435.Can Advanced SAS Programming Detect, Clean, and Optimize AI Training Experiment Data While Identifying Fraud Patterns?

Advanced SAS Intelligence for AI Training Data Cleansing, Optimization, and Fraud Detection

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —DATA | SET | MERGE | INPUT | DATALINES | IF |  BY | OUTPUT | PROC SORT | PROC APPEND | MERGE | SET | PROC TRANSPOSE | PROC DATASETS | RUN | %MACRO | %MEND | CHARACTER FUNCTIONS & NUMERIC FUNCTIONS

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Introduction

Artificial Intelligence (AI) training experiments generate massive volumes of structured data involving model performance, training cycles, GPU utilization, and cost metrics. However, raw datasets are rarely clean they often contain inconsistencies, missing values, incorrect formats, and even fraudulent manipulations.

In real-world analytics and clinical/statistical programming environments, handling such imperfect datasets is a core responsibility. This project simulates an AI training experiment dataset with intentional errors and demonstrates how SAS programming techniques can be used to detect, clean, transform, and optimize the data efficiently.

We will also integrate fraud detection logic using macros and explore advanced SAS concepts such as dataset merging, transposing, appending, date handling, and function usage.

This blog is designed in a step-by-step, interview-ready format with practical explanations, making it useful for both beginners and experienced SAS programmers.

Table of Contents

  1. Business Context
  2. Raw Dataset Creation (SAS + R)
  3. Intentional Errors in Dataset
  4. Error Detection & Correction
  5. SAS Data Cleaning Techniques
  6. Date Handling (MDY, INTCK, INTNX)
  7. Numeric Functions
  8. Character Functions
  9. Dataset Operations (SET, MERGE, APPEND)
  10. PROC TRANSPOSE
  11. PROC DATASETS DELETE
  12. Fraud Detection Macro
  13. Utilization Classification Logic
  14. Multi-Dataset Creation from One Dataset
  15. 15+ SAS Codes
  16. 20 Key Points About The Project
  17. Summary
  18. Conclusion

Business Context

AI companies run multiple experiments to train models. Each experiment consumes GPU resources, time, and cost. However:

  • Some records may be manipulated (fraud) to inflate performance
  • GPU usage may exceed logical thresholds
  • Training time may not match dataset size
  • Dates may be inconsistent

This project ensures:
  1.Data integrity
  2.Fraud detection
  3.Performance optimization

1. Raw Dataset Creation (SAS)

data ai_experiments_raw;

input Model_ID $ Dataset_Size Epochs Training_Time Accuracy Loss_Value 

      GPU_Usage Fees Start_Date:$11. End_Date:$11.;

datalines;

M101 50000 10 5.5 0.92 0.25 85 2000 01-01-2025 01-05-2025

M102 60000 12 6.2 0.95 0.20 90 2500 02-01-2025 02-06-2025

M103 . 15 7.0 1.2 0.15 105 3000 03-01-2025 03-07-2025

M104 70000 -5 6.5 0.89 . 88 2700 04-01-2025 04-06-2025

M105 80000 20 8.0 0.97 0.10 95 . 05-01-2025 05-08-2025

M106 90000 18 9.5 0.60 0.50 70 2200 wrong_date 06-08-2025

M107 100000 25 10.5 0.99 0.05 110 3500 07-01-2025 07-10-2025

M108 120000 30 12.0 0.85 0.30 60 1800 08-01-2025 08-12-2025

M109 110000 22 11.0 0.96 0.12 92 2900 09-01-2025 09-09-2025

M110 95000 19 9.0 0.93 0.22 89 2600 10-01-2025 10-07-2025

;

run;

proc print data=ai_experiments_raw;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_Date
1M10150000105.50.920.2585200001-01-202501-05-2025
2M10260000126.20.950.2090250002-01-202502-06-2025
3M103.157.01.200.15105300003-01-202503-07-2025
4M10470000-56.50.89.88270004-01-202504-06-2025
5M10580000208.00.970.1095.05-01-202505-08-2025
6M10690000189.50.600.50702200wrong_date06-08-2025
7M1071000002510.50.990.05110350007-01-202507-10-2025
8M1081200003012.00.850.3060180008-01-202508-12-2025
9M1091100002211.00.960.1292290009-01-202509-09-2025
10M11095000199.00.930.2289260010-01-202510-07-2025

Intentional Errors Introduced

·  Missing Dataset_Size (M103)

·  Accuracy > 1 (invalid)

·  GPU usage > 100

·  Negative Epochs

·  Missing Loss_Value

·  Invalid date ("wrong_date")

2. Raw Dataset Creation in R (With Errors)

ai_data <- data.frame(
  Model_ID = c("M101","M102","M103","M104","M105","M106","M107","M108","M109","M110"),
  Dataset_Size = c(50000,60000,NA,70000,80000,90000,100000,120000,110000,95000),
  Epochs = c(10,12,15,-5,20,18,25,30,22,19),
  Training_Time = c(5.5,6.2,7.0,6.5,8.0,9.5,10.5,12.0,11.0,9.0),
  Accuracy = c(0.92,0.95,1.2,0.89,0.97,0.60,0.99,0.85,0.96,0.93),
  Loss_Value = c(0.25,0.20,0.15,NA,0.10,0.50,0.05,0.30,0.12,0.22),
  GPU_Usage = c(85,90,105,88,95,70,110,60,92,89),
  Fees = c(2000,2500,3000,2700,NA,2200,3500,1800,2900,2600),
  Start_Date = c("01-01-2025","02-01-2025","03-01-2025","04-01-2025","05-01-2025",
                 "wrong_date","07-01-2025","08-01-2025","09-01-2025","10-01-2025"),
  End_Date = c("01-05-2025","02-06-2025","03-07-2025","04-06-2025","05-08-2025",
               "06-08-2025","07-10-2025","08-12-2025","09-09-2025","10-07-2025")
)

print(ai_data)

OUTPUT:

 

Model_ID

Dataset_Size

Epochs

Training_Time

Accuracy

Loss_Value

GPU_Usage

Fees

Start_Date

End_Date

1

M101

50000

10

5.5

0.92

0.25

85

2000

01-01-2025

01-05-2025

2

M102

60000

12

6.2

0.95

0.2

90

2500

02-01-2025

02-06-2025

3

M103

NA

15

7

1.2

0.15

105

3000

03-01-2025

03-07-2025

4

M104

70000

-5

6.5

0.89

NA

88

2700

04-01-2025

04-06-2025

5

M105

80000

20

8

0.97

0.1

95

NA

05-01-2025

05-08-2025

6

M106

90000

18

9.5

0.6

0.5

70

2200

wrong_date

06-08-2025

7

M107

100000

25

10.5

0.99

0.05

110

3500

07-01-2025

07-10-2025

8

M108

120000

30

12

0.85

0.3

60

1800

08-01-2025

08-12-2025

9

M109

110000

22

11

0.96

0.12

92

2900

09-01-2025

09-09-2025

10

M110

95000

19

9

0.93

0.22

89

2600

10-01-2025

10-07-2025

·  Now R dataset = SAS dataset (same structure)

·  Includes intentional errors (for cleaning practice)

·  Matches real-world messy data scenario

 

Identify Matching Errors in Both

Now BOTH R & SAS datasets contain same issues:

Issue

Example

Missing Dataset_Size

M103

Accuracy > 1

1.2

Negative Epochs

-5

Missing Loss_Value

M104

GPU > 100

105, 110

Missing Fees

M105

Invalid Date

wrong_date

3. Corrected Dataset

data ai_experiments_clean;

set ai_experiments_raw;

if Dataset_Size = . then Dataset_Size = 50000;

if Epochs < 0 then Epochs = abs(Epochs);

if Accuracy > 1 then Accuracy = 0.99;

if GPU_Usage > 100 then GPU_Usage = 100;

if Loss_Value = . then Loss_Value = 0.20;

if Fees = . then Fees = 2400;

if Start_Date = "wrong_date" then Start_Date = "06-01-2025";

Start_Date_num = input(Start_Date, ddmmyy10.);

End_Date_num = input(End_Date, ddmmyy10.);

format Start_Date_num End_Date_num date9.;

run;

proc print data=ai_experiments_clean;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_num
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY2025
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN2025
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL2025
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN2025
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG2025
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG2025
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT2025
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC2025
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP2025
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL2025

Why this code is used

·  Cleans invalid values

·  Converts character dates to numeric

·  Ensures dataset consistency

4. Date Functions

data date_calc;

set ai_experiments_clean;

Duration = intck('day', Start_Date_num, End_Date_num);

Next_Run = intnx('day', End_Date_num, 7);

run;

proc print data=date_calc;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_numDurationNext_Run
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY202512023869
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN202515123901
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL202518123932
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN202515123903
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG202521223965
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG202521223966
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT202527324028
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC202533424090
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP202524324000
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL202518123939

/*Alternate*/

data date_calc;

set ai_experiments_clean;

Duration = intck('day', Start_Date_num, End_Date_num);

Next_Run = intnx('day', End_Date_num, 7);

format Duration Next_Run date9.;

run;

proc print data=date_calc;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_numDurationNext_Run
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY202530APR196008MAY2025
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN202531MAY196009JUN2025
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL202530JUN196010JUL2025
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN202531MAY196011JUN2025
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG202531JUL196012AUG2025
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG202531JUL196013AUG2025
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT202530SEP196014OCT2025
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC202530NOV196015DEC2025
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP202531AUG196016SEP2025
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL202530JUN196017JUL2025

Explanation

·  INTCK → Calculates duration

·  INTNX → Predicts next run

5. Numeric Functions

data numeric_calc;

set ai_experiments_clean;

Log_Size = log(Dataset_Size);

Sqrt_GPU = sqrt(GPU_Usage);

Rounded_Acc = round(Accuracy,0.01);

run;

proc print data=numeric_calc;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_numLog_SizeSqrt_GPURounded_Acc
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY202510.81989.21950.92
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN202511.00219.48680.95
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL202510.819810.00000.99
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN202511.15639.38080.89
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG202511.28989.74680.97
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG202511.40768.36660.60
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT202511.512910.00000.99
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC202511.69527.74600.85
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP202511.60829.59170.96
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL202511.46169.43400.93

6. Character Functions 

data char_func;

set ai_experiments_clean;

Model_Clean = strip(Model_ID);

Model_Upper = upcase(Model_ID);

Model_Lower = lowcase(Model_ID);

Model_Proper = propcase(Model_ID);

Combined = catx('-', Model_ID, Dataset_Size);

run;

proc print data=char_func;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_numModel_CleanModel_UpperModel_LowerModel_ProperCombined
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY2025M101M101m101M101M101-50000
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN2025M102M102m102M102M102-60000
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL2025M103M103m103M103M103-50000
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN2025M104M104m104M104M104-70000
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG2025M105M105m105M105M105-80000
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG2025M106M106m106M106M106-90000
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT2025M107M107m107M107M107-100000
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC2025M108M108m108M108M108-120000
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP2025M109M109m109M109M109-110000
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL2025M110M110m110M110M110-95000

7. A)SET Statement

data combined_data;

set ai_experiments_clean(in=a)

    ai_experiments_clean(in=b);

length Source $8.;

if a then Source="First";

else if b then Source="Second";

run;

proc print data=combined_data;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_numSource
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY2025First
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN2025First
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL2025First
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN2025First
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG2025First
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG2025First
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT2025First
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC2025First
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP2025First
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL2025First
11M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY2025Second
12M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN2025Second
13M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL2025Second
14M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN2025Second
15M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG2025Second
16M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG2025Second
17M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT2025Second
18M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC2025Second
19M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP2025Second
20M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL2025Second

Key Concept

Statement

Purpose

SET

Row-wise combination

MERGE

Column-wise combination

APPEND

Faster SET alternative

B)Remove Duplicates

proc sort data=combined_data nodupkey;

by Model_ID;

run;

proc print data=combined_data;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_numSource
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY2025First
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN2025First
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL2025First
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN2025First
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG2025First
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG2025First
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT2025First
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC2025First
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP2025First
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL2025First

·  Keeps only unique records

8. MERGE & IF-ELSE

data fees_data;

input Model_ID $ New_Fees;

datalines;

M101 2500

M102 3000

;

run;

proc print data=fees_data;

run;

OUTPUT:

ObsModel_IDNew_Fees
1M1012500
2M1023000

proc sort data=ai_experiments_clean;

by Model_ID;

run;

proc print data=ai_experiments_clean;

run; 

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_num
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY2025
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN2025
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL2025
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN2025
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG2025
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG2025
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT2025
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC2025
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP2025
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL2025

proc sort data=fees_data;

by Model_ID;

run;

proc print data=fees_data;

run;

OUTPUT:

ObsModel_IDNew_Fees
1M1012500
2M1023000

data merged;

merge ai_experiments_clean 

      fees_data;

by Model_ID;

run;

proc print data=merged;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_numNew_Fees
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY20252500
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN20253000
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL2025.
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN2025.
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG2025.
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG2025.
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT2025.
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC2025.
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP2025.
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL2025.

data New_Fees;

 set merged;

 if Model_ID not in ("M101","M102") then New_Fees=Fees+500;

 else New_Fees=Fees;

run;

proc print data=New_Fees;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_numNew_Fees
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY20252000
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN20252500
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL20253500
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN20253200
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG20252900
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG20252700
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT20254000
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC20252300
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP20253400
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL20253100

Explanation 

set your_dataset_name;

·    Reads your existing dataset

Model_ID not in ("M101","M102")

·    Applies condition

·    Excludes those two models

Fees + 500

·    Adds increment only for valid records

else New_Fees = Fees;

·    Keeps missing for excluded models

Expected Output Logic

Model_ID

Fees

New_Fees

M101

2000

2500

M102

2500

3000

M103

3000

3500

M104

2700

3200

M105

2400

2900

9. APPEND

proc append base=ai_experiments_clean

            data=fees_data Force;

run;

proc print data=ai_experiments_clean;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_num
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY2025
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN2025
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL2025
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN2025
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG2025
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG2025
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT2025
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC2025
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP2025
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL2025
11M101.......  ..
12M102.......  ..

10. TRANSPOSE

proc transpose data=ai_experiments_clean out=transposed;

var Accuracy Loss_Value GPU_Usage;

run;

proc print data=transposed;

run;

OUTPUT:

Obs_NAME_COL1COL2COL3COL4COL5COL6COL7COL8COL9COL10COL11COL12
1Accuracy0.920.950.990.890.970.60.990.850.960.93..
2Loss_Value0.250.200.150.200.100.50.050.300.120.22..
3GPU_Usage85.0090.00100.0088.0095.0070.0100.0060.0092.0089.00..

11. DELETE DATASETS

proc datasets library=work;

delete transposed;

quit;

LOG:

NOTE: Deleting WORK.TRANSPOSED (memtype=DATA).

12. Fraud Detection Macro

%macro fraud_check(ds);

data fraud_flag;

set &ds;

if Accuracy > 0.98 and GPU_Usage < 50 then Fraud_Flag="YES";

else if Training_Time < 2 then Fraud_Flag="YES";

else Fraud_Flag="NO";

run;

proc print data=fraud_flag;

run;

%mend;


%fraud_check(ai_experiments_clean);

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_numFraud_Flag
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY2025NO
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN2025NO
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL2025NO
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN2025NO
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG2025NO
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG2025NO
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT2025NO
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC2025NO
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP2025NO
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL2025NO
11M101.......  ..YES
12M102.......  ..YES

13. Utilization Classification

data utilization;

set ai_experiments_clean;

length Usage_Class $8.;

if GPU_Usage >= 90 then Usage_Class="High";

else if GPU_Usage >= 70 then Usage_Class="Medium";

else Usage_Class="Low";

run;

proc print data=utilization;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_numUsage_Class
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY2025Medium
2M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN2025High
3M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL2025High
4M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN2025Medium
5M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG2025High
6M10690000189.50.600.5070220006-01-202506-08-202506JAN202506AUG2025Medium
7M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT2025High
8M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC2025Low
9M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP2025High
10M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL2025Medium
11M101.......  ..Low
12M102.......  ..Low

14. Multiple Datasets from One

data high

     medium

     low;

set ai_experiments_clean;

if Accuracy >= 0.95 then output high;

else if  0.60 >= Accuracy < 0.95 then output medium;

else output low;

run;

proc print data=high;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_num
1M10260000126.20.950.2090250002-01-202502-06-202502JAN202502JUN2025
2M10350000157.00.990.15100300003-01-202503-07-202503JAN202503JUL2025
3M10580000208.00.970.1095240005-01-202505-08-202505JAN202505AUG2025
4M1071000002510.50.990.05100350007-01-202507-10-202507JAN202507OCT2025
5M1091100002211.00.960.1292290009-01-202509-09-202509JAN202509SEP2025

proc print data=medium;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_num
1M10690000189.50.60.570220006-01-202506-08-202506JAN202506AUG2025
2M101.......  ..
3M102.......  ..

proc print data=low;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateStart_Date_numEnd_Date_num
1M10150000105.50.920.2585200001-01-202501-05-202501JAN202501MAY2025
2M1047000056.50.890.2088270004-01-202504-06-202504JAN202504JUN2025
3M1081200003012.00.850.3060180008-01-202508-12-202508JAN202508DEC2025
4M11095000199.00.930.2289260010-01-202510-07-202510JAN202510JUL2025

15. COALESCE Example

data coalesce_ex;

set ai_experiments_raw;

Final_Size = coalesce(Dataset_Size, 50000);

run;

proc print data=coalesce_ex;

run;

OUTPUT:

ObsModel_IDDataset_SizeEpochsTraining_TimeAccuracyLoss_ValueGPU_UsageFeesStart_DateEnd_DateFinal_Size
1M10150000105.50.920.2585200001-01-202501-05-202550000
2M10260000126.20.950.2090250002-01-202502-06-202560000
3M103.157.01.200.15105300003-01-202503-07-202550000
4M10470000-56.50.89.88270004-01-202504-06-202570000
5M10580000208.00.970.1095.05-01-202505-08-202580000
6M10690000189.50.600.50702200wrong_date06-08-202590000
7M1071000002510.50.990.05110350007-01-202507-10-2025100000
8M1081200003012.00.850.3060180008-01-202508-12-2025120000
9M1091100002211.00.960.1292290009-01-202509-09-2025110000
10M11095000199.00.930.2289260010-01-202510-07-202595000

15+ Code Summary Points

  1. DATA step creation
  2. SET statement
  3. MERGE datasets
  4. PROC APPEND
  5. PROC TRANSPOSE
  6. PROC DATASETS DELETE
  7. INTCK function
  8. INTNX function
  9. Numeric functions
  10. Character functions
  11. COALESCE
  12. Macro creation
  13. Fraud logic
  14. Classification
  15. Multiple outputs
20 Key Point About The Project
  1. Designed a realistic AI training experiment dataset including performance, cost, and resource utilization variables.
  2. Introduced intentional data quality issues such as missing values, invalid ranges, and inconsistent date formats.
  3. Implemented robust data cleaning logic using SAS DATA step to correct errors and standardize values.
  4. Handled missing values using techniques like COALESCE and conditional replacement logic.
  5. Validated numerical ranges (e.g., Accuracy ≤ 1, GPU_Usage ≤ 100) to ensure data integrity.
  6. Corrected negative and illogical values such as negative Epochs using transformation functions.
  7. Converted character date variables into numeric SAS dates using INPUT and applied proper formats.
  8. Calculated experiment duration using INTCK and projected future runs using INTNX.
  9. Applied numeric functions like LOG, SQRT, and ROUND for advanced analytical transformations.
  10. Standardized text data using character functions like STRIP, UPCASE, LOWCASE, and PROPCASE.
  11. Combined multiple datasets using the SET statement to simulate real-world data integration.
  12. Demonstrated dataset merging using MERGE and BY statements for relational data handling.
  13. Used PROC APPEND for efficient vertical data concatenation in large-scale environments.
  14. Restructured data using PROC TRANSPOSE for reporting and analytical flexibility.
  15. Removed unwanted datasets using PROC DATASETS DELETE to manage workspace efficiently.
  16. Created multiple output datasets from a single source dataset using conditional OUTPUT logic.
  17. Developed a macro-driven fraud detection system to identify suspicious training patterns.
  18. Flagged anomalies such as high accuracy with low GPU usage and unrealistic training times.
  19. Implemented GPU utilization classification into High, Medium, and Low categories for insights.
  20. Ensured cross-platform consistency by aligning SAS and R datasets for reproducible analytics workflows.

Summary

This project demonstrates how SAS can effectively handle real-world AI training experiment data, which often contains errors and inconsistencies. We started by creating a raw dataset with intentional issues such as missing values, invalid ranges, and incorrect date formats. Using SAS data step techniques and functions, we cleaned and transformed the dataset into a reliable format.

We applied date functions like INTCK and INTNX to calculate durations and predict future runs. Numeric and character functions helped standardize and enhance the data. Advanced operations like SET, MERGE, APPEND, and TRANSPOSE showed how datasets can be manipulated efficiently.

Additionally, a macro-based fraud detection system was implemented to identify suspicious patterns. Utilization classification provided insights into GPU usage efficiency.

Overall, this project highlights the importance of data validation, transformation, and automation in analytics workflows using SAS.

Conclusion

In modern AI-driven environments, data quality directly impacts decision-making and model performance. This project illustrates how SAS programming can go beyond basic data handling to deliver robust solutions involving error detection, correction, and fraud identification.

By combining multiple SAS techniques,ranging from data step processing to macros and procedures.We created a scalable and reusable framework. Such skills are highly valuable in industries like clinical trials, finance, and AI analytics.

Mastering these concepts not only strengthens technical expertise but also prepares you for real-world challenges and interviews.

SAS INTERVIEW QUESTIONS

1. Data Step vs. PROC SQL

Question: When would you choose a DATA Step over PROC SQL for merging datasets?

Short Answer: I use a DATA Step when I need to perform complex row-by-row logic, use temporary variables (like FIRST. or LAST.), or when I am merging large, pre-sorted datasets (using MERGE and BY). I choose PROC SQL for quick 'ad-hoc' joins, especially when the data isn't sorted or when I need to join more than two tables simultaneously.

2. Handling Data Errors (The "Input" Function)

Question: How do you handle a numeric variable that was accidentally imported as a character?

Short Answer: I use the INPUT() function in a DATA step to convert it. I create a new numeric variable and assign it the value: Numeric_Var = INPUT(Character_Var, informat.);. If I need to keep the original name, I use a RENAME statement. This ensures I can perform mathematical calculations on the data without errors.

3. Date Math (INTCK vs. INTNX)

Question: What is the difference between the INTCK and INTNX functions?

Short Answer: I use INTCK to count the 'intervals' between two dates (like calculating age in days or months). I use INTNX to 'advace' a date to a new point (like finding the first day of the next month). Simply put: INTCK is for subtraction (difference), and INTNX is for addition (shifting).


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent AI TRAINING data.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics 

·  Clinical SAS Programmer

·  Research Data Analyst

·  Regulatory Data Validator


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy



Comments

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

418.Can We Design, Debug, Detect Fraud, And Optimize A Smart Parking System Using Advanced SAS Programming Techniques?