435.Can Advanced SAS Programming Detect, Clean, and Optimize AI Training Experiment Data While Identifying Fraud Patterns?
Advanced SAS Intelligence for AI Training Data Cleansing, Optimization, and Fraud Detection
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —DATA | SET | MERGE | INPUT | DATALINES | IF | BY | OUTPUT | PROC SORT | PROC APPEND | MERGE | SET | PROC TRANSPOSE | PROC DATASETS | RUN | %MACRO | %MEND | CHARACTER FUNCTIONS & NUMERIC FUNCTIONS
Introduction
Artificial
Intelligence (AI) training experiments generate massive volumes of structured
data involving model performance, training cycles, GPU utilization, and cost
metrics. However, raw datasets are rarely clean they often contain
inconsistencies, missing values, incorrect formats, and even fraudulent
manipulations.
In
real-world analytics and clinical/statistical programming environments,
handling such imperfect datasets is a core responsibility. This project
simulates an AI training experiment dataset with intentional errors and
demonstrates how SAS programming techniques can be used to detect,
clean, transform, and optimize the data efficiently.
We will
also integrate fraud detection logic using macros and explore advanced SAS
concepts such as dataset merging, transposing, appending, date handling, and
function usage.
This blog
is designed in a step-by-step, interview-ready format with practical
explanations, making it useful for both beginners and experienced SAS
programmers.
Table of Contents
- Business Context
- Raw Dataset Creation (SAS +
R)
- Intentional Errors in
Dataset
- Error Detection &
Correction
- SAS Data Cleaning Techniques
- Date Handling (MDY, INTCK,
INTNX)
- Numeric Functions
- Character Functions
- Dataset Operations (SET,
MERGE, APPEND)
- PROC TRANSPOSE
- PROC DATASETS DELETE
- Fraud Detection Macro
- Utilization Classification
Logic
- Multi-Dataset Creation from
One Dataset
- 15+ SAS Codes
- 20 Key Points About The
Project
- Summary
- Conclusion
Business Context
AI
companies run multiple experiments to train models. Each experiment consumes
GPU resources, time, and cost. However:
- Some records may be manipulated
(fraud) to inflate performance
- GPU usage may exceed logical
thresholds
- Training time may not match
dataset size
- Dates may be inconsistent
This
project ensures:
1.Data
integrity
2.Fraud
detection
3.Performance
optimization
1. Raw Dataset Creation (SAS)
data ai_experiments_raw;
input Model_ID $ Dataset_Size Epochs Training_Time Accuracy Loss_Value
GPU_Usage Fees Start_Date:$11. End_Date:$11.;
datalines;
M101 50000 10 5.5 0.92 0.25 85 2000 01-01-2025 01-05-2025
M102 60000 12 6.2 0.95 0.20 90 2500 02-01-2025 02-06-2025
M103 . 15 7.0 1.2 0.15 105 3000 03-01-2025 03-07-2025
M104 70000 -5 6.5 0.89 . 88 2700 04-01-2025 04-06-2025
M105 80000 20 8.0 0.97 0.10 95 . 05-01-2025 05-08-2025
M106 90000 18 9.5 0.60 0.50 70 2200 wrong_date 06-08-2025
M107 100000 25 10.5 0.99 0.05 110 3500 07-01-2025 07-10-2025
M108 120000 30 12.0 0.85 0.30 60 1800 08-01-2025 08-12-2025
M109 110000 22 11.0 0.96 0.12 92 2900 09-01-2025 09-09-2025
M110 95000 19 9.0 0.93 0.22 89 2600 10-01-2025 10-07-2025
;
run;
proc print data=ai_experiments_raw;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 |
| 3 | M103 | . | 15 | 7.0 | 1.20 | 0.15 | 105 | 3000 | 03-01-2025 | 03-07-2025 |
| 4 | M104 | 70000 | -5 | 6.5 | 0.89 | . | 88 | 2700 | 04-01-2025 | 04-06-2025 |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | . | 05-01-2025 | 05-08-2025 |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | wrong_date | 06-08-2025 |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 110 | 3500 | 07-01-2025 | 07-10-2025 |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 |
Intentional Errors Introduced
· Missing
Dataset_Size (M103)
· Accuracy
> 1 (invalid)
· GPU usage
> 100
· Negative
Epochs
· Missing
Loss_Value
· Invalid
date ("wrong_date")
2. Raw
Dataset Creation in R (With Errors)
OUTPUT:
|
|
Model_ID |
Dataset_Size |
Epochs |
Training_Time |
Accuracy |
Loss_Value |
GPU_Usage |
Fees |
Start_Date |
End_Date |
|
1 |
M101 |
50000 |
10 |
5.5 |
0.92 |
0.25 |
85 |
2000 |
01-01-2025 |
01-05-2025 |
|
2 |
M102 |
60000 |
12 |
6.2 |
0.95 |
0.2 |
90 |
2500 |
02-01-2025 |
02-06-2025 |
|
3 |
M103 |
NA |
15 |
7 |
1.2 |
0.15 |
105 |
3000 |
03-01-2025 |
03-07-2025 |
|
4 |
M104 |
70000 |
-5 |
6.5 |
0.89 |
NA |
88 |
2700 |
04-01-2025 |
04-06-2025 |
|
5 |
M105 |
80000 |
20 |
8 |
0.97 |
0.1 |
95 |
NA |
05-01-2025 |
05-08-2025 |
|
6 |
M106 |
90000 |
18 |
9.5 |
0.6 |
0.5 |
70 |
2200 |
wrong_date |
06-08-2025 |
|
7 |
M107 |
100000 |
25 |
10.5 |
0.99 |
0.05 |
110 |
3500 |
07-01-2025 |
07-10-2025 |
|
8 |
M108 |
120000 |
30 |
12 |
0.85 |
0.3 |
60 |
1800 |
08-01-2025 |
08-12-2025 |
|
9 |
M109 |
110000 |
22 |
11 |
0.96 |
0.12 |
92 |
2900 |
09-01-2025 |
09-09-2025 |
|
10 |
M110 |
95000 |
19 |
9 |
0.93 |
0.22 |
89 |
2600 |
10-01-2025 |
10-07-2025 |
· Now R
dataset = SAS dataset (same structure)
· Includes intentional
errors (for cleaning practice)
· Matches real-world
messy data scenario
Identify
Matching Errors in Both
Now BOTH R & SAS datasets contain same
issues:
|
Issue |
Example |
|
Missing
Dataset_Size |
M103 |
|
Accuracy
> 1 |
1.2 |
|
Negative
Epochs |
-5 |
|
Missing
Loss_Value |
M104 |
|
GPU
> 100 |
105,
110 |
|
Missing
Fees |
M105 |
|
Invalid
Date |
wrong_date |
3. Corrected Dataset
data ai_experiments_clean;
set ai_experiments_raw;
if Dataset_Size = . then Dataset_Size = 50000;
if Epochs < 0 then Epochs = abs(Epochs);
if Accuracy > 1 then Accuracy = 0.99;
if GPU_Usage > 100 then GPU_Usage = 100;
if Loss_Value = . then Loss_Value = 0.20;
if Fees = . then Fees = 2400;
if Start_Date = "wrong_date" then Start_Date = "06-01-2025";
Start_Date_num = input(Start_Date, ddmmyy10.);
End_Date_num = input(End_Date, ddmmyy10.);
format Start_Date_num End_Date_num date9.;
run;
proc print data=ai_experiments_clean;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 |
Why this code is used
· Cleans
invalid values
· Converts
character dates to numeric
· Ensures dataset consistency
4. Date Functions
data date_calc;
set ai_experiments_clean;
Duration = intck('day', Start_Date_num, End_Date_num);
Next_Run = intnx('day', End_Date_num, 7);
run;
proc print data=date_calc;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num | Duration | Next_Run |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 | 120 | 23869 |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 | 151 | 23901 |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 | 181 | 23932 |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 | 151 | 23903 |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 | 212 | 23965 |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 | 212 | 23966 |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 | 273 | 24028 |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 | 334 | 24090 |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 | 243 | 24000 |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 | 181 | 23939 |
/*Alternate*/
data date_calc;
set ai_experiments_clean;
Duration = intck('day', Start_Date_num, End_Date_num);
Next_Run = intnx('day', End_Date_num, 7);
format Duration Next_Run date9.;
run;
proc print data=date_calc;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num | Duration | Next_Run |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 | 30APR1960 | 08MAY2025 |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 | 31MAY1960 | 09JUN2025 |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 | 30JUN1960 | 10JUL2025 |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 | 31MAY1960 | 11JUN2025 |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 | 31JUL1960 | 12AUG2025 |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 | 31JUL1960 | 13AUG2025 |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 | 30SEP1960 | 14OCT2025 |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 | 30NOV1960 | 15DEC2025 |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 | 31AUG1960 | 16SEP2025 |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 | 30JUN1960 | 17JUL2025 |
Explanation
· INTCK → Calculates duration
· INTNX → Predicts next run
5. Numeric Functions
data numeric_calc;
set ai_experiments_clean;
Log_Size = log(Dataset_Size);
Sqrt_GPU = sqrt(GPU_Usage);
Rounded_Acc = round(Accuracy,0.01);
run;
proc print data=numeric_calc;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num | Log_Size | Sqrt_GPU | Rounded_Acc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 | 10.8198 | 9.2195 | 0.92 |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 | 11.0021 | 9.4868 | 0.95 |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 | 10.8198 | 10.0000 | 0.99 |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 | 11.1563 | 9.3808 | 0.89 |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 | 11.2898 | 9.7468 | 0.97 |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 | 11.4076 | 8.3666 | 0.60 |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 | 11.5129 | 10.0000 | 0.99 |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 | 11.6952 | 7.7460 | 0.85 |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 | 11.6082 | 9.5917 | 0.96 |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 | 11.4616 | 9.4340 | 0.93 |
6. Character Functions
data char_func;
set ai_experiments_clean;
Model_Clean = strip(Model_ID);
Model_Upper = upcase(Model_ID);
Model_Lower = lowcase(Model_ID);
Model_Proper = propcase(Model_ID);
Combined = catx('-', Model_ID, Dataset_Size);
run;
proc print data=char_func;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num | Model_Clean | Model_Upper | Model_Lower | Model_Proper | Combined |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 | M101 | M101 | m101 | M101 | M101-50000 |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 | M102 | M102 | m102 | M102 | M102-60000 |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 | M103 | M103 | m103 | M103 | M103-50000 |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 | M104 | M104 | m104 | M104 | M104-70000 |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 | M105 | M105 | m105 | M105 | M105-80000 |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 | M106 | M106 | m106 | M106 | M106-90000 |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 | M107 | M107 | m107 | M107 | M107-100000 |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 | M108 | M108 | m108 | M108 | M108-120000 |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 | M109 | M109 | m109 | M109 | M109-110000 |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 | M110 | M110 | m110 | M110 | M110-95000 |
7. A)SET Statement
data combined_data;
set ai_experiments_clean(in=a)
ai_experiments_clean(in=b);
length Source $8.;
if a then Source="First";
else if b then Source="Second";
run;
proc print data=combined_data;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num | Source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 | First |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 | First |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 | First |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 | First |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 | First |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 | First |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 | First |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 | First |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 | First |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 | First |
| 11 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 | Second |
| 12 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 | Second |
| 13 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 | Second |
| 14 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 | Second |
| 15 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 | Second |
| 16 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 | Second |
| 17 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 | Second |
| 18 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 | Second |
| 19 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 | Second |
| 20 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 | Second |
Key Concept
|
Statement |
Purpose |
|
SET |
Row-wise
combination |
|
MERGE |
Column-wise
combination |
|
APPEND |
Faster
SET alternative |
B)Remove Duplicates
proc sort data=combined_data nodupkey;
by Model_ID;
run;
proc print data=combined_data;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num | Source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 | First |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 | First |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 | First |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 | First |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 | First |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 | First |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 | First |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 | First |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 | First |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 | First |
· Keeps only unique records
8. MERGE & IF-ELSE
data fees_data;
input Model_ID $ New_Fees;
datalines;
M101 2500
M102 3000
;
run;
proc print data=fees_data;
run;
OUTPUT:
| Obs | Model_ID | New_Fees |
|---|---|---|
| 1 | M101 | 2500 |
| 2 | M102 | 3000 |
proc sort data=ai_experiments_clean;
by Model_ID;
run;
proc print data=ai_experiments_clean;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 |
proc sort data=fees_data;
by Model_ID;
run;
proc print data=fees_data;
run;
OUTPUT:
| Obs | Model_ID | New_Fees |
|---|---|---|
| 1 | M101 | 2500 |
| 2 | M102 | 3000 |
data merged;
merge ai_experiments_clean
fees_data;
by Model_ID;
run;
proc print data=merged;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num | New_Fees |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 | 2500 |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 | 3000 |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 | . |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 | . |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 | . |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 | . |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 | . |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 | . |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 | . |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 | . |
data New_Fees;
set merged;
if Model_ID not in ("M101","M102") then New_Fees=Fees+500;
else New_Fees=Fees;
run;
proc print data=New_Fees;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num | New_Fees |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 | 2000 |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 | 2500 |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 | 3500 |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 | 3200 |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 | 2900 |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 | 2700 |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 | 4000 |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 | 2300 |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 | 3400 |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 | 3100 |
Explanation
set
your_dataset_name;
· Reads your existing
dataset
Model_ID
not in ("M101","M102")
·
Applies condition
· Excludes those two models
Fees
+ 500
· Adds increment only for valid records
else
New_Fees = Fees;
·
Keeps missing for excluded models
Expected Output Logic
|
Model_ID |
Fees |
New_Fees |
|
M101 |
2000 |
2500 |
|
M102 |
2500 |
3000 |
|
M103 |
3000 |
3500 |
|
M104 |
2700 |
3200 |
|
M105 |
2400 |
2900 |
9. APPEND
proc append base=ai_experiments_clean
data=fees_data Force;
run;
proc print data=ai_experiments_clean;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 |
| 11 | M101 | . | . | . | . | . | . | . | . | . | ||
| 12 | M102 | . | . | . | . | . | . | . | . | . |
10. TRANSPOSE
proc transpose data=ai_experiments_clean out=transposed;
var Accuracy Loss_Value GPU_Usage;
run;
proc print data=transposed;
run;
OUTPUT:
| Obs | _NAME_ | COL1 | COL2 | COL3 | COL4 | COL5 | COL6 | COL7 | COL8 | COL9 | COL10 | COL11 | COL12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Accuracy | 0.92 | 0.95 | 0.99 | 0.89 | 0.97 | 0.6 | 0.99 | 0.85 | 0.96 | 0.93 | . | . |
| 2 | Loss_Value | 0.25 | 0.20 | 0.15 | 0.20 | 0.10 | 0.5 | 0.05 | 0.30 | 0.12 | 0.22 | . | . |
| 3 | GPU_Usage | 85.00 | 90.00 | 100.00 | 88.00 | 95.00 | 70.0 | 100.00 | 60.00 | 92.00 | 89.00 | . | . |
11. DELETE DATASETS
proc datasets library=work;
delete transposed;
quit;
LOG:
12. Fraud Detection Macro
%macro fraud_check(ds);
data fraud_flag;
set &ds;
if Accuracy > 0.98 and GPU_Usage < 50 then Fraud_Flag="YES";
else if Training_Time < 2 then Fraud_Flag="YES";
else Fraud_Flag="NO";
run;
proc print data=fraud_flag;
run;
%mend;
%fraud_check(ai_experiments_clean);
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 | NO |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 | NO |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 | NO |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 | NO |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 | NO |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 | NO |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 | NO |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 | NO |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 | NO |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 | NO |
| 11 | M101 | . | . | . | . | . | . | . | . | . | YES | ||
| 12 | M102 | . | . | . | . | . | . | . | . | . | YES |
13. Utilization Classification
data utilization;
set ai_experiments_clean;
length Usage_Class $8.;
if GPU_Usage >= 90 then Usage_Class="High";
else if GPU_Usage >= 70 then Usage_Class="Medium";
else Usage_Class="Low";
run;
proc print data=utilization;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num | Usage_Class |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 | Medium |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 | High |
| 3 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 | High |
| 4 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 | Medium |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 | High |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 | Medium |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 | High |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 | Low |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 | High |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 | Medium |
| 11 | M101 | . | . | . | . | . | . | . | . | . | Low | ||
| 12 | M102 | . | . | . | . | . | . | . | . | . | Low |
14. Multiple Datasets from One
data high
medium
low;
set ai_experiments_clean;
if Accuracy >= 0.95 then output high;
else if 0.60 >= Accuracy < 0.95 then output medium;
else output low;
run;
proc print data=high;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 02JAN2025 | 02JUN2025 |
| 2 | M103 | 50000 | 15 | 7.0 | 0.99 | 0.15 | 100 | 3000 | 03-01-2025 | 03-07-2025 | 03JAN2025 | 03JUL2025 |
| 3 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | 2400 | 05-01-2025 | 05-08-2025 | 05JAN2025 | 05AUG2025 |
| 4 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 100 | 3500 | 07-01-2025 | 07-10-2025 | 07JAN2025 | 07OCT2025 |
| 5 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 09JAN2025 | 09SEP2025 |
proc print data=medium;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M106 | 90000 | 18 | 9.5 | 0.6 | 0.5 | 70 | 2200 | 06-01-2025 | 06-08-2025 | 06JAN2025 | 06AUG2025 |
| 2 | M101 | . | . | . | . | . | . | . | . | . | ||
| 3 | M102 | . | . | . | . | . | . | . | . | . |
proc print data=low;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Start_Date_num | End_Date_num |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 01JAN2025 | 01MAY2025 |
| 2 | M104 | 70000 | 5 | 6.5 | 0.89 | 0.20 | 88 | 2700 | 04-01-2025 | 04-06-2025 | 04JAN2025 | 04JUN2025 |
| 3 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 08JAN2025 | 08DEC2025 |
| 4 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 10JAN2025 | 10JUL2025 |
15. COALESCE Example
data coalesce_ex;
set ai_experiments_raw;
Final_Size = coalesce(Dataset_Size, 50000);
run;
proc print data=coalesce_ex;
run;
OUTPUT:
| Obs | Model_ID | Dataset_Size | Epochs | Training_Time | Accuracy | Loss_Value | GPU_Usage | Fees | Start_Date | End_Date | Final_Size |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M101 | 50000 | 10 | 5.5 | 0.92 | 0.25 | 85 | 2000 | 01-01-2025 | 01-05-2025 | 50000 |
| 2 | M102 | 60000 | 12 | 6.2 | 0.95 | 0.20 | 90 | 2500 | 02-01-2025 | 02-06-2025 | 60000 |
| 3 | M103 | . | 15 | 7.0 | 1.20 | 0.15 | 105 | 3000 | 03-01-2025 | 03-07-2025 | 50000 |
| 4 | M104 | 70000 | -5 | 6.5 | 0.89 | . | 88 | 2700 | 04-01-2025 | 04-06-2025 | 70000 |
| 5 | M105 | 80000 | 20 | 8.0 | 0.97 | 0.10 | 95 | . | 05-01-2025 | 05-08-2025 | 80000 |
| 6 | M106 | 90000 | 18 | 9.5 | 0.60 | 0.50 | 70 | 2200 | wrong_date | 06-08-2025 | 90000 |
| 7 | M107 | 100000 | 25 | 10.5 | 0.99 | 0.05 | 110 | 3500 | 07-01-2025 | 07-10-2025 | 100000 |
| 8 | M108 | 120000 | 30 | 12.0 | 0.85 | 0.30 | 60 | 1800 | 08-01-2025 | 08-12-2025 | 120000 |
| 9 | M109 | 110000 | 22 | 11.0 | 0.96 | 0.12 | 92 | 2900 | 09-01-2025 | 09-09-2025 | 110000 |
| 10 | M110 | 95000 | 19 | 9.0 | 0.93 | 0.22 | 89 | 2600 | 10-01-2025 | 10-07-2025 | 95000 |
15+ Code
Summary Points
- DATA
step creation
- SET
statement
- MERGE
datasets
- PROC
APPEND
- PROC
TRANSPOSE
- PROC
DATASETS DELETE
- INTCK
function
- INTNX
function
- Numeric
functions
- Character
functions
- COALESCE
- Macro
creation
- Fraud
logic
- Classification
- Multiple
outputs
- Designed a realistic AI
training experiment dataset including performance, cost, and resource
utilization variables.
- Introduced intentional data
quality issues such as missing values, invalid ranges, and inconsistent
date formats.
- Implemented robust data
cleaning logic using SAS DATA step to correct errors and standardize
values.
- Handled missing values using
techniques like COALESCE and conditional replacement logic.
- Validated numerical ranges
(e.g., Accuracy ≤ 1, GPU_Usage ≤ 100) to ensure data integrity.
- Corrected negative and
illogical values such as negative Epochs using transformation functions.
- Converted character date
variables into numeric SAS dates using INPUT and applied proper formats.
- Calculated experiment duration
using INTCK and projected future runs using INTNX.
- Applied numeric functions
like LOG, SQRT, and ROUND for advanced analytical transformations.
- Standardized text data using
character functions like STRIP, UPCASE, LOWCASE, and PROPCASE.
- Combined multiple datasets
using the SET statement to simulate real-world data integration.
- Demonstrated dataset merging
using MERGE and BY statements for relational data handling.
- Used PROC APPEND for
efficient vertical data concatenation in large-scale environments.
- Restructured data using PROC
TRANSPOSE for reporting and analytical flexibility.
- Removed unwanted datasets
using PROC DATASETS DELETE to manage workspace efficiently.
- Created multiple output
datasets from a single source dataset using conditional OUTPUT logic.
- Developed a macro-driven
fraud detection system to identify suspicious training patterns.
- Flagged anomalies such as
high accuracy with low GPU usage and unrealistic training times.
- Implemented GPU utilization
classification into High, Medium, and Low categories for insights.
- Ensured cross-platform
consistency by aligning SAS and R datasets for reproducible analytics
workflows.
Summary
This
project demonstrates how SAS can effectively handle real-world AI training
experiment data, which often contains errors and inconsistencies. We started by
creating a raw dataset with intentional issues such as missing values, invalid
ranges, and incorrect date formats. Using SAS data step techniques and
functions, we cleaned and transformed the dataset into a reliable format.
We
applied date functions like INTCK and INTNX to calculate durations and predict
future runs. Numeric and character functions helped standardize and enhance the
data. Advanced operations like SET, MERGE, APPEND, and TRANSPOSE showed how
datasets can be manipulated efficiently.
Additionally,
a macro-based fraud detection system was implemented to identify suspicious
patterns. Utilization classification provided insights into GPU usage
efficiency.
Overall,
this project highlights the importance of data validation, transformation, and
automation in analytics workflows using SAS.
Conclusion
In modern
AI-driven environments, data quality directly impacts decision-making and model
performance. This project illustrates how SAS programming can go beyond basic
data handling to deliver robust solutions involving error detection,
correction, and fraud identification.
By combining
multiple SAS techniques,ranging from data step processing to macros and
procedures.We created a scalable and reusable framework. Such skills are highly
valuable in industries like clinical trials, finance, and AI analytics.
Mastering
these concepts not only strengthens technical expertise but also prepares you
for real-world challenges and interviews.
SAS INTERVIEW QUESTIONS
1.
Data Step vs. PROC SQL
Question:
When would you choose a DATA Step over PROC SQL for merging datasets?
Short
Answer: I use a DATA Step
when I need to perform complex row-by-row logic, use temporary variables (like FIRST.
or LAST.),
or when I am merging large, pre-sorted datasets (using MERGE
and BY).
I choose PROC SQL for
quick 'ad-hoc' joins, especially when the data isn't sorted or when I need to
join more than two tables simultaneously.
2. Handling Data Errors (The
"Input" Function)
Question:
How do you handle a numeric variable that was accidentally imported as a
character?
Short
Answer: I use the INPUT()
function in a DATA step to convert it. I create a new numeric variable and
assign it the value: Numeric_Var
= INPUT(Character_Var, informat.);. If I need to keep the
original name, I use a RENAME
statement. This ensures I can perform mathematical calculations on the data
without errors.
3. Date Math (INTCK vs. INTNX)
Question: What is the difference between the INTCK and INTNX
functions?
Short
Answer: I use INTCK
to count the 'intervals' between two dates (like calculating age in days or
months). I use INTNX
to 'advace' a date to a new point (like finding the first day of the next
month). Simply put: INTCK
is for subtraction
(difference), and INTNX
is for addition
(shifting).
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent AI TRAINING data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
· Clinical SAS Programmer
· Research Data Analyst
· Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment