406.If fraud exists in laboratory experiments, would your SAS analytics detect it in time?
If fraud exists in laboratory experiments, would your SAS analytics detect it in time?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HERE WE USED THESE SAS STATEMENTS AND FUNCTIONS FOR THIS PROJECT:
DATA STEP | SET | INPUT | DATALINES | LENGTH | FORMAT |
IF–THEN–ELSE | DO–END | INTCK | INTNX | MDY | ABS | ROUND | SUM | COALESCE |
STRIP | TRIM | CAT | CATX | PROPCASE | UPCASE | LOWCASE | MERGE | PROC DATASETS
| DELETE | PROC SQL | CREATE TABLE | SELECT | GROUP BY | PROC MEANS | PROC
UNIVARIATE | PROC FREQ | PROC CORR | PROC SGPLOT | PROC TRANSPOSE | PROC APPEND
| MACRO | %MACRO | %MEND
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TABLE OF CONTENTS
1.
Introduction
2.
Business Context
3.
Project Objectives
4.
Dataset Design
5.
Raw Data Creation (With Intentional Errors)
6.
Identifying & Explaining Errors
7.
Corrected Dataset – Full Code
8.
Data Cleaning & Transformation
9.
Utilization Classification Macro
10. Fraud
Detection Macro
11. SQL
Analysis
12. Statistical
Analysis (MEANS, UNIVARIATE)
13. Correlation
Study
14. Visualization
(PROC SGPLOT)
15. Transpose
& Append Operations
16. Dataset
Cleanup
17. 5 Key Points About The Project
18. Summary
19. Conclusion
1. INTRODUCTION
Laboratory experiments are the backbone of
pharmaceutical research, biotechnology innovation, chemical analysis, and
material science validation.
Every experiment has:
·
Duration
·
Cost
·
Error count
·
Success rate
·
Validation status
·
Resource utilization
From an analytics perspective, we must answer:
·
Which domain performs better?
·
Which experiments show fraud-like patterns?
·
Are long durations increasing cost?
·
Does higher error count reduce success rate?
·
Which experiments need re-validation?
2. BUSINESS CONTEXT
Imagine you are working as:
·
Clinical Data Analyst
·
Research Analytics Lead
·
QA Validation Officer
·
Lab Operations Manager
Your responsibilities:
·
Monitor cost efficiency
·
Identify suspicious experiments
·
Validate success rate claims
·
Track experiment timelines
·
Improve operational utilization
Incorrect data may result in:
·
Financial loss
·
Regulatory failure
·
Scientific misinterpretation
·
Audit penalties
Therefore, data validation + fraud detection =
critical.
3. PROJECT OBJECTIVES
1.
Create 15+ experiment dataset
2. Introduce multiple intentional
data errors
3. Clean & correct dataset
4. Classify utilization levels
5. Detect fraud patterns
6. Perform statistical analysis
7. Visualize trends
8. Explain each step clearly
4. DATASET DESIGN
Variables:
|
Variable |
Type |
Description |
|
Experiment_ID |
Char |
Unique experiment code |
|
Domain |
Char |
Pharma / Biotech / Chemical |
|
Start_Date |
Date |
Experiment start |
|
End_Date |
Date |
Experiment end |
|
Duration_Days |
Num |
INTCK calculation |
|
Cost |
Num |
Total experiment cost |
|
Success_Rate |
Num |
% success |
|
Error_Count |
Num |
Number of operational errors |
|
Validation_Status |
Char |
Validated / Pending |
|
Scientist_Name |
Char |
Lead scientist |
|
Equipment_Utilization |
Num |
% usage |
|
Fraud_Flag |
Char |
Derived flag |
5. RAW DATA CREATION (WITH INTENTIONAL ERRORS)
data lab_raw;
length Experiment_ID $8 Domain $15 Validation_Status $12 Scientist_Name $20;
format Start_Date End_Date date9.;
input Experiment_ID $ Domain $ Start_Date :date9. End_Date :date9. Cost Success_Rate
Error_Count Validation_Status $ Scientist_Name $ Equipment_Utilization;
datalines;
EXP001 Pharma 01JAN2025 10JAN2025 50000 92 2 Validated Dr.Ravi 85
EXP002 Biotech 05JAN2025 02JAN2025 45000 105 3 Pending Dr.Sita 78
EXP003 Chemical 12JAN2025 20JAN2025 -20000 88 1 Validated Dr.Khan 90
EXP004 Pharma 15JAN2025 25JAN2025 60000 75 -2 Validated Dr.Ravi 88
EXP005 Biotech 20JAN2025 28JAN2025 52000 68 5 Pending Dr.Meera 110
EXP006 Chemical 01FEB2025 10FEB2025 48000 81 0 Validated Dr.Ajay 70
EXP007 Pharma 05FEB2025 18FEB2025 75000 95 2 Validated Dr.Ravi 92
EXP008 Biotech 10FEB2025 15FEB2025 30000 45 8 Pending Dr.Sita 65
EXP009 Chemical 12FEB2025 25FEB2025 90000 98 1 Validated Dr.Khan 97
EXP010 Pharma 20FEB2025 01MAR2025 100000 99 0 Validated Dr.Meera 99
EXP011 Biotech 22FEB2025 28FEB2025 55000 60 4 Pending Dr.Sita 75
EXP012 Chemical 01MAR2025 15MAR2025 65000 85 2 Validated Dr.Ajay 83
EXP013 Pharma 05MAR2025 20MAR2025 72000 88 3 Pending Dr.Ravi 89
EXP014 Biotech 10MAR2025 18MAR2025 58000 91 1 Validated Dr.Meera 82
EXP015 Chemical 12MAR2025 22MAR2025 61000 77 2 Validated Dr.Khan 84
;
run;
proc print data=lab_raw;
run;
OUTPUT:
| Obs | Experiment_ID | Domain | Validation_Status | Scientist_Name | Start_Date | End_Date | Cost | Success_Rate | Error_Count | Equipment_Utilization |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | EXP001 | Pharma | Validated | Dr.Ravi | 01JAN2025 | 10JAN2025 | 50000 | 92 | 2 | 85 |
| 2 | EXP002 | Biotech | Pending | Dr.Sita | 05JAN2025 | 02JAN2025 | 45000 | 105 | 3 | 78 |
| 3 | EXP003 | Chemical | Validated | Dr.Khan | 12JAN2025 | 20JAN2025 | -20000 | 88 | 1 | 90 |
| 4 | EXP004 | Pharma | Validated | Dr.Ravi | 15JAN2025 | 25JAN2025 | 60000 | 75 | -2 | 88 |
| 5 | EXP005 | Biotech | Pending | Dr.Meera | 20JAN2025 | 28JAN2025 | 52000 | 68 | 5 | 110 |
| 6 | EXP006 | Chemical | Validated | Dr.Ajay | 01FEB2025 | 10FEB2025 | 48000 | 81 | 0 | 70 |
| 7 | EXP007 | Pharma | Validated | Dr.Ravi | 05FEB2025 | 18FEB2025 | 75000 | 95 | 2 | 92 |
| 8 | EXP008 | Biotech | Pending | Dr.Sita | 10FEB2025 | 15FEB2025 | 30000 | 45 | 8 | 65 |
| 9 | EXP009 | Chemical | Validated | Dr.Khan | 12FEB2025 | 25FEB2025 | 90000 | 98 | 1 | 97 |
| 10 | EXP010 | Pharma | Validated | Dr.Meera | 20FEB2025 | 01MAR2025 | 100000 | 99 | 0 | 99 |
| 11 | EXP011 | Biotech | Pending | Dr.Sita | 22FEB2025 | 28FEB2025 | 55000 | 60 | 4 | 75 |
| 12 | EXP012 | Chemical | Validated | Dr.Ajay | 01MAR2025 | 15MAR2025 | 65000 | 85 | 2 | 83 |
| 13 | EXP013 | Pharma | Pending | Dr.Ravi | 05MAR2025 | 20MAR2025 | 72000 | 88 | 3 | 89 |
| 14 | EXP014 | Biotech | Validated | Dr.Meera | 10MAR2025 | 18MAR2025 | 58000 | 91 | 1 | 82 |
| 15 | EXP015 | Chemical | Validated | Dr.Khan | 12MAR2025 | 22MAR2025 | 61000 | 77 | 2 | 84 |
·
Dates may be reversed
·
Percentages exceed logical bounds
·
Negative values appear due to system
bugs
·
Data entry mistakes occur
·
Missing or inconsistent formats exist
6. INTENTIONAL ERRORS IDENTIFIED
❌ Error 1
EXP002
End_Date < Start_Date
❌ Error 2
Success_Rate
= 105 (cannot exceed 100)
❌ Error 3
Cost =
-20000 (negative impossible)
❌ Error 4
Error_Count
= -2
❌ Error 5
Equipment_Utilization
= 110 (>100 impossible)
7. CORRECTED DATASET – FULL LENGTH CODE
data lab_clean;
set lab_raw;
Duration_Days = intck('day', Start_Date, End_Date);
if End_Date < Start_Date then do;
End_Date = intnx('day', Start_Date, 5);
end;
if Success_Rate > 100 then Success_Rate = 100;
if Success_Rate < 0 then Success_Rate = 0;
if Cost < 0 then Cost = abs(Cost);
if Error_Count < 0 then Error_Count = 0;
if Equipment_Utilization > 100 then Equipment_Utilization = 100;
Domain = propcase(strip(Domain));
Validation_Status = upcase(strip(Validation_Status));
Scientist_Name = propcase(Scientist_Name);
Experiment_Full = catx('-', Experiment_ID, Domain);
format Duration_Days 8.;
run;
proc print data=lab_clean;
run;
OUTPUT:
| Obs | Experiment_ID | Domain | Validation_Status | Scientist_Name | Start_Date | End_Date | Cost | Success_Rate | Error_Count | Equipment_Utilization | Duration_Days | Experiment_Full |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | EXP001 | Pharma | VALIDATED | Dr.Ravi | 01JAN2025 | 10JAN2025 | 50000 | 92 | 2 | 85 | 9 | EXP001-Pharma |
| 2 | EXP002 | Biotech | PENDING | Dr.Sita | 05JAN2025 | 10JAN2025 | 45000 | 100 | 3 | 78 | -3 | EXP002-Biotech |
| 3 | EXP003 | Chemical | VALIDATED | Dr.Khan | 12JAN2025 | 20JAN2025 | 20000 | 88 | 1 | 90 | 8 | EXP003-Chemical |
| 4 | EXP004 | Pharma | VALIDATED | Dr.Ravi | 15JAN2025 | 25JAN2025 | 60000 | 75 | 0 | 88 | 10 | EXP004-Pharma |
| 5 | EXP005 | Biotech | PENDING | Dr.Meera | 20JAN2025 | 28JAN2025 | 52000 | 68 | 5 | 100 | 8 | EXP005-Biotech |
| 6 | EXP006 | Chemical | VALIDATED | Dr.Ajay | 01FEB2025 | 10FEB2025 | 48000 | 81 | 0 | 70 | 9 | EXP006-Chemical |
| 7 | EXP007 | Pharma | VALIDATED | Dr.Ravi | 05FEB2025 | 18FEB2025 | 75000 | 95 | 2 | 92 | 13 | EXP007-Pharma |
| 8 | EXP008 | Biotech | PENDING | Dr.Sita | 10FEB2025 | 15FEB2025 | 30000 | 45 | 8 | 65 | 5 | EXP008-Biotech |
| 9 | EXP009 | Chemical | VALIDATED | Dr.Khan | 12FEB2025 | 25FEB2025 | 90000 | 98 | 1 | 97 | 13 | EXP009-Chemical |
| 10 | EXP010 | Pharma | VALIDATED | Dr.Meera | 20FEB2025 | 01MAR2025 | 100000 | 99 | 0 | 99 | 9 | EXP010-Pharma |
| 11 | EXP011 | Biotech | PENDING | Dr.Sita | 22FEB2025 | 28FEB2025 | 55000 | 60 | 4 | 75 | 6 | EXP011-Biotech |
| 12 | EXP012 | Chemical | VALIDATED | Dr.Ajay | 01MAR2025 | 15MAR2025 | 65000 | 85 | 2 | 83 | 14 | EXP012-Chemical |
| 13 | EXP013 | Pharma | PENDING | Dr.Ravi | 05MAR2025 | 20MAR2025 | 72000 | 88 | 3 | 89 | 15 | EXP013-Pharma |
| 14 | EXP014 | Biotech | VALIDATED | Dr.Meera | 10MAR2025 | 18MAR2025 | 58000 | 91 | 1 | 82 | 8 | EXP014-Biotech |
| 15 | EXP015 | Chemical | VALIDATED | Dr.Khan | 12MAR2025 | 22MAR2025 | 61000 | 77 | 2 | 84 | 10 | EXP015-Chemical |
·
Date correction using INTNX
·
Duration calculation using INTCK
·
ABS() for negative cost
·
Logical boundary correction for
percentages
·
Character normalization using STRIP and
PROPCASE
·
STRIP() → removes extra spaces
·
PROPCASE( ) → proper capitalization
·
UPCASE( ) → standard format
·
LOWCASE( ) → lower conversion
·
CAT( ) / CATX( ) → combine strings
ABS ( ) : Converts negative to positive.
8. UTILIZATION CLASSIFICATION MACRO
%macro utilization_flag;
data lab_clean;
set lab_clean;
length Utilization_Level $10;
if Equipment_Utilization >= 90 then Utilization_Level="High";
else if Equipment_Utilization >= 75 then Utilization_Level="Medium";
else Utilization_Level="Low";
run;
proc print data=lab_clean;
run;
%mend;
%utilization_flag;
OUTPUT:
| Obs | Experiment_ID | Domain | Validation_Status | Scientist_Name | Start_Date | End_Date | Cost | Success_Rate | Error_Count | Equipment_Utilization | Duration_Days | Experiment_Full | Utilization_Level |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | EXP001 | Pharma | VALIDATED | Dr.Ravi | 01JAN2025 | 10JAN2025 | 50000 | 92 | 2 | 85 | 9 | EXP001-Pharma | Medium |
| 2 | EXP002 | Biotech | PENDING | Dr.Sita | 05JAN2025 | 10JAN2025 | 45000 | 100 | 3 | 78 | -3 | EXP002-Biotech | Medium |
| 3 | EXP003 | Chemical | VALIDATED | Dr.Khan | 12JAN2025 | 20JAN2025 | 20000 | 88 | 1 | 90 | 8 | EXP003-Chemical | High |
| 4 | EXP004 | Pharma | VALIDATED | Dr.Ravi | 15JAN2025 | 25JAN2025 | 60000 | 75 | 0 | 88 | 10 | EXP004-Pharma | Medium |
| 5 | EXP005 | Biotech | PENDING | Dr.Meera | 20JAN2025 | 28JAN2025 | 52000 | 68 | 5 | 100 | 8 | EXP005-Biotech | High |
| 6 | EXP006 | Chemical | VALIDATED | Dr.Ajay | 01FEB2025 | 10FEB2025 | 48000 | 81 | 0 | 70 | 9 | EXP006-Chemical | Low |
| 7 | EXP007 | Pharma | VALIDATED | Dr.Ravi | 05FEB2025 | 18FEB2025 | 75000 | 95 | 2 | 92 | 13 | EXP007-Pharma | High |
| 8 | EXP008 | Biotech | PENDING | Dr.Sita | 10FEB2025 | 15FEB2025 | 30000 | 45 | 8 | 65 | 5 | EXP008-Biotech | Low |
| 9 | EXP009 | Chemical | VALIDATED | Dr.Khan | 12FEB2025 | 25FEB2025 | 90000 | 98 | 1 | 97 | 13 | EXP009-Chemical | High |
| 10 | EXP010 | Pharma | VALIDATED | Dr.Meera | 20FEB2025 | 01MAR2025 | 100000 | 99 | 0 | 99 | 9 | EXP010-Pharma | High |
| 11 | EXP011 | Biotech | PENDING | Dr.Sita | 22FEB2025 | 28FEB2025 | 55000 | 60 | 4 | 75 | 6 | EXP011-Biotech | Medium |
| 12 | EXP012 | Chemical | VALIDATED | Dr.Ajay | 01MAR2025 | 15MAR2025 | 65000 | 85 | 2 | 83 | 14 | EXP012-Chemical | Medium |
| 13 | EXP013 | Pharma | PENDING | Dr.Ravi | 05MAR2025 | 20MAR2025 | 72000 | 88 | 3 | 89 | 15 | EXP013-Pharma | Medium |
| 14 | EXP014 | Biotech | VALIDATED | Dr.Meera | 10MAR2025 | 18MAR2025 | 58000 | 91 | 1 | 82 | 8 | EXP014-Biotech | Medium |
| 15 | EXP015 | Chemical | VALIDATED | Dr.Khan | 12MAR2025 | 22MAR2025 | 61000 | 77 | 2 | 84 | 10 | EXP015-Chemical | Medium |
·
High (>=90%)
·
Medium (75–89%)
·
Low (<75%)
Macro automates classification.
·
Code reuse
·
Scalability
·
Automation
·
Efficiency
9. FRAUD DETECTION MACRO
%macro fraud_check;
data fraud_check;
set lab_clean;
length Fraud_Flag $12;
if Success_Rate=100 and Error_Count>2 then Fraud_Flag="Suspicious";
else if Cost>90000 and Duration_Days<5 then Fraud_Flag="Suspicious";
else Fraud_Flag="Normal";
run;
proc print data=fraud_check;
run;
%mend;
%fraud_check;
OUTPUT:
| Obs | Experiment_ID | Domain | Validation_Status | Scientist_Name | Start_Date | End_Date | Cost | Success_Rate | Error_Count | Equipment_Utilization | Duration_Days | Experiment_Full | Utilization_Level | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | EXP001 | Pharma | VALIDATED | Dr.Ravi | 01JAN2025 | 10JAN2025 | 50000 | 92 | 2 | 85 | 9 | EXP001-Pharma | Medium | Normal |
| 2 | EXP002 | Biotech | PENDING | Dr.Sita | 05JAN2025 | 10JAN2025 | 45000 | 100 | 3 | 78 | -3 | EXP002-Biotech | Medium | Suspicious |
| 3 | EXP003 | Chemical | VALIDATED | Dr.Khan | 12JAN2025 | 20JAN2025 | 20000 | 88 | 1 | 90 | 8 | EXP003-Chemical | High | Normal |
| 4 | EXP004 | Pharma | VALIDATED | Dr.Ravi | 15JAN2025 | 25JAN2025 | 60000 | 75 | 0 | 88 | 10 | EXP004-Pharma | Medium | Normal |
| 5 | EXP005 | Biotech | PENDING | Dr.Meera | 20JAN2025 | 28JAN2025 | 52000 | 68 | 5 | 100 | 8 | EXP005-Biotech | High | Normal |
| 6 | EXP006 | Chemical | VALIDATED | Dr.Ajay | 01FEB2025 | 10FEB2025 | 48000 | 81 | 0 | 70 | 9 | EXP006-Chemical | Low | Normal |
| 7 | EXP007 | Pharma | VALIDATED | Dr.Ravi | 05FEB2025 | 18FEB2025 | 75000 | 95 | 2 | 92 | 13 | EXP007-Pharma | High | Normal |
| 8 | EXP008 | Biotech | PENDING | Dr.Sita | 10FEB2025 | 15FEB2025 | 30000 | 45 | 8 | 65 | 5 | EXP008-Biotech | Low | Normal |
| 9 | EXP009 | Chemical | VALIDATED | Dr.Khan | 12FEB2025 | 25FEB2025 | 90000 | 98 | 1 | 97 | 13 | EXP009-Chemical | High | Normal |
| 10 | EXP010 | Pharma | VALIDATED | Dr.Meera | 20FEB2025 | 01MAR2025 | 100000 | 99 | 0 | 99 | 9 | EXP010-Pharma | High | Normal |
| 11 | EXP011 | Biotech | PENDING | Dr.Sita | 22FEB2025 | 28FEB2025 | 55000 | 60 | 4 | 75 | 6 | EXP011-Biotech | Medium | Normal |
| 12 | EXP012 | Chemical | VALIDATED | Dr.Ajay | 01MAR2025 | 15MAR2025 | 65000 | 85 | 2 | 83 | 14 | EXP012-Chemical | Medium | Normal |
| 13 | EXP013 | Pharma | PENDING | Dr.Ravi | 05MAR2025 | 20MAR2025 | 72000 | 88 | 3 | 89 | 15 | EXP013-Pharma | Medium | Normal |
| 14 | EXP014 | Biotech | VALIDATED | Dr.Meera | 10MAR2025 | 18MAR2025 | 58000 | 91 | 1 | 82 | 8 | EXP014-Biotech | Medium | Normal |
| 15 | EXP015 | Chemical | VALIDATED | Dr.Khan | 12MAR2025 | 22MAR2025 | 61000 | 77 | 2 | 84 | 10 | EXP015-Chemical | Medium | Normal |
Fraud indicators used:
·
100% success with high errors
·
High cost with extremely short duration
These patterns are statistically suspicious.
10. PROC SQL ANALYSIS
proc sql;
create table domain_summary as
select Domain,
count(*) as Total_Experiments,
avg(Cost) as Avg_Cost,
avg(Success_Rate) as Avg_Success
from fraud_check
group by Domain;
quit;
proc print data=domain_summary;
run;
OUTPUT:
| Obs | Domain | Total_Experiments | Avg_Cost | Avg_Success |
|---|---|---|---|---|
| 1 | Biotech | 5 | 48000 | 72.8 |
| 2 | Chemical | 5 | 56800 | 85.8 |
| 3 | Pharma | 5 | 71400 | 89.8 |
SQL allows:
·
Aggregation
·
Filtering
·
Grouping
·
Joining
11. PROC MEANS
proc means data=fraud_check mean median min max;
var Cost Success_Rate Duration_Days Error_Count;
run;
OUTPUT:
The MEANS Procedure
| Variable | Mean | Median | Minimum | Maximum |
|---|---|---|---|---|
Cost Success_Rate Duration_Days Error_Count | 58733.33 82.8000000 8.9333333 2.2666667 | 58000.00 88.0000000 9.0000000 2.0000000 | 20000.00 45.0000000 -3.0000000 0 | 100000.00 100.0000000 15.0000000 8.0000000 |
Provides:
·
Mean
·
Median
·
Minimum
·
Maximum
·
Standard deviation
12. PROC UNIVARIATE
proc univariate data=fraud_check;
var Cost;
histogram Cost;
run;
OUTPUT:
The UNIVARIATE Procedure
Variable: Cost
| Moments | |||
|---|---|---|---|
| N | 15 | Sum Weights | 15 |
| Mean | 58733.3333 | Sum Observations | 881000 |
| Std Deviation | 20585.9406 | Variance | 423780952 |
| Skewness | 0.19892832 | Kurtosis | 0.48197115 |
| Uncorrected SS | 5.7677E10 | Corrected SS | 5932933333 |
| Coeff Variation | 35.0498422 | Std Error Mean | 5315.26702 |
| Basic Statistical Measures | |||
|---|---|---|---|
| Location | Variability | ||
| Mean | 58733.33 | Std Deviation | 20586 |
| Median | 58000.00 | Variance | 423780952 |
| Mode | . | Range | 80000 |
| Interquartile Range | 24000 | ||
| Tests for Location: Mu0=0 | ||||
|---|---|---|---|---|
| Test | Statistic | p Value | ||
| Student's t | t | 11.04993 | Pr > |t| | <.0001 |
| Sign | M | 7.5 | Pr >= |M| | <.0001 |
| Signed Rank | S | 60 | Pr >= |S| | <.0001 |
| Quantiles (Definition 5) | |
|---|---|
| Level | Quantile |
| 100% Max | 100000 |
| 99% | 100000 |
| 95% | 100000 |
| 90% | 90000 |
| 75% Q3 | 72000 |
| 50% Median | 58000 |
| 25% Q1 | 48000 |
| 10% | 30000 |
| 5% | 20000 |
| 1% | 20000 |
| 0% Min | 20000 |
| Extreme Observations | |||
|---|---|---|---|
| Lowest | Highest | ||
| Value | Obs | Value | Obs |
| 20000 | 3 | 65000 | 12 |
| 30000 | 8 | 72000 | 13 |
| 45000 | 2 | 75000 | 7 |
| 48000 | 6 | 90000 | 9 |
| 50000 | 1 | 100000 | 10 |
The UNIVARIATE Procedure
Gives:
·
Skewness
·
Kurtosis
·
Histogram
· Normality tests
13. PROC FREQ
proc freq data=fraud_check;
tables Domain*Fraud_Flag / chisq;
run;
OUTPUT:
The FREQ Procedure
|
| ||||||||||||||||||||||||||||
Statistics for Table of Domain by Fraud_Flag
| Statistic | DF | Value | Prob |
|---|---|---|---|
| WARNING: 100% of the cells have expected counts less than 5. Chi-Square may not be a valid test. | |||
| Chi-Square | 2 | 2.1429 | 0.3425 |
| Likelihood Ratio Chi-Square | 2 | 2.3439 | 0.3098 |
| Mantel-Haenszel Chi-Square | 1 | 1.5000 | 0.2207 |
| Phi Coefficient | 0.3780 | ||
| Contingency Coefficient | 0.3536 | ||
| Cramer's V | 0.3780 | ||
Sample Size = 15
Used for:
·
Domain distribution
·
Fraud flag counts
·
Validation status breakdown
With CHISQ option:
We test statistical association between domain
and fraud.
14. PROC CORR
proc corr data=fraud_check;
var Cost Duration_Days Success_Rate Error_Count;
run;
OUTPUT:
The CORR Procedure
| 4 Variables: | Cost Duration_Days Success_Rate Error_Count |
|---|
| Simple Statistics | ||||||
|---|---|---|---|---|---|---|
| Variable | N | Mean | Std Dev | Sum | Minimum | Maximum |
| Cost | 15 | 58733 | 20586 | 881000 | 20000 | 100000 |
| Duration_Days | 15 | 8.93333 | 4.36654 | 134.00000 | -3.00000 | 15.00000 |
| Success_Rate | 15 | 82.80000 | 15.61684 | 1242 | 45.00000 | 100.00000 |
| Error_Count | 15 | 2.26667 | 2.15362 | 34.00000 | 0 | 8.00000 |
| Pearson Correlation Coefficients, N = 15 Prob > |r| under H0: Rho=0 | ||||
|---|---|---|---|---|
| Cost | Duration_Days | Success_Rate | Error_Count | |
| Cost | 1.00000 | 0.50994 0.0521 | 0.48173 0.0690 | -0.41234 0.1267 |
| Duration_Days | 0.50994 0.0521 | 1.00000 | 0.16005 0.5688 | -0.32459 0.2379 |
| Success_Rate | 0.48173 0.0690 | 0.16005 0.5688 | 1.00000 | -0.71401 0.0028 |
| Error_Count | -0.41234 0.1267 | -0.32459 0.2379 | -0.71401 0.0028 | 1.00000 |
Checks linear relationships:
·
Duration vs Cost
·
Error_Count vs Success_Rate
If correlation is strong:
We understand dependency.
15. PROC SGPLOT
proc sgplot data=fraud_check;
scatter x=Duration_Days y=Cost;
run;
OUTPUT:
Scatter plot:
Duration vs Cost.
Visual pattern reveals:
·
Linear trend
·
Outliers
·
Clusters
Executives understand visuals faster than tables.
16. PROC TRANSPOSE
proc transpose data=fraud_check out=cost_transpose;
var Cost Success_Rate;
id Experiment_ID;
run;
proc print data=cost_transpose;
run;
OUTPUT:
| Obs | _NAME_ | EXP001 | EXP002 | EXP003 | EXP004 | EXP005 | EXP006 | EXP007 | EXP008 | EXP009 | EXP010 | EXP011 | EXP012 | EXP013 | EXP014 | EXP015 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Cost | 50000 | 45000 | 20000 | 60000 | 52000 | 48000 | 75000 | 30000 | 90000 | 100000 | 55000 | 65000 | 72000 | 58000 | 61000 |
| 2 | Success_Rate | 92 | 100 | 88 | 75 | 68 | 81 | 95 | 45 | 98 | 99 | 60 | 85 | 88 | 91 | 77 |
Converts rows to columns.
17. APPEND
data additional;
set fraud_check(obs=3);
Cost = Cost + 5000;
run;
proc print data=additional;
run;
OUTPUT:
| Obs | Experiment_ID | Domain | Validation_Status | Scientist_Name | Start_Date | End_Date | Cost | Success_Rate | Error_Count | Equipment_Utilization | Duration_Days | Experiment_Full | Utilization_Level | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | EXP001 | Pharma | VALIDATED | Dr.Ravi | 01JAN2025 | 10JAN2025 | 55000 | 92 | 2 | 85 | 9 | EXP001-Pharma | Medium | Normal |
| 2 | EXP002 | Biotech | PENDING | Dr.Sita | 05JAN2025 | 10JAN2025 | 50000 | 100 | 3 | 78 | -3 | EXP002-Biotech | Medium | Suspicious |
| 3 | EXP003 | Chemical | VALIDATED | Dr.Khan | 12JAN2025 | 20JAN2025 | 25000 | 88 | 1 | 90 | 8 | EXP003-Chemical | High | Normal |
proc append base=fraud_check
data=additional force;
run;
proc print data=fraud_check;
run;
OUTPUT:
| Obs | Experiment_ID | Domain | Validation_Status | Scientist_Name | Start_Date | End_Date | Cost | Success_Rate | Error_Count | Equipment_Utilization | Duration_Days | Experiment_Full | Utilization_Level | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | EXP001 | Pharma | VALIDATED | Dr.Ravi | 01JAN2025 | 10JAN2025 | 50000 | 92 | 2 | 85 | 9 | EXP001-Pharma | Medium | Normal |
| 2 | EXP002 | Biotech | PENDING | Dr.Sita | 05JAN2025 | 10JAN2025 | 45000 | 100 | 3 | 78 | -3 | EXP002-Biotech | Medium | Suspicious |
| 3 | EXP003 | Chemical | VALIDATED | Dr.Khan | 12JAN2025 | 20JAN2025 | 20000 | 88 | 1 | 90 | 8 | EXP003-Chemical | High | Normal |
| 4 | EXP004 | Pharma | VALIDATED | Dr.Ravi | 15JAN2025 | 25JAN2025 | 60000 | 75 | 0 | 88 | 10 | EXP004-Pharma | Medium | Normal |
| 5 | EXP005 | Biotech | PENDING | Dr.Meera | 20JAN2025 | 28JAN2025 | 52000 | 68 | 5 | 100 | 8 | EXP005-Biotech | High | Normal |
| 6 | EXP006 | Chemical | VALIDATED | Dr.Ajay | 01FEB2025 | 10FEB2025 | 48000 | 81 | 0 | 70 | 9 | EXP006-Chemical | Low | Normal |
| 7 | EXP007 | Pharma | VALIDATED | Dr.Ravi | 05FEB2025 | 18FEB2025 | 75000 | 95 | 2 | 92 | 13 | EXP007-Pharma | High | Normal |
| 8 | EXP008 | Biotech | PENDING | Dr.Sita | 10FEB2025 | 15FEB2025 | 30000 | 45 | 8 | 65 | 5 | EXP008-Biotech | Low | Normal |
| 9 | EXP009 | Chemical | VALIDATED | Dr.Khan | 12FEB2025 | 25FEB2025 | 90000 | 98 | 1 | 97 | 13 | EXP009-Chemical | High | Normal |
| 10 | EXP010 | Pharma | VALIDATED | Dr.Meera | 20FEB2025 | 01MAR2025 | 100000 | 99 | 0 | 99 | 9 | EXP010-Pharma | High | Normal |
| 11 | EXP011 | Biotech | PENDING | Dr.Sita | 22FEB2025 | 28FEB2025 | 55000 | 60 | 4 | 75 | 6 | EXP011-Biotech | Medium | Normal |
| 12 | EXP012 | Chemical | VALIDATED | Dr.Ajay | 01MAR2025 | 15MAR2025 | 65000 | 85 | 2 | 83 | 14 | EXP012-Chemical | Medium | Normal |
| 13 | EXP013 | Pharma | PENDING | Dr.Ravi | 05MAR2025 | 20MAR2025 | 72000 | 88 | 3 | 89 | 15 | EXP013-Pharma | Medium | Normal |
| 14 | EXP014 | Biotech | VALIDATED | Dr.Meera | 10MAR2025 | 18MAR2025 | 58000 | 91 | 1 | 82 | 8 | EXP014-Biotech | Medium | Normal |
| 15 | EXP015 | Chemical | VALIDATED | Dr.Khan | 12MAR2025 | 22MAR2025 | 61000 | 77 | 2 | 84 | 10 | EXP015-Chemical | Medium | Normal |
| 16 | EXP001 | Pharma | VALIDATED | Dr.Ravi | 01JAN2025 | 10JAN2025 | 55000 | 92 | 2 | 85 | 9 | EXP001-Pharma | Medium | Normal |
| 17 | EXP002 | Biotech | PENDING | Dr.Sita | 05JAN2025 | 10JAN2025 | 50000 | 100 | 3 | 78 | -3 | EXP002-Biotech | Medium | Suspicious |
| 18 | EXP003 | Chemical | VALIDATED | Dr.Khan | 12JAN2025 | 20JAN2025 | 25000 | 88 | 1 | 90 | 8 | EXP003-Chemical | High | Normal |
Adds new records to base dataset.
Used when incremental data arrives.
18. PROC DATASETS DELETE
proc datasets lib=work;
delete lab_clean fraud_check ;
run;
LOG:
Deletes unwanted datasets.
Keeps WORK library clean.
Important for:
·
Memory optimization
·
Preventing dataset confusion
·
Maintaining reproducibility
19. 5
Key Points About the Project
1. Comprehensive Data
Cleaning & Validation
Implemented logical checks for dates, percentages, negative values, and
utilization limits to ensure data integrity.
2. Automation Using
Macros
Built reusable macros for utilization classification and fraud detection,
improving scalability and efficiency.
3. Advanced Statistical
Analysis
Applied PROC
MEANS, PROC UNIVARIATE,
and PROC CORR
to understand distribution, variability, and relationships between cost,
duration, and success rate.
4. Business
Intelligence via PROC SQL
Generated domain-level summaries and performance metrics to support management
decision-making.
5. Fraud & Risk
Analytics Framework
Designed rule-based detection logic to flag suspicious experiments based on
cost, duration, and error patterns.
20. Project
Summary
Laboratory experiment performance monitoring, data validation, utilization optimization, and fraud analytics using structured SAS programming.
21. Conclusion
This project
demonstrates a complete analytical lifecycle for laboratory experiment
performance monitoring using SAS. Starting from raw data containing intentional
logical errors, we applied systematic validation techniques to correct
inconsistencies in dates, percentages, costs, and error counts. Through
structured DATA step processing, SQL aggregation, statistical procedures, and
macro automation, the dataset was transformed into a reliable analytical model.
The utilization
classification macro provided operational insights into equipment efficiency,
while the fraud detection logic identified suspicious experiment patterns based
on abnormal success rates and cost-duration inconsistencies. Statistical
procedures such as PROC MEANS, PROC UNIVARIATE, and PROC CORR enabled deeper
understanding of cost distribution and variable relationships.
Overall, the project reflects real-world industry
practices in pharmaceutical and research environments, combining data quality
control, operational intelligence, and risk analytics into a single integrated
SAS-based framework suitable for audit, regulatory review, and strategic decision-making.
SAS INTERVIEW QUESTIONS
1.Why use INTCK vs INTNX?
2.How do macros help operational classification?
3.How do you detect data quality issues in SAS?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent LABORATORY data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
·
Clinical SAS Programmer
·
Research Data Analyst
·
Quality Control Officer
·
Regulatory Data Validator
·
Lab Operations Analyst
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment