431.Can Advanced SAS Programming Detect, Clean, and Optimize Vaccine Cold Chain Logistics Data While Identifying Fraud Patterns?
Vaccine Cold Chain Optimization Using Advanced SAS Techniques for Data Quality Enhancement and Fraud Detection
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HERE IN THIS PROJECT WE USED THESE SAS
& R TECHNIQUES —
DATA STEP | PROC SQL | PROC PRINT | MACROS | PROC TRANSPOSE | PROC DATASETS
DELETE | DATA FUNCTIONS | R DATA FRAME | dplyr | rbind/bind_rows | merge |
reshape2 (melt) | Date Functions | String Functions | Numeric Functions
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Introduction
In modern
healthcare supply chains, vaccine cold chain logistics is one of the most
critical operational domains. Vaccines are highly temperature-sensitive
biological products, and even minor deviations can lead to loss of potency,
resulting in financial loss and public health risks. Organizations rely on
robust data systems to monitor transportation, storage, compliance, and
spoilage.
However,
real-world datasets are rarely clean. They often contain:
- Missing values
- Incorrect formats
- Logical inconsistencies
- Fraudulent manipulations
This
project simulates a real-world vaccine cold chain dataset with intentional
errors, and demonstrates how Advanced SAS programming and R can:
- Detect errors
- Correct inconsistencies
- Perform transformations
- Apply business rules
- Identify fraud patterns
The
explanation is written in a simple, human-readable way, while still
maintaining technical depth for interview preparation.
Table of Contents
- Business Context
- Raw Dataset Creation (With
Errors)
- Error Identification
- Data Cleaning Using SAS
- DATA Step Transformations
- PROC SQL Operations
- Date Functions (MDY, INTCK,
INTNX)
- SET, MERGE, APPEND
- PROC TRANSPOSE
- Character Functions
- Numeric Functions
- Fraud Detection Macro
- PROC DATASETS DELETE
- Final Clean Dataset
- Equivalent R Code
- Summary
- Conclusion
Business Context
A
pharmaceutical company distributes vaccines across multiple regions. Each
shipment is monitored using the following variables:
- Distribution_Node (location)
- Storage_Temperature
- Shipment_Delay
- Spoilage_Rate
- Units_Transported
- Compliance_Index
- Price
- Utilization
- Classification (Good / Risky
/ Fraud)
- Shipment_Date
Key Business Goals:
- Maintain temperature between
2°C to 8°C
- Reduce spoilage below 5%
- Detect fraud (fake
reporting, manipulation)
- Ensure regulatory compliance
1. Create Raw Dataset (With Intentional Errors)
SAS Code
data vaccine_raw;
input Distribution_Node $ Storage_Temperature Shipment_Delay Spoilage_Rate
Units_Transported Compliance_Index Price Utilization Shipment_Date : date9.;
format Shipment_Date date9.;
datalines;
Hyd -5 3 2.5 1000 95 12.5 80 01JAN2024
Vizag 25 -1 110 2000 105 abc 70 05FEB2024
Chennai 4 2 . 1500 85 15.5 . 10MAR2024
Delhi 7 5 3.5 -100 90 20.0 60 15APR2024
Mumbai 3 0 2.0 1200 . 18.5 75 20MAY2024
;
run;
proc print data=vaccine_raw;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Hyd | -5 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 |
| 2 | Vizag | 25 | -1 | 110.0 | 2000 | 105 | . | 70 | 05FEB2024 |
| 3 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 |
| 4 | Delhi | 7 | 5 | 3.5 | -100 | 90 | 20.0 | 60 | 15APR2024 |
| 5 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 |
Explanation
This dataset intentionally contains multiple issues:
· Negative
temperature (-5)
· Negative
delay (-1)
· Spoilage
> 100 (invalid)
· Price =
"abc" (character in numeric field)
· Missing
values
· Negative units
Why This Code is Used
· To simulate
real-world messy data
· To test cleaning and validation logic
R code
vaccine_raw <- data.frame(
Distribution_Node =c("Hyd","Vizag","Chennai","Delhi","Mumbai"),
Storage_Temperature = c(-5,25,4,7,3),
Shipment_Delay = c(3,-1,2,5,0),
Spoilage_Rate = c(2.5,110,NA,3.5,2.0),
Units_Transported = c(1000,2000,1500,-100,1200),
Compliance_Index = c(95,105,85,90,NA),
Price = c(12.5,NA,15.5,20.0,18.5),
Utilization = c(80,70,NA,60,75),
Shipment_Date = c("01JAN2024","05FEB2024","10MAR2024","15APR2024","20MAY2024")
)
print(vaccine_raw)
2. Error Detection
SAS Code
proc print data=vaccine_raw;
where Storage_Temperature < 0 or Shipment_Delay < 0 or
Spoilage_Rate > 100 or Units_Transported < 0;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Hyd | -5 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 |
| 2 | Vizag | 25 | -1 | 110.0 | 2000 | 105 | . | 70 | 05FEB2024 |
| 4 | Delhi | 7 | 5 | 3.5 | -100 | 90 | 20.0 | 60 | 15APR2024 |
Explanation
This identifies invalid records.
Why Used
· Early
validation step
· Helps isolate problematic rows
R Code
subset(vaccine_raw, Storage_Temperature < 0 | Shipment_Delay < 0 |
Spoilage_Rate > 100 | Units_Transported < 0)
3. Data Cleaning
SAS Code
data vaccine_clean;
set vaccine_raw;
if Storage_Temperature < 2 then Storage_Temperature = 2;
if Shipment_Delay < 0 then Shipment_Delay = 0;
if Spoilage_Rate > 100 then Spoilage_Rate = 5;
if Units_Transported < 0 then Units_Transported = abs(Units_Transported);
if Price = . then Price = 10;
if Compliance_Index > 100 then Compliance_Index = 100;
run;
proc print data=vaccine_clean;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 |
| 2 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 |
| 3 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 |
| 4 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 |
| 5 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 |
Explanation
Fixes logical inconsistencies.
Why Used
· Ensures
valid ranges
· Converts negative to positive
R Code
vaccine_clean <- vaccine_raw
vaccine_clean$Storage_Temperature[vaccine_clean$Storage_Temperature < 2] <- 2
vaccine_clean$Shipment_Delay[vaccine_clean$Shipment_Delay < 0] <- 0
vaccine_clean$Units_Transported <- abs(vaccine_clean$Units_Transported)
4. Character Functions
SAS Code
data vaccine_char;
set vaccine_clean;
Node_Upper = upcase(Distribution_Node);
Node_Proper = propcase(Distribution_Node);
Node_Clean = strip(Distribution_Node);
run;
proc print data=vaccine_char;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date | Node_Upper | Node_Proper | Node_Clean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd |
| 2 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag |
| 3 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai |
| 4 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi |
| 5 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai |
Explanation
· UPCASE →
capital letters
· PROPCASE →
proper format
· STRIP → remove spaces
R Code
vaccine_clean <- vaccine_raw
vaccine_clean$Node_Upper <- toupper(vaccine_clean$Distribution_Node)
vaccine_clean$Node_Lower <- tolower(vaccine_clean$Distribution_Node)
print(vaccine_clean)
5. Numeric Functions
SAS Code
data vaccine_num;
set vaccine_char;
Total_Value = Units_Transported * Price;
Avg_Util = mean(Utilization, 50);
run;
proc print data=vaccine_num;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date | Node_Upper | Node_Proper | Node_Clean | Total_Value | Avg_Util |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd | 12500 | 65.0 |
| 2 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag | 20000 | 60.0 |
| 3 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai | 23250 | 50.0 |
| 4 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi | 2000 | 55.0 |
| 5 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai | 22200 | 62.5 |
|
SAS |
R |
|
mean(a,b)
→ row-wise |
mean()
→ column-wise |
|
Handles
missing automatically |
Needs
na.rm=TRUE |
|
Built-in
flexibility |
Needs
rowMeans() |
R Code
vaccine_clean$Total_Value <- vaccine_clean$Units_Transported * vaccine_clean$Price
vaccine_clean$Avg_Util <- rowMeans(cbind(vaccine_clean$Utilization, 50),na.rm = TRUE)
print(vaccine_clean)
6. Date Functions
SAS Code
data vaccine_date;
set vaccine_num;
Month = intck('month','01JAN2024'd,Shipment_Date);
Next_Month = intnx('month',Shipment_Date,1);
run;
proc print data=vaccine_date;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date | Node_Upper | Node_Proper | Node_Clean | Total_Value | Avg_Util | Month | Next_Month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd | 12500 | 65.0 | 0 | 23407 |
| 2 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag | 20000 | 60.0 | 1 | 23436 |
| 3 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai | 23250 | 50.0 | 2 | 23467 |
| 4 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi | 2000 | 55.0 | 3 | 23497 |
| 5 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai | 22200 | 62.5 | 4 | 23528 |
|
Shipment_Date |
SAS
INTCK |
R %m |
|
01JAN2024 |
0 |
1 |
|
05FEB2024 |
1 |
2 |
|
10MAR2024 |
2 |
3 |
|
15APR2024 |
3 |
4 |
|
20MAY2024 |
4 |
5 |
Concept Difference
SAS (INTCK)
Counts intervals FROM reference
date
Example:
01JAN → 01JAN = 0
01JAN → FEB = 1
R (%m)
Returns actual calendar month
Example:
JAN = 1
FEB = 2
Key Insight
|
SAS
INTCK |
R
format("%m") |
|
Relative
count |
Absolute
month |
|
Starts
from 0 |
Starts
from 1 |
|
Interval
logic |
Calendar
logic |
R Code
Step 1:Convert to Date
vaccine_raw$Shipment_Date <- as.Date(vaccine_raw$Shipment_Date,format = "%d%b%Y")
Step 2:Extract Month
vaccine_raw$Month <- as.integer(format(vaccine_raw$Shipment_Date, "%m"))
print(vaccine_raw)
Step 3:How to Match SAS Output in R
vaccine_raw$Month_SAS <- as.integer(format(vaccine_raw$Shipment_Date, "%m")) - 1
print(vaccine_raw)
7. SET Statement
Step 1: Combine Two Different Datasets
data combined;
set vaccine_date
vaccine_clean;
run;
proc print data=combined;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date | Node_Upper | Node_Proper | Node_Clean | Total_Value | Avg_Util | Month | Next_Month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd | 12500 | 65.0 | 0 | 23407 |
| 2 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag | 20000 | 60.0 | 1 | 23436 |
| 3 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai | 23250 | 50.0 | 2 | 23467 |
| 4 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi | 2000 | 55.0 | 3 | 23497 |
| 5 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai | 22200 | 62.5 | 4 | 23528 |
| 6 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | . | . | . | . | |||
| 7 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | . | . | . | . | |||
| 8 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | . | . | . | . | |||
| 9 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | . | . | . | . | |||
| 10 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | . | . | . | . |
Step 2: Remove Duplicates After Combining
proc sort data=combined nodupkey;
by Distribution_Node Shipment_Date;
run;
proc print data=combined;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date | Node_Upper | Node_Proper | Node_Clean | Total_Value | Avg_Util | Month | Next_Month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai | 23250 | 50.0 | 2 | 23467 |
| 2 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi | 2000 | 55.0 | 3 | 23497 |
| 3 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd | 12500 | 65.0 | 0 | 23407 |
| 4 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai | 22200 | 62.5 | 4 | 23528 |
| 5 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag | 20000 | 60.0 | 1 | 23436 |
Step 3: Add Source Indicator
data combined;
set vaccine_date(in=a) vaccine_date(in=b);
length Source $12.;
if a then Source="Original";
if b then Source="Duplicate";
run;
proc print data=combined;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date | Node_Upper | Node_Proper | Node_Clean | Total_Value | Avg_Util | Month | Next_Month | Source |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd | 12500 | 65.0 | 0 | 23407 | Original |
| 2 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag | 20000 | 60.0 | 1 | 23436 | Original |
| 3 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai | 23250 | 50.0 | 2 | 23467 | Original |
| 4 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi | 2000 | 55.0 | 3 | 23497 | Original |
| 5 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai | 22200 | 62.5 | 4 | 23528 | Original |
| 6 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd | 12500 | 65.0 | 0 | 23407 | Duplicate |
| 7 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag | 20000 | 60.0 | 1 | 23436 | Duplicate |
| 8 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai | 23250 | 50.0 | 2 | 23467 | Duplicate |
| 9 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi | 2000 | 55.0 | 3 | 23497 | Duplicate |
| 10 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai | 22200 | 62.5 | 4 | 23528 | Duplicate |
Step 4: If You DON'T Want Duplication
data combined;
set vaccine_date;
run;
proc print data=combined;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date | Node_Upper | Node_Proper | Node_Clean | Total_Value | Avg_Util | Month | Next_Month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd | 12500 | 65.0 | 0 | 23407 |
| 2 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag | 20000 | 60.0 | 1 | 23436 |
| 3 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai | 23250 | 50.0 | 2 | 23467 |
| 4 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi | 2000 | 55.0 | 3 | 23497 |
| 5 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai | 22200 | 62.5 | 4 | 23528 |
Explanation
· The SET statement reads observations
from datasets sequentially.
· Here, the same dataset is appended twice →
dataset doubles in size.
Why It Is Used
· Combine
datasets with same structure
· Useful in:
· Monthly
data stacking
·
Multi-source integration
· Batch processing pipelines
R Code
Step 1:Convert to Date
vaccine_raw$Shipment_Date <- as.Date(vaccine_raw$Shipment_Date,format = "%d%b%Y")
Step 2:Extract Month
vaccine_raw$Month <- as.integer(format(vaccine_raw$Shipment_Date, "%m"))
print(vaccine_raw)
Step 3:How to Match SAS Output in R
vaccine_raw$Month_SAS <- as.integer(format(vaccine_raw$Shipment_Date, "%m")) - 1
print(vaccine_raw)
vaccine_clean2 <- vaccine_clean
combined <- rbind(vaccine_clean, vaccine_clean2)
vaccine_clean$Source <- "Original"
vaccine_clean2$Source <- "New"
R
Explanation
· rbind() stacks datasets row-wise.
· Works only if column structures match.
Why Used in R
· Equivalent
of SAS SET for vertical concatenation
· Common in time-series or batch append
operations
8. MERGE
SAS Code
proc sort data=vaccine_date; by Distribution_Node; run;
proc print data=vaccine_date;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date | Node_Upper | Node_Proper | Node_Clean | Total_Value | Avg_Util | Month | Next_Month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai | 23250 | 50.0 | 2 | 23467 |
| 2 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi | 2000 | 55.0 | 3 | 23497 |
| 3 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd | 12500 | 65.0 | 0 | 23407 |
| 4 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai | 22200 | 62.5 | 4 | 23528 |
| 5 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag | 20000 | 60.0 | 1 | 23436 |
data merged;
merge vaccine_date(in=a) vaccine_date(in=b);
by Distribution_Node;
if a;
run;
proc print data=merged;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date | Node_Upper | Node_Proper | Node_Clean | Total_Value | Avg_Util | Month | Next_Month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai | 23250 | 50.0 | 2 | 23467 |
| 2 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi | 2000 | 55.0 | 3 | 23497 |
| 3 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd | 12500 | 65.0 | 0 | 23407 |
| 4 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai | 22200 | 62.5 | 4 | 23528 |
| 5 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag | 20000 | 60.0 | 1 | 23436 |
Explanation
· Combines
datasets based on key (Distribution_Node)
· in= flags
help control which dataset contributes rows
Why It Is Used
· Combine
datasets with different variables
· Core for:
· SDTM → ADaM
derivation
· Lookup
joins
· Enrichment of datasets
R
Explanation
· merge() performs join
· all.x = TRUE → left join
Why Used in R
· Combine
multiple datasets
· Used in:
· Clinical
joins
· Lookup
merges
· Feature engineering
9. APPEND
SAS Code
proc append base=vaccine_date
data=vaccine_clean force;
run;
proc print data=vaccine_date;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date | Node_Upper | Node_Proper | Node_Clean | Total_Value | Avg_Util | Month | Next_Month |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai | 23250 | 50.0 | 2 | 23467 |
| 2 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi | 2000 | 55.0 | 3 | 23497 |
| 3 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd | 12500 | 65.0 | 0 | 23407 |
| 4 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai | 22200 | 62.5 | 4 | 23528 |
| 5 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag | 20000 | 60.0 | 1 | 23436 |
| 6 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | . | . | . | . | |||
| 7 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | . | . | . | . | |||
| 8 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | . | . | . | . | |||
| 9 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | . | . | . | . | |||
| 10 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | . | . | . | . |
Explanation
· Adds rows
of one dataset to another
· force → allows mismatched variables
Why It Is Used
· Faster than
SET for large datasets
· Used in:
· Incremental
data loads
· Real-time
ingestion
· ETL pipelines
R
Explanation
· Same logic
as append
· Combines rows into base dataset
Why Used in R
·
Efficient for incremental updates
·
Common in data pipelines
10. PROC TRANSPOSE
SAS Code
proc transpose data=vaccine_date out=transposed;
var Units_Transported Price;
run;
proc print data=transposed;
run;
OUTPUT:
| Obs | _NAME_ | COL1 | COL2 | COL3 | COL4 | COL5 | COL6 | COL7 | COL8 | COL9 | COL10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Units_Transported | 1500.0 | 100 | 1000.0 | 1200.0 | 2000 | 1000.0 | 2000 | 1500.0 | 100 | 1200.0 |
| 2 | Price | 15.5 | 20 | 12.5 | 18.5 | 10 | 12.5 | 10 | 15.5 | 20 | 18.5 |
Explanation
· Converts columns
→ rows
· Wide → Long format transformation
Why It Is Used
Required for:
· Reporting
formats
· TLF
generation
· Reshaping
data
11. PROC SQL
SAS Code
proc sql;
create table summary as
select Distribution_Node,
sum(Units_Transported) as Total_Units,
avg(Price) as Avg_Price
from vaccine_date
group by Distribution_Node;
quit;
proc print data=summary;
run;
OUTPUT:
| Obs | Distribution_Node | Total_Units | Avg_Price |
|---|---|---|---|
| 1 | Chennai | 3000 | 15.5 |
| 2 | Delhi | 200 | 20.0 |
| 3 | Hyd | 2000 | 12.5 |
| 4 | Mumbai | 2400 | 18.5 |
| 5 | Vizag | 4000 | 10.0 |
Explanation
· Performs
aggregation
· Groups by Distribution Node
Why It Is Used
· SQL
is powerful for:
· Summaries
· Joins
· Data transformations
· Widely used in clinical reporting
R Code
library(dplyr)
summary <- vaccine_clean %>%
group_by(Distribution_Node) %>%
summarise(
Total_Units = sum(Units_Transported, na.rm = TRUE),
Avg_Price = mean(Price, na.rm = TRUE)
)
R Explanation
·
group_by() + summarise() performs
aggregation
Why Used in R
·
Equivalent to SQL aggregation
·
Clean and readable syntax
12. PROC DATASETS DELETE
SAS Code
proc datasets library=work nolist;
delete vaccine_raw combined merged;
quit;
LOG:
Explanation
·
Deletes datasets from WORK library
Why It Is Used
·
Memory optimization
·
Workspace cleanup
·
Prevent confusion with old datasets
R Code
rm(vaccine_raw, combined)
R Explanation
·
rm() removes objects
·
gc() clears memory
Why Used in R
·
Free
memory
·
Improve
performance
·
Avoid
object conflicts
Final Corrected Dataset Code
SAS Code
data final_dataset;
set vaccine_date;
/* Classification Logic */
if Spoilage_Rate < 5 then Classification="GOOD";
else if 5 <= Spoilage_Rate < 10 then Classification="RISK";
else Classification="CRITICAL";
length Util_Class $8.;
/* Utilization Classification */
if Utilization >= 80 then Util_Class="HIGH";
else if Utilization >= 60 then Util_Class="MEDIUM";
else Util_Class="LOW";
/* Compliance Flag */
if Compliance_Index >= 90 then Compliance_Flag="STRONG";
else Compliance_Flag="WEAK";
/* Fraud Flag */
if Spoilage_Rate > 10 and Compliance_Index < 80 then Fraud="YES";
else Fraud="NO";
run;
proc print data=final_dataset;
run;
OUTPUT:
| Obs | Distribution_Node | Storage_Temperature | Shipment_Delay | Spoilage_Rate | Units_Transported | Compliance_Index | Price | Utilization | Shipment_Date | Node_Upper | Node_Proper | Node_Clean | Total_Value | Avg_Util | Month | Next_Month | Classification | Util_Class | Compliance_Flag | Fraud |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | CHENNAI | Chennai | Chennai | 23250 | 50.0 | 2 | 23467 | GOOD | LOW | WEAK | NO |
| 2 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | DELHI | Delhi | Delhi | 2000 | 55.0 | 3 | 23497 | GOOD | MEDIUM | STRONG | NO |
| 3 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | HYD | Hyd | Hyd | 12500 | 65.0 | 0 | 23407 | GOOD | HIGH | STRONG | NO |
| 4 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | MUMBAI | Mumbai | Mumbai | 22200 | 62.5 | 4 | 23528 | GOOD | MEDIUM | WEAK | NO |
| 5 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | VIZAG | Vizag | Vizag | 20000 | 60.0 | 1 | 23436 | RISK | MEDIUM | STRONG | NO |
| 6 | Hyd | 2 | 3 | 2.5 | 1000 | 95 | 12.5 | 80 | 01JAN2024 | . | . | . | . | GOOD | HIGH | STRONG | NO | |||
| 7 | Vizag | 25 | 0 | 5.0 | 2000 | 100 | 10.0 | 70 | 05FEB2024 | . | . | . | . | RISK | MEDIUM | STRONG | NO | |||
| 8 | Chennai | 4 | 2 | . | 1500 | 85 | 15.5 | . | 10MAR2024 | . | . | . | . | GOOD | LOW | WEAK | NO | |||
| 9 | Delhi | 7 | 5 | 3.5 | 100 | 90 | 20.0 | 60 | 15APR2024 | . | . | . | . | GOOD | MEDIUM | STRONG | NO | |||
| 10 | Mumbai | 3 | 0 | 2.0 | 1200 | . | 18.5 | 75 | 20MAY2024 | . | . | . | . | GOOD | MEDIUM | WEAK | NO |
Explanation
1. Classification
- Based on
Spoilage Rate
- Helps
identify risk level
2. Utilization Class
- Categorizes
efficiency of usage
3. Compliance Flag
- Indicates
regulatory adherence
4. Fraud Flag
- Detects
suspicious patterns
Why This Final Dataset is Important
This is the business-ready dataset:
- Used for dashboards
- Used for regulatory
reporting
- Used for fraud detection
systems
- Used for decision-making
R Code
final_dataset <- vaccine_clean
# Classification
final_dataset$Classification <- ifelse(final_dataset$Spoilage_Rate < 5, "GOOD",
ifelse(final_dataset$Spoilage_Rate < 10, "RISK", "CRITICAL"))
# Utilization Class
final_dataset$Util_Class <- ifelse(final_dataset$Utilization >= 80, "HIGH",
ifelse(final_dataset$Utilization >= 60, "MEDIUM", "LOW"))
# Compliance Flag
final_dataset$Compliance_Flag <- ifelse(final_dataset$Compliance_Index >= 90, "STRONG", "WEAK")
# Fraud Detection
final_dataset$Fraud <- ifelse(final_dataset$Spoilage_Rate > 10 &
final_dataset$Compliance_Index < 80, "YES", "NO")
Summary
In this
project, we created a vaccine cold chain dataset that reflects real-world
logistics challenges. Initially, the dataset contained several errors such as
invalid temperature values, negative shipment delays, unrealistic spoilage
rates, and missing or incorrect entries. Using SAS programming, we
systematically identified and corrected these issues through data cleaning
techniques.
We
applied various SAS concepts like DATA step processing, PROC SQL, macros, and
functions for handling dates, characters, and numeric values. Additionally, we
implemented fraud detection logic to flag suspicious records based on spoilage
and compliance patterns.
The
integration of SAS and R provided a dual perspective, helping us understand how
similar operations can be performed in both environments. Overall, this project
demonstrates how structured programming and validation logic can significantly
improve data quality and reliability in critical healthcare systems.
Conclusion
This
project highlights the importance of robust data handling in vaccine cold chain
logistics, where accuracy directly impacts public health outcomes. By
introducing intentional errors, we simulated real-world data challenges and
demonstrated how Advanced SAS programming techniques can effectively identify,
correct, and optimize such datasets.
Through
the use of DATA steps, PROC SQL, macros, and various functions, we transformed
unreliable raw data into a structured and meaningful dataset. The inclusion of
fraud detection logic further emphasized the role of analytics in safeguarding
operational integrity.
Additionally,
mapping each SAS step to R provided a comparative understanding, enhancing
flexibility for analysts working across platforms. Overall, this project not
only strengthens technical skills but also reinforces the importance of data
quality, validation, and business logic in clinical and logistics domains.
SAS INTERVIEW QUESTIONS
1. What is DISTINCT used for?
Answer:
DISTINCT removes duplicate rows from query results.
2. What is the difference between UNION and OUTER
UNION?
Answer:
UNION removes duplicates, while OUTER UNION keeps all rows including
duplicates.
3. What is a subquery?
Answer:
A subquery is a query nested inside
another SQL query.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent VACCINE data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
· Clinical SAS Programmer
· Research Data Analyst
· Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment