Vaccine Cold Chain Optimization Using Advanced SAS Techniques for Data Quality Enhancement and Fraud Detection

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Introduction

In modern healthcare supply chains, vaccine cold chain logistics is one of the most critical operational domains. Vaccines are highly temperature-sensitive biological products, and even minor deviations can lead to loss of potency, resulting in financial loss and public health risks. Organizations rely on robust data systems to monitor transportation, storage, compliance, and spoilage.

However, real-world datasets are rarely clean. They often contain:

Missing values
Incorrect formats
Logical inconsistencies
Fraudulent manipulations

This project simulates a real-world vaccine cold chain dataset with intentional errors, and demonstrates how Advanced SAS programming and R can:

Detect errors
Correct inconsistencies
Perform transformations
Apply business rules
Identify fraud patterns

The explanation is written in a simple, human-readable way, while still maintaining technical depth for interview preparation.

Table of Contents

Business Context
Raw Dataset Creation (With Errors)
Error Identification
Data Cleaning Using SAS
DATA Step Transformations
PROC SQL Operations
Date Functions (MDY, INTCK, INTNX)
SET, MERGE, APPEND
PROC TRANSPOSE
Character Functions
Numeric Functions
Fraud Detection Macro
PROC DATASETS DELETE
Final Clean Dataset
Equivalent R Code
Summary
Conclusion

Business Context

A pharmaceutical company distributes vaccines across multiple regions. Each shipment is monitored using the following variables:

Distribution_Node (location)
Storage_Temperature
Shipment_Delay
Spoilage_Rate
Units_Transported
Compliance_Index
Price
Utilization
Classification (Good / Risky / Fraud)
Shipment_Date

Key Business Goals:

Maintain temperature between 2°C to 8°C
Reduce spoilage below 5%
Detect fraud (fake reporting, manipulation)
Ensure regulatory compliance

1. Create Raw Dataset (With Intentional Errors)

SAS Code

data vaccine_raw;

input Distribution_Node $ Storage_Temperature Shipment_Delay Spoilage_Rate

Units_Transported Compliance_Index Price Utilization Shipment_Date : date9.;

format Shipment_Date date9.;

datalines;

Hyd -5 3 2.5 1000 95 12.5 80 01JAN2024

Vizag 25 -1 110 2000 105 abc 70 05FEB2024

Chennai 4 2 . 1500 85 15.5 . 10MAR2024

Delhi 7 5 3.5 -100 90 20.0 60 15APR2024

Mumbai 3 0 2.0 1200 . 18.5 75 20MAY2024

;

run;

proc print data=vaccine_raw;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date
1	Hyd	-5	3	2.5	1000	95	12.5	80	01JAN2024
2	Vizag	25	-1	110.0	2000	105	.	70	05FEB2024
3	Chennai	4	2	.	1500	85	15.5	.	10MAR2024
4	Delhi	7	5	3.5	-100	90	20.0	60	15APR2024
5	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024

Explanation

This dataset intentionally contains multiple issues:

· Negative temperature (-5)

· Negative delay (-1)

· Spoilage > 100 (invalid)

· Price = "abc" (character in numeric field)

· Missing values

· Negative units

Why This Code is Used

· To simulate real-world messy data

· To test cleaning and validation logic

R code

vaccine_raw <- data.frame(

Distribution_Node =c("Hyd","Vizag","Chennai","Delhi","Mumbai"),

Storage_Temperature = c(-5,25,4,7,3),

Shipment_Delay = c(3,-1,2,5,0),

Spoilage_Rate = c(2.5,110,NA,3.5,2.0),

Units_Transported = c(1000,2000,1500,-100,1200),

Compliance_Index = c(95,105,85,90,NA),

Price = c(12.5,NA,15.5,20.0,18.5),

Utilization = c(80,70,NA,60,75),

Shipment_Date = c("01JAN2024","05FEB2024","10MAR2024","15APR2024","20MAY2024")

)

print(vaccine_raw)

2. Error Detection

SAS Code

proc print data=vaccine_raw;

where Storage_Temperature < 0 or Shipment_Delay < 0 or

Spoilage_Rate > 100 or Units_Transported < 0;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date
1	Hyd	-5	3	2.5	1000	95	12.5	80	01JAN2024
2	Vizag	25	-1	110.0	2000	105	.	70	05FEB2024
4	Delhi	7	5	3.5	-100	90	20.0	60	15APR2024

Explanation

This identifies invalid records.

Why Used

· Early validation step

· Helps isolate problematic rows

R Code

subset(vaccine_raw, Storage_Temperature < 0 | Shipment_Delay < 0 |

Spoilage_Rate > 100 | Units_Transported < 0)

3. Data Cleaning

SAS Code

data vaccine_clean;

set vaccine_raw;

if Storage_Temperature < 2 then Storage_Temperature = 2;

if Shipment_Delay < 0 then Shipment_Delay = 0;

if Spoilage_Rate > 100 then Spoilage_Rate = 5;

if Units_Transported < 0 then Units_Transported = abs(Units_Transported);

if Price = . then Price = 10;

if Compliance_Index > 100 then Compliance_Index = 100;

run;

proc print data=vaccine_clean;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date
1	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024
2	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024
3	Chennai	4	2	.	1500	85	15.5	.	10MAR2024
4	Delhi	7	5	3.5	100	90	20.0	60	15APR2024
5	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024

Explanation

Fixes logical inconsistencies.

Why Used

· Ensures valid ranges

· Converts negative to positive

R Code

vaccine_clean <- vaccine_raw

vaccine_clean$Storage_Temperature[vaccine_clean$Storage_Temperature < 2] <- 2

vaccine_clean$Shipment_Delay[vaccine_clean$Shipment_Delay < 0] <- 0

vaccine_clean$Units_Transported <- abs(vaccine_clean$Units_Transported)

4. Character Functions

SAS Code

data vaccine_char;

set vaccine_clean;

Node_Upper = upcase(Distribution_Node);

Node_Proper = propcase(Distribution_Node);

Node_Clean = strip(Distribution_Node);

run;

proc print data=vaccine_char;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date	Node_Upper	Node_Proper	Node_Clean
1	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd
2	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag
3	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai
4	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi
5	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai

Explanation

· UPCASE → capital letters

· PROPCASE → proper format

· STRIP → remove spaces

R Code

vaccine_clean <- vaccine_raw

vaccine_clean$Node_Upper <- toupper(vaccine_clean$Distribution_Node)

vaccine_clean$Node_Lower <- tolower(vaccine_clean$Distribution_Node)

print(vaccine_clean)

5. Numeric Functions

SAS Code

data vaccine_num;

set vaccine_char;

Total_Value = Units_Transported * Price;

Avg_Util = mean(Utilization, 50);

run;

proc print data=vaccine_num;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date	Node_Upper	Node_Proper	Node_Clean	Total_Value	Avg_Util
1	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd	12500	65.0
2	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag	20000	60.0
3	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai	23250	50.0
4	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi	2000	55.0
5	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai	22200	62.5

SAS	R
mean(a,b) → row-wise	mean() → column-wise
Handles missing automatically	Needs na.rm=TRUE
Built-in flexibility	Needs rowMeans()

R Code

vaccine_clean$Total_Value <- vaccine_clean$Units_Transported * vaccine_clean$Price

vaccine_clean$Avg_Util <- rowMeans(cbind(vaccine_clean$Utilization, 50),na.rm = TRUE)

print(vaccine_clean)

6. Date Functions

SAS Code

data vaccine_date;

set vaccine_num;

Month = intck('month','01JAN2024'd,Shipment_Date);

Next_Month = intnx('month',Shipment_Date,1);

run;

proc print data=vaccine_date;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date	Node_Upper	Node_Proper	Node_Clean	Total_Value	Avg_Util	Month	Next_Month
1	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd	12500	65.0	0	23407
2	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag	20000	60.0	1	23436
3	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai	23250	50.0	2	23467
4	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi	2000	55.0	3	23497
5	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai	22200	62.5	4	23528

Shipment_Date	SAS INTCK	R %m
01JAN2024	0	1
05FEB2024	1	2
10MAR2024	2	3
15APR2024	3	4
20MAY2024	4	5

Concept Difference

SAS (INTCK)

Counts intervals FROM reference date

Example:

01JAN → 01JAN = 0
01JAN → FEB = 1

R (%m)

Returns actual calendar month

Example:

JAN = 1
FEB = 2

Key Insight

SAS INTCK	R format("%m")
Relative count	Absolute month
Starts from 0	Starts from 1
Interval logic	Calendar logic

R Code

Step 1:Convert to Date

vaccine_raw$Shipment_Date <- as.Date(vaccine_raw$Shipment_Date,format = "%d%b%Y")

Step 2:Extract Month

vaccine_raw$Month <- as.integer(format(vaccine_raw$Shipment_Date, "%m"))

print(vaccine_raw)

Step 3:How to Match SAS Output in R

vaccine_raw$Month_SAS <- as.integer(format(vaccine_raw$Shipment_Date, "%m")) - 1

print(vaccine_raw)

7. SET Statement

Step 1: Combine Two Different Datasets

data combined;

set vaccine_date

vaccine_clean;

run;

proc print data=combined;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date	Node_Upper	Node_Proper	Node_Clean	Total_Value	Avg_Util	Month	Next_Month
1	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd	12500	65.0	0	23407
2	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag	20000	60.0	1	23436
3	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai	23250	50.0	2	23467
4	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi	2000	55.0	3	23497
5	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai	22200	62.5	4	23528
6	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024				.	.	.	.
7	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024				.	.	.	.
8	Chennai	4	2	.	1500	85	15.5	.	10MAR2024				.	.	.	.
9	Delhi	7	5	3.5	100	90	20.0	60	15APR2024				.	.	.	.
10	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024				.	.	.	.

Step 2: Remove Duplicates After Combining

proc sort data=combined nodupkey;

by Distribution_Node Shipment_Date;

run;

proc print data=combined;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date	Node_Upper	Node_Proper	Node_Clean	Total_Value	Avg_Util	Month	Next_Month
1	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai	23250	50.0	2	23467
2	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi	2000	55.0	3	23497
3	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd	12500	65.0	0	23407
4	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai	22200	62.5	4	23528
5	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag	20000	60.0	1	23436

Step 3: Add Source Indicator

data combined;

set vaccine_date(in=a) vaccine_date(in=b);

length Source $12.;

if a then Source="Original";

if b then Source="Duplicate";

run;

proc print data=combined;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date	Node_Upper	Node_Proper	Node_Clean	Total_Value	Avg_Util	Month	Next_Month	Source
1	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd	12500	65.0	0	23407	Original
2	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag	20000	60.0	1	23436	Original
3	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai	23250	50.0	2	23467	Original
4	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi	2000	55.0	3	23497	Original
5	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai	22200	62.5	4	23528	Original
6	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd	12500	65.0	0	23407	Duplicate
7	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag	20000	60.0	1	23436	Duplicate
8	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai	23250	50.0	2	23467	Duplicate
9	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi	2000	55.0	3	23497	Duplicate
10	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai	22200	62.5	4	23528	Duplicate

Step 4: If You DON'T Want Duplication

data combined;

set vaccine_date;

run;

proc print data=combined;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date	Node_Upper	Node_Proper	Node_Clean	Total_Value	Avg_Util	Month	Next_Month
1	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd	12500	65.0	0	23407
2	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag	20000	60.0	1	23436
3	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai	23250	50.0	2	23467
4	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi	2000	55.0	3	23497
5	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai	22200	62.5	4	23528

Explanation

· The SET statement reads observations from datasets sequentially.

· Here, the same dataset is appended twice → dataset doubles in size.

Why It Is Used

· Combine datasets with same structure

· Useful in:

· Monthly data stacking

· Multi-source integration

· Batch processing pipelines

R Code

Step 1:Convert to Date

vaccine_raw$Shipment_Date <- as.Date(vaccine_raw$Shipment_Date,format = "%d%b%Y")

Step 2:Extract Month

vaccine_raw$Month <- as.integer(format(vaccine_raw$Shipment_Date, "%m"))

print(vaccine_raw)

Step 3:How to Match SAS Output in R

vaccine_raw$Month_SAS <- as.integer(format(vaccine_raw$Shipment_Date, "%m")) - 1

print(vaccine_raw)

vaccine_clean2 <- vaccine_clean

combined <- rbind(vaccine_clean, vaccine_clean2)

vaccine_clean$Source <- "Original"

vaccine_clean2$Source <- "New"

R Explanation

· rbind() stacks datasets row-wise.

· Works only if column structures match.

Why Used in R

· Equivalent of SAS SET for vertical concatenation

· Common in time-series or batch append operations

8. MERGE

SAS Code

proc sort data=vaccine_date; by Distribution_Node; run;

proc print data=vaccine_date;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date	Node_Upper	Node_Proper	Node_Clean	Total_Value	Avg_Util	Month	Next_Month
1	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai	23250	50.0	2	23467
2	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi	2000	55.0	3	23497
3	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd	12500	65.0	0	23407
4	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai	22200	62.5	4	23528
5	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag	20000	60.0	1	23436

data merged;

merge vaccine_date(in=a) vaccine_date(in=b);

by Distribution_Node;

if a;

run;

proc print data=merged;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date	Node_Upper	Node_Proper	Node_Clean	Total_Value	Avg_Util	Month	Next_Month
1	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai	23250	50.0	2	23467
2	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi	2000	55.0	3	23497
3	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd	12500	65.0	0	23407
4	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai	22200	62.5	4	23528
5	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag	20000	60.0	1	23436

Explanation

· Combines datasets based on key (Distribution_Node)

· in= flags help control which dataset contributes rows

Why It Is Used

· Combine datasets with different variables

· Core for:

· SDTM → ADaM derivation

· Lookup joins

· Enrichment of datasets

R Explanation

· merge() performs join

· all.x = TRUE → left join

Why Used in R

· Combine multiple datasets

· Used in:

· Clinical joins

· Lookup merges

· Feature engineering

9. APPEND

SAS Code

proc append base=vaccine_date

data=vaccine_clean force;

run;

proc print data=vaccine_date;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date	Node_Upper	Node_Proper	Node_Clean	Total_Value	Avg_Util	Month	Next_Month
1	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai	23250	50.0	2	23467
2	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi	2000	55.0	3	23497
3	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd	12500	65.0	0	23407
4	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai	22200	62.5	4	23528
5	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag	20000	60.0	1	23436
6	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024				.	.	.	.
7	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024				.	.	.	.
8	Chennai	4	2	.	1500	85	15.5	.	10MAR2024				.	.	.	.
9	Delhi	7	5	3.5	100	90	20.0	60	15APR2024				.	.	.	.
10	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024				.	.	.	.

Explanation

· Adds rows of one dataset to another

· force → allows mismatched variables

Why It Is Used

· Faster than SET for large datasets

· Used in:

· Incremental data loads

· Real-time ingestion

· ETL pipelines

R Explanation

· Same logic as append

· Combines rows into base dataset

Why Used in R

· Efficient for incremental updates

· Common in data pipelines

10. PROC TRANSPOSE

SAS Code

proc transpose data=vaccine_date out=transposed;

var Units_Transported Price;

run;

proc print data=transposed;

run;

OUTPUT:


Obs	_NAME_	COL1	COL2	COL3	COL4	COL5	COL6	COL7	COL8	COL9	COL10
1	Units_Transported	1500.0	100	1000.0	1200.0	2000	1000.0	2000	1500.0	100	1200.0
2	Price	15.5	20	12.5	18.5	10	12.5	10	15.5	20	18.5

Explanation

· Converts columns → rows

· Wide → Long format transformation

Why It Is Used

Required for:

· Reporting formats

· TLF generation

· Reshaping data

11. PROC SQL

SAS Code

proc sql;

create table summary as

select Distribution_Node,

sum(Units_Transported) as Total_Units,

avg(Price) as Avg_Price

from vaccine_date

group by Distribution_Node;

quit;

proc print data=summary;

run;

OUTPUT:


Obs	Distribution_Node	Total_Units	Avg_Price
1	Chennai	3000	15.5
2	Delhi	200	20.0
3	Hyd	2000	12.5
4	Mumbai	2400	18.5
5	Vizag	4000	10.0

Explanation

· Performs aggregation

· Groups by Distribution Node

Why It Is Used

· SQL is powerful for:

· Summaries

· Joins

· Data transformations

· Widely used in clinical reporting

R Code

library(dplyr)

summary <- vaccine_clean %>%

group_by(Distribution_Node) %>%

summarise(

Total_Units = sum(Units_Transported, na.rm = TRUE),

Avg_Price = mean(Price, na.rm = TRUE)

)

R Explanation

· group_by() + summarise() performs aggregation

Why Used in R

· Equivalent to SQL aggregation

· Clean and readable syntax

12. PROC DATASETS DELETE

SAS Code

proc datasets library=work nolist;

delete vaccine_raw combined merged;

quit;

LOG:

 NOTE: Deleting WORK.VACCINE_RAW (memtype=DATA).

 NOTE: Deleting WORK.COMBINED (memtype=DATA).

 NOTE: Deleting WORK.MERGED (memtype=DATA).

Explanation

· Deletes datasets from WORK library

Why It Is Used

· Memory optimization

· Workspace cleanup

· Prevent confusion with old datasets

R Code

rm(vaccine_raw, combined)

R Explanation

· rm() removes objects

· gc() clears memory

Why Used in R

· Free memory

· Improve performance

· Avoid object conflicts

Final Corrected Dataset Code

SAS Code

data final_dataset;

set vaccine_date;

/* Classification Logic */

if Spoilage_Rate < 5 then Classification="GOOD";

else if 5 <= Spoilage_Rate < 10 then Classification="RISK";

else Classification="CRITICAL";

length Util_Class $8.;

/* Utilization Classification */

if Utilization >= 80 then Util_Class="HIGH";

else if Utilization >= 60 then Util_Class="MEDIUM";

else Util_Class="LOW";

/* Compliance Flag */

if Compliance_Index >= 90 then Compliance_Flag="STRONG";

else Compliance_Flag="WEAK";

/* Fraud Flag */

if Spoilage_Rate > 10 and Compliance_Index < 80 then Fraud="YES";

else Fraud="NO";

run;

proc print data=final_dataset;

run;

OUTPUT:


Obs	Distribution_Node	Storage_Temperature	Shipment_Delay	Spoilage_Rate	Units_Transported	Compliance_Index	Price	Utilization	Shipment_Date	Node_Upper	Node_Proper	Node_Clean	Total_Value	Avg_Util	Month	Next_Month	Classification	Util_Class	Compliance_Flag	Fraud
1	Chennai	4	2	.	1500	85	15.5	.	10MAR2024	CHENNAI	Chennai	Chennai	23250	50.0	2	23467	GOOD	LOW	WEAK	NO
2	Delhi	7	5	3.5	100	90	20.0	60	15APR2024	DELHI	Delhi	Delhi	2000	55.0	3	23497	GOOD	MEDIUM	STRONG	NO
3	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024	HYD	Hyd	Hyd	12500	65.0	0	23407	GOOD	HIGH	STRONG	NO
4	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024	MUMBAI	Mumbai	Mumbai	22200	62.5	4	23528	GOOD	MEDIUM	WEAK	NO
5	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024	VIZAG	Vizag	Vizag	20000	60.0	1	23436	RISK	MEDIUM	STRONG	NO
6	Hyd	2	3	2.5	1000	95	12.5	80	01JAN2024				.	.	.	.	GOOD	HIGH	STRONG	NO
7	Vizag	25	0	5.0	2000	100	10.0	70	05FEB2024				.	.	.	.	RISK	MEDIUM	STRONG	NO
8	Chennai	4	2	.	1500	85	15.5	.	10MAR2024				.	.	.	.	GOOD	LOW	WEAK	NO
9	Delhi	7	5	3.5	100	90	20.0	60	15APR2024				.	.	.	.	GOOD	MEDIUM	STRONG	NO
10	Mumbai	3	0	2.0	1200	.	18.5	75	20MAY2024				.	.	.	.	GOOD	MEDIUM	WEAK	NO

Explanation

1. Classification

Based on Spoilage Rate
Helps identify risk level

2. Utilization Class

Categorizes efficiency of usage

3. Compliance Flag

Indicates regulatory adherence

4. Fraud Flag

Detects suspicious patterns

Why This Final Dataset is Important

This is the business-ready dataset:

Used for dashboards
Used for regulatory reporting
Used for fraud detection systems
Used for decision-making

R Code

final_dataset <- vaccine_clean

# Classification

final_dataset$Classification <- ifelse(final_dataset$Spoilage_Rate < 5, "GOOD",

ifelse(final_dataset$Spoilage_Rate < 10, "RISK", "CRITICAL"))

# Utilization Class

final_dataset$Util_Class <- ifelse(final_dataset$Utilization >= 80, "HIGH",

ifelse(final_dataset$Utilization >= 60, "MEDIUM", "LOW"))

# Compliance Flag

final_dataset$Compliance_Flag <- ifelse(final_dataset$Compliance_Index >= 90, "STRONG", "WEAK")

# Fraud Detection

final_dataset$Fraud <- ifelse(final_dataset$Spoilage_Rate > 10 &

final_dataset$Compliance_Index < 80, "YES", "NO")

Summary

In this project, we created a vaccine cold chain dataset that reflects real-world logistics challenges. Initially, the dataset contained several errors such as invalid temperature values, negative shipment delays, unrealistic spoilage rates, and missing or incorrect entries. Using SAS programming, we systematically identified and corrected these issues through data cleaning techniques.

We applied various SAS concepts like DATA step processing, PROC SQL, macros, and functions for handling dates, characters, and numeric values. Additionally, we implemented fraud detection logic to flag suspicious records based on spoilage and compliance patterns.

The integration of SAS and R provided a dual perspective, helping us understand how similar operations can be performed in both environments. Overall, this project demonstrates how structured programming and validation logic can significantly improve data quality and reliability in critical healthcare systems.

Conclusion

This project highlights the importance of robust data handling in vaccine cold chain logistics, where accuracy directly impacts public health outcomes. By introducing intentional errors, we simulated real-world data challenges and demonstrated how Advanced SAS programming techniques can effectively identify, correct, and optimize such datasets.

Through the use of DATA steps, PROC SQL, macros, and various functions, we transformed unreliable raw data into a structured and meaningful dataset. The inclusion of fraud detection logic further emphasized the role of analytics in safeguarding operational integrity.

Additionally, mapping each SAS step to R provided a comparative understanding, enhancing flexibility for analysts working across platforms. Overall, this project not only strengthens technical skills but also reinforces the importance of data quality, validation, and business logic in clinical and logistics domains.

SAS INTERVIEW QUESTIONS

1. What is DISTINCT used for?
Answer:
DISTINCT removes duplicate rows from query results.
2. What is the difference between UNION and OUTER UNION?
Answer:
UNION removes duplicates, while OUTER UNION keeps all rows including duplicates.
3. What is a subquery?

Answer:
A subquery is a query nested inside another SQL query.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent VACCINE data.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics
·  Clinical SAS Programmer
·  Research Data Analyst
·  Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can SAS analytics reveal which programming languages are popular but surprisingly difficult to master?
2.Is SDTM DM quality control in SAS the key to avoiding last-minute FDA rejection risks?
3.Can SAS analytics reveal which world tourist places truly attract visitors year after year?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Can Advanced SAS Programming Detect, Clean, and Optimize Vaccine Cold Chain Logistics Data While Identifying Fraud Patterns?

Vaccine Cold Chain Optimization Using Advanced SAS Techniques for Data Quality Enhancement and Fraud Detection

Explanation

Explanation

Why Used

Explanation

Why Used

Explanation

R Explanation

Why Used in R

Explanation

Why It Is Used

R Explanation

Why Used in R

Explanation

1. Classification

2. Utilization Class

3. Compliance Flag

4. Fraud Flag

Why This Final Dataset is Important

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can SAS analytics reveal which programming languages are popular but surprisingly difficult to master?
2.Is SDTM DM quality control in SAS the key to avoiding last-minute FDA rejection risks?
3.Can SAS analytics reveal which world tourist places truly attract visitors year after year?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

Can Advanced SAS Programming Detect, Clean, and Optimize Vaccine Cold Chain Logistics Data While Identifying Fraud Patterns?

Vaccine Cold Chain Optimization Using Advanced SAS Techniques for Data Quality Enhancement and Fraud Detection

Explanation

Explanation

Why Used

Explanation

Why Used

Explanation

R Explanation

Why Used in R

Explanation

Why It Is Used

R Explanation

Why Used in R

Explanation

1. Classification

2. Utilization Class

3. Compliance Flag

4. Fraud Flag

Why This Final Dataset is Important

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can SAS analytics reveal which programming languages are popular but surprisingly difficult to master?2.Is SDTM DM quality control in SAS the key to avoiding last-minute FDA rejection risks?3.Can SAS analytics reveal which world tourist places truly attract visitors year after year?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

1.Can SAS analytics reveal which programming languages are popular but surprisingly difficult to master?
2.Is SDTM DM quality control in SAS the key to avoiding last-minute FDA rejection risks?
3.Can SAS analytics reveal which world tourist places truly attract visitors year after year?