Transforming Famous Ice Cream Datasets into Production-Grade Analytics Using SAS and R
Frozen Dreams, Corrupted Rows Famous Ice Cream Intelligence Using SAS PROC SQL, DATA Step, and Modern R Pipelines
Introduction
In
enterprise analytics, bad data rarely announces itself loudly. It quietly
destroys dashboards, regulatory outputs, AI models, operational reports, and
executive trust. One corrupted field inside a production dataset can trigger
millions of dollars in losses, delayed submissions, or incorrect strategic
decisions.
Imagine a
multinational frozen-dessert company conducting global customer behavior
analysis across premium ice cream brands. Their operational datasets contain
famous ice creams from multiple countries including Italy, India, the United
States, Turkey, Japan, and France. The company wants to understand regional
demand patterns, sugar preferences, sales performance, allergy categories, and
manufacturing quality metrics.
Unfortunately,
the raw production dataset is severely corrupted.
Duplicate
IceCream IDs exist.
Negative revenue values appear in financial columns.
Ages are unrealistic.
Email addresses are malformed.
Manufacturing dates are invalid.
Category labels contain inconsistent capitalization.
Whitespace corruption affects region codes.
Missing timestamps break reporting pipelines.
Mixed character/numeric variables crash validation macros.
The
consequences become dangerous:
- AI demand forecasts become
unreliable
- Executive dashboards show
incorrect regional profitability
- Clinical nutrition studies
using frozen dairy consumption become statistically invalid
- Regulatory reporting fails
QC checks
- Insurance risk models
misclassify allergy-sensitive consumers
- Machine learning pipelines
propagate corrupted metadata
This is where enterprise-grade SAS and R data engineering becomes critical.
Raw Dataset Creation in SAS
Raw Corrupted Dataset with Intentional Errors
data icecream_raw;
length IceCream_ID $8 Brand $25 Country $20 Flavor $20
Customer_Email $50 Region_Code $10 Category $15;
informat Manufacture_Date anydtdte20.;
format Manufacture_Date date9.;
infile datalines dlm='|' truncover;
input IceCream_ID $ Brand $ Country $ Flavor $ Customer_Age
Revenue Manufacture_Date Customer_Email $ Region_Code $
Category $;
datalines;
IC001|GelatoRoma|Italy|Vanilla|25|4500|12JAN2025|abc@gmail.com| eu01 |Premium
IC001|gelatoroma|ITALY|vanilla|25|-4500|31FEB2025|wrongmail.com|EU01|premium
IC002|Amul|India|Mango|150|3200|15MAR2025|amul@ice.com|in02|Regular
IC003|BaskinRobbins|USA|Chocolate|-5|5500|.|baskin@gmail|US01|Luxury
IC004|Cornetto|France|Strawberry|32|.|11APR2025|cornetto@yahoo.com| FR03|regular
IC005|NULL|Turkey|Pistachio|41|7000|15MAY2025|turkmail.com|TR01|Premium
IC006|HaagenDazs|USA|Cookie|29|6500|14JUN2025|haagen@gmail.com|us01|Luxury
IC007|KwalityWalls|India|Kulfi|38|-800|17JUL2025|kwality@gmail.com|IN-01|Regular
IC008|Movenpick|Switzerland|Coffee|999|9200|18AUG2025|move@gmail.com|SW01|Premium
IC009|Ben&Jerrys|USA|Brownie|44|8700|19SEP2025|ben@icecream.com|US 01|Luxury
IC010|Magnum|Belgium|Almond|33|7800|INVALID|magnum@gmail.com|BE01|Premium
IC011|Yili|China|Matcha|27|4500|25OCT2025|yili@@gmail.com|CN01|Regular
IC012|Walls|India|Chocolate|.|5000|12NOV2025|walls@gmail.com|IN01|Regular
IC013|Nirulas|India|Butterscotch|45|6500|15DEC2025|nirulasgmail.com|IN01|Premium
IC014|BlueBell|USA|Mint|29|7200|18JAN2025|blue@gmail.com|US01|Luxury
IC015|Grom|Italy|Hazelnut|31|8100|20FEB2025|grom@gmail.com|EU01|Premium
IC016|ColdStone|USA|Cheesecake|36|9100|22MAR2025|coldstone@gmail.com|US02|Luxury
IC017|Natural's|India|TenderCoconut|28|6200|24APR2025|natural@gmail.com|IN03|Premium
IC018|DairyDay|India|Vanilla|42|5800|26MAY2025|dairy@gmail.com|IN04|Regular
IC019|CarteDor|France|Caramel|35|7700|28JUN2025|carte@gmail.com|FR01|Luxury
IC020|TokyoCream|Japan|Matcha|30|8900|30JUL2025|tokyo@gmail.com|JP01|Premium
;
run;
proc print data = icecream_raw;
run;
OUTPUT:
| Obs | IceCream_ID | Brand | Country | Flavor | Customer_Email | Region_Code | Category | Manufacture_Date | Customer_Age | Revenue |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | IC001 | GelatoRoma | Italy | Vanilla | abc@gmail.com | eu01 | Premium | 12JAN2025 | 25 | 4500 |
| 2 | IC001 | gelatoroma | ITALY | vanilla | wrongmail.com | EU01 | premium | . | 25 | -4500 |
| 3 | IC002 | Amul | India | Mango | amul@ice.com | in02 | Regular | 15MAR2025 | 150 | 3200 |
| 4 | IC003 | BaskinRobbins | USA | Chocolate | baskin@gmail | US01 | Luxury | . | -5 | 5500 |
| 5 | IC004 | Cornetto | France | Strawberry | cornetto@yahoo.com | FR03 | regular | 11APR2025 | 32 | . |
| 6 | IC005 | NULL | Turkey | Pistachio | turkmail.com | TR01 | Premium | 15MAY2025 | 41 | 7000 |
| 7 | IC006 | HaagenDazs | USA | Cookie | haagen@gmail.com | us01 | Luxury | 14JUN2025 | 29 | 6500 |
| 8 | IC007 | KwalityWalls | India | Kulfi | kwality@gmail.com | IN-01 | Regular | 17JUL2025 | 38 | -800 |
| 9 | IC008 | Movenpick | Switzerland | Coffee | move@gmail.com | SW01 | Premium | 18AUG2025 | 999 | 9200 |
| 10 | IC009 | Ben&Jerrys | USA | Brownie | ben@icecream.com | US 01 | Luxury | 19SEP2025 | 44 | 8700 |
| 11 | IC010 | Magnum | Belgium | Almond | magnum@gmail.com | BE01 | Premium | . | 33 | 7800 |
| 12 | IC011 | Yili | China | Matcha | yili@@gmail.com | CN01 | Regular | 25OCT2025 | 27 | 4500 |
| 13 | IC012 | Walls | India | Chocolate | walls@gmail.com | IN01 | Regular | 12NOV2025 | . | 5000 |
| 14 | IC013 | Nirulas | India | Butterscotch | nirulasgmail.com | IN01 | Premium | 15DEC2025 | 45 | 6500 |
| 15 | IC014 | BlueBell | USA | Mint | blue@gmail.com | US01 | Luxury | 18JAN2025 | 29 | 7200 |
| 16 | IC015 | Grom | Italy | Hazelnut | grom@gmail.com | EU01 | Premium | 20FEB2025 | 31 | 8100 |
| 17 | IC016 | ColdStone | USA | Cheesecake | coldstone@gmail.com | US02 | Luxury | 22MAR2025 | 36 | 9100 |
| 18 | IC017 | Natural's | India | TenderCoconut | natural@gmail.com | IN03 | Premium | 24APR2025 | 28 | 6200 |
| 19 | IC018 | DairyDay | India | Vanilla | dairy@gmail.com | IN04 | Regular | 26MAY2025 | 42 | 5800 |
| 20 | IC019 | CarteDor | France | Caramel | carte@gmail.com | FR01 | Luxury | 28JUN2025 | 35 | 7700 |
| 21 | IC020 | TokyoCream | Japan | Matcha | tokyo@gmail.com | JP01 | Premium | 30JUL2025 | 30 | 8900 |
Why LENGTH Statements
Matter in SAS
One of
the most misunderstood production risks in SAS is character truncation.
If LENGTH
statements are declared after assignments, SAS automatically allocates
variable length based on first encounter.
Example:
data test;
name='ChocolateBrownieSupreme';
length name $10;
run;
Result:
ChocolateB
The
string becomes truncated permanently.
In
enterprise clinical environments, truncation can destroy:
- SDTM variable integrity
- Medication names
- Region mappings
- Audit traceability
- Regulatory consistency
R behaves
differently because character vectors dynamically allocate memory. SAS requires
proactive metadata governance.
SAS Cleaning Workflow
1.Standardization Layer
data icecream_clean1;
retain Data_Source "GLOBAL_ICECREAM_SYSTEM";
set icecream_raw;
Brand = propcase(strip(Brand));
Country = upcase(strip(Country));
Flavor = propcase(compbl(Flavor));
Region_Code = compress(upcase(Region_Code));
Category = propcase(Category);
Customer_Email = lowcase(strip(Customer_Email));
Revenue = abs(Revenue);
if Customer_Age < 1 or Customer_Age > 100 then
Customer_Age = .;
if find(Customer_Email,'@')=0 then
Customer_Email='invalid@email.com';
if Manufacture_Date=. then
Manufacture_Date=intnx('month',today(),-1,'same');
Revenue=round(Revenue,.01);
run;
proc print data = icecream_clean1;
run;
OUTPUT:
| Obs | Data_Source | IceCream_ID | Brand | Country | Flavor | Customer_Email | Region_Code | Category | Manufacture_Date | Customer_Age | Revenue |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_ICECREAM_SYSTEM | IC001 | Gelatoroma | ITALY | Vanilla | abc@gmail.com | EU01 | Premium | 12JAN2025 | 25 | 4500 |
| 2 | GLOBAL_ICECREAM_SYSTEM | IC001 | Gelatoroma | ITALY | Vanilla | invalid@email.com | EU01 | Premium | 04MAY2026 | 25 | 4500 |
| 3 | GLOBAL_ICECREAM_SYSTEM | IC002 | Amul | INDIA | Mango | amul@ice.com | IN02 | Regular | 15MAR2025 | . | 3200 |
| 4 | GLOBAL_ICECREAM_SYSTEM | IC003 | Baskinrobbins | USA | Chocolate | baskin@gmail | US01 | Luxury | 04MAY2026 | . | 5500 |
| 5 | GLOBAL_ICECREAM_SYSTEM | IC004 | Cornetto | FRANCE | Strawberry | cornetto@yahoo.com | FR03 | Regular | 11APR2025 | 32 | . |
| 6 | GLOBAL_ICECREAM_SYSTEM | IC005 | Null | TURKEY | Pistachio | invalid@email.com | TR01 | Premium | 15MAY2025 | 41 | 7000 |
| 7 | GLOBAL_ICECREAM_SYSTEM | IC006 | Haagendazs | USA | Cookie | haagen@gmail.com | US01 | Luxury | 14JUN2025 | 29 | 6500 |
| 8 | GLOBAL_ICECREAM_SYSTEM | IC007 | Kwalitywalls | INDIA | Kulfi | kwality@gmail.com | IN-01 | Regular | 17JUL2025 | 38 | 800 |
| 9 | GLOBAL_ICECREAM_SYSTEM | IC008 | Movenpick | SWITZERLAND | Coffee | move@gmail.com | SW01 | Premium | 18AUG2025 | . | 9200 |
| 10 | GLOBAL_ICECREAM_SYSTEM | IC009 | Ben&jerrys | USA | Brownie | ben@icecream.com | US01 | Luxury | 19SEP2025 | 44 | 8700 |
| 11 | GLOBAL_ICECREAM_SYSTEM | IC010 | Magnum | BELGIUM | Almond | magnum@gmail.com | BE01 | Premium | 04MAY2026 | 33 | 7800 |
| 12 | GLOBAL_ICECREAM_SYSTEM | IC011 | Yili | CHINA | Matcha | yili@@gmail.com | CN01 | Regular | 25OCT2025 | 27 | 4500 |
| 13 | GLOBAL_ICECREAM_SYSTEM | IC012 | Walls | INDIA | Chocolate | walls@gmail.com | IN01 | Regular | 12NOV2025 | . | 5000 |
| 14 | GLOBAL_ICECREAM_SYSTEM | IC013 | Nirulas | INDIA | Butterscotch | invalid@email.com | IN01 | Premium | 15DEC2025 | 45 | 6500 |
| 15 | GLOBAL_ICECREAM_SYSTEM | IC014 | Bluebell | USA | Mint | blue@gmail.com | US01 | Luxury | 18JAN2025 | 29 | 7200 |
| 16 | GLOBAL_ICECREAM_SYSTEM | IC015 | Grom | ITALY | Hazelnut | grom@gmail.com | EU01 | Premium | 20FEB2025 | 31 | 8100 |
| 17 | GLOBAL_ICECREAM_SYSTEM | IC016 | Coldstone | USA | Cheesecake | coldstone@gmail.com | US02 | Luxury | 22MAR2025 | 36 | 9100 |
| 18 | GLOBAL_ICECREAM_SYSTEM | IC017 | Natural's | INDIA | Tendercoconut | natural@gmail.com | IN03 | Premium | 24APR2025 | 28 | 6200 |
| 19 | GLOBAL_ICECREAM_SYSTEM | IC018 | Dairyday | INDIA | Vanilla | dairy@gmail.com | IN04 | Regular | 26MAY2025 | 42 | 5800 |
| 20 | GLOBAL_ICECREAM_SYSTEM | IC019 | Cartedor | FRANCE | Caramel | carte@gmail.com | FR01 | Luxury | 28JUN2025 | 35 | 7700 |
| 21 | GLOBAL_ICECREAM_SYSTEM | IC020 | Tokyocream | JAPAN | Matcha | tokyo@gmail.com | JP01 | Premium | 30JUL2025 | 30 | 8900 |
Explanation
This DATA
step performs enterprise normalization and corruption repair. PROPCASE, UPCASE,
and LOWCASE standardize inconsistent text formatting. COMPRESS removes
whitespace corruption from region codes. ABS converts negative revenue into
valid business values. Invalid ages are converted to missing for downstream
imputation logic. FIND detects malformed email structures. INTNX imputes missing
dates intelligently. ROUND ensures financial precision consistency. The RETAIN
statement preserves audit lineage metadata across rows. This mirrors
production-grade SDTM cleaning pipelines used in regulated environments.
2.Deduplication Using PROC SORT NODUPKEY
proc sort data=icecream_clean1
out=icecream_nodup nodupkey;
by IceCream_ID;
run;
proc print data = icecream_nodup;
run;
OUTPUT:
| Obs | Data_Source | IceCream_ID | Brand | Country | Flavor | Customer_Email | Region_Code | Category | Manufacture_Date | Customer_Age | Revenue |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_ICECREAM_SYSTEM | IC001 | Gelatoroma | ITALY | Vanilla | abc@gmail.com | EU01 | Premium | 12JAN2025 | 25 | 4500 |
| 2 | GLOBAL_ICECREAM_SYSTEM | IC002 | Amul | INDIA | Mango | amul@ice.com | IN02 | Regular | 15MAR2025 | . | 3200 |
| 3 | GLOBAL_ICECREAM_SYSTEM | IC003 | Baskinrobbins | USA | Chocolate | baskin@gmail | US01 | Luxury | 04MAY2026 | . | 5500 |
| 4 | GLOBAL_ICECREAM_SYSTEM | IC004 | Cornetto | FRANCE | Strawberry | cornetto@yahoo.com | FR03 | Regular | 11APR2025 | 32 | . |
| 5 | GLOBAL_ICECREAM_SYSTEM | IC005 | Null | TURKEY | Pistachio | invalid@email.com | TR01 | Premium | 15MAY2025 | 41 | 7000 |
| 6 | GLOBAL_ICECREAM_SYSTEM | IC006 | Haagendazs | USA | Cookie | haagen@gmail.com | US01 | Luxury | 14JUN2025 | 29 | 6500 |
| 7 | GLOBAL_ICECREAM_SYSTEM | IC007 | Kwalitywalls | INDIA | Kulfi | kwality@gmail.com | IN-01 | Regular | 17JUL2025 | 38 | 800 |
| 8 | GLOBAL_ICECREAM_SYSTEM | IC008 | Movenpick | SWITZERLAND | Coffee | move@gmail.com | SW01 | Premium | 18AUG2025 | . | 9200 |
| 9 | GLOBAL_ICECREAM_SYSTEM | IC009 | Ben&jerrys | USA | Brownie | ben@icecream.com | US01 | Luxury | 19SEP2025 | 44 | 8700 |
| 10 | GLOBAL_ICECREAM_SYSTEM | IC010 | Magnum | BELGIUM | Almond | magnum@gmail.com | BE01 | Premium | 04MAY2026 | 33 | 7800 |
| 11 | GLOBAL_ICECREAM_SYSTEM | IC011 | Yili | CHINA | Matcha | yili@@gmail.com | CN01 | Regular | 25OCT2025 | 27 | 4500 |
| 12 | GLOBAL_ICECREAM_SYSTEM | IC012 | Walls | INDIA | Chocolate | walls@gmail.com | IN01 | Regular | 12NOV2025 | . | 5000 |
| 13 | GLOBAL_ICECREAM_SYSTEM | IC013 | Nirulas | INDIA | Butterscotch | invalid@email.com | IN01 | Premium | 15DEC2025 | 45 | 6500 |
| 14 | GLOBAL_ICECREAM_SYSTEM | IC014 | Bluebell | USA | Mint | blue@gmail.com | US01 | Luxury | 18JAN2025 | 29 | 7200 |
| 15 | GLOBAL_ICECREAM_SYSTEM | IC015 | Grom | ITALY | Hazelnut | grom@gmail.com | EU01 | Premium | 20FEB2025 | 31 | 8100 |
| 16 | GLOBAL_ICECREAM_SYSTEM | IC016 | Coldstone | USA | Cheesecake | coldstone@gmail.com | US02 | Luxury | 22MAR2025 | 36 | 9100 |
| 17 | GLOBAL_ICECREAM_SYSTEM | IC017 | Natural's | INDIA | Tendercoconut | natural@gmail.com | IN03 | Premium | 24APR2025 | 28 | 6200 |
| 18 | GLOBAL_ICECREAM_SYSTEM | IC018 | Dairyday | INDIA | Vanilla | dairy@gmail.com | IN04 | Regular | 26MAY2025 | 42 | 5800 |
| 19 | GLOBAL_ICECREAM_SYSTEM | IC019 | Cartedor | FRANCE | Caramel | carte@gmail.com | FR01 | Luxury | 28JUN2025 | 35 | 7700 |
| 20 | GLOBAL_ICECREAM_SYSTEM | IC020 | Tokyocream | JAPAN | Matcha | tokyo@gmail.com | JP01 | Premium | 30JUL2025 | 30 | 8900 |
Explanation
Duplicate
records are dangerous because they inflate revenue totals, distort AI training
data, and corrupt statistical calculations. PROC SORT NODUPKEY removes
duplicate IceCream IDs while preserving the first valid occurrence. In
enterprise clinical trials, duplicate patient IDs can create catastrophic
submission failures. Deduplication must always occur before aggregation or
reporting logic.
3.Validation Using PROC FORMAT
proc format;
value agecheck low-0='INVALID'
1-100='VALID'
101-high='INVALID';
run;
LOG:
Explanation
PROC
FORMAT creates reusable validation frameworks. Instead of repeatedly hardcoding
validation logic, formats centralize governance rules. This improves
maintainability, traceability, and QC consistency. Enterprise organizations
often maintain centralized validation libraries shared across SDTM, ADaM, and
operational reporting pipelines.
4.Advanced DATA Step with ARRAYS and DO Loops
data validation_flags;
set icecream_nodup;
array chars(*) Brand Country Flavor Category;
do i=1 to dim(chars);
chars(i)=strip(chars(i));
end;
Missing_Count = cmiss(of _all_);
format Customer_Age agecheck.;
run;
proc print data = validation_flags;
run;
OUTPUT:
| Obs | Data_Source | IceCream_ID | Brand | Country | Flavor | Customer_Email | Region_Code | Category | Manufacture_Date | Customer_Age | Revenue | i | Missing_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_ICECREAM_SYSTEM | IC001 | Gelatoroma | ITALY | Vanilla | abc@gmail.com | EU01 | Premium | 12JAN2025 | VALID | 4500 | 5 | 1 |
| 2 | GLOBAL_ICECREAM_SYSTEM | IC002 | Amul | INDIA | Mango | amul@ice.com | IN02 | Regular | 15MAR2025 | . | 3200 | 5 | 2 |
| 3 | GLOBAL_ICECREAM_SYSTEM | IC003 | Baskinrobbins | USA | Chocolate | baskin@gmail | US01 | Luxury | 04MAY2026 | . | 5500 | 5 | 2 |
| 4 | GLOBAL_ICECREAM_SYSTEM | IC004 | Cornetto | FRANCE | Strawberry | cornetto@yahoo.com | FR03 | Regular | 11APR2025 | VALID | . | 5 | 2 |
| 5 | GLOBAL_ICECREAM_SYSTEM | IC005 | Null | TURKEY | Pistachio | invalid@email.com | TR01 | Premium | 15MAY2025 | VALID | 7000 | 5 | 1 |
| 6 | GLOBAL_ICECREAM_SYSTEM | IC006 | Haagendazs | USA | Cookie | haagen@gmail.com | US01 | Luxury | 14JUN2025 | VALID | 6500 | 5 | 1 |
| 7 | GLOBAL_ICECREAM_SYSTEM | IC007 | Kwalitywalls | INDIA | Kulfi | kwality@gmail.com | IN-01 | Regular | 17JUL2025 | VALID | 800 | 5 | 1 |
| 8 | GLOBAL_ICECREAM_SYSTEM | IC008 | Movenpick | SWITZERLAND | Coffee | move@gmail.com | SW01 | Premium | 18AUG2025 | . | 9200 | 5 | 2 |
| 9 | GLOBAL_ICECREAM_SYSTEM | IC009 | Ben&jerrys | USA | Brownie | ben@icecream.com | US01 | Luxury | 19SEP2025 | VALID | 8700 | 5 | 1 |
| 10 | GLOBAL_ICECREAM_SYSTEM | IC010 | Magnum | BELGIUM | Almond | magnum@gmail.com | BE01 | Premium | 04MAY2026 | VALID | 7800 | 5 | 1 |
| 11 | GLOBAL_ICECREAM_SYSTEM | IC011 | Yili | CHINA | Matcha | yili@@gmail.com | CN01 | Regular | 25OCT2025 | VALID | 4500 | 5 | 1 |
| 12 | GLOBAL_ICECREAM_SYSTEM | IC012 | Walls | INDIA | Chocolate | walls@gmail.com | IN01 | Regular | 12NOV2025 | . | 5000 | 5 | 2 |
| 13 | GLOBAL_ICECREAM_SYSTEM | IC013 | Nirulas | INDIA | Butterscotch | invalid@email.com | IN01 | Premium | 15DEC2025 | VALID | 6500 | 5 | 1 |
| 14 | GLOBAL_ICECREAM_SYSTEM | IC014 | Bluebell | USA | Mint | blue@gmail.com | US01 | Luxury | 18JAN2025 | VALID | 7200 | 5 | 1 |
| 15 | GLOBAL_ICECREAM_SYSTEM | IC015 | Grom | ITALY | Hazelnut | grom@gmail.com | EU01 | Premium | 20FEB2025 | VALID | 8100 | 5 | 1 |
| 16 | GLOBAL_ICECREAM_SYSTEM | IC016 | Coldstone | USA | Cheesecake | coldstone@gmail.com | US02 | Luxury | 22MAR2025 | VALID | 9100 | 5 | 1 |
| 17 | GLOBAL_ICECREAM_SYSTEM | IC017 | Natural's | INDIA | Tendercoconut | natural@gmail.com | IN03 | Premium | 24APR2025 | VALID | 6200 | 5 | 1 |
| 18 | GLOBAL_ICECREAM_SYSTEM | IC018 | Dairyday | INDIA | Vanilla | dairy@gmail.com | IN04 | Regular | 26MAY2025 | VALID | 5800 | 5 | 1 |
| 19 | GLOBAL_ICECREAM_SYSTEM | IC019 | Cartedor | FRANCE | Caramel | carte@gmail.com | FR01 | Luxury | 28JUN2025 | VALID | 7700 | 5 | 1 |
| 20 | GLOBAL_ICECREAM_SYSTEM | IC020 | Tokyocream | JAPAN | Matcha | tokyo@gmail.com | JP01 | Premium | 30JUL2025 | VALID | 8900 | 5 | 1 |
Explanation
ARRAYS
allow scalable transformation across multiple variables simultaneously. Instead
of repetitive code, loops automate enterprise cleaning logic. CMISS calculates
total missing values across mixed variable types. This technique becomes
extremely powerful in healthcare or insurance systems containing hundreds of
columns.
5.PROC SQL Enterprise Cleaning
proc sql;
create table revenue_summary as
select Country,Category,
count(*) as Total_Records,
mean(Revenue) as Avg_Revenue,
sum(Revenue) as Total_Revenue
from validation_flags
group by Country, Category;
quit;
proc print data = revenue_summary;
run;
OUTPUT:
| Obs | Country | Category | Total_Records | Avg_Revenue | Total_Revenue |
|---|---|---|---|---|---|
| 1 | BELGIUM | Premium | 1 | 7800 | 7800 |
| 2 | CHINA | Regular | 1 | 4500 | 4500 |
| 3 | FRANCE | Luxury | 1 | 7700 | 7700 |
| 4 | FRANCE | Regular | 1 | . | . |
| 5 | INDIA | Premium | 2 | 6350 | 12700 |
| 6 | INDIA | Regular | 4 | 3700 | 14800 |
| 7 | ITALY | Premium | 2 | 6300 | 12600 |
| 8 | JAPAN | Premium | 1 | 8900 | 8900 |
| 9 | SWITZERLAND | Premium | 1 | 9200 | 9200 |
| 10 | TURKEY | Premium | 1 | 7000 | 7000 |
| 11 | USA | Luxury | 5 | 7400 | 37000 |
Explanation
PROC SQL
simplifies aggregation and reporting workflows. SQL is especially useful when
joining transactional systems, operational marts, and external vendor datasets.
In enterprise analytics, SQL logic often powers executive dashboards and
clinical reporting environments.
6.FIRST./LAST. Processing Example
proc sort data=validation_flags;
by Country;
run;
proc print data = validation_flags;
run;
OUTPUT:
| Obs | Data_Source | IceCream_ID | Brand | Country | Flavor | Customer_Email | Region_Code | Category | Manufacture_Date | Customer_Age | Revenue | i | Missing_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_ICECREAM_SYSTEM | IC010 | Magnum | BELGIUM | Almond | magnum@gmail.com | BE01 | Premium | 04MAY2026 | VALID | 7800 | 5 | 1 |
| 2 | GLOBAL_ICECREAM_SYSTEM | IC011 | Yili | CHINA | Matcha | yili@@gmail.com | CN01 | Regular | 25OCT2025 | VALID | 4500 | 5 | 1 |
| 3 | GLOBAL_ICECREAM_SYSTEM | IC004 | Cornetto | FRANCE | Strawberry | cornetto@yahoo.com | FR03 | Regular | 11APR2025 | VALID | . | 5 | 2 |
| 4 | GLOBAL_ICECREAM_SYSTEM | IC019 | Cartedor | FRANCE | Caramel | carte@gmail.com | FR01 | Luxury | 28JUN2025 | VALID | 7700 | 5 | 1 |
| 5 | GLOBAL_ICECREAM_SYSTEM | IC002 | Amul | INDIA | Mango | amul@ice.com | IN02 | Regular | 15MAR2025 | . | 3200 | 5 | 2 |
| 6 | GLOBAL_ICECREAM_SYSTEM | IC007 | Kwalitywalls | INDIA | Kulfi | kwality@gmail.com | IN-01 | Regular | 17JUL2025 | VALID | 800 | 5 | 1 |
| 7 | GLOBAL_ICECREAM_SYSTEM | IC012 | Walls | INDIA | Chocolate | walls@gmail.com | IN01 | Regular | 12NOV2025 | . | 5000 | 5 | 2 |
| 8 | GLOBAL_ICECREAM_SYSTEM | IC013 | Nirulas | INDIA | Butterscotch | invalid@email.com | IN01 | Premium | 15DEC2025 | VALID | 6500 | 5 | 1 |
| 9 | GLOBAL_ICECREAM_SYSTEM | IC017 | Natural's | INDIA | Tendercoconut | natural@gmail.com | IN03 | Premium | 24APR2025 | VALID | 6200 | 5 | 1 |
| 10 | GLOBAL_ICECREAM_SYSTEM | IC018 | Dairyday | INDIA | Vanilla | dairy@gmail.com | IN04 | Regular | 26MAY2025 | VALID | 5800 | 5 | 1 |
| 11 | GLOBAL_ICECREAM_SYSTEM | IC001 | Gelatoroma | ITALY | Vanilla | abc@gmail.com | EU01 | Premium | 12JAN2025 | VALID | 4500 | 5 | 1 |
| 12 | GLOBAL_ICECREAM_SYSTEM | IC015 | Grom | ITALY | Hazelnut | grom@gmail.com | EU01 | Premium | 20FEB2025 | VALID | 8100 | 5 | 1 |
| 13 | GLOBAL_ICECREAM_SYSTEM | IC020 | Tokyocream | JAPAN | Matcha | tokyo@gmail.com | JP01 | Premium | 30JUL2025 | VALID | 8900 | 5 | 1 |
| 14 | GLOBAL_ICECREAM_SYSTEM | IC008 | Movenpick | SWITZERLAND | Coffee | move@gmail.com | SW01 | Premium | 18AUG2025 | . | 9200 | 5 | 2 |
| 15 | GLOBAL_ICECREAM_SYSTEM | IC005 | Null | TURKEY | Pistachio | invalid@email.com | TR01 | Premium | 15MAY2025 | VALID | 7000 | 5 | 1 |
| 16 | GLOBAL_ICECREAM_SYSTEM | IC003 | Baskinrobbins | USA | Chocolate | baskin@gmail | US01 | Luxury | 04MAY2026 | . | 5500 | 5 | 2 |
| 17 | GLOBAL_ICECREAM_SYSTEM | IC006 | Haagendazs | USA | Cookie | haagen@gmail.com | US01 | Luxury | 14JUN2025 | VALID | 6500 | 5 | 1 |
| 18 | GLOBAL_ICECREAM_SYSTEM | IC009 | Ben&jerrys | USA | Brownie | ben@icecream.com | US01 | Luxury | 19SEP2025 | VALID | 8700 | 5 | 1 |
| 19 | GLOBAL_ICECREAM_SYSTEM | IC014 | Bluebell | USA | Mint | blue@gmail.com | US01 | Luxury | 18JAN2025 | VALID | 7200 | 5 | 1 |
| 20 | GLOBAL_ICECREAM_SYSTEM | IC016 | Coldstone | USA | Cheesecake | coldstone@gmail.com | US02 | Luxury | 22MAR2025 | VALID | 9100 | 5 | 1 |
data regional_summary;
set validation_flags;
by Country;
if first.Country then Country_Count=0;
Country_Count+1;
if last.Country;
run;
proc print data = regional_summary;
run;
OUTPUT:
| Obs | Data_Source | IceCream_ID | Brand | Country | Flavor | Customer_Email | Region_Code | Category | Manufacture_Date | Customer_Age | Revenue | i | Missing_Count | Country_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_ICECREAM_SYSTEM | IC010 | Magnum | BELGIUM | Almond | magnum@gmail.com | BE01 | Premium | 04MAY2026 | VALID | 7800 | 5 | 1 | 1 |
| 2 | GLOBAL_ICECREAM_SYSTEM | IC011 | Yili | CHINA | Matcha | yili@@gmail.com | CN01 | Regular | 25OCT2025 | VALID | 4500 | 5 | 1 | 1 |
| 3 | GLOBAL_ICECREAM_SYSTEM | IC019 | Cartedor | FRANCE | Caramel | carte@gmail.com | FR01 | Luxury | 28JUN2025 | VALID | 7700 | 5 | 1 | 2 |
| 4 | GLOBAL_ICECREAM_SYSTEM | IC018 | Dairyday | INDIA | Vanilla | dairy@gmail.com | IN04 | Regular | 26MAY2025 | VALID | 5800 | 5 | 1 | 6 |
| 5 | GLOBAL_ICECREAM_SYSTEM | IC015 | Grom | ITALY | Hazelnut | grom@gmail.com | EU01 | Premium | 20FEB2025 | VALID | 8100 | 5 | 1 | 2 |
| 6 | GLOBAL_ICECREAM_SYSTEM | IC020 | Tokyocream | JAPAN | Matcha | tokyo@gmail.com | JP01 | Premium | 30JUL2025 | VALID | 8900 | 5 | 1 | 1 |
| 7 | GLOBAL_ICECREAM_SYSTEM | IC008 | Movenpick | SWITZERLAND | Coffee | move@gmail.com | SW01 | Premium | 18AUG2025 | . | 9200 | 5 | 2 | 1 |
| 8 | GLOBAL_ICECREAM_SYSTEM | IC005 | Null | TURKEY | Pistachio | invalid@email.com | TR01 | Premium | 15MAY2025 | VALID | 7000 | 5 | 1 | 1 |
| 9 | GLOBAL_ICECREAM_SYSTEM | IC016 | Coldstone | USA | Cheesecake | coldstone@gmail.com | US02 | Luxury | 22MAR2025 | VALID | 9100 | 5 | 1 | 5 |
Explanation
FIRST./LAST.
processing enables grouped calculations without SQL. This technique is heavily
used in SDTM derivations, visit sequencing, exposure tracking, and adverse
event summarization.
7.PROC REPORT Professional Output
proc report data=revenue_summary nowd;
column Country Category Total_Records Avg_Revenue Total_Revenue;
define Country / group;
define Category / group;
define Total_Records / analysis;
define Avg_Revenue / analysis format=dollar10.;
define Total_Revenue / analysis format=dollar12.;
run;
OUTPUT:
| Country | Category | Total_Records | Avg_Revenue | Total_Revenue |
|---|---|---|---|---|
| BELGIUM | Premium | 1 | $7,800 | $7,800 |
| CHINA | Regular | 1 | $4,500 | $4,500 |
| FRANCE | Luxury | 1 | $7,700 | $7,700 |
| Regular | 1 | . | . | |
| INDIA | Premium | 2 | $6,350 | $12,700 |
| Regular | 4 | $3,700 | $14,800 | |
| ITALY | Premium | 2 | $6,300 | $12,600 |
| JAPAN | Premium | 1 | $8,900 | $8,900 |
| SWITZERLAND | Premium | 1 | $9,200 | $9,200 |
| TURKEY | Premium | 1 | $7,000 | $7,000 |
| USA | Luxury | 5 | $7,400 | $37,000 |
Explanation
PROC
REPORT creates production-quality enterprise outputs. Pharmaceutical companies
use this extensively for Tables, Listings, and Figures (TLFs). Formatting
consistency is critical for executive presentations and regulatory
documentation.
8.Reusable SAS Macro
%macro dataset_check(ds);
proc contents data=&ds;
run;
proc means data=&ds n nmiss;
run;
proc freq data=&ds;
tables Country Category;
run;
%mend;
%dataset_check(validation_flags);
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.VALIDATION_FLAGS | Observations | 20 |
|---|---|---|---|
| Member Type | DATA | Variables | 13 |
| Engine | V9 | Indexes | 0 |
| Created | 06/04/2026 10:37:27 | Observation Length | 216 |
| Last Modified | 06/04/2026 10:37:27 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | YES | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 606 |
| Obs in First Data Page | 20 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_work701100007A7C_odaws02-apse1-2.oda.sas.com/SAS_workB2E300007A7C_odaws02-apse1-2.oda.sas.com/validation_flags.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 1125598 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Alphabetic List of Variables and Attributes | |||||
|---|---|---|---|---|---|
| # | Variable | Type | Len | Format | Informat |
| 3 | Brand | Char | 25 | ||
| 8 | Category | Char | 15 | ||
| 4 | Country | Char | 20 | ||
| 10 | Customer_Age | Num | 8 | AGECHECK. | |
| 6 | Customer_Email | Char | 50 | ||
| 1 | Data_Source | Char | 22 | ||
| 5 | Flavor | Char | 20 | ||
| 2 | IceCream_ID | Char | 8 | ||
| 9 | Manufacture_Date | Num | 8 | DATE9. | ANYDTDTE20. |
| 13 | Missing_Count | Num | 8 | ||
| 7 | Region_Code | Char | 10 | ||
| 11 | Revenue | Num | 8 | ||
| 12 | i | Num | 8 | ||
| Sort Information | |
|---|---|
| Sortedby | Country |
| Validated | YES |
| Character Set | ASCII |
The MEANS Procedure
| Variable | N | N Miss |
|---|---|---|
Manufacture_Date Customer_Age Revenue i Missing_Count | 20 16 19 20 20 | 0 4 1 0 0 |
The FREQ Procedure
| Country | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| BELGIUM | 1 | 5.00 | 1 | 5.00 |
| CHINA | 1 | 5.00 | 2 | 10.00 |
| FRANCE | 2 | 10.00 | 4 | 20.00 |
| INDIA | 6 | 30.00 | 10 | 50.00 |
| ITALY | 2 | 10.00 | 12 | 60.00 |
| JAPAN | 1 | 5.00 | 13 | 65.00 |
| SWITZERLAND | 1 | 5.00 | 14 | 70.00 |
| TURKEY | 1 | 5.00 | 15 | 75.00 |
| USA | 5 | 25.00 | 20 | 100.00 |
| Category | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Luxury | 6 | 30.00 | 6 | 30.00 |
| Premium | 8 | 40.00 | 14 | 70.00 |
| Regular | 6 | 30.00 | 20 | 100.00 |
Explanation
Macros
enable reusable validation frameworks. Instead of rewriting repetitive code,
enterprise teams standardize QC logic into centralized macro libraries. This
improves productivity, consistency, and deployment scalability.
9.FAMOUS ICE CREAMS WORLDWIDE RAW DATASET CREATION IN R
# using read.delim() with pipe delimiter
icecream_raw <- read.delim(
text = "
IC001|GelatoRoma|Italy|Vanilla|25|4500|12JAN2025|abc@gmail.com| eu01 |Premium
IC001|gelatoroma|ITALY|vanilla|25|-4500|31FEB2025|wrongmail.com|EU01|premium
IC002|Amul|India|Mango|150|3200|15MAR2025|amul@ice.com|in02|Regular
IC003|BaskinRobbins|USA|Chocolate|-5|5500|.|baskin@gmail|US01|Luxury
IC004|Cornetto|France|Strawberry|32|.|11APR2025|cornetto@yahoo.com| FR03|regular
IC005|NULL|Turkey|Pistachio|41|7000|15MAY2025|turkmail.com|TR01|Premium
IC006|HaagenDazs|USA|Cookie|29|6500|14JUN2025|haagen@gmail.com|us01|Luxury
IC007|KwalityWalls|India|Kulfi|38|-800|17JUL2025|kwality@gmail.com|IN-01|Regular
IC008|Movenpick|Switzerland|Coffee|999|9200|18AUG2025|move@gmail.com|SW01|Premium
IC009|Ben&Jerrys|USA|Brownie|44|8700|19SEP2025|ben@icecream.com|US 01|Luxury
IC010|Magnum|Belgium|Almond|33|7800|INVALID|magnum@gmail.com|BE01|Premium
IC011|Yili|China|Matcha|27|4500|25OCT2025|yili@@gmail.com|CN01|Regular
IC012|Walls|India|Chocolate|.|5000|12NOV2025|walls@gmail.com|IN01|Regular
IC013|Nirulas|India|Butterscotch|45|6500|15DEC2025|nirulasgmail.com|IN01|Premium
IC014|BlueBell|USA|Mint|29|7200|18JAN2025|blue@gmail.com|US01|Luxury
IC015|Grom|Italy|Hazelnut|31|8100|20FEB2025|grom@gmail.com|EU01|Premium
IC016|ColdStone|USA|Cheesecake|36|9100|22MAR2025|coldstone@gmail.com|US02|Luxury
IC017|Natural's|India|TenderCoconut|28|6200|24APR2025|natural@gmail.com|IN03|Premium
IC018|DairyDay|India|Vanilla|42|5800|26MAY2025|dairy@gmail.com|IN04|Regular
IC019|CarteDor|France|Caramel|35|7700|28JUN2025|carte@gmail.com|FR01|Luxury
IC020|TokyoCream|Japan|Matcha|30|8900|30JUL2025|tokyo@gmail.com|JP01|Premium
",
sep = "|",
header = FALSE,
stringsAsFactors = FALSE,
na.strings = c(".", "NULL", "INVALID")
)
OUTPUT:
|
|
V1 |
V2 |
V3 |
V4 |
V5 |
V6 |
V7 |
V8 |
V9 |
V10 |
|
1 |
IC001 |
GelatoRoma |
Italy |
Vanilla |
25 |
4500 |
12-Jan-2025 |
abc@gmail.com |
eu01 |
Premium |
|
2 |
IC001 |
gelatoroma |
ITALY |
vanilla |
25 |
-4500 |
31FEB2025 |
wrongmail.com |
EU01 |
premium |
|
3 |
IC002 |
Amul |
India |
Mango |
150 |
3200 |
15-Mar-2025 |
amul@ice.com |
in02 |
Regular |
|
4 |
IC003 |
BaskinRobbins |
USA |
Chocolate |
-5 |
5500 |
NA |
baskin@gmail |
US01 |
Luxury |
|
5 |
IC004 |
Cornetto |
France |
Strawberry |
32 |
NA |
11-Apr-2025 |
cornetto@yahoo.com |
FR03 |
regular |
|
6 |
IC005 |
NA |
Turkey |
Pistachio |
41 |
7000 |
15-May-2025 |
turkmail.com |
TR01 |
Premium |
|
7 |
IC006 |
HaagenDazs |
USA |
Cookie |
29 |
6500 |
14-Jun-2025 |
haagen@gmail.com |
us01 |
Luxury |
|
8 |
IC007 |
KwalityWalls |
India |
Kulfi |
38 |
-800 |
17-Jul-2025 |
kwality@gmail.com |
IN-01 |
Regular |
|
9 |
IC008 |
Movenpick |
Switzerland |
Coffee |
999 |
9200 |
18-Aug-2025 |
move@gmail.com |
SW01 |
Premium |
|
10 |
IC009 |
Ben&Jerrys |
USA |
Brownie |
44 |
8700 |
19-Sep-2025 |
ben@icecream.com |
US 01 |
Luxury |
|
11 |
IC010 |
Magnum |
Belgium |
Almond |
33 |
7800 |
NA |
magnum@gmail.com |
BE01 |
Premium |
|
12 |
IC011 |
Yili |
China |
Matcha |
27 |
4500 |
25-Oct-2025 |
yili@@gmail.com |
CN01 |
Regular |
|
13 |
IC012 |
Walls |
India |
Chocolate |
NA |
5000 |
12-Nov-2025 |
walls@gmail.com |
IN01 |
Regular |
|
14 |
IC013 |
Nirulas |
India |
Butterscotch |
45 |
6500 |
15-Dec-2025 |
nirulasgmail.com |
IN01 |
Premium |
|
15 |
IC014 |
BlueBell |
USA |
Mint |
29 |
7200 |
18-Jan-2025 |
blue@gmail.com |
US01 |
Luxury |
|
16 |
IC015 |
Grom |
Italy |
Hazelnut |
31 |
8100 |
20-Feb-2025 |
grom@gmail.com |
EU01 |
Premium |
|
17 |
IC016 |
ColdStone |
USA |
Cheesecake |
36 |
9100 |
22-Mar-2025 |
coldstone@gmail.com |
US02 |
Luxury |
|
18 |
IC017 |
Natural's |
India |
TenderCoconut |
28 |
6200 |
24-Apr-2025 |
natural@gmail.com |
IN03 |
Premium |
|
19 |
IC018 |
DairyDay |
India |
Vanilla |
42 |
5800 |
26-May-2025 |
dairy@gmail.com |
IN04 |
Regular |
|
20 |
IC019 |
CarteDor |
France |
Caramel |
35 |
7700 |
28-Jun-2025 |
carte@gmail.com |
FR01 |
Luxury |
|
21 |
IC020 |
TokyoCream |
Japan |
Matcha |
30 |
8900 |
30-Jul-2025 |
tokyo@gmail.com |
JP01 |
Premium |
# ASSIGNING COLUMN NAMES
colnames(icecream_raw) <- c(
"IceCream_ID",
"Brand",
"Country",
"Flavor",
"Customer_Age",
"Revenue",
"Manufacture_Date",
"Customer_Email",
"Region_Code",
"Category"
)
# CONVERTING DATA TYPES
icecream_raw$Customer_Age <-
as.numeric(icecream_raw$Customer_Age)
icecream_raw$Revenue <-
as.numeric(icecream_raw$Revenue)
# DATE CONVERSION
icecream_raw$Manufacture_Date <- as.Date(
icecream_raw$Manufacture_Date,
format = "%d%b%Y"
)
OUTPUT:
|
|
IceCream_ID |
Brand |
Country |
Flavor |
Customer_Age |
Revenue |
Manufacture_Date |
Customer_Email |
Region_Code |
Category |
|
1 |
IC001 |
GelatoRoma |
Italy |
Vanilla |
25 |
4500 |
12-01-2025 |
abc@gmail.com |
eu01 |
Premium |
|
2 |
IC001 |
gelatoroma |
ITALY |
vanilla |
25 |
-4500 |
NA |
wrongmail.com |
EU01 |
premium |
|
3 |
IC002 |
Amul |
India |
Mango |
150 |
3200 |
15-03-2025 |
amul@ice.com |
in02 |
Regular |
|
4 |
IC003 |
BaskinRobbins |
USA |
Chocolate |
-5 |
5500 |
NA |
baskin@gmail |
US01 |
Luxury |
|
5 |
IC004 |
Cornetto |
France |
Strawberry |
32 |
NA |
11-04-2025 |
cornetto@yahoo.com |
FR03 |
regular |
|
6 |
IC005 |
NA |
Turkey |
Pistachio |
41 |
7000 |
15-05-2025 |
turkmail.com |
TR01 |
Premium |
|
7 |
IC006 |
HaagenDazs |
USA |
Cookie |
29 |
6500 |
14-06-2025 |
haagen@gmail.com |
us01 |
Luxury |
|
8 |
IC007 |
KwalityWalls |
India |
Kulfi |
38 |
-800 |
17-07-2025 |
kwality@gmail.com |
IN-01 |
Regular |
|
9 |
IC008 |
Movenpick |
Switzerland |
Coffee |
999 |
9200 |
18-08-2025 |
move@gmail.com |
SW01 |
Premium |
|
10 |
IC009 |
Ben&Jerrys |
USA |
Brownie |
44 |
8700 |
19-09-2025 |
ben@icecream.com |
US 01 |
Luxury |
|
11 |
IC010 |
Magnum |
Belgium |
Almond |
33 |
7800 |
NA |
magnum@gmail.com |
BE01 |
Premium |
|
12 |
IC011 |
Yili |
China |
Matcha |
27 |
4500 |
25-10-2025 |
yili@@gmail.com |
CN01 |
Regular |
|
13 |
IC012 |
Walls |
India |
Chocolate |
NA |
5000 |
12-11-2025 |
walls@gmail.com |
IN01 |
Regular |
|
14 |
IC013 |
Nirulas |
India |
Butterscotch |
45 |
6500 |
15-12-2025 |
nirulasgmail.com |
IN01 |
Premium |
|
15 |
IC014 |
BlueBell |
USA |
Mint |
29 |
7200 |
18-01-2025 |
blue@gmail.com |
US01 |
Luxury |
|
16 |
IC015 |
Grom |
Italy |
Hazelnut |
31 |
8100 |
20-02-2025 |
grom@gmail.com |
EU01 |
Premium |
|
17 |
IC016 |
ColdStone |
USA |
Cheesecake |
36 |
9100 |
22-03-2025 |
coldstone@gmail.com |
US02 |
Luxury |
|
18 |
IC017 |
Natural's |
India |
TenderCoconut |
28 |
6200 |
24-04-2025 |
natural@gmail.com |
IN03 |
Premium |
|
19 |
IC018 |
DairyDay |
India |
Vanilla |
42 |
5800 |
26-05-2025 |
dairy@gmail.com |
IN04 |
Regular |
|
20 |
IC019 |
CarteDor |
France |
Caramel |
35 |
7700 |
28-06-2025 |
carte@gmail.com |
FR01 |
Luxury |
|
21 |
IC020 |
TokyoCream |
Japan |
Matcha |
30 |
8900 |
30-07-2025 |
tokyo@gmail.com |
JP01 |
Premium |
Explanation of the R Code
This R
program recreates the same corrupted enterprise-style Famous Ice Cream dataset
that was originally built in SAS using DATALINES. Instead of DATALINES, R uses read.delim()
with the text= argument to directly load raw pipe-delimited records into memory.
The separator sep="|" behaves similarly to SAS dlm='|'.
The na.strings
argument converts problematic values like ".", "NULL", and "INVALID"
into proper missing values (NA). This is extremely important because enterprise
systems often receive corrupted placeholders from APIs, legacy databases, or
Excel uploads.
colnames()
assigns professional business variable names. Numeric conversion is done using as.numeric()
because imported values initially enter R as character vectors. Date conversion
uses as.Date() with %d%b%Y, which interprets values like 12JAN2025.
This
dataset intentionally contains:
- Duplicate IceCream IDs
- Negative revenue
- Impossible ages
- Invalid dates
- Malformed emails
- Mixed case values
- Whitespace corruption
- Missing values
These
issues simulate real-world production failures commonly encountered in
healthcare, retail, banking, and insurance analytics systems.
Key
advantage of R:
- Flexible string handling
- Dynamic memory allocation
- Fast exploratory profiling
10.R Data Cleaning Layer
library(tidyverse)
library(janitor)
library(lubridate)
icecream_clean <- icecream_raw %>%
clean_names() %>%
mutate(
brand =str_to_title(str_trim(brand)),
country =str_to_upper(str_trim(country)),
flavor =str_to_title(str_trim(flavor)),
customer_email =if_else(
grepl("@", customer_email),
str_trim(customer_email),
"invalid@email.com"),
revenue =abs(revenue),
customer_age =if_else(customer_age < 1 |
customer_age > 100,NA_real_,
customer_age),
region_code =str_replace_all(
str_to_upper(region_code)," ",""),
category =str_to_title(str_trim(category))
) %>%
distinct(ice_cream_id, .keep_all = TRUE)
|
|
ice_cream_id |
brand |
country |
flavor |
customer_age |
revenue |
manufacture_date |
customer_email |
region_code |
category |
|
1 |
IC001 |
Gelatoroma |
ITALY |
Vanilla |
25 |
4500 |
12-01-2025 |
abc@gmail.com |
EU01 |
Premium |
|
2 |
IC002 |
Amul |
INDIA |
Mango |
NA |
3200 |
15-03-2025 |
amul@ice.com |
IN02 |
Regular |
|
3 |
IC003 |
Baskinrobbins |
USA |
Chocolate |
NA |
5500 |
NA |
baskin@gmail |
US01 |
Luxury |
|
4 |
IC004 |
Cornetto |
FRANCE |
Strawberry |
32 |
NA |
11-04-2025 |
cornetto@yahoo.com |
FR03 |
Regular |
|
5 |
IC005 |
NA |
TURKEY |
Pistachio |
41 |
7000 |
15-05-2025 |
invalid@email.com |
TR01 |
Premium |
|
6 |
IC006 |
Haagendazs |
USA |
Cookie |
29 |
6500 |
14-06-2025 |
haagen@gmail.com |
US01 |
Luxury |
|
7 |
IC007 |
Kwalitywalls |
INDIA |
Kulfi |
38 |
800 |
17-07-2025 |
kwality@gmail.com |
IN-01 |
Regular |
|
8 |
IC008 |
Movenpick |
SWITZERLAND |
Coffee |
NA |
9200 |
18-08-2025 |
move@gmail.com |
SW01 |
Premium |
|
9 |
IC009 |
Ben&Jerrys |
USA |
Brownie |
44 |
8700 |
19-09-2025 |
ben@icecream.com |
US01 |
Luxury |
|
10 |
IC010 |
Magnum |
BELGIUM |
Almond |
33 |
7800 |
NA |
magnum@gmail.com |
BE01 |
Premium |
|
11 |
IC011 |
Yili |
CHINA |
Matcha |
27 |
4500 |
25-10-2025 |
yili@@gmail.com |
CN01 |
Regular |
|
12 |
IC012 |
Walls |
INDIA |
Chocolate |
NA |
5000 |
12-11-2025 |
walls@gmail.com |
IN01 |
Regular |
|
13 |
IC013 |
Nirulas |
INDIA |
Butterscotch |
45 |
6500 |
15-12-2025 |
invalid@email.com |
IN01 |
Premium |
|
14 |
IC014 |
Bluebell |
USA |
Mint |
29 |
7200 |
18-01-2025 |
blue@gmail.com |
US01 |
Luxury |
|
15 |
IC015 |
Grom |
ITALY |
Hazelnut |
31 |
8100 |
20-02-2025 |
grom@gmail.com |
EU01 |
Premium |
|
16 |
IC016 |
Coldstone |
USA |
Cheesecake |
36 |
9100 |
22-03-2025 |
coldstone@gmail.com |
US02 |
Luxury |
|
17 |
IC017 |
Natural's |
INDIA |
Tendercoconut |
28 |
6200 |
24-04-2025 |
natural@gmail.com |
IN03 |
Premium |
|
18 |
IC018 |
Dairyday |
INDIA |
Vanilla |
42 |
5800 |
26-05-2025 |
dairy@gmail.com |
IN04 |
Regular |
|
19 |
IC019 |
Cartedor |
FRANCE |
Caramel |
35 |
7700 |
28-06-2025 |
carte@gmail.com |
FR01 |
Luxury |
|
20 |
IC020 |
Tokyocream |
JAPAN |
Matcha |
30 |
8900 |
30-07-2025 |
tokyo@gmail.com |
JP01 |
Premium |
Explanation
The R
workflow mirrors SAS cleaning logic using modern tidyverse functions. mutate()
transforms variables, if_else() performs conditional repair, grepl() validates
email structure, and parse_date_time() standardizes corrupted dates. Unlike
SAS, R handles character memory dynamically, reducing truncation risk. However,
SAS remains stronger in metadata governance and auditability.
Enterprise Validation &
Compliance
In
regulated industries like pharmaceuticals, banking, and insurance, data
cleaning is not optional it is a compliance requirement.
Clinical
trial standards like CDISC SDTM and ADaM require:
- traceable derivations
- reproducible transformations
- independent QC validation
- metadata consistency
- audit-ready outputs
One
dangerous SAS behavior is:
if revenue > . then
Missing
numeric values in SAS are treated as lower than valid numbers. Improper logic
can accidentally include missing values in calculations, creating severe
statistical errors.
Regulatory
agencies expect:
- full lineage
- documented transformations
- controlled macros
- version governance
- reproducible outputs
20 Enterprise Data Cleaning
Best Practices
- Standardize metadata early
- Validate date formats
immediately
- Remove duplicates before
aggregation
- Centralize validation rules
- Use reusable macros
- Track audit lineage
- Separate raw and cleaned
datasets
- Never overwrite source data
- Use controlled terminology
- Validate missingness
patterns
- Normalize text variables
- Standardize region mappings
- QC independently
- Use PROC CONTENTS frequently
- Validate variable lengths
- Use defensive programming
- Create exception reports
- Automate data profiling
- Maintain reproducible
workflows
- Document every derivation
Business Logic Behind
Cleaning
Enterprise
cleaning exists because analytics depends on trust. If patient ages are
negative or greater than 150, statistical models become invalid. If revenue
values are negative due to system corruption, profitability dashboards mislead
executives. Missing dates break time-series forecasting and visit-window calculations.
Text normalization matters because “premium,” “Premium,” and “ PREMIUM ” should
represent the same category. Otherwise, aggregation logic produces fragmented
results. Missing values are often imputed to preserve analytical continuity.
Standardized variables improve downstream joins, AI model consistency, and
dashboard reliability. Every transformation must support business meaning,
traceability, and analytical reproducibility.
20 One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic is stronger
than visual inspection.
- Duplicate records destroy
reporting accuracy.
- Standardized variables
improve reproducibility.
- SAS excels in governed
enterprise environments.
- R provides flexible transformation
workflows.
- Metadata drives reliable
analytics.
- Audit trails protect
regulatory integrity.
- Missing values silently
corrupt models.
- Defensive programming
reduces production failures.
- PROC SQL simplifies
enterprise joins.
- DATA Step offers unmatched
row-level control.
- ARRAYS improve scalability.
- QC independence improves
trustworthiness.
- Enterprise cleaning requires
documentation discipline.
- Standard formats reduce
operational chaos.
- Automation improves
consistency.
- Clean datasets accelerate AI
reliability.
- Reporting accuracy depends
on preprocessing quality.
- Trustworthy analytics begins
with trustworthy data.
SAS vs R Comparison
|
Feature |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Very
High |
Growing |
|
Flexibility |
Structured |
Extremely
Flexible |
|
Enterprise
Governance |
Strong |
Moderate |
|
Visualization |
Moderate |
Excellent |
|
Metadata
Control |
Excellent |
Limited |
|
Performance |
Strong |
Strong |
|
Open
Source Ecosystem |
Limited |
Massive |
|
Reproducibility |
Excellent |
Excellent |
|
Statistical
Power |
Excellent |
Excellent |
Summary
SAS and R
complement each other exceptionally well in enterprise analytics ecosystems.
SAS dominates in regulated production systems because of its metadata
governance, auditability, standardized validation capabilities, and mature
reporting infrastructure. DATA Step programming provides extremely granular
row-level control, while PROC SQL and reporting procedures simplify enterprise
summarization. SAS remains deeply trusted across pharmaceutical, banking, and insurance
sectors because of reproducibility and compliance strength.
R excels
in flexibility, open-source innovation, modern visualization, and rapid
experimentation. Packages like tidyverse, stringr, lubridate, janitor, and
purrr dramatically simplify complex transformations. R also integrates
efficiently with machine learning workflows and modern AI pipelines.
The
strongest enterprise strategy is not SAS versus R it is SAS and R together. SAS
provides governance, production control, validation frameworks, and audit
readiness. R contributes agility, modern analytics, advanced transformations,
and scalable data science tooling. Together, they create reliable, scalable,
and production-grade analytical intelligence systems.
Conclusion
Modern
analytics is fundamentally a data-quality problem disguised as a reporting
problem. Organizations often invest millions into AI systems, dashboards, cloud
platforms, and predictive models while underestimating the destructive impact
of corrupted operational data. Duplicate records, malformed text, inconsistent
dates, negative values, missing fields, and metadata corruption silently damage
analytical trust long before executives notice the consequences.
This
Famous Ice Cream enterprise project demonstrates how SAS and R transform
chaotic operational information into trusted analytical intelligence. Using
DATA Step programming, PROC SQL, ARRAYS, FIRST./LAST. processing, PROC FORMAT,
macros, and enterprise reporting procedures, SAS provides unmatched governance
and production stability. Simultaneously, R offers agile transformation
frameworks through tidyverse pipelines, modern string processing, and flexible
statistical workflows.
In real
enterprise ecosystems, data cleaning is not a cosmetic task. It directly
impacts:
- clinical trial integrity
- regulatory compliance
- insurance claim accuracy
- banking fraud detection
- AI model reliability
- executive decision-making
- operational forecasting
- financial reporting
The most
successful analytics organizations treat cleaning pipelines as mission-critical
infrastructure. They standardize metadata, enforce governance rules, automate
validation frameworks, maintain reproducible audit trails, and separate raw
data from analytical datasets.
A single
malformed variable can invalidate an entire statistical submission. A duplicate
identifier can distort revenue reporting. Improper missing-value handling can
silently destroy AI predictions. That is why enterprise-grade cleaning frameworks
are essential.
SAS and R
together create a powerful ecosystem where governance meets flexibility,
compliance meets innovation, and analytical reliability becomes scalable. Clean
data is not merely technical hygiene it is the foundation of trustworthy
intelligence.
Interview Questions and
Answers
1. How would you identify duplicate IceCream IDs in
SAS?
Answer
I would
first use PROC SORT NODUPKEY to remove duplicates and then use PROC FREQ or PROC
SQL with GROUP BY HAVING COUNT(*) > 1 to identify repeated IDs. In
enterprise systems, duplicate identifiers can inflate revenue metrics and
distort patient or customer analytics.
2. Why is missing numeric handling dangerous in
SAS?
Answer
SAS
treats missing numeric values as smaller than valid numbers. Improper logic
like if amount > . may unintentionally include missing observations. This
can produce catastrophic analytical and statistical errors in regulated
reporting environments.
3. How would you validate malformed email addresses
in R?
Answer
I would
use grepl() or str_detect() with regex validation. Invalid emails would either
be corrected through reference mapping or flagged into exception datasets for
manual review.
4. Why should LENGTH statements appear before
assignments in SAS?
Answer
SAS
allocates character variable length during first compilation encounter. If
LENGTH appears after assignment, truncation may occur permanently. This is a
major production risk in SDTM and enterprise reporting environments.
5. When would you prefer DATA Step over PROC SQL?
Answer
DATA Step
is better for row-level sequential processing, FIRST./LAST. logic, arrays, and
complex conditional transformations. PROC SQL is preferred for joins,
aggregation, summarization, and relational operations. Enterprise workflows
usually combine both approaches strategically.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about ICE-CREAM DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment