From Mughal Markets to Modern Metadata: Cleaning Indian Cultural Data at Scale
Kingdoms, Chaos & Corrupted Columns: Rebuilding Indian Culture Data Trust with SAS
Introduction
The Monday Morning Disaster Nobody Wanted
Three
years ago, I walked into a regulatory war room at 8:12 AM.
Nobody
was drinking coffee anymore. That phase had passed.
The FDA
had flagged a submission overnight.
Not
because the statistical model was wrong.
Not
because efficacy failed.
Because
the enrollment counts didn't reconcile.
A
duplicate patient ID just one had cascaded through SDTM mappings, inflated
treatment population counts, distorted exposure summaries, and corrupted
downstream ADaM derivations.
One
duplicated identifier.
One
invisible crack in the pipeline.
The
clinical dashboard looked perfect. Executive summaries were polished.
Visualization layers sparkled in Power BI.
But
underneath?
Rot.
That is
the brutal truth about enterprise analytics:
Dashboards
don't fail because of charts.
They fail because somebody trusted dirty data.
And the
scariest part?
Most
corrupted datasets still look normal.
A
negative claim amount.
A malformed timestamp.
A hidden trailing blank.
A lowercase region code that fails a merge.
A patient age of 342 quietly slipping into a survival model.
The
disaster doesn't happen during ingestion.
It
happens weeks later when executives make decisions using poisoned outputs.
Today,
I'll show you how senior SAS programmers and data engineers build defensive
cleaning pipelines using both SAS and R.
Not
classroom code.
Production-grade
thinking.
Business Scenario: Indian
Cultural Heritage Intelligence Platform (1200–1800 Era)
Imagine a
government-funded cultural analytics platform studying Indian cultural
evolution between 1200 and 1800.
The
platform aggregates records from:
- Temple archives
- Trade registries
- Kingdom census systems
- Cultural manuscripts
- Festival taxation records
- Historical migration logs
But
source systems are inconsistent.
Some
clerks used uppercase.
Some stored years as text.
Some kingdoms renamed regions.
Some records duplicated during digitization.
Some transaction values became negative during OCR conversion.
Now
executives want:
- Cultural migration
dashboards
- Economic trade summaries
- Regional heritage heatmaps
- Predictive AI models on
cultural influence
Bad
cleaning here means:
- Wrong regional
classifications
- Broken AI predictions
- False economic estimates
- Regulatory audit failures
- Corrupted historical
intelligence
Raw Enterprise Dataset
Design
We
intentionally inject realistic enterprise-quality problems:
|
Problem
Type |
Example |
|
Duplicate
IDs |
C102
repeated |
|
Impossible
Ages |
250, -5 |
|
Invalid
Dates |
32-13-1650 |
|
Corrupted
Emails |
kingdommail.com |
|
NULL
strings |
"NULL" |
|
Mixed
casing |
souTH,
EAST |
|
Trailing
blanks |
"
Mughal " |
|
Negative
values |
-4500 |
|
Character
numeric confusion |
"15O0" |
|
Invalid
category labels |
"FestivalX" |
Why LENGTH Must Come First in SAS
This is
one of the most dangerous production issues in SAS.
Consider
this:
data demo;
input region $;
datalines;
SOUTH_INDIA
;
run;
proc print data = demo;
run;
OUTPUT:
| Obs | region |
|---|---|
| 1 | SOUTH_IN |
data demo;
length region $12;
input region $;
datalines;
SOUTH_INDIA
;
run;
proc print data = demo;
run;
OUTPUT:
| Obs | region |
|---|---|
| 1 | SOUTH_INDIA |
data broken;
set demo;
region='NORTH_EAST_INDIA';
run;
proc print data = broken;
run;
OUTPUT:
| Obs | region |
|---|---|
| 1 | NORTH_EAST_I |
data broken;
length region $20;
set demo;
region='NORTH_EAST_INDIA';
run;
proc print data = broken;
run;
OUTPUT:
| Obs | region |
|---|---|
| 1 | NORTH_EAST_INDIA |
You might
expect "NORTH_EAST_INDIA".
You
actually get:
NORTH_
Why?
Because
SAS determines character variable length during first creation.
If region
was initially length 6, every future assignment gets truncated silently.
No
warning.
No error.
Just
corrupted production data.
This is
called Character Truncation Risk.
In
Fortune 500 systems, this destroys:
- country names
- protocol identifiers
- treatment descriptions
- audit trails
- metadata lineage
That is
why senior SAS programmers declare LENGTH first.
Always.
SAS Raw Dataset Creation
data indian_culture_raw;
length Culture_ID $12 Kingdom $40 Region $25 Festival_Type $20
Contact_Email $60 Trade_Value_Char $15 Population_Char $12
Timestamp_Char $30 Visit_Date_Char $20;
informat Visit_Date anydtdte20.;
format Visit_Date date9.;
infile datalines truncover;
input Culture_ID:$12. Kingdom:$40. Region:$25. Festival_Type:$20.
Age Trade_Value Trade_Value_Char:$15. Population_Char:$12.
Visit_Date_Char:$20. Timestamp_Char:$30. Contact_Email:$60.;
/* Convert character date safely */
Visit_Date = input(Visit_Date_Char, anydtdte20.);
drop Visit_Date_Char;
datalines;
C101 Mughal NORTH Holi 45 25000 25000 120000 15JAN1650 1650-01-15:10:30:00 king@mughal.com
C102 Mughal south Diwali 250 -5000 -5000 90000 . NULL kingmail.com
C103 Chola EAST Pongal 32 40000 40000 150000 10FEB1500 1500/02/10:11:00:00 admin@chola
C102 Mughal SOUTH Diwali 250 -5000 -5000 90000 . NULL kingmail.com
C104 NULL west FestivalX -5 30000 30000 80000 32-13-1650 INVALID_TS wrongemail
C105 Vijayanagara SOUTH Dussehra 55 45000 45O00 170000 01MAR1600 1600-03-01:14:10:00 king@vijayanagara.com
C106 Maratha EAST Holi . 35000 35000 abcde 12APR1700 1700-04-12:09:00:00 NULL
C107 Sikh NORTH Baisakhi 70 -4500 -4500 100000 . . contact@sikh.org
C108 Rajput EAST Holi 29 20000 20000 110000 15MAY1550 invalidtimestamp rajput@mail
C109 Mughal south Eid 48 60000 60000 130000 20JUN1655 1655-06-20:08:00:00 emperor@mughal.com
C110 Maratha WEST Ganesh 39 0 0 98000 11JUL1750 1750-07-11:07:15:00 admin@maratha.in
C111 Chola EAST Pongal 41 42000 42000 140000 09AUG1450 1450-08-09:11:11:11 king@chola.com
C112 Vijayanagara SOUTH NULL 36 51000 51000 175000 . . vijaya@empire
C113 Rajput EAST Holi 28 19000 19000 115000 22SEP1580 1580-09-22:10:20:00 raj@rajput.com
C114 Mughal NORTH Eid 52 61000 61000 135000 31OCT1690 badformat emperor@mughal
C115 Sikh north Baisakhi 66 33000 33000 105000 05NOV1710 1710-11-05:13:00:00 contact@sikh.org
;
run;
proc print data=indian_culture_raw;
run;
OUTPUT:
| Obs | Culture_ID | Kingdom | Region | Festival_Type | Contact_Email | Trade_Value_Char | Population_Char | Timestamp_Char | Visit_Date | Age | Trade_Value |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | Mughal | NORTH | Holi | king@mughal.com | 25000 | 120000 | 1650-01-15:10:30:00 | 15JAN1650 | 45 | 25000 |
| 2 | C102 | Mughal | south | Diwali | kingmail.com | -5000 | 90000 | NULL | . | 250 | -5000 |
| 3 | C103 | Chola | EAST | Pongal | admin@chola | 40000 | 150000 | 1500/02/10:11:00:00 | . | 32 | 40000 |
| 4 | C102 | Mughal | SOUTH | Diwali | kingmail.com | -5000 | 90000 | NULL | . | 250 | -5000 |
| 5 | C104 | NULL | west | FestivalX | wrongemail | 30000 | 80000 | INVALID_TS | . | -5 | 30000 |
| 6 | C105 | Vijayanagara | SOUTH | Dussehra | king@vijayanagara.com | 45O00 | 170000 | 1600-03-01:14:10:00 | 01MAR1600 | 55 | 45000 |
| 7 | C106 | Maratha | EAST | Holi | NULL | 35000 | abcde | 1700-04-12:09:00:00 | 12APR1700 | . | 35000 |
| 8 | C107 | Sikh | NORTH | Baisakhi | contact@sikh.org | -4500 | 100000 | . | 70 | -4500 | |
| 9 | C108 | Rajput | EAST | Holi | rajput@mail | 20000 | 110000 | invalidtimestamp | . | 29 | 20000 |
| 10 | C109 | Mughal | south | Eid | emperor@mughal.com | 60000 | 130000 | 1655-06-20:08:00:00 | 20JUN1655 | 48 | 60000 |
| 11 | C110 | Maratha | WEST | Ganesh | admin@maratha.in | 0 | 98000 | 1750-07-11:07:15:00 | 11JUL1750 | 39 | 0 |
| 12 | C111 | Chola | EAST | Pongal | king@chola.com | 42000 | 140000 | 1450-08-09:11:11:11 | . | 41 | 42000 |
| 13 | C112 | Vijayanagara | SOUTH | NULL | vijaya@empire | 51000 | 175000 | . | 36 | 51000 | |
| 14 | C113 | Rajput | EAST | Holi | raj@rajput.com | 19000 | 115000 | 1580-09-22:10:20:00 | . | 28 | 19000 |
| 15 | C114 | Mughal | NORTH | Eid | emperor@mughal | 61000 | 135000 | badformat | 31OCT1690 | 52 | 61000 |
| 16 | C115 | Sikh | north | Baisakhi | contact@sikh.org | 33000 | 105000 | 1710-11-05:13:00:00 | 05NOV1710 | 66 | 33000 |
PROC CONTENTS — First Defensive Move
proc contents data=indian_culture_raw;
run;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.INDIAN_CULTURE_RAW | Observations | 16 |
|---|---|---|---|
| Member Type | DATA | Variables | 11 |
| Engine | V9 | Indexes | 0 |
| Created | 05/31/2026 17:29:11 | Observation Length | 240 |
| Last Modified | 05/31/2026 17:29:11 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 545 |
| Obs in First Data Page | 16 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_workE0C80001D1B4_odaws01-apse1-2.oda.sas.com/SAS_work1B940001D1B4_odaws01-apse1-2.oda.sas.com/indian_culture_raw.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 67129944 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Alphabetic List of Variables and Attributes | |||||
|---|---|---|---|---|---|
| # | Variable | Type | Len | Format | Informat |
| 10 | Age | Num | 8 | ||
| 5 | Contact_Email | Char | 60 | ||
| 1 | Culture_ID | Char | 12 | ||
| 4 | Festival_Type | Char | 20 | ||
| 2 | Kingdom | Char | 40 | ||
| 7 | Population_Char | Char | 12 | ||
| 3 | Region | Char | 25 | ||
| 8 | Timestamp_Char | Char | 30 | ||
| 11 | Trade_Value | Num | 8 | ||
| 6 | Trade_Value_Char | Char | 15 | ||
| 9 | Visit_Date | Num | 8 | DATE9. | ANYDTDTE20. |
Why This Matters in Production
This dataset mirrors real ingestion pipelines.
Nothing here is random.
Every corruption pattern exists in live enterprise systems:
- OCR extraction failures
- ETL mapping defects
- manual data entry issues
- cross-platform encoding mismatches
- inconsistent regional taxonomies
The biggest beginner mistake?
Assuming raw data is trustworthy.
Senior engineers assume the opposite.
SAS Cleaning Pipeline
1.Standardization Layer
data culture_clean_stage1;
set indian_culture_raw;
array chars(*) Kingdom Region Festival_Type Contact_Email;
do i=1 to dim(chars);
chars(i)=strip(chars(i));
chars(i)=compress(chars(i),,'kw');
chars(i)=tranwrd(chars(i),'NULL','');
chars(i)=propcase(lowcase(chars(i)));
end;
Region=upcase(region);
if age < 0 or age > 120 then age=.;
Trade_Value=abs(trade_value);
if missing(contact_email) then
contact_email='unknown@domain.com';
if index(contact_email,'@')=0 then
contact_email='invalid_email@fix.com';
run;
proc print data=culture_clean_stage1;
run;
OUTPUT:
| Obs | Culture_ID | Kingdom | Region | Festival_Type | Contact_Email | Trade_Value_Char | Population_Char | Timestamp_Char | Visit_Date | Age | Trade_Value | i |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | Mughal | NORTH | Holi | King@mughal.Com | 25000 | 120000 | 1650-01-15:10:30:00 | 15JAN1650 | 45 | 25000 | 5 |
| 2 | C102 | Mughal | SOUTH | Diwali | invalid_email@fix.com | -5000 | 90000 | NULL | . | . | 5000 | 5 |
| 3 | C103 | Chola | EAST | Pongal | Admin@chola | 40000 | 150000 | 1500/02/10:11:00:00 | . | 32 | 40000 | 5 |
| 4 | C102 | Mughal | SOUTH | Diwali | invalid_email@fix.com | -5000 | 90000 | NULL | . | . | 5000 | 5 |
| 5 | C104 | WEST | Festivalx | invalid_email@fix.com | 30000 | 80000 | INVALID_TS | . | . | 30000 | 5 | |
| 6 | C105 | Vijayanagara | SOUTH | Dussehra | King@vijayanagara.Com | 45O00 | 170000 | 1600-03-01:14:10:00 | 01MAR1600 | 55 | 45000 | 5 |
| 7 | C106 | Maratha | EAST | Holi | unknown@domain.com | 35000 | abcde | 1700-04-12:09:00:00 | 12APR1700 | . | 35000 | 5 |
| 8 | C107 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | -4500 | 100000 | . | 70 | 4500 | 5 | |
| 9 | C108 | Rajput | EAST | Holi | Rajput@mail | 20000 | 110000 | invalidtimestamp | . | 29 | 20000 | 5 |
| 10 | C109 | Mughal | SOUTH | Eid | Emperor@mughal.Com | 60000 | 130000 | 1655-06-20:08:00:00 | 20JUN1655 | 48 | 60000 | 5 |
| 11 | C110 | Maratha | WEST | Ganesh | Admin@maratha.In | 0 | 98000 | 1750-07-11:07:15:00 | 11JUL1750 | 39 | 0 | 5 |
| 12 | C111 | Chola | EAST | Pongal | King@chola.Com | 42000 | 140000 | 1450-08-09:11:11:11 | . | 41 | 42000 | 5 |
| 13 | C112 | Vijayanagara | SOUTH | Vijaya@empire | 51000 | 175000 | . | 36 | 51000 | 5 | ||
| 14 | C113 | Rajput | EAST | Holi | Raj@rajput.Com | 19000 | 115000 | 1580-09-22:10:20:00 | . | 28 | 19000 | 5 |
| 15 | C114 | Mughal | NORTH | Eid | Emperor@mughal | 61000 | 135000 | badformat | 31OCT1690 | 52 | 61000 | 5 |
| 16 | C115 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | 33000 | 105000 | 1710-11-05:13:00:00 | 05NOV1710 | 66 | 33000 | 5 |
Why This Matters
This is
not cosmetic cleaning.
This is
semantic normalization.
Notice
what happens:
- STRIP() removes hidden
whitespace
- COMPRESS() removes
non-printable junk
- TRANWRD() replaces fake NULL
values
- PROPCASE() standardizes
labels
- ABS() corrects financial
polarity errors
Production
systems fail because business rules are inconsistent.
Cleaning
isn't formatting.
Cleaning
is enforcing enterprise meaning.
Common failure:
People clean presentation instead of logic.
2.SELECT-WHEN for Controlled Categorization
data culture_clean_stage2;
set culture_clean_stage1;
length Region_Group $20;
select(region);
when('NORTH') Region_Group='Northern';
when('SOUTH') Region_Group='Southern';
when('EAST') Region_Group='Eastern';
when('WEST') Region_Group='Western';
otherwise Region_Group='Unknown';
end;
run;
proc print data=culture_clean_stage2;
run;
OUTPUT:
| Obs | Culture_ID | Kingdom | Region | Festival_Type | Contact_Email | Trade_Value_Char | Population_Char | Timestamp_Char | Visit_Date | Age | Trade_Value | i | Region_Group |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | Mughal | NORTH | Holi | King@mughal.Com | 25000 | 120000 | 1650-01-15:10:30:00 | 15JAN1650 | 45 | 25000 | 5 | Northern |
| 2 | C102 | Mughal | SOUTH | Diwali | invalid_email@fix.com | -5000 | 90000 | NULL | . | . | 5000 | 5 | Southern |
| 3 | C103 | Chola | EAST | Pongal | Admin@chola | 40000 | 150000 | 1500/02/10:11:00:00 | . | 32 | 40000 | 5 | Eastern |
| 4 | C102 | Mughal | SOUTH | Diwali | invalid_email@fix.com | -5000 | 90000 | NULL | . | . | 5000 | 5 | Southern |
| 5 | C104 | WEST | Festivalx | invalid_email@fix.com | 30000 | 80000 | INVALID_TS | . | . | 30000 | 5 | Western | |
| 6 | C105 | Vijayanagara | SOUTH | Dussehra | King@vijayanagara.Com | 45O00 | 170000 | 1600-03-01:14:10:00 | 01MAR1600 | 55 | 45000 | 5 | Southern |
| 7 | C106 | Maratha | EAST | Holi | unknown@domain.com | 35000 | abcde | 1700-04-12:09:00:00 | 12APR1700 | . | 35000 | 5 | Eastern |
| 8 | C107 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | -4500 | 100000 | . | 70 | 4500 | 5 | Northern | |
| 9 | C108 | Rajput | EAST | Holi | Rajput@mail | 20000 | 110000 | invalidtimestamp | . | 29 | 20000 | 5 | Eastern |
| 10 | C109 | Mughal | SOUTH | Eid | Emperor@mughal.Com | 60000 | 130000 | 1655-06-20:08:00:00 | 20JUN1655 | 48 | 60000 | 5 | Southern |
| 11 | C110 | Maratha | WEST | Ganesh | Admin@maratha.In | 0 | 98000 | 1750-07-11:07:15:00 | 11JUL1750 | 39 | 0 | 5 | Western |
| 12 | C111 | Chola | EAST | Pongal | King@chola.Com | 42000 | 140000 | 1450-08-09:11:11:11 | . | 41 | 42000 | 5 | Eastern |
| 13 | C112 | Vijayanagara | SOUTH | Vijaya@empire | 51000 | 175000 | . | 36 | 51000 | 5 | Southern | ||
| 14 | C113 | Rajput | EAST | Holi | Raj@rajput.Com | 19000 | 115000 | 1580-09-22:10:20:00 | . | 28 | 19000 | 5 | Eastern |
| 15 | C114 | Mughal | NORTH | Eid | Emperor@mughal | 61000 | 135000 | badformat | 31OCT1690 | 52 | 61000 | 5 | Northern |
| 16 | C115 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | 33000 | 105000 | 1710-11-05:13:00:00 | 05NOV1710 | 66 | 33000 | 5 | Northern |
Why This Matters
SELECT-WHEN
scales better than huge IF chains.
Production
systems often maintain:
- 200+ region mappings
- 500+ medical code groups
- evolving classification
standards
Structured
categorization reduces maintenance risk.
Common
failure:
Developers hardcode inconsistent IF conditions across programs.
3.PROC SORT NODUPKEY
proc sort data=culture_clean_stage2
out=culture_dedup nodupkey;
by Culture_ID;
run;
proc print data=culture_dedup;
run;
LOG:
OUTPUT:
| Obs | Culture_ID | Kingdom | Region | Festival_Type | Contact_Email | Trade_Value_Char | Population_Char | Timestamp_Char | Visit_Date | Age | Trade_Value | i | Region_Group |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | Mughal | NORTH | Holi | King@mughal.Com | 25000 | 120000 | 1650-01-15:10:30:00 | 15JAN1650 | 45 | 25000 | 5 | Northern |
| 2 | C102 | Mughal | SOUTH | Diwali | invalid_email@fix.com | -5000 | 90000 | NULL | . | . | 5000 | 5 | Southern |
| 3 | C103 | Chola | EAST | Pongal | Admin@chola | 40000 | 150000 | 1500/02/10:11:00:00 | . | 32 | 40000 | 5 | Eastern |
| 4 | C104 | WEST | Festivalx | invalid_email@fix.com | 30000 | 80000 | INVALID_TS | . | . | 30000 | 5 | Western | |
| 5 | C105 | Vijayanagara | SOUTH | Dussehra | King@vijayanagara.Com | 45O00 | 170000 | 1600-03-01:14:10:00 | 01MAR1600 | 55 | 45000 | 5 | Southern |
| 6 | C106 | Maratha | EAST | Holi | unknown@domain.com | 35000 | abcde | 1700-04-12:09:00:00 | 12APR1700 | . | 35000 | 5 | Eastern |
| 7 | C107 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | -4500 | 100000 | . | 70 | 4500 | 5 | Northern | |
| 8 | C108 | Rajput | EAST | Holi | Rajput@mail | 20000 | 110000 | invalidtimestamp | . | 29 | 20000 | 5 | Eastern |
| 9 | C109 | Mughal | SOUTH | Eid | Emperor@mughal.Com | 60000 | 130000 | 1655-06-20:08:00:00 | 20JUN1655 | 48 | 60000 | 5 | Southern |
| 10 | C110 | Maratha | WEST | Ganesh | Admin@maratha.In | 0 | 98000 | 1750-07-11:07:15:00 | 11JUL1750 | 39 | 0 | 5 | Western |
| 11 | C111 | Chola | EAST | Pongal | King@chola.Com | 42000 | 140000 | 1450-08-09:11:11:11 | . | 41 | 42000 | 5 | Eastern |
| 12 | C112 | Vijayanagara | SOUTH | Vijaya@empire | 51000 | 175000 | . | 36 | 51000 | 5 | Southern | ||
| 13 | C113 | Rajput | EAST | Holi | Raj@rajput.Com | 19000 | 115000 | 1580-09-22:10:20:00 | . | 28 | 19000 | 5 | Eastern |
| 14 | C114 | Mughal | NORTH | Eid | Emperor@mughal | 61000 | 135000 | badformat | 31OCT1690 | 52 | 61000 | 5 | Northern |
| 15 | C115 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | 33000 | 105000 | 1710-11-05:13:00:00 | 05NOV1710 | 66 | 33000 | 5 | Northern |
Why This Matters
Duplicate
IDs destroy:
- enrollment counts
- patient exposure totals
- financial aggregates
- AI model training
NODUPKEY
is defensive engineering.
But
here's the hidden danger:
It keeps
FIRST occurrence only.
If the
first record is wrong, you preserve corruption.
Senior
engineers always define survivorship logic explicitly.
4.FIRST./LAST. Processing
proc sort data=culture_clean_stage2;
by Culture_ID Visit_Date;
run;
proc print data=culture_clean_stage2;
run;
OUTPUT:
| Obs | Culture_ID | Kingdom | Region | Festival_Type | Contact_Email | Trade_Value_Char | Population_Char | Timestamp_Char | Visit_Date | Age | Trade_Value | i | Region_Group |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | Mughal | NORTH | Holi | King@mughal.Com | 25000 | 120000 | 1650-01-15:10:30:00 | 15JAN1650 | 45 | 25000 | 5 | Northern |
| 2 | C102 | Mughal | SOUTH | Diwali | invalid_email@fix.com | -5000 | 90000 | NULL | . | . | 5000 | 5 | Southern |
| 3 | C102 | Mughal | SOUTH | Diwali | invalid_email@fix.com | -5000 | 90000 | NULL | . | . | 5000 | 5 | Southern |
| 4 | C103 | Chola | EAST | Pongal | Admin@chola | 40000 | 150000 | 1500/02/10:11:00:00 | . | 32 | 40000 | 5 | Eastern |
| 5 | C104 | WEST | Festivalx | invalid_email@fix.com | 30000 | 80000 | INVALID_TS | . | . | 30000 | 5 | Western | |
| 6 | C105 | Vijayanagara | SOUTH | Dussehra | King@vijayanagara.Com | 45O00 | 170000 | 1600-03-01:14:10:00 | 01MAR1600 | 55 | 45000 | 5 | Southern |
| 7 | C106 | Maratha | EAST | Holi | unknown@domain.com | 35000 | abcde | 1700-04-12:09:00:00 | 12APR1700 | . | 35000 | 5 | Eastern |
| 8 | C107 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | -4500 | 100000 | . | 70 | 4500 | 5 | Northern | |
| 9 | C108 | Rajput | EAST | Holi | Rajput@mail | 20000 | 110000 | invalidtimestamp | . | 29 | 20000 | 5 | Eastern |
| 10 | C109 | Mughal | SOUTH | Eid | Emperor@mughal.Com | 60000 | 130000 | 1655-06-20:08:00:00 | 20JUN1655 | 48 | 60000 | 5 | Southern |
| 11 | C110 | Maratha | WEST | Ganesh | Admin@maratha.In | 0 | 98000 | 1750-07-11:07:15:00 | 11JUL1750 | 39 | 0 | 5 | Western |
| 12 | C111 | Chola | EAST | Pongal | King@chola.Com | 42000 | 140000 | 1450-08-09:11:11:11 | . | 41 | 42000 | 5 | Eastern |
| 13 | C112 | Vijayanagara | SOUTH | Vijaya@empire | 51000 | 175000 | . | 36 | 51000 | 5 | Southern | ||
| 14 | C113 | Rajput | EAST | Holi | Raj@rajput.Com | 19000 | 115000 | 1580-09-22:10:20:00 | . | 28 | 19000 | 5 | Eastern |
| 15 | C114 | Mughal | NORTH | Eid | Emperor@mughal | 61000 | 135000 | badformat | 31OCT1690 | 52 | 61000 | 5 | Northern |
| 16 | C115 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | 33000 | 105000 | 1710-11-05:13:00:00 | 05NOV1710 | 66 | 33000 | 5 | Northern |
data latest_records;
set culture_clean_stage2;
by Culture_ID;
if last.Culture_ID;
run;
proc print data=latest_records;
run;
OUTPUT:
| Obs | Culture_ID | Kingdom | Region | Festival_Type | Contact_Email | Trade_Value_Char | Population_Char | Timestamp_Char | Visit_Date | Age | Trade_Value | i | Region_Group |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | Mughal | NORTH | Holi | King@mughal.Com | 25000 | 120000 | 1650-01-15:10:30:00 | 15JAN1650 | 45 | 25000 | 5 | Northern |
| 2 | C102 | Mughal | SOUTH | Diwali | invalid_email@fix.com | -5000 | 90000 | NULL | . | . | 5000 | 5 | Southern |
| 3 | C103 | Chola | EAST | Pongal | Admin@chola | 40000 | 150000 | 1500/02/10:11:00:00 | . | 32 | 40000 | 5 | Eastern |
| 4 | C104 | WEST | Festivalx | invalid_email@fix.com | 30000 | 80000 | INVALID_TS | . | . | 30000 | 5 | Western | |
| 5 | C105 | Vijayanagara | SOUTH | Dussehra | King@vijayanagara.Com | 45O00 | 170000 | 1600-03-01:14:10:00 | 01MAR1600 | 55 | 45000 | 5 | Southern |
| 6 | C106 | Maratha | EAST | Holi | unknown@domain.com | 35000 | abcde | 1700-04-12:09:00:00 | 12APR1700 | . | 35000 | 5 | Eastern |
| 7 | C107 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | -4500 | 100000 | . | 70 | 4500 | 5 | Northern | |
| 8 | C108 | Rajput | EAST | Holi | Rajput@mail | 20000 | 110000 | invalidtimestamp | . | 29 | 20000 | 5 | Eastern |
| 9 | C109 | Mughal | SOUTH | Eid | Emperor@mughal.Com | 60000 | 130000 | 1655-06-20:08:00:00 | 20JUN1655 | 48 | 60000 | 5 | Southern |
| 10 | C110 | Maratha | WEST | Ganesh | Admin@maratha.In | 0 | 98000 | 1750-07-11:07:15:00 | 11JUL1750 | 39 | 0 | 5 | Western |
| 11 | C111 | Chola | EAST | Pongal | King@chola.Com | 42000 | 140000 | 1450-08-09:11:11:11 | . | 41 | 42000 | 5 | Eastern |
| 12 | C112 | Vijayanagara | SOUTH | Vijaya@empire | 51000 | 175000 | . | 36 | 51000 | 5 | Southern | ||
| 13 | C113 | Rajput | EAST | Holi | Raj@rajput.Com | 19000 | 115000 | 1580-09-22:10:20:00 | . | 28 | 19000 | 5 | Eastern |
| 14 | C114 | Mughal | NORTH | Eid | Emperor@mughal | 61000 | 135000 | badformat | 31OCT1690 | 52 | 61000 | 5 | Northern |
| 15 | C115 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | 33000 | 105000 | 1710-11-05:13:00:00 | 05NOV1710 | 66 | 33000 | 5 | Northern |
Why This Matters
FIRST./LAST.
logic is enterprise gold.
Used for:
- latest patient visit
- most recent transaction
- newest insurance claim
- last audit update
Common
mistake:
Forgetting to sort properly first.
Wrong
sorting = wrong business truth.
5.PROC SQL Validation Layer
proc sql;
create table invalid_emails as
select *
from culture_clean_stage2
where contact_email not like '%@%.%';
quit;
proc print data=invalid_emails;
run;
OUTPUT:
| Obs | Culture_ID | Kingdom | Region | Festival_Type | Contact_Email | Trade_Value_Char | Population_Char | Timestamp_Char | Visit_Date | Age | Trade_Value | i | Region_Group |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C103 | Chola | EAST | Pongal | Admin@chola | 40000 | 150000 | 1500/02/10:11:00:00 | . | 32 | 40000 | 5 | Eastern |
| 2 | C108 | Rajput | EAST | Holi | Rajput@mail | 20000 | 110000 | invalidtimestamp | . | 29 | 20000 | 5 | Eastern |
| 3 | C112 | Vijayanagara | SOUTH | Vijaya@empire | 51000 | 175000 | . | 36 | 51000 | 5 | Southern | ||
| 4 | C114 | Mughal | NORTH | Eid | Emperor@mughal | 61000 | 135000 | badformat | 31OCT1690 | 52 | 61000 | 5 | Northern |
Why This Matters
SQL
validation layers isolate anomalies.
Production
teams often maintain:
- fraud exception tables
- invalid demographic logs
- reconciliation failure
outputs
Never
delete bad records silently.
Quarantine
them.
Auditability
matters.
6.PROC FORMAT
proc format;
value agegrp . = 'Missing Age'
low - <18 = 'Child'
18 - 40 = 'Adult'
41 - 65 = 'Middle'
66 - high = 'Senior';
run;
LOG:
Why This Matters
Formats
separate business meaning from raw values.
This
improves:
- maintainability
- consistency
- reporting standardization
Common
failure:
Hardcoding category logic repeatedly.
7.RETAIN Statement
data running_trade;
set culture_dedup;
retain cumulative_trade 0;
cumulative_trade + trade_value;
format age agegrp.;
run;
proc print data=running_trade;
run;
OUTPUT:
| Obs | Culture_ID | Kingdom | Region | Festival_Type | Contact_Email | Trade_Value_Char | Population_Char | Timestamp_Char | Visit_Date | Age | Trade_Value | i | Region_Group | cumulative_trade |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | Mughal | NORTH | Holi | King@mughal.Com | 25000 | 120000 | 1650-01-15:10:30:00 | 15JAN1650 | Middle | 25000 | 5 | Northern | 25000 |
| 2 | C102 | Mughal | SOUTH | Diwali | invalid_email@fix.com | -5000 | 90000 | NULL | . | Missing Age | 5000 | 5 | Southern | 30000 |
| 3 | C103 | Chola | EAST | Pongal | Admin@chola | 40000 | 150000 | 1500/02/10:11:00:00 | . | Adult | 40000 | 5 | Eastern | 70000 |
| 4 | C104 | WEST | Festivalx | invalid_email@fix.com | 30000 | 80000 | INVALID_TS | . | Missing Age | 30000 | 5 | Western | 100000 | |
| 5 | C105 | Vijayanagara | SOUTH | Dussehra | King@vijayanagara.Com | 45O00 | 170000 | 1600-03-01:14:10:00 | 01MAR1600 | Middle | 45000 | 5 | Southern | 145000 |
| 6 | C106 | Maratha | EAST | Holi | unknown@domain.com | 35000 | abcde | 1700-04-12:09:00:00 | 12APR1700 | Missing Age | 35000 | 5 | Eastern | 180000 |
| 7 | C107 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | -4500 | 100000 | . | Senior | 4500 | 5 | Northern | 184500 | |
| 8 | C108 | Rajput | EAST | Holi | Rajput@mail | 20000 | 110000 | invalidtimestamp | . | Adult | 20000 | 5 | Eastern | 204500 |
| 9 | C109 | Mughal | SOUTH | Eid | Emperor@mughal.Com | 60000 | 130000 | 1655-06-20:08:00:00 | 20JUN1655 | Middle | 60000 | 5 | Southern | 264500 |
| 10 | C110 | Maratha | WEST | Ganesh | Admin@maratha.In | 0 | 98000 | 1750-07-11:07:15:00 | 11JUL1750 | Adult | 0 | 5 | Western | 264500 |
| 11 | C111 | Chola | EAST | Pongal | King@chola.Com | 42000 | 140000 | 1450-08-09:11:11:11 | . | Middle | 42000 | 5 | Eastern | 306500 |
| 12 | C112 | Vijayanagara | SOUTH | Vijaya@empire | 51000 | 175000 | . | Adult | 51000 | 5 | Southern | 357500 | ||
| 13 | C113 | Rajput | EAST | Holi | Raj@rajput.Com | 19000 | 115000 | 1580-09-22:10:20:00 | . | Adult | 19000 | 5 | Eastern | 376500 |
| 14 | C114 | Mughal | NORTH | Eid | Emperor@mughal | 61000 | 135000 | badformat | 31OCT1690 | Middle | 61000 | 5 | Northern | 437500 |
| 15 | C115 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | 33000 | 105000 | 1710-11-05:13:00:00 | 05NOV1710 | Senior | 33000 | 5 | Northern | 470500 |
Why This Matters
RETAIN
preserves values across iterations.
Used
heavily for:
- cumulative exposure
- running balances
- audit counters
- sequence tracking
Without
RETAIN, SAS resets values every row.
8.Macro-Driven Cleaning Framework
%macro validate_numeric(ds,var);
data &ds;
set &ds;
if verify(strip(&var),'0123456789') > 0 then
&var='';
run;
proc print data=&ds;
run;
%mend;
%validate_numeric(culture_dedup,Population_Char);
OUTPUT:
| Obs | Culture_ID | Kingdom | Region | Festival_Type | Contact_Email | Trade_Value_Char | Population_Char | Timestamp_Char | Visit_Date | Age | Trade_Value | i | Region_Group |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | Mughal | NORTH | Holi | King@mughal.Com | 25000 | 120000 | 1650-01-15:10:30:00 | 15JAN1650 | 45 | 25000 | 5 | Northern |
| 2 | C102 | Mughal | SOUTH | Diwali | invalid_email@fix.com | -5000 | 90000 | NULL | . | . | 5000 | 5 | Southern |
| 3 | C103 | Chola | EAST | Pongal | Admin@chola | 40000 | 150000 | 1500/02/10:11:00:00 | . | 32 | 40000 | 5 | Eastern |
| 4 | C104 | WEST | Festivalx | invalid_email@fix.com | 30000 | 80000 | INVALID_TS | . | . | 30000 | 5 | Western | |
| 5 | C105 | Vijayanagara | SOUTH | Dussehra | King@vijayanagara.Com | 45O00 | 170000 | 1600-03-01:14:10:00 | 01MAR1600 | 55 | 45000 | 5 | Southern |
| 6 | C106 | Maratha | EAST | Holi | unknown@domain.com | 35000 | 1700-04-12:09:00:00 | 12APR1700 | . | 35000 | 5 | Eastern | |
| 7 | C107 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | -4500 | 100000 | . | 70 | 4500 | 5 | Northern | |
| 8 | C108 | Rajput | EAST | Holi | Rajput@mail | 20000 | 110000 | invalidtimestamp | . | 29 | 20000 | 5 | Eastern |
| 9 | C109 | Mughal | SOUTH | Eid | Emperor@mughal.Com | 60000 | 130000 | 1655-06-20:08:00:00 | 20JUN1655 | 48 | 60000 | 5 | Southern |
| 10 | C110 | Maratha | WEST | Ganesh | Admin@maratha.In | 0 | 98000 | 1750-07-11:07:15:00 | 11JUL1750 | 39 | 0 | 5 | Western |
| 11 | C111 | Chola | EAST | Pongal | King@chola.Com | 42000 | 140000 | 1450-08-09:11:11:11 | . | 41 | 42000 | 5 | Eastern |
| 12 | C112 | Vijayanagara | SOUTH | Vijaya@empire | 51000 | 175000 | . | 36 | 51000 | 5 | Southern | ||
| 13 | C113 | Rajput | EAST | Holi | Raj@rajput.Com | 19000 | 115000 | 1580-09-22:10:20:00 | . | 28 | 19000 | 5 | Eastern |
| 14 | C114 | Mughal | NORTH | Eid | Emperor@mughal | 61000 | 135000 | badformat | 31OCT1690 | 52 | 61000 | 5 | Northern |
| 15 | C115 | Sikh | NORTH | Baisakhi | Contact@sikh.Org | 33000 | 105000 | 1710-11-05:13:00:00 | 05NOV1710 | 66 | 33000 | 5 | Northern |
Why This Matters
Macros
industrialize cleaning logic.
Enterprise
systems may validate:
- 500 datasets nightly
- 10,000+ variables
- multi-country standards
Macros
create repeatable governance.
Common
failure:
Overcomplicated macros with no logging.
9.PROC SUMMARY
proc summary data=culture_dedup nway;
class Region_Group;
var Trade_Value;
output out=trade_summary sum=Trade_Value_Total
mean=Trade_Value_Avg
median=Trade_Value_Median;
run;
proc print data=trade_summary;
run;
OUTPUT:
| Obs | Region_Group | _TYPE_ | _FREQ_ | Trade_Value_Total | Trade_Value_Avg | Trade_Value_Median |
|---|---|---|---|---|---|---|
| 1 | Eastern | 1 | 5 | 156000 | 31200 | 35000 |
| 2 | Northern | 1 | 4 | 123500 | 30875 | 29000 |
| 3 | Southern | 1 | 4 | 161000 | 40250 | 48000 |
| 4 | Western | 1 | 2 | 30000 | 15000 | 15000 |
Why This Matters
Production
reporting pipelines rely heavily on summarized outputs.
This
drives:
- executive dashboards
- KPI reporting
- regulatory summaries
Bad
cleaning before summaries creates polished lies.
What PROC SUMMARY
Actually Generated
Your
original code probably created:
|
Variable |
Meaning |
|
Trade_Value_Sum |
Sum |
|
Trade_Value_Mean |
Mean |
|
Trade_Value_Median |
Median |
|
TYPE |
Classification level |
|
FREQ |
Observation count |
10.PROC REPORT
proc report data=trade_summary nowd;
column Region_Group
Trade_Value_Total
Trade_Value_Avg
Trade_Value_Median;
define Region_Group / group;
define Trade_Value_Total / analysis;
define Trade_Value_Avg / analysis;
define Trade_Value_Median / analysis;
run;
OUTPUT:
| Region_Group | Trade_Value_Total | Trade_Value_Avg | Trade_Value_Median |
|---|---|---|---|
| Eastern | 156000 | 31200 | 35000 |
| Northern | 123500 | 30875 | 29000 |
| Southern | 161000 | 40250 | 48000 |
| Western | 30000 | 15000 | 15000 |
Why This Matters
PROC
REPORT powers enterprise reporting.
Clinical
listings.
Financial summaries.
Operational metrics.
Formatting
is easy.
Reliable
upstream data is hard.
11.How R Handles Character Memory Differently
R behaves
differently internally.
Character
vectors in R are dynamically managed using references and string pools.
R does
not preallocate fixed-length character storage like SAS.
So this
works safely:
region <- "SOUTH"
region <- "NORTH_EAST_INDIA"
CONSOLE:
> region <- "SOUTH" > region <- "NORTH_EAST_INDIA"
No
truncation.
But R
introduces different risks:
- memory overhead
- inconsistent factor
conversion
- encoding issues
- hidden UTF corruption
SAS risks
truncation.
R risks inconsistency.
Enterprise
engineers must understand both.
12.R Version — Cleaning Workflow
Raw Dataset Creation in R
library(tidyverse)
library(lubridate)
library(janitor)
culture_raw <- tibble(
Culture_ID = c("C101","C102","C103","C102","C104",
"C105","C106","C107","C108","C109",
"C110","C111","C112","C113","C114","C115"),
Kingdom = c("Mughal","Mughal","Chola","Mughal","NULL",
"Vijayanagara","Maratha","Sikh","Rajput","Mughal",
"Maratha","Chola","Vijayanagara","Rajput","Mughal","Sikh"),
Region = c("NORTH","south","EAST","SOUTH","west",
"SOUTH","EAST","NORTH","EAST","south",
"WEST","EAST","SOUTH","EAST","NORTH","north"),
Age = c(45,250,32,250,-5,55,NA,70,29,48,39,41,36,28,52,66),
Trade_Value = c(25000,-5000,40000,-5000,30000,
45000,35000,-4500,20000,60000,
0,42000,51000,19000,61000,33000),
Visit_Date = c("15JAN1650",".","10FEB1500",".",
"32-13-1650","01MAR1600","12APR1700",".",
"15MAY1550","20JUN1655","11JUL1750",
"09AUG1450",".","22SEP1580",
"31OCT1690","05NOV1710")
)
OUTPUT:
|
|
Culture_ID |
Kingdom |
Region |
Age |
Trade_Value |
Visit_Date |
|
1 |
C101 |
Mughal |
NORTH |
45 |
25000 |
15JAN1650 |
|
2 |
C102 |
Mughal |
south |
250 |
-5000 |
. |
|
3 |
C103 |
Chola |
EAST |
32 |
40000 |
10FEB1500 |
|
4 |
C102 |
Mughal |
SOUTH |
250 |
-5000 |
. |
|
5 |
C104 |
NULL |
west |
-5 |
30000 |
32-13-1650 |
|
6 |
C105 |
Vijayanagara |
SOUTH |
55 |
45000 |
01MAR1600 |
|
7 |
C106 |
Maratha |
EAST |
NA |
35000 |
12APR1700 |
|
8 |
C107 |
Sikh |
NORTH |
70 |
-4500 |
. |
|
9 |
C108 |
Rajput |
EAST |
29 |
20000 |
15MAY1550 |
|
10 |
C109 |
Mughal |
south |
48 |
60000 |
20JUN1655 |
|
11 |
C110 |
Maratha |
WEST |
39 |
0 |
11JUL1750 |
|
12 |
C111 |
Chola |
EAST |
41 |
42000 |
09AUG1450 |
|
13 |
C112 |
Vijayanagara |
SOUTH |
36 |
51000 |
. |
|
14 |
C113 |
Rajput |
EAST |
28 |
19000 |
22SEP1580 |
|
15 |
C114 |
Mughal |
NORTH |
52 |
61000 |
31OCT1690 |
|
16 |
C115 |
Sikh |
north |
66 |
33000 |
05NOV1710 |
SAS vs R Translation
|
SAS |
R |
|
DATA
step |
tibble() |
|
INPUT |
c() |
|
LENGTH |
dynamic
character vectors |
|
FORMAT |
mutate()/formatting |
13.mutate() vs DATA Step Assignment
culture_clean <- culture_raw %>%
mutate(
Region = str_to_upper(str_trim(Region)),
Kingdom = str_to_title(str_trim(Kingdom)),
Trade_Value = abs(Trade_Value),
Age = if_else(Age < 0 | Age > 120,
NA_real_,Age)
)
OUTPUT:
|
|
Culture_ID |
Kingdom |
Region |
Age |
Trade_Value |
Visit_Date |
|
1 |
C101 |
Mughal |
NORTH |
45 |
25000 |
15JAN1650 |
|
2 |
C102 |
Mughal |
SOUTH |
NA |
5000 |
. |
|
3 |
C103 |
Chola |
EAST |
32 |
40000 |
10FEB1500 |
|
4 |
C102 |
Mughal |
SOUTH |
NA |
5000 |
. |
|
5 |
C104 |
Null |
WEST |
NA |
30000 |
32-13-1650 |
|
6 |
C105 |
Vijayanagara |
SOUTH |
55 |
45000 |
01MAR1600 |
|
7 |
C106 |
Maratha |
EAST |
NA |
35000 |
12APR1700 |
|
8 |
C107 |
Sikh |
NORTH |
70 |
4500 |
. |
|
9 |
C108 |
Rajput |
EAST |
29 |
20000 |
15MAY1550 |
|
10 |
C109 |
Mughal |
SOUTH |
48 |
60000 |
20JUN1655 |
|
11 |
C110 |
Maratha |
WEST |
39 |
0 |
11JUL1750 |
|
12 |
C111 |
Chola |
EAST |
41 |
42000 |
09AUG1450 |
|
13 |
C112 |
Vijayanagara |
SOUTH |
36 |
51000 |
. |
|
14 |
C113 |
Rajput |
EAST |
28 |
19000 |
22SEP1580 |
|
15 |
C114 |
Mughal |
NORTH |
52 |
61000 |
31OCT1690 |
|
16 |
C115 |
Sikh |
NORTH |
66 |
33000 |
05NOV1710 |
Why This Matters
This
mirrors SAS DATA-step transformations.
|
SAS |
R |
|
UPCASE |
str_to_upper |
|
STRIP |
str_trim |
|
ABS |
abs |
|
IF-THEN |
if_else |
R
pipelines feel declarative.
SAS feels
procedural.
Both are
powerful.
Both require discipline.
14.case_when() vs SELECT-WHEN
culture_clean <- culture_clean %>%
mutate(
Region_Group = case_when(
Region == "NORTH" ~ "Northern",
Region == "SOUTH" ~ "Southern",
Region == "EAST" ~ "Eastern",
Region == "WEST" ~ "Western",
TRUE ~ "Unknown")
)
OUTPUT:
|
|
Culture_ID |
Kingdom |
Region |
Age |
Trade_Value |
Visit_Date |
Region_Group |
|
1 |
C101 |
Mughal |
NORTH |
45 |
25000 |
15JAN1650 |
Northern |
|
2 |
C102 |
Mughal |
SOUTH |
NA |
5000 |
. |
Southern |
|
3 |
C103 |
Chola |
EAST |
32 |
40000 |
10FEB1500 |
Eastern |
|
4 |
C102 |
Mughal |
SOUTH |
NA |
5000 |
. |
Southern |
|
5 |
C104 |
Null |
WEST |
NA |
30000 |
32-13-1650 |
Western |
|
6 |
C105 |
Vijayanagara |
SOUTH |
55 |
45000 |
01MAR1600 |
Southern |
|
7 |
C106 |
Maratha |
EAST |
NA |
35000 |
12APR1700 |
Eastern |
|
8 |
C107 |
Sikh |
NORTH |
70 |
4500 |
. |
Northern |
|
9 |
C108 |
Rajput |
EAST |
29 |
20000 |
15MAY1550 |
Eastern |
|
10 |
C109 |
Mughal |
SOUTH |
48 |
60000 |
20JUN1655 |
Southern |
|
11 |
C110 |
Maratha |
WEST |
39 |
0 |
11JUL1750 |
Western |
|
12 |
C111 |
Chola |
EAST |
41 |
42000 |
09AUG1450 |
Eastern |
|
13 |
C112 |
Vijayanagara |
SOUTH |
36 |
51000 |
. |
Southern |
|
14 |
C113 |
Rajput |
EAST |
28 |
19000 |
22SEP1580 |
Eastern |
|
15 |
C114 |
Mughal |
NORTH |
52 |
61000 |
31OCT1690 |
Northern |
|
16 |
C115 |
Sikh |
NORTH |
66 |
33000 |
05NOV1710 |
Northern |
|
SAS |
R |
|
SELECT-WHEN |
case_when |
|
OTHERWISE |
TRUE ~ |
15.Date Parsing in R
culture_clean <- culture_raw %>%
mutate(
Visit_Date_Raw = Visit_Date,
Visit_Date_Raw = na_if(Visit_Date_Raw, "."),
Visit_Date = suppressWarnings(
parse_date_time( Visit_Date_Raw,orders = c("dmy","dby","dbY"))
),
Date_Parse_Flag = if_else( is.na(Visit_Date),
"INVALID_DATE","VALID_DATE")
)
OUTPUT:
|
|
Culture_ID |
Kingdom |
Region |
Age |
Trade_Value |
Visit_Date |
Visit_Date_Raw |
Date_Parse_Flag |
|
1 |
C101 |
Mughal |
NORTH |
45 |
25000 |
1650-01-15 |
15JAN1650 |
VALID_DATE |
|
2 |
C102 |
Mughal |
south |
250 |
-5000 |
NA |
NA |
INVALID_DATE |
|
3 |
C103 |
Chola |
EAST |
32 |
40000 |
1500-02-10 |
10FEB1500 |
VALID_DATE |
|
4 |
C102 |
Mughal |
SOUTH |
250 |
-5000 |
NA |
NA |
INVALID_DATE |
|
5 |
C104 |
NULL |
west |
-5 |
30000 |
NA |
32-13-1650 |
INVALID_DATE |
|
6 |
C105 |
Vijayanagara |
SOUTH |
55 |
45000 |
1600-03-01 |
01MAR1600 |
VALID_DATE |
|
7 |
C106 |
Maratha |
EAST |
NA |
35000 |
1700-04-12 |
12APR1700 |
VALID_DATE |
|
8 |
C107 |
Sikh |
NORTH |
70 |
-4500 |
NA |
NA |
INVALID_DATE |
|
9 |
C108 |
Rajput |
EAST |
29 |
20000 |
1550-05-15 |
15MAY1550 |
VALID_DATE |
|
10 |
C109 |
Mughal |
south |
48 |
60000 |
1655-06-20 |
20JUN1655 |
VALID_DATE |
|
11 |
C110 |
Maratha |
WEST |
39 |
0 |
1750-07-11 |
11JUL1750 |
VALID_DATE |
|
12 |
C111 |
Chola |
EAST |
41 |
42000 |
1450-08-09 |
09AUG1450 |
VALID_DATE |
|
13 |
C112 |
Vijayanagara |
SOUTH |
36 |
51000 |
NA |
NA |
INVALID_DATE |
|
14 |
C113 |
Rajput |
EAST |
28 |
19000 |
1580-09-22 |
22SEP1580 |
VALID_DATE |
|
15 |
C114 |
Mughal |
NORTH |
52 |
61000 |
1690-10-31 |
31OCT1690 |
VALID_DATE |
|
16 |
C115 |
Sikh |
north |
66 |
33000 |
1710-11-05 |
05NOV1710 |
VALID_DATE |
Why This Matters
Date
corruption is catastrophic in enterprise systems.
Bad dates
affect:
- treatment windows
- claims eligibility
- forecasting
- exposure calculations
R is
flexible.
Too
flexible sometimes.
SAS is
stricter.
Too
strict sometimes.
Senior
engineers know when each behavior helps.
16.distinct() vs PROC SORT NODUPKEY
culture_dedup <- culture_clean %>%
distinct(Culture_ID, .keep_all=TRUE)
|
|
Culture_ID |
Kingdom |
Region |
Age |
Trade_Value |
Visit_Date |
Visit_Date_Raw |
Date_Parse_Flag |
|
1 |
C101 |
Mughal |
NORTH |
45 |
25000 |
1650-01-15 |
15JAN1650 |
VALID_DATE |
|
2 |
C102 |
Mughal |
south |
250 |
-5000 |
NA |
NA |
INVALID_DATE |
|
3 |
C103 |
Chola |
EAST |
32 |
40000 |
1500-02-10 |
10FEB1500 |
VALID_DATE |
|
4 |
C104 |
NULL |
west |
-5 |
30000 |
NA |
32-13-1650 |
INVALID_DATE |
|
5 |
C105 |
Vijayanagara |
SOUTH |
55 |
45000 |
1600-03-01 |
01MAR1600 |
VALID_DATE |
|
6 |
C106 |
Maratha |
EAST |
NA |
35000 |
1700-04-12 |
12APR1700 |
VALID_DATE |
|
7 |
C107 |
Sikh |
NORTH |
70 |
-4500 |
NA |
NA |
INVALID_DATE |
|
8 |
C108 |
Rajput |
EAST |
29 |
20000 |
1550-05-15 |
15MAY1550 |
VALID_DATE |
|
9 |
C109 |
Mughal |
south |
48 |
60000 |
1655-06-20 |
20JUN1655 |
VALID_DATE |
|
10 |
C110 |
Maratha |
WEST |
39 |
0 |
1750-07-11 |
11JUL1750 |
VALID_DATE |
|
11 |
C111 |
Chola |
EAST |
41 |
42000 |
1450-08-09 |
09AUG1450 |
VALID_DATE |
|
12 |
C112 |
Vijayanagara |
SOUTH |
36 |
51000 |
NA |
NA |
INVALID_DATE |
|
13 |
C113 |
Rajput |
EAST |
28 |
19000 |
1580-09-22 |
22SEP1580 |
VALID_DATE |
|
14 |
C114 |
Mughal |
NORTH |
52 |
61000 |
1690-10-31 |
31OCT1690 |
VALID_DATE |
|
15 |
C115 |
Sikh |
north |
66 |
33000 |
1710-11-05 |
05NOV1710 |
VALID_DATE |
|
SAS |
R |
|
PROC
SORT NODUPKEY |
distinct() |
Enterprise Validation &
Compliance
The Dangerous SAS Missing Value Trap
This is
one of SAS's deadliest default behaviors:
. < -999 < 0 < 1
Missing numeric
values sort lower than every valid number.
Meaning:
if lab_value < 5 then
flag='LOW';
This
unintentionally flags missing values too.
That
single oversight has caused:
- false safety signals
- incorrect patient exclusions
- failed regulatory outputs
Defensive
SAS programming requires:
if not missing(lab_value)
and lab_value < 5 then
flag='LOW';
Always.
SDTM & ADaM Enterprise
Expectations
In
regulated environments:
|
Requirement |
Meaning |
|
Traceability |
Every
derived value traceable to source |
|
QC
Independence |
Separate
programmer validates logic |
|
Audit
Trails |
Every
transformation documented |
|
Metadata
Governance |
Variable
definitions standardized |
|
Reproducibility |
Same
code = same output always |
Cleaning
pipelines are not optional.
They are
regulatory infrastructure.
Business Logic Narrative
Why Cleaning Decisions Matter
Data
cleaning is not cosmetic housekeeping. It is operational risk management.
Take
patient age correction. If a subject age appears as 250 because of OCR
corruption, downstream population stratification fails. Elderly subgroup
analyses become distorted. Risk-adjustment algorithms inflate mortality
expectations. One impossible value contaminates every derived statistic
touching age.
Consider
salary normalization in banking systems. Suppose "₹50,000 " contains
trailing blanks while "50000" does not. Grouping operations may split
identical salaries into separate categories. Executives reviewing compensation
dashboards see fictional segmentation patterns. HR planning breaks because
strings were not standardized.
Missing-date
imputation is even more dangerous. Imagine an insurance claim without a
transaction date. If the ETL pipeline defaults it to January 1st, quarterly
financial reporting shifts artificially. Fraud detection windows become inaccurate.
Regulatory filing periods misalign.
Text
normalization matters too. "south", "South ", and
"SOUTH" appear identical to humans but not to machines. Predictive
models interpret them as different categories unless standardized.
This is
why experienced programmers think in terms of downstream consequences,
not syntax.
Every
cleaning rule changes business truth.
20 Real-World Data Cleaning
Best Practices
- Declare LENGTH before
assignments in SAS
- Never overwrite raw source
datasets
- Build quarantine tables for
invalid records
- Standardize metadata
centrally
- Use macro-driven validation
frameworks
- Validate before aggregation
- Separate business rules from
formatting logic
- Maintain audit trails for
derivations
- Use independent QC
programmers
- Never trust imported datatypes
- Explicitly handle missing
values
- Version-control cleaning
macros
- Validate join cardinality
before merges
- Standardize date formats
early
- Normalize casing before
comparisons
- Avoid hardcoded business
mappings
- Log all dropped observations
- Reconcile counts after every
transformation
- Design reusable validation
libraries
- Test pipelines using
intentionally corrupted datasets
20 Sharp One-Liner Insights
- Missing values in SAS aren't
empty they're quietly the smallest number in your dataset.
- A duplicate ID is never
“just one duplicate.”
- Dirty dimensions create fake
business intelligence.
- Most dashboards fail long
before visualization begins.
- Cleaning logic is business
logic in disguise.
- Whitespace is invisible
corruption.
- Invalid dates destroy
timelines silently.
- A bad merge can rewrite
history.
- Enterprise trust starts with
metadata discipline.
- PROC SORT can hide problems
as easily as solve them.
- R is flexible; SAS is
controlled. Production needs both mindsets.
- Auditability matters more
than elegance in regulated systems.
- Null handling separates
junior analysts from senior engineers.
- Defensive programming saves
careers.
- Production ETL pipelines are
survival systems.
- The cleanest dashboard can
still be analytically poisonous.
- Most AI failures are
data-engineering failures first.
- Data lineage is operational
insurance.
- Standardization reduces
entropy.
- Bad source data scales
faster than good governance.
SAS vs R Comparison Table
|
Feature |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Industry
Standard |
Growing |
|
Flexibility |
Structured |
Extremely
Flexible |
|
Memory
Handling |
Fixed-length
chars |
Dynamic
strings |
|
Enterprise
Governance |
Strong |
Requires
tooling |
|
Visualization |
Moderate |
Excellent |
|
Performance |
Excellent
on large ETL |
Excellent
analytics |
|
Macro
Automation |
Powerful |
Functional
programming |
|
Learning
Curve |
Moderate |
Moderate-High |
|
Clinical
Trial Fit |
Exceptional |
Expanding
rapidly |
Production Validation
Checklist
1.Validate metadata before
transformations
2.Check duplicate business keys
3.Standardize text casing
4.Remove hidden whitespace
5.Validate date ranges
6.Flag biologically impossible
values
7.Reconcile counts after merges
8.Validate missing-value handling
9.QC derived variables
independently
10.Preserve raw datasets
unchanged
11.Log all transformation rules
12.Validate join cardinality
13.Standardize formats centrally
14.Test edge-case scenarios
15.Review audit trail
completeness
Summary
SAS
dominates heavily regulated enterprise environments because it was engineered
around control, reproducibility, and auditability. In clinical trials, banking
compliance systems, and pharmaceutical submissions, SAS remains deeply trusted
because every transformation behaves predictably. DATA step processing is
deterministic. PROC-based workflows are standardized. Macro systems
industrialize repetitive governance logic. Most importantly, SAS integrates
naturally into validation-heavy environments where traceability matters as much
as computation itself.
R
approaches the world differently. It prioritizes flexibility, extensibility,
and analytical creativity. Packages like dplyr, stringr, and lubridate
dramatically accelerate exploratory cleaning and transformation workflows. R’s
vectorized operations and functional programming paradigms make complex
transformations elegant and expressive. However, this flexibility introduces
governance challenges. Without strong coding standards, R pipelines can become
inconsistent across teams.
In
practice, elite enterprise organizations increasingly combine both ecosystems.
SAS
handles:
- validated ETL
- SDTM/ADaM pipelines
- regulatory deliverables
- operational reproducibility
R
handles:
- advanced analytics
- AI feature engineering
- visualization
- exploratory modeling
The
strongest data engineers are bilingual.
They
understand when SAS rigidity protects the business and when R flexibility
accelerates innovation.
Conclusion
Organizations
love to invest in AI.
Executives
buy machine-learning platforms.
Teams build dashboards.
Consultants deploy predictive engines.
But
underneath every successful analytical ecosystem sits something far less
glamorous:
A
disciplined cleaning framework.
Without
it, enterprise intelligence becomes sophisticated fiction.
Think
about what dirty data actually does.
A
duplicated patient ID changes enrollment counts.
A malformed timestamp shifts quarterly revenue.
A lowercase region code breaks joins.
A missing numeric value silently enters statistical calculations.
A truncated variable destroys traceability.
None of
these failures announce themselves dramatically.
They
spread quietly.
That is
what makes them dangerous.
The
hardest production problems are rarely algorithmic.
They are
semantic.
The
machine computed exactly what you told it to compute using corrupted business
truth.
Structured
cleaning frameworks solve this by introducing consistency, governance,
defensibility, and auditability into every transformation step. They force
organizations to think explicitly about:
- lineage
- validation
- metadata
- reproducibility
- exception handling
- business-rule enforcement
This is
why senior engineers obsess over seemingly small details:
- variable lengths
- missing-value behavior
- casing standardization
- merge logic
- duplicate survivorship rules
Because
they understand the downstream blast radius.
In modern
enterprises, trust is the product.
And
trustworthy analytics begin long before dashboards, AI models, or executive
summaries.
They
begin the moment raw data enters the pipeline.
That is
where real engineering starts.
Interview Questions &
Answers
1. Duplicate Patient IDs Inflated Enrollment Counts
Question
A
regulatory submission showed 1,204 enrolled patients, but source systems
contained only 1,198. How would you investigate?
Answer
First, I
would validate uniqueness assumptions on patient identifiers using both PROC
SORT NODUPKEY and PROC SQL GROUP BY HAVING COUNT(*) > 1 logic. Then I’d examine
survivorship rules because duplicates are rarely identical. I’d compare visit
dates, source timestamps, and ingestion lineage to identify whether the issue
originated from ETL duplication, merge cardinality inflation, or source-system
replay failures. I would also reconcile counts at every pipeline stage to
isolate where inflation first appeared. The key is not simply removing
duplicates but understanding why they existed.
2. Missing Numeric Values Triggered False Safety
Flags
Question
Why can
missing numeric values in SAS become dangerous in production?
Answer
Because
SAS treats missing numeric values as smaller than any valid number. A condition
like:
if lab_value < 5;
unexpectedly
includes missing observations. In clinical safety analyses, this can falsely
classify subjects as abnormal. Defensive programming requires explicit missing
checks:
if not missing(lab_value)
and lab_value < 5;
I always
review conditional logic involving ranges, thresholds, and statistical flags to
ensure missing values are handled intentionally rather than implicitly.
3. Macro Debugging Failure in Production
Question
A macro
works for one dataset but fails for another with truncation issues. How do you
debug it?
Answer
I first
inspect metadata using PROC CONTENTS because truncation problems usually
originate from inherited lengths. Then I enable debugging options:
options mprint mlogic symbolgen;
Next, I
review whether variables are created before LENGTH statements. In SAS, the
first assignment determines character storage size unless LENGTH is declared
earlier. Most truncation bugs are architectural, not syntactical.
4. Cross-System Validation Between SAS and R Failed
Question
A SAS
summary total differs from an R summary total. What would you check first?
Answer
I would
immediately investigate:
- missing-value handling
- datatype coercion
- join behavior
- duplicate treatment
- date parsing
SAS and R
differ significantly in default behaviors. For example, SAS excludes missing
values in many procedures automatically, while R functions may require
na.rm=TRUE. I’d reconcile row counts step-by-step across both systems until
divergence appears.
5. Merge Logic Produced Unexpected Record Inflation
Question
A DATA
step MERGE unexpectedly doubled record counts. What happened?
Answer
Most
likely the BY keys were non-unique in one or both datasets, causing
many-to-many merge inflation. Before every merge, I validate cardinality using:
proc sql;
select key,count(*)
from dataset
group by key
having count(*)>1;
quit;
Production
engineers never assume join uniqueness. Every merge must have validated
business-key expectations before execution.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about CULTURE DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment