Mangalsutra Meets Metadata: Inside Enterprise Marriage Data Cleaning with SAS and R
Sacred Vows, Corrupted Rows: Cleaning Indian Marriage Data Before Analytics Collapse
Introduction:
The Monday Morning Disaster Nobody Expected
Three
years ago, I walked into a regulatory war room at 7:10 AM.
Nobody
touched the coffee.
A Fortune
500 healthcare insurer had just failed an internal audit tied to a demographic
enrichment pipeline. The system merged public marriage registration data with
policyholder records to predict dependent coverage eligibility, fraud
probability, and regional premium risk.
The
dashboard showed a dramatic increase in “high-stability family households.
Executives
celebrated.
Actuaries
recalculated premium forecasts.
AI models
approved lower-risk policies.
Then QA
found the bomb.
Duplicate
spouse IDs.
Thousands
of them.
One
corrupted merge key inflated married-household counts by 18%.
Worse,
negative claim reimbursements silently passed through because the billing field
was stored as character data. Values like " -45000 " survived
validation because whitespace and implicit conversion masked the issue.
One
malformed region code broke downstream stratification logic.
One
impossible age value 212 years old flowed into actuarial modeling because
nobody implemented defensive range validation.
Within
hours:
- Fraud scores became unreliable
- Executive reports became
legally questionable
- Machine learning outputs
became statistically contaminated
- Audit traceability failed
- Regulatory confidence
collapsed
That’s
the reality of dirty data.
Not
theory.
Not
tutorials.
Production.
This
article walks through how senior SAS programmers and data scientists engineer
defensive, enterprise-grade cleaning frameworks using SAS DATA Step, PROC SQL,
and modern R pipelines.
We are going to build corrupted Indian marriage registration datasets from scratch, intentionally poison them with realistic enterprise failures, then clean them the way experienced production teams actually do it.
Why LENGTH Comes First in SAS The Silent Killer Called Character Truncation
Junior
programmers often think this is syntax trivia.
It isn’t.
It’s data
survival.
In SAS,
the first time a character variable appears, SAS determines its storage length
permanently unless explicitly controlled using LENGTH.
Example
disaster:
data demo;
spouse_name='Raj';
spouse_name='Rajesh Kumar Varma';
run;
proc print data = demo;
run;
OUTPUT:
| Obs | spouse_name |
|---|---|
| 1 | Raj |
Result?
spouse_name
becomes length 3 because "Raj" appeared first.
Everything
afterward truncates silently:
Raj
Raj
Not
warning.
Not
error.
Silent
corruption.
In
production, this destroys:
- Patient names
- Site identifiers
- Country codes
- Medication descriptions
- Audit traceability
That’s
why experienced SAS programmers place LENGTH first before INPUT, IF statements,
assignments, or merges.
R behaves
differently.
R
character vectors are dynamically allocated objects using internal string pooling
and reference semantics.R does not permanently truncate strings because memory
allocation is abstracted at runtime. A character column can store
"Raj" and "Rajesh Kumar Varma" together safely without
predefined width constraints.
SAS
prioritizes storage efficiency and fixed metadata structures.
R
prioritizes flexibility and dynamic memory handling.
Understanding
this difference separates production engineers from syntax learners.
data demo;
length spouse_name $20;
spouse_name='Raj';
spouse_name='Rajesh Kumar Varma';
run;
proc print data = demo;
run;
OUTPUT:
| Obs | spouse_name |
|---|---|
| 1 | Rajesh Kumar Varma |
Building the Raw Corrupted
Indian Marriage Dataset in SAS
SAS Raw Dataset Creation
data indian_marriage_raw;
length Marriage_ID $12 Husband_Name $40 Wife_Name $40 Region_Code $10
Marriage_Type $20 Registrar_Email $60 Annual_Income $15
Marriage_Date_Raw $25 City $25 Status_Flag $15;
informat Marriage_Date_Raw $25.;
infile datalines truncover;
input Marriage_ID:$12. Husband_Name:$40. Wife_Name:$40. Age_Husband
Age_Wife Region_Code:$10. Marriage_Type:$20. Registrar_Email:$60.
Annual_Income:$15. Marriage_Date_Raw:$25. City:$25. Status_Flag:$15.;
datalines;
M1001 Ravi_Kumar Anjali 29 24 AP Hindu registrar@apgov.in 850000 12-01-2025 Hyderabad ACTIVE
M1002 Ravi_Kumar Anjali 29 24 AP Hindu registrar@apgov.in 850000 12-01-2025 Hyderabad ACTIVE
M1003 Arjun Meera 212 31 TN Hindu invalidmail.com 650000 15/02/2025 Chennai ACTIVE
M1004 Kiran NULL -5 27 KA Muslim registrar@kgov.in -550000 . Bangalore PENDING
M1005 Suresh Lakshmi 35 33 ap Christian reg_apgov.in 1200000 2025-03-18 Vizag ACTIVE
M1006 Vikram Priya 41 39 MH Unknown registrar@mh.gov.in abc900 17MAR2025 Mumbai CLOSED
M1007 Dinesh Kavya 32 . KL Hindu registrar@kl.gov.in 990000 18-13-2025 Kochi ACTIVE
M1008 Rohan Sneha 28 22 XX Sikh registrar@xgov.in 1000000 11-02-2025 Delhi ACTIVE
M1009 Tarun Pooja 45 44 AP Hindu registrar@apgov.in -150000 20250315 Hyderabad ACTIVE
M1010 NULL Divya 37 36 TN Jain registrar@tn.gov.in 760000 14APR2025 Chennai CLOSED
M1011 Akash Nisha 33 29 AP Hindu registrar@apgov.in 850000 . Hyderabad ACTIVE
M1012 Akash Nisha 33 29 AP Hindu registrar@apgov.in 850000 . Hyderabad ACTIVE
M1013 Mohan Radha 999 21 MH Buddhist registrar@mgov.in 450000 10-10-2024 Pune ACTIVE
M1014 Rakesh Simran 39 35 KA Hindu registrar@kgov.in 0 25DEC2024 Mysore ACTIVE
M1015 Ajay Priti 31 28 AP Hindu registrar@apgov.in 780000 31/11/2025 Vijayawada PENDING
;
run;
proc print data=indian_marriage_raw;
run;
OUTPUT:
| Obs | Marriage_ID | Husband_Name | Wife_Name | Region_Code | Marriage_Type | Registrar_Email | Annual_Income | Marriage_Date_Raw | City | Status_Flag | Age_Husband | Age_Wife |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M1001 | Ravi_Kumar | Anjali | AP | Hindu | registrar@apgov.in | 850000 | 12-01-2025 | Hyderabad | ACTIVE | 29 | 24 |
| 2 | M1002 | Ravi_Kumar | Anjali | AP | Hindu | registrar@apgov.in | 850000 | 12-01-2025 | Hyderabad | ACTIVE | 29 | 24 |
| 3 | M1003 | Arjun | Meera | TN | Hindu | invalidmail.com | 650000 | 15/02/2025 | Chennai | ACTIVE | 212 | 31 |
| 4 | M1004 | Kiran | NULL | KA | Muslim | registrar@kgov.in | -550000 | Bangalore | PENDING | -5 | 27 | |
| 5 | M1005 | Suresh | Lakshmi | ap | Christian | reg_apgov.in | 1200000 | 2025-03-18 | Vizag | ACTIVE | 35 | 33 |
| 6 | M1006 | Vikram | Priya | MH | Unknown | registrar@mh.gov.in | abc900 | 17MAR2025 | Mumbai | CLOSED | 41 | 39 |
| 7 | M1007 | Dinesh | Kavya | KL | Hindu | registrar@kl.gov.in | 990000 | 18-13-2025 | Kochi | ACTIVE | 32 | . |
| 8 | M1008 | Rohan | Sneha | XX | Sikh | registrar@xgov.in | 1000000 | 11-02-2025 | Delhi | ACTIVE | 28 | 22 |
| 9 | M1009 | Tarun | Pooja | AP | Hindu | registrar@apgov.in | -150000 | 20250315 | Hyderabad | ACTIVE | 45 | 44 |
| 10 | M1010 | NULL | Divya | TN | Jain | registrar@tn.gov.in | 760000 | 14APR2025 | Chennai | CLOSED | 37 | 36 |
| 11 | M1011 | Akash | Nisha | AP | Hindu | registrar@apgov.in | 850000 | Hyderabad | ACTIVE | 33 | 29 | |
| 12 | M1012 | Akash | Nisha | AP | Hindu | registrar@apgov.in | 850000 | Hyderabad | ACTIVE | 33 | 29 | |
| 13 | M1013 | Mohan | Radha | MH | Buddhist | registrar@mgov.in | 450000 | 10-10-2024 | Pune | ACTIVE | 999 | 21 |
| 14 | M1014 | Rakesh | Simran | KA | Hindu | registrar@kgov.in | 0 | 25DEC2024 | Mysore | ACTIVE | 39 | 35 |
| 15 | M1015 | Ajay | Priti | AP | Hindu | registrar@apgov.in | 780000 | 31/11/2025 | Vijayawada | PENDING | 31 | 28 |
Why This Dataset Matters
This is
not random bad data.
Every
corruption pattern here reflects real production failures:
- Duplicate marriage
registrations
- Impossible ages
- Invalid region codes
- Negative income values
- Malformed emails
- Mixed-case inconsistencies
- Character/numeric confusion
- Missing dates
- Invalid timestamps
Production
systems rarely fail because of one catastrophic error.
They fail
because 400 small inconsistencies accumulate silently.
One
common mistake: developers validate structure but ignore semantic correctness.
"999" is numeric but biologically impossible.
Enterprise
cleaning is about meaning, not merely datatype compliance.
SAS DATA Step Cleaning
1. Standardization and Defensive Cleaning
data marriage_clean_stage1;
set indian_marriage_raw;
retain Load_Datetime;
Load_Datetime=datetime();
array chars(*) Husband_Name Wife_Name Region_Code
Marriage_Type Registrar_Email
City Status_Flag;
do i=1 to dim(chars);
chars(i)=strip(chars(i));
chars(i)=compress(chars(i),,'kw');
end;
Husband_Name=propcase(lowcase(Husband_Name));
Wife_Name=propcase(lowcase(Wife_Name));
City=propcase(lowcase(City));
Region_Code=upcase(Region_Code);
Registrar_Email=lowcase(Registrar_Email);
if Husband_Name='Null' then Husband_Name='';
if Wife_Name='Null' then Wife_Name='';
Annual_Income_Num=input(compress(Annual_Income,,'kd'),8.);
if Annual_Income_Num < 0 then
Annual_Income_Num=abs(Annual_Income_Num);
Marriage_Date=input(Marriage_Date_Raw,anydtdte15.);
format Marriage_Date date9.;
if Age_Husband < 18 or Age_Husband > 100 then
Age_Husband=.;
if Age_Wife < 18 or Age_Wife > 100 then
Age_Wife=.;
Email_Flag=0;
if index(Registrar_Email,'@')=0 then
Email_Flag=1;
drop i Annual_Income;
rename Annual_Income_Num = Annual_Income;
run;
proc print data=marriage_clean_stage1;
run;
OUTPUT:
| Obs | Marriage_ID | Husband_Name | Wife_Name | Region_Code | Marriage_Type | Registrar_Email | Marriage_Date_Raw | City | Status_Flag | Age_Husband | Age_Wife | Load_Datetime | Annual_Income | Marriage_Date | Email_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M1001 | Ravi_kumar | Anjali | AP | Hindu | registrar@apgov.in | 12-01-2025 | Hyderabad | ACTIVE | 29 | 24 | 2096004747.3 | 850000 | 01DEC2025 | 0 |
| 2 | M1002 | Ravi_kumar | Anjali | AP | Hindu | registrar@apgov.in | 12-01-2025 | Hyderabad | ACTIVE | 29 | 24 | 2096004747.3 | 850000 | 01DEC2025 | 0 |
| 3 | M1003 | Arjun | Meera | TN | Hindu | invalidmail.com | 15/02/2025 | Chennai | ACTIVE | . | 31 | 2096004747.3 | 650000 | 15FEB2025 | 1 |
| 4 | M1004 | Kiran | KA | Muslim | registrar@kgov.in | Bangalore | PENDING | . | 27 | 2096004747.3 | 550000 | . | 0 | ||
| 5 | M1005 | Suresh | Lakshmi | AP | Christian | reg_apgov.in | 2025-03-18 | Vizag | ACTIVE | 35 | 33 | 2096004747.3 | 1200000 | 18MAR2025 | 1 |
| 6 | M1006 | Vikram | Priya | MH | Unknown | registrar@mh.gov.in | 17MAR2025 | Mumbai | CLOSED | 41 | 39 | 2096004747.3 | 900 | 17MAR2025 | 0 |
| 7 | M1007 | Dinesh | Kavya | KL | Hindu | registrar@kl.gov.in | 18-13-2025 | Kochi | ACTIVE | 32 | . | 2096004747.3 | 990000 | . | 0 |
| 8 | M1008 | Rohan | Sneha | XX | Sikh | registrar@xgov.in | 11-02-2025 | Delhi | ACTIVE | 28 | 22 | 2096004747.3 | 1000000 | 02NOV2025 | 0 |
| 9 | M1009 | Tarun | Pooja | AP | Hindu | registrar@apgov.in | 20250315 | Hyderabad | ACTIVE | 45 | 44 | 2096004747.3 | 150000 | 15MAR2025 | 0 |
| 10 | M1010 | Divya | TN | Jain | registrar@tn.gov.in | 14APR2025 | Chennai | CLOSED | 37 | 36 | 2096004747.3 | 760000 | 14APR2025 | 0 | |
| 11 | M1011 | Akash | Nisha | AP | Hindu | registrar@apgov.in | Hyderabad | ACTIVE | 33 | 29 | 2096004747.3 | 850000 | . | 0 | |
| 12 | M1012 | Akash | Nisha | AP | Hindu | registrar@apgov.in | Hyderabad | ACTIVE | 33 | 29 | 2096004747.3 | 850000 | . | 0 | |
| 13 | M1013 | Mohan | Radha | MH | Buddhist | registrar@mgov.in | 10-10-2024 | Pune | ACTIVE | . | 21 | 2096004747.3 | 450000 | 10OCT2024 | 0 |
| 14 | M1014 | Rakesh | Simran | KA | Hindu | registrar@kgov.in | 25DEC2024 | Mysore | ACTIVE | 39 | 35 | 2096004747.3 | 0 | 25DEC2024 | 0 |
| 15 | M1015 | Ajay | Priti | AP | Hindu | registrar@apgov.in | 31/11/2025 | Vijayawada | PENDING | 31 | 28 | 2096004747.3 | 780000 | . | 0 |
Why This DATA Step Matters
This DATA
step behaves like a defensive firewall.
Enterprise
cleaning is not cosmetic formatting. It is risk containment.
Notice
how arrays reduce repetitive logic. In real studies, you may standardize 200+
character variables. Arrays make programs maintainable and scalable.
COMPRESS(...,'kw')
removes unwanted whitespace artifacts while preserving readable content.
PROPCASE() normalizes names for reporting consistency. INPUT() converts dirty
character income values into numerics defensively.
One of
the most dangerous production assumptions is “numeric-looking data is safe.” It
isn’t.
abc900
becomes corrupted unless explicitly validated.
Another
common production failure: programmers overwrite raw fields without preserving
lineage. Always retain raw columns somewhere in the pipeline for audit
traceability.
2.Advanced Conditional Logic with SELECT-WHEN
data marriage_clean_stage2;
set marriage_clean_stage1;
length Region_Desc $30 Risk_Category $20;
select (Region_Code);
when ('AP') Region_Desc='Andhra Pradesh';
when ('TN') Region_Desc='Tamil Nadu';
when ('KA') Region_Desc='Karnataka';
when ('MH') Region_Desc='Maharashtra';
when ('KL') Region_Desc='Kerala';
otherwise Region_Desc='Invalid Region';
end;
select;
when (Annual_Income >= 1000000)
Risk_Category='High Income';
when (500000 <= Annual_Income <= 999999)
Risk_Category='Middle Income';
otherwise
Risk_Category='Low Income';
end;
run;
proc print data=marriage_clean_stage2;
run;
OUTPUT:
| Obs | Marriage_ID | Husband_Name | Wife_Name | Region_Code | Marriage_Type | Registrar_Email | Marriage_Date_Raw | City | Status_Flag | Age_Husband | Age_Wife | Load_Datetime | Annual_Income | Marriage_Date | Email_Flag | Region_Desc | Risk_Category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M1001 | Ravi_kumar | Anjali | AP | Hindu | registrar@apgov.in | 12-01-2025 | Hyderabad | ACTIVE | 29 | 24 | 2096004747.3 | 850000 | 01DEC2025 | 0 | Andhra Pradesh | Middle Income |
| 2 | M1002 | Ravi_kumar | Anjali | AP | Hindu | registrar@apgov.in | 12-01-2025 | Hyderabad | ACTIVE | 29 | 24 | 2096004747.3 | 850000 | 01DEC2025 | 0 | Andhra Pradesh | Middle Income |
| 3 | M1003 | Arjun | Meera | TN | Hindu | invalidmail.com | 15/02/2025 | Chennai | ACTIVE | . | 31 | 2096004747.3 | 650000 | 15FEB2025 | 1 | Tamil Nadu | Middle Income |
| 4 | M1004 | Kiran | KA | Muslim | registrar@kgov.in | Bangalore | PENDING | . | 27 | 2096004747.3 | 550000 | . | 0 | Karnataka | Middle Income | ||
| 5 | M1005 | Suresh | Lakshmi | AP | Christian | reg_apgov.in | 2025-03-18 | Vizag | ACTIVE | 35 | 33 | 2096004747.3 | 1200000 | 18MAR2025 | 1 | Andhra Pradesh | High Income |
| 6 | M1006 | Vikram | Priya | MH | Unknown | registrar@mh.gov.in | 17MAR2025 | Mumbai | CLOSED | 41 | 39 | 2096004747.3 | 900 | 17MAR2025 | 0 | Maharashtra | Low Income |
| 7 | M1007 | Dinesh | Kavya | KL | Hindu | registrar@kl.gov.in | 18-13-2025 | Kochi | ACTIVE | 32 | . | 2096004747.3 | 990000 | . | 0 | Kerala | Middle Income |
| 8 | M1008 | Rohan | Sneha | XX | Sikh | registrar@xgov.in | 11-02-2025 | Delhi | ACTIVE | 28 | 22 | 2096004747.3 | 1000000 | 02NOV2025 | 0 | Invalid Region | High Income |
| 9 | M1009 | Tarun | Pooja | AP | Hindu | registrar@apgov.in | 20250315 | Hyderabad | ACTIVE | 45 | 44 | 2096004747.3 | 150000 | 15MAR2025 | 0 | Andhra Pradesh | Low Income |
| 10 | M1010 | Divya | TN | Jain | registrar@tn.gov.in | 14APR2025 | Chennai | CLOSED | 37 | 36 | 2096004747.3 | 760000 | 14APR2025 | 0 | Tamil Nadu | Middle Income | |
| 11 | M1011 | Akash | Nisha | AP | Hindu | registrar@apgov.in | Hyderabad | ACTIVE | 33 | 29 | 2096004747.3 | 850000 | . | 0 | Andhra Pradesh | Middle Income | |
| 12 | M1012 | Akash | Nisha | AP | Hindu | registrar@apgov.in | Hyderabad | ACTIVE | 33 | 29 | 2096004747.3 | 850000 | . | 0 | Andhra Pradesh | Middle Income | |
| 13 | M1013 | Mohan | Radha | MH | Buddhist | registrar@mgov.in | 10-10-2024 | Pune | ACTIVE | . | 21 | 2096004747.3 | 450000 | 10OCT2024 | 0 | Maharashtra | Low Income |
| 14 | M1014 | Rakesh | Simran | KA | Hindu | registrar@kgov.in | 25DEC2024 | Mysore | ACTIVE | 39 | 35 | 2096004747.3 | 0 | 25DEC2024 | 0 | Karnataka | Low Income |
| 15 | M1015 | Ajay | Priti | AP | Hindu | registrar@apgov.in | 31/11/2025 | Vijayawada | PENDING | 31 | 28 | 2096004747.3 | 780000 | . | 0 | Andhra Pradesh | Middle Income |
Why SELECT-WHEN Beats Endless IF Statements
In
production codebases, readability becomes survival.
Nested IF
statements become dangerous in enterprise validation because logic overlaps
grow invisible over time.
SELECT-WHEN
improves:
- Traceability
- Rule segregation
- Audit readability
- Future enhancements
Regulatory
reviewers often inspect transformation logic manually. Clean branching
structures reduce ambiguity dramatically.
A common
production mistake is forgetting OTHERWISE. That creates unhandled categories
that silently propagate downstream.
3.PROC FORMAT for Controlled Standardization
proc format;
value agegrp low-29='Young'
30-45='Middle'
46-high='Senior';
run;
LOG:
Why Formats Matter
Formats
are metadata-driven intelligence layers.
Instead
of hardcoding categorization logic repeatedly, formats centralize business
definitions.
In
enterprise systems, changing an age grouping from 30-45 to 30-50 should happen
once not across 300 programs.
This
reduces validation drift and regulatory inconsistency.
4.Deduplication Using PROC SORT NODUPKEY
proc sort data=marriage_clean_stage2
out=marriage_dedup nodupkey;
by Husband_Name Wife_Name Region_Code;
run;
proc print data=marriage_dedup;
run;
LOG:
OUTPUT:
| Obs | Marriage_ID | Husband_Name | Wife_Name | Region_Code | Marriage_Type | Registrar_Email | Marriage_Date_Raw | City | Status_Flag | Age_Husband | Age_Wife | Load_Datetime | Annual_Income | Marriage_Date | Email_Flag | Region_Desc | Risk_Category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M1010 | Divya | TN | Jain | registrar@tn.gov.in | 14APR2025 | Chennai | CLOSED | 37 | 36 | 2096004747.3 | 760000 | 14APR2025 | 0 | Tamil Nadu | Middle Income | |
| 2 | M1015 | Ajay | Priti | AP | Hindu | registrar@apgov.in | 31/11/2025 | Vijayawada | PENDING | 31 | 28 | 2096004747.3 | 780000 | . | 0 | Andhra Pradesh | Middle Income |
| 3 | M1011 | Akash | Nisha | AP | Hindu | registrar@apgov.in | Hyderabad | ACTIVE | 33 | 29 | 2096004747.3 | 850000 | . | 0 | Andhra Pradesh | Middle Income | |
| 4 | M1003 | Arjun | Meera | TN | Hindu | invalidmail.com | 15/02/2025 | Chennai | ACTIVE | . | 31 | 2096004747.3 | 650000 | 15FEB2025 | 1 | Tamil Nadu | Middle Income |
| 5 | M1007 | Dinesh | Kavya | KL | Hindu | registrar@kl.gov.in | 18-13-2025 | Kochi | ACTIVE | 32 | . | 2096004747.3 | 990000 | . | 0 | Kerala | Middle Income |
| 6 | M1004 | Kiran | KA | Muslim | registrar@kgov.in | Bangalore | PENDING | . | 27 | 2096004747.3 | 550000 | . | 0 | Karnataka | Middle Income | ||
| 7 | M1013 | Mohan | Radha | MH | Buddhist | registrar@mgov.in | 10-10-2024 | Pune | ACTIVE | . | 21 | 2096004747.3 | 450000 | 10OCT2024 | 0 | Maharashtra | Low Income |
| 8 | M1014 | Rakesh | Simran | KA | Hindu | registrar@kgov.in | 25DEC2024 | Mysore | ACTIVE | 39 | 35 | 2096004747.3 | 0 | 25DEC2024 | 0 | Karnataka | Low Income |
| 9 | M1001 | Ravi_kumar | Anjali | AP | Hindu | registrar@apgov.in | 12-01-2025 | Hyderabad | ACTIVE | 29 | 24 | 2096004747.3 | 850000 | 01DEC2025 | 0 | Andhra Pradesh | Middle Income |
| 10 | M1008 | Rohan | Sneha | XX | Sikh | registrar@xgov.in | 11-02-2025 | Delhi | ACTIVE | 28 | 22 | 2096004747.3 | 1000000 | 02NOV2025 | 0 | Invalid Region | High Income |
| 11 | M1005 | Suresh | Lakshmi | AP | Christian | reg_apgov.in | 2025-03-18 | Vizag | ACTIVE | 35 | 33 | 2096004747.3 | 1200000 | 18MAR2025 | 1 | Andhra Pradesh | High Income |
| 12 | M1009 | Tarun | Pooja | AP | Hindu | registrar@apgov.in | 20250315 | Hyderabad | ACTIVE | 45 | 44 | 2096004747.3 | 150000 | 15MAR2025 | 0 | Andhra Pradesh | Low Income |
| 13 | M1006 | Vikram | Priya | MH | Unknown | registrar@mh.gov.in | 17MAR2025 | Mumbai | CLOSED | 41 | 39 | 2096004747.3 | 900 | 17MAR2025 | 0 | Maharashtra | Low Income |
Why Deduplication Is Never “Simple”
Duplicate
logic is business logic.
Should duplicates be removed by:
- Couple combination?
- Registration date?
- Latest modification
timestamp?
- Other?
Choosing
incorrectly can erase legitimate records.
NODUPKEY
keeps the first occurrence only. If records arrive unsorted, wrong observations
survive.
Senior
programmers always define deduplication precedence explicitly.
5.FIRST./LAST. Processing
proc sort data=marriage_dedup;
by Region_Code Marriage_ID ;
run;
proc print data=marriage_dedup;
run;
OUTPUT:
| Obs | Marriage_ID | Husband_Name | Wife_Name | Region_Code | Marriage_Type | Registrar_Email | Marriage_Date_Raw | City | Status_Flag | Age_Husband | Age_Wife | Load_Datetime | Annual_Income | Marriage_Date | Email_Flag | Region_Desc | Risk_Category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M1001 | Ravi_kumar | Anjali | AP | Hindu | registrar@apgov.in | 12-01-2025 | Hyderabad | ACTIVE | 29 | 24 | 2096004747.3 | 850000 | 01DEC2025 | 0 | Andhra Pradesh | Middle Income |
| 2 | M1005 | Suresh | Lakshmi | AP | Christian | reg_apgov.in | 2025-03-18 | Vizag | ACTIVE | 35 | 33 | 2096004747.3 | 1200000 | 18MAR2025 | 1 | Andhra Pradesh | High Income |
| 3 | M1009 | Tarun | Pooja | AP | Hindu | registrar@apgov.in | 20250315 | Hyderabad | ACTIVE | 45 | 44 | 2096004747.3 | 150000 | 15MAR2025 | 0 | Andhra Pradesh | Low Income |
| 4 | M1011 | Akash | Nisha | AP | Hindu | registrar@apgov.in | Hyderabad | ACTIVE | 33 | 29 | 2096004747.3 | 850000 | . | 0 | Andhra Pradesh | Middle Income | |
| 5 | M1015 | Ajay | Priti | AP | Hindu | registrar@apgov.in | 31/11/2025 | Vijayawada | PENDING | 31 | 28 | 2096004747.3 | 780000 | . | 0 | Andhra Pradesh | Middle Income |
| 6 | M1004 | Kiran | KA | Muslim | registrar@kgov.in | Bangalore | PENDING | . | 27 | 2096004747.3 | 550000 | . | 0 | Karnataka | Middle Income | ||
| 7 | M1014 | Rakesh | Simran | KA | Hindu | registrar@kgov.in | 25DEC2024 | Mysore | ACTIVE | 39 | 35 | 2096004747.3 | 0 | 25DEC2024 | 0 | Karnataka | Low Income |
| 8 | M1007 | Dinesh | Kavya | KL | Hindu | registrar@kl.gov.in | 18-13-2025 | Kochi | ACTIVE | 32 | . | 2096004747.3 | 990000 | . | 0 | Kerala | Middle Income |
| 9 | M1006 | Vikram | Priya | MH | Unknown | registrar@mh.gov.in | 17MAR2025 | Mumbai | CLOSED | 41 | 39 | 2096004747.3 | 900 | 17MAR2025 | 0 | Maharashtra | Low Income |
| 10 | M1013 | Mohan | Radha | MH | Buddhist | registrar@mgov.in | 10-10-2024 | Pune | ACTIVE | . | 21 | 2096004747.3 | 450000 | 10OCT2024 | 0 | Maharashtra | Low Income |
| 11 | M1003 | Arjun | Meera | TN | Hindu | invalidmail.com | 15/02/2025 | Chennai | ACTIVE | . | 31 | 2096004747.3 | 650000 | 15FEB2025 | 1 | Tamil Nadu | Middle Income |
| 12 | M1010 | Divya | TN | Jain | registrar@tn.gov.in | 14APR2025 | Chennai | CLOSED | 37 | 36 | 2096004747.3 | 760000 | 14APR2025 | 0 | Tamil Nadu | Middle Income | |
| 13 | M1008 | Rohan | Sneha | XX | Sikh | registrar@xgov.in | 11-02-2025 | Delhi | ACTIVE | 28 | 22 | 2096004747.3 | 1000000 | 02NOV2025 | 0 | Invalid Region | High Income |
data region_counts;
set marriage_dedup;
by Region_Code;
retain Marriage_Count;
if first.Region_Code then
Marriage_Count=0;
Marriage_Count+1;
if last.Region_Code;
run;
proc print data=region_counts;
run;
OUTPUT:
| Obs | Marriage_ID | Husband_Name | Wife_Name | Region_Code | Marriage_Type | Registrar_Email | Marriage_Date_Raw | City | Status_Flag | Age_Husband | Age_Wife | Load_Datetime | Annual_Income | Marriage_Date | Email_Flag | Region_Desc | Risk_Category | Marriage_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M1015 | Ajay | Priti | AP | Hindu | registrar@apgov.in | 31/11/2025 | Vijayawada | PENDING | 31 | 28 | 2096004747.3 | 780000 | . | 0 | Andhra Pradesh | Middle Income | 5 |
| 2 | M1014 | Rakesh | Simran | KA | Hindu | registrar@kgov.in | 25DEC2024 | Mysore | ACTIVE | 39 | 35 | 2096004747.3 | 0 | 25DEC2024 | 0 | Karnataka | Low Income | 2 |
| 3 | M1007 | Dinesh | Kavya | KL | Hindu | registrar@kl.gov.in | 18-13-2025 | Kochi | ACTIVE | 32 | . | 2096004747.3 | 990000 | . | 0 | Kerala | Middle Income | 1 |
| 4 | M1013 | Mohan | Radha | MH | Buddhist | registrar@mgov.in | 10-10-2024 | Pune | ACTIVE | . | 21 | 2096004747.3 | 450000 | 10OCT2024 | 0 | Maharashtra | Low Income | 2 |
| 5 | M1010 | Divya | TN | Jain | registrar@tn.gov.in | 14APR2025 | Chennai | CLOSED | 37 | 36 | 2096004747.3 | 760000 | 14APR2025 | 0 | Tamil Nadu | Middle Income | 2 | |
| 6 | M1008 | Rohan | Sneha | XX | Sikh | registrar@xgov.in | 11-02-2025 | Delhi | ACTIVE | 28 | 22 | 2096004747.3 | 1000000 | 02NOV2025 | 0 | Invalid Region | High Income | 1 |
Why FIRST./LAST. Processing Is Elite SAS
Engineering
This is
one of SAS’s greatest strengths.
FIRST./LAST.
logic enables streaming analytics without requiring joins or window functions.
It is
memory-efficient and production-scalable.
A common
mistake: forgetting to sort properly before BY-group processing. Incorrect
sorting silently produces invalid logic.
6.PROC SQL Validation Layer
proc sql;
create table invalid_emails as
select *
from marriage_dedup
where Registrar_Email not like '%@%.%';
quit;
proc print data=invalid_emails;
run;
OUTPUT:
| Obs | Marriage_ID | Husband_Name | Wife_Name | Region_Code | Marriage_Type | Registrar_Email | Marriage_Date_Raw | City | Status_Flag | Age_Husband | Age_Wife | Load_Datetime | Annual_Income | Marriage_Date | Email_Flag | Region_Desc | Risk_Category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M1005 | Suresh | Lakshmi | AP | Christian | reg_apgov.in | 2025-03-18 | Vizag | ACTIVE | 35 | 33 | 2096004747.3 | 1200000 | 18MAR2025 | 1 | Andhra Pradesh | High Income |
| 2 | M1003 | Arjun | Meera | TN | Hindu | invalidmail.com | 15/02/2025 | Chennai | ACTIVE | . | 31 | 2096004747.3 | 650000 | 15FEB2025 | 1 | Tamil Nadu | Middle Income |
Why PROC SQL Complements DATA Step
DATA Step
excels at row-wise engineering.
PROC SQL
excels at relational validation.
Experienced
SAS engineers use both strategically not religiously.
Production
systems fail when teams force every problem into one paradigm.
7.PROC REPORT Enterprise Reporting
proc report data=marriage_dedup nowd;
column Region_Desc Marriage_Type Annual_Income;
define Region_Desc / group;
define Marriage_Type / group;
define Annual_Income / analysis mean;
run;
| Region_Desc | Marriage_Type | Annual_Income |
|---|---|---|
| Andhra Pradesh | Christian | 1200000 |
| Hindu | 657500 | |
| Invalid Region | Sikh | 1000000 |
| Karnataka | Hindu | 0 |
| Muslim | 550000 | |
| Kerala | Hindu | 990000 |
| Maharashtra | Buddhist | 450000 |
| Unknown | 900 | |
| Tamil Nadu | Hindu | 650000 |
| Jain | 760000 |
Why Reporting Layers Matter
Data
cleaning is incomplete unless outputs are interpretable.
A
perfectly engineered dataset still fails if reviewers cannot validate summaries
quickly.
PROC
REPORT bridges engineering and business consumption.
8.R Version Dataset
library(tidyverse)
library(lubridate)
library(janitor)
marriage_raw <- tibble(
Marriage_ID = c("M1001","M1002","M1003","M1004",
"M1005","M1006","M1007","M1008",
"M1009","M1010","M1011","M1012",
"M1013","M1014","M1015"),
Husband_Name = c("Ravi Kumar","Ravi Kumar","Arjun",
"Kiran"," Suresh ","Vikram",
"Dinesh","Rohan","Tarun",
"NULL","Akash","Akash",
"Mohan","Rakesh","Ajay"),
Age_Husband = c(29,29,212,-5,35,41,32,28,45,37,33,33,999,39,31),
Region_Code = c("AP","AP","TN","KA","ap","MH",
"KL","XX","AP","TN","AP","AP",
"MH","KA","AP"),
Annual_Income = c("850000","850000","650000",
"-550000","1200000","abc900",
"990000","1000000","-150000",
"760000","850000","850000",
"450000","0","780000"),
Marriage_Date = c("12-01-2025","12-01-2025",
"15/02/2025",NA,"2025-03-18",
"17MAR2025","18-13-2025",
"11-02-2025","20250315",
"14APR2025",NA,NA,
"10-10-2024","25DEC2024",
"31/11/2025")
)
OUTPUT:
|
|
Marriage_ID |
Husband_Name |
Age_Husband |
Region_Code |
Annual_Income |
Marriage_Date |
|
1 |
M1001 |
Ravi Kumar |
29 |
AP |
850000 |
12-01-2025 |
|
2 |
M1002 |
Ravi Kumar |
29 |
AP |
850000 |
12-01-2025 |
|
3 |
M1003 |
Arjun |
212 |
TN |
650000 |
15-02-2025 |
|
4 |
M1004 |
Kiran |
-5 |
KA |
-550000 |
NA |
|
5 |
M1005 |
Suresh |
35 |
ap |
1200000 |
18-03-2025 |
|
6 |
M1006 |
Vikram |
41 |
MH |
abc900 |
17-Mar-2025 |
|
7 |
M1007 |
Dinesh |
32 |
KL |
990000 |
18-13-2025 |
|
8 |
M1008 |
Rohan |
28 |
XX |
1000000 |
11-02-2025 |
|
9 |
M1009 |
Tarun |
45 |
AP |
-150000 |
20250315 |
|
10 |
M1010 |
NULL |
37 |
TN |
760000 |
14-Apr-2025 |
|
11 |
M1011 |
Akash |
33 |
AP |
850000 |
NA |
|
12 |
M1012 |
Akash |
33 |
AP |
850000 |
NA |
|
13 |
M1013 |
Mohan |
999 |
MH |
450000 |
10-10-2024 |
|
14 |
M1014 |
Rakesh |
39 |
KA |
0 |
25-Dec-2024 |
|
15 |
M1015 |
Ajay |
31 |
AP |
780000 |
31/11/2025 |
SAS vs R Translation Insight
|
SAS |
R |
|
DATA
Step |
mutate() |
|
ARRAYS |
across() |
|
COMPRESS() |
str_replace_all() |
|
PROPCASE() |
str_to_title() |
|
INPUT() |
as.numeric()
/ parsing |
|
FIRST./LAST. |
group_by()
+ summarise() |
|
MERGE |
left_join() |
|
PROC
SQL |
dbplyr
/ sqldf / joins |
9.R Cleaning Pipeline
marriage_clean <- marriage_raw %>%
clean_names() %>%
mutate( across(
c(husband_name, region_code),str_trim),
husband_name =str_to_title(
str_to_lower(husband_name)),
region_code =str_to_upper(region_code),
annual_income =suppressWarnings(
as.numeric(gsub("[^0-9-]", "",
annual_income)
)
),
annual_income =abs(annual_income),
marriage_date =suppressWarnings(
parse_date_time(marriage_date,
orders = c("dmy","ymd","dbY")
)
),
age_husband =if_else(age_husband < 18 |
age_husband > 100,NA_real_,
as.numeric(age_husband))
)
OUTPUT:
|
|
marriage_id |
husband_name |
age_husband |
region_code |
annual_income |
marriage_date |
|
1 |
M1001 |
Ravi Kumar |
29 |
AP |
850000 |
12-01-2025 |
|
2 |
M1002 |
Ravi Kumar |
29 |
AP |
850000 |
12-01-2025 |
|
3 |
M1003 |
Arjun |
NA |
TN |
650000 |
15-02-2025 |
|
4 |
M1004 |
Kiran |
NA |
KA |
550000 |
NA |
|
5 |
M1005 |
Suresh |
35 |
AP |
1200000 |
18-03-2025 |
|
6 |
M1006 |
Vikram |
41 |
MH |
900 |
17-03-2025 |
|
7 |
M1007 |
Dinesh |
32 |
KL |
990000 |
NA |
|
8 |
M1008 |
Rohan |
28 |
XX |
1000000 |
11-02-2025 |
|
9 |
M1009 |
Tarun |
45 |
AP |
150000 |
15-03-2025 |
|
10 |
M1010 |
Null |
37 |
TN |
760000 |
14-04-2025 |
|
11 |
M1011 |
Akash |
33 |
AP |
850000 |
NA |
|
12 |
M1012 |
Akash |
33 |
AP |
850000 |
NA |
|
13 |
M1013 |
Mohan |
NA |
MH |
450000 |
10-10-2024 |
|
14 |
M1014 |
Rakesh |
39 |
KA |
0 |
25-12-2024 |
|
15 |
M1015 |
Ajay |
31 |
AP |
780000 |
NA |
Why This R Pipeline Works Well
The
tidyverse pipeline model behaves like a readable transformation storyboard.
Each
transformation becomes composable and auditable.
across()
mirrors SAS arrays conceptually.
parse_date_time()
is extraordinarily useful because enterprise data rarely follows one format
consistently.
One
production mistake in R: implicit coercion warnings are ignored.
as.numeric("abc900") creates NA, which can silently propagate unless
validation summaries exist.
PROC SQL vs DATA Step vs R
dplyr
|
Task |
SAS
DATA Step |
PROC
SQL |
R
dplyr |
|
Row-wise
cleaning |
Excellent |
Moderate |
Excellent |
|
Complex
joins |
Moderate |
Excellent |
Excellent |
|
Streaming
performance |
Excellent |
Moderate |
Moderate |
|
Auditability |
Very
High |
High |
Moderate |
|
Memory
handling |
Efficient |
Efficient |
Higher
RAM usage |
|
Regulatory
adoption |
Gold
Standard |
Gold
Standard |
Growing |
|
Macro
automation |
Extremely
powerful |
Limited |
Functional
programming |
|
Metadata
integration |
Native |
Native |
Manual |
10.distinct() vs PROC SORT NODUPKEY
marriage_dedup <- marriage_clean %>%
distinct(husband_name,region_code, .keep_all=TRUE)
|
|
marriage_id |
husband_name |
age_husband |
region_code |
annual_income |
marriage_date |
|
1 |
M1001 |
Ravi Kumar |
29 |
AP |
850000 |
12-01-2025 |
|
2 |
M1003 |
Arjun |
NA |
TN |
650000 |
15-02-2025 |
|
3 |
M1004 |
Kiran |
NA |
KA |
550000 |
NA |
|
4 |
M1005 |
Suresh |
35 |
AP |
1200000 |
18-03-2025 |
|
5 |
M1006 |
Vikram |
41 |
MH |
900 |
17-03-2025 |
|
6 |
M1007 |
Dinesh |
32 |
KL |
990000 |
NA |
|
7 |
M1008 |
Rohan |
28 |
XX |
1000000 |
11-02-2025 |
|
8 |
M1009 |
Tarun |
45 |
AP |
150000 |
15-03-2025 |
|
9 |
M1010 |
Null |
37 |
TN |
760000 |
14-04-2025 |
|
10 |
M1011 |
Akash |
33 |
AP |
850000 |
NA |
|
11 |
M1013 |
Mohan |
NA |
MH |
450000 |
10-10-2024 |
|
12 |
M1014 |
Rakesh |
39 |
KA |
0 |
25-12-2024 |
|
13 |
M1015 |
Ajay |
31 |
AP |
780000 |
NA |
|
SAS |
|
|
PROC SORT NODUPKEY |
distinct() |
Enterprise Validation &
Compliance
In
clinical programming, cleaning logic is not optional engineering polish.
It is
regulatory evidence.
Every
transformation must answer:
- What changed?
- Why did it change?
- Who approved it?
- Can it be reproduced?
- Can QA independently verify
it?
That is
the foundation of:
- Clinical Data Management
- SDTM
- ADaM
- Audit trail integrity
- Traceability matrices
- QC independence
The Dangerous SAS Missing Value Trap
In SAS:
. < 0 < 1 < 2
Missing
numeric values sort LOWER than all valid numbers.
That
means this condition:
if salary < 50000;
ACCIDENTALLY
includes missing salaries.
Catastrophic
consequences:
- Invalid low-income
populations
- Inflated adverse event rates
- Incorrect risk
stratification
- Regulatory reporting errors
Defensive
programming:
if not missing(salary)
and salary < 50000;
Senior
SAS programmers never trust implicit missing-value behavior.
Never.
Business Logic Narrative Why
Cleaning Decisions Matter
Data
cleaning is not cosmetic hygiene. It is business logic preservation.
Suppose a
patient age arrives as 212. Technically numeric. Statistically poisonous. If
left untouched, that single value can distort mean age calculations, affect
age-stratified efficacy analysis, and alter machine learning model weighting.
In actuarial systems, impossible ages can distort premium estimation and
reserve forecasting.
Negative
salary or income values create similar damage. Sometimes negative values
represent refunds. Sometimes system corruption. Sometimes sign inversions
during ETL ingestion. Blindly converting them to positive numbers without
understanding business context creates audit risk. Intelligent pipelines flag
them for controlled remediation.
Date
standardization is equally critical. A mixed-format field containing
"12-01-2025", "2025/01/12", and "12JAN2025" may
appear harmless until time-series analysis fails or duplicate detection logic
collapses because equivalent dates are interpreted differently.
Text
normalization affects downstream joins more than most junior analysts realize.
"AP", " ap ", and "Ap" can fracture regional
aggregation logic, leading to incomplete counts and broken dashboards.
Cleaning
is not about making data pretty.
It is
about preserving analytical truth.
20 Real-World Data Cleaning
Best Practices
- Preserve raw datasets untouched
- Standardize metadata
centrally
- Validate before
transformation
- Use defensive missing-value
checks
- Never overwrite source
variables blindly
- Build reusable macro
frameworks
- Separate derivation from
reporting
- Track lineage for every
variable
- Validate joins independently
- Use controlled terminology
libraries
- Implement duplicate
escalation rules
- Centralize date parsing
standards
- Avoid hardcoded business
logic
- Use QC programmers
independently
- Version-control validation
logic
- Validate ranges
semantically, not only structurally
- Build exception datasets
explicitly
- Audit all imputation logic
- Use modular pipelines
- Treat data cleaning as risk
engineering
20 Sharp One-Liner Insights
- Missing values in SAS aren’t
empty they’re secretly the smallest numbers in your dataset.
- Dirty joins create confident
lies.
- One malformed ID can poison
an entire warehouse.
- Production bugs rarely
scream they whisper.
- Character truncation is
silent data amputation.
- Every duplicate record tells
a system story.
- AI models inherit your data
quality morality.
- Cleaning logic without
lineage is audit suicide.
- NULL handling separates
analysts from engineers.
- Dashboards fail long before
executives notice.
- Date chaos destroys trust
faster than missing values.
- PROC SQL is powerful but
DATA Step is surgical.
- Defensive programming is
cheaper than regulatory remediation.
- Every transformation needs a
business reason.
- Bad metadata scales faster
than good governance.
- Imputation without
documentation is analytical fiction.
- Whitespace bugs have
destroyed million-dollar reports.
- Validation code deserves
validation too.
- Clean data is engineered,
not discovered.
- Trustworthy analytics begin
where assumptions end.
Production Validation
Checklist
1.Raw data archived
2.Metadata verified
3.Duplicate logic approved
4.Missing-value rules documented
5.Date formats standardized
6.Controlled terminology applied
7.Invalid ranges flagged
8.Character lengths validated
9.Join cardinality tested
10.QC independently executed
11.Audit trail captured
12.Exception datasets generated
13.PROC CONTENTS reviewed
14.Summary statistics validated
15.Regulatory traceability
confirmed
Summary
SAS
remains the dominant force in regulated enterprise environments because it was
engineered for repeatability, metadata consistency, and procedural traceability.
The DATA Step remains one of the most efficient row-processing engines ever
built for structured enterprise data. FIRST./LAST. processing, macro
automation, format catalogs, and native procedural reporting make SAS
extraordinarily powerful in production clinical pipelines. Most importantly,
regulators understand SAS behavior deeply. Validation standards, SDTM mapping
frameworks, Define.xml workflows, and submission ecosystems evolved around SAS
for decades.
R,
however, transformed modern analytical engineering through flexibility,
open-source innovation, and expressive transformation pipelines. The tidyverse
ecosystem dramatically improves readability, composability, and exploratory
engineering velocity. Packages like dplyr, stringr, purrr, and lubridate allow
analysts to build elegant, scalable transformation pipelines rapidly. R also
integrates naturally with machine learning, APIs, visualization ecosystems, and
modern cloud architectures.
But
enterprise trust differs from analytical flexibility.
SAS
prioritizes deterministic reproducibility.
R
prioritizes analytical agility.
In modern
organizations, the strongest teams rarely choose one exclusively. Instead, they
combine SAS for validated regulatory production and R for exploratory
analytics, modeling, automation, and scalable data science workflows.
The
future belongs to bilingual programmers.
Conclusion
Every
executive dashboard, every AI prediction, every regulatory submission, and
every strategic decision rests on one invisible assumption:
The data
is trustworthy.
Most
organizations obsess over visualization tools, cloud platforms, machine
learning frameworks, and AI acceleration while underinvesting in the
engineering discipline that determines whether any of those outputs deserve
trust in the first place.
Data
cleaning is not preprocessing.
It is
enterprise risk management.
A
corrupted patient ID can invalidate enrollment counts.
A
malformed timestamp can collapse longitudinal analysis.
A silent
truncation bug can erase regulatory traceability.
A
missing-value assumption can reverse analytical conclusions.
This is
why experienced SAS programmers and data scientists think differently from
junior coders. They do not merely write transformations. They engineer
defensible systems.
Good
cleaning frameworks create:
- Repeatability
- Auditability
- Scalability
- Traceability
- Statistical reliability
- Executive trust
The
strongest engineers build pipelines that assume failure will happen somewhere malformed
source feeds, duplicate merges, invalid formats, impossible values, broken
joins, inconsistent metadata, and human mistakes.
Defensive
programming is not pessimism.
It is
professionalism.
SAS and R
each bring unique strengths to enterprise cleaning ecosystems. SAS delivers
industrial-grade procedural stability and regulatory alignment. R delivers
extraordinary flexibility and analytical acceleration. Together, they form one
of the most powerful bilingual engineering combinations in modern data science.
At the
end of the day, analytics is not about generating numbers.
It is
about generating decisions people can trust.
And
trustworthy decisions begin with disciplined cleaning frameworks long before
the first dashboard appears.
Interview Questions &
Answers
1. Duplicate Marriage IDs Inflated Enrollment
Counts
Question
A
production dashboard suddenly shows 18% growth in registered couples overnight.
How would you investigate?
Answer
I would
first validate source ingestion counts versus historical baselines. Then I’d
compare duplicate frequencies using PROC SORT NODUPKEY, PROC FREQ, and SQL
COUNT DISTINCT logic. I’d examine whether duplicates originated from upstream
ingestion, failed merge logic, or repeated incremental loads. Most importantly,
I would define business-level uniqueness carefully sometimes duplicate IDs are
legitimate amendments rather than true duplicates. I’d isolate impacted
downstream datasets, assess reporting exposure, and document remediation
traceability before rerunning pipelines.
2. Missing Values Accidentally Included in
Low-Income Population
Question
A
regulatory output included patients with missing salaries in the low-income
category. What happened?
Answer
Classic
SAS missing-value behavior. SAS treats missing numeric values as smaller than
valid numbers. A condition like if salary < 50000 unintentionally includes
missing salaries. I would correct logic defensively using not missing(salary)
and validate all downstream derivations. Then I’d implement automated QC checks
specifically targeting missing-value inclusion patterns because this class of
bug is extremely common in production SAS environments.
3. Macro Debugging Failure in Production
Question
A
macro-generated dataset suddenly creates missing variables in only one study.
How would you debug?
Answer
I’d
enable MPRINT, MLOGIC, and SYMBOLGEN immediately to inspect macro resolution
behavior. Then I’d compare metadata structures between successful and failed
studies using PROC CONTENTS. Usually these failures come from variable-length
inconsistencies, conditional macro branches, or study-specific metadata drift.
I debug macros by validating generated code first not by staring at macro
syntax blindly.
4. Cross-System SAS vs R Validation Disagreement
Question
Your SAS
and R outputs produce different patient counts. What’s your approach?
Answer
I’d first
validate datatype coercion and missing-value handling because SAS and R treat
them differently. Then I’d compare row-level joins, date parsing, sorting
behavior, and duplicate handling rules. I’d create reconciliation datasets
isolating mismatched records and validate transformation ordering carefully.
Cross-platform validation is usually a semantics problem, not syntax problem.
5. Impossible Ages Passed Through QA
Question
How do
biologically impossible ages survive enterprise systems?
Answer
Because
many systems validate structure instead of meaning. If age is numeric, systems
often assume it’s valid. Mature pipelines implement semantic validation layers
range checks, controlled terminology validation, anomaly detection, and
exception reporting. I always separate datatype validation from business-rule
validation because both are necessary and neither alone is sufficient.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MARRIAGE DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment