Sacred Vows, Corrupted Rows: Cleaning Indian Marriage Data Before Analytics Collapse

Introduction:

The Monday Morning Disaster Nobody Expected

Three years ago, I walked into a regulatory war room at 7:10 AM.

Nobody touched the coffee.

A Fortune 500 healthcare insurer had just failed an internal audit tied to a demographic enrichment pipeline. The system merged public marriage registration data with policyholder records to predict dependent coverage eligibility, fraud probability, and regional premium risk.

The dashboard showed a dramatic increase in “high-stability family households.

Executives celebrated.

Actuaries recalculated premium forecasts.

AI models approved lower-risk policies.

Then QA found the bomb.

Duplicate spouse IDs.

Thousands of them.

One corrupted merge key inflated married-household counts by 18%.

Worse, negative claim reimbursements silently passed through because the billing field was stored as character data. Values like " -45000 " survived validation because whitespace and implicit conversion masked the issue.

One malformed region code broke downstream stratification logic.

One impossible age value 212 years old flowed into actuarial modeling because nobody implemented defensive range validation.

Within hours:

Fraud scores became unreliable
Executive reports became legally questionable
Machine learning outputs became statistically contaminated
Audit traceability failed
Regulatory confidence collapsed

That’s the reality of dirty data.

Not theory.

Not tutorials.

Production.

This article walks through how senior SAS programmers and data scientists engineer defensive, enterprise-grade cleaning frameworks using SAS DATA Step, PROC SQL, and modern R pipelines.

We are going to build corrupted Indian marriage registration datasets from scratch, intentionally poison them with realistic enterprise failures, then clean them the way experienced production teams actually do it.

Why LENGTH Comes First in SAS The Silent Killer Called Character Truncation

Junior programmers often think this is syntax trivia.

It isn’t.

It’s data survival.

In SAS, the first time a character variable appears, SAS determines its storage length permanently unless explicitly controlled using LENGTH.

Example disaster:

data demo;

spouse_name='Raj';

spouse_name='Rajesh Kumar Varma';

run;

proc print data = demo;

run;

OUTPUT:


Obs	spouse_name
1	Raj

Result?

spouse_name becomes length 3 because "Raj" appeared first.

Everything afterward truncates silently:

Raj

Not warning.

Not error.

Silent corruption.

In production, this destroys:

Patient names
Site identifiers
Country codes
Medication descriptions
Audit traceability

That’s why experienced SAS programmers place LENGTH first before INPUT, IF statements, assignments, or merges.

R behaves differently.

R character vectors are dynamically allocated objects using internal string pooling and reference semantics.R does not permanently truncate strings because memory allocation is abstracted at runtime. A character column can store "Raj" and "Rajesh Kumar Varma" together safely without predefined width constraints.

SAS prioritizes storage efficiency and fixed metadata structures.

R prioritizes flexibility and dynamic memory handling.

Understanding this difference separates production engineers from syntax learners.

data demo;

length spouse_name $20;

spouse_name='Raj';

spouse_name='Rajesh Kumar Varma';

run;

proc print data = demo;

run;

OUTPUT:


Obs	spouse_name
1	Rajesh Kumar Varma

Building the Raw Corrupted Indian Marriage Dataset in SAS

SAS Raw Dataset Creation

data indian_marriage_raw;

length Marriage_ID $12 Husband_Name $40 Wife_Name $40 Region_Code $10

Marriage_Type $20 Registrar_Email $60 Annual_Income $15

Marriage_Date_Raw $25 City $25 Status_Flag $15;

informat Marriage_Date_Raw $25.;

infile datalines truncover;

input Marriage_ID:$12. Husband_Name:$40. Wife_Name:$40. Age_Husband

Age_Wife Region_Code:$10. Marriage_Type:$20. Registrar_Email:$60.

Annual_Income:$15. Marriage_Date_Raw:$25. City:$25. Status_Flag:$15.;

datalines;

M1001 Ravi_Kumar Anjali 29 24 AP Hindu registrar@apgov.in 850000 12-01-2025 Hyderabad ACTIVE

M1002 Ravi_Kumar Anjali 29 24 AP Hindu registrar@apgov.in 850000 12-01-2025 Hyderabad ACTIVE

M1003 Arjun Meera 212 31 TN Hindu invalidmail.com 650000 15/02/2025 Chennai ACTIVE

M1004 Kiran NULL -5 27 KA Muslim registrar@kgov.in -550000 . Bangalore PENDING

M1005 Suresh Lakshmi 35 33 ap Christian reg_apgov.in 1200000 2025-03-18 Vizag ACTIVE

M1006 Vikram Priya 41 39 MH Unknown registrar@mh.gov.in abc900 17MAR2025 Mumbai CLOSED

M1007 Dinesh Kavya 32 . KL Hindu registrar@kl.gov.in 990000 18-13-2025 Kochi ACTIVE

M1008 Rohan Sneha 28 22 XX Sikh registrar@xgov.in 1000000 11-02-2025 Delhi ACTIVE

M1009 Tarun Pooja 45 44 AP Hindu registrar@apgov.in -150000 20250315 Hyderabad ACTIVE

M1010 NULL Divya 37 36 TN Jain registrar@tn.gov.in 760000 14APR2025 Chennai CLOSED

M1011 Akash Nisha 33 29 AP Hindu registrar@apgov.in 850000 . Hyderabad ACTIVE

M1012 Akash Nisha 33 29 AP Hindu registrar@apgov.in 850000 . Hyderabad ACTIVE

M1013 Mohan Radha 999 21 MH Buddhist registrar@mgov.in 450000 10-10-2024 Pune ACTIVE

M1014 Rakesh Simran 39 35 KA Hindu registrar@kgov.in 0 25DEC2024 Mysore ACTIVE

M1015 Ajay Priti 31 28 AP Hindu registrar@apgov.in 780000 31/11/2025 Vijayawada PENDING

;

run;

proc print data=indian_marriage_raw;

run;

OUTPUT:


Obs	Marriage_ID	Husband_Name	Wife_Name	Region_Code	Marriage_Type	Registrar_Email	Annual_Income	Marriage_Date_Raw	City	Status_Flag	Age_Husband	Age_Wife
1	M1001	Ravi_Kumar	Anjali	AP	Hindu	registrar@apgov.in	850000	12-01-2025	Hyderabad	ACTIVE	29	24
2	M1002	Ravi_Kumar	Anjali	AP	Hindu	registrar@apgov.in	850000	12-01-2025	Hyderabad	ACTIVE	29	24
3	M1003	Arjun	Meera	TN	Hindu	invalidmail.com	650000	15/02/2025	Chennai	ACTIVE	212	31
4	M1004	Kiran	NULL	KA	Muslim	registrar@kgov.in	-550000		Bangalore	PENDING	-5	27
5	M1005	Suresh	Lakshmi	ap	Christian	reg_apgov.in	1200000	2025-03-18	Vizag	ACTIVE	35	33
6	M1006	Vikram	Priya	MH	Unknown	registrar@mh.gov.in	abc900	17MAR2025	Mumbai	CLOSED	41	39
7	M1007	Dinesh	Kavya	KL	Hindu	registrar@kl.gov.in	990000	18-13-2025	Kochi	ACTIVE	32	.
8	M1008	Rohan	Sneha	XX	Sikh	registrar@xgov.in	1000000	11-02-2025	Delhi	ACTIVE	28	22
9	M1009	Tarun	Pooja	AP	Hindu	registrar@apgov.in	-150000	20250315	Hyderabad	ACTIVE	45	44
10	M1010	NULL	Divya	TN	Jain	registrar@tn.gov.in	760000	14APR2025	Chennai	CLOSED	37	36
11	M1011	Akash	Nisha	AP	Hindu	registrar@apgov.in	850000		Hyderabad	ACTIVE	33	29
12	M1012	Akash	Nisha	AP	Hindu	registrar@apgov.in	850000		Hyderabad	ACTIVE	33	29
13	M1013	Mohan	Radha	MH	Buddhist	registrar@mgov.in	450000	10-10-2024	Pune	ACTIVE	999	21
14	M1014	Rakesh	Simran	KA	Hindu	registrar@kgov.in	0	25DEC2024	Mysore	ACTIVE	39	35
15	M1015	Ajay	Priti	AP	Hindu	registrar@apgov.in	780000	31/11/2025	Vijayawada	PENDING	31	28

Why This Dataset Matters

This is not random bad data.

Every corruption pattern here reflects real production failures:

Duplicate marriage registrations
Impossible ages
Invalid region codes
Negative income values
Malformed emails
Mixed-case inconsistencies
Character/numeric confusion
Missing dates
Invalid timestamps

Production systems rarely fail because of one catastrophic error.

They fail because 400 small inconsistencies accumulate silently.

One common mistake: developers validate structure but ignore semantic correctness. "999" is numeric but biologically impossible.

Enterprise cleaning is about meaning, not merely datatype compliance.

SAS DATA Step Cleaning

1. Standardization and Defensive Cleaning

data marriage_clean_stage1;

set indian_marriage_raw;

retain Load_Datetime;

Load_Datetime=datetime();

array chars(*) Husband_Name Wife_Name Region_Code

Marriage_Type Registrar_Email

City Status_Flag;

do i=1 to dim(chars);

chars(i)=strip(chars(i));

chars(i)=compress(chars(i),,'kw');

end;

Husband_Name=propcase(lowcase(Husband_Name));

Wife_Name=propcase(lowcase(Wife_Name));

City=propcase(lowcase(City));

Region_Code=upcase(Region_Code);

Registrar_Email=lowcase(Registrar_Email);

if Husband_Name='Null' then Husband_Name='';

if Wife_Name='Null' then Wife_Name='';

Annual_Income_Num=input(compress(Annual_Income,,'kd'),8.);

if Annual_Income_Num < 0 then

Annual_Income_Num=abs(Annual_Income_Num);

Marriage_Date=input(Marriage_Date_Raw,anydtdte15.);

format Marriage_Date date9.;

if Age_Husband < 18 or Age_Husband > 100 then

Age_Husband=.;

if Age_Wife < 18 or Age_Wife > 100 then

Age_Wife=.;

Email_Flag=0;

if index(Registrar_Email,'@')=0 then

Email_Flag=1;

drop i Annual_Income;

rename Annual_Income_Num = Annual_Income;

run;

proc print data=marriage_clean_stage1;

run;

OUTPUT:


Obs	Marriage_ID	Husband_Name	Wife_Name	Region_Code	Marriage_Type	Registrar_Email	Marriage_Date_Raw	City	Status_Flag	Age_Husband	Age_Wife	Load_Datetime	Annual_Income	Marriage_Date	Email_Flag
1	M1001	Ravi_kumar	Anjali	AP	Hindu	registrar@apgov.in	12-01-2025	Hyderabad	ACTIVE	29	24	2096004747.3	850000	01DEC2025	0
2	M1002	Ravi_kumar	Anjali	AP	Hindu	registrar@apgov.in	12-01-2025	Hyderabad	ACTIVE	29	24	2096004747.3	850000	01DEC2025	0
3	M1003	Arjun	Meera	TN	Hindu	invalidmail.com	15/02/2025	Chennai	ACTIVE	.	31	2096004747.3	650000	15FEB2025	1
4	M1004	Kiran		KA	Muslim	registrar@kgov.in		Bangalore	PENDING	.	27	2096004747.3	550000	.	0
5	M1005	Suresh	Lakshmi	AP	Christian	reg_apgov.in	2025-03-18	Vizag	ACTIVE	35	33	2096004747.3	1200000	18MAR2025	1
6	M1006	Vikram	Priya	MH	Unknown	registrar@mh.gov.in	17MAR2025	Mumbai	CLOSED	41	39	2096004747.3	900	17MAR2025	0
7	M1007	Dinesh	Kavya	KL	Hindu	registrar@kl.gov.in	18-13-2025	Kochi	ACTIVE	32	.	2096004747.3	990000	.	0
8	M1008	Rohan	Sneha	XX	Sikh	registrar@xgov.in	11-02-2025	Delhi	ACTIVE	28	22	2096004747.3	1000000	02NOV2025	0
9	M1009	Tarun	Pooja	AP	Hindu	registrar@apgov.in	20250315	Hyderabad	ACTIVE	45	44	2096004747.3	150000	15MAR2025	0
10	M1010		Divya	TN	Jain	registrar@tn.gov.in	14APR2025	Chennai	CLOSED	37	36	2096004747.3	760000	14APR2025	0
11	M1011	Akash	Nisha	AP	Hindu	registrar@apgov.in		Hyderabad	ACTIVE	33	29	2096004747.3	850000	.	0
12	M1012	Akash	Nisha	AP	Hindu	registrar@apgov.in		Hyderabad	ACTIVE	33	29	2096004747.3	850000	.	0
13	M1013	Mohan	Radha	MH	Buddhist	registrar@mgov.in	10-10-2024	Pune	ACTIVE	.	21	2096004747.3	450000	10OCT2024	0
14	M1014	Rakesh	Simran	KA	Hindu	registrar@kgov.in	25DEC2024	Mysore	ACTIVE	39	35	2096004747.3	0	25DEC2024	0
15	M1015	Ajay	Priti	AP	Hindu	registrar@apgov.in	31/11/2025	Vijayawada	PENDING	31	28	2096004747.3	780000	.	0

Why This DATA Step Matters

This DATA step behaves like a defensive firewall.

Enterprise cleaning is not cosmetic formatting. It is risk containment.

Notice how arrays reduce repetitive logic. In real studies, you may standardize 200+ character variables. Arrays make programs maintainable and scalable.

COMPRESS(...,'kw') removes unwanted whitespace artifacts while preserving readable content. PROPCASE() normalizes names for reporting consistency. INPUT() converts dirty character income values into numerics defensively.

One of the most dangerous production assumptions is “numeric-looking data is safe.” It isn’t.

abc900 becomes corrupted unless explicitly validated.

Another common production failure: programmers overwrite raw fields without preserving lineage. Always retain raw columns somewhere in the pipeline for audit traceability.

2.Advanced Conditional Logic with SELECT-WHEN

data marriage_clean_stage2;

set marriage_clean_stage1;

length Region_Desc $30 Risk_Category $20;

select (Region_Code);

when ('AP') Region_Desc='Andhra Pradesh';

when ('TN') Region_Desc='Tamil Nadu';

when ('KA') Region_Desc='Karnataka';

when ('MH') Region_Desc='Maharashtra';

when ('KL') Region_Desc='Kerala';

otherwise Region_Desc='Invalid Region';

end;

select;

when (Annual_Income >= 1000000)

Risk_Category='High Income';

when (500000 <= Annual_Income <= 999999)

Risk_Category='Middle Income';

otherwise

Risk_Category='Low Income';

end;

run;

proc print data=marriage_clean_stage2;

run;

OUTPUT:


Obs	Marriage_ID	Husband_Name	Wife_Name	Region_Code	Marriage_Type	Registrar_Email	Marriage_Date_Raw	City	Status_Flag	Age_Husband	Age_Wife	Load_Datetime	Annual_Income	Marriage_Date	Email_Flag	Region_Desc	Risk_Category
1	M1001	Ravi_kumar	Anjali	AP	Hindu	registrar@apgov.in	12-01-2025	Hyderabad	ACTIVE	29	24	2096004747.3	850000	01DEC2025	0	Andhra Pradesh	Middle Income
2	M1002	Ravi_kumar	Anjali	AP	Hindu	registrar@apgov.in	12-01-2025	Hyderabad	ACTIVE	29	24	2096004747.3	850000	01DEC2025	0	Andhra Pradesh	Middle Income
3	M1003	Arjun	Meera	TN	Hindu	invalidmail.com	15/02/2025	Chennai	ACTIVE	.	31	2096004747.3	650000	15FEB2025	1	Tamil Nadu	Middle Income
4	M1004	Kiran		KA	Muslim	registrar@kgov.in		Bangalore	PENDING	.	27	2096004747.3	550000	.	0	Karnataka	Middle Income
5	M1005	Suresh	Lakshmi	AP	Christian	reg_apgov.in	2025-03-18	Vizag	ACTIVE	35	33	2096004747.3	1200000	18MAR2025	1	Andhra Pradesh	High Income
6	M1006	Vikram	Priya	MH	Unknown	registrar@mh.gov.in	17MAR2025	Mumbai	CLOSED	41	39	2096004747.3	900	17MAR2025	0	Maharashtra	Low Income
7	M1007	Dinesh	Kavya	KL	Hindu	registrar@kl.gov.in	18-13-2025	Kochi	ACTIVE	32	.	2096004747.3	990000	.	0	Kerala	Middle Income
8	M1008	Rohan	Sneha	XX	Sikh	registrar@xgov.in	11-02-2025	Delhi	ACTIVE	28	22	2096004747.3	1000000	02NOV2025	0	Invalid Region	High Income
9	M1009	Tarun	Pooja	AP	Hindu	registrar@apgov.in	20250315	Hyderabad	ACTIVE	45	44	2096004747.3	150000	15MAR2025	0	Andhra Pradesh	Low Income
10	M1010		Divya	TN	Jain	registrar@tn.gov.in	14APR2025	Chennai	CLOSED	37	36	2096004747.3	760000	14APR2025	0	Tamil Nadu	Middle Income
11	M1011	Akash	Nisha	AP	Hindu	registrar@apgov.in		Hyderabad	ACTIVE	33	29	2096004747.3	850000	.	0	Andhra Pradesh	Middle Income
12	M1012	Akash	Nisha	AP	Hindu	registrar@apgov.in		Hyderabad	ACTIVE	33	29	2096004747.3	850000	.	0	Andhra Pradesh	Middle Income
13	M1013	Mohan	Radha	MH	Buddhist	registrar@mgov.in	10-10-2024	Pune	ACTIVE	.	21	2096004747.3	450000	10OCT2024	0	Maharashtra	Low Income
14	M1014	Rakesh	Simran	KA	Hindu	registrar@kgov.in	25DEC2024	Mysore	ACTIVE	39	35	2096004747.3	0	25DEC2024	0	Karnataka	Low Income
15	M1015	Ajay	Priti	AP	Hindu	registrar@apgov.in	31/11/2025	Vijayawada	PENDING	31	28	2096004747.3	780000	.	0	Andhra Pradesh	Middle Income

Why SELECT-WHEN Beats Endless IF Statements

In production codebases, readability becomes survival.

Nested IF statements become dangerous in enterprise validation because logic overlaps grow invisible over time.

SELECT-WHEN improves:

Traceability
Rule segregation
Audit readability
Future enhancements

Regulatory reviewers often inspect transformation logic manually. Clean branching structures reduce ambiguity dramatically.

A common production mistake is forgetting OTHERWISE. That creates unhandled categories that silently propagate downstream.

3.PROC FORMAT for Controlled Standardization

proc format;

value agegrp low-29='Young'

30-45='Middle'

46-high='Senior';

run;

LOG:

 NOTE: Format AGEGRP has been output.

Why Formats Matter

Formats are metadata-driven intelligence layers.

Instead of hardcoding categorization logic repeatedly, formats centralize business definitions.

In enterprise systems, changing an age grouping from 30-45 to 30-50 should happen once not across 300 programs.

This reduces validation drift and regulatory inconsistency.

4.Deduplication Using PROC SORT NODUPKEY

proc sort data=marriage_clean_stage2

out=marriage_dedup nodupkey;

by Husband_Name Wife_Name Region_Code;

run;

proc print data=marriage_dedup;

run;

LOG:

 NOTE: There were 15 observations read from the data set WORK.MARRIAGE_CLEAN_STAGE2.

 NOTE: 2 observations with duplicate key values were deleted.

 NOTE: The data set WORK.MARRIAGE_DEDUP has 13 observations and 17 variables.

OUTPUT:


Obs	Marriage_ID	Husband_Name	Wife_Name	Region_Code	Marriage_Type	Registrar_Email	Marriage_Date_Raw	City	Status_Flag	Age_Husband	Age_Wife	Load_Datetime	Annual_Income	Marriage_Date	Email_Flag	Region_Desc	Risk_Category
1	M1010		Divya	TN	Jain	registrar@tn.gov.in	14APR2025	Chennai	CLOSED	37	36	2096004747.3	760000	14APR2025	0	Tamil Nadu	Middle Income
2	M1015	Ajay	Priti	AP	Hindu	registrar@apgov.in	31/11/2025	Vijayawada	PENDING	31	28	2096004747.3	780000	.	0	Andhra Pradesh	Middle Income
3	M1011	Akash	Nisha	AP	Hindu	registrar@apgov.in		Hyderabad	ACTIVE	33	29	2096004747.3	850000	.	0	Andhra Pradesh	Middle Income
4	M1003	Arjun	Meera	TN	Hindu	invalidmail.com	15/02/2025	Chennai	ACTIVE	.	31	2096004747.3	650000	15FEB2025	1	Tamil Nadu	Middle Income
5	M1007	Dinesh	Kavya	KL	Hindu	registrar@kl.gov.in	18-13-2025	Kochi	ACTIVE	32	.	2096004747.3	990000	.	0	Kerala	Middle Income
6	M1004	Kiran		KA	Muslim	registrar@kgov.in		Bangalore	PENDING	.	27	2096004747.3	550000	.	0	Karnataka	Middle Income
7	M1013	Mohan	Radha	MH	Buddhist	registrar@mgov.in	10-10-2024	Pune	ACTIVE	.	21	2096004747.3	450000	10OCT2024	0	Maharashtra	Low Income
8	M1014	Rakesh	Simran	KA	Hindu	registrar@kgov.in	25DEC2024	Mysore	ACTIVE	39	35	2096004747.3	0	25DEC2024	0	Karnataka	Low Income
9	M1001	Ravi_kumar	Anjali	AP	Hindu	registrar@apgov.in	12-01-2025	Hyderabad	ACTIVE	29	24	2096004747.3	850000	01DEC2025	0	Andhra Pradesh	Middle Income
10	M1008	Rohan	Sneha	XX	Sikh	registrar@xgov.in	11-02-2025	Delhi	ACTIVE	28	22	2096004747.3	1000000	02NOV2025	0	Invalid Region	High Income
11	M1005	Suresh	Lakshmi	AP	Christian	reg_apgov.in	2025-03-18	Vizag	ACTIVE	35	33	2096004747.3	1200000	18MAR2025	1	Andhra Pradesh	High Income
12	M1009	Tarun	Pooja	AP	Hindu	registrar@apgov.in	20250315	Hyderabad	ACTIVE	45	44	2096004747.3	150000	15MAR2025	0	Andhra Pradesh	Low Income
13	M1006	Vikram	Priya	MH	Unknown	registrar@mh.gov.in	17MAR2025	Mumbai	CLOSED	41	39	2096004747.3	900	17MAR2025	0	Maharashtra	Low Income

Why Deduplication Is Never “Simple”

Duplicate logic is business logic.

Should duplicates be removed by:

Couple combination?
Registration date?
Latest modification timestamp?
Other?

Choosing incorrectly can erase legitimate records.

NODUPKEY keeps the first occurrence only. If records arrive unsorted, wrong observations survive.

Senior programmers always define deduplication precedence explicitly.

5.FIRST./LAST. Processing

proc sort data=marriage_dedup;

by Region_Code Marriage_ID ;

run;

proc print data=marriage_dedup;

run;

OUTPUT:


Obs	Marriage_ID	Husband_Name	Wife_Name	Region_Code	Marriage_Type	Registrar_Email	Marriage_Date_Raw	City	Status_Flag	Age_Husband	Age_Wife	Load_Datetime	Annual_Income	Marriage_Date	Email_Flag	Region_Desc	Risk_Category
1	M1001	Ravi_kumar	Anjali	AP	Hindu	registrar@apgov.in	12-01-2025	Hyderabad	ACTIVE	29	24	2096004747.3	850000	01DEC2025	0	Andhra Pradesh	Middle Income
2	M1005	Suresh	Lakshmi	AP	Christian	reg_apgov.in	2025-03-18	Vizag	ACTIVE	35	33	2096004747.3	1200000	18MAR2025	1	Andhra Pradesh	High Income
3	M1009	Tarun	Pooja	AP	Hindu	registrar@apgov.in	20250315	Hyderabad	ACTIVE	45	44	2096004747.3	150000	15MAR2025	0	Andhra Pradesh	Low Income
4	M1011	Akash	Nisha	AP	Hindu	registrar@apgov.in		Hyderabad	ACTIVE	33	29	2096004747.3	850000	.	0	Andhra Pradesh	Middle Income
5	M1015	Ajay	Priti	AP	Hindu	registrar@apgov.in	31/11/2025	Vijayawada	PENDING	31	28	2096004747.3	780000	.	0	Andhra Pradesh	Middle Income
6	M1004	Kiran		KA	Muslim	registrar@kgov.in		Bangalore	PENDING	.	27	2096004747.3	550000	.	0	Karnataka	Middle Income
7	M1014	Rakesh	Simran	KA	Hindu	registrar@kgov.in	25DEC2024	Mysore	ACTIVE	39	35	2096004747.3	0	25DEC2024	0	Karnataka	Low Income
8	M1007	Dinesh	Kavya	KL	Hindu	registrar@kl.gov.in	18-13-2025	Kochi	ACTIVE	32	.	2096004747.3	990000	.	0	Kerala	Middle Income
9	M1006	Vikram	Priya	MH	Unknown	registrar@mh.gov.in	17MAR2025	Mumbai	CLOSED	41	39	2096004747.3	900	17MAR2025	0	Maharashtra	Low Income
10	M1013	Mohan	Radha	MH	Buddhist	registrar@mgov.in	10-10-2024	Pune	ACTIVE	.	21	2096004747.3	450000	10OCT2024	0	Maharashtra	Low Income
11	M1003	Arjun	Meera	TN	Hindu	invalidmail.com	15/02/2025	Chennai	ACTIVE	.	31	2096004747.3	650000	15FEB2025	1	Tamil Nadu	Middle Income
12	M1010		Divya	TN	Jain	registrar@tn.gov.in	14APR2025	Chennai	CLOSED	37	36	2096004747.3	760000	14APR2025	0	Tamil Nadu	Middle Income
13	M1008	Rohan	Sneha	XX	Sikh	registrar@xgov.in	11-02-2025	Delhi	ACTIVE	28	22	2096004747.3	1000000	02NOV2025	0	Invalid Region	High Income

data region_counts;

set marriage_dedup;

by Region_Code;

retain Marriage_Count;

if first.Region_Code then

Marriage_Count=0;

Marriage_Count+1;

if last.Region_Code;

run;

proc print data=region_counts;

run;

OUTPUT:


Obs	Marriage_ID	Husband_Name	Wife_Name	Region_Code	Marriage_Type	Registrar_Email	Marriage_Date_Raw	City	Status_Flag	Age_Husband	Age_Wife	Load_Datetime	Annual_Income	Marriage_Date	Region_Desc	Risk_Category	Marriage_Count
1	M1015	Ajay	Priti	AP	Hindu	registrar@apgov.in	31/11/2025	Vijayawada	PENDING	31	28	2096004747.3	780000	.	Andhra Pradesh	Middle Income	5
2	M1014	Rakesh	Simran	KA	Hindu	registrar@kgov.in	25DEC2024	Mysore	ACTIVE	39	35	2096004747.3	0	25DEC2024	Karnataka	Low Income	2
3	M1007	Dinesh	Kavya	KL	Hindu	registrar@kl.gov.in	18-13-2025	Kochi	ACTIVE	32	.	2096004747.3	990000	.	Kerala	Middle Income	1
4	M1013	Mohan	Radha	MH	Buddhist	registrar@mgov.in	10-10-2024	Pune	ACTIVE	.	21	2096004747.3	450000	10OCT2024	Maharashtra	Low Income	2
5	M1010		Divya	TN	Jain	registrar@tn.gov.in	14APR2025	Chennai	CLOSED	37	36	2096004747.3	760000	14APR2025	Tamil Nadu	Middle Income	2
6	M1008	Rohan	Sneha	XX	Sikh	registrar@xgov.in	11-02-2025	Delhi	ACTIVE	28	22	2096004747.3	1000000	02NOV2025	Invalid Region	High Income	1

Why FIRST./LAST. Processing Is Elite SAS Engineering

This is one of SAS’s greatest strengths.

FIRST./LAST. logic enables streaming analytics without requiring joins or window functions.

It is memory-efficient and production-scalable.

A common mistake: forgetting to sort properly before BY-group processing. Incorrect sorting silently produces invalid logic.

6.PROC SQL Validation Layer

proc sql;

create table invalid_emails as

select *

from marriage_dedup

where Registrar_Email not like '%@%.%';

quit;

proc print data=invalid_emails;

run;

OUTPUT:


Obs	Marriage_ID	Husband_Name	Wife_Name	Region_Code	Marriage_Type	Registrar_Email	Marriage_Date_Raw	City	Status_Flag	Age_Husband	Age_Wife	Load_Datetime	Annual_Income	Marriage_Date	Email_Flag	Region_Desc	Risk_Category
1	M1005	Suresh	Lakshmi	AP	Christian	reg_apgov.in	2025-03-18	Vizag	ACTIVE	35	33	2096004747.3	1200000	18MAR2025	1	Andhra Pradesh	High Income
2	M1003	Arjun	Meera	TN	Hindu	invalidmail.com	15/02/2025	Chennai	ACTIVE	.	31	2096004747.3	650000	15FEB2025	1	Tamil Nadu	Middle Income

Why PROC SQL Complements DATA Step

DATA Step excels at row-wise engineering.

PROC SQL excels at relational validation.

Experienced SAS engineers use both strategically not religiously.

Production systems fail when teams force every problem into one paradigm.

7.PROC REPORT Enterprise Reporting

proc report data=marriage_dedup nowd;

column Region_Desc Marriage_Type Annual_Income;

define Region_Desc / group;

define Marriage_Type / group;

define Annual_Income / analysis mean;

run;

OUTPUT:


Region_Desc	Marriage_Type	Annual_Income
Andhra Pradesh	Christian	1200000
	Hindu	657500
Invalid Region	Sikh	1000000
Karnataka	Hindu	0
	Muslim	550000
Kerala	Hindu	990000
Maharashtra	Buddhist	450000
	Unknown	900
Tamil Nadu	Hindu	650000
	Jain	760000

Why Reporting Layers Matter

Data cleaning is incomplete unless outputs are interpretable.

A perfectly engineered dataset still fails if reviewers cannot validate summaries quickly.

PROC REPORT bridges engineering and business consumption.

8.R Version Dataset

library(tidyverse)

library(lubridate)

library(janitor)

marriage_raw <- tibble(

Marriage_ID = c("M1001","M1002","M1003","M1004",

"M1005","M1006","M1007","M1008",

"M1009","M1010","M1011","M1012",

"M1013","M1014","M1015"),

Husband_Name = c("Ravi Kumar","Ravi Kumar","Arjun",

"Kiran"," Suresh ","Vikram",

"Dinesh","Rohan","Tarun",

"NULL","Akash","Akash",

"Mohan","Rakesh","Ajay"),

Age_Husband = c(29,29,212,-5,35,41,32,28,45,37,33,33,999,39,31),

Region_Code = c("AP","AP","TN","KA","ap","MH",

"KL","XX","AP","TN","AP","AP",

"MH","KA","AP"),

Annual_Income = c("850000","850000","650000",

"-550000","1200000","abc900",

"990000","1000000","-150000",

"760000","850000","850000",

"450000","0","780000"),

Marriage_Date = c("12-01-2025","12-01-2025",

"15/02/2025",NA,"2025-03-18",

"17MAR2025","18-13-2025",

"11-02-2025","20250315",

"14APR2025",NA,NA,

"10-10-2024","25DEC2024",

"31/11/2025")

)

OUTPUT:

	Marriage_ID	Husband_Name	Age_Husband	Region_Code	Annual_Income	Marriage_Date
1	M1001	Ravi Kumar	29	AP	850000	12-01-2025
2	M1002	Ravi Kumar	29	AP	850000	12-01-2025
3	M1003	Arjun	212	TN	650000	15-02-2025
4	M1004	Kiran	-5	KA	-550000	NA
5	M1005	Suresh	35	ap	1200000	18-03-2025
6	M1006	Vikram	41	MH	abc900	17-Mar-2025
7	M1007	Dinesh	32	KL	990000	18-13-2025
8	M1008	Rohan	28	XX	1000000	11-02-2025
9	M1009	Tarun	45	AP	-150000	20250315
10	M1010	NULL	37	TN	760000	14-Apr-2025
11	M1011	Akash	33	AP	850000	NA
12	M1012	Akash	33	AP	850000	NA
13	M1013	Mohan	999	MH	450000	10-10-2024
14	M1014	Rakesh	39	KA	0	25-Dec-2024
15	M1015	Ajay	31	AP	780000	31/11/2025

SAS vs R Translation Insight

SAS	R
DATA Step	mutate()
ARRAYS	across()
COMPRESS()	str_replace_all()
PROPCASE()	str_to_title()
INPUT()	as.numeric() / parsing
FIRST./LAST.	group_by() + summarise()
MERGE	left_join()
PROC SQL	dbplyr / sqldf / joins

9.R Cleaning Pipeline

marriage_clean <- marriage_raw %>%

clean_names() %>%

mutate( across(

c(husband_name, region_code),str_trim),

husband_name =str_to_title(

str_to_lower(husband_name)),

region_code =str_to_upper(region_code),

annual_income =suppressWarnings(

as.numeric(gsub("[^0-9-]", "",

annual_income)

)

annual_income =abs(annual_income),

marriage_date =suppressWarnings(

parse_date_time(marriage_date,

orders = c("dmy","ymd","dbY")

)

age_husband =if_else(age_husband < 18 |

age_husband > 100,NA_real_,

as.numeric(age_husband))

)

OUTPUT:

	marriage_id	husband_name	age_husband	region_code	annual_income	marriage_date
1	M1001	Ravi Kumar	29	AP	850000	12-01-2025
2	M1002	Ravi Kumar	29	AP	850000	12-01-2025
3	M1003	Arjun	NA	TN	650000	15-02-2025
4	M1004	Kiran	NA	KA	550000	NA
5	M1005	Suresh	35	AP	1200000	18-03-2025
6	M1006	Vikram	41	MH	900	17-03-2025
7	M1007	Dinesh	32	KL	990000	NA
8	M1008	Rohan	28	XX	1000000	11-02-2025
9	M1009	Tarun	45	AP	150000	15-03-2025
10	M1010	Null	37	TN	760000	14-04-2025
11	M1011	Akash	33	AP	850000	NA
12	M1012	Akash	33	AP	850000	NA
13	M1013	Mohan	NA	MH	450000	10-10-2024
14	M1014	Rakesh	39	KA	0	25-12-2024
15	M1015	Ajay	31	AP	780000	NA

Why This R Pipeline Works Well

The tidyverse pipeline model behaves like a readable transformation storyboard.

Each transformation becomes composable and auditable.

across() mirrors SAS arrays conceptually.

parse_date_time() is extraordinarily useful because enterprise data rarely follows one format consistently.

One production mistake in R: implicit coercion warnings are ignored. as.numeric("abc900") creates NA, which can silently propagate unless validation summaries exist.

PROC SQL vs DATA Step vs R dplyr

Task	SAS DATA Step	PROC SQL	R dplyr
Row-wise cleaning	Excellent	Moderate	Excellent
Complex joins	Moderate	Excellent	Excellent
Streaming performance	Excellent	Moderate	Moderate
Auditability	Very High	High	Moderate
Memory handling	Efficient	Efficient	Higher RAM usage
Regulatory adoption	Gold Standard	Gold Standard	Growing
Macro automation	Extremely powerful	Limited	Functional programming
Metadata integration	Native	Native	Manual

10.distinct() vs PROC SORT NODUPKEY

marriage_dedup <- marriage_clean %>%

distinct(husband_name,region_code, .keep_all=TRUE)

OUTPUT:

	marriage_id	husband_name	age_husband	region_code	annual_income	marriage_date
1	M1001	Ravi Kumar	29	AP	850000	12-01-2025
2	M1003	Arjun	NA	TN	650000	15-02-2025
3	M1004	Kiran	NA	KA	550000	NA
4	M1005	Suresh	35	AP	1200000	18-03-2025
5	M1006	Vikram	41	MH	900	17-03-2025
6	M1007	Dinesh	32	KL	990000	NA
7	M1008	Rohan	28	XX	1000000	11-02-2025
8	M1009	Tarun	45	AP	150000	15-03-2025
9	M1010	Null	37	TN	760000	14-04-2025
10	M1011	Akash	33	AP	850000	NA
11	M1013	Mohan	NA	MH	450000	10-10-2024
12	M1014	Rakesh	39	KA	0	25-12-2024
13	M1015	Ajay	31	AP	780000	NA

SAS	R
PROC SORT NODUPKEY	distinct()

Enterprise Validation & Compliance

In clinical programming, cleaning logic is not optional engineering polish.

It is regulatory evidence.

Every transformation must answer:

What changed?
Why did it change?
Who approved it?
Can it be reproduced?
Can QA independently verify it?

That is the foundation of:

Clinical Data Management
SDTM
ADaM
Audit trail integrity
Traceability matrices
QC independence

The Dangerous SAS Missing Value Trap

In SAS:

. < 0 < 1 < 2

Missing numeric values sort LOWER than all valid numbers.

That means this condition:

if salary < 50000;

ACCIDENTALLY includes missing salaries.

Catastrophic consequences:

Invalid low-income populations
Inflated adverse event rates
Incorrect risk stratification
Regulatory reporting errors

Defensive programming:

if not missing(salary)

and salary < 50000;

Senior SAS programmers never trust implicit missing-value behavior.

Never.

Business Logic Narrative Why Cleaning Decisions Matter

Data cleaning is not cosmetic hygiene. It is business logic preservation.

Suppose a patient age arrives as 212. Technically numeric. Statistically poisonous. If left untouched, that single value can distort mean age calculations, affect age-stratified efficacy analysis, and alter machine learning model weighting. In actuarial systems, impossible ages can distort premium estimation and reserve forecasting.

Negative salary or income values create similar damage. Sometimes negative values represent refunds. Sometimes system corruption. Sometimes sign inversions during ETL ingestion. Blindly converting them to positive numbers without understanding business context creates audit risk. Intelligent pipelines flag them for controlled remediation.

Date standardization is equally critical. A mixed-format field containing "12-01-2025", "2025/01/12", and "12JAN2025" may appear harmless until time-series analysis fails or duplicate detection logic collapses because equivalent dates are interpreted differently.

Text normalization affects downstream joins more than most junior analysts realize. "AP", " ap ", and "Ap" can fracture regional aggregation logic, leading to incomplete counts and broken dashboards.

Cleaning is not about making data pretty.

It is about preserving analytical truth.

20 Real-World Data Cleaning Best Practices

Preserve raw datasets untouched
Standardize metadata centrally
Validate before transformation
Use defensive missing-value checks
Never overwrite source variables blindly
Build reusable macro frameworks
Separate derivation from reporting
Track lineage for every variable
Validate joins independently
Use controlled terminology libraries
Implement duplicate escalation rules
Centralize date parsing standards
Avoid hardcoded business logic
Use QC programmers independently
Version-control validation logic
Validate ranges semantically, not only structurally
Build exception datasets explicitly
Audit all imputation logic
Use modular pipelines
Treat data cleaning as risk engineering

20 Sharp One-Liner Insights

Missing values in SAS aren’t empty they’re secretly the smallest numbers in your dataset.
Dirty joins create confident lies.
One malformed ID can poison an entire warehouse.
Production bugs rarely scream they whisper.
Character truncation is silent data amputation.
Every duplicate record tells a system story.
AI models inherit your data quality morality.
Cleaning logic without lineage is audit suicide.
NULL handling separates analysts from engineers.
Dashboards fail long before executives notice.
Date chaos destroys trust faster than missing values.
PROC SQL is powerful but DATA Step is surgical.
Defensive programming is cheaper than regulatory remediation.
Every transformation needs a business reason.
Bad metadata scales faster than good governance.
Imputation without documentation is analytical fiction.
Whitespace bugs have destroyed million-dollar reports.
Validation code deserves validation too.
Clean data is engineered, not discovered.
Trustworthy analytics begin where assumptions end.

Production Validation Checklist

1.Raw data archived

2.Metadata verified

3.Duplicate logic approved

4.Missing-value rules documented

5.Date formats standardized

6.Controlled terminology applied

7.Invalid ranges flagged

8.Character lengths validated

9.Join cardinality tested

10.QC independently executed

11.Audit trail captured

12.Exception datasets generated

13.PROC CONTENTS reviewed

14.Summary statistics validated

15.Regulatory traceability confirmed

Summary

SAS remains the dominant force in regulated enterprise environments because it was engineered for repeatability, metadata consistency, and procedural traceability. The DATA Step remains one of the most efficient row-processing engines ever built for structured enterprise data. FIRST./LAST. processing, macro automation, format catalogs, and native procedural reporting make SAS extraordinarily powerful in production clinical pipelines. Most importantly, regulators understand SAS behavior deeply. Validation standards, SDTM mapping frameworks, Define.xml workflows, and submission ecosystems evolved around SAS for decades.

R, however, transformed modern analytical engineering through flexibility, open-source innovation, and expressive transformation pipelines. The tidyverse ecosystem dramatically improves readability, composability, and exploratory engineering velocity. Packages like dplyr, stringr, purrr, and lubridate allow analysts to build elegant, scalable transformation pipelines rapidly. R also integrates naturally with machine learning, APIs, visualization ecosystems, and modern cloud architectures.

But enterprise trust differs from analytical flexibility.

SAS prioritizes deterministic reproducibility.

R prioritizes analytical agility.

In modern organizations, the strongest teams rarely choose one exclusively. Instead, they combine SAS for validated regulatory production and R for exploratory analytics, modeling, automation, and scalable data science workflows.

The future belongs to bilingual programmers.

Conclusion

Every executive dashboard, every AI prediction, every regulatory submission, and every strategic decision rests on one invisible assumption:

The data is trustworthy.

Most organizations obsess over visualization tools, cloud platforms, machine learning frameworks, and AI acceleration while underinvesting in the engineering discipline that determines whether any of those outputs deserve trust in the first place.

Data cleaning is not preprocessing.

It is enterprise risk management.

A corrupted patient ID can invalidate enrollment counts.

A malformed timestamp can collapse longitudinal analysis.

A silent truncation bug can erase regulatory traceability.

A missing-value assumption can reverse analytical conclusions.

This is why experienced SAS programmers and data scientists think differently from junior coders. They do not merely write transformations. They engineer defensible systems.

Good cleaning frameworks create:

Repeatability
Auditability
Scalability
Traceability
Statistical reliability
Executive trust

The strongest engineers build pipelines that assume failure will happen somewhere malformed source feeds, duplicate merges, invalid formats, impossible values, broken joins, inconsistent metadata, and human mistakes.

Defensive programming is not pessimism.

It is professionalism.

SAS and R each bring unique strengths to enterprise cleaning ecosystems. SAS delivers industrial-grade procedural stability and regulatory alignment. R delivers extraordinary flexibility and analytical acceleration. Together, they form one of the most powerful bilingual engineering combinations in modern data science.

At the end of the day, analytics is not about generating numbers.

It is about generating decisions people can trust.

And trustworthy decisions begin with disciplined cleaning frameworks long before the first dashboard appears.

Interview Questions & Answers

1. Duplicate Marriage IDs Inflated Enrollment Counts

Question

A production dashboard suddenly shows 18% growth in registered couples overnight. How would you investigate?

Answer

I would first validate source ingestion counts versus historical baselines. Then I’d compare duplicate frequencies using PROC SORT NODUPKEY, PROC FREQ, and SQL COUNT DISTINCT logic. I’d examine whether duplicates originated from upstream ingestion, failed merge logic, or repeated incremental loads. Most importantly, I would define business-level uniqueness carefully sometimes duplicate IDs are legitimate amendments rather than true duplicates. I’d isolate impacted downstream datasets, assess reporting exposure, and document remediation traceability before rerunning pipelines.

2. Missing Values Accidentally Included in Low-Income Population

Question

A regulatory output included patients with missing salaries in the low-income category. What happened?

Answer

Classic SAS missing-value behavior. SAS treats missing numeric values as smaller than valid numbers. A condition like if salary < 50000 unintentionally includes missing salaries. I would correct logic defensively using not missing(salary) and validate all downstream derivations. Then I’d implement automated QC checks specifically targeting missing-value inclusion patterns because this class of bug is extremely common in production SAS environments.

3. Macro Debugging Failure in Production

Question

A macro-generated dataset suddenly creates missing variables in only one study. How would you debug?

Answer

I’d enable MPRINT, MLOGIC, and SYMBOLGEN immediately to inspect macro resolution behavior. Then I’d compare metadata structures between successful and failed studies using PROC CONTENTS. Usually these failures come from variable-length inconsistencies, conditional macro branches, or study-specific metadata drift. I debug macros by validating generated code first not by staring at macro syntax blindly.

4. Cross-System SAS vs R Validation Disagreement

Question

Your SAS and R outputs produce different patient counts. What’s your approach?

Answer

I’d first validate datatype coercion and missing-value handling because SAS and R treat them differently. Then I’d compare row-level joins, date parsing, sorting behavior, and duplicate handling rules. I’d create reconciliation datasets isolating mismatched records and validate transformation ordering carefully. Cross-platform validation is usually a semantics problem, not syntax problem.

5. Impossible Ages Passed Through QA

Question

How do biologically impossible ages survive enterprise systems?

Answer

Because many systems validate structure instead of meaning. If age is numeric, systems often assume it’s valid. Mature pipelines implement semantic validation layers range checks, controlled terminology validation, anomaly detection, and exception reporting. I always separate datatype validation from business-rule validation because both are necessary and neither alone is sufficient.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MARRIAGE DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Global Villains Dataset Reimagined with SAS DATA Step, PROC SQL and R

2.When Missing Data Speaks: Unlocking Hidden Patterns in Orphan School Analytics

3.The Hidden Power of SAS Dates: Building Smart Time Logic with INTNX and INTCK
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Mangalsutra Meets Metadata: Inside Enterprise Marriage Data Cleaning with SAS and R

Sacred Vows, Corrupted Rows: Cleaning Indian Marriage Data Before Analytics Collapse

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS