Fraud Approved, Patients Duplicated, Models Broken: The Real Cost of Dirty Data

Dirty Data, Dead Decisions: How Enterprise SAS & R Pipelines Rescue Business Trust

Monday Morning — The FDA Escalation Nobody Wanted

At 7:42 AM, the escalation email arrived.

The subject line was short:

URGENT: Enrollment Count Mismatch Detected in Submission Package

Inside a global oncology trial, the FDA identified duplicate patient identifiers across two regional data feeds. The enrollment dashboard showed 2,148 patients.

Reality?

2,031.

A single duplicated patient stream inflated recruitment metrics, altered exposure summaries, and contaminated downstream efficacy tables. Safety percentages shifted. Treatment-arm denominators changed. Executive dashboards were wrong. Statistical assumptions cracked.

By noon, clinical operations blamed programming.
Programming blamed data management.
Data management blamed the vendor feed.
The vendor blamed "legacy mapping inconsistencies."

Meanwhile, regulators didn't care whose fault it was.

The submission failed.

That is what dirty data does in the real world.

Not theoretical classroom dirty data.

Production dirty data.

The kind that quietly enters systems through malformed CSV exports, legacy ETL pipelines, offshore vendor feeds, timestamp mismatches, whitespace corruption, character truncation, mixed encodings, duplicate keys, and default software behaviors nobody noticed until millions of dollars were at risk.

A single bad variable is never just one bad variable.

It spreads.

Like oil in water.

Duplicate IDs inflate enrollment.
Negative billing values reverse revenue.
Malformed timestamps break trend models.
Mixed-case region codes destroy joins.
Missing visit dates invalidate survival analyses.
Improper type conversion silently converts business logic into fiction.

This is why enterprise data cleaning is not “preprocessing.”

It is risk management.

And the programmers who understand this become the people organizations trust during crises.

Raw Enterprise Healthcare Claims Dataset Construction

We will simulate a corrupted healthcare claims dataset containing deliberate enterprise-grade data issues.

This is not an employee table.

This is realistic production chaos.

SAS Raw Dataset Creation

data raw_claims;

/*LENGTH MUST COME FIRST*/

length patient_id $12 patient_name $40 region_code $10  email_address $60

       diagnosis_grp $20 visit_status $20 bill_amount_chr $15

       visit_date_chr $25 submit_ts $30;

infile datalines dlm='|' truncover;

informat visit_date_chr $25.;

format patient_name $40.;

input patient_id $ patient_name $ age region_code $ email_address $

      diagnosis_grp $ bill_amount_chr $ visit_date_chr $ submit_ts $

      visit_status $;

datalines;

P1001|john smith|45|us_east|john@gmail.com|CARDIO|12000|2025-01-15|2025-01-15 10:10:00|COMPLETE

P1002|  SARAH LEE |150|Us_West|sarahgmail.com|ONCOLOGY|-4500|2025/02/01|2025-02-01 08:00|closed

P1003|mike ray|-3|NULL|mike@yahoo|NEURO|2300|15-03-2025|2025-03-15T11:22|OPEN

P1004|anita joseph|67|EU_CENTRAL|anita@gmail.com|cardio|9999999|NULL|2025-04-01 09:55:11|Complete

P1005|ROBERT KING|34|APAC |robert@@mail.com|ORTHO|3000.55|2025-05-10|bad_timestamp|complete

P1005|ROBERT KING|34|APAC |robert@@mail.com|ORTHO|3000.55|2025-05-10|bad_timestamp|complete

P1006|mary ann|.|US_EAST|mary@mail.com|NEURO|abc123|2025-06-20|2025-06-20 05:22:11|Open

P1007|david long|29|usa-east|david@mail|CARDIO|-900|2025-07-18|2025-07-18 15:00:00|closed

P1008|LISA WHITE|200|EUROPE|lisa@gmail.com|UNKNOWN|500|2025-08-01|2025-08-01 18:33:00|COMPLETE

P1009|kelly stone|54| US_WEST |kelly@gmail.com|ONCOLOGY|4500|2025-09-10|2025-09-10 22:15:44|complete

P1010|ravi kumar|38|APAC|ravi@gmail.com|CARDIO|0|2025-10-11|2025-10-11 06:30:00|pending

P1011|emma green|75|EU_CENTRAL|emma@gmail.com|ORTHO|-50|2025-11-15|2025-11-15 01:10:00|Closed

P1012|tony stark|48|US_EAST|tony@stark.com|CARDIO|100000|2025-12-01|2025-12-01 11:11:11|COMPLETE

P1013|bruce wayne|65|APAC|bruce#mail.com|ONCOLOGY|8500|2025-12-15|2025-12-15 09:09:09|COMPLETE

P1014|clark kent|42|NULL|clark@gmail.com|NEURO|7200|2025-12-20|2025-12-20 08:45:00|OPEN

;

run;

proc print data = raw_claims;

run;

OUTPUT:

Obspatient_idpatient_nameregion_codeemail_addressdiagnosis_grpvisit_statusbill_amount_chrvisit_date_chrsubmit_tsage
1P1001john smithus_eastjohn@gmail.comCARDIOCOMPLETE120002025-01-152025-01-15 10:10:0045
2P1002SARAH LEEUs_Westsarahgmail.comONCOLOGYclosed-45002025/02/012025-02-01 08:00150
3P1003mike rayNULLmike@yahooNEUROOPEN230015-03-20252025-03-15T11:22-3
4P1004anita josephEU_CENTRALanita@gmail.comcardioComplete9999999NULL2025-04-01 09:55:1167
5P1005ROBERT KINGAPACrobert@@mail.comORTHOcomplete3000.552025-05-10bad_timestamp34
6P1005ROBERT KINGAPACrobert@@mail.comORTHOcomplete3000.552025-05-10bad_timestamp34
7P1006mary annUS_EASTmary@mail.comNEUROOpenabc1232025-06-202025-06-20 05:22:11.
8P1007david longusa-eastdavid@mailCARDIOclosed-9002025-07-182025-07-18 15:00:0029
9P1008LISA WHITEEUROPElisa@gmail.comUNKNOWNCOMPLETE5002025-08-012025-08-01 18:33:00200
10P1009kelly stoneUS_WESTkelly@gmail.comONCOLOGYcomplete45002025-09-102025-09-10 22:15:4454
11P1010ravi kumarAPACravi@gmail.comCARDIOpending02025-10-112025-10-11 06:30:0038
12P1011emma greenEU_CENTRALemma@gmail.comORTHOClosed-502025-11-152025-11-15 01:10:0075
13P1012tony starkUS_EASTtony@stark.comCARDIOCOMPLETE1000002025-12-012025-12-01 11:11:1148
14P1013bruce wayneAPACbruce#mail.comONCOLOGYCOMPLETE85002025-12-152025-12-15 09:09:0965
15P1014clark kentNULLclark@gmail.comNEUROOPEN72002025-12-202025-12-20 08:45:0042

Why LENGTH Must Come First in SAS

This is one of the most dangerous silent failures in SAS production systems.

In SAS, character variable length is assigned the moment the variable is first created. If you accidentally assign:

patient_name='John';

before defining:

length patient_name $40;

then SAS permanently allocates only 4 bytes.

Later:

patient_name='Christopher Anderson';

becomes:

Chri

No error.
No failure.
No dramatic warning.

Just silent truncation.

This is called Character Truncation Risk, and in enterprise systems it destroys:

  • patient identifiers
  • treatment names
  • audit comments
  • vendor source values
  • regulatory metadata
  • JSON payload mappings

In production, truncation spreads downstream into joins, reports, APIs, and submission datasets.

R behaves differently.

R stores character vectors as dynamically allocated string objects internally. Strings are not fixed-width memory allocations like SAS character variables. R therefore avoids silent truncation by design. However, R introduces its own enterprise risks:

  • factor coercion
  • encoding inconsistencies
  • NA propagation
  • type instability during joins
  • memory explosion with large strings

SAS risks truncation.
R risks uncontrolled flexibility.

Senior programmers learn both failure modes.

Enterprise SAS Cleaning Pipeline

1.Standardization & Type Conversion

data claims_stage1;

set raw_claims;

retain batch_id "CLAIM_BATCH_2026";

patient_name = propcase(strip(patient_name));

region_code = upcase(compress(region_code));

visit_status = upcase(strip(visit_status));

diagnosis_grp = upcase(strip(diagnosis_grp));

email_address = lowcase(strip(email_address));

bill_amount = input(compress(bill_amount_chr,,'kd'),best12.);

visit_date = input(visit_date_chr,anydtdte20.);

submit_datetime = input(submit_ts,anydtdtm30.);

format visit_date date9. submit_datetime datetime20.

       bill_amount dollar12.2;

run;

proc print data = claims_stage1;

run;

OUTPUT:

Obspatient_idpatient_nameregion_codeemail_addressdiagnosis_grpvisit_statusbill_amount_chrvisit_date_chrsubmit_tsagebatch_idbill_amountvisit_datesubmit_datetime
1P1001John SmithUS_EASTjohn@gmail.comCARDIOCOMPLETE120002025-01-152025-01-15 10:10:0045CLAIM_BATCH_2026$12,000.0015JAN202515JAN2025:10:10:00
2P1002Sarah LeeUS_WESTsarahgmail.comONCOLOGYCLOSED-45002025/02/012025-02-01 08:00150CLAIM_BATCH_2026$4,500.0001FEB202501FEB2025:08:00:00
3P1003Mike RayNULLmike@yahooNEUROOPEN230015-03-20252025-03-15T11:22-3CLAIM_BATCH_2026$2,300.0015MAR2025.
4P1004Anita JosephEU_CENTRALanita@gmail.comCARDIOCOMPLETE9999999NULL2025-04-01 09:55:1167CLAIM_BATCH_2026$9999999.00.01APR2025:09:55:11
5P1005Robert KingAPACrobert@@mail.comORTHOCOMPLETE3000.552025-05-10bad_timestamp34CLAIM_BATCH_2026$300,055.0010MAY2025.
6P1005Robert KingAPACrobert@@mail.comORTHOCOMPLETE3000.552025-05-10bad_timestamp34CLAIM_BATCH_2026$300,055.0010MAY2025.
7P1006Mary AnnUS_EASTmary@mail.comNEUROOPENabc1232025-06-202025-06-20 05:22:11.CLAIM_BATCH_2026$123.0020JUN202520JUN2025:05:22:11
8P1007David LongUSA-EASTdavid@mailCARDIOCLOSED-9002025-07-182025-07-18 15:00:0029CLAIM_BATCH_2026$900.0018JUL202518JUL2025:15:00:00
9P1008Lisa WhiteEUROPElisa@gmail.comUNKNOWNCOMPLETE5002025-08-012025-08-01 18:33:00200CLAIM_BATCH_2026$500.0001AUG202501AUG2025:18:33:00
10P1009Kelly StoneUS_WESTkelly@gmail.comONCOLOGYCOMPLETE45002025-09-102025-09-10 22:15:4454CLAIM_BATCH_2026$4,500.0010SEP202510SEP2025:22:15:44
11P1010Ravi KumarAPACravi@gmail.comCARDIOPENDING02025-10-112025-10-11 06:30:0038CLAIM_BATCH_2026$0.0011OCT202511OCT2025:06:30:00
12P1011Emma GreenEU_CENTRALemma@gmail.comORTHOCLOSED-502025-11-152025-11-15 01:10:0075CLAIM_BATCH_2026$50.0015NOV202515NOV2025:01:10:00
13P1012Tony StarkUS_EASTtony@stark.comCARDIOCOMPLETE1000002025-12-012025-12-01 11:11:1148CLAIM_BATCH_2026$100,000.0001DEC202501DEC2025:11:11:11
14P1013Bruce WayneAPACbruce#mail.comONCOLOGYCOMPLETE85002025-12-152025-12-15 09:09:0965CLAIM_BATCH_2026$8,500.0015DEC202515DEC2025:09:09:09
15P1014Clark KentNULLclark@gmail.comNEUROOPEN72002025-12-202025-12-20 08:45:0042CLAIM_BATCH_2026$7,200.0020DEC202520DEC2025:08:45:00

Why This Matters in Production

This stage is where raw operational chaos becomes machine-readable enterprise data. Real production systems rarely agree on formatting conventions. One vendor sends "Us_West", another sends " us_west ", and another sends "USA-WEST". Unless standardized immediately, joins fail silently and aggregated analytics fracture into false categories.

The COMPRESS(...,'kd') trick is particularly important. It keeps only digits while stripping corruption from mixed billing values like "abc123" or "12,000USD".

The most common production failure here is improper type conversion. Analysts often apply INPUT() directly without validating malformed values. That causes mass missing values, which later distort summaries and predictive models.

2.Advanced Enterprise Cleaning with Arrays, DO Loops & Validation

data claims_stage2;

set claims_stage1;

array char_fix(*) patient_name region_code diagnosis_grp visit_status;

do i = 1 to dim(char_fix);

char_fix(i)=tranwrd(char_fix(i),'NULL','UNKNOWN');

char_fix(i)=strip(char_fix(i));

end;

if age < 0 then age=.;

else if age > 110 then age=90;

if bill_amount < 0 then

   bill_amount = abs(bill_amount);

if missing(visit_date) then

   visit_date = intnx('day',today(),-30);

   claim_age_days =intck('day',visit_date,

            today());

run;

proc print data = claims_stage2;

run;

OUTPUT:

Obspatient_idpatient_nameregion_codeemail_addressdiagnosis_grpvisit_statusbill_amount_chrvisit_date_chrsubmit_tsagebatch_idbill_amountvisit_datesubmit_datetimeiclaim_age_days
1P1001John SmithUS_EASTjohn@gmail.comCARDIOCOMPLETE120002025-01-152025-01-15 10:10:0045CLAIM_BATCH_2026$12,000.0015JAN202515JAN2025:10:10:005499
2P1002Sarah LeeUS_WESTsarahgmail.comONCOLOGYCLOSED-45002025/02/012025-02-01 08:0090CLAIM_BATCH_2026$4,500.0001FEB202501FEB2025:08:00:005482
3P1003Mike RayUNKNOWNmike@yahooNEUROOPEN230015-03-20252025-03-15T11:22.CLAIM_BATCH_2026$2,300.0015MAR2025.5440
4P1004Anita JosephEU_CENTRALanita@gmail.comCARDIOCOMPLETE9999999NULL2025-04-01 09:55:1167CLAIM_BATCH_2026$9999999.0029APR202601APR2025:09:55:11530
5P1005Robert KingAPACrobert@@mail.comORTHOCOMPLETE3000.552025-05-10bad_timestamp34CLAIM_BATCH_2026$300,055.0010MAY2025.5384
6P1005Robert KingAPACrobert@@mail.comORTHOCOMPLETE3000.552025-05-10bad_timestamp34CLAIM_BATCH_2026$300,055.0010MAY2025.5384
7P1006Mary AnnUS_EASTmary@mail.comNEUROOPENabc1232025-06-202025-06-20 05:22:11.CLAIM_BATCH_2026$123.0020JUN202520JUN2025:05:22:115343
8P1007David LongUSA-EASTdavid@mailCARDIOCLOSED-9002025-07-182025-07-18 15:00:0029CLAIM_BATCH_2026$900.0018JUL202518JUL2025:15:00:005315
9P1008Lisa WhiteEUROPElisa@gmail.comUNKNOWNCOMPLETE5002025-08-012025-08-01 18:33:0090CLAIM_BATCH_2026$500.0001AUG202501AUG2025:18:33:005301
10P1009Kelly StoneUS_WESTkelly@gmail.comONCOLOGYCOMPLETE45002025-09-102025-09-10 22:15:4454CLAIM_BATCH_2026$4,500.0010SEP202510SEP2025:22:15:445261
11P1010Ravi KumarAPACravi@gmail.comCARDIOPENDING02025-10-112025-10-11 06:30:0038CLAIM_BATCH_2026$0.0011OCT202511OCT2025:06:30:005230
12P1011Emma GreenEU_CENTRALemma@gmail.comORTHOCLOSED-502025-11-152025-11-15 01:10:0075CLAIM_BATCH_2026$50.0015NOV202515NOV2025:01:10:005195
13P1012Tony StarkUS_EASTtony@stark.comCARDIOCOMPLETE1000002025-12-012025-12-01 11:11:1148CLAIM_BATCH_2026$100,000.0001DEC202501DEC2025:11:11:115179
14P1013Bruce WayneAPACbruce#mail.comONCOLOGYCOMPLETE85002025-12-152025-12-15 09:09:0965CLAIM_BATCH_2026$8,500.0015DEC202515DEC2025:09:09:095165
15P1014Clark KentUNKNOWNclark@gmail.comNEUROOPEN72002025-12-202025-12-20 08:45:0042CLAIM_BATCH_2026$7,200.0020DEC202520DEC2025:08:45:005160

Why This Matters in Production

Enterprise cleaning is about defensive programming. Arrays allow scalable corrections across dozens or hundreds of fields without repetitive code. This becomes critical in SDTM and ADaM environments where domains may contain hundreds of standardized variables.

INTNX() and INTCK() are essential enterprise date-engineering tools. Regulatory timelines, aging metrics, claims latency, and exposure windows all depend on accurate interval calculations.

The most dangerous mistake here is careless imputation. Replacing missing dates without preserving traceability violates regulatory expectations. Production-grade systems must always preserve:

  • original values
  • derivation logic
  • imputation reason
  • audit metadata

3.Duplicate Detection Using FIRST./LAST.

proc sort data=claims_stage2;

by patient_id visit_date;

run;

proc print data = claims_stage2;

run;

OUTPUT:

Obspatient_idpatient_nameregion_codeemail_addressdiagnosis_grpvisit_statusbill_amount_chrvisit_date_chrsubmit_tsagebatch_idbill_amountvisit_datesubmit_datetimeiclaim_age_days
1P1001John SmithUS_EASTjohn@gmail.comCARDIOCOMPLETE120002025-01-152025-01-15 10:10:0045CLAIM_BATCH_2026$12,000.0015JAN202515JAN2025:10:10:005499
2P1002Sarah LeeUS_WESTsarahgmail.comONCOLOGYCLOSED-45002025/02/012025-02-01 08:0090CLAIM_BATCH_2026$4,500.0001FEB202501FEB2025:08:00:005482
3P1003Mike RayUNKNOWNmike@yahooNEUROOPEN230015-03-20252025-03-15T11:22.CLAIM_BATCH_2026$2,300.0015MAR2025.5440
4P1004Anita JosephEU_CENTRALanita@gmail.comCARDIOCOMPLETE9999999NULL2025-04-01 09:55:1167CLAIM_BATCH_2026$9999999.0029APR202601APR2025:09:55:11530
5P1005Robert KingAPACrobert@@mail.comORTHOCOMPLETE3000.552025-05-10bad_timestamp34CLAIM_BATCH_2026$300,055.0010MAY2025.5384
6P1005Robert KingAPACrobert@@mail.comORTHOCOMPLETE3000.552025-05-10bad_timestamp34CLAIM_BATCH_2026$300,055.0010MAY2025.5384
7P1006Mary AnnUS_EASTmary@mail.comNEUROOPENabc1232025-06-202025-06-20 05:22:11.CLAIM_BATCH_2026$123.0020JUN202520JUN2025:05:22:115343
8P1007David LongUSA-EASTdavid@mailCARDIOCLOSED-9002025-07-182025-07-18 15:00:0029CLAIM_BATCH_2026$900.0018JUL202518JUL2025:15:00:005315
9P1008Lisa WhiteEUROPElisa@gmail.comUNKNOWNCOMPLETE5002025-08-012025-08-01 18:33:0090CLAIM_BATCH_2026$500.0001AUG202501AUG2025:18:33:005301
10P1009Kelly StoneUS_WESTkelly@gmail.comONCOLOGYCOMPLETE45002025-09-102025-09-10 22:15:4454CLAIM_BATCH_2026$4,500.0010SEP202510SEP2025:22:15:445261
11P1010Ravi KumarAPACravi@gmail.comCARDIOPENDING02025-10-112025-10-11 06:30:0038CLAIM_BATCH_2026$0.0011OCT202511OCT2025:06:30:005230
12P1011Emma GreenEU_CENTRALemma@gmail.comORTHOCLOSED-502025-11-152025-11-15 01:10:0075CLAIM_BATCH_2026$50.0015NOV202515NOV2025:01:10:005195
13P1012Tony StarkUS_EASTtony@stark.comCARDIOCOMPLETE1000002025-12-012025-12-01 11:11:1148CLAIM_BATCH_2026$100,000.0001DEC202501DEC2025:11:11:115179
14P1013Bruce WayneAPACbruce#mail.comONCOLOGYCOMPLETE85002025-12-152025-12-15 09:09:0965CLAIM_BATCH_2026$8,500.0015DEC202515DEC2025:09:09:095165
15P1014Clark KentUNKNOWNclark@gmail.comNEUROOPEN72002025-12-202025-12-20 08:45:0042CLAIM_BATCH_2026$7,200.0020DEC202520DEC2025:08:45:005160

data duplicate_claims 

        clean_claims;

set claims_stage2;

by patient_id;

if first.patient_id and last.patient_id then

output clean_claims;

else output duplicate_claims;

run;

proc print data = clean_claims;

run;

OUTPUT:

Obspatient_idpatient_nameregion_codeemail_addressdiagnosis_grpvisit_statusbill_amount_chrvisit_date_chrsubmit_tsagebatch_idbill_amountvisit_datesubmit_datetimeiclaim_age_days
1P1001John SmithUS_EASTjohn@gmail.comCARDIOCOMPLETE120002025-01-152025-01-15 10:10:0045CLAIM_BATCH_2026$12,000.0015JAN202515JAN2025:10:10:005499
2P1002Sarah LeeUS_WESTsarahgmail.comONCOLOGYCLOSED-45002025/02/012025-02-01 08:0090CLAIM_BATCH_2026$4,500.0001FEB202501FEB2025:08:00:005482
3P1003Mike RayUNKNOWNmike@yahooNEUROOPEN230015-03-20252025-03-15T11:22.CLAIM_BATCH_2026$2,300.0015MAR2025.5440
4P1004Anita JosephEU_CENTRALanita@gmail.comCARDIOCOMPLETE9999999NULL2025-04-01 09:55:1167CLAIM_BATCH_2026$9999999.0029APR202601APR2025:09:55:11530
5P1006Mary AnnUS_EASTmary@mail.comNEUROOPENabc1232025-06-202025-06-20 05:22:11.CLAIM_BATCH_2026$123.0020JUN202520JUN2025:05:22:115343
6P1007David LongUSA-EASTdavid@mailCARDIOCLOSED-9002025-07-182025-07-18 15:00:0029CLAIM_BATCH_2026$900.0018JUL202518JUL2025:15:00:005315
7P1008Lisa WhiteEUROPElisa@gmail.comUNKNOWNCOMPLETE5002025-08-012025-08-01 18:33:0090CLAIM_BATCH_2026$500.0001AUG202501AUG2025:18:33:005301
8P1009Kelly StoneUS_WESTkelly@gmail.comONCOLOGYCOMPLETE45002025-09-102025-09-10 22:15:4454CLAIM_BATCH_2026$4,500.0010SEP202510SEP2025:22:15:445261
9P1010Ravi KumarAPACravi@gmail.comCARDIOPENDING02025-10-112025-10-11 06:30:0038CLAIM_BATCH_2026$0.0011OCT202511OCT2025:06:30:005230
10P1011Emma GreenEU_CENTRALemma@gmail.comORTHOCLOSED-502025-11-152025-11-15 01:10:0075CLAIM_BATCH_2026$50.0015NOV202515NOV2025:01:10:005195
11P1012Tony StarkUS_EASTtony@stark.comCARDIOCOMPLETE1000002025-12-012025-12-01 11:11:1148CLAIM_BATCH_2026$100,000.0001DEC202501DEC2025:11:11:115179
12P1013Bruce WayneAPACbruce#mail.comONCOLOGYCOMPLETE85002025-12-152025-12-15 09:09:0965CLAIM_BATCH_2026$8,500.0015DEC202515DEC2025:09:09:095165
13P1014Clark KentUNKNOWNclark@gmail.comNEUROOPEN72002025-12-202025-12-20 08:45:0042CLAIM_BATCH_2026$7,200.0020DEC202520DEC2025:08:45:005160

proc print data = duplicate_claims;

run;

OUTPUT:

Obspatient_idpatient_nameregion_codeemail_addressdiagnosis_grpvisit_statusbill_amount_chrvisit_date_chrsubmit_tsagebatch_idbill_amountvisit_datesubmit_datetimeiclaim_age_days
1P1005Robert KingAPACrobert@@mail.comORTHOCOMPLETE3000.552025-05-10bad_timestamp34CLAIM_BATCH_2026$300,055.0010MAY2025.5384
2P1005Robert KingAPACrobert@@mail.comORTHOCOMPLETE3000.552025-05-10bad_timestamp34CLAIM_BATCH_2026$300,055.0010MAY2025.5384

Why This Matters in Production

Duplicate logic is rarely simple.

A patient may legitimately appear multiple times across visits, but duplicate IDs with identical timestamps may indicate feed replication errors. FIRST./LAST. processing gives precise control over group-level logic.

A common production failure is sorting incorrectly before BY-group processing. If the dataset is not sorted identically to the BY statement, SAS produces unreliable grouping behavior that may silently corrupt deduplication logic.

In banking and insurance systems, this can:

  • duplicate payments
  • suppress fraud alerts
  • inflate revenue
  • distort exposure metrics

4.PROC FORMAT for Standardization

proc format;

value $regionfmt 'US_EAST'='United States East'

                             'US_WEST'='United States West'

                                     'APAC'='Asia Pacific'

                     'EU_CENTRAL'='Europe Central'

                                          other='Unknown Region';

run;

LOG:

NOTE: Format $REGIONFMT has been output.

Why This Matters in Production

Formats separate business presentation logic from raw storage logic. That distinction becomes critical in enterprise reporting environments. Instead of permanently overwriting source values, formats provide controlled semantic interpretation.

The common mistake is embedding business mappings directly into DATA steps repeatedly. That creates governance nightmares during updates.

5.PROC SQL Validation

proc sql;

create table invalid_emails as

select patient_id,

       region_code format=$regionfmt.,

       email_address

from claims_stage2

where email_address not like '%@%.%';

quit;

proc print data = invalid_emails;

run;

OUTPUT:

Obspatient_idregion_codeemail_address
1P1002United States Westsarahgmail.com
2P1003Unknown Regionmike@yahoo
3P1007Unknown Regiondavid@mail
4P1013Asia Pacificbruce#mail.com

Why This Matters in Production

PROC SQL excels at declarative validation logic. SQL-based exception tables are widely used in enterprise reconciliation pipelines because they create auditable issue inventories.

The most common failure is assuming SQL validation alone guarantees clean data. SQL finds problems. It does not explain business intent.

R Version — Cleaning Workflow

Raw Dataset Creation

library(tidyverse)

library(lubridate)

library(janitor)

raw_claims <- tibble(

  patient_id = c(

    "P1001","P1002","P1003","P1004","P1005",

    "P1005","P1006","P1007","P1008","P1009",

    "P1010","P1011","P1012","P1013","P1014"),

  patient_name = c(

    "john smith","  SARAH LEE ","mike ray",

    "anita joseph","ROBERT KING",

    "ROBERT KING","mary ann","david long",

    "LISA WHITE","kelly stone",

    "ravi kumar","emma green",

    "tony stark","bruce wayne","clark kent"),

  age = c(45,150,-3,67,34,34,NA,29,200,54,38,75,48,65,42),

  region_code = c(

    "us_east","Us_West","NULL","EU_CENTRAL",

    "APAC ","APAC ","US_EAST","usa-east",

    "EUROPE"," US_WEST ","APAC","EU_CENTRAL",

    "US_EAST","APAC","NULL")

)

OUTPUT:

 

patient_id

patient_name

age

region_code

1

P1001

john smith

45

us_east

2

P1002

  SARAH LEE 

150

Us_West

3

P1003

mike ray

-3

NULL

4

P1004

anita joseph

67

EU_CENTRAL

5

P1005

ROBERT KING

34

APAC 

6

P1005

ROBERT KING

34

APAC 

7

P1006

mary ann

NA

US_EAST

8

P1007

david long

29

usa-east

9

P1008

LISA WHITE

200

EUROPE

10

P1009

kelly stone

54

 US_WEST 

11

P1010

ravi kumar

38

APAC

12

P1011

emma green

75

EU_CENTRAL

13

P1012

tony stark

48

US_EAST

14

P1013

bruce wayne

65

APAC

15

P1014

clark kent

42

NULL

SAS vs R Translation

SAS

R

DATA step

tibble()

SET dataset

pipe %>%

STRIP()

str_trim()

UPCASE()

str_to_upper()

PROPCASE()

str_to_title()

COMPRESS()

str_replace_all()

INPUT()

as.numeric() / parsers

NMISS()

is.na()

CATX()

unite()

COALESCEC()

coalesce()

R Cleaning Pipeline

claims_clean <- raw_claims %>%

clean_names() %>%

mutate(patient_name =patient_name %>%

       str_trim() %>%

       str_to_title(),

       region_code =region_code %>%

       str_trim() %>%

       str_to_upper(),

       age =case_when(age < 0 ~ NA_real_,

                      age > 110 ~ 90,

                      TRUE ~ age)

  ) %>%

  distinct(patient_id, .keep_all = TRUE)

OUTPUT:

 

patient_id

patient_name

age

region_code

1

P1001

John Smith

45

US_EAST

2

P1002

Sarah Lee

90

US_WEST

3

P1003

Mike Ray

NA

NULL

4

P1004

Anita Joseph

67

EU_CENTRAL

5

P1005

Robert King

34

APAC

6

P1006

Mary Ann

NA

US_EAST

7

P1007

David Long

29

USA-EAST

8

P1008

Lisa White

90

EUROPE

9

P1009

Kelly Stone

54

US_WEST

10

P1010

Ravi Kumar

38

APAC

11

P1011

Emma Green

75

EU_CENTRAL

12

P1012

Tony Stark

48

US_EAST

13

P1013

Bruce Wayne

65

APAC

14

P1014

Clark Kent

42

NULL

Why This Matters in Production

mutate() in R behaves similarly to iterative DATA step transformation logic in SAS, but R pipelines are more functionally expressive and composable.

case_when() is essentially R’s enterprise equivalent of SAS SELECT-WHEN or chained IF-THEN/ELSE logic.

The common enterprise failure in R is accidental type coercion. If one branch returns character while another returns numeric, the entire vector may become character silently.

distinct() in R

Keeps the first occurrence by default.

Enterprise Validation & Regulatory Compliance

In regulated environments like clinical trials, cleaning is not cosmetic.

It is compliance-critical.

Standards like CDISC SDTM and ADaM require:

  • traceable derivations
  • reproducible transformations
  • metadata-controlled variable lineage
  • QC independence
  • auditable programming logic
  • deterministic outputs

One of SAS’s most dangerous default behaviors is this:

Missing Numeric Values Sort Lower Than All Valid Numbers

In SAS:

. < 0 < 1 < 100

That means missing values appear FIRST during sorting.

This creates catastrophic analytical errors.

Example:

proc sort data=claims;

by bill_amount;

run;

If analysts later select:

if first.bill_amount;

they may accidentally process missing financial records before legitimate low-dollar claims.

This has caused:

  • false outlier detection
  • broken risk scoring
  • corrupted baseline calculations
  • invalid TLF summaries

Defensive programmers explicitly handle missing values before sorting.

Always.

Business Logic Behind Enterprise Cleaning

Data cleaning is not about “making datasets look pretty.”

It is about preserving business truth.

Suppose patient age is recorded as 150. Technically, the system accepted the value. But biologically and clinically, it is impossible. If untreated, survival models, demographic summaries, and dosing analyses become statistically distorted. Correcting or flagging these values preserves analytical integrity.

The same applies to salary normalization in banking or insurance systems. If one system records salary annually and another monthly, predictive risk models become inconsistent. A loan approval model may falsely classify high-risk applicants because the feature scale is corrupted.

Missing-date imputation is even more dangerous. Imagine a treatment start date missing in a clinical trial. Downstream exposure calculations fail. Adverse event windows become inaccurate. Regulators may question safety conclusions.

Text normalization matters too. "US_EAST", "us-east", and "US EAST" look similar to humans but represent entirely different categories to software. Failed joins create phantom regions, fractured reporting, and duplicate business entities.

Enterprise cleaning is therefore not “cleanup.”

It is controlled reconstruction of operational reality.

20 Enterprise Data Cleaning Best Practices

  1. Always define LENGTH before assignment in SAS.
  2. Separate raw, staging, and production layers.
  3. Never overwrite source data directly.
  4. Preserve audit columns for all derivations.
  5. Standardize date formats immediately.
  6. Centralize validation rules in macros/functions.
  7. Use metadata-driven programming whenever possible.
  8. QC independently from production code.
  9. Validate joins with record counts.
  10. Detect duplicate keys before merges.
  11. Normalize text before matching.
  12. Use defensive missing-value logic.
  13. Document imputation assumptions.
  14. Separate business logic from display formatting.
  15. Create exception datasets for rejected records.
  16. Validate ranges against domain standards.
  17. Build reusable enterprise macros.
  18. Version-control production code.
  19. Automate regression validation testing.
  20. Never trust vendor feeds blindly.

20 Sharp One-Liner Insights

  1. Dirty data rarely crashes systems immediately it corrupts trust slowly.
  2. Missing values in SAS aren't empty they're quietly the smallest number in your dataset.
  3. Duplicate IDs don’t duplicate rows they duplicate business risk.
  4. Every malformed timestamp is a future reconciliation meeting.
  5. Production bugs love silent truncation.
  6. A failed join is usually a text-standardization problem wearing a different costume.
  7. Data lineage is compliance insurance.
  8. Bad source systems create honest dashboards that still tell lies.
  9. Macros scale logic and scale mistakes.
  10. Whitespace is invisible until it breaks a million-dollar merge.
  11. Most AI problems begin as data engineering problems.
  12. Regulators trust traceability more than clever code.
  13. A dashboard is only as truthful as its weakest variable.
  14. Character encoding issues are globalization’s hidden tax.
  15. Every manual correction needs an audit trail.
  16. Production programming is defensive programming.
  17. NULL is not a business definition.
  18. Standardization creates analytical gravity.
  19. Fast code without validation is automated chaos.
  20. Cleaning data is reconstructing reality from operational noise.

SAS vs R — Enterprise Comparison

Area

SAS

R

Auditability

Extremely strong

Requires governance discipline

Regulatory Acceptance

Gold standard in pharma

Growing rapidly

Flexibility

Structured

Highly flexible

Performance

Excellent on enterprise grids

Excellent with optimization

Memory Model

Fixed-width

Dynamic

Traceability

Native enterprise orientation

Must be engineered

Macro Automation

Mature

More fragmented

Visualization

Moderate

Excellent

Learning Curve

Procedural

Functional + procedural

Enterprise Deployment

Extremely mature

Expanding rapidly

Production Validation Checklist

[ ] Duplicate keys validated

[ ] Missing values profiled

[ ] Date formats standardized

[ ] Character lengths reviewed

[ ] Invalid categories flagged

[ ] Joins reconciled

[ ] Record counts validated

[ ] QC independent review completed

[ ] Audit trail generated

[ ] Exception records archived

[ ] Imputation documented

[ ] Metadata synchronized

[ ] Derived variables traceable

[ ] Output reproducibility confirmed

SAS vs R — Enterprise Cleaning Summary

SAS remains dominant in heavily regulated industries because its architecture was designed around repeatability, traceability, procedural determinism, and enterprise governance. DATA step processing provides row-level control that remains unmatched for certain operational workloads. PROC-based reporting ecosystems integrate naturally into submission pipelines, making SAS exceptionally strong for SDTM, ADaM, and TLF production.

R, however, excels in flexibility, modern data science integration, visualization, package extensibility, and functional programming workflows. The tidyverse ecosystem enables elegant transformation logic that is often more concise and expressive than traditional SAS implementations. R also integrates naturally with machine learning frameworks and modern cloud-native analytics stacks.

In practice, elite enterprise teams increasingly use both.

SAS handles:

  • validated pipelines
  • regulatory submissions
  • traceable transformations
  • production governance

R handles:

  • exploratory analytics
  • advanced modeling
  • visualization
  • scalable statistical experimentation

The smartest programmers are bilingual.

Not because tools matter more than thinking.

But because different systems solve different business risks.

Conclusion

Organizations spend millions building dashboards, AI systems, predictive models, cloud platforms, and executive analytics ecosystems.

Yet most catastrophic failures still begin with something embarrassingly small:

  • a malformed date
  • a duplicate identifier
  • a missing value
  • a broken join
  • a truncated character field

Enterprise data cleaning is therefore not a support activity.

It is foundational infrastructure.

A hospital cannot trust patient-risk models built on corrupted source systems. A bank cannot trust fraud analytics polluted by inconsistent transaction logic. An insurance company cannot trust reserve calculations built from malformed claims feeds. A pharmaceutical company cannot trust safety conclusions derived from duplicated subjects or invalid treatment dates.

The irony is brutal:
the most sophisticated analytics systems in the world are still vulnerable to the smallest unvalidated variable.

This is why mature organizations build structured cleaning frameworks:

  • standardized staging layers
  • metadata-driven validation
  • enterprise macros
  • audit trails
  • exception handling
  • reproducible pipelines
  • QC independence
  • lineage-aware transformations

The goal is not perfection.

The goal is controlled reliability.

Good programmers write code that runs.

Senior programmers write code that survives production.

Enterprise programmers write systems that preserve trust when everything upstream becomes chaotic.

And that difference changes careers.

Interview Questions & Answers:

1. Duplicate Patient IDs Inflated Enrollment Counts

Question:

A regulatory submission failed because duplicate patient IDs inflated enrollment metrics. How would you investigate?

Answer:

First, I would validate whether the duplicates are true duplicates or legitimate repeat visits. I’d profile uniqueness using PROC SORT NODUPKEY, PROC FREQ, and SQL COUNT DISTINCT comparisons. Then I’d examine source lineage to identify whether the issue originated during ingestion, merging, or upstream vendor mapping. I would compare timestamps, site IDs, and treatment sequences to reconstruct the duplication path. Finally, I’d build exception datasets and validate corrected denominators against prior frozen snapshots before rerunning TLFs.

2. Missing Numeric Values Corrupted Risk Scores

Question:

A fraud model unexpectedly prioritized missing claims as lowest risk. What happened?

Answer:

In SAS, missing numeric values sort lower than all valid numbers. If ranking logic or FIRST./LAST. processing was applied after sorting, missing claims may have been incorrectly interpreted as minimal-risk observations. I would inspect sorting behavior, verify missing handling rules, and implement explicit conditional logic to separate missing from legitimate low values before scoring.

3. Macro Suddenly Failed After Deployment

Question:

A reusable macro worked in DEV but failed in PROD. How would you debug it?

Answer:

I’d first check environmental dependencies: library assignments, encoding differences, macro variable resolution, permissions, and metadata references. Then I’d enable MPRINT, SYMBOLGEN, and MLOGIC to trace execution flow. Most production macro failures stem from assumptions about variable existence, data structure consistency, or unresolved macro variables introduced by environment differences.

4. SAS and R Outputs Don’t Match

Question:

You generated the same summary in SAS and R but totals differ slightly. What do you do?

Answer:

I’d investigate type conversion, rounding behavior, missing-value handling, factor/category mappings, and join cardinality. SAS and R often differ in implicit conversions and NA treatment. I’d validate row counts after every transformation stage and reconcile discrepancies incrementally rather than comparing only final outputs.

5. Date Imputation Triggered Regulatory Concern

Question:

An auditor questioned your missing-date imputation strategy. How do you defend it?

Answer:

I’d provide documented business rules, SAP references, metadata lineage, derivation traceability, and validation evidence. I’d show both original and imputed values alongside rationale and demonstrate that the imputation method was predefined, consistently applied, and independently QC’d. Regulators care less about whether imputation exists and more about whether it is controlled, transparent, and reproducible.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about BUSINESS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can SAS’s Precision and R’s Flexibility Together Eliminate Hidden Errors in Orbital Debris Data Analytics?

2.Can We Build an Accurate Product Demand Forecasting & Fraud Detection System in SAS  While Identifying and Fixing Intentional Errors?

3.Can SAS Identify the Most Efficient Waste Collection Routes in a City?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments