Fraud Approved, Patients Duplicated, Models Broken: The Real Cost of Dirty Data
Dirty Data, Dead Decisions: How Enterprise SAS & R Pipelines Rescue Business Trust
Monday Morning — The FDA Escalation Nobody Wanted
At 7:42
AM, the escalation email arrived.
The
subject line was short:
URGENT:
Enrollment Count Mismatch Detected in Submission Package
Inside a
global oncology trial, the FDA identified duplicate patient identifiers across
two regional data feeds. The enrollment dashboard showed 2,148 patients.
Reality?
2,031.
A single
duplicated patient stream inflated recruitment metrics, altered exposure
summaries, and contaminated downstream efficacy tables. Safety percentages
shifted. Treatment-arm denominators changed. Executive dashboards were wrong.
Statistical assumptions cracked.
By noon,
clinical operations blamed programming.
Programming blamed data management.
Data management blamed the vendor feed.
The vendor blamed "legacy mapping inconsistencies."
Meanwhile,
regulators didn't care whose fault it was.
The
submission failed.
That is
what dirty data does in the real world.
Not
theoretical classroom dirty data.
Production
dirty data.
The kind
that quietly enters systems through malformed CSV exports, legacy ETL
pipelines, offshore vendor feeds, timestamp mismatches, whitespace corruption,
character truncation, mixed encodings, duplicate keys, and default software
behaviors nobody noticed until millions of dollars were at risk.
A single
bad variable is never just one bad variable.
It
spreads.
Like oil
in water.
Duplicate
IDs inflate enrollment.
Negative billing values reverse revenue.
Malformed timestamps break trend models.
Mixed-case region codes destroy joins.
Missing visit dates invalidate survival analyses.
Improper type conversion silently converts business logic into fiction.
This is
why enterprise data cleaning is not “preprocessing.”
It is
risk management.
And the
programmers who understand this become the people organizations trust during crises.
Raw Enterprise Healthcare
Claims Dataset Construction
We will
simulate a corrupted healthcare claims dataset containing deliberate
enterprise-grade data issues.
This is
not an employee table.
This is realistic production chaos.
SAS Raw Dataset Creation
data raw_claims;
/*LENGTH MUST COME FIRST*/
length patient_id $12 patient_name $40 region_code $10 email_address $60
diagnosis_grp $20 visit_status $20 bill_amount_chr $15
visit_date_chr $25 submit_ts $30;
infile datalines dlm='|' truncover;
informat visit_date_chr $25.;
format patient_name $40.;
input patient_id $ patient_name $ age region_code $ email_address $
diagnosis_grp $ bill_amount_chr $ visit_date_chr $ submit_ts $
visit_status $;
datalines;
P1001|john smith|45|us_east|john@gmail.com|CARDIO|12000|2025-01-15|2025-01-15 10:10:00|COMPLETE
P1002| SARAH LEE |150|Us_West|sarahgmail.com|ONCOLOGY|-4500|2025/02/01|2025-02-01 08:00|closed
P1003|mike ray|-3|NULL|mike@yahoo|NEURO|2300|15-03-2025|2025-03-15T11:22|OPEN
P1004|anita joseph|67|EU_CENTRAL|anita@gmail.com|cardio|9999999|NULL|2025-04-01 09:55:11|Complete
P1005|ROBERT KING|34|APAC |robert@@mail.com|ORTHO|3000.55|2025-05-10|bad_timestamp|complete
P1005|ROBERT KING|34|APAC |robert@@mail.com|ORTHO|3000.55|2025-05-10|bad_timestamp|complete
P1006|mary ann|.|US_EAST|mary@mail.com|NEURO|abc123|2025-06-20|2025-06-20 05:22:11|Open
P1007|david long|29|usa-east|david@mail|CARDIO|-900|2025-07-18|2025-07-18 15:00:00|closed
P1008|LISA WHITE|200|EUROPE|lisa@gmail.com|UNKNOWN|500|2025-08-01|2025-08-01 18:33:00|COMPLETE
P1009|kelly stone|54| US_WEST |kelly@gmail.com|ONCOLOGY|4500|2025-09-10|2025-09-10 22:15:44|complete
P1010|ravi kumar|38|APAC|ravi@gmail.com|CARDIO|0|2025-10-11|2025-10-11 06:30:00|pending
P1011|emma green|75|EU_CENTRAL|emma@gmail.com|ORTHO|-50|2025-11-15|2025-11-15 01:10:00|Closed
P1012|tony stark|48|US_EAST|tony@stark.com|CARDIO|100000|2025-12-01|2025-12-01 11:11:11|COMPLETE
P1013|bruce wayne|65|APAC|bruce#mail.com|ONCOLOGY|8500|2025-12-15|2025-12-15 09:09:09|COMPLETE
P1014|clark kent|42|NULL|clark@gmail.com|NEURO|7200|2025-12-20|2025-12-20 08:45:00|OPEN
;
run;
proc print data = raw_claims;
run;
OUTPUT:
| Obs | patient_id | patient_name | region_code | email_address | diagnosis_grp | visit_status | bill_amount_chr | visit_date_chr | submit_ts | age |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | P1001 | john smith | us_east | john@gmail.com | CARDIO | COMPLETE | 12000 | 2025-01-15 | 2025-01-15 10:10:00 | 45 |
| 2 | P1002 | SARAH LEE | Us_West | sarahgmail.com | ONCOLOGY | closed | -4500 | 2025/02/01 | 2025-02-01 08:00 | 150 |
| 3 | P1003 | mike ray | NULL | mike@yahoo | NEURO | OPEN | 2300 | 15-03-2025 | 2025-03-15T11:22 | -3 |
| 4 | P1004 | anita joseph | EU_CENTRAL | anita@gmail.com | cardio | Complete | 9999999 | NULL | 2025-04-01 09:55:11 | 67 |
| 5 | P1005 | ROBERT KING | APAC | robert@@mail.com | ORTHO | complete | 3000.55 | 2025-05-10 | bad_timestamp | 34 |
| 6 | P1005 | ROBERT KING | APAC | robert@@mail.com | ORTHO | complete | 3000.55 | 2025-05-10 | bad_timestamp | 34 |
| 7 | P1006 | mary ann | US_EAST | mary@mail.com | NEURO | Open | abc123 | 2025-06-20 | 2025-06-20 05:22:11 | . |
| 8 | P1007 | david long | usa-east | david@mail | CARDIO | closed | -900 | 2025-07-18 | 2025-07-18 15:00:00 | 29 |
| 9 | P1008 | LISA WHITE | EUROPE | lisa@gmail.com | UNKNOWN | COMPLETE | 500 | 2025-08-01 | 2025-08-01 18:33:00 | 200 |
| 10 | P1009 | kelly stone | US_WEST | kelly@gmail.com | ONCOLOGY | complete | 4500 | 2025-09-10 | 2025-09-10 22:15:44 | 54 |
| 11 | P1010 | ravi kumar | APAC | ravi@gmail.com | CARDIO | pending | 0 | 2025-10-11 | 2025-10-11 06:30:00 | 38 |
| 12 | P1011 | emma green | EU_CENTRAL | emma@gmail.com | ORTHO | Closed | -50 | 2025-11-15 | 2025-11-15 01:10:00 | 75 |
| 13 | P1012 | tony stark | US_EAST | tony@stark.com | CARDIO | COMPLETE | 100000 | 2025-12-01 | 2025-12-01 11:11:11 | 48 |
| 14 | P1013 | bruce wayne | APAC | bruce#mail.com | ONCOLOGY | COMPLETE | 8500 | 2025-12-15 | 2025-12-15 09:09:09 | 65 |
| 15 | P1014 | clark kent | NULL | clark@gmail.com | NEURO | OPEN | 7200 | 2025-12-20 | 2025-12-20 08:45:00 | 42 |
Why LENGTH Must Come First
in SAS
This is
one of the most dangerous silent failures in SAS production systems.
In SAS,
character variable length is assigned the moment the variable is first created.
If you accidentally assign:
patient_name='John';
before
defining:
length patient_name $40;
then SAS
permanently allocates only 4 bytes.
Later:
patient_name='Christopher
Anderson';
becomes:
Chri
No error.
No failure.
No dramatic warning.
Just
silent truncation.
This is
called Character Truncation Risk, and in enterprise systems it destroys:
- patient identifiers
- treatment names
- audit comments
- vendor source values
- regulatory metadata
- JSON payload mappings
In
production, truncation spreads downstream into joins, reports, APIs, and
submission datasets.
R behaves
differently.
R stores
character vectors as dynamically allocated string objects internally. Strings
are not fixed-width memory allocations like SAS character variables. R
therefore avoids silent truncation by design. However, R introduces its own
enterprise risks:
- factor coercion
- encoding inconsistencies
- NA propagation
- type instability during
joins
- memory explosion with large
strings
SAS risks
truncation.
R risks uncontrolled flexibility.
Senior
programmers learn both failure modes.
Enterprise SAS Cleaning
Pipeline
1.Standardization & Type Conversion
data claims_stage1;
set raw_claims;
retain batch_id "CLAIM_BATCH_2026";
patient_name = propcase(strip(patient_name));
region_code = upcase(compress(region_code));
visit_status = upcase(strip(visit_status));
diagnosis_grp = upcase(strip(diagnosis_grp));
email_address = lowcase(strip(email_address));
bill_amount = input(compress(bill_amount_chr,,'kd'),best12.);
visit_date = input(visit_date_chr,anydtdte20.);
submit_datetime = input(submit_ts,anydtdtm30.);
format visit_date date9. submit_datetime datetime20.
bill_amount dollar12.2;
run;
proc print data = claims_stage1;
run;
OUTPUT:
| Obs | patient_id | patient_name | region_code | email_address | diagnosis_grp | visit_status | bill_amount_chr | visit_date_chr | submit_ts | age | batch_id | bill_amount | visit_date | submit_datetime |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | P1001 | John Smith | US_EAST | john@gmail.com | CARDIO | COMPLETE | 12000 | 2025-01-15 | 2025-01-15 10:10:00 | 45 | CLAIM_BATCH_2026 | $12,000.00 | 15JAN2025 | 15JAN2025:10:10:00 |
| 2 | P1002 | Sarah Lee | US_WEST | sarahgmail.com | ONCOLOGY | CLOSED | -4500 | 2025/02/01 | 2025-02-01 08:00 | 150 | CLAIM_BATCH_2026 | $4,500.00 | 01FEB2025 | 01FEB2025:08:00:00 |
| 3 | P1003 | Mike Ray | NULL | mike@yahoo | NEURO | OPEN | 2300 | 15-03-2025 | 2025-03-15T11:22 | -3 | CLAIM_BATCH_2026 | $2,300.00 | 15MAR2025 | . |
| 4 | P1004 | Anita Joseph | EU_CENTRAL | anita@gmail.com | CARDIO | COMPLETE | 9999999 | NULL | 2025-04-01 09:55:11 | 67 | CLAIM_BATCH_2026 | $9999999.00 | . | 01APR2025:09:55:11 |
| 5 | P1005 | Robert King | APAC | robert@@mail.com | ORTHO | COMPLETE | 3000.55 | 2025-05-10 | bad_timestamp | 34 | CLAIM_BATCH_2026 | $300,055.00 | 10MAY2025 | . |
| 6 | P1005 | Robert King | APAC | robert@@mail.com | ORTHO | COMPLETE | 3000.55 | 2025-05-10 | bad_timestamp | 34 | CLAIM_BATCH_2026 | $300,055.00 | 10MAY2025 | . |
| 7 | P1006 | Mary Ann | US_EAST | mary@mail.com | NEURO | OPEN | abc123 | 2025-06-20 | 2025-06-20 05:22:11 | . | CLAIM_BATCH_2026 | $123.00 | 20JUN2025 | 20JUN2025:05:22:11 |
| 8 | P1007 | David Long | USA-EAST | david@mail | CARDIO | CLOSED | -900 | 2025-07-18 | 2025-07-18 15:00:00 | 29 | CLAIM_BATCH_2026 | $900.00 | 18JUL2025 | 18JUL2025:15:00:00 |
| 9 | P1008 | Lisa White | EUROPE | lisa@gmail.com | UNKNOWN | COMPLETE | 500 | 2025-08-01 | 2025-08-01 18:33:00 | 200 | CLAIM_BATCH_2026 | $500.00 | 01AUG2025 | 01AUG2025:18:33:00 |
| 10 | P1009 | Kelly Stone | US_WEST | kelly@gmail.com | ONCOLOGY | COMPLETE | 4500 | 2025-09-10 | 2025-09-10 22:15:44 | 54 | CLAIM_BATCH_2026 | $4,500.00 | 10SEP2025 | 10SEP2025:22:15:44 |
| 11 | P1010 | Ravi Kumar | APAC | ravi@gmail.com | CARDIO | PENDING | 0 | 2025-10-11 | 2025-10-11 06:30:00 | 38 | CLAIM_BATCH_2026 | $0.00 | 11OCT2025 | 11OCT2025:06:30:00 |
| 12 | P1011 | Emma Green | EU_CENTRAL | emma@gmail.com | ORTHO | CLOSED | -50 | 2025-11-15 | 2025-11-15 01:10:00 | 75 | CLAIM_BATCH_2026 | $50.00 | 15NOV2025 | 15NOV2025:01:10:00 |
| 13 | P1012 | Tony Stark | US_EAST | tony@stark.com | CARDIO | COMPLETE | 100000 | 2025-12-01 | 2025-12-01 11:11:11 | 48 | CLAIM_BATCH_2026 | $100,000.00 | 01DEC2025 | 01DEC2025:11:11:11 |
| 14 | P1013 | Bruce Wayne | APAC | bruce#mail.com | ONCOLOGY | COMPLETE | 8500 | 2025-12-15 | 2025-12-15 09:09:09 | 65 | CLAIM_BATCH_2026 | $8,500.00 | 15DEC2025 | 15DEC2025:09:09:09 |
| 15 | P1014 | Clark Kent | NULL | clark@gmail.com | NEURO | OPEN | 7200 | 2025-12-20 | 2025-12-20 08:45:00 | 42 | CLAIM_BATCH_2026 | $7,200.00 | 20DEC2025 | 20DEC2025:08:45:00 |
Why This Matters in Production
This
stage is where raw operational chaos becomes machine-readable enterprise data.
Real production systems rarely agree on formatting conventions. One vendor
sends "Us_West", another sends " us_west ", and another
sends "USA-WEST". Unless standardized immediately, joins fail
silently and aggregated analytics fracture into false categories.
The
COMPRESS(...,'kd') trick is particularly important. It keeps only digits while
stripping corruption from mixed billing values like "abc123" or
"12,000USD".
The most
common production failure here is improper type conversion. Analysts often
apply INPUT() directly without validating malformed values. That causes mass
missing values, which later distort summaries and predictive models.
2.Advanced Enterprise Cleaning with Arrays, DO Loops & Validation
data claims_stage2;
set claims_stage1;
array char_fix(*) patient_name region_code diagnosis_grp visit_status;
do i = 1 to dim(char_fix);
char_fix(i)=tranwrd(char_fix(i),'NULL','UNKNOWN');
char_fix(i)=strip(char_fix(i));
end;
if age < 0 then age=.;
else if age > 110 then age=90;
if bill_amount < 0 then
bill_amount = abs(bill_amount);
if missing(visit_date) then
visit_date = intnx('day',today(),-30);
claim_age_days =intck('day',visit_date,
today());
run;
proc print data = claims_stage2;
run;
OUTPUT:
| Obs | patient_id | patient_name | region_code | email_address | diagnosis_grp | visit_status | bill_amount_chr | visit_date_chr | submit_ts | age | batch_id | bill_amount | visit_date | submit_datetime | i | claim_age_days |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | P1001 | John Smith | US_EAST | john@gmail.com | CARDIO | COMPLETE | 12000 | 2025-01-15 | 2025-01-15 10:10:00 | 45 | CLAIM_BATCH_2026 | $12,000.00 | 15JAN2025 | 15JAN2025:10:10:00 | 5 | 499 |
| 2 | P1002 | Sarah Lee | US_WEST | sarahgmail.com | ONCOLOGY | CLOSED | -4500 | 2025/02/01 | 2025-02-01 08:00 | 90 | CLAIM_BATCH_2026 | $4,500.00 | 01FEB2025 | 01FEB2025:08:00:00 | 5 | 482 |
| 3 | P1003 | Mike Ray | UNKNOWN | mike@yahoo | NEURO | OPEN | 2300 | 15-03-2025 | 2025-03-15T11:22 | . | CLAIM_BATCH_2026 | $2,300.00 | 15MAR2025 | . | 5 | 440 |
| 4 | P1004 | Anita Joseph | EU_CENTRAL | anita@gmail.com | CARDIO | COMPLETE | 9999999 | NULL | 2025-04-01 09:55:11 | 67 | CLAIM_BATCH_2026 | $9999999.00 | 29APR2026 | 01APR2025:09:55:11 | 5 | 30 |
| 5 | P1005 | Robert King | APAC | robert@@mail.com | ORTHO | COMPLETE | 3000.55 | 2025-05-10 | bad_timestamp | 34 | CLAIM_BATCH_2026 | $300,055.00 | 10MAY2025 | . | 5 | 384 |
| 6 | P1005 | Robert King | APAC | robert@@mail.com | ORTHO | COMPLETE | 3000.55 | 2025-05-10 | bad_timestamp | 34 | CLAIM_BATCH_2026 | $300,055.00 | 10MAY2025 | . | 5 | 384 |
| 7 | P1006 | Mary Ann | US_EAST | mary@mail.com | NEURO | OPEN | abc123 | 2025-06-20 | 2025-06-20 05:22:11 | . | CLAIM_BATCH_2026 | $123.00 | 20JUN2025 | 20JUN2025:05:22:11 | 5 | 343 |
| 8 | P1007 | David Long | USA-EAST | david@mail | CARDIO | CLOSED | -900 | 2025-07-18 | 2025-07-18 15:00:00 | 29 | CLAIM_BATCH_2026 | $900.00 | 18JUL2025 | 18JUL2025:15:00:00 | 5 | 315 |
| 9 | P1008 | Lisa White | EUROPE | lisa@gmail.com | UNKNOWN | COMPLETE | 500 | 2025-08-01 | 2025-08-01 18:33:00 | 90 | CLAIM_BATCH_2026 | $500.00 | 01AUG2025 | 01AUG2025:18:33:00 | 5 | 301 |
| 10 | P1009 | Kelly Stone | US_WEST | kelly@gmail.com | ONCOLOGY | COMPLETE | 4500 | 2025-09-10 | 2025-09-10 22:15:44 | 54 | CLAIM_BATCH_2026 | $4,500.00 | 10SEP2025 | 10SEP2025:22:15:44 | 5 | 261 |
| 11 | P1010 | Ravi Kumar | APAC | ravi@gmail.com | CARDIO | PENDING | 0 | 2025-10-11 | 2025-10-11 06:30:00 | 38 | CLAIM_BATCH_2026 | $0.00 | 11OCT2025 | 11OCT2025:06:30:00 | 5 | 230 |
| 12 | P1011 | Emma Green | EU_CENTRAL | emma@gmail.com | ORTHO | CLOSED | -50 | 2025-11-15 | 2025-11-15 01:10:00 | 75 | CLAIM_BATCH_2026 | $50.00 | 15NOV2025 | 15NOV2025:01:10:00 | 5 | 195 |
| 13 | P1012 | Tony Stark | US_EAST | tony@stark.com | CARDIO | COMPLETE | 100000 | 2025-12-01 | 2025-12-01 11:11:11 | 48 | CLAIM_BATCH_2026 | $100,000.00 | 01DEC2025 | 01DEC2025:11:11:11 | 5 | 179 |
| 14 | P1013 | Bruce Wayne | APAC | bruce#mail.com | ONCOLOGY | COMPLETE | 8500 | 2025-12-15 | 2025-12-15 09:09:09 | 65 | CLAIM_BATCH_2026 | $8,500.00 | 15DEC2025 | 15DEC2025:09:09:09 | 5 | 165 |
| 15 | P1014 | Clark Kent | UNKNOWN | clark@gmail.com | NEURO | OPEN | 7200 | 2025-12-20 | 2025-12-20 08:45:00 | 42 | CLAIM_BATCH_2026 | $7,200.00 | 20DEC2025 | 20DEC2025:08:45:00 | 5 | 160 |
Why This Matters in Production
Enterprise
cleaning is about defensive programming. Arrays allow scalable corrections
across dozens or hundreds of fields without repetitive code. This becomes critical
in SDTM and ADaM environments where domains may contain hundreds of
standardized variables.
INTNX()
and INTCK() are essential enterprise date-engineering tools. Regulatory
timelines, aging metrics, claims latency, and exposure windows all depend on accurate
interval calculations.
The most
dangerous mistake here is careless imputation. Replacing missing dates without
preserving traceability violates regulatory expectations. Production-grade
systems must always preserve:
- original values
- derivation logic
- imputation reason
- audit metadata
3.Duplicate Detection Using FIRST./LAST.
proc sort data=claims_stage2;
by patient_id visit_date;
run;
proc print data = claims_stage2;
run;
OUTPUT:
| Obs | patient_id | patient_name | region_code | email_address | diagnosis_grp | visit_status | bill_amount_chr | visit_date_chr | submit_ts | age | batch_id | bill_amount | visit_date | submit_datetime | i | claim_age_days |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | P1001 | John Smith | US_EAST | john@gmail.com | CARDIO | COMPLETE | 12000 | 2025-01-15 | 2025-01-15 10:10:00 | 45 | CLAIM_BATCH_2026 | $12,000.00 | 15JAN2025 | 15JAN2025:10:10:00 | 5 | 499 |
| 2 | P1002 | Sarah Lee | US_WEST | sarahgmail.com | ONCOLOGY | CLOSED | -4500 | 2025/02/01 | 2025-02-01 08:00 | 90 | CLAIM_BATCH_2026 | $4,500.00 | 01FEB2025 | 01FEB2025:08:00:00 | 5 | 482 |
| 3 | P1003 | Mike Ray | UNKNOWN | mike@yahoo | NEURO | OPEN | 2300 | 15-03-2025 | 2025-03-15T11:22 | . | CLAIM_BATCH_2026 | $2,300.00 | 15MAR2025 | . | 5 | 440 |
| 4 | P1004 | Anita Joseph | EU_CENTRAL | anita@gmail.com | CARDIO | COMPLETE | 9999999 | NULL | 2025-04-01 09:55:11 | 67 | CLAIM_BATCH_2026 | $9999999.00 | 29APR2026 | 01APR2025:09:55:11 | 5 | 30 |
| 5 | P1005 | Robert King | APAC | robert@@mail.com | ORTHO | COMPLETE | 3000.55 | 2025-05-10 | bad_timestamp | 34 | CLAIM_BATCH_2026 | $300,055.00 | 10MAY2025 | . | 5 | 384 |
| 6 | P1005 | Robert King | APAC | robert@@mail.com | ORTHO | COMPLETE | 3000.55 | 2025-05-10 | bad_timestamp | 34 | CLAIM_BATCH_2026 | $300,055.00 | 10MAY2025 | . | 5 | 384 |
| 7 | P1006 | Mary Ann | US_EAST | mary@mail.com | NEURO | OPEN | abc123 | 2025-06-20 | 2025-06-20 05:22:11 | . | CLAIM_BATCH_2026 | $123.00 | 20JUN2025 | 20JUN2025:05:22:11 | 5 | 343 |
| 8 | P1007 | David Long | USA-EAST | david@mail | CARDIO | CLOSED | -900 | 2025-07-18 | 2025-07-18 15:00:00 | 29 | CLAIM_BATCH_2026 | $900.00 | 18JUL2025 | 18JUL2025:15:00:00 | 5 | 315 |
| 9 | P1008 | Lisa White | EUROPE | lisa@gmail.com | UNKNOWN | COMPLETE | 500 | 2025-08-01 | 2025-08-01 18:33:00 | 90 | CLAIM_BATCH_2026 | $500.00 | 01AUG2025 | 01AUG2025:18:33:00 | 5 | 301 |
| 10 | P1009 | Kelly Stone | US_WEST | kelly@gmail.com | ONCOLOGY | COMPLETE | 4500 | 2025-09-10 | 2025-09-10 22:15:44 | 54 | CLAIM_BATCH_2026 | $4,500.00 | 10SEP2025 | 10SEP2025:22:15:44 | 5 | 261 |
| 11 | P1010 | Ravi Kumar | APAC | ravi@gmail.com | CARDIO | PENDING | 0 | 2025-10-11 | 2025-10-11 06:30:00 | 38 | CLAIM_BATCH_2026 | $0.00 | 11OCT2025 | 11OCT2025:06:30:00 | 5 | 230 |
| 12 | P1011 | Emma Green | EU_CENTRAL | emma@gmail.com | ORTHO | CLOSED | -50 | 2025-11-15 | 2025-11-15 01:10:00 | 75 | CLAIM_BATCH_2026 | $50.00 | 15NOV2025 | 15NOV2025:01:10:00 | 5 | 195 |
| 13 | P1012 | Tony Stark | US_EAST | tony@stark.com | CARDIO | COMPLETE | 100000 | 2025-12-01 | 2025-12-01 11:11:11 | 48 | CLAIM_BATCH_2026 | $100,000.00 | 01DEC2025 | 01DEC2025:11:11:11 | 5 | 179 |
| 14 | P1013 | Bruce Wayne | APAC | bruce#mail.com | ONCOLOGY | COMPLETE | 8500 | 2025-12-15 | 2025-12-15 09:09:09 | 65 | CLAIM_BATCH_2026 | $8,500.00 | 15DEC2025 | 15DEC2025:09:09:09 | 5 | 165 |
| 15 | P1014 | Clark Kent | UNKNOWN | clark@gmail.com | NEURO | OPEN | 7200 | 2025-12-20 | 2025-12-20 08:45:00 | 42 | CLAIM_BATCH_2026 | $7,200.00 | 20DEC2025 | 20DEC2025:08:45:00 | 5 | 160 |
data duplicate_claims
clean_claims;
set claims_stage2;
by patient_id;
if first.patient_id and last.patient_id then
output clean_claims;
else output duplicate_claims;
run;
proc print data = clean_claims;
run;
OUTPUT:
| Obs | patient_id | patient_name | region_code | email_address | diagnosis_grp | visit_status | bill_amount_chr | visit_date_chr | submit_ts | age | batch_id | bill_amount | visit_date | submit_datetime | i | claim_age_days |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | P1001 | John Smith | US_EAST | john@gmail.com | CARDIO | COMPLETE | 12000 | 2025-01-15 | 2025-01-15 10:10:00 | 45 | CLAIM_BATCH_2026 | $12,000.00 | 15JAN2025 | 15JAN2025:10:10:00 | 5 | 499 |
| 2 | P1002 | Sarah Lee | US_WEST | sarahgmail.com | ONCOLOGY | CLOSED | -4500 | 2025/02/01 | 2025-02-01 08:00 | 90 | CLAIM_BATCH_2026 | $4,500.00 | 01FEB2025 | 01FEB2025:08:00:00 | 5 | 482 |
| 3 | P1003 | Mike Ray | UNKNOWN | mike@yahoo | NEURO | OPEN | 2300 | 15-03-2025 | 2025-03-15T11:22 | . | CLAIM_BATCH_2026 | $2,300.00 | 15MAR2025 | . | 5 | 440 |
| 4 | P1004 | Anita Joseph | EU_CENTRAL | anita@gmail.com | CARDIO | COMPLETE | 9999999 | NULL | 2025-04-01 09:55:11 | 67 | CLAIM_BATCH_2026 | $9999999.00 | 29APR2026 | 01APR2025:09:55:11 | 5 | 30 |
| 5 | P1006 | Mary Ann | US_EAST | mary@mail.com | NEURO | OPEN | abc123 | 2025-06-20 | 2025-06-20 05:22:11 | . | CLAIM_BATCH_2026 | $123.00 | 20JUN2025 | 20JUN2025:05:22:11 | 5 | 343 |
| 6 | P1007 | David Long | USA-EAST | david@mail | CARDIO | CLOSED | -900 | 2025-07-18 | 2025-07-18 15:00:00 | 29 | CLAIM_BATCH_2026 | $900.00 | 18JUL2025 | 18JUL2025:15:00:00 | 5 | 315 |
| 7 | P1008 | Lisa White | EUROPE | lisa@gmail.com | UNKNOWN | COMPLETE | 500 | 2025-08-01 | 2025-08-01 18:33:00 | 90 | CLAIM_BATCH_2026 | $500.00 | 01AUG2025 | 01AUG2025:18:33:00 | 5 | 301 |
| 8 | P1009 | Kelly Stone | US_WEST | kelly@gmail.com | ONCOLOGY | COMPLETE | 4500 | 2025-09-10 | 2025-09-10 22:15:44 | 54 | CLAIM_BATCH_2026 | $4,500.00 | 10SEP2025 | 10SEP2025:22:15:44 | 5 | 261 |
| 9 | P1010 | Ravi Kumar | APAC | ravi@gmail.com | CARDIO | PENDING | 0 | 2025-10-11 | 2025-10-11 06:30:00 | 38 | CLAIM_BATCH_2026 | $0.00 | 11OCT2025 | 11OCT2025:06:30:00 | 5 | 230 |
| 10 | P1011 | Emma Green | EU_CENTRAL | emma@gmail.com | ORTHO | CLOSED | -50 | 2025-11-15 | 2025-11-15 01:10:00 | 75 | CLAIM_BATCH_2026 | $50.00 | 15NOV2025 | 15NOV2025:01:10:00 | 5 | 195 |
| 11 | P1012 | Tony Stark | US_EAST | tony@stark.com | CARDIO | COMPLETE | 100000 | 2025-12-01 | 2025-12-01 11:11:11 | 48 | CLAIM_BATCH_2026 | $100,000.00 | 01DEC2025 | 01DEC2025:11:11:11 | 5 | 179 |
| 12 | P1013 | Bruce Wayne | APAC | bruce#mail.com | ONCOLOGY | COMPLETE | 8500 | 2025-12-15 | 2025-12-15 09:09:09 | 65 | CLAIM_BATCH_2026 | $8,500.00 | 15DEC2025 | 15DEC2025:09:09:09 | 5 | 165 |
| 13 | P1014 | Clark Kent | UNKNOWN | clark@gmail.com | NEURO | OPEN | 7200 | 2025-12-20 | 2025-12-20 08:45:00 | 42 | CLAIM_BATCH_2026 | $7,200.00 | 20DEC2025 | 20DEC2025:08:45:00 | 5 | 160 |
proc print data = duplicate_claims;
run;
OUTPUT:
| Obs | patient_id | patient_name | region_code | email_address | diagnosis_grp | visit_status | bill_amount_chr | visit_date_chr | submit_ts | age | batch_id | bill_amount | visit_date | submit_datetime | i | claim_age_days |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | P1005 | Robert King | APAC | robert@@mail.com | ORTHO | COMPLETE | 3000.55 | 2025-05-10 | bad_timestamp | 34 | CLAIM_BATCH_2026 | $300,055.00 | 10MAY2025 | . | 5 | 384 |
| 2 | P1005 | Robert King | APAC | robert@@mail.com | ORTHO | COMPLETE | 3000.55 | 2025-05-10 | bad_timestamp | 34 | CLAIM_BATCH_2026 | $300,055.00 | 10MAY2025 | . | 5 | 384 |
Why This Matters in Production
Duplicate
logic is rarely simple.
A patient
may legitimately appear multiple times across visits, but duplicate IDs with
identical timestamps may indicate feed replication errors. FIRST./LAST.
processing gives precise control over group-level logic.
A common
production failure is sorting incorrectly before BY-group processing. If the
dataset is not sorted identically to the BY statement, SAS produces unreliable
grouping behavior that may silently corrupt deduplication logic.
In
banking and insurance systems, this can:
- duplicate payments
- suppress fraud alerts
- inflate revenue
- distort exposure metrics
4.PROC FORMAT for Standardization
proc format;
value $regionfmt 'US_EAST'='United States East'
'US_WEST'='United States West'
'APAC'='Asia Pacific'
'EU_CENTRAL'='Europe Central'
other='Unknown Region';
run;
LOG:
Why This Matters in Production
Formats
separate business presentation logic from raw storage logic. That distinction
becomes critical in enterprise reporting environments. Instead of permanently
overwriting source values, formats provide controlled semantic interpretation.
The
common mistake is embedding business mappings directly into DATA steps
repeatedly. That creates governance nightmares during updates.
5.PROC SQL Validation
proc sql;
create table invalid_emails as
select patient_id,
region_code format=$regionfmt.,
email_address
from claims_stage2
where email_address not like '%@%.%';
quit;
proc print data = invalid_emails;
run;
OUTPUT:
| Obs | patient_id | region_code | email_address |
|---|---|---|---|
| 1 | P1002 | United States West | sarahgmail.com |
| 2 | P1003 | Unknown Region | mike@yahoo |
| 3 | P1007 | Unknown Region | david@mail |
| 4 | P1013 | Asia Pacific | bruce#mail.com |
Why This Matters in Production
PROC SQL
excels at declarative validation logic. SQL-based exception tables are widely
used in enterprise reconciliation pipelines because they create auditable issue
inventories.
The most
common failure is assuming SQL validation alone guarantees clean data. SQL
finds problems. It does not explain business intent.
R Version — Cleaning Workflow
Raw Dataset Creation
library(tidyverse)
library(lubridate)
library(janitor)
raw_claims <- tibble(
patient_id = c(
"P1001","P1002","P1003","P1004","P1005",
"P1005","P1006","P1007","P1008","P1009",
"P1010","P1011","P1012","P1013","P1014"),
patient_name = c(
"john smith"," SARAH LEE ","mike ray",
"anita joseph","ROBERT KING",
"ROBERT KING","mary ann","david long",
"LISA WHITE","kelly stone",
"ravi kumar","emma green",
"tony stark","bruce wayne","clark kent"),
age = c(45,150,-3,67,34,34,NA,29,200,54,38,75,48,65,42),
region_code = c(
"us_east","Us_West","NULL","EU_CENTRAL",
"APAC ","APAC ","US_EAST","usa-east",
"EUROPE"," US_WEST ","APAC","EU_CENTRAL",
"US_EAST","APAC","NULL")
)
OUTPUT:
|
|
patient_id |
patient_name |
age |
region_code |
|
1 |
P1001 |
john smith |
45 |
us_east |
|
2 |
P1002 |
SARAH LEE |
150 |
Us_West |
|
3 |
P1003 |
mike ray |
-3 |
NULL |
|
4 |
P1004 |
anita joseph |
67 |
EU_CENTRAL |
|
5 |
P1005 |
ROBERT KING |
34 |
APAC |
|
6 |
P1005 |
ROBERT KING |
34 |
APAC |
|
7 |
P1006 |
mary ann |
NA |
US_EAST |
|
8 |
P1007 |
david long |
29 |
usa-east |
|
9 |
P1008 |
LISA WHITE |
200 |
EUROPE |
|
10 |
P1009 |
kelly stone |
54 |
US_WEST |
|
11 |
P1010 |
ravi kumar |
38 |
APAC |
|
12 |
P1011 |
emma green |
75 |
EU_CENTRAL |
|
13 |
P1012 |
tony stark |
48 |
US_EAST |
|
14 |
P1013 |
bruce wayne |
65 |
APAC |
|
15 |
P1014 |
clark kent |
42 |
NULL |
SAS vs R Translation
|
SAS |
R |
|
DATA
step |
tibble() |
|
SET
dataset |
pipe
%>% |
|
STRIP() |
str_trim() |
|
UPCASE() |
str_to_upper() |
|
PROPCASE() |
str_to_title() |
|
COMPRESS() |
str_replace_all() |
|
INPUT() |
as.numeric()
/ parsers |
|
NMISS() |
is.na() |
|
CATX() |
unite() |
|
COALESCEC() |
coalesce() |
R Cleaning Pipeline
claims_clean <- raw_claims %>%
clean_names() %>%
mutate(patient_name =patient_name %>%
str_trim() %>%
str_to_title(),
region_code =region_code %>%
str_trim() %>%
str_to_upper(),
age =case_when(age < 0 ~ NA_real_,
age > 110 ~ 90,
TRUE ~ age)
) %>%
distinct(patient_id, .keep_all = TRUE)
OUTPUT:
|
|
patient_id |
patient_name |
age |
region_code |
|
1 |
P1001 |
John Smith |
45 |
US_EAST |
|
2 |
P1002 |
Sarah Lee |
90 |
US_WEST |
|
3 |
P1003 |
Mike Ray |
NA |
NULL |
|
4 |
P1004 |
Anita Joseph |
67 |
EU_CENTRAL |
|
5 |
P1005 |
Robert King |
34 |
APAC |
|
6 |
P1006 |
Mary Ann |
NA |
US_EAST |
|
7 |
P1007 |
David Long |
29 |
USA-EAST |
|
8 |
P1008 |
Lisa White |
90 |
EUROPE |
|
9 |
P1009 |
Kelly Stone |
54 |
US_WEST |
|
10 |
P1010 |
Ravi Kumar |
38 |
APAC |
|
11 |
P1011 |
Emma Green |
75 |
EU_CENTRAL |
|
12 |
P1012 |
Tony Stark |
48 |
US_EAST |
|
13 |
P1013 |
Bruce Wayne |
65 |
APAC |
|
14 |
P1014 |
Clark Kent |
42 |
NULL |
Why This Matters in Production
mutate()
in R behaves similarly to iterative DATA step transformation logic in SAS, but
R pipelines are more functionally expressive and composable.
case_when()
is essentially R’s enterprise equivalent of SAS SELECT-WHEN or chained
IF-THEN/ELSE logic.
The
common enterprise failure in R is accidental type coercion. If one branch
returns character while another returns numeric, the entire vector may become
character silently.
distinct() in R
Keeps the first occurrence by default.
Enterprise Validation &
Regulatory Compliance
In
regulated environments like clinical trials, cleaning is not cosmetic.
It is
compliance-critical.
Standards
like CDISC SDTM and ADaM require:
- traceable derivations
- reproducible transformations
- metadata-controlled variable
lineage
- QC independence
- auditable programming logic
- deterministic outputs
One of
SAS’s most dangerous default behaviors is this:
Missing Numeric Values Sort Lower Than All Valid
Numbers
In SAS:
. < 0 < 1 < 100
That
means missing values appear FIRST during sorting.
This
creates catastrophic analytical errors.
Example:
proc sort data=claims;
by bill_amount;
run;
If
analysts later select:
if first.bill_amount;
they may
accidentally process missing financial records before legitimate low-dollar
claims.
This has
caused:
- false outlier detection
- broken risk scoring
- corrupted baseline
calculations
- invalid TLF summaries
Defensive
programmers explicitly handle missing values before sorting.
Always.
Business Logic Behind
Enterprise Cleaning
Data
cleaning is not about “making datasets look pretty.”
It is
about preserving business truth.
Suppose
patient age is recorded as 150. Technically, the system accepted the value. But
biologically and clinically, it is impossible. If untreated, survival models,
demographic summaries, and dosing analyses become statistically distorted.
Correcting or flagging these values preserves analytical integrity.
The same
applies to salary normalization in banking or insurance systems. If one system
records salary annually and another monthly, predictive risk models become
inconsistent. A loan approval model may falsely classify high-risk applicants
because the feature scale is corrupted.
Missing-date
imputation is even more dangerous. Imagine a treatment start date missing in a
clinical trial. Downstream exposure calculations fail. Adverse event windows
become inaccurate. Regulators may question safety conclusions.
Text
normalization matters too. "US_EAST", "us-east", and
"US EAST" look similar to humans but represent entirely different
categories to software. Failed joins create phantom regions, fractured
reporting, and duplicate business entities.
Enterprise
cleaning is therefore not “cleanup.”
It is
controlled reconstruction of operational reality.
20 Enterprise Data Cleaning
Best Practices
- Always define LENGTH before
assignment in SAS.
- Separate raw, staging, and
production layers.
- Never overwrite source data
directly.
- Preserve audit columns for
all derivations.
- Standardize date formats immediately.
- Centralize validation rules
in macros/functions.
- Use metadata-driven
programming whenever possible.
- QC independently from
production code.
- Validate joins with record
counts.
- Detect duplicate keys before
merges.
- Normalize text before
matching.
- Use defensive missing-value
logic.
- Document imputation
assumptions.
- Separate business logic from
display formatting.
- Create exception datasets
for rejected records.
- Validate ranges against
domain standards.
- Build reusable enterprise
macros.
- Version-control production
code.
- Automate regression
validation testing.
- Never trust vendor feeds
blindly.
20 Sharp One-Liner Insights
- Dirty data rarely crashes
systems immediately it corrupts trust slowly.
- Missing values in SAS aren't
empty they're quietly the smallest number in your dataset.
- Duplicate IDs don’t duplicate
rows they duplicate business risk.
- Every malformed timestamp is
a future reconciliation meeting.
- Production bugs love silent
truncation.
- A failed join is usually a
text-standardization problem wearing a different costume.
- Data lineage is compliance
insurance.
- Bad source systems create
honest dashboards that still tell lies.
- Macros scale logic and scale
mistakes.
- Whitespace is invisible
until it breaks a million-dollar merge.
- Most AI problems begin as
data engineering problems.
- Regulators trust traceability
more than clever code.
- A dashboard is only as truthful
as its weakest variable.
- Character encoding issues are
globalization’s hidden tax.
- Every manual correction
needs an audit trail.
- Production programming is
defensive programming.
- NULL is not a business
definition.
- Standardization creates
analytical gravity.
- Fast code without validation
is automated chaos.
- Cleaning data is
reconstructing reality from operational noise.
SAS vs R — Enterprise
Comparison
|
Area |
SAS |
R |
|
Auditability |
Extremely
strong |
Requires
governance discipline |
|
Regulatory
Acceptance |
Gold
standard in pharma |
Growing
rapidly |
|
Flexibility |
Structured |
Highly
flexible |
|
Performance |
Excellent
on enterprise grids |
Excellent
with optimization |
|
Memory
Model |
Fixed-width |
Dynamic |
|
Traceability |
Native
enterprise orientation |
Must be
engineered |
|
Macro
Automation |
Mature |
More
fragmented |
|
Visualization |
Moderate |
Excellent |
|
Learning
Curve |
Procedural |
Functional
+ procedural |
|
Enterprise
Deployment |
Extremely
mature |
Expanding
rapidly |
Production Validation
Checklist
[ ] Duplicate keys validated
[ ] Missing values profiled
[ ] Date formats standardized
[ ] Character lengths reviewed
[ ] Invalid categories flagged
[ ] Joins reconciled
[ ] Record counts validated
[ ] QC independent review
completed
[ ] Audit trail generated
[ ] Exception records archived
[ ] Imputation documented
[ ] Metadata synchronized
[ ] Derived variables traceable
[ ] Output reproducibility
confirmed
SAS vs R — Enterprise
Cleaning Summary
SAS
remains dominant in heavily regulated industries because its architecture was
designed around repeatability, traceability, procedural determinism, and
enterprise governance. DATA step processing provides row-level control that
remains unmatched for certain operational workloads. PROC-based reporting
ecosystems integrate naturally into submission pipelines, making SAS
exceptionally strong for SDTM, ADaM, and TLF production.
R,
however, excels in flexibility, modern data science integration, visualization,
package extensibility, and functional programming workflows. The tidyverse
ecosystem enables elegant transformation logic that is often more concise and
expressive than traditional SAS implementations. R also integrates naturally
with machine learning frameworks and modern cloud-native analytics stacks.
In
practice, elite enterprise teams increasingly use both.
SAS
handles:
- validated pipelines
- regulatory submissions
- traceable transformations
- production governance
R
handles:
- exploratory analytics
- advanced modeling
- visualization
- scalable statistical
experimentation
The
smartest programmers are bilingual.
Not
because tools matter more than thinking.
But
because different systems solve different business risks.
Conclusion
Organizations
spend millions building dashboards, AI systems, predictive models, cloud
platforms, and executive analytics ecosystems.
Yet most
catastrophic failures still begin with something embarrassingly small:
- a malformed date
- a duplicate identifier
- a missing value
- a broken join
- a truncated character field
Enterprise
data cleaning is therefore not a support activity.
It is
foundational infrastructure.
A
hospital cannot trust patient-risk models built on corrupted source systems. A
bank cannot trust fraud analytics polluted by inconsistent transaction logic.
An insurance company cannot trust reserve calculations built from malformed
claims feeds. A pharmaceutical company cannot trust safety conclusions derived
from duplicated subjects or invalid treatment dates.
The irony
is brutal:
the most sophisticated analytics systems in the world are still vulnerable to
the smallest unvalidated variable.
This is
why mature organizations build structured cleaning frameworks:
- standardized staging layers
- metadata-driven validation
- enterprise macros
- audit trails
- exception handling
- reproducible pipelines
- QC independence
- lineage-aware
transformations
The goal
is not perfection.
The goal
is controlled reliability.
Good
programmers write code that runs.
Senior
programmers write code that survives production.
Enterprise
programmers write systems that preserve trust when everything upstream becomes
chaotic.
And that
difference changes careers.
Interview Questions & Answers:
1. Duplicate Patient IDs Inflated Enrollment Counts
Question:
A
regulatory submission failed because duplicate patient IDs inflated enrollment
metrics. How would you investigate?
Answer:
First, I
would validate whether the duplicates are true duplicates or legitimate repeat
visits. I’d profile uniqueness using PROC SORT NODUPKEY, PROC FREQ, and SQL
COUNT DISTINCT comparisons. Then I’d examine source lineage to identify whether
the issue originated during ingestion, merging, or upstream vendor mapping. I
would compare timestamps, site IDs, and treatment sequences to reconstruct the
duplication path. Finally, I’d build exception datasets and validate corrected
denominators against prior frozen snapshots before rerunning TLFs.
2. Missing Numeric Values Corrupted Risk Scores
Question:
A fraud
model unexpectedly prioritized missing claims as lowest risk. What happened?
Answer:
In SAS,
missing numeric values sort lower than all valid numbers. If ranking logic or
FIRST./LAST. processing was applied after sorting, missing claims may have been
incorrectly interpreted as minimal-risk observations. I would inspect sorting
behavior, verify missing handling rules, and implement explicit conditional
logic to separate missing from legitimate low values before scoring.
3. Macro Suddenly Failed After Deployment
Question:
A
reusable macro worked in DEV but failed in PROD. How would you debug it?
Answer:
I’d first
check environmental dependencies: library assignments, encoding differences,
macro variable resolution, permissions, and metadata references. Then I’d
enable MPRINT, SYMBOLGEN, and MLOGIC to trace execution flow. Most production
macro failures stem from assumptions about variable existence, data structure
consistency, or unresolved macro variables introduced by environment
differences.
4. SAS and R Outputs Don’t Match
Question:
You
generated the same summary in SAS and R but totals differ slightly. What do you
do?
Answer:
I’d
investigate type conversion, rounding behavior, missing-value handling,
factor/category mappings, and join cardinality. SAS and R often differ in
implicit conversions and NA treatment. I’d validate row counts after every
transformation stage and reconcile discrepancies incrementally rather than
comparing only final outputs.
5. Date Imputation Triggered Regulatory Concern
Question:
An
auditor questioned your missing-date imputation strategy. How do you defend it?
Answer:
I’d
provide documented business rules, SAP references, metadata lineage, derivation
traceability, and validation evidence. I’d show both original and imputed
values alongside rationale and demonstrate that the imputation method was predefined,
consistently applied, and independently QC’d. Regulators care less about
whether imputation exists and more about whether it is controlled, transparent,
and reproducible.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about BUSINESS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment