Human Eaters, Hidden Errors & High-Risk Analytics: Enterprise SAS PROC SQL vs DATA Step Cleaning Frameworks Explained
Human Eaters in World Data
into Trusted Enterprise Intelligence Using Advanced SAS (PROC SQL vs DATA Step)
and Modern R Engineering Frameworks
INTRODUCTION:
In
enterprise analytics, dirty data is not just an inconvenience it is a silent
operational disaster. As Clinical SAS Programmers and Data Scientists, we often
inherit datasets that resemble chaos more than structured intelligence. Imagine
a global investigation dataset named “HUMAN EATERS IN WORLD”, designed
to track criminal incidents, forensic observations, victim patterns,
psychological classifications, regional activity, and investigation timelines.
Now
imagine this dataset feeding dashboards, predictive AI engines, regulatory
reports, law-enforcement surveillance systems, and executive decision-making
platforms.
One
corrupted variable can destroy analytical credibility.
One
duplicate subject ID can invalidate regulatory submissions.
One
malformed date can crash survival analysis.
One
missing region code can misclassify geographic risk.
That is
exactly why enterprise-grade data cleaning matters.
Business Crisis Scenario:
When Corrupted Data Creates Operational Disaster
A
multinational forensic intelligence organization launched a predictive
behavioral analytics system to identify high-risk criminal zones associated
with “Human Eaters” investigations globally.
However,
the raw operational data contained:
- Duplicate Case IDs
- Missing incident dates
- Negative victim counts
- Impossible ages
- Mixed case region names
- Corrupted timestamps
- Invalid emails
- NULL strings
- Whitespace contamination
- Mixed numeric-character
variables
As a
result:
- AI prediction engines
produced false risk zones
- Dashboards showed incorrect
death statistics
- Executive reports
misclassified criminal severity
- Compliance teams failed
audit traceability checks
- Regional intelligence
reports contradicted source systems
This is a
classic enterprise data-governance failure.
Dirty
data destroys trust.
Raw Enterprise Dataset
Creation in SAS
Below is
a deliberately corrupted dataset with 20+ observations and 9 variables.
Variables
|
Variable |
Description |
|
CASE_ID |
Investigation
ID |
|
REGION |
Geographic
zone |
|
EATER_NAME |
Suspect
Name |
|
AGE |
Suspect
age |
|
VICTIMS |
Number
of victims |
|
INCIDENT_DATE |
Investigation
date |
|
RISK_LEVEL |
Severity
category |
|
CONTACT_EMAIL |
Investigator
email |
|
STATUS |
Investigation
status |
SAS Raw Dataset with Intentional Errors
data human_eaters_raw;
length CASE_ID $12 REGION $20 EATER_NAME $40 RISK_LEVEL $15
CONTACT_EMAIL $50 STATUS $20;
informat INCIDENT_DATE anydtdte20.;
format INCIDENT_DATE date9.;
infile datalines dsd dlm='|';
input CASE_ID :$12. REGION :$20. EATER_NAME :$40. AGE
VICTIMS INCIDENT_DATE :anydtdte20. RISK_LEVEL :$15. CONTACT_EMAIL :$50.
STATUS :$20.;
datalines;
HE001|Asia|Cannibal_King|45|12|12JAN2025|HIGH|kinggmail.com|OPEN
HE002|europe|Flesh_Hunter|-5|8|.|Medium|hunter@mail.com|CLOSED
HE003|AFRICA|Bone_Eater|120|-3|25FEB2025|HIGH|eater@domain|OPEN
HE003|AFRICA|Bone_Eater|120|-3|25FEB2025|HIGH|eater@domain|OPEN
HE004|asia|NULL|33|4|31APR2025|LOW|null.com|ACTIVE
HE005|EUROPE|Night_Feaster|28|0|15MAY2025|MEDIUM|test@@mail.com|OPEN
HE006|Africa|Skull_Collector|.|14|17JUN2025|HIGH|skullmail.com|PENDING
HE007|Asia|WhiteSpace___|52|5|.|HIGH|white@mail|OPEN
HE008|Europe|Red_Hunter|41|9|12JUL2025|CRITICAL|red@mail.com|CLOSED
HE009|africa|Bone Lord|38|2|invaliddate|LOW|bone@mail.com|ACTIVE
HE010|ASIA|Dark_Eater|44|7|22AUG2025|medium|dark@domain.com|OPEN
HE011|Europe|Ghost_Eater|200|11|30SEP2025|HIGH|ghostmail.com|OPEN
HE012|AFRICA|Blood_Drinker|35|.|10OCT2025|MEDIUM|blood@mail.com|CLOSED
HE013|Asia|Cannibal_Zero|29|-8|14NOV2025|LOW|zero@mail.com|OPEN
HE014|Europe|Alpha_Biter|60|15|19DEC2025|HIGH|alpha@@mail.com|ACTIVE
HE015|Africa|Tribal_Eater|48|5|01JAN2026|CRITICAL|tribal@mail.com|OPEN
HE016|Asia|Hunter_X|31|6|.|HIGH|hunterx.com|CLOSED
HE017|Europe|Silent_Eater|39|3|11FEB2026|LOW|silent@mail|ACTIVE
HE018|AFRICA|Dark_Shadow|42|1|14MAR2026|MEDIUM|shadow@mail.com|OPEN
HE019|Asia|Unknown|0|2|20APR2026|HIGH|unknown@mail.com|PENDING
HE020|Europe|Cannibal_X|55|9|25MAY2026|HIGH|canni@mail.com|CLOSED
;
run;
proc print data=human_eaters_raw;
run;
OUTPUT:
| Obs | CASE_ID | REGION | EATER_NAME | RISK_LEVEL | CONTACT_EMAIL | STATUS | INCIDENT_DATE | AGE | VICTIMS |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HE001 | Asia | Cannibal_King | HIGH | kinggmail.com | OPEN | 12JAN2025 | 45 | 12 |
| 2 | HE002 | europe | Flesh_Hunter | Medium | hunter@mail.com | CLOSED | . | -5 | 8 |
| 3 | HE003 | AFRICA | Bone_Eater | HIGH | eater@domain | OPEN | 25FEB2025 | 120 | -3 |
| 4 | HE003 | AFRICA | Bone_Eater | HIGH | eater@domain | OPEN | 25FEB2025 | 120 | -3 |
| 5 | HE004 | asia | NULL | LOW | null.com | ACTIVE | . | 33 | 4 |
| 6 | HE005 | EUROPE | Night_Feaster | MEDIUM | test@@mail.com | OPEN | 15MAY2025 | 28 | 0 |
| 7 | HE006 | Africa | Skull_Collector | HIGH | skullmail.com | PENDING | 17JUN2025 | . | 14 |
| 8 | HE007 | Asia | WhiteSpace___ | HIGH | white@mail | OPEN | . | 52 | 5 |
| 9 | HE008 | Europe | Red_Hunter | CRITICAL | red@mail.com | CLOSED | 12JUL2025 | 41 | 9 |
| 10 | HE009 | africa | Bone Lord | LOW | bone@mail.com | ACTIVE | . | 38 | 2 |
| 11 | HE010 | ASIA | Dark_Eater | medium | dark@domain.com | OPEN | 22AUG2025 | 44 | 7 |
| 12 | HE011 | Europe | Ghost_Eater | HIGH | ghostmail.com | OPEN | 30SEP2025 | 200 | 11 |
| 13 | HE012 | AFRICA | Blood_Drinker | MEDIUM | blood@mail.com | CLOSED | 10OCT2025 | 35 | . |
| 14 | HE013 | Asia | Cannibal_Zero | LOW | zero@mail.com | OPEN | 14NOV2025 | 29 | -8 |
| 15 | HE014 | Europe | Alpha_Biter | HIGH | alpha@@mail.com | ACTIVE | 19DEC2025 | 60 | 15 |
| 16 | HE015 | Africa | Tribal_Eater | CRITICAL | tribal@mail.com | OPEN | 01JAN2026 | 48 | 5 |
| 17 | HE016 | Asia | Hunter_X | HIGH | hunterx.com | CLOSED | . | 31 | 6 |
| 18 | HE017 | Europe | Silent_Eater | LOW | silent@mail | ACTIVE | 11FEB2026 | 39 | 3 |
| 19 | HE018 | AFRICA | Dark_Shadow | MEDIUM | shadow@mail.com | OPEN | 14MAR2026 | 42 | 1 |
| 20 | HE019 | Asia | Unknown | HIGH | unknown@mail.com | PENDING | 20APR2026 | 0 | 2 |
| 21 | HE020 | Europe | Cannibal_X | HIGH | canni@mail.com | CLOSED | 25MAY2026 | 55 | 9 |
1. DSD
infile datalines dsd
tells SAS
to:
- properly handle delimiters
- recognize missing values
- safely read character values
2. dlm='|'
Pipe
delimiter prevents whitespace corruption.
3. Colon Modifier (:)
AGE
INCIDENT_DATE :anydtdte20.
The colon
modifier allows flexible reading.
Very
important for:
- invalid dates
- variable-length fields
- clinical raw files
Why LENGTH Statements
Matter in SAS
One of
the biggest interview questions in Advanced SAS revolves around character
truncation risk.
In SAS,
if LENGTH is not declared before assignment:
name="Alexander";
SAS
automatically assigns variable length based on first encounter.
Later:
name="AlexanderTheGreat";
gets
truncated silently.
That is
extremely dangerous in SDTM, ADaM, banking, insurance, and healthcare
environments because truncation corrupts traceability.
R behaves
differently because character vectors dynamically allocate memory rather than
fixed-length storage.
SAS Cleaning Workflow Using
DATA Step Engineering
Standardization & Validation
data human_eaters_clean;
set human_eaters_raw;
REGION = propcase(strip(REGION));
EATER_NAME = tranwrd(EATER_NAME,"_"," ");
RISK_LEVEL = upcase(RISK_LEVEL);
STATUS = lowcase(strip(STATUS));
CONTACT_EMAIL = compress(CONTACT_EMAIL);
if AGE < 18 or AGE > 100 then AGE = .;
VICTIMS = abs(VICTIMS);
if missing(INCIDENT_DATE) then
INCIDENT_DATE = intnx('day',today(),-30);
if find(CONTACT_EMAIL,'@') = 0 then
CONTACT_EMAIL='invalid@mail.com';
if RISK_LEVEL not in ('LOW','MEDIUM','HIGH','CRITICAL')
then RISK_LEVEL='UNKNOWN';
run;
proc print data=human_eaters_clean;
run;
OUTPUT:
| Obs | CASE_ID | REGION | EATER_NAME | RISK_LEVEL | CONTACT_EMAIL | STATUS | INCIDENT_DATE | AGE | VICTIMS |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HE001 | Asia | Cannibal King | HIGH | invalid@mail.com | open | 12JAN2025 | 45 | 12 |
| 2 | HE002 | Europe | Flesh Hunter | MEDIUM | hunter@mail.com | closed | 26APR2026 | . | 8 |
| 3 | HE003 | Africa | Bone Eater | HIGH | eater@domain | open | 25FEB2025 | . | 3 |
| 4 | HE003 | Africa | Bone Eater | HIGH | eater@domain | open | 25FEB2025 | . | 3 |
| 5 | HE004 | Asia | NULL | LOW | invalid@mail.com | active | 26APR2026 | 33 | 4 |
| 6 | HE005 | Europe | Night Feaster | MEDIUM | test@@mail.com | open | 15MAY2025 | 28 | 0 |
| 7 | HE006 | Africa | Skull Collector | HIGH | invalid@mail.com | pending | 17JUN2025 | . | 14 |
| 8 | HE007 | Asia | WhiteSpace | HIGH | white@mail | open | 26APR2026 | 52 | 5 |
| 9 | HE008 | Europe | Red Hunter | CRITICAL | red@mail.com | closed | 12JUL2025 | 41 | 9 |
| 10 | HE009 | Africa | Bone Lord | LOW | bone@mail.com | active | 26APR2026 | 38 | 2 |
| 11 | HE010 | Asia | Dark Eater | MEDIUM | dark@domain.com | open | 22AUG2025 | 44 | 7 |
| 12 | HE011 | Europe | Ghost Eater | HIGH | invalid@mail.com | open | 30SEP2025 | . | 11 |
| 13 | HE012 | Africa | Blood Drinker | MEDIUM | blood@mail.com | closed | 10OCT2025 | 35 | . |
| 14 | HE013 | Asia | Cannibal Zero | LOW | zero@mail.com | open | 14NOV2025 | 29 | 8 |
| 15 | HE014 | Europe | Alpha Biter | HIGH | alpha@@mail.com | active | 19DEC2025 | 60 | 15 |
| 16 | HE015 | Africa | Tribal Eater | CRITICAL | tribal@mail.com | open | 01JAN2026 | 48 | 5 |
| 17 | HE016 | Asia | Hunter X | HIGH | invalid@mail.com | closed | 26APR2026 | 31 | 6 |
| 18 | HE017 | Europe | Silent Eater | LOW | silent@mail | active | 11FEB2026 | 39 | 3 |
| 19 | HE018 | Africa | Dark Shadow | MEDIUM | shadow@mail.com | open | 14MAR2026 | 42 | 1 |
| 20 | HE019 | Asia | Unknown | HIGH | unknown@mail.com | pending | 20APR2026 | . | 2 |
| 21 | HE020 | Europe | Cannibal X | HIGH | canni@mail.com | closed | 25MAY2026 | 55 | 9 |
Explanation
This DATA
step demonstrates enterprise cleaning logic frequently used in clinical trials
and operational analytics.
- PROPCASE standardizes region
naming
- TRANWRD removes underscore
corruption
- ABS fixes negative victim
counts
- INTNX imputes missing dates
- FIND validates email
structure
- COMPRESS removes whitespace
contamination
- Invalid ages are converted
to missing values
This
mirrors real-world SDTM validation where unrealistic ages or impossible dates
must be corrected before FDA submission.
Removing Duplicates Using PROC SORT
proc sort data=human_eaters_clean
out=dedup_data nodupkey;
by CASE_ID;
run;
LOG:
proc print data=dedup_data;
run;
OUTPUT:
| Obs | CASE_ID | REGION | EATER_NAME | RISK_LEVEL | CONTACT_EMAIL | STATUS | INCIDENT_DATE | AGE | VICTIMS |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HE001 | Asia | Cannibal King | HIGH | invalid@mail.com | open | 12JAN2025 | 45 | 12 |
| 2 | HE002 | Europe | Flesh Hunter | MEDIUM | hunter@mail.com | closed | 26APR2026 | . | 8 |
| 3 | HE003 | Africa | Bone Eater | HIGH | eater@domain | open | 25FEB2025 | . | 3 |
| 4 | HE004 | Asia | NULL | LOW | invalid@mail.com | active | 26APR2026 | 33 | 4 |
| 5 | HE005 | Europe | Night Feaster | MEDIUM | test@@mail.com | open | 15MAY2025 | 28 | 0 |
| 6 | HE006 | Africa | Skull Collector | HIGH | invalid@mail.com | pending | 17JUN2025 | . | 14 |
| 7 | HE007 | Asia | WhiteSpace | HIGH | white@mail | open | 26APR2026 | 52 | 5 |
| 8 | HE008 | Europe | Red Hunter | CRITICAL | red@mail.com | closed | 12JUL2025 | 41 | 9 |
| 9 | HE009 | Africa | Bone Lord | LOW | bone@mail.com | active | 26APR2026 | 38 | 2 |
| 10 | HE010 | Asia | Dark Eater | MEDIUM | dark@domain.com | open | 22AUG2025 | 44 | 7 |
| 11 | HE011 | Europe | Ghost Eater | HIGH | invalid@mail.com | open | 30SEP2025 | . | 11 |
| 12 | HE012 | Africa | Blood Drinker | MEDIUM | blood@mail.com | closed | 10OCT2025 | 35 | . |
| 13 | HE013 | Asia | Cannibal Zero | LOW | zero@mail.com | open | 14NOV2025 | 29 | 8 |
| 14 | HE014 | Europe | Alpha Biter | HIGH | alpha@@mail.com | active | 19DEC2025 | 60 | 15 |
| 15 | HE015 | Africa | Tribal Eater | CRITICAL | tribal@mail.com | open | 01JAN2026 | 48 | 5 |
| 16 | HE016 | Asia | Hunter X | HIGH | invalid@mail.com | closed | 26APR2026 | 31 | 6 |
| 17 | HE017 | Europe | Silent Eater | LOW | silent@mail | active | 11FEB2026 | 39 | 3 |
| 18 | HE018 | Africa | Dark Shadow | MEDIUM | shadow@mail.com | open | 14MAR2026 | 42 | 1 |
| 19 | HE019 | Asia | Unknown | HIGH | unknown@mail.com | pending | 20APR2026 | . | 2 |
| 20 | HE020 | Europe | Cannibal X | HIGH | canni@mail.com | closed | 25MAY2026 | 55 | 9 |
Explanation
PROC SORT
NODUPKEY is one of the most important deduplication mechanisms in SAS
production systems.
Duplicate
IDs can destroy:
- patient counts,
- adverse event summaries,
- risk calculations,
- fraud analysis,
- executive dashboards.
This step
ensures CASE_ID uniqueness before downstream reporting.
PROC SQL vs DATA Step Merge
PROC SQL Join
proc sql;
create table risk_summary as
select REGION,count(*) as TOTAL_CASES,
mean(VICTIMS) as AVG_VICTIMS,
max(AGE) as MAX_AGE
from dedup_data
group by REGION;
quit;
proc print data=risk_summary;
run;
OUTPUT:
| Obs | REGION | TOTAL_CASES | AVG_VICTIMS | MAX_AGE |
|---|---|---|---|---|
| 1 | Africa | 6 | 5.00000 | 48 |
| 2 | Asia | 7 | 6.28571 | 52 |
| 3 | Europe | 7 | 7.85714 | 60 |
DATA Step Alternative
proc summary data=dedup_data nway;
class REGION;
var VICTIMS AGE;
output out=summary_data mean=AVG_VICTIMS
max=MAX_AGE
n=TOTAL_CASES;
run;
proc print data=summary_data;
run;
OUTPUT:
| Obs | REGION | _TYPE_ | _FREQ_ | AVG_VICTIMS | MAX_AGE | TOTAL_CASES |
|---|---|---|---|---|---|---|
| 1 | Africa | 1 | 6 | 5.00000 | 14 | 5 |
| 2 | Asia | 1 | 7 | 6.28571 | 12 | 7 |
| 3 | Europe | 1 | 7 | 7.85714 | 15 | 7 |
Explanation
This is a
classic interview topic:
|
PROC
SQL |
DATA
Step / PROC SUMMARY |
|
Better
for joins |
Better
for sequential processing |
|
Easier
syntax |
Faster
for large datasets |
|
Database-style
logic |
SAS-native
optimization |
|
Flexible
grouping |
Excellent
aggregation performance |
Experienced
SAS programmers must know when each method is optimal.
Advanced DATA Step Engineering Techniques
ARRAYS + DO LOOP
data quality_flags;
set dedup_data;
array chars(*) REGION EATER_NAME
RISK_LEVEL STATUS;
do i=1 to dim(chars);
chars(i)=strip(chars(i));
end;
if cmiss(of _all_) > 0 then FLAG='MISSING';
drop i;
run;
proc print data=quality_flags;
run;
OUTPUT:
| Obs | CASE_ID | REGION | EATER_NAME | RISK_LEVEL | CONTACT_EMAIL | STATUS | INCIDENT_DATE | AGE | VICTIMS | FLAG |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | HE001 | Asia | Cannibal King | HIGH | invalid@mail.com | open | 12JAN2025 | 45 | 12 | |
| 2 | HE002 | Europe | Flesh Hunter | MEDIUM | hunter@mail.com | closed | 26APR2026 | . | 8 | MISSING |
| 3 | HE003 | Africa | Bone Eater | HIGH | eater@domain | open | 25FEB2025 | . | 3 | MISSING |
| 4 | HE004 | Asia | NULL | LOW | invalid@mail.com | active | 26APR2026 | 33 | 4 | |
| 5 | HE005 | Europe | Night Feaster | MEDIUM | test@@mail.com | open | 15MAY2025 | 28 | 0 | |
| 6 | HE006 | Africa | Skull Collector | HIGH | invalid@mail.com | pending | 17JUN2025 | . | 14 | MISSING |
| 7 | HE007 | Asia | WhiteSpace | HIGH | white@mail | open | 26APR2026 | 52 | 5 | |
| 8 | HE008 | Europe | Red Hunter | CRITICAL | red@mail.com | closed | 12JUL2025 | 41 | 9 | |
| 9 | HE009 | Africa | Bone Lord | LOW | bone@mail.com | active | 26APR2026 | 38 | 2 | |
| 10 | HE010 | Asia | Dark Eater | MEDIUM | dark@domain.com | open | 22AUG2025 | 44 | 7 | |
| 11 | HE011 | Europe | Ghost Eater | HIGH | invalid@mail.com | open | 30SEP2025 | . | 11 | MISSING |
| 12 | HE012 | Africa | Blood Drinker | MEDIUM | blood@mail.com | closed | 10OCT2025 | 35 | . | MISSING |
| 13 | HE013 | Asia | Cannibal Zero | LOW | zero@mail.com | open | 14NOV2025 | 29 | 8 | |
| 14 | HE014 | Europe | Alpha Biter | HIGH | alpha@@mail.com | active | 19DEC2025 | 60 | 15 | |
| 15 | HE015 | Africa | Tribal Eater | CRITICAL | tribal@mail.com | open | 01JAN2026 | 48 | 5 | |
| 16 | HE016 | Asia | Hunter X | HIGH | invalid@mail.com | closed | 26APR2026 | 31 | 6 | |
| 17 | HE017 | Europe | Silent Eater | LOW | silent@mail | active | 11FEB2026 | 39 | 3 | |
| 18 | HE018 | Africa | Dark Shadow | MEDIUM | shadow@mail.com | open | 14MAR2026 | 42 | 1 | |
| 19 | HE019 | Asia | Unknown | HIGH | unknown@mail.com | pending | 20APR2026 | . | 2 | MISSING |
| 20 | HE020 | Europe | Cannibal X | HIGH | canni@mail.com | closed | 25MAY2026 | 55 | 9 |
Explanation
Arrays
simplify repetitive cleaning operations.
Instead
of writing multiple STRIP statements manually, arrays allow scalable enterprise
cleaning logic. This becomes critical in SDTM domains containing hundreds of
variables.
FIRST./LAST. Processing
proc sort data=dedup_data;
by REGION;
run;
proc print data=dedup_data;
run;
OUTPUT:
| Obs | CASE_ID | REGION | EATER_NAME | RISK_LEVEL | CONTACT_EMAIL | STATUS | INCIDENT_DATE | AGE | VICTIMS |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HE003 | Africa | Bone Eater | HIGH | eater@domain | open | 25FEB2025 | . | 3 |
| 2 | HE006 | Africa | Skull Collector | HIGH | invalid@mail.com | pending | 17JUN2025 | . | 14 |
| 3 | HE009 | Africa | Bone Lord | LOW | bone@mail.com | active | 26APR2026 | 38 | 2 |
| 4 | HE012 | Africa | Blood Drinker | MEDIUM | blood@mail.com | closed | 10OCT2025 | 35 | . |
| 5 | HE015 | Africa | Tribal Eater | CRITICAL | tribal@mail.com | open | 01JAN2026 | 48 | 5 |
| 6 | HE018 | Africa | Dark Shadow | MEDIUM | shadow@mail.com | open | 14MAR2026 | 42 | 1 |
| 7 | HE001 | Asia | Cannibal King | HIGH | invalid@mail.com | open | 12JAN2025 | 45 | 12 |
| 8 | HE004 | Asia | NULL | LOW | invalid@mail.com | active | 26APR2026 | 33 | 4 |
| 9 | HE007 | Asia | WhiteSpace | HIGH | white@mail | open | 26APR2026 | 52 | 5 |
| 10 | HE010 | Asia | Dark Eater | MEDIUM | dark@domain.com | open | 22AUG2025 | 44 | 7 |
| 11 | HE013 | Asia | Cannibal Zero | LOW | zero@mail.com | open | 14NOV2025 | 29 | 8 |
| 12 | HE016 | Asia | Hunter X | HIGH | invalid@mail.com | closed | 26APR2026 | 31 | 6 |
| 13 | HE019 | Asia | Unknown | HIGH | unknown@mail.com | pending | 20APR2026 | . | 2 |
| 14 | HE002 | Europe | Flesh Hunter | MEDIUM | hunter@mail.com | closed | 26APR2026 | . | 8 |
| 15 | HE005 | Europe | Night Feaster | MEDIUM | test@@mail.com | open | 15MAY2025 | 28 | 0 |
| 16 | HE008 | Europe | Red Hunter | CRITICAL | red@mail.com | closed | 12JUL2025 | 41 | 9 |
| 17 | HE011 | Europe | Ghost Eater | HIGH | invalid@mail.com | open | 30SEP2025 | . | 11 |
| 18 | HE014 | Europe | Alpha Biter | HIGH | alpha@@mail.com | active | 19DEC2025 | 60 | 15 |
| 19 | HE017 | Europe | Silent Eater | LOW | silent@mail | active | 11FEB2026 | 39 | 3 |
| 20 | HE020 | Europe | Cannibal X | HIGH | canni@mail.com | closed | 25MAY2026 | 55 | 9 |
data regional_totals;
set dedup_data;
by REGION;
retain TOTAL_VICTIMS 0;
TOTAL_VICTIMS + VICTIMS;
if last.REGION then output;
run;
proc print data=regional_totals;
run;
OUTPUT:
| Obs | CASE_ID | REGION | EATER_NAME | RISK_LEVEL | CONTACT_EMAIL | STATUS | INCIDENT_DATE | AGE | VICTIMS | TOTAL_VICTIMS |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | HE018 | Africa | Dark Shadow | MEDIUM | shadow@mail.com | open | 14MAR2026 | 42 | 1 | 25 |
| 2 | HE019 | Asia | Unknown | HIGH | unknown@mail.com | pending | 20APR2026 | . | 2 | 69 |
| 3 | HE020 | Europe | Cannibal X | HIGH | canni@mail.com | closed | 25MAY2026 | 55 | 9 | 124 |
Explanation
FIRST.
and LAST. processing is heavily used in:
- patient visit tracking,
- exposure accumulation,
- financial rollups,
- insurance claims.
The RETAIN
statement preserves values across observations, enabling cumulative
calculations.
PROC FORMAT for Enterprise Reporting
proc format;
value $riskfmt 'LOW'=1
'MEDIUM'=2
'HIGH'=3
'CRITICAL'=3;
run;
LOG:
Explanation
Formats
standardize presentation layers without altering source data. This is crucial
in:
- regulatory submissions,
- executive dashboards,
- SDTM controlled terminology.
Professional Reporting Using PROC REPORT
proc report data=dedup_data nowd;
column REGION RISK_LEVEL VICTIMS AGE;
define REGION / group;
define RISK_LEVEL / order format=$riskfmt.;
define VICTIMS / analysis mean;
define AGE / analysis mean;
run;
OUTPUT:
| REGION | RISK_LEVEL | VICTIMS | AGE |
|---|---|---|---|
| Africa | 1 | 2 | 38 |
| 2 | . | 35 | |
| 1 | 42 | ||
| 3 | 3 | . | |
| 14 | . | ||
| 5 | 48 | ||
| Asia | 1 | 4 | 33 |
| 8 | 29 | ||
| 2 | 7 | 44 | |
| 3 | 12 | 45 | |
| 5 | 52 | ||
| 6 | 31 | ||
| 2 | . | ||
| Europe | 1 | 3 | 39 |
| 2 | 8 | . | |
| 0 | 28 | ||
| 3 | 9 | 41 | |
| 11 | . | ||
| 15 | 60 | ||
| 9 | 55 |
Explanation
PROC
REPORT produces highly customized enterprise-ready outputs. Unlike PROC PRINT,
it supports grouping, computed columns, summaries, and advanced layouts
required in pharmaceutical reporting.
Reusable SAS Macro Framework
%macro nullcheck(ds,var);
proc freq data=&ds;
tables &var / missing;
run;
%mend;
%nullcheck(dedup_data,REGION);
OUTPUT:
The FREQ Procedure
| REGION | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Africa | 6 | 30.00 | 6 | 30.00 |
| Asia | 7 | 35.00 | 13 | 65.00 |
| Europe | 7 | 35.00 | 20 | 100.00 |
Explanation
Macros
improve:
- reusability,
- standardization,
- automation,
- audit readiness.
Large
pharmaceutical companies heavily depend on macro frameworks for SDTM and ADaM
generation.
R Data Cleaning Workflow
Raw Dataset in R
library(tidyverse)
library(lubridate)
library(janitor)
human_raw <- tibble(
CASE_ID = c("HE001","HE002","HE003"),
REGION = c("asia"," EUROPE ","AFRICA"),
AGE = c(45,-5,120),
VICTIMS = c(12,-3,5),
EMAIL = c("wrongmail","test@@mail.com","good@mail.com")
)
OUTPUT:
|
|
CASE_ID |
REGION |
AGE |
VICTIMS |
EMAIL |
|
1 |
HE001 |
asia |
45 |
12 |
wrongmail |
|
2 |
HE002 |
EUROPE |
-5 |
-3 |
test@@mail.com |
|
3 |
HE003 |
AFRICA |
120 |
5 |
good@mail.com |
R Cleaning Layer
human_clean <- human_raw %>%
clean_names() %>%
mutate(region = str_to_title(str_trim(region)),
victims = abs(victims),
age = if_else(age < 18 | age > 100,
NA_real_,as.numeric(age)),
email = if_else(grepl("@",email),
email,"invalid@mail.com")
)
OUTPUT:
|
|
case_id |
region |
age |
victims |
email |
|
1 |
HE001 |
Asia |
45 |
12 |
invalid@mail.com |
|
2 |
HE002 |
Europe |
NA |
3 |
test@@mail.com |
|
3 |
HE003 |
Africa |
NA |
5 |
good@mail.com |
Explanation
This R
workflow mirrors SAS cleaning logic:
- mutate() ≈ DATA step
assignments
- if_else() ≈ IF-THEN
- str_trim() ≈ STRIP
- grepl() ≈ FIND/INDEX
- clean_names() standardizes
metadata
R offers
flexible string handling and pipeline readability, while SAS offers stronger
governance and auditability.
Enterprise Validation &
Compliance
In
clinical trials, incorrect cleaning logic can invalidate entire submissions.
Key Compliance Risks
SDTM/ADaM Traceability
Every
transformed variable must trace back to source CRF data.
Missing Value Danger in SAS
In SAS:
. < 0 < 1
Missing
numeric values are treated lower than valid numbers.
This
creates catastrophic errors if programmers write:
if lab_value < 5 then
flag='LOW';
Missing
values also become LOW unintentionally.
Correct
logic:
if not missing(lab_value) and
lab_value < 5 then flag='LOW';
This is a
critical interview concept.
20 Enterprise Data-Cleaning
Best Practices
- Standardize metadata early
- Validate variable types
- Remove duplicates before
joins
- Audit all imputations
- Preserve raw datasets
- Use reusable macros
- Maintain data lineage
- Validate ranges aggressively
- Normalize text fields
- Apply controlled terminology
- Document assumptions
- Avoid hardcoded logic
- Use version-controlled
programs
- Perform independent QC
- Validate joins carefully
- Flag missing critical
variables
- Use defensive programming
- Track derivation
traceability
- Test edge-case records
- Build reproducible workflows
Business Logic Behind
Cleaning Decisions
Enterprise
cleaning logic exists because operational systems rarely produce analytically
perfect data. In healthcare environments, missing patient ages may occur
because of incomplete electronic data capture systems. Analysts cannot simply
ignore those records because downstream demographic summaries and statistical
models depend on accurate age distributions. Therefore, unrealistic ages such
as 0, negative values, or 200 years are converted to missing values for
investigation or imputation.
Similarly,
missing visit dates affect treatment exposure calculations, adverse-event
timelines, and regulatory compliance. Standardizing dates using INTNX or
controlled imputations ensures continuity of longitudinal analytics.
Text
normalization is equally important. Mixed-case region names such as “asia,”
“ASIA,” and “Asia” appear identical to humans but are treated differently by
analytical engines. Standardization improves grouping accuracy and dashboard
reliability.
Malformed
emails disrupt notification systems and operational communication workflows.
Removing whitespace corruption and validating email structure prevents system
failures.
Negative
financial or victim counts also require correction because they distort
statistical summaries and AI prediction models. Using functions like ABS()
restores logical consistency while maintaining traceability.
Ultimately,
enterprise cleaning transforms operational chaos into trusted intelligence
capable of supporting compliance, reporting, forecasting, and executive
decision-making.
20 Sharp One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic is stronger
than visual inspection.
- Duplicate IDs silently
corrupt analytics.
- Missing dates destroy
timeline integrity.
- Controlled terminology
improves consistency.
- PROC SQL simplifies
relational logic.
- DATA step excels in
sequential engineering.
- SAS macros improve
scalability.
- R pipelines improve
readability.
- Whitespace corruption breaks
joins.
- Standardized variables
improve reproducibility.
- Missing values require
explicit handling.
- Enterprise reporting depends
on trusted inputs.
- Auditability matters more
than shortcuts.
- Metadata governance prevents
confusion.
- Defensive programming
reduces production failures.
- Data lineage supports
regulatory trust.
- Clean data improves AI
reliability.
- Validation failures often
begin with poor ingestion.
- Good analytics starts with
disciplined cleaning.
SAS vs R Comparison
|
Feature |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Very
High |
Growing |
|
Flexibility |
Strong |
Extremely
Strong |
|
Performance |
Excellent |
Good |
|
Visualization |
Moderate |
Excellent |
|
Enterprise
Governance |
Excellent |
Moderate |
|
Open
Source |
No |
Yes |
|
Statistical
Ecosystem |
Mature |
Massive |
Summary
SAS and R
together create a powerful enterprise-grade analytics ecosystem. SAS dominates
highly regulated industries because of its structured architecture, validation
controls, reproducibility, and audit-friendly workflows. Clinical trial
environments especially depend on SAS for SDTM, ADaM, TLF generation, and
FDA-compliant reporting. Features such as PROC SQL, DATA step engineering, PROC
REPORT, macros, and metadata-driven programming make SAS exceptionally reliable
for large-scale operational analytics.
R,
however, provides remarkable flexibility and modern data-science capabilities.
Packages like tidyverse, stringr, lubridate, janitor, and purrr enable elegant
and highly readable transformation pipelines. R excels in exploratory
analytics, AI integration, visualization, and rapid prototyping.
In
enterprise cleaning workflows, the combination is powerful:
- SAS ensures governance and
compliance
- R enhances agility and
advanced analytics
The best
organizations do not treat SAS and R as competitors. They integrate both
strategically.
Modern
analytics demands:
- scalable cleaning,
- transparent derivations,
- reproducible pipelines,
- and trustworthy outputs.
Without
structured cleaning frameworks, dashboards become misleading, AI becomes
unreliable, and executive decisions become dangerous. Clean data is not just
technical hygiene it is enterprise survival.
Conclusion
The
“HUMAN EATERS IN WORLD” project demonstrates a realistic enterprise
data-engineering challenge where corrupted operational records threaten
analytical trust, reporting accuracy, and business intelligence reliability. In
real-world pharmaceutical, banking, insurance, retail, and forensic
environments, raw datasets almost never arrive in perfect condition. They
contain duplicate identifiers, invalid dates, inconsistent text values,
malformed emails, impossible numeric ranges, missing variables, and structural
inconsistencies that can severely damage downstream analytics.
Through
advanced SAS programming techniques such as DATA step engineering, PROC SQL,
PROC SUMMARY, PROC REPORT, ARRAYS, RETAIN logic, FIRST./LAST. processing, PROC
FORMAT, PROC SORT NODUPKEY, and reusable MACROS, organizations can
systematically transform unreliable raw data into production-grade analytical
intelligence. At the same time, modern R frameworks like tidyverse, stringr,
lubridate, janitor, and purrr provide flexible and highly readable
transformation pipelines that accelerate exploratory analysis and modern
data-science workflows.
One of
the most critical lessons is that enterprise cleaning is not merely cosmetic
formatting. It is about preserving analytical integrity, ensuring auditability,
supporting traceability, enabling regulatory compliance, and protecting
executive decision-making from hidden corruption. Even small issues such as
whitespace contamination or missing numeric values can produce catastrophic
downstream consequences when improperly handled.
Experienced
SAS programmers understand that successful analytics begins long before
modeling or reporting. It begins with disciplined ingestion, defensive
validation logic, metadata governance, standardized transformations, and
reproducible workflows. Clean data creates trustworthy intelligence.
Trustworthy intelligence drives reliable business outcomes.
In
today’s enterprise analytics ecosystem, SAS and R together provide one of the
strongest combinations for scalable, compliant, and production-ready data engineering.
Interview Questions &
Answers
1. How would you handle duplicate patient IDs in
SAS?
Answer:
I would first identify duplicates using PROC SORT NODUPKEY or PROC SQL GROUP BY
HAVING COUNT(*) > 1. Then I would investigate source-system lineage before
removing records. In regulated environments, deletion must be traceable and
documented.
2. Why is missing-value handling dangerous in SAS?
Answer:
SAS treats missing numeric values as smaller than valid numbers. Therefore:
if value < 5;
includes
missing values unintentionally. Proper defensive logic requires:
if not missing(value) and value
< 5;
3. When would you prefer DATA Step over PROC SQL?
Answer:
DATA step is preferred for sequential processing, row-wise transformations,
FIRST./LAST. logic, RETAIN calculations, and large-scale optimized processing.
PROC SQL is preferred for joins and relational aggregations.
4. How do you validate cleaned datasets in
enterprise environments?
Answer:
I compare record counts, variable attributes, summary statistics, missing-value
patterns, and derived outputs between raw and cleaned datasets. Independent QC
programming is also essential for regulatory compliance.
5. How does R complement SAS in enterprise
analytics?
Answer:
SAS provides governance, auditability, and regulatory reliability, while R
offers advanced visualization, flexible pipelines, machine learning
integration, and rapid exploratory analysis. Together they create scalable
enterprise intelligence workflows.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about HUMAN EATERS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment