Digital Doppelgängers, Missing Values & Validation Nightmares: Engineering Production-Ready Lookalike Analytics Using SAS DATA Step, PROC SQL and R
Global Lookalike Persons Dataset into Analysis-Ready Enterprise Analytics Using SAS (PROC SQL vs DATA Step) and Modern R Pipelines
Introduction
In
enterprise analytics, bad data is never “just a small issue.” A single
corrupted value can quietly destroy an entire reporting ecosystem. Imagine a
global intelligence organization studying “lookalike persons” individuals across countries who resemble
celebrities, politicians, athletes, or public figures. Now imagine duplicate
identities entering the system, impossible ages being assigned, malformed
timestamps corrupting event histories, and inconsistent region codes confusing
downstream dashboards.
Suddenly,
AI facial-recognition systems begin matching the wrong people. Fraud-detection
algorithms fail. Clinical enrollment systems incorrectly identify duplicate
patients. Executive dashboards show misleading counts. Regulatory submissions
fail QC validation.
This is
not theoretical.
As experienced
Clinical SAS Programmers and Data Scientists know, poor-quality datasets create
catastrophic downstream consequences:
- Incorrect SDTM subject
counts
- Failed ADaM derivations
- Misleading TLF outputs
- Broken audit trails
- Invalid statistical
conclusions
- Regulatory rejection risks
- Corrupted machine-learning
predictions
This
project demonstrates how to transform a deliberately corrupted Global
Lookalike Persons Dataset into reliable analytical intelligence using both SAS
and R with enterprise-grade cleaning logic.
Business Crisis Scenario:
The Duplicate Identity Disaster
A
multinational digital identity company launched a facial similarity analytics
platform used by:
- immigration departments,
- entertainment agencies,
- banking KYC systems,
- and healthcare identity validation
teams.
Within
weeks, executives noticed severe reporting discrepancies:
|
Problem |
Impact |
|
Duplicate
Person IDs |
Incorrect
identity counts |
|
Invalid
DOBs |
Age
derivation failures |
|
Missing
region codes |
Dashboard
segmentation errors |
|
Negative
transaction values |
Revenue
corruption |
|
Invalid
timestamps |
Timeline
analysis failure |
|
Mixed
uppercase/lowercase names |
Duplicate
matching failures |
|
NULL
strings |
AI
preprocessing errors |
|
Malformed
emails |
Notification
delivery failures |
The
organization needed a complete enterprise cleaning framework using SAS and R.
Raw Corrupted Dataset
Structure
Variables
|
Variable |
Description |
|
PERSON_ID |
Unique
identity ID |
|
PERSON_NAME |
Lookalike
person name |
|
COUNTRY |
Country |
|
AGE |
Person
age |
|
LOOKALIKE_OF |
Celebrity
resemblance |
|
REGION_CODE |
Geographic
code |
|
CONTACT_EMAIL |
Email |
|
EVENT_DATE |
Observation
date |
|
PAYMENT_AMOUNT |
Verification
fee |
1.SAS Raw Dataset Creation with Intentional Errors
DATA lookalike_raw;
LENGTH PERSON_ID $12 PERSON_NAME $40 COUNTRY $20 LOOKALIKE_OF $40
REGION_CODE $10 CONTACT_EMAIL $60 EVENT_DATE_RAW $30;
INFILE DATALINES DLM='|' MISSOVER;
INPUT PERSON_ID $ PERSON_NAME $ COUNTRY $ AGE
LOOKALIKE_OF $ REGION_CODE $ CONTACT_EMAIL $
EVENT_DATE_RAW $ PAYMENT_AMOUNT;
DATALINES;
L001|john_smith|usa|25|Tom_Cruise|na|johnmail.com|2025-01-10|200
L002| maria lee |India|-45|Angelina_Jolie|AP|maria@gmail.com|2025-02-30|-500
L003|NULL|UK|200|Brad_Pitt|eu|brad@@mail.com|invaliddate|1000
L004|DavidM|USA|35|SRK|NA|david@gmail.com|2025-03-15|700
L004|DavidM|USA|35|SRK|NA|david@gmail.com|2025-03-15|700
L005|Anita |india|.|Deepika|ap|anita@gmail|2025-04-10|350
L006|Chris_R|Canada|29|Leonardo|ca|NULL|2025-13-01|450
L007| Fatima |UAE|150|Aamir_Khan|me|fatima@mail.com|2025-05-11|-800
L008|Wei Chen|China|31|Jackie_Chan|as|wei@gmail.com|2025-06-05|900
L009|NULL|USA|40|Robert_Downey|NA|ironman.com|2025-07-09|600
;
RUN;
PROC PRINT DATA=lookalike_raw;
RUN;
OUTPUT:
| Obs | PERSON_ID | PERSON_NAME | COUNTRY | LOOKALIKE_OF | REGION_CODE | CONTACT_EMAIL | EVENT_DATE_RAW | AGE | PAYMENT_AMOUNT |
|---|---|---|---|---|---|---|---|---|---|
| 1 | L001 | john_smith | usa | Tom_Cruise | na | johnmail.com | 2025-01-10 | 25 | 200 |
| 2 | L002 | maria lee | India | Angelina_Jolie | AP | maria@gmail.com | 2025-02-30 | -45 | -500 |
| 3 | L003 | NULL | UK | Brad_Pitt | eu | brad@@mail.com | invaliddate | 200 | 1000 |
| 4 | L004 | DavidM | USA | SRK | NA | david@gmail.com | 2025-03-15 | 35 | 700 |
| 5 | L004 | DavidM | USA | SRK | NA | david@gmail.com | 2025-03-15 | 35 | 700 |
| 6 | L005 | Anita | india | Deepika | ap | anita@gmail | 2025-04-10 | . | 350 |
| 7 | L006 | Chris_R | Canada | Leonardo | ca | NULL | 2025-13-01 | 29 | 450 |
| 8 | L007 | Fatima | UAE | Aamir_Khan | me | fatima@mail.com | 2025-05-11 | 150 | -800 |
| 9 | L008 | Wei Chen | China | Jackie_Chan | as | wei@gmail.com | 2025-06-05 | 31 | 900 |
| 10 | L009 | NULL | USA | Robert_Downey | NA | ironman.com | 2025-07-09 | 40 | 600 |
Why LENGTH Must Appear
First in SAS
One of
the most misunderstood SAS concepts is character truncation risk.
If you
assign values before defining LENGTH, SAS permanently fixes variable length
based on the first assignment.
Example:
data test;
name='JonathanChristopher';
length name $5;
run;
The
variable becomes truncated because SAS already allocated memory.
Correct
approach:
data test;
length name $40;
name='JonathanChristopher';
run;
Unlike
SAS, R dynamically manages character vectors without fixed-width allocation.
SAS requires proactive metadata control for enterprise consistency.
Enterprise SAS Cleaning
Workflow
2.Standardization and Cleaning
DATA lookalike_clean;
SET lookalike_raw;
PERSON_NAME=PROPCASE(STRIP(TRANWRD(PERSON_NAME,'_',' ')));
COUNTRY=UPCASE(STRIP(COUNTRY));
LOOKALIKE_OF=PROPCASE(TRANWRD(LOOKALIKE_OF,'_',' '));
REGION_CODE=UPCASE(REGION_CODE);
CONTACT_EMAIL=LOWCASE(STRIP(CONTACT_EMAIL));
PAYMENT_AMOUNT=ABS(PAYMENT_AMOUNT);
IF AGE<0 OR AGE>120 THEN AGE=.;
IF PERSON_NAME='Null' THEN PERSON_NAME='Unknown';
IF INDEX(CONTACT_EMAIL,'@')=0 THEN
CONTACT_EMAIL='invalid@email.com';
EVENT_DATE=INPUT(EVENT_DATE_RAW,?? YYMMDD10.);
IF MISSING(EVENT_DATE) THEN
DATE_FLAG='INVALID DATE';
ELSE DATE_FLAG='VALID DATE';
FORMAT EVENT_DATE DATE9.;
DROP EVENT_DATE_RAW;
RUN;
PROC PRINT DATA=lookalike_clean;
RUN;
OUTPUT:
| Obs | PERSON_ID | PERSON_NAME | COUNTRY | LOOKALIKE_OF | REGION_CODE | CONTACT_EMAIL | AGE | PAYMENT_AMOUNT | EVENT_DATE | DATE_FLAG |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | L001 | John Smith | USA | Tom Cruise | NA | invalid@email.com | 25 | 200 | 10JAN2025 | VALID DATE |
| 2 | L002 | Maria Lee | INDIA | Angelina Jolie | AP | maria@gmail.com | . | 500 | . | INVALID DATE |
| 3 | L003 | Unknown | UK | Brad Pitt | EU | brad@@mail.com | . | 1000 | . | INVALID DATE |
| 4 | L004 | Davidm | USA | Srk | NA | david@gmail.com | 35 | 700 | 15MAR2025 | VALID DATE |
| 5 | L004 | Davidm | USA | Srk | NA | david@gmail.com | 35 | 700 | 15MAR2025 | VALID DATE |
| 6 | L005 | Anita | INDIA | Deepika | AP | anita@gmail | . | 350 | 10APR2025 | VALID DATE |
| 7 | L006 | Chris R | CANADA | Leonardo | CA | invalid@email.com | 29 | 450 | . | INVALID DATE |
| 8 | L007 | Fatima | UAE | Aamir Khan | ME | fatima@mail.com | . | 800 | 11MAY2025 | VALID DATE |
| 9 | L008 | Wei Chen | CHINA | Jackie Chan | AS | wei@gmail.com | 31 | 900 | 05JUN2025 | VALID DATE |
| 10 | L009 | Unknown | USA | Robert Downey | NA | invalid@email.com | 40 | 600 | 09JUL2025 | VALID DATE |
Explanation and Key Points
This DATA
step demonstrates real-world production cleaning logic. PROPCASE standardizes
names for duplicate matching. TRANWRD removes underscore corruption. ABS()
corrects negative financial values. INPUT(...ANYDTDTE.) intelligently converts
inconsistent date formats. Missing and unrealistic ages are standardized to SAS
missing values (.). Email validation uses containment logic. This mirrors
clinical trial workflows where demographic variables must be normalized before
SDTM mapping. In enterprise systems, consistent formatting directly improves
merge success rates, AI entity matching, and downstream reporting
reproducibility.INDEX() searches for a substring. If '@' is found → returns
position number. If not found → returns 0.
What ?? Does
Internally
|
Modifier |
Behavior |
|
none |
NOTE + ERROR=1 |
|
? |
suppress
NOTE |
|
?? |
suppress
NOTE + suppress ERROR |
3.Removing Duplicate Records
PROC SORT DATA=lookalike_clean
OUT=lookalike_nodup NODUPKEY;
BY PERSON_ID;
RUN;
PROC PRINT DATA=lookalike_nodup;
RUN;
LOG:
OUTPUT:
| Obs | PERSON_ID | PERSON_NAME | COUNTRY | LOOKALIKE_OF | REGION_CODE | CONTACT_EMAIL | AGE | PAYMENT_AMOUNT | EVENT_DATE | DATE_FLAG |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | L001 | John Smith | USA | Tom Cruise | NA | invalid@email.com | 25 | 200 | 10JAN2025 | VALID DATE |
| 2 | L002 | Maria Lee | INDIA | Angelina Jolie | AP | maria@gmail.com | . | 500 | . | INVALID DATE |
| 3 | L003 | Unknown | UK | Brad Pitt | EU | brad@@mail.com | . | 1000 | . | INVALID DATE |
| 4 | L004 | Davidm | USA | Srk | NA | david@gmail.com | 35 | 700 | 15MAR2025 | VALID DATE |
| 5 | L005 | Anita | INDIA | Deepika | AP | anita@gmail | . | 350 | 10APR2025 | VALID DATE |
| 6 | L006 | Chris R | CANADA | Leonardo | CA | invalid@email.com | 29 | 450 | . | INVALID DATE |
| 7 | L007 | Fatima | UAE | Aamir Khan | ME | fatima@mail.com | . | 800 | 11MAY2025 | VALID DATE |
| 8 | L008 | Wei Chen | CHINA | Jackie Chan | AS | wei@gmail.com | 31 | 900 | 05JUN2025 | VALID DATE |
| 9 | L009 | Unknown | USA | Robert Downey | NA | invalid@email.com | 40 | 600 | 09JUL2025 | VALID DATE |
Explanation and Key Points
PROC SORT
NODUPKEY removes duplicate business keys. In clinical trials, duplicate USUBJID
values can invalidate subject-level analyses. Duplicate removal protects
statistical integrity and prevents inflation of counts in dashboards or
regulatory outputs. SAS sorting procedures are highly optimized for
enterprise-scale processing involving millions of rows.
4.Advanced Validation Using ARRAYS
DATA validation_check;
SET lookalike_nodup;
ARRAY chars(*) PERSON_NAME COUNTRY LOOKALIKE_OF REGION_CODE CONTACT_EMAIL;
DO i=1 TO DIM(chars);
chars(i)=COMPBL(chars(i));
END;
MISSING_COUNT=CMISS(OF _ALL_);
RUN;
PROC PRINT DATA=validation_check;
RUN;
OUTPUT:
| Obs | PERSON_ID | PERSON_NAME | COUNTRY | LOOKALIKE_OF | REGION_CODE | CONTACT_EMAIL | AGE | PAYMENT_AMOUNT | EVENT_DATE | DATE_FLAG | i | MISSING_COUNT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | L001 | John Smith | USA | Tom Cruise | NA | invalid@email.com | 25 | 200 | 10JAN2025 | VALID DATE | 6 | 1 |
| 2 | L002 | Maria Lee | INDIA | Angelina Jolie | AP | maria@gmail.com | . | 500 | . | INVALID DATE | 6 | 3 |
| 3 | L003 | Unknown | UK | Brad Pitt | EU | brad@@mail.com | . | 1000 | . | INVALID DATE | 6 | 3 |
| 4 | L004 | Davidm | USA | Srk | NA | david@gmail.com | 35 | 700 | 15MAR2025 | VALID DATE | 6 | 1 |
| 5 | L005 | Anita | INDIA | Deepika | AP | anita@gmail | . | 350 | 10APR2025 | VALID DATE | 6 | 2 |
| 6 | L006 | Chris R | CANADA | Leonardo | CA | invalid@email.com | 29 | 450 | . | INVALID DATE | 6 | 2 |
| 7 | L007 | Fatima | UAE | Aamir Khan | ME | fatima@mail.com | . | 800 | 11MAY2025 | VALID DATE | 6 | 2 |
| 8 | L008 | Wei Chen | CHINA | Jackie Chan | AS | wei@gmail.com | 31 | 900 | 05JUN2025 | VALID DATE | 6 | 1 |
| 9 | L009 | Unknown | USA | Robert Downey | NA | invalid@email.com | 40 | 600 | 09JUL2025 | VALID DATE | 6 | 1 |
Explanation and Key Points
Arrays
allow scalable variable-level transformations without repetitive code. COMPBL()
removes extra embedded blanks. CMISS() counts missing character and numeric
values simultaneously. This is extremely valuable in SDTM validation because
missingness directly affects subject disposition and endpoint derivations.
5.PROC SQL Enterprise Logic
PROC SQL;
CREATE TABLE profile_summary AS
SELECT COUNTRY,
COUNT(*) AS TOTAL_PERSONS,
MEAN(AGE) AS AVG_AGE,
SUM(PAYMENT_AMOUNT) AS TOTAL_REVENUE
FROM validation_check
GROUP BY COUNTRY;
QUIT;
PROC PRINT DATA=profile_summary;
RUN;
OUTPUT:
| Obs | COUNTRY | TOTAL_PERSONS | AVG_AGE | TOTAL_REVENUE |
|---|---|---|---|---|
| 1 | CANADA | 1 | 29.0000 | 450 |
| 2 | CHINA | 1 | 31.0000 | 900 |
| 3 | INDIA | 2 | . | 850 |
| 4 | UAE | 1 | . | 800 |
| 5 | UK | 1 | . | 1000 |
| 6 | USA | 3 | 33.3333 | 1500 |
Explanation and Key Points
PROC SQL
simplifies aggregation and join logic. SQL is preferred when relational
operations dominate the workflow. Clinical programmers frequently use SQL for
subject-level joins across AE, DM, LB, and EX domains. Aggregation logic
improves reporting automation and executive dashboard creation.
6.Using FIRST./LAST. Processing
PROC SORT DATA=validation_check;
BY COUNTRY;
RUN;
PROC PRINT DATA=validation_check;
RUN;
OUTPUT:
| Obs | PERSON_ID | PERSON_NAME | COUNTRY | LOOKALIKE_OF | REGION_CODE | CONTACT_EMAIL | AGE | PAYMENT_AMOUNT | EVENT_DATE | DATE_FLAG | i | MISSING_COUNT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | L006 | Chris R | CANADA | Leonardo | CA | invalid@email.com | 29 | 450 | . | INVALID DATE | 6 | 2 |
| 2 | L008 | Wei Chen | CHINA | Jackie Chan | AS | wei@gmail.com | 31 | 900 | 05JUN2025 | VALID DATE | 6 | 1 |
| 3 | L002 | Maria Lee | INDIA | Angelina Jolie | AP | maria@gmail.com | . | 500 | . | INVALID DATE | 6 | 3 |
| 4 | L005 | Anita | INDIA | Deepika | AP | anita@gmail | . | 350 | 10APR2025 | VALID DATE | 6 | 2 |
| 5 | L007 | Fatima | UAE | Aamir Khan | ME | fatima@mail.com | . | 800 | 11MAY2025 | VALID DATE | 6 | 2 |
| 6 | L003 | Unknown | UK | Brad Pitt | EU | brad@@mail.com | . | 1000 | . | INVALID DATE | 6 | 3 |
| 7 | L001 | John Smith | USA | Tom Cruise | NA | invalid@email.com | 25 | 200 | 10JAN2025 | VALID DATE | 6 | 1 |
| 8 | L004 | Davidm | USA | Srk | NA | david@gmail.com | 35 | 700 | 15MAR2025 | VALID DATE | 6 | 1 |
| 9 | L009 | Unknown | USA | Robert Downey | NA | invalid@email.com | 40 | 600 | 09JUL2025 | VALID DATE | 6 | 1 |
DATA country_stats;
SET validation_check;
BY COUNTRY;
RETAIN COUNTRY_TOTAL 0;
COUNTRY_TOTAL+1;
IF LAST.COUNTRY THEN OUTPUT;
RUN;
PROC PRINT DATA=country_stats;
RUN;
OUTPUT:
| Obs | PERSON_ID | PERSON_NAME | COUNTRY | LOOKALIKE_OF | REGION_CODE | CONTACT_EMAIL | AGE | PAYMENT_AMOUNT | EVENT_DATE | DATE_FLAG | i | MISSING_COUNT | COUNTRY_TOTAL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | L006 | Chris R | CANADA | Leonardo | CA | invalid@email.com | 29 | 450 | . | INVALID DATE | 6 | 2 | 1 |
| 2 | L008 | Wei Chen | CHINA | Jackie Chan | AS | wei@gmail.com | 31 | 900 | 05JUN2025 | VALID DATE | 6 | 1 | 2 |
| 3 | L005 | Anita | INDIA | Deepika | AP | anita@gmail | . | 350 | 10APR2025 | VALID DATE | 6 | 2 | 4 |
| 4 | L007 | Fatima | UAE | Aamir Khan | ME | fatima@mail.com | . | 800 | 11MAY2025 | VALID DATE | 6 | 2 | 5 |
| 5 | L003 | Unknown | UK | Brad Pitt | EU | brad@@mail.com | . | 1000 | . | INVALID DATE | 6 | 3 | 6 |
| 6 | L009 | Unknown | USA | Robert Downey | NA | invalid@email.com | 40 | 600 | 09JUL2025 | VALID DATE | 6 | 1 | 9 |
Explanation and Key Points
FIRST.
and LAST. processing are among SAS’s most powerful enterprise features. They
enable grouped calculations without SQL overhead. This logic is heavily used in
adverse-event summarization and longitudinal clinical tracking.
7.PROC FORMAT for Controlled Categories
PROC FORMAT;
VALUE agegrp LOW-17='Minor'
18-40='Young Adult'
41-60='Adult'
61-HIGH='Senior';
RUN;
LOG:
Explanation and Key Points
Formats
improve reporting readability while preserving raw storage values. Regulatory
reporting often requires grouped categories for demographics and safety
summaries.
8.DATA STEP Age_Format
DATA age_format;
SET validation_check;
FORMAT AGE agegrp.;
RUN;
PROC PRINT DATA=age_format;
RUN;
OUTPUT:
| Obs | PERSON_ID | PERSON_NAME | COUNTRY | LOOKALIKE_OF | REGION_CODE | CONTACT_EMAIL | AGE | PAYMENT_AMOUNT | EVENT_DATE | DATE_FLAG | i | MISSING_COUNT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | L006 | Chris R | CANADA | Leonardo | CA | invalid@email.com | Young Adult | 450 | . | INVALID DATE | 6 | 2 |
| 2 | L008 | Wei Chen | CHINA | Jackie Chan | AS | wei@gmail.com | Young Adult | 900 | 05JUN2025 | VALID DATE | 6 | 1 |
| 3 | L002 | Maria Lee | INDIA | Angelina Jolie | AP | maria@gmail.com | . | 500 | . | INVALID DATE | 6 | 3 |
| 4 | L005 | Anita | INDIA | Deepika | AP | anita@gmail | . | 350 | 10APR2025 | VALID DATE | 6 | 2 |
| 5 | L007 | Fatima | UAE | Aamir Khan | ME | fatima@mail.com | . | 800 | 11MAY2025 | VALID DATE | 6 | 2 |
| 6 | L003 | Unknown | UK | Brad Pitt | EU | brad@@mail.com | . | 1000 | . | INVALID DATE | 6 | 3 |
| 7 | L001 | John Smith | USA | Tom Cruise | NA | invalid@email.com | Young Adult | 200 | 10JAN2025 | VALID DATE | 6 | 1 |
| 8 | L004 | Davidm | USA | Srk | NA | david@gmail.com | Young Adult | 700 | 15MAR2025 | VALID DATE | 6 | 1 |
| 9 | L009 | Unknown | USA | Robert Downey | NA | invalid@email.com | Young Adult | 600 | 09JUL2025 | VALID DATE | 6 | 1 |
9.R Code — Creating Raw Corrupted Dataset
library(tidyverse)
lookalike_raw <- read.delim(
text = "
PERSON_ID|PERSON_NAME|COUNTRY|AGE|LOOKALIKE_OF|REGION_CODE|CONTACT_EMAIL|EVENT_DATE_RAW|PAYMENT_AMOUNT
L001|john_smith|usa|25|Tom_Cruise|na|johnmail.com|2025-01-10|200
L002| maria lee |India|-45|Angelina_Jolie|AP|maria@gmail.com|2025-02-30|-500
L003|NULL|UK|200|Brad_Pitt|eu|brad@@mail.com|invaliddate|1000
L004|DavidM|USA|35|SRK|NA|david@gmail.com|2025-03-15|700
L004|DavidM|USA|35|SRK|NA|david@gmail.com|2025-03-15|700
L005|Anita |india|.|Deepika|ap|anita@gmail|2025-04-10|350
L006|Chris_R|Canada|29|Leonardo|ca|NULL|2025-13-01|450
L007| Fatima |UAE|150|Aamir_Khan|me|fatima@mail.com|2025-05-11|-800
L008|Wei Chen|China|31|Jackie_Chan|as|wei@gmail.com|2025-06-05|900
L009|NULL|USA|40|Robert_Downey|NA|ironman.com|2025-07-09|600
",
sep = "|",
header = TRUE,
stringsAsFactors = FALSE,
strip.white = FALSE
)
OUTPUT:
|
|
PERSON_ID |
PERSON_NAME |
COUNTRY |
AGE |
LOOKALIKE_OF |
REGION_CODE |
CONTACT_EMAIL |
EVENT_DATE_RAW |
PAYMENT_AMOUNT |
|
1 |
L001 |
john_smith |
usa |
25 |
Tom_Cruise |
na |
johnmail.com |
10-01-2025 |
200 |
|
2 |
L002 |
maria lee |
India |
-45 |
Angelina_Jolie |
AP |
maria@gmail.com |
2025-02-30 |
-500 |
|
3 |
L003 |
NULL |
UK |
200 |
Brad_Pitt |
eu |
brad@@mail.com |
invaliddate |
1000 |
|
4 |
L004 |
DavidM |
USA |
35 |
SRK |
NA |
david@gmail.com |
15-03-2025 |
700 |
|
5 |
L004 |
DavidM |
USA |
35 |
SRK |
NA |
david@gmail.com |
15-03-2025 |
700 |
|
6 |
L005 |
Anita |
india |
. |
Deepika |
ap |
anita@gmail |
10-04-2025 |
350 |
|
7 |
L006 |
Chris_R |
Canada |
29 |
Leonardo |
ca |
NULL |
2025-13-01 |
450 |
|
8 |
L007 |
Fatima |
UAE |
150 |
Aamir_Khan |
me |
fatima@mail.com |
11-05-2025 |
-800 |
|
9 |
L008 |
Wei Chen |
China |
31 |
Jackie_Chan |
as |
wei@gmail.com |
05-06-2025 |
900 |
|
10 |
L009 |
NULL |
USA |
40 |
Robert_Downey |
NA |
ironman.com |
09-07-2025 |
600 |
10.Modern R Cleaning Workflow
library(tidyverse)
library(lubridate)
library(janitor)
lookalike_clean <- lookalike_raw %>%
clean_names() %>%
mutate(
person_name=str_to_title(str_trim(gsub("_"," ",person_name))),
country=str_to_upper(country),
lookalike_of=str_to_title(gsub("_"," ",lookalike_of)),
region_code=str_to_upper(region_code),
contact_email=tolower(contact_email),
payment_amount=abs(payment_amount),
age = suppressWarnings(as.numeric(age)),
age=if_else(age<0 | age>120,NA_real_,
as.numeric(age)),
event_date = suppressWarnings(parse_date_time(
event_date_raw,orders=c("ymd","dmy"))),
contact_email=if_else(grepl("@",contact_email),
contact_email,"invalid@email.com")
) %>%
distinct(person_id,.keep_all=TRUE) %>%
select(-event_date_raw)
OUTPUT:
|
|
person_id |
person_name |
country |
age |
lookalike_of |
region_code |
contact_email |
payment_amount |
event_date |
|
1 |
L001 |
John Smith |
USA |
25 |
Tom Cruise |
NA |
invalid@email.com |
200 |
10-01-2025 |
|
2 |
L002 |
Maria Lee |
INDIA |
NA |
Angelina Jolie |
AP |
maria@gmail.com |
500 |
NA |
|
3 |
L003 |
Null |
UK |
NA |
Brad Pitt |
EU |
brad@@mail.com |
1000 |
NA |
|
4 |
L004 |
Davidm |
USA |
35 |
Srk |
NA |
david@gmail.com |
700 |
15-03-2025 |
|
5 |
L005 |
Anita |
INDIA |
NA |
Deepika |
AP |
anita@gmail |
350 |
10-04-2025 |
|
6 |
L006 |
Chris R |
CANADA |
29 |
Leonardo |
CA |
invalid@email.com |
450 |
NA |
|
7 |
L007 |
Fatima |
UAE |
NA |
Aamir Khan |
ME |
fatima@mail.com |
800 |
11-05-2025 |
|
8 |
L008 |
Wei Chen |
CHINA |
31 |
Jackie Chan |
AS |
wei@gmail.com |
900 |
05-06-2025 |
|
9 |
L009 |
Null |
USA |
40 |
Robert Downey |
NA |
invalid@email.com |
600 |
09-07-2025 |
Explanation and Key Points
This R
pipeline mirrors SAS logic using tidyverse principles. mutate() handles
transformations elegantly. if_else() performs vectorized validation.
parse_date_time() from lubridate resolves inconsistent timestamps. distinct()
removes duplicate identifiers. Compared to SAS DATA steps, R provides more
concise syntax but less inherent auditability. In enterprise environments, R
excels at flexible exploratory cleaning while SAS dominates traceable
production workflows.
SAS vs R Comparison
|
Capability |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Industry
Standard |
Increasing |
|
Scalability |
Very
High |
High |
|
Flexibility |
Structured |
Extremely
Flexible |
|
QC
Traceability |
Strong |
Requires
Framework |
|
Metadata
Governance |
Excellent |
Manual |
|
Visualization |
Moderate |
Excellent |
SAS vs R Equivalent
|
SAS |
R |
|
DROP
EVENT_DATE_RAW |
select(-event_date_raw) |
|
KEEP
var1 var2; |
select(var1,var2) |
|
RENAME
old=new; |
rename(new=old) |
Enterprise Validation & Compliance
In
clinical trials, data cleaning is not cosmetic it is regulatory survival.
Incorrect
missing-value handling in SAS is particularly dangerous because:
. < 0
SAS
treats missing numeric values as smaller than valid numbers.
This can
accidentally include missing values in logical filters:
IF AGE < 18;
This
condition also captures missing ages unless explicitly handled.
Enterprise
validation frameworks require:
- SDTM traceability
- ADaM derivation lineage
- Independent QC programming
- Audit-ready logs
- Controlled terminology
validation
- Metadata consistency
- Reproducible transformations
Poor
cleaning logic can trigger FDA rejection risks.
20 Data-Cleaning Best
Practices
- Always standardize case
formatting
- Remove duplicates before
analysis
- Validate dates aggressively
- Never trust source-system
emails
- Use metadata-driven
programming
- Create reusable macros
- Separate raw and cleaned
layers
- Maintain audit logs
- Validate missingness
patterns
- Standardize region codes
- Avoid hardcoding logic
- Use defensive programming
- Preserve raw datasets
unchanged
- Validate categorical values
- Use controlled terminology
- Apply QC independence
- Track derivation lineage
- Use naming conventions
consistently
- Document transformations
thoroughly
- Automate validation
reporting
Business Logic Explanation
Business
logic transforms raw operational records into trusted analytical assets.
Missing values are often imputed because downstream analytics cannot reliably
process incomplete data. For example, missing patient ages in clinical trials
may exclude subjects from age-based efficacy analyses. Unrealistic values such
as age 200 or negative billing amounts are corrected because they distort
statistical summaries and AI models. Date standardization is essential because
inconsistent timestamps break longitudinal tracking and visit-window
calculations. Text normalization improves duplicate detection and reporting
accuracy. For instance, “india,” “INDIA,” and “ India ” should represent one
standardized country value. In banking systems, salary normalization prevents
incorrect loan eligibility classification. In healthcare, visit-date imputation
ensures accurate treatment exposure calculations. Without standardization,
dashboards display fragmented metrics, machine-learning models misclassify
risk, and regulatory submissions fail consistency checks. Enterprise data
cleaning therefore ensures analytical reliability, operational consistency,
reproducibility, and governance compliance across reporting ecosystems.
20 Sharp One-Line Insights
- Dirty data creates expensive
business mistakes.
- Standardized variables
improve reproducibility.
- Validation logic is stronger
than visual inspection.
- Duplicate IDs destroy
analytical trust.
- Metadata is enterprise
memory.
- SAS excels in auditability.
- R accelerates exploratory
cleaning.
- Missing dates break
timelines.
- Controlled terminology
reduces reporting chaos.
- Data lineage supports
compliance.
- QC independence prevents
silent failures.
- Character truncation
silently corrupts data.
- Negative financial values
distort forecasting.
- Case inconsistency ruins
joins.
- PROC SQL simplifies
relational workflows.
- DATA step enables granular
control.
- Arrays reduce repetitive
coding.
- Macros improve enterprise
scalability.
- Standardized formats improve
reporting clarity.
- Clean data powers
trustworthy AI.
Summary: SAS vs R
Enterprise Cleaning Workflows
SAS and R
each provide powerful but complementary strengths for enterprise data
engineering. SAS dominates heavily regulated industries because of its
auditability, reproducibility, metadata governance, and controlled execution
framework. DATA step programming offers granular transformation control, while
PROC SQL supports scalable relational operations. SAS logs provide traceability
essential for SDTM and ADaM compliance.
R, on the
other hand, excels in flexibility, rapid prototyping, modern data manipulation,
and advanced visualization ecosystems. Packages like tidyverse and lubridate
dramatically simplify text cleaning and date handling. R’s functional
programming capabilities enable scalable automation for exploratory analytics.
In
practice, modern organizations increasingly integrate SAS and R together. SAS
manages validated production pipelines, while R enhances exploratory analysis,
visualization, and machine-learning integration. Together, they create
scalable, trustworthy analytical ecosystems capable of supporting healthcare, banking,
retail, and insurance intelligence frameworks.
Conclusion
Modern
analytics systems cannot survive on raw operational data alone. Whether in
healthcare, banking, insurance, or retail, corrupted datasets create hidden
analytical liabilities that silently damage dashboards, machine-learning
predictions, financial forecasting, and regulatory submissions. The “Global
Lookalike Persons” project demonstrates how enterprise-grade cleaning
frameworks transform unreliable raw information into trustworthy analytical
intelligence.
SAS
provides unmatched strengths in auditability, controlled metadata management,
reproducibility, and regulatory traceability. Features such as DATA step
programming, FIRST./LAST. logic, PROC FORMAT, arrays, macros, and PROC SQL
allow organizations to engineer robust production pipelines capable of
processing millions of records consistently. In highly regulated environments
such as clinical trials, these capabilities are indispensable for SDTM and ADaM
compliance.
R
complements SAS by offering exceptional flexibility, modern transformation
libraries, rapid exploratory analytics, and scalable text-processing
capabilities. Tidyverse workflows simplify complex cleaning tasks while
enabling fast iteration for analytical experimentation.
Most
importantly, enterprise cleaning is not simply “data correction.” It is risk
management, governance enforcement, and analytical quality assurance. Every
standardized variable, corrected date, deduplicated identifier, and validated
category improves downstream reliability. Clean data strengthens executive
confidence, enhances AI accuracy, improves reporting reproducibility, and
protects organizations from compliance failure.
The
future of enterprise analytics belongs to organizations that treat data cleaning
not as a preprocessing step, but as a strategic engineering discipline. SAS and
R together create a powerful ecosystem for building scalable, production-grade,
analysis-ready intelligence platforms capable of supporting modern business
decision-making with confidence and precision.
Interview
Questions & Answers
1. What are TLFs?
TLFs mean
Tables, Listings, and Figures.
- Tables → Summary outputs
- Listings → Detailed
patient-level data
- Figures → Graphical
representation
These are
generated for clinical study reports using SAS procedures.
2. Explain PROC SQL and DATA Step difference
DATA
Step:
- Used for row-by-row
processing.
- Faster for sequential data
manipulation.
PROC SQL:
- Used for joins, grouping,
summarization.
- Similar to SQL database
operations.
Example:
- DATA step used for
derivations.
- PROC SQL used for combining
datasets using joins.
3. What is the use of SAS Macros?
Macros
are used to automate repetitive tasks and reduce code duplication.
Advantages:
- Reusable code
- Dynamic programming
- Easy maintenance
- Faster execution for
repeated tasks
4. Explain SET and MERGE statements
SET:
- Combines datasets
vertically.
- Appends observations.
MERGE:
- Combines datasets
horizontally using BY variables.
Example:
SET:
Combine AE2024 and AE2025 datasets.
MERGE:
Merge DM and AE using USUBJID.
5. What are common SAS procedures you use?
- PROC SORT
- PROC FREQ
- PROC MEANS
- PROC REPORT
- PROC SQL
- PROC TRANSPOSE
- PROC COMPARE
- PROC FORMAT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent LOOKALIKE DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and Exams Reviewers and Observers
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment