FROM HOSPITAL CHAOS TO TRUSTED HEALTHCARE INTELLIGENCE IN SAS AND R
BEST HEALTHCARE CENTERS IN WORLD: Building Analysis-Ready Datasets in SAS and R Through Enterprise Data Cleaning, Validation, and Professional Reporting
Introduction: When Bad
Healthcare Data Becomes a Business Disaster
Imagine a
global healthcare analytics organization preparing executive dashboards for the
world's leading healthcare centers. Patient enrollment counts suddenly appear
inflated, billing revenue becomes negative in monthly reports, duplicate
patient records are submitted to regulatory agencies, and clinical trial visit
dates occur before enrollment dates.
The root
cause is not advanced analytics failure.
The root
cause is dirty data.
In modern
healthcare environments, poor-quality data can create:
- Incorrect patient enrollment
statistics
- Wrong clinical trial
conclusions
- Insurance claim rejections
- Regulatory compliance findings
- Misleading AI predictions
- Executive reporting failures
- SDTM and ADaM validation
issues
- Financial forecasting errors
As
Clinical SAS Programmers and Data Scientists, our responsibility is
transforming corrupted operational data into trustworthy analytical
intelligence.
This
article demonstrates a complete enterprise workflow using a BEST HEALTHCARE
CENTERS IN WORLD dataset containing intentional data quality problems and
shows how SAS and R can systematically convert chaos into reliable business
insights.
1.Raw Healthcare Dataset with Intentional Errors (SAS)
SAS Raw Dataset
data healthcare_raw;
length Center_ID $8 Center_Name $40 Country $20 Region_Code $20
Patient_Email $50 Category $20 Visit_Date $20 Billing_Amount $15;
infile datalines dlm='|' dsd truncover;
input Center_ID $ Center_Name $ Country $ Region_Code $
Patient_Age Patient_Email $ Visit_Date $ Billing_Amount $
Category $;
datalines;
HC001|Mayo Clinic|USA|US|45|john@gmail.com|2025-01-10|5000|Premium
HC002| cleveland clinic |usa|U.S.|-12|marygmail.com|2025-02-15|-3000|gold
HC003|Apollo Hospitals|India|APAC|210|NULL|2025/13/01|7000|VIP
HC004|Johns Hopkins|USA|USA|55|test@yahoo|NULL|4500|premium
HC005|Charite Hospital|Germany|EU|.|user@abc.com|2025-03-10|8000|GOLD
HC005|Charite Hospital|Germany|EUR|.|user@abc.com|2025-03-10|8000|GOLD
HC006|Singapore General Hospital|Singapore|ASIA|30|abc@@gmail.com|2025-04-01|-900|VIP
HC007|Mount Sinai|USA|US|125|demo@gmail.com|2025-05-01|1000|unknown
HC008|AIIMS|India|APAC|35|NULL|2025-06-11|6500|Silver
HC009|Toronto General Hospital|Canada|NA|44|samplegmail.com|2025-07-10|2000|premium
HC010|King Faisal Hospital|Saudi Arabia|MEA|60|info@gmail.com|BADDATE|5000|VIP
HC011|Asan Medical Center|Korea|APAC|52|user@test.com|2025-09-01|7000|Gold
HC012|Karolinska Institute|Sweden|EU|48|abc@test.com|2025-10-01|-500|Premium
HC013|Bumrungrad Hospital|Thailand|ASIA|41|usergmail.com|2025-11-10|4000|VIP
HC014|Cleveland Clinic Abu Dhabi|UAE|MiddleEast|39|mail@test.com|2025-12-10|5500|Gold
HC015|Fortis Healthcare|India|APAC|29|fortis@test.com|2025-06-15|4200|Silver
;
run;
proc print data=healthcare_raw;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Billing_Amount | Patient_Age |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HC001 | Mayo Clinic | USA | US | john@gmail.com | Premium | 2025-01-10 | 5000 | 45 |
| 2 | HC002 | cleveland clinic | usa | U.S. | marygmail.com | gold | 2025-02-15 | -3000 | -12 |
| 3 | HC003 | Apollo Hospitals | India | APAC | NULL | VIP | 2025/13/01 | 7000 | 210 |
| 4 | HC004 | Johns Hopkins | USA | USA | test@yahoo | premium | NULL | 4500 | 55 |
| 5 | HC005 | Charite Hospital | Germany | EU | user@abc.com | GOLD | 2025-03-10 | 8000 | . |
| 6 | HC005 | Charite Hospital | Germany | EUR | user@abc.com | GOLD | 2025-03-10 | 8000 | . |
| 7 | HC006 | Singapore General Hospital | Singapore | ASIA | abc@@gmail.com | VIP | 2025-04-01 | -900 | 30 |
| 8 | HC007 | Mount Sinai | USA | US | demo@gmail.com | unknown | 2025-05-01 | 1000 | 125 |
| 9 | HC008 | AIIMS | India | APAC | NULL | Silver | 2025-06-11 | 6500 | 35 |
| 10 | HC009 | Toronto General Hospital | Canada | NA | samplegmail.com | premium | 2025-07-10 | 2000 | 44 |
| 11 | HC010 | King Faisal Hospital | Saudi Arabia | MEA | info@gmail.com | VIP | BADDATE | 5000 | 60 |
| 12 | HC011 | Asan Medical Center | Korea | APAC | user@test.com | Gold | 2025-09-01 | 7000 | 52 |
| 13 | HC012 | Karolinska Institute | Sweden | EU | abc@test.com | Premium | 2025-10-01 | -500 | 48 |
| 14 | HC013 | Bumrungrad Hospital | Thailand | ASIA | usergmail.com | VIP | 2025-11-10 | 4000 | 41 |
| 15 | HC014 | Cleveland Clinic Abu Dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 5500 | 39 |
| 16 | HC015 | Fortis Healthcare | India | APAC | fortis@test.com | Silver | 2025-06-15 | 4200 | 29 |
Raw Healthcare Dataset in R
healthcare_raw <- tibble::tribble(
~center_id,~center_name,~country,~region_code,~patient_age,
~patient_email,~visit_date,~billing_amount,~category,
"HC001","Mayo Clinic","USA","US",45,"john@gmail.com","2025-01-10",5000,"Premium",
"HC002"," cleveland clinic ","usa","U.S.",-12,"marygmail.com","2025-02-15",-3000,"gold",
"HC003","Apollo Hospitals","India","APAC",210,"NULL","2025/13/01",7000,"VIP",
"HC004","Johns Hopkins","USA","USA",55,"test@yahoo","NULL",4500,"premium",
"HC005","Charite Hospital","Germany","EU",NA,"user@abc.com","2025-03-10",8000,"GOLD",
"HC005","Charite Hospital","Germany","EUR",NA,"user@abc.com","2025-03-10",8000,"GOLD",
"HC006","Singapore General Hospital","Singapore","ASIA",30,"abc@@gmail.com",
"2025-04-01",-900,"VIP",
"HC007","Mount Sinai","USA","US",125,"demo@gmail.com","2025-05-01",1000,"unknown",
"HC008","AIIMS","India","APAC",35,"NULL","2025-06-11",6500,"Silver",
"HC009","Toronto General Hospital","Canada","NA",44,"samplegmail.com",
"2025-07-10",2000,"premium",
"HC010","King Faisal Hospital","Saudi Arabia","MEA",60,"info@gmail.com",
"BADDATE",5000,"VIP"
)
|
center_id |
center_name |
country |
region_code |
patient_age |
patient_email |
visit_date |
billing_amount |
category |
|
HC001 |
Mayo Clinic |
USA |
US |
45 |
john@gmail.com |
2025-01-10 |
5000 |
Premium |
|
HC002 |
cleveland clinic |
usa |
U.S. |
-12 |
marygmail.com |
2025-02-15 |
-3000 |
gold |
|
HC003 |
Apollo Hospitals |
India |
APAC |
210 |
NULL |
2025/13/01 |
7000 |
VIP |
|
HC004 |
Johns Hopkins |
USA |
USA |
55 |
test@yahoo |
NULL |
4500 |
premium |
|
HC005 |
Charite Hospital |
Germany |
EU |
user@abc.com |
2025-03-10 |
8000 |
GOLD |
|
|
HC005 |
Charite Hospital |
Germany |
EUR |
user@abc.com |
2025-03-10 |
8000 |
GOLD |
|
|
HC006 |
Singapore General Hospital |
Singapore |
ASIA |
30 |
abc@@gmail.com |
2025-04-01 |
-900 |
VIP |
|
HC007 |
Mount Sinai |
USA |
US |
125 |
demo@gmail.com |
2025-05-01 |
1000 |
unknown |
|
HC008 |
AIIMS |
India |
APAC |
35 |
NULL |
2025-06-11 |
6500 |
Silver |
|
HC009 |
Toronto General Hospital |
Canada |
NA |
44 |
samplegmail.com |
2025-07-10 |
2000 |
premium |
|
HC010 |
King Faisal Hospital |
Saudi Arabia |
MEA |
60 |
info@gmail.com |
BADDATE |
5000 |
VIP |
Understanding Character
Truncation Risk in SAS
One of
the most misunderstood SAS behaviors is character variable truncation.
data demo;
length region $20;
region='NORTH_AMERICA';
run;
proc print data=demo;
run;
OUTPUT:
| Obs | region |
|---|---|
| 1 | NORTH_AMERICA |
Explanation
SAS
allocates storage during compilation. If LENGTH is omitted before assignment,
SAS determines length using the first encountered value, which may truncate
future longer values.
For
example, if "US" appears first, later values like
"NORTH_AMERICA" may be truncated. In regulated environments such as
SDTM and ADaM production, truncation can corrupt submission datasets and
produce validation findings.
R behaves
differently because character vectors dynamically accommodate varying string
lengths without requiring explicit length declarations. SAS programmers must
therefore define LENGTH statements before assignments, conditional logic, or
imports to avoid unexpected truncation.
2.SAS Enterprise Cleaning
Workflow
Step 1: Standardization
data healthcare_clean1;
set healthcare_raw;
center_name=propcase(strip(center_name));
country=upcase(strip(country));
category=propcase(strip(category));
patient_email=lowcase(strip(patient_email));
run;
proc print data=healthcare_clean1;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Billing_Amount | Patient_Age |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HC001 | Mayo Clinic | USA | US | john@gmail.com | Premium | 2025-01-10 | 5000 | 45 |
| 2 | HC002 | Cleveland Clinic | USA | U.S. | marygmail.com | Gold | 2025-02-15 | -3000 | -12 |
| 3 | HC003 | Apollo Hospitals | INDIA | APAC | null | Vip | 2025/13/01 | 7000 | 210 |
| 4 | HC004 | Johns Hopkins | USA | USA | test@yahoo | Premium | NULL | 4500 | 55 |
| 5 | HC005 | Charite Hospital | GERMANY | EU | user@abc.com | Gold | 2025-03-10 | 8000 | . |
| 6 | HC005 | Charite Hospital | GERMANY | EUR | user@abc.com | Gold | 2025-03-10 | 8000 | . |
| 7 | HC006 | Singapore General Hospital | SINGAPORE | ASIA | abc@@gmail.com | Vip | 2025-04-01 | -900 | 30 |
| 8 | HC007 | Mount Sinai | USA | US | demo@gmail.com | Unknown | 2025-05-01 | 1000 | 125 |
| 9 | HC008 | Aiims | INDIA | APAC | null | Silver | 2025-06-11 | 6500 | 35 |
| 10 | HC009 | Toronto General Hospital | CANADA | NA | samplegmail.com | Premium | 2025-07-10 | 2000 | 44 |
| 11 | HC010 | King Faisal Hospital | SAUDI ARABIA | MEA | info@gmail.com | Vip | BADDATE | 5000 | 60 |
| 12 | HC011 | Asan Medical Center | KOREA | APAC | user@test.com | Gold | 2025-09-01 | 7000 | 52 |
| 13 | HC012 | Karolinska Institute | SWEDEN | EU | abc@test.com | Premium | 2025-10-01 | -500 | 48 |
| 14 | HC013 | Bumrungrad Hospital | THAILAND | ASIA | usergmail.com | Vip | 2025-11-10 | 4000 | 41 |
| 15 | HC014 | Cleveland Clinic Abu Dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 5500 | 39 |
| 16 | HC015 | Fortis Healthcare | INDIA | APAC | fortis@test.com | Silver | 2025-06-15 | 4200 | 29 |
Explanation
This step
removes leading/trailing spaces, standardizes capitalization, and creates
consistent reporting values. Without normalization, "premium",
"Premium", and "PREMIUM" appear as separate categories in
PROC FREQ outputs. Standardized text significantly improves reporting reliability,
grouping consistency, dashboard accuracy, and machine learning feature quality.
Step 2: Region Correction Using SELECT-WHEN
data healthcare_clean1;
set healthcare_raw;
length region_code $20;
select(region_code);
when ('US','USA','U.S.') region_code='NORTH_AMERICA';
when ('EU','EUR') region_code='EUROPE';
when ('APAC','ASIA') region_code='ASIA_PACIFIC';
otherwise region_code='OTHER';
end;
run;
proc print data=healthcare_clean1;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Billing_Amount | Patient_Age |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HC001 | Mayo Clinic | USA | NORTH_AMERICA | john@gmail.com | Premium | 2025-01-10 | 5000 | 45 |
| 2 | HC002 | cleveland clinic | usa | NORTH_AMERICA | marygmail.com | gold | 2025-02-15 | -3000 | -12 |
| 3 | HC003 | Apollo Hospitals | India | ASIA_PACIFIC | NULL | VIP | 2025/13/01 | 7000 | 210 |
| 4 | HC004 | Johns Hopkins | USA | NORTH_AMERICA | test@yahoo | premium | NULL | 4500 | 55 |
| 5 | HC005 | Charite Hospital | Germany | EUROPE | user@abc.com | GOLD | 2025-03-10 | 8000 | . |
| 6 | HC005 | Charite Hospital | Germany | EUROPE | user@abc.com | GOLD | 2025-03-10 | 8000 | . |
| 7 | HC006 | Singapore General Hospital | Singapore | ASIA_PACIFIC | abc@@gmail.com | VIP | 2025-04-01 | -900 | 30 |
| 8 | HC007 | Mount Sinai | USA | NORTH_AMERICA | demo@gmail.com | unknown | 2025-05-01 | 1000 | 125 |
| 9 | HC008 | AIIMS | India | ASIA_PACIFIC | NULL | Silver | 2025-06-11 | 6500 | 35 |
| 10 | HC009 | Toronto General Hospital | Canada | OTHER | samplegmail.com | premium | 2025-07-10 | 2000 | 44 |
| 11 | HC010 | King Faisal Hospital | Saudi Arabia | OTHER | info@gmail.com | VIP | BADDATE | 5000 | 60 |
| 12 | HC011 | Asan Medical Center | Korea | ASIA_PACIFIC | user@test.com | Gold | 2025-09-01 | 7000 | 52 |
| 13 | HC012 | Karolinska Institute | Sweden | EUROPE | abc@test.com | Premium | 2025-10-01 | -500 | 48 |
| 14 | HC013 | Bumrungrad Hospital | Thailand | ASIA_PACIFIC | usergmail.com | VIP | 2025-11-10 | 4000 | 41 |
| 15 | HC014 | Cleveland Clinic Abu Dhabi | UAE | OTHER | mail@test.com | Gold | 2025-12-10 | 5500 | 39 |
| 16 | HC015 | Fortis Healthcare | India | ASIA_PACIFIC | fortis@test.com | Silver | 2025-06-15 | 4200 | 29 |
Explanation
SELECT-WHEN
is cleaner than multiple IF-THEN statements. Enterprise code becomes easier to
maintain, audit, and validate. This technique is frequently used in SDTM
mapping programs when consolidating source-system categories into controlled
terminology.
Step 3: Correct Impossible Values
data healthcare_clean1;
set healthcare_raw;
if not missing(patient_age) then
patient_age = abs(patient_age);
billing_amount_num =
abs(input(compress(billing_amount,'$,'),best32.));
if patient_age > 100 then patient_age=.;
format billing_amount_num comma12.2;
drop billing_amount;
rename billing_amount_num=billing_amount;
run;
proc print data=healthcare_clean1;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Patient_Age | billing_amount |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HC001 | Mayo Clinic | USA | US | john@gmail.com | Premium | 2025-01-10 | 45 | 5,000.00 |
| 2 | HC002 | cleveland clinic | usa | U.S. | marygmail.com | gold | 2025-02-15 | 12 | 3,000.00 |
| 3 | HC003 | Apollo Hospitals | India | APAC | NULL | VIP | 2025/13/01 | . | 7,000.00 |
| 4 | HC004 | Johns Hopkins | USA | USA | test@yahoo | premium | NULL | 55 | 4,500.00 |
| 5 | HC005 | Charite Hospital | Germany | EU | user@abc.com | GOLD | 2025-03-10 | . | 8,000.00 |
| 6 | HC005 | Charite Hospital | Germany | EUR | user@abc.com | GOLD | 2025-03-10 | . | 8,000.00 |
| 7 | HC006 | Singapore General Hospital | Singapore | ASIA | abc@@gmail.com | VIP | 2025-04-01 | 30 | 900.00 |
| 8 | HC007 | Mount Sinai | USA | US | demo@gmail.com | unknown | 2025-05-01 | . | 1,000.00 |
| 9 | HC008 | AIIMS | India | APAC | NULL | Silver | 2025-06-11 | 35 | 6,500.00 |
| 10 | HC009 | Toronto General Hospital | Canada | NA | samplegmail.com | premium | 2025-07-10 | 44 | 2,000.00 |
| 11 | HC010 | King Faisal Hospital | Saudi Arabia | MEA | info@gmail.com | VIP | BADDATE | 60 | 5,000.00 |
| 12 | HC011 | Asan Medical Center | Korea | APAC | user@test.com | Gold | 2025-09-01 | 52 | 7,000.00 |
| 13 | HC012 | Karolinska Institute | Sweden | EU | abc@test.com | Premium | 2025-10-01 | 48 | 500.00 |
| 14 | HC013 | Bumrungrad Hospital | Thailand | ASIA | usergmail.com | VIP | 2025-11-10 | 41 | 4,000.00 |
| 15 | HC014 | Cleveland Clinic Abu Dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 39 | 5,500.00 |
| 16 | HC015 | Fortis Healthcare | India | APAC | fortis@test.com | Silver | 2025-06-15 | 29 | 4,200.00 |
Explanation
Negative
ages and negative billing values are impossible business conditions. ABS
converts negative values to positive values, while age limits enforce clinical
plausibility. ROUND improves reporting consistency and reduces floating-point
precision issues during statistical analyses.
Step 4: Email Validation
data healthcare_clean1;
set healthcare_raw;
if find(patient_email,'@')=0 then
patient_email='MISSING_EMAIL';
run;
proc print data=healthcare_clean1;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Billing_Amount | Patient_Age |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HC001 | Mayo Clinic | USA | US | john@gmail.com | Premium | 2025-01-10 | 5000 | 45 |
| 2 | HC002 | cleveland clinic | usa | U.S. | MISSING_EMAIL | gold | 2025-02-15 | -3000 | -12 |
| 3 | HC003 | Apollo Hospitals | India | APAC | MISSING_EMAIL | VIP | 2025/13/01 | 7000 | 210 |
| 4 | HC004 | Johns Hopkins | USA | USA | test@yahoo | premium | NULL | 4500 | 55 |
| 5 | HC005 | Charite Hospital | Germany | EU | user@abc.com | GOLD | 2025-03-10 | 8000 | . |
| 6 | HC005 | Charite Hospital | Germany | EUR | user@abc.com | GOLD | 2025-03-10 | 8000 | . |
| 7 | HC006 | Singapore General Hospital | Singapore | ASIA | abc@@gmail.com | VIP | 2025-04-01 | -900 | 30 |
| 8 | HC007 | Mount Sinai | USA | US | demo@gmail.com | unknown | 2025-05-01 | 1000 | 125 |
| 9 | HC008 | AIIMS | India | APAC | MISSING_EMAIL | Silver | 2025-06-11 | 6500 | 35 |
| 10 | HC009 | Toronto General Hospital | Canada | NA | MISSING_EMAIL | premium | 2025-07-10 | 2000 | 44 |
| 11 | HC010 | King Faisal Hospital | Saudi Arabia | MEA | info@gmail.com | VIP | BADDATE | 5000 | 60 |
| 12 | HC011 | Asan Medical Center | Korea | APAC | user@test.com | Gold | 2025-09-01 | 7000 | 52 |
| 13 | HC012 | Karolinska Institute | Sweden | EU | abc@test.com | Premium | 2025-10-01 | -500 | 48 |
| 14 | HC013 | Bumrungrad Hospital | Thailand | ASIA | MISSING_EMAIL | VIP | 2025-11-10 | 4000 | 41 |
| 15 | HC014 | Cleveland Clinic Abu Dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 5500 | 39 |
| 16 | HC015 | Fortis Healthcare | India | APAC | fortis@test.com | Silver | 2025-06-15 | 4200 | 29 |
Explanation
Malformed
emails create communication failures and operational risks. FIND allows
efficient validation. In production systems, regex validation can further
enhance email verification.
Step 5: Duplicate Removal
proc sort data=healthcare_clean1
out=healthcare_clean2 nodupkey;
by center_id;
run;
proc print data=healthcare_clean2;
run;
LOG:
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Billing_Amount | Patient_Age |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HC001 | Mayo Clinic | USA | US | john@gmail.com | Premium | 2025-01-10 | 5000 | 45 |
| 2 | HC002 | cleveland clinic | usa | U.S. | MISSING_EMAIL | gold | 2025-02-15 | -3000 | -12 |
| 3 | HC003 | Apollo Hospitals | India | APAC | MISSING_EMAIL | VIP | 2025/13/01 | 7000 | 210 |
| 4 | HC004 | Johns Hopkins | USA | USA | test@yahoo | premium | NULL | 4500 | 55 |
| 5 | HC005 | Charite Hospital | Germany | EU | user@abc.com | GOLD | 2025-03-10 | 8000 | . |
| 6 | HC006 | Singapore General Hospital | Singapore | ASIA | abc@@gmail.com | VIP | 2025-04-01 | -900 | 30 |
| 7 | HC007 | Mount Sinai | USA | US | demo@gmail.com | unknown | 2025-05-01 | 1000 | 125 |
| 8 | HC008 | AIIMS | India | APAC | MISSING_EMAIL | Silver | 2025-06-11 | 6500 | 35 |
| 9 | HC009 | Toronto General Hospital | Canada | NA | MISSING_EMAIL | premium | 2025-07-10 | 2000 | 44 |
| 10 | HC010 | King Faisal Hospital | Saudi Arabia | MEA | info@gmail.com | VIP | BADDATE | 5000 | 60 |
| 11 | HC011 | Asan Medical Center | Korea | APAC | user@test.com | Gold | 2025-09-01 | 7000 | 52 |
| 12 | HC012 | Karolinska Institute | Sweden | EU | abc@test.com | Premium | 2025-10-01 | -500 | 48 |
| 13 | HC013 | Bumrungrad Hospital | Thailand | ASIA | MISSING_EMAIL | VIP | 2025-11-10 | 4000 | 41 |
| 14 | HC014 | Cleveland Clinic Abu Dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 5500 | 39 |
| 15 | HC015 | Fortis Healthcare | India | APAC | fortis@test.com | Silver | 2025-06-15 | 4200 | 29 |
Explanation
Duplicate
records can inflate patient counts, billing totals, and enrollment metrics.
PROC SORT NODUPKEY removes duplicate keys while preserving one valid
observation. This is one of the most common validation checks in healthcare
analytics.
3.Advanced DATA Step
Techniques
Step 1: ARRAY Processing
data healthcare_clean1;
set healthcare_raw;
array chars(*) center_name country category;
do i=1 to dim(chars);
chars(i)=strip(chars(i));
end;
run;
proc print data=healthcare_clean1;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Billing_Amount | Patient_Age | i |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | HC001 | Mayo Clinic | USA | US | john@gmail.com | Premium | 2025-01-10 | 5000 | 45 | 4 |
| 2 | HC002 | cleveland clinic | usa | U.S. | marygmail.com | gold | 2025-02-15 | -3000 | -12 | 4 |
| 3 | HC003 | Apollo Hospitals | India | APAC | NULL | VIP | 2025/13/01 | 7000 | 210 | 4 |
| 4 | HC004 | Johns Hopkins | USA | USA | test@yahoo | premium | NULL | 4500 | 55 | 4 |
| 5 | HC005 | Charite Hospital | Germany | EU | user@abc.com | GOLD | 2025-03-10 | 8000 | . | 4 |
| 6 | HC005 | Charite Hospital | Germany | EUR | user@abc.com | GOLD | 2025-03-10 | 8000 | . | 4 |
| 7 | HC006 | Singapore General Hospital | Singapore | ASIA | abc@@gmail.com | VIP | 2025-04-01 | -900 | 30 | 4 |
| 8 | HC007 | Mount Sinai | USA | US | demo@gmail.com | unknown | 2025-05-01 | 1000 | 125 | 4 |
| 9 | HC008 | AIIMS | India | APAC | NULL | Silver | 2025-06-11 | 6500 | 35 | 4 |
| 10 | HC009 | Toronto General Hospital | Canada | NA | samplegmail.com | premium | 2025-07-10 | 2000 | 44 | 4 |
| 11 | HC010 | King Faisal Hospital | Saudi Arabia | MEA | info@gmail.com | VIP | BADDATE | 5000 | 60 | 4 |
| 12 | HC011 | Asan Medical Center | Korea | APAC | user@test.com | Gold | 2025-09-01 | 7000 | 52 | 4 |
| 13 | HC012 | Karolinska Institute | Sweden | EU | abc@test.com | Premium | 2025-10-01 | -500 | 48 | 4 |
| 14 | HC013 | Bumrungrad Hospital | Thailand | ASIA | usergmail.com | VIP | 2025-11-10 | 4000 | 41 | 4 |
| 15 | HC014 | Cleveland Clinic Abu Dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 5500 | 39 | 4 |
| 16 | HC015 | Fortis Healthcare | India | APAC | fortis@test.com | Silver | 2025-06-15 | 4200 | 29 | 4 |
Explanation
ARRAYS
reduce repetitive code and improve maintainability. Instead of writing multiple
STRIP statements, one loop cleans several variables simultaneously. Enterprise
programmers often use arrays when standardizing hundreds of SDTM variables.
Step 2: RETAIN Example
data healthcare_clean1;
set healthcare_raw;
retain cumulative_revenue 0;
cumulative_revenue+
billing_amount;
run;
proc print data=healthcare_clean1;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Billing_Amount | Patient_Age | cumulative_revenue |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | HC001 | Mayo Clinic | USA | US | john@gmail.com | Premium | 2025-01-10 | 5000 | 45 | 5000 |
| 2 | HC002 | cleveland clinic | usa | U.S. | marygmail.com | gold | 2025-02-15 | -3000 | -12 | 2000 |
| 3 | HC003 | Apollo Hospitals | India | APAC | NULL | VIP | 2025/13/01 | 7000 | 210 | 9000 |
| 4 | HC004 | Johns Hopkins | USA | USA | test@yahoo | premium | NULL | 4500 | 55 | 13500 |
| 5 | HC005 | Charite Hospital | Germany | EU | user@abc.com | GOLD | 2025-03-10 | 8000 | . | 21500 |
| 6 | HC005 | Charite Hospital | Germany | EUR | user@abc.com | GOLD | 2025-03-10 | 8000 | . | 29500 |
| 7 | HC006 | Singapore General Hospital | Singapore | ASIA | abc@@gmail.com | VIP | 2025-04-01 | -900 | 30 | 28600 |
| 8 | HC007 | Mount Sinai | USA | US | demo@gmail.com | unknown | 2025-05-01 | 1000 | 125 | 29600 |
| 9 | HC008 | AIIMS | India | APAC | NULL | Silver | 2025-06-11 | 6500 | 35 | 36100 |
| 10 | HC009 | Toronto General Hospital | Canada | NA | samplegmail.com | premium | 2025-07-10 | 2000 | 44 | 38100 |
| 11 | HC010 | King Faisal Hospital | Saudi Arabia | MEA | info@gmail.com | VIP | BADDATE | 5000 | 60 | 43100 |
| 12 | HC011 | Asan Medical Center | Korea | APAC | user@test.com | Gold | 2025-09-01 | 7000 | 52 | 50100 |
| 13 | HC012 | Karolinska Institute | Sweden | EU | abc@test.com | Premium | 2025-10-01 | -500 | 48 | 49600 |
| 14 | HC013 | Bumrungrad Hospital | Thailand | ASIA | usergmail.com | VIP | 2025-11-10 | 4000 | 41 | 53600 |
| 15 | HC014 | Cleveland Clinic Abu Dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 5500 | 39 | 59100 |
| 16 | HC015 | Fortis Healthcare | India | APAC | fortis@test.com | Silver | 2025-06-15 | 4200 | 29 | 63300 |
Explanation
RETAIN
preserves values across observations. This allows running totals, cumulative
metrics, and longitudinal tracking commonly required in healthcare operational
reporting.
Step 3: FIRST./LAST. Processing
proc sort data=healthcare_clean2;
by country;
run;
proc print data=healthcare_clean2;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Billing_Amount | Patient_Age |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HC009 | Toronto General Hospital | Canada | NA | MISSING_EMAIL | premium | 2025-07-10 | 2000 | 44 |
| 2 | HC005 | Charite Hospital | Germany | EU | user@abc.com | GOLD | 2025-03-10 | 8000 | . |
| 3 | HC003 | Apollo Hospitals | India | APAC | MISSING_EMAIL | VIP | 2025/13/01 | 7000 | 210 |
| 4 | HC008 | AIIMS | India | APAC | MISSING_EMAIL | Silver | 2025-06-11 | 6500 | 35 |
| 5 | HC015 | Fortis Healthcare | India | APAC | fortis@test.com | Silver | 2025-06-15 | 4200 | 29 |
| 6 | HC011 | Asan Medical Center | Korea | APAC | user@test.com | Gold | 2025-09-01 | 7000 | 52 |
| 7 | HC010 | King Faisal Hospital | Saudi Arabia | MEA | info@gmail.com | VIP | BADDATE | 5000 | 60 |
| 8 | HC006 | Singapore General Hospital | Singapore | ASIA | abc@@gmail.com | VIP | 2025-04-01 | -900 | 30 |
| 9 | HC012 | Karolinska Institute | Sweden | EU | abc@test.com | Premium | 2025-10-01 | -500 | 48 |
| 10 | HC013 | Bumrungrad Hospital | Thailand | ASIA | MISSING_EMAIL | VIP | 2025-11-10 | 4000 | 41 |
| 11 | HC014 | Cleveland Clinic Abu Dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 5500 | 39 |
| 12 | HC001 | Mayo Clinic | USA | US | john@gmail.com | Premium | 2025-01-10 | 5000 | 45 |
| 13 | HC004 | Johns Hopkins | USA | USA | test@yahoo | premium | NULL | 4500 | 55 |
| 14 | HC007 | Mount Sinai | USA | US | demo@gmail.com | unknown | 2025-05-01 | 1000 | 125 |
| 15 | HC002 | cleveland clinic | usa | U.S. | MISSING_EMAIL | gold | 2025-02-15 | -3000 | -12 |
data country_summary;
set healthcare_clean2;
by country;
if first.country then count=0;
count+1;
if last.country;
run;
proc print data=country_summary;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Billing_Amount | Patient_Age | count |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | HC009 | Toronto General Hospital | Canada | NA | MISSING_EMAIL | premium | 2025-07-10 | 2000 | 44 | 1 |
| 2 | HC005 | Charite Hospital | Germany | EU | user@abc.com | GOLD | 2025-03-10 | 8000 | . | 1 |
| 3 | HC015 | Fortis Healthcare | India | APAC | fortis@test.com | Silver | 2025-06-15 | 4200 | 29 | 3 |
| 4 | HC011 | Asan Medical Center | Korea | APAC | user@test.com | Gold | 2025-09-01 | 7000 | 52 | 1 |
| 5 | HC010 | King Faisal Hospital | Saudi Arabia | MEA | info@gmail.com | VIP | BADDATE | 5000 | 60 | 1 |
| 6 | HC006 | Singapore General Hospital | Singapore | ASIA | abc@@gmail.com | VIP | 2025-04-01 | -900 | 30 | 1 |
| 7 | HC012 | Karolinska Institute | Sweden | EU | abc@test.com | Premium | 2025-10-01 | -500 | 48 | 1 |
| 8 | HC013 | Bumrungrad Hospital | Thailand | ASIA | MISSING_EMAIL | VIP | 2025-11-10 | 4000 | 41 | 1 |
| 9 | HC014 | Cleveland Clinic Abu Dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 5500 | 39 | 1 |
| 10 | HC007 | Mount Sinai | USA | US | demo@gmail.com | unknown | 2025-05-01 | 1000 | 125 | 3 |
| 11 | HC002 | cleveland clinic | usa | U.S. | MISSING_EMAIL | gold | 2025-02-15 | -3000 | -12 | 1 |
Explanation
FIRST.
and LAST. processing supports grouped calculations. This technique is
fundamental in clinical trial summaries, patient counts, site metrics, and
regional reporting.
Step 4: PROC FORMAT
proc format;
value agegrp low-17='Pediatric'
18-64='Adult'
65-high='Senior';
run;
LOG:
Explanation
Formats
create reusable business classifications without physically changing stored
data. This improves reporting flexibility and consistency across projects.
4.PROC SQL vs DATA Step
Step 1: PROC SQL
proc sql;
create table revenue_country as
select country,
sum(billing_amount) as total_revenue
from healthcare_clean1
group by country;
quit;
proc print data=revenue_country;
run;
OUTPUT:
| Obs | Country | Patient_Age | total_revenue |
|---|---|---|---|
| 1 | Canada | Adult | 2000 |
| 2 | Germany | . | 16000 |
| 3 | Germany | . | 16000 |
| 4 | India | Adult | 17700 |
| 5 | India | . | 17700 |
| 6 | India | Adult | 17700 |
| 7 | Korea | Adult | 7000 |
| 8 | Saudi Arabia | Adult | 5000 |
| 9 | Singapore | Adult | 900 |
| 10 | Sweden | Adult | 500 |
| 11 | Thailand | Adult | 4000 |
| 12 | UAE | Adult | 5500 |
| 13 | USA | Adult | 10500 |
| 14 | USA | Adult | 10500 |
| 15 | USA | . | 10500 |
| 16 | usa | Pediatric | 3000 |
Explanation
PROC SQL
is concise and ideal for joins, aggregations, and relational processing. It
resembles database programming and is widely used in healthcare data
warehouses.
Step 2: DATA Step Alternative
proc sort data=healthcare_clean1;
by country;
run;
proc print data=healthcare_clean1;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Patient_Age | billing_amount |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HC009 | Toronto General Hospital | Canada | NA | samplegmail.com | premium | 2025-07-10 | 44 | 2,000.00 |
| 2 | HC005 | Charite Hospital | Germany | EU | user@abc.com | GOLD | 2025-03-10 | . | 8,000.00 |
| 3 | HC005 | Charite Hospital | Germany | EUR | user@abc.com | GOLD | 2025-03-10 | . | 8,000.00 |
| 4 | HC003 | Apollo Hospitals | India | APAC | NULL | VIP | 2025/13/01 | . | 7,000.00 |
| 5 | HC008 | AIIMS | India | APAC | NULL | Silver | 2025-06-11 | 35 | 6,500.00 |
| 6 | HC015 | Fortis Healthcare | India | APAC | fortis@test.com | Silver | 2025-06-15 | 29 | 4,200.00 |
| 7 | HC011 | Asan Medical Center | Korea | APAC | user@test.com | Gold | 2025-09-01 | 52 | 7,000.00 |
| 8 | HC010 | King Faisal Hospital | Saudi Arabia | MEA | info@gmail.com | VIP | BADDATE | 60 | 5,000.00 |
| 9 | HC006 | Singapore General Hospital | Singapore | ASIA | abc@@gmail.com | VIP | 2025-04-01 | 30 | 900.00 |
| 10 | HC012 | Karolinska Institute | Sweden | EU | abc@test.com | Premium | 2025-10-01 | 48 | 500.00 |
| 11 | HC013 | Bumrungrad Hospital | Thailand | ASIA | usergmail.com | VIP | 2025-11-10 | 41 | 4,000.00 |
| 12 | HC014 | Cleveland Clinic Abu Dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 39 | 5,500.00 |
| 13 | HC001 | Mayo Clinic | USA | US | john@gmail.com | Premium | 2025-01-10 | 45 | 5,000.00 |
| 14 | HC004 | Johns Hopkins | USA | USA | test@yahoo | premium | NULL | 55 | 4,500.00 |
| 15 | HC007 | Mount Sinai | USA | US | demo@gmail.com | unknown | 2025-05-01 | . | 1,000.00 |
| 16 | HC002 | cleveland clinic | usa | U.S. | marygmail.com | gold | 2025-02-15 | 12 | 3,000.00 |
data revenue_country_ds;
set healthcare_clean1;
by country;
retain total_revenue;
if first.country then total_revenue=0;
total_revenue+billing_amount;
if last.country;
run;
proc print data=revenue_country_ds;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Patient_Age | billing_amount | total_revenue |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | HC009 | Toronto General Hospital | Canada | NA | samplegmail.com | premium | 2025-07-10 | 44 | 2,000.00 | 2000 |
| 2 | HC005 | Charite Hospital | Germany | EUR | user@abc.com | GOLD | 2025-03-10 | . | 8,000.00 | 16000 |
| 3 | HC015 | Fortis Healthcare | India | APAC | fortis@test.com | Silver | 2025-06-15 | 29 | 4,200.00 | 17700 |
| 4 | HC011 | Asan Medical Center | Korea | APAC | user@test.com | Gold | 2025-09-01 | 52 | 7,000.00 | 7000 |
| 5 | HC010 | King Faisal Hospital | Saudi Arabia | MEA | info@gmail.com | VIP | BADDATE | 60 | 5,000.00 | 5000 |
| 6 | HC006 | Singapore General Hospital | Singapore | ASIA | abc@@gmail.com | VIP | 2025-04-01 | 30 | 900.00 | 900 |
| 7 | HC012 | Karolinska Institute | Sweden | EU | abc@test.com | Premium | 2025-10-01 | 48 | 500.00 | 500 |
| 8 | HC013 | Bumrungrad Hospital | Thailand | ASIA | usergmail.com | VIP | 2025-11-10 | 41 | 4,000.00 | 4000 |
| 9 | HC014 | Cleveland Clinic Abu Dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 39 | 5,500.00 | 5500 |
| 10 | HC007 | Mount Sinai | USA | US | demo@gmail.com | unknown | 2025-05-01 | . | 1,000.00 | 10500 |
| 11 | HC002 | cleveland clinic | usa | U.S. | marygmail.com | gold | 2025-02-15 | 12 | 3,000.00 | 3000 |
Explanation
DATA Step
offers more control and often better performance for sequential processing. SAS
programmers frequently compare SQL and DATA Step approaches to choose the most
efficient solution.
Step 3: SAS Macro for Reusable Cleaning
%macro clean_text(var);
&var=lowcase(strip(&var));
%mend;
data healthcare_final;
set healthcare_clean1;
%clean_text(center_name);
run;
proc print data=healthcare_final;
run;
OUTPUT:
| Obs | Center_ID | Center_Name | Country | Region_Code | Patient_Email | Category | Visit_Date | Patient_Age | billing_amount |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HC009 | toronto general hospital | Canada | NA | samplegmail.com | premium | 2025-07-10 | 44 | 2,000.00 |
| 2 | HC005 | charite hospital | Germany | EU | user@abc.com | GOLD | 2025-03-10 | . | 8,000.00 |
| 3 | HC005 | charite hospital | Germany | EUR | user@abc.com | GOLD | 2025-03-10 | . | 8,000.00 |
| 4 | HC003 | apollo hospitals | India | APAC | NULL | VIP | 2025/13/01 | . | 7,000.00 |
| 5 | HC008 | aiims | India | APAC | NULL | Silver | 2025-06-11 | 35 | 6,500.00 |
| 6 | HC015 | fortis healthcare | India | APAC | fortis@test.com | Silver | 2025-06-15 | 29 | 4,200.00 |
| 7 | HC011 | asan medical center | Korea | APAC | user@test.com | Gold | 2025-09-01 | 52 | 7,000.00 |
| 8 | HC010 | king faisal hospital | Saudi Arabia | MEA | info@gmail.com | VIP | BADDATE | 60 | 5,000.00 |
| 9 | HC006 | singapore general hospital | Singapore | ASIA | abc@@gmail.com | VIP | 2025-04-01 | 30 | 900.00 |
| 10 | HC012 | karolinska institute | Sweden | EU | abc@test.com | Premium | 2025-10-01 | 48 | 500.00 |
| 11 | HC013 | bumrungrad hospital | Thailand | ASIA | usergmail.com | VIP | 2025-11-10 | 41 | 4,000.00 |
| 12 | HC014 | cleveland clinic abu dhabi | UAE | MiddleEast | mail@test.com | Gold | 2025-12-10 | 39 | 5,500.00 |
| 13 | HC001 | mayo clinic | USA | US | john@gmail.com | Premium | 2025-01-10 | 45 | 5,000.00 |
| 14 | HC004 | johns hopkins | USA | USA | test@yahoo | premium | NULL | 55 | 4,500.00 |
| 15 | HC007 | mount sinai | USA | US | demo@gmail.com | unknown | 2025-05-01 | . | 1,000.00 |
| 16 | HC002 | cleveland clinic | usa | U.S. | marygmail.com | gold | 2025-02-15 | 12 | 3,000.00 |
Explanation
Macros
promote standardization, reduce duplication, and improve maintainability.
Regulatory environments favor reusable validated components because they reduce
programming variability and QC effort.
Step 4: R Data Cleaning Workflow
library(tidyverse)
library(janitor)
library(lubridate)
healthcare_clean <- healthcare_raw %>%
clean_names() %>%
mutate(center_name = str_to_title(str_trim(center_name)),
country = str_to_upper(str_trim(country)),
category = case_when(category %in% c("gold","GOLD") ~ "Gold",
TRUE ~ str_to_title(category)),
billing_amount = abs(billing_amount),
patient_age = if_else(patient_age > 100,NA_real_,
abs(patient_age)),
patient_email = if_else(grepl("@", patient_email),
patient_email,NA_character_),
visit_date = na_if(visit_date, "NULL"),
visit_date = na_if(visit_date, "BADDATE"),
visit_date = if_else(grepl("^\\d{4}/13/", visit_date),
NA_character_,visit_date),
visit_date = parse_date_time(visit_date,
orders = c("ymd","Ymd"))
) %>%
distinct(center_id,.keep_all=TRUE)
write_xlsx(healthcare_clean,
"C:\\Users\\healthcare_clean.xlsx")
|
center_id |
center_name |
country |
region_code |
patient_age |
patient_email |
visit_date |
billing_amount |
category |
|
HC001 |
Mayo
Clinic |
USA |
US |
45 |
john@gmail.com |
2025-01-10 00:00:00 UTC |
5000 |
Premium |
|
HC002 |
Cleveland
Clinic |
USA |
U.S. |
12 |
2025-02-15 00:00:00 UTC |
3000 |
Gold |
|
|
HC003 |
Apollo
Hospitals |
INDIA |
APAC |
7000 |
Vip |
|||
|
HC004 |
Johns
Hopkins |
USA |
USA |
55 |
test@yahoo |
4500 |
Premium |
|
|
HC005 |
Charite
Hospital |
GERMANY |
EU |
user@abc.com |
2025-03-10 00:00:00 UTC |
8000 |
Gold |
|
|
HC006 |
Singapore
General Hospital |
SINGAPORE |
ASIA |
30 |
abc@@gmail.com |
2025-04-01 00:00:00 UTC |
900 |
Vip |
|
HC007 |
Mount
Sinai |
USA |
US |
demo@gmail.com |
2025-05-01 00:00:00 UTC |
1000 |
Unknown |
|
|
HC008 |
Aiims |
INDIA |
APAC |
35 |
2025-06-11 00:00:00 UTC |
6500 |
Silver |
|
|
HC009 |
Toronto
General Hospital |
CANADA |
NA |
44 |
2025-07-10 00:00:00 UTC |
2000 |
Premium |
|
|
HC010 |
King
Faisal Hospital |
SAUDI
ARABIA |
MEA |
60 |
info@gmail.com |
5000 |
Vip |
Explanation
This
workflow uses tidyverse principles to create clean analytical datasets.
mutate() transforms variables, case_when() standardizes categories, str_trim()
removes whitespace, abs() fixes numeric issues, parse_date_time() handles date
inconsistencies, and distinct() removes duplicates. Compared with SAS DATA Step
programming, tidyverse provides a declarative pipeline approach that many
modern analytics teams prefer.
Enterprise Validation &
Compliance
Healthcare
analytics operates under strict compliance expectations.
Key areas
include:
- SDTM standardization
- ADaM traceability
- Audit trail documentation
- QC independence
- Regulatory reproducibility
- Controlled terminology
A
critical SAS concept is missing-value behavior.
if age < 18 then flag='Y';
Missing
ages satisfy this condition because SAS treats missing numeric values as
smaller than all valid numbers.
This can
generate catastrophic classification errors.
Always
write:
if not missing(age) and age
<18 then flag='Y';
Independent
QC and traceability are essential for submission-quality datasets.
Business Logic Behind Data
Cleaning
Data
cleaning is fundamentally about protecting analytical truth. Missing values are
often imputed because many statistical models cannot process incomplete records
effectively. For example, if a patient's visit date is missing, enrollment
timelines may become inaccurate, affecting study milestones and resource
planning. Age corrections are equally important. A patient age of 210 years
clearly indicates data-entry failure. Allowing such values into statistical
analyses can distort averages, percentiles, and predictive models.
Billing
amount standardization prevents negative charges from inflating financial
losses. In banking, negative loan balances caused by input errors may trigger
incorrect risk classifications. In healthcare, negative treatment costs may
reduce revenue estimates and mislead executives.
Text
normalization improves grouping consistency. Without standardization,
"Premium", "premium", and "PREMIUM" create
multiple categories instead of one unified business segment. Standardized dates
ensure consistent chronological calculations. Regulatory reporting often
depends on accurate date intervals derived through INTCK and INTNX functions.
Imputation,
normalization, validation, and standardization collectively improve reporting reliability,
machine learning quality, operational efficiency, and regulatory compliance.
Effective data cleaning transforms unreliable raw information into trustworthy
business intelligence that supports accurate decision-making.
20 Data Cleaning Best
Practices
- Validate source metadata
first.
- Standardize variable naming
conventions.
- Use controlled terminology.
- Remove duplicates early.
- Validate date ranges.
- Audit missing values.
- Document derivations.
- Maintain lineage records.
- Use reusable macros.
- Separate development and
production.
- Perform independent QC.
- Validate category mappings.
- Protect against truncation.
- Standardize missing-value
handling.
- Version-control programs.
- Validate merge keys.
- Create reconciliation
reports.
- Automate validation checks.
- Maintain audit trails.
- Perform deployment reviews.
20 One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic is stronger
than visual inspection.
- Standardized variables
improve reproducibility.
- Missing values hide
operational risks.
- Duplicates inflate business
metrics.
- Metadata drives consistency.
- QC independence reduces
bias.
- Audit trails build trust.
- Traceability supports
compliance.
- Clean data improves AI
performance.
- Controlled terminology
reduces ambiguity.
- Defensive programming
prevents surprises.
- Reusable macros improve
efficiency.
- Automated validation saves
time.
- Business rules matter more
than tools.
- Reporting quality starts
with source quality.
- Consistent dates enable
accurate analytics.
- Data lineage supports
transparency.
- Reliable datasets improve
decisions.
- Enterprise analytics begins
with clean data.
SAS vs R Comparison
|
Feature |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Excellent |
Growing |
|
Flexibility |
High |
Very
High |
|
Macros |
Powerful |
Functions |
|
Data
Step |
Unique
Advantage |
No
Equivalent |
|
Visualization |
Good |
Excellent |
|
Open
Source |
No |
Yes |
|
Enterprise
Validation |
Excellent |
Good |
|
SDTM/ADaM
Usage |
Industry
Standard |
Supplemental |
|
Learning
Curve |
Moderate |
Moderate |
Validation Checklist
1.Duplicate
check completed
2.Missing-value
review completed
3.Date
validation completed
4.Category
standardization completed
5.Email
validation completed
6.Metadata
review completed
7.QC
review completed
8.Reporting
verification completed
Summary: SAS and R for
Enterprise Data Engineering
SAS and R
complement each other remarkably well in enterprise analytics. SAS excels in
controlled, auditable, highly regulated environments where traceability and
validation are paramount. Clinical trial submissions, SDTM conversions, ADaM
derivations, and regulatory reporting depend heavily on SAS because of its
mature data management capabilities and predictable execution framework.
R
provides flexibility, rapid development, modern data manipulation libraries,
and extensive visualization capabilities. Packages such as dplyr, stringr,
lubridate, janitor, and purrr simplify complex transformations through readable
pipelines. Data scientists frequently leverage R for exploratory analysis,
predictive modeling, and advanced reporting.
From a
scalability perspective, SAS performs exceptionally well on large operational
datasets and supports structured production deployment processes. R offers
unmatched extensibility and access to modern analytical ecosystems. Combining
SAS for validated data engineering and R for advanced analytics creates a
powerful hybrid architecture.
Organizations
that integrate both technologies benefit from stronger governance, greater
analytical flexibility, improved reproducibility, and more reliable business
intelligence. The ultimate objective is not simply cleaning data but
establishing trustworthy analytical foundations that support strategic decisions,
regulatory confidence, and operational excellence.
Conclusion
Modern
healthcare organizations generate enormous volumes of operational, clinical,
financial, and patient-level information every day. However, raw data rarely
arrives in a state suitable for analysis. Duplicate identifiers, missing dates,
invalid categories, malformed emails, negative financial values, inconsistent
coding standards, and corrupted text formats introduce significant analytical
risk.
The BEST
HEALTHCARE CENTERS IN WORLD project demonstrates how structured
data-cleaning methodologies transform unreliable information into dependable
analytical assets. Through SAS DATA Step programming, PROC SQL processing,
validation routines, formats, macros, and enterprise reporting procedures,
organizations can establish highly controlled and auditable workflows.
Simultaneously, R provides modern data transformation capabilities through
tidyverse pipelines that enable rapid, flexible, and scalable analytics.
The most
successful organizations treat data cleaning as a strategic discipline rather
than a preliminary task. Every corrected age, standardized date, validated
category, and removed duplicate improves downstream reporting quality. Strong
governance frameworks ensure reproducibility, traceability, compliance
readiness, and executive confidence.
Clinical
research, healthcare operations, banking, insurance, and retail analytics all
depend on trustworthy datasets. Poor-quality data can influence regulatory
submissions, distort machine learning models, mislead dashboards, and drive
costly business decisions. Robust validation processes, defensive programming
practices, metadata governance, and independent QC reviews are therefore
essential components of enterprise analytics.
When SAS
and R are combined effectively, organizations gain the auditability of SAS and
the flexibility of R. The result is a production-grade analytical ecosystem
capable of generating reliable intelligence from complex operational data.
Clean data is not merely a technical objective it is the foundation upon which
modern business decisions, patient safety, regulatory compliance, and
analytical excellence are built.
Interview Questions and
Answers
1. A clinical dataset contains duplicate patient
IDs. How would you identify and remove them in SAS?
Answer: I would use PROC SORT with
NODUPKEY and review duplicate outputs separately. I would validate duplicates
using business keys before deletion to ensure legitimate repeat visits are not
removed incorrectly.
2. A patient's age appears as -45 and 210. How
would you handle it?
Answer: First investigate source data.
Apply ABS for negative values if confirmed entry errors. Values exceeding
business thresholds (e.g., >100 years) should be set missing and reviewed
through data clarification workflows.
3. How do SAS and R differ in handling missing
values?
Answer: SAS numeric missing values are
treated as lower than valid numbers, requiring explicit checks. R uses NA
values and most functions require na.rm=TRUE for calculations.
4. When would you choose PROC SQL instead of a DATA
Step?
Answer: PROC SQL is preferred for joins,
aggregations, and relational operations. DATA Step is often preferred for
row-wise processing, retained calculations, arrays, and complex business-rule
implementations.
5. How would you validate a production-ready
healthcare reporting dataset?
Answer: I would perform metadata
validation, duplicate checks, missing-value review, category verification,
reconciliation against source systems, independent QC programming, audit-trail
review, and final reporting validation before deployment.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent HEALTHCARE DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
· Clinical SAS Programmer
· Research Data Analyst
· Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment