FROM HOSPITAL CHAOS TO TRUSTED HEALTHCARE INTELLIGENCE IN SAS AND R

BEST HEALTHCARE CENTERS IN WORLD: Building Analysis-Ready Datasets in SAS and R Through Enterprise Data Cleaning, Validation, and Professional Reporting

Introduction: When Bad Healthcare Data Becomes a Business Disaster

Imagine a global healthcare analytics organization preparing executive dashboards for the world's leading healthcare centers. Patient enrollment counts suddenly appear inflated, billing revenue becomes negative in monthly reports, duplicate patient records are submitted to regulatory agencies, and clinical trial visit dates occur before enrollment dates.

The root cause is not advanced analytics failure.

The root cause is dirty data.

In modern healthcare environments, poor-quality data can create:

  • Incorrect patient enrollment statistics
  • Wrong clinical trial conclusions
  • Insurance claim rejections
  • Regulatory compliance findings
  • Misleading AI predictions
  • Executive reporting failures
  • SDTM and ADaM validation issues
  • Financial forecasting errors

As Clinical SAS Programmers and Data Scientists, our responsibility is transforming corrupted operational data into trustworthy analytical intelligence.

This article demonstrates a complete enterprise workflow using a BEST HEALTHCARE CENTERS IN WORLD dataset containing intentional data quality problems and shows how SAS and R can systematically convert chaos into reliable business insights.

1.Raw Healthcare Dataset with Intentional Errors (SAS)

SAS Raw Dataset

data healthcare_raw;

length Center_ID $8 Center_Name $40 Country $20 Region_Code $20

Patient_Email $50 Category $20 Visit_Date $20 Billing_Amount $15;

infile datalines dlm='|' dsd truncover;

input Center_ID $ Center_Name $ Country $ Region_Code $

Patient_Age Patient_Email $ Visit_Date $ Billing_Amount $

Category $;

datalines;

HC001|Mayo Clinic|USA|US|45|john@gmail.com|2025-01-10|5000|Premium

HC002| cleveland clinic |usa|U.S.|-12|marygmail.com|2025-02-15|-3000|gold

HC003|Apollo Hospitals|India|APAC|210|NULL|2025/13/01|7000|VIP

HC004|Johns Hopkins|USA|USA|55|test@yahoo|NULL|4500|premium

HC005|Charite Hospital|Germany|EU|.|user@abc.com|2025-03-10|8000|GOLD

HC005|Charite Hospital|Germany|EUR|.|user@abc.com|2025-03-10|8000|GOLD

HC006|Singapore General Hospital|Singapore|ASIA|30|abc@@gmail.com|2025-04-01|-900|VIP

HC007|Mount Sinai|USA|US|125|demo@gmail.com|2025-05-01|1000|unknown

HC008|AIIMS|India|APAC|35|NULL|2025-06-11|6500|Silver

HC009|Toronto General Hospital|Canada|NA|44|samplegmail.com|2025-07-10|2000|premium

HC010|King Faisal Hospital|Saudi Arabia|MEA|60|info@gmail.com|BADDATE|5000|VIP

HC011|Asan Medical Center|Korea|APAC|52|user@test.com|2025-09-01|7000|Gold

HC012|Karolinska Institute|Sweden|EU|48|abc@test.com|2025-10-01|-500|Premium

HC013|Bumrungrad Hospital|Thailand|ASIA|41|usergmail.com|2025-11-10|4000|VIP

HC014|Cleveland Clinic Abu Dhabi|UAE|MiddleEast|39|mail@test.com|2025-12-10|5500|Gold

HC015|Fortis Healthcare|India|APAC|29|fortis@test.com|2025-06-15|4200|Silver

;

run;

proc print data=healthcare_raw;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DateBilling_AmountPatient_Age
1HC001Mayo ClinicUSAUSjohn@gmail.comPremium2025-01-10500045
2HC002cleveland clinicusaU.S.marygmail.comgold2025-02-15-3000-12
3HC003Apollo HospitalsIndiaAPACNULLVIP2025/13/017000210
4HC004Johns HopkinsUSAUSAtest@yahoopremiumNULL450055
5HC005Charite HospitalGermanyEUuser@abc.comGOLD2025-03-108000.
6HC005Charite HospitalGermanyEURuser@abc.comGOLD2025-03-108000.
7HC006Singapore General HospitalSingaporeASIAabc@@gmail.comVIP2025-04-01-90030
8HC007Mount SinaiUSAUSdemo@gmail.comunknown2025-05-011000125
9HC008AIIMSIndiaAPACNULLSilver2025-06-11650035
10HC009Toronto General HospitalCanadaNAsamplegmail.compremium2025-07-10200044
11HC010King Faisal HospitalSaudi ArabiaMEAinfo@gmail.comVIPBADDATE500060
12HC011Asan Medical CenterKoreaAPACuser@test.comGold2025-09-01700052
13HC012Karolinska InstituteSwedenEUabc@test.comPremium2025-10-01-50048
14HC013Bumrungrad HospitalThailandASIAusergmail.comVIP2025-11-10400041
15HC014Cleveland Clinic Abu DhabiUAEMiddleEastmail@test.comGold2025-12-10550039
16HC015Fortis HealthcareIndiaAPACfortis@test.comSilver2025-06-15420029

Raw Healthcare Dataset in R 

healthcare_raw <- tibble::tribble(

  ~center_id,~center_name,~country,~region_code,~patient_age,

  ~patient_email,~visit_date,~billing_amount,~category,

  "HC001","Mayo Clinic","USA","US",45,"john@gmail.com","2025-01-10",5000,"Premium",

  "HC002"," cleveland clinic ","usa","U.S.",-12,"marygmail.com","2025-02-15",-3000,"gold",

  "HC003","Apollo Hospitals","India","APAC",210,"NULL","2025/13/01",7000,"VIP",

  "HC004","Johns Hopkins","USA","USA",55,"test@yahoo","NULL",4500,"premium",

  "HC005","Charite Hospital","Germany","EU",NA,"user@abc.com","2025-03-10",8000,"GOLD",

  "HC005","Charite Hospital","Germany","EUR",NA,"user@abc.com","2025-03-10",8000,"GOLD",

  "HC006","Singapore General Hospital","Singapore","ASIA",30,"abc@@gmail.com",

  "2025-04-01",-900,"VIP",

  "HC007","Mount Sinai","USA","US",125,"demo@gmail.com","2025-05-01",1000,"unknown",

  "HC008","AIIMS","India","APAC",35,"NULL","2025-06-11",6500,"Silver",

  "HC009","Toronto General Hospital","Canada","NA",44,"samplegmail.com",

  "2025-07-10",2000,"premium",

  "HC010","King Faisal Hospital","Saudi Arabia","MEA",60,"info@gmail.com",

  "BADDATE",5000,"VIP"

)

OUTPUT:

center_id

center_name

country

region_code

patient_age

patient_email

visit_date

billing_amount

category

HC001

Mayo Clinic

USA

US

45

john@gmail.com

2025-01-10

5000

Premium

HC002

 cleveland clinic

usa

U.S.

-12

marygmail.com

2025-02-15

-3000

gold

HC003

Apollo Hospitals

India

APAC

210

NULL

2025/13/01

7000

VIP

HC004

Johns Hopkins

USA

USA

55

test@yahoo

NULL

4500

premium

HC005

Charite Hospital

Germany

EU

user@abc.com

2025-03-10

8000

GOLD

HC005

Charite Hospital

Germany

EUR

user@abc.com

2025-03-10

8000

GOLD

HC006

Singapore General Hospital

Singapore

ASIA

30

abc@@gmail.com

2025-04-01

-900

VIP

HC007

Mount Sinai

USA

US

125

demo@gmail.com

2025-05-01

1000

unknown

HC008

AIIMS

India

APAC

35

NULL

2025-06-11

6500

Silver

HC009

Toronto General Hospital

Canada

NA

44

samplegmail.com

2025-07-10

2000

premium

HC010

King Faisal Hospital

Saudi Arabia

MEA

60

info@gmail.com

BADDATE

5000

VIP

Understanding Character Truncation Risk in SAS

One of the most misunderstood SAS behaviors is character variable truncation.

data demo;

length region $20;

region='NORTH_AMERICA';

run;

proc print data=demo;

run;

OUTPUT:

Obsregion
1NORTH_AMERICA

Explanation

SAS allocates storage during compilation. If LENGTH is omitted before assignment, SAS determines length using the first encountered value, which may truncate future longer values.

For example, if "US" appears first, later values like "NORTH_AMERICA" may be truncated. In regulated environments such as SDTM and ADaM production, truncation can corrupt submission datasets and produce validation findings.

R behaves differently because character vectors dynamically accommodate varying string lengths without requiring explicit length declarations. SAS programmers must therefore define LENGTH statements before assignments, conditional logic, or imports to avoid unexpected truncation.

2.SAS Enterprise Cleaning Workflow

Step 1: Standardization

data healthcare_clean1;

set healthcare_raw;

center_name=propcase(strip(center_name));

country=upcase(strip(country));

category=propcase(strip(category));

patient_email=lowcase(strip(patient_email));

run;

proc print data=healthcare_clean1;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DateBilling_AmountPatient_Age
1HC001Mayo ClinicUSAUSjohn@gmail.comPremium2025-01-10500045
2HC002Cleveland ClinicUSAU.S.marygmail.comGold2025-02-15-3000-12
3HC003Apollo HospitalsINDIAAPACnullVip2025/13/017000210
4HC004Johns HopkinsUSAUSAtest@yahooPremiumNULL450055
5HC005Charite HospitalGERMANYEUuser@abc.comGold2025-03-108000.
6HC005Charite HospitalGERMANYEURuser@abc.comGold2025-03-108000.
7HC006Singapore General HospitalSINGAPOREASIAabc@@gmail.comVip2025-04-01-90030
8HC007Mount SinaiUSAUSdemo@gmail.comUnknown2025-05-011000125
9HC008AiimsINDIAAPACnullSilver2025-06-11650035
10HC009Toronto General HospitalCANADANAsamplegmail.comPremium2025-07-10200044
11HC010King Faisal HospitalSAUDI ARABIAMEAinfo@gmail.comVipBADDATE500060
12HC011Asan Medical CenterKOREAAPACuser@test.comGold2025-09-01700052
13HC012Karolinska InstituteSWEDENEUabc@test.comPremium2025-10-01-50048
14HC013Bumrungrad HospitalTHAILANDASIAusergmail.comVip2025-11-10400041
15HC014Cleveland Clinic Abu DhabiUAEMiddleEastmail@test.comGold2025-12-10550039
16HC015Fortis HealthcareINDIAAPACfortis@test.comSilver2025-06-15420029

Explanation

This step removes leading/trailing spaces, standardizes capitalization, and creates consistent reporting values. Without normalization, "premium", "Premium", and "PREMIUM" appear as separate categories in PROC FREQ outputs. Standardized text significantly improves reporting reliability, grouping consistency, dashboard accuracy, and machine learning feature quality.

Step 2: Region Correction Using SELECT-WHEN

data healthcare_clean1;

set healthcare_raw;

length region_code $20;

select(region_code);

when ('US','USA','U.S.') region_code='NORTH_AMERICA';

when ('EU','EUR') region_code='EUROPE';

when ('APAC','ASIA') region_code='ASIA_PACIFIC';

otherwise region_code='OTHER';

end;

run;

proc print data=healthcare_clean1;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DateBilling_AmountPatient_Age
1HC001Mayo ClinicUSANORTH_AMERICAjohn@gmail.comPremium2025-01-10500045
2HC002cleveland clinicusaNORTH_AMERICAmarygmail.comgold2025-02-15-3000-12
3HC003Apollo HospitalsIndiaASIA_PACIFICNULLVIP2025/13/017000210
4HC004Johns HopkinsUSANORTH_AMERICAtest@yahoopremiumNULL450055
5HC005Charite HospitalGermanyEUROPEuser@abc.comGOLD2025-03-108000.
6HC005Charite HospitalGermanyEUROPEuser@abc.comGOLD2025-03-108000.
7HC006Singapore General HospitalSingaporeASIA_PACIFICabc@@gmail.comVIP2025-04-01-90030
8HC007Mount SinaiUSANORTH_AMERICAdemo@gmail.comunknown2025-05-011000125
9HC008AIIMSIndiaASIA_PACIFICNULLSilver2025-06-11650035
10HC009Toronto General HospitalCanadaOTHERsamplegmail.compremium2025-07-10200044
11HC010King Faisal HospitalSaudi ArabiaOTHERinfo@gmail.comVIPBADDATE500060
12HC011Asan Medical CenterKoreaASIA_PACIFICuser@test.comGold2025-09-01700052
13HC012Karolinska InstituteSwedenEUROPEabc@test.comPremium2025-10-01-50048
14HC013Bumrungrad HospitalThailandASIA_PACIFICusergmail.comVIP2025-11-10400041
15HC014Cleveland Clinic Abu DhabiUAEOTHERmail@test.comGold2025-12-10550039
16HC015Fortis HealthcareIndiaASIA_PACIFICfortis@test.comSilver2025-06-15420029

Explanation

SELECT-WHEN is cleaner than multiple IF-THEN statements. Enterprise code becomes easier to maintain, audit, and validate. This technique is frequently used in SDTM mapping programs when consolidating source-system categories into controlled terminology.

Step 3: Correct Impossible Values

data healthcare_clean1;

set healthcare_raw;

if not missing(patient_age) then

    patient_age = abs(patient_age);

billing_amount_num =

       abs(input(compress(billing_amount,'$,'),best32.));

if patient_age > 100 then patient_age=.;

format billing_amount_num comma12.2;

drop billing_amount;

rename billing_amount_num=billing_amount;

run;

proc print data=healthcare_clean1;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DatePatient_Agebilling_amount
1HC001Mayo ClinicUSAUSjohn@gmail.comPremium2025-01-10455,000.00
2HC002cleveland clinicusaU.S.marygmail.comgold2025-02-15123,000.00
3HC003Apollo HospitalsIndiaAPACNULLVIP2025/13/01.7,000.00
4HC004Johns HopkinsUSAUSAtest@yahoopremiumNULL554,500.00
5HC005Charite HospitalGermanyEUuser@abc.comGOLD2025-03-10.8,000.00
6HC005Charite HospitalGermanyEURuser@abc.comGOLD2025-03-10.8,000.00
7HC006Singapore General HospitalSingaporeASIAabc@@gmail.comVIP2025-04-0130900.00
8HC007Mount SinaiUSAUSdemo@gmail.comunknown2025-05-01.1,000.00
9HC008AIIMSIndiaAPACNULLSilver2025-06-11356,500.00
10HC009Toronto General HospitalCanadaNAsamplegmail.compremium2025-07-10442,000.00
11HC010King Faisal HospitalSaudi ArabiaMEAinfo@gmail.comVIPBADDATE605,000.00
12HC011Asan Medical CenterKoreaAPACuser@test.comGold2025-09-01527,000.00
13HC012Karolinska InstituteSwedenEUabc@test.comPremium2025-10-0148500.00
14HC013Bumrungrad HospitalThailandASIAusergmail.comVIP2025-11-10414,000.00
15HC014Cleveland Clinic Abu DhabiUAEMiddleEastmail@test.comGold2025-12-10395,500.00
16HC015Fortis HealthcareIndiaAPACfortis@test.comSilver2025-06-15294,200.00

Explanation

Negative ages and negative billing values are impossible business conditions. ABS converts negative values to positive values, while age limits enforce clinical plausibility. ROUND improves reporting consistency and reduces floating-point precision issues during statistical analyses.

Step 4: Email Validation

data healthcare_clean1;

set healthcare_raw;

if find(patient_email,'@')=0 then

patient_email='MISSING_EMAIL';

run;

proc print data=healthcare_clean1;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DateBilling_AmountPatient_Age
1HC001Mayo ClinicUSAUSjohn@gmail.comPremium2025-01-10500045
2HC002cleveland clinicusaU.S.MISSING_EMAILgold2025-02-15-3000-12
3HC003Apollo HospitalsIndiaAPACMISSING_EMAILVIP2025/13/017000210
4HC004Johns HopkinsUSAUSAtest@yahoopremiumNULL450055
5HC005Charite HospitalGermanyEUuser@abc.comGOLD2025-03-108000.
6HC005Charite HospitalGermanyEURuser@abc.comGOLD2025-03-108000.
7HC006Singapore General HospitalSingaporeASIAabc@@gmail.comVIP2025-04-01-90030
8HC007Mount SinaiUSAUSdemo@gmail.comunknown2025-05-011000125
9HC008AIIMSIndiaAPACMISSING_EMAILSilver2025-06-11650035
10HC009Toronto General HospitalCanadaNAMISSING_EMAILpremium2025-07-10200044
11HC010King Faisal HospitalSaudi ArabiaMEAinfo@gmail.comVIPBADDATE500060
12HC011Asan Medical CenterKoreaAPACuser@test.comGold2025-09-01700052
13HC012Karolinska InstituteSwedenEUabc@test.comPremium2025-10-01-50048
14HC013Bumrungrad HospitalThailandASIAMISSING_EMAILVIP2025-11-10400041
15HC014Cleveland Clinic Abu DhabiUAEMiddleEastmail@test.comGold2025-12-10550039
16HC015Fortis HealthcareIndiaAPACfortis@test.comSilver2025-06-15420029

Explanation

Malformed emails create communication failures and operational risks. FIND allows efficient validation. In production systems, regex validation can further enhance email verification.

Step 5: Duplicate Removal

proc sort data=healthcare_clean1

          out=healthcare_clean2 nodupkey;

by center_id;

run;

proc print data=healthcare_clean2;

run;

LOG:

NOTE: There were 16 observations read from the data set WORK.HEALTHCARE_CLEAN1.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.HEALTHCARE_CLEAN2 has 15 observations and 9 variables.

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DateBilling_AmountPatient_Age
1HC001Mayo ClinicUSAUSjohn@gmail.comPremium2025-01-10500045
2HC002cleveland clinicusaU.S.MISSING_EMAILgold2025-02-15-3000-12
3HC003Apollo HospitalsIndiaAPACMISSING_EMAILVIP2025/13/017000210
4HC004Johns HopkinsUSAUSAtest@yahoopremiumNULL450055
5HC005Charite HospitalGermanyEUuser@abc.comGOLD2025-03-108000.
6HC006Singapore General HospitalSingaporeASIAabc@@gmail.comVIP2025-04-01-90030
7HC007Mount SinaiUSAUSdemo@gmail.comunknown2025-05-011000125
8HC008AIIMSIndiaAPACMISSING_EMAILSilver2025-06-11650035
9HC009Toronto General HospitalCanadaNAMISSING_EMAILpremium2025-07-10200044
10HC010King Faisal HospitalSaudi ArabiaMEAinfo@gmail.comVIPBADDATE500060
11HC011Asan Medical CenterKoreaAPACuser@test.comGold2025-09-01700052
12HC012Karolinska InstituteSwedenEUabc@test.comPremium2025-10-01-50048
13HC013Bumrungrad HospitalThailandASIAMISSING_EMAILVIP2025-11-10400041
14HC014Cleveland Clinic Abu DhabiUAEMiddleEastmail@test.comGold2025-12-10550039
15HC015Fortis HealthcareIndiaAPACfortis@test.comSilver2025-06-15420029

Explanation

Duplicate records can inflate patient counts, billing totals, and enrollment metrics. PROC SORT NODUPKEY removes duplicate keys while preserving one valid observation. This is one of the most common validation checks in healthcare analytics.

3.Advanced DATA Step Techniques

Step 1: ARRAY Processing

data healthcare_clean1;

set healthcare_raw;

array chars(*) center_name country category;

do i=1 to dim(chars);

chars(i)=strip(chars(i));

end;

run;

proc print data=healthcare_clean1;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DateBilling_AmountPatient_Agei
1HC001Mayo ClinicUSAUSjohn@gmail.comPremium2025-01-105000454
2HC002cleveland clinicusaU.S.marygmail.comgold2025-02-15-3000-124
3HC003Apollo HospitalsIndiaAPACNULLVIP2025/13/0170002104
4HC004Johns HopkinsUSAUSAtest@yahoopremiumNULL4500554
5HC005Charite HospitalGermanyEUuser@abc.comGOLD2025-03-108000.4
6HC005Charite HospitalGermanyEURuser@abc.comGOLD2025-03-108000.4
7HC006Singapore General HospitalSingaporeASIAabc@@gmail.comVIP2025-04-01-900304
8HC007Mount SinaiUSAUSdemo@gmail.comunknown2025-05-0110001254
9HC008AIIMSIndiaAPACNULLSilver2025-06-116500354
10HC009Toronto General HospitalCanadaNAsamplegmail.compremium2025-07-102000444
11HC010King Faisal HospitalSaudi ArabiaMEAinfo@gmail.comVIPBADDATE5000604
12HC011Asan Medical CenterKoreaAPACuser@test.comGold2025-09-017000524
13HC012Karolinska InstituteSwedenEUabc@test.comPremium2025-10-01-500484
14HC013Bumrungrad HospitalThailandASIAusergmail.comVIP2025-11-104000414
15HC014Cleveland Clinic Abu DhabiUAEMiddleEastmail@test.comGold2025-12-105500394
16HC015Fortis HealthcareIndiaAPACfortis@test.comSilver2025-06-154200294

Explanation

ARRAYS reduce repetitive code and improve maintainability. Instead of writing multiple STRIP statements, one loop cleans several variables simultaneously. Enterprise programmers often use arrays when standardizing hundreds of SDTM variables.

Step 2: RETAIN Example

data healthcare_clean1;

set healthcare_raw;

retain cumulative_revenue 0;

cumulative_revenue+

billing_amount;

run;

proc print data=healthcare_clean1;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DateBilling_AmountPatient_Agecumulative_revenue
1HC001Mayo ClinicUSAUSjohn@gmail.comPremium2025-01-105000455000
2HC002cleveland clinicusaU.S.marygmail.comgold2025-02-15-3000-122000
3HC003Apollo HospitalsIndiaAPACNULLVIP2025/13/0170002109000
4HC004Johns HopkinsUSAUSAtest@yahoopremiumNULL45005513500
5HC005Charite HospitalGermanyEUuser@abc.comGOLD2025-03-108000.21500
6HC005Charite HospitalGermanyEURuser@abc.comGOLD2025-03-108000.29500
7HC006Singapore General HospitalSingaporeASIAabc@@gmail.comVIP2025-04-01-9003028600
8HC007Mount SinaiUSAUSdemo@gmail.comunknown2025-05-01100012529600
9HC008AIIMSIndiaAPACNULLSilver2025-06-1165003536100
10HC009Toronto General HospitalCanadaNAsamplegmail.compremium2025-07-1020004438100
11HC010King Faisal HospitalSaudi ArabiaMEAinfo@gmail.comVIPBADDATE50006043100
12HC011Asan Medical CenterKoreaAPACuser@test.comGold2025-09-0170005250100
13HC012Karolinska InstituteSwedenEUabc@test.comPremium2025-10-01-5004849600
14HC013Bumrungrad HospitalThailandASIAusergmail.comVIP2025-11-1040004153600
15HC014Cleveland Clinic Abu DhabiUAEMiddleEastmail@test.comGold2025-12-1055003959100
16HC015Fortis HealthcareIndiaAPACfortis@test.comSilver2025-06-1542002963300

Explanation

RETAIN preserves values across observations. This allows running totals, cumulative metrics, and longitudinal tracking commonly required in healthcare operational reporting.

Step 3: FIRST./LAST. Processing

proc sort data=healthcare_clean2;

by country;

run;

proc print data=healthcare_clean2;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DateBilling_AmountPatient_Age
1HC009Toronto General HospitalCanadaNAMISSING_EMAILpremium2025-07-10200044
2HC005Charite HospitalGermanyEUuser@abc.comGOLD2025-03-108000.
3HC003Apollo HospitalsIndiaAPACMISSING_EMAILVIP2025/13/017000210
4HC008AIIMSIndiaAPACMISSING_EMAILSilver2025-06-11650035
5HC015Fortis HealthcareIndiaAPACfortis@test.comSilver2025-06-15420029
6HC011Asan Medical CenterKoreaAPACuser@test.comGold2025-09-01700052
7HC010King Faisal HospitalSaudi ArabiaMEAinfo@gmail.comVIPBADDATE500060
8HC006Singapore General HospitalSingaporeASIAabc@@gmail.comVIP2025-04-01-90030
9HC012Karolinska InstituteSwedenEUabc@test.comPremium2025-10-01-50048
10HC013Bumrungrad HospitalThailandASIAMISSING_EMAILVIP2025-11-10400041
11HC014Cleveland Clinic Abu DhabiUAEMiddleEastmail@test.comGold2025-12-10550039
12HC001Mayo ClinicUSAUSjohn@gmail.comPremium2025-01-10500045
13HC004Johns HopkinsUSAUSAtest@yahoopremiumNULL450055
14HC007Mount SinaiUSAUSdemo@gmail.comunknown2025-05-011000125
15HC002cleveland clinicusaU.S.MISSING_EMAILgold2025-02-15-3000-12

data country_summary;

set healthcare_clean2;

by country;

if first.country then count=0;

count+1;

if last.country;

run;

proc print data=country_summary;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DateBilling_AmountPatient_Agecount
1HC009Toronto General HospitalCanadaNAMISSING_EMAILpremium2025-07-102000441
2HC005Charite HospitalGermanyEUuser@abc.comGOLD2025-03-108000.1
3HC015Fortis HealthcareIndiaAPACfortis@test.comSilver2025-06-154200293
4HC011Asan Medical CenterKoreaAPACuser@test.comGold2025-09-017000521
5HC010King Faisal HospitalSaudi ArabiaMEAinfo@gmail.comVIPBADDATE5000601
6HC006Singapore General HospitalSingaporeASIAabc@@gmail.comVIP2025-04-01-900301
7HC012Karolinska InstituteSwedenEUabc@test.comPremium2025-10-01-500481
8HC013Bumrungrad HospitalThailandASIAMISSING_EMAILVIP2025-11-104000411
9HC014Cleveland Clinic Abu DhabiUAEMiddleEastmail@test.comGold2025-12-105500391
10HC007Mount SinaiUSAUSdemo@gmail.comunknown2025-05-0110001253
11HC002cleveland clinicusaU.S.MISSING_EMAILgold2025-02-15-3000-121

Explanation

FIRST. and LAST. processing supports grouped calculations. This technique is fundamental in clinical trial summaries, patient counts, site metrics, and regional reporting.

Step 4: PROC FORMAT

proc format;

value agegrp low-17='Pediatric'

              18-64='Adult'

            65-high='Senior';

run;

LOG:

NOTE: Format AGEGRP has been output.

Explanation

Formats create reusable business classifications without physically changing stored data. This improves reporting flexibility and consistency across projects.

4.PROC SQL vs DATA Step

Step 1: PROC SQL

proc sql;

create table revenue_country as

select country,

sum(billing_amount) as total_revenue

from healthcare_clean1

group by country;

quit;

proc print data=revenue_country;

run;

OUTPUT:

ObsCountryPatient_Agetotal_revenue
1CanadaAdult2000
2Germany.16000
3Germany.16000
4IndiaAdult17700
5India.17700
6IndiaAdult17700
7KoreaAdult7000
8Saudi ArabiaAdult5000
9SingaporeAdult900
10SwedenAdult500
11ThailandAdult4000
12UAEAdult5500
13USAAdult10500
14USAAdult10500
15USA.10500
16usaPediatric3000

Explanation

PROC SQL is concise and ideal for joins, aggregations, and relational processing. It resembles database programming and is widely used in healthcare data warehouses.

Step 2: DATA Step Alternative

proc sort data=healthcare_clean1;

by country;

run;

proc print data=healthcare_clean1;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DatePatient_Agebilling_amount
1HC009Toronto General HospitalCanadaNAsamplegmail.compremium2025-07-10442,000.00
2HC005Charite HospitalGermanyEUuser@abc.comGOLD2025-03-10.8,000.00
3HC005Charite HospitalGermanyEURuser@abc.comGOLD2025-03-10.8,000.00
4HC003Apollo HospitalsIndiaAPACNULLVIP2025/13/01.7,000.00
5HC008AIIMSIndiaAPACNULLSilver2025-06-11356,500.00
6HC015Fortis HealthcareIndiaAPACfortis@test.comSilver2025-06-15294,200.00
7HC011Asan Medical CenterKoreaAPACuser@test.comGold2025-09-01527,000.00
8HC010King Faisal HospitalSaudi ArabiaMEAinfo@gmail.comVIPBADDATE605,000.00
9HC006Singapore General HospitalSingaporeASIAabc@@gmail.comVIP2025-04-0130900.00
10HC012Karolinska InstituteSwedenEUabc@test.comPremium2025-10-0148500.00
11HC013Bumrungrad HospitalThailandASIAusergmail.comVIP2025-11-10414,000.00
12HC014Cleveland Clinic Abu DhabiUAEMiddleEastmail@test.comGold2025-12-10395,500.00
13HC001Mayo ClinicUSAUSjohn@gmail.comPremium2025-01-10455,000.00
14HC004Johns HopkinsUSAUSAtest@yahoopremiumNULL554,500.00
15HC007Mount SinaiUSAUSdemo@gmail.comunknown2025-05-01.1,000.00
16HC002cleveland clinicusaU.S.marygmail.comgold2025-02-15123,000.00

data revenue_country_ds;

set healthcare_clean1;

by country;

retain total_revenue;

if first.country then total_revenue=0;

total_revenue+billing_amount;

if last.country;

run;

proc print data=revenue_country_ds;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DatePatient_Agebilling_amounttotal_revenue
1HC009Toronto General HospitalCanadaNAsamplegmail.compremium2025-07-10442,000.002000
2HC005Charite HospitalGermanyEURuser@abc.comGOLD2025-03-10.8,000.0016000
3HC015Fortis HealthcareIndiaAPACfortis@test.comSilver2025-06-15294,200.0017700
4HC011Asan Medical CenterKoreaAPACuser@test.comGold2025-09-01527,000.007000
5HC010King Faisal HospitalSaudi ArabiaMEAinfo@gmail.comVIPBADDATE605,000.005000
6HC006Singapore General HospitalSingaporeASIAabc@@gmail.comVIP2025-04-0130900.00900
7HC012Karolinska InstituteSwedenEUabc@test.comPremium2025-10-0148500.00500
8HC013Bumrungrad HospitalThailandASIAusergmail.comVIP2025-11-10414,000.004000
9HC014Cleveland Clinic Abu DhabiUAEMiddleEastmail@test.comGold2025-12-10395,500.005500
10HC007Mount SinaiUSAUSdemo@gmail.comunknown2025-05-01.1,000.0010500
11HC002cleveland clinicusaU.S.marygmail.comgold2025-02-15123,000.003000

Explanation

DATA Step offers more control and often better performance for sequential processing. SAS programmers frequently compare SQL and DATA Step approaches to choose the most efficient solution.

Step 3: SAS Macro for Reusable Cleaning

%macro clean_text(var);

&var=lowcase(strip(&var));

%mend;


data healthcare_final;

set healthcare_clean1;

%clean_text(center_name);

run;

proc print data=healthcare_final;

run;

OUTPUT:

ObsCenter_IDCenter_NameCountryRegion_CodePatient_EmailCategoryVisit_DatePatient_Agebilling_amount
1HC009toronto general hospitalCanadaNAsamplegmail.compremium2025-07-10442,000.00
2HC005charite hospitalGermanyEUuser@abc.comGOLD2025-03-10.8,000.00
3HC005charite hospitalGermanyEURuser@abc.comGOLD2025-03-10.8,000.00
4HC003apollo hospitalsIndiaAPACNULLVIP2025/13/01.7,000.00
5HC008aiimsIndiaAPACNULLSilver2025-06-11356,500.00
6HC015fortis healthcareIndiaAPACfortis@test.comSilver2025-06-15294,200.00
7HC011asan medical centerKoreaAPACuser@test.comGold2025-09-01527,000.00
8HC010king faisal hospitalSaudi ArabiaMEAinfo@gmail.comVIPBADDATE605,000.00
9HC006singapore general hospitalSingaporeASIAabc@@gmail.comVIP2025-04-0130900.00
10HC012karolinska instituteSwedenEUabc@test.comPremium2025-10-0148500.00
11HC013bumrungrad hospitalThailandASIAusergmail.comVIP2025-11-10414,000.00
12HC014cleveland clinic abu dhabiUAEMiddleEastmail@test.comGold2025-12-10395,500.00
13HC001mayo clinicUSAUSjohn@gmail.comPremium2025-01-10455,000.00
14HC004johns hopkinsUSAUSAtest@yahoopremiumNULL554,500.00
15HC007mount sinaiUSAUSdemo@gmail.comunknown2025-05-01.1,000.00
16HC002cleveland clinicusaU.S.marygmail.comgold2025-02-15123,000.00

Explanation

Macros promote standardization, reduce duplication, and improve maintainability. Regulatory environments favor reusable validated components because they reduce programming variability and QC effort.

Step 4: R Data Cleaning Workflow

library(tidyverse)

library(janitor)

library(lubridate)

healthcare_clean <- healthcare_raw %>%

  clean_names() %>%

  mutate(center_name = str_to_title(str_trim(center_name)),

             country = str_to_upper(str_trim(country)),

            category = case_when(category %in% c("gold","GOLD") ~ "Gold",

                       TRUE ~ str_to_title(category)),

      billing_amount = abs(billing_amount),

         patient_age = if_else(patient_age > 100,NA_real_,

                          abs(patient_age)),

       patient_email = if_else(grepl("@", patient_email),

                       patient_email,NA_character_),

      visit_date = na_if(visit_date, "NULL"),

      visit_date = na_if(visit_date, "BADDATE"),

      visit_date = if_else(grepl("^\\d{4}/13/", visit_date),

                   NA_character_,visit_date),

      visit_date = parse_date_time(visit_date,

                   orders = c("ymd","Ymd"))

  ) %>%

  distinct(center_id,.keep_all=TRUE)

write_xlsx(healthcare_clean,

  "C:\\Users\\healthcare_clean.xlsx")

OUTPUT:

center_id

center_name

country

region_code

patient_age

patient_email

  visit_date

billing_amount

category

HC001

Mayo Clinic

USA

US

45

john@gmail.com

2025-01-10 00:00:00 UTC

5000

Premium

HC002

Cleveland Clinic

USA

U.S.

12

2025-02-15 00:00:00 UTC

3000

Gold

HC003

Apollo Hospitals

INDIA

APAC

7000

Vip

HC004

Johns Hopkins

USA

USA

55

test@yahoo

4500

Premium

HC005

Charite Hospital

GERMANY

EU

user@abc.com

2025-03-10 00:00:00 UTC

8000

Gold

HC006

Singapore General Hospital

SINGAPORE

ASIA

30

abc@@gmail.com

2025-04-01 00:00:00 UTC

900

Vip

HC007

Mount Sinai

USA

US

demo@gmail.com

2025-05-01 00:00:00 UTC

1000

Unknown

HC008

Aiims

INDIA

APAC

35

2025-06-11 00:00:00 UTC

6500

Silver

HC009

Toronto General Hospital

CANADA

NA

44

2025-07-10 00:00:00 UTC

2000

Premium

HC010

King Faisal Hospital

SAUDI ARABIA

MEA

60

info@gmail.com

5000

Vip

Explanation

This workflow uses tidyverse principles to create clean analytical datasets. mutate() transforms variables, case_when() standardizes categories, str_trim() removes whitespace, abs() fixes numeric issues, parse_date_time() handles date inconsistencies, and distinct() removes duplicates. Compared with SAS DATA Step programming, tidyverse provides a declarative pipeline approach that many modern analytics teams prefer.

Enterprise Validation & Compliance

Healthcare analytics operates under strict compliance expectations.

Key areas include:

  • SDTM standardization
  • ADaM traceability
  • Audit trail documentation
  • QC independence
  • Regulatory reproducibility
  • Controlled terminology

A critical SAS concept is missing-value behavior.

if age < 18 then flag='Y';

Missing ages satisfy this condition because SAS treats missing numeric values as smaller than all valid numbers.

This can generate catastrophic classification errors.

Always write:

if not missing(age) and age <18 then flag='Y';

Independent QC and traceability are essential for submission-quality datasets.

Business Logic Behind Data Cleaning

Data cleaning is fundamentally about protecting analytical truth. Missing values are often imputed because many statistical models cannot process incomplete records effectively. For example, if a patient's visit date is missing, enrollment timelines may become inaccurate, affecting study milestones and resource planning. Age corrections are equally important. A patient age of 210 years clearly indicates data-entry failure. Allowing such values into statistical analyses can distort averages, percentiles, and predictive models.

Billing amount standardization prevents negative charges from inflating financial losses. In banking, negative loan balances caused by input errors may trigger incorrect risk classifications. In healthcare, negative treatment costs may reduce revenue estimates and mislead executives.

Text normalization improves grouping consistency. Without standardization, "Premium", "premium", and "PREMIUM" create multiple categories instead of one unified business segment. Standardized dates ensure consistent chronological calculations. Regulatory reporting often depends on accurate date intervals derived through INTCK and INTNX functions.

Imputation, normalization, validation, and standardization collectively improve reporting reliability, machine learning quality, operational efficiency, and regulatory compliance. Effective data cleaning transforms unreliable raw information into trustworthy business intelligence that supports accurate decision-making.

20 Data Cleaning Best Practices

  1. Validate source metadata first.
  2. Standardize variable naming conventions.
  3. Use controlled terminology.
  4. Remove duplicates early.
  5. Validate date ranges.
  6. Audit missing values.
  7. Document derivations.
  8. Maintain lineage records.
  9. Use reusable macros.
  10. Separate development and production.
  11. Perform independent QC.
  12. Validate category mappings.
  13. Protect against truncation.
  14. Standardize missing-value handling.
  15. Version-control programs.
  16. Validate merge keys.
  17. Create reconciliation reports.
  18. Automate validation checks.
  19. Maintain audit trails.
  20. Perform deployment reviews.

20 One-Line Insights

  • Dirty data creates expensive business mistakes.
  • Validation logic is stronger than visual inspection.
  • Standardized variables improve reproducibility.
  • Missing values hide operational risks.
  • Duplicates inflate business metrics.
  • Metadata drives consistency.
  • QC independence reduces bias.
  • Audit trails build trust.
  • Traceability supports compliance.
  • Clean data improves AI performance.
  • Controlled terminology reduces ambiguity.
  • Defensive programming prevents surprises.
  • Reusable macros improve efficiency.
  • Automated validation saves time.
  • Business rules matter more than tools.
  • Reporting quality starts with source quality.
  • Consistent dates enable accurate analytics.
  • Data lineage supports transparency.
  • Reliable datasets improve decisions.
  • Enterprise analytics begins with clean data.

SAS vs R Comparison

Feature

SAS

R

Auditability

Excellent

Moderate

Regulatory Acceptance

Excellent

Growing

Flexibility

High

Very High

Macros

Powerful

Functions

Data Step

Unique Advantage

No Equivalent

Visualization

Good

Excellent

Open Source

No

Yes

Enterprise Validation

Excellent

Good

SDTM/ADaM Usage

Industry Standard

Supplemental

Learning Curve

Moderate

Moderate

Validation Checklist

1.Duplicate check completed

2.Missing-value review completed

3.Date validation completed

4.Category standardization completed

5.Email validation completed

6.Metadata review completed

7.QC review completed

8.Reporting verification completed

Summary: SAS and R for Enterprise Data Engineering

SAS and R complement each other remarkably well in enterprise analytics. SAS excels in controlled, auditable, highly regulated environments where traceability and validation are paramount. Clinical trial submissions, SDTM conversions, ADaM derivations, and regulatory reporting depend heavily on SAS because of its mature data management capabilities and predictable execution framework.

R provides flexibility, rapid development, modern data manipulation libraries, and extensive visualization capabilities. Packages such as dplyr, stringr, lubridate, janitor, and purrr simplify complex transformations through readable pipelines. Data scientists frequently leverage R for exploratory analysis, predictive modeling, and advanced reporting.

From a scalability perspective, SAS performs exceptionally well on large operational datasets and supports structured production deployment processes. R offers unmatched extensibility and access to modern analytical ecosystems. Combining SAS for validated data engineering and R for advanced analytics creates a powerful hybrid architecture.

Organizations that integrate both technologies benefit from stronger governance, greater analytical flexibility, improved reproducibility, and more reliable business intelligence. The ultimate objective is not simply cleaning data but establishing trustworthy analytical foundations that support strategic decisions, regulatory confidence, and operational excellence.

Conclusion

Modern healthcare organizations generate enormous volumes of operational, clinical, financial, and patient-level information every day. However, raw data rarely arrives in a state suitable for analysis. Duplicate identifiers, missing dates, invalid categories, malformed emails, negative financial values, inconsistent coding standards, and corrupted text formats introduce significant analytical risk.

The BEST HEALTHCARE CENTERS IN WORLD project demonstrates how structured data-cleaning methodologies transform unreliable information into dependable analytical assets. Through SAS DATA Step programming, PROC SQL processing, validation routines, formats, macros, and enterprise reporting procedures, organizations can establish highly controlled and auditable workflows. Simultaneously, R provides modern data transformation capabilities through tidyverse pipelines that enable rapid, flexible, and scalable analytics.

The most successful organizations treat data cleaning as a strategic discipline rather than a preliminary task. Every corrected age, standardized date, validated category, and removed duplicate improves downstream reporting quality. Strong governance frameworks ensure reproducibility, traceability, compliance readiness, and executive confidence.

Clinical research, healthcare operations, banking, insurance, and retail analytics all depend on trustworthy datasets. Poor-quality data can influence regulatory submissions, distort machine learning models, mislead dashboards, and drive costly business decisions. Robust validation processes, defensive programming practices, metadata governance, and independent QC reviews are therefore essential components of enterprise analytics.

When SAS and R are combined effectively, organizations gain the auditability of SAS and the flexibility of R. The result is a production-grade analytical ecosystem capable of generating reliable intelligence from complex operational data. Clean data is not merely a technical objective it is the foundation upon which modern business decisions, patient safety, regulatory compliance, and analytical excellence are built.

Interview Questions and Answers

1. A clinical dataset contains duplicate patient IDs. How would you identify and remove them in SAS?

Answer: I would use PROC SORT with NODUPKEY and review duplicate outputs separately. I would validate duplicates using business keys before deletion to ensure legitimate repeat visits are not removed incorrectly.

2. A patient's age appears as -45 and 210. How would you handle it?

Answer: First investigate source data. Apply ABS for negative values if confirmed entry errors. Values exceeding business thresholds (e.g., >100 years) should be set missing and reviewed through data clarification workflows.

3. How do SAS and R differ in handling missing values?

Answer: SAS numeric missing values are treated as lower than valid numbers, requiring explicit checks. R uses NA values and most functions require na.rm=TRUE for calculations.

4. When would you choose PROC SQL instead of a DATA Step?

Answer: PROC SQL is preferred for joins, aggregations, and relational operations. DATA Step is often preferred for row-wise processing, retained calculations, arrays, and complex business-rule implementations.

5. How would you validate a production-ready healthcare reporting dataset?

Answer: I would perform metadata validation, duplicate checks, missing-value review, category verification, reconciliation against source systems, independent QC programming, audit-trail review, and final reporting validation before deployment.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent HEALTHCARE DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics 

·  Clinical SAS Programmer

·  Research Data Analyst

·  Regulatory Data Validator


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS