Where Data Learns to Dance: Advanced Cleaning Frameworks for Reliable Analytics Using Sas And R
From Stage Legends to Statistical Legends: Transforming the World's Best Dancers Dataset into Analysis-Ready Intelligence with SAS and R
The Business Crisis Nobody Saw Coming
Imagine a
global entertainment analytics company preparing an annual report ranking the
world's most influential dancers. Sponsors are allocating millions in endorsement
budgets, streaming platforms are using AI models to recommend performers, and
event organizers are planning international festivals.
Then
disaster strikes.
The
dashboard reports that a dancer earned negative annual revenue, another
performer appears twice due to duplicate IDs, ages exceed 140 years, and
several email addresses are malformed.
A dancer
from India is incorrectly classified under the European region because of
inconsistent region codes such as:
- APAC
- apac
- Asia Pacific
- AP-AC
- asia
Machine
learning models begin producing biased predictions.
Executive
reports become inaccurate.
Business
decisions become unreliable.
This is
exactly what happens in healthcare when patient enrollment is duplicated, in
banking when fraud transactions are misclassified, and in insurance when claim
records contain invalid policy dates.
Dirty
data is not merely an inconvenience.
It is an enterprise risk.
1.Raw Corrupted Dataset SAS Version
data dancers_raw;
length Dancer_Name $30 Country $20 Dance_Style $20 Region $15
Email $60 Award_Level $20;
infile datalines dlm='|' dsd truncover;
informat Debut_Date anydtdte20.;
format Debut_Date date9.;
input Dancer_ID Dancer_Name :$30. Country :$20. Age Dance_Style :$20.
Annual_Earnings Debut_Date :anydtdte20. Region :$15. Email :$60.
Award_Level :$20.;
datalines;
101|MichaelJackson|USA|50|Pop|-450000|15JAN1980|APAC|michael@gmail.com|Platinum
102|prabhuDeva|india|-5|Cinema|850000|12FEB1991|asia|prabhudeva@gmailcom|Gold
103|MadhuriDixit|India|57|Bollywood|920000|.|AP-AC|madhuri@gmail.com|Platinum
103|MadhuriDixit|India|57|Bollywood|920000|.|AP-AC|madhuri@gmail.com|Platinum
104|NULL|UK|150|Ballet|350000|32JAN1990|EUROPE|marygmail.com|Silver
105|HrithikRoshan|India|52|Bollywood|-90000|20MAR2000|apac|hrithik@ gmail.com|Gold
106|Shakira|Colombia|49|Latin|720000|18APR1998|LATAM|shakira@gmail.com|Premium
107|BTS|Korea|31|KPOP|600000|InvalidDate|APAC|bts@gmail.com|Premium
108|ChrisBrown|USA|38|HipHop|450000|25DEC2005|USA|chris@gmail.com|Platinum
109|NoraFatehi|India|36|BellyDance|-50000|.|asia|nora@gmailcom|Gold
110|MistyCopeland|USA|44|Ballet|900000|12JUL2003|usa|misty@gmail.com|Platinum
111|AkramKhan|UK|52|Contemporary|610000|14AUG1997|Europe|akram@gmail.com|Gold
112|InvalidName|India|.|Bollywood|450000|19SEP2010|APAC|invalidmail|Premium
113|Jabbawockeez|USA|28|HipHop|500000|25DEC2014|USA|crew@gmail.com|Gold
114|JenniferLopez|USA|56|Pop|980000|12JAN1999|usa|jlo@gmail.com|Platinum
115|JoaquinCortes|Spain|57|Flamenco|650000|14FEB1988|europe|joaquin@gmail.com|Gold
;
run;
proc print data=dancers_raw;
run;
OUTPUT:
| Obs | Dancer_Name | Country | Dance_Style | Region | Award_Level | Debut_Date | Dancer_ID | Age | Annual_Earnings | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | MichaelJackson | USA | Pop | APAC | michael@gmail.com | Platinum | 15JAN1980 | 101 | 50 | -450000 |
| 2 | prabhuDeva | india | Cinema | asia | prabhudeva@gmailcom | Gold | 12FEB1991 | 102 | -5 | 850000 |
| 3 | MadhuriDixit | India | Bollywood | AP-AC | madhuri@gmail.com | Platinum | . | 103 | 57 | 920000 |
| 4 | MadhuriDixit | India | Bollywood | AP-AC | madhuri@gmail.com | Platinum | . | 103 | 57 | 920000 |
| 5 | NULL | UK | Ballet | EUROPE | marygmail.com | Silver | . | 104 | 150 | 350000 |
| 6 | HrithikRoshan | India | Bollywood | apac | hrithik@ gmail.com | Gold | 20MAR2000 | 105 | 52 | -90000 |
| 7 | Shakira | Colombia | Latin | LATAM | shakira@gmail.com | Premium | 18APR1998 | 106 | 49 | 720000 |
| 8 | BTS | Korea | KPOP | APAC | bts@gmail.com | Premium | . | 107 | 31 | 600000 |
| 9 | ChrisBrown | USA | HipHop | USA | chris@gmail.com | Platinum | 25DEC2005 | 108 | 38 | 450000 |
| 10 | NoraFatehi | India | BellyDance | asia | nora@gmailcom | Gold | . | 109 | 36 | -50000 |
| 11 | MistyCopeland | USA | Ballet | usa | misty@gmail.com | Platinum | 12JUL2003 | 110 | 44 | 900000 |
| 12 | AkramKhan | UK | Contemporary | Europe | akram@gmail.com | Gold | 14AUG1997 | 111 | 52 | 610000 |
| 13 | InvalidName | India | Bollywood | APAC | invalidmail | Premium | 19SEP2010 | 112 | . | 450000 |
| 14 | Jabbawockeez | USA | HipHop | USA | crew@gmail.com | Gold | 25DEC2014 | 113 | 28 | 500000 |
| 15 | JenniferLopez | USA | Pop | usa | jlo@gmail.com | Platinum | 12JAN1999 | 114 | 56 | 980000 |
| 16 | JoaquinCortes | Spain | Flamenco | europe | joaquin@gmail.com | Gold | 14FEB1988 | 115 | 57 | 650000 |
Why LENGTH Must Come First
One of
the most dangerous SAS behaviors is character truncation.
data example;
name="Christopher";
length name $5;
run;
proc print data=example;
run;
OUTPUT:
| Obs | name |
|---|---|
| 1 | Chris |
SAS
allocates memory during compile time.
If LENGTH
appears after assignment, the variable length is already fixed.
R behaves
differently because character vectors dynamically allocate memory and do not
truncate automatically.
In
regulated industries such as clinical trials, truncating treatment codes can
create SDTM compliance failures.
2.Enterprise Cleaning Workflow Using DATA Step
data dancers_clean;
length Region_Std $20 Email_Domain $30;
set dancers_raw;
Dancer_Name=propcase(strip(Dancer_Name));
Country=propcase(strip(Country));
Email=lowcase(compress(Email));
Annual_Earnings=abs(Annual_Earnings);
Age=abs(Age);
if Age>90 then Age=.;
if missing(Debut_Date) then
Debut_Date=intnx('year',today(),-10,'same');
select(upcase(region));
when("APAC","ASIA","AP-AC") Region_Std="Asia Pacific";
when("USA","US") Region_Std="North America";
when("EUROPE","UK") Region_Std="Europe";
otherwise Region_Std="Other";
end;
Email_Domain=scan(email,2,'@');
if index(email,'@')=0 then Email='invalid@email.com';
Award_Level=propcase(Award_Level);
run;
proc print data=dancers_clean;
run;
OUTPUT:
| Obs | Region_Std | Email_Domain | Dancer_Name | Country | Dance_Style | Region | Award_Level | Debut_Date | Dancer_ID | Age | Annual_Earnings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Asia Pacific | gmail.com | Michaeljackson | Usa | Pop | APAC | michael@gmail.com | Platinum | 15JAN1980 | 101 | 50 | 450000 |
| 2 | Asia Pacific | gmailcom | Prabhudeva | India | Cinema | asia | prabhudeva@gmailcom | Gold | 12FEB1991 | 102 | 5 | 850000 |
| 3 | Asia Pacific | gmail.com | Madhuridixit | India | Bollywood | AP-AC | madhuri@gmail.com | Platinum | 27JUN2016 | 103 | 57 | 920000 |
| 4 | Asia Pacific | gmail.com | Madhuridixit | India | Bollywood | AP-AC | madhuri@gmail.com | Platinum | 27JUN2016 | 103 | 57 | 920000 |
| 5 | Europe | Null | Uk | Ballet | EUROPE | invalid@email.com | Silver | 27JUN2016 | 104 | . | 350000 | |
| 6 | Asia Pacific | gmail.com | Hrithikroshan | India | Bollywood | apac | hrithik@gmail.com | Gold | 20MAR2000 | 105 | 52 | 90000 |
| 7 | Other | gmail.com | Shakira | Colombia | Latin | LATAM | shakira@gmail.com | Premium | 18APR1998 | 106 | 49 | 720000 |
| 8 | Asia Pacific | gmail.com | Bts | Korea | KPOP | APAC | bts@gmail.com | Premium | 27JUN2016 | 107 | 31 | 600000 |
| 9 | North America | gmail.com | Chrisbrown | Usa | HipHop | USA | chris@gmail.com | Platinum | 25DEC2005 | 108 | 38 | 450000 |
| 10 | Asia Pacific | gmailcom | Norafatehi | India | BellyDance | asia | nora@gmailcom | Gold | 27JUN2016 | 109 | 36 | 50000 |
| 11 | North America | gmail.com | Mistycopeland | Usa | Ballet | usa | misty@gmail.com | Platinum | 12JUL2003 | 110 | 44 | 900000 |
| 12 | Europe | gmail.com | Akramkhan | Uk | Contemporary | Europe | akram@gmail.com | Gold | 14AUG1997 | 111 | 52 | 610000 |
| 13 | Asia Pacific | Invalidname | India | Bollywood | APAC | invalid@email.com | Premium | 19SEP2010 | 112 | . | 450000 | |
| 14 | North America | gmail.com | Jabbawockeez | Usa | HipHop | USA | crew@gmail.com | Gold | 25DEC2014 | 113 | 28 | 500000 |
| 15 | North America | gmail.com | Jenniferlopez | Usa | Pop | usa | jlo@gmail.com | Platinum | 12JAN1999 | 114 | 56 | 980000 |
| 16 | Europe | gmail.com | Joaquincortes | Spain | Flamenco | europe | joaquin@gmail.com | Gold | 14FEB1988 | 115 | 57 | 650000 |
Key Points
This DATA
step demonstrates:
- IF THEN ELSE validation
- SELECT WHEN categorization
- ABS() correction
- INTNX() imputation
- SCAN() extraction
- INDEX() validation
- PROPCASE() normalization
- COMPRESS() whitespace
removal
The
workflow converts unusable operational records into analytical intelligence.
3.Duplicate Removal
proc sort data=dancers_clean nodupkey;
by Dancer_ID;
run;
proc print data=dancers_clean;
run;
LOG:
OUTPUT:
| Obs | Region_Std | Email_Domain | Dancer_Name | Country | Dance_Style | Region | Award_Level | Debut_Date | Dancer_ID | Age | Annual_Earnings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Asia Pacific | gmail.com | Michaeljackson | Usa | Pop | APAC | michael@gmail.com | Platinum | 15JAN1980 | 101 | 50 | 450000 |
| 2 | Asia Pacific | gmailcom | Prabhudeva | India | Cinema | asia | prabhudeva@gmailcom | Gold | 12FEB1991 | 102 | 5 | 850000 |
| 3 | Asia Pacific | gmail.com | Madhuridixit | India | Bollywood | AP-AC | madhuri@gmail.com | Platinum | 27JUN2016 | 103 | 57 | 920000 |
| 4 | Europe | Null | Uk | Ballet | EUROPE | invalid@email.com | Silver | 27JUN2016 | 104 | . | 350000 | |
| 5 | Asia Pacific | gmail.com | Hrithikroshan | India | Bollywood | apac | hrithik@gmail.com | Gold | 20MAR2000 | 105 | 52 | 90000 |
| 6 | Other | gmail.com | Shakira | Colombia | Latin | LATAM | shakira@gmail.com | Premium | 18APR1998 | 106 | 49 | 720000 |
| 7 | Asia Pacific | gmail.com | Bts | Korea | KPOP | APAC | bts@gmail.com | Premium | 27JUN2016 | 107 | 31 | 600000 |
| 8 | North America | gmail.com | Chrisbrown | Usa | HipHop | USA | chris@gmail.com | Platinum | 25DEC2005 | 108 | 38 | 450000 |
| 9 | Asia Pacific | gmailcom | Norafatehi | India | BellyDance | asia | nora@gmailcom | Gold | 27JUN2016 | 109 | 36 | 50000 |
| 10 | North America | gmail.com | Mistycopeland | Usa | Ballet | usa | misty@gmail.com | Platinum | 12JUL2003 | 110 | 44 | 900000 |
| 11 | Europe | gmail.com | Akramkhan | Uk | Contemporary | Europe | akram@gmail.com | Gold | 14AUG1997 | 111 | 52 | 610000 |
| 12 | Asia Pacific | Invalidname | India | Bollywood | APAC | invalid@email.com | Premium | 19SEP2010 | 112 | . | 450000 | |
| 13 | North America | gmail.com | Jabbawockeez | Usa | HipHop | USA | crew@gmail.com | Gold | 25DEC2014 | 113 | 28 | 500000 |
| 14 | North America | gmail.com | Jenniferlopez | Usa | Pop | usa | jlo@gmail.com | Platinum | 12JAN1999 | 114 | 56 | 980000 |
| 15 | Europe | gmail.com | Joaquincortes | Spain | Flamenco | europe | joaquin@gmail.com | Gold | 14FEB1988 | 115 | 57 | 650000 |
Explanation
NODUPKEY
removes duplicate business keys while preserving one validated record.
Clinical
trials frequently use this technique to avoid duplicate subjects in DM domains.
4.ARRAY-Based Validation
data validation;
set dancers_clean;
array nums(*) Age Annual_Earnings;
do i=1 to dim(nums);
if nums(i)<0 then nums(i)=abs(nums(i));
end;
drop i;
run;
proc print data=validation;
run;
OUTPUT:
| Obs | Region_Std | Email_Domain | Dancer_Name | Country | Dance_Style | Region | Award_Level | Debut_Date | Dancer_ID | Age | Annual_Earnings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Asia Pacific | gmail.com | Michaeljackson | Usa | Pop | APAC | michael@gmail.com | Platinum | 15JAN1980 | 101 | 50 | 450000 |
| 2 | Asia Pacific | gmailcom | Prabhudeva | India | Cinema | asia | prabhudeva@gmailcom | Gold | 12FEB1991 | 102 | 5 | 850000 |
| 3 | Asia Pacific | gmail.com | Madhuridixit | India | Bollywood | AP-AC | madhuri@gmail.com | Platinum | 27JUN2016 | 103 | 57 | 920000 |
| 4 | Europe | Null | Uk | Ballet | EUROPE | invalid@email.com | Silver | 27JUN2016 | 104 | . | 350000 | |
| 5 | Asia Pacific | gmail.com | Hrithikroshan | India | Bollywood | apac | hrithik@gmail.com | Gold | 20MAR2000 | 105 | 52 | 90000 |
| 6 | Other | gmail.com | Shakira | Colombia | Latin | LATAM | shakira@gmail.com | Premium | 18APR1998 | 106 | 49 | 720000 |
| 7 | Asia Pacific | gmail.com | Bts | Korea | KPOP | APAC | bts@gmail.com | Premium | 27JUN2016 | 107 | 31 | 600000 |
| 8 | North America | gmail.com | Chrisbrown | Usa | HipHop | USA | chris@gmail.com | Platinum | 25DEC2005 | 108 | 38 | 450000 |
| 9 | Asia Pacific | gmailcom | Norafatehi | India | BellyDance | asia | nora@gmailcom | Gold | 27JUN2016 | 109 | 36 | 50000 |
| 10 | North America | gmail.com | Mistycopeland | Usa | Ballet | usa | misty@gmail.com | Platinum | 12JUL2003 | 110 | 44 | 900000 |
| 11 | Europe | gmail.com | Akramkhan | Uk | Contemporary | Europe | akram@gmail.com | Gold | 14AUG1997 | 111 | 52 | 610000 |
| 12 | Asia Pacific | Invalidname | India | Bollywood | APAC | invalid@email.com | Premium | 19SEP2010 | 112 | . | 450000 | |
| 13 | North America | gmail.com | Jabbawockeez | Usa | HipHop | USA | crew@gmail.com | Gold | 25DEC2014 | 113 | 28 | 500000 |
| 14 | North America | gmail.com | Jenniferlopez | Usa | Pop | usa | jlo@gmail.com | Platinum | 12JAN1999 | 114 | 56 | 980000 |
| 15 | Europe | gmail.com | Joaquincortes | Spain | Flamenco | europe | joaquin@gmail.com | Gold | 14FEB1988 | 115 | 57 | 650000 |
Explanation
ARRAY
processing allows scalable validation without repetitive code.
Large
SDTM domains may contain hundreds of numeric variables requiring identical
checks.
5.FIRST./LAST. Processing
proc sort data=dancers_clean;
by Country;
run;
proc print data=dancers_clean;
run;
OUTPUT:
| Obs | Region_Std | Email_Domain | Dancer_Name | Country | Dance_Style | Region | Award_Level | Debut_Date | Dancer_ID | Age | Annual_Earnings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Other | gmail.com | Shakira | Colombia | Latin | LATAM | shakira@gmail.com | Premium | 18APR1998 | 106 | 49 | 720000 |
| 2 | Asia Pacific | gmailcom | Prabhudeva | India | Cinema | asia | prabhudeva@gmailcom | Gold | 12FEB1991 | 102 | 5 | 850000 |
| 3 | Asia Pacific | gmail.com | Madhuridixit | India | Bollywood | AP-AC | madhuri@gmail.com | Platinum | 27JUN2016 | 103 | 57 | 920000 |
| 4 | Asia Pacific | gmail.com | Hrithikroshan | India | Bollywood | apac | hrithik@gmail.com | Gold | 20MAR2000 | 105 | 52 | 90000 |
| 5 | Asia Pacific | gmailcom | Norafatehi | India | BellyDance | asia | nora@gmailcom | Gold | 27JUN2016 | 109 | 36 | 50000 |
| 6 | Asia Pacific | Invalidname | India | Bollywood | APAC | invalid@email.com | Premium | 19SEP2010 | 112 | . | 450000 | |
| 7 | Asia Pacific | gmail.com | Bts | Korea | KPOP | APAC | bts@gmail.com | Premium | 27JUN2016 | 107 | 31 | 600000 |
| 8 | Europe | gmail.com | Joaquincortes | Spain | Flamenco | europe | joaquin@gmail.com | Gold | 14FEB1988 | 115 | 57 | 650000 |
| 9 | Europe | Null | Uk | Ballet | EUROPE | invalid@email.com | Silver | 27JUN2016 | 104 | . | 350000 | |
| 10 | Europe | gmail.com | Akramkhan | Uk | Contemporary | Europe | akram@gmail.com | Gold | 14AUG1997 | 111 | 52 | 610000 |
| 11 | Asia Pacific | gmail.com | Michaeljackson | Usa | Pop | APAC | michael@gmail.com | Platinum | 15JAN1980 | 101 | 50 | 450000 |
| 12 | North America | gmail.com | Chrisbrown | Usa | HipHop | USA | chris@gmail.com | Platinum | 25DEC2005 | 108 | 38 | 450000 |
| 13 | North America | gmail.com | Mistycopeland | Usa | Ballet | usa | misty@gmail.com | Platinum | 12JUL2003 | 110 | 44 | 900000 |
| 14 | North America | gmail.com | Jabbawockeez | Usa | HipHop | USA | crew@gmail.com | Gold | 25DEC2014 | 113 | 28 | 500000 |
| 15 | North America | gmail.com | Jenniferlopez | Usa | Pop | usa | jlo@gmail.com | Platinum | 12JAN1999 | 114 | 56 | 980000 |
data country_summary;
set dancers_clean;
by Country;
retain Total_Earnings 0;
Total_Earnings+Annual_Earnings;
if last.Country;
run;
proc print data=country_summary;
run;
OUTPUT:
| Obs | Region_Std | Email_Domain | Dancer_Name | Country | Dance_Style | Region | Award_Level | Debut_Date | Dancer_ID | Age | Annual_Earnings | Total_Earnings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Other | gmail.com | Shakira | Colombia | Latin | LATAM | shakira@gmail.com | Premium | 18APR1998 | 106 | 49 | 720000 | 720000 |
| 2 | Asia Pacific | Invalidname | India | Bollywood | APAC | invalid@email.com | Premium | 19SEP2010 | 112 | . | 450000 | 3080000 | |
| 3 | Asia Pacific | gmail.com | Bts | Korea | KPOP | APAC | bts@gmail.com | Premium | 27JUN2016 | 107 | 31 | 600000 | 3680000 |
| 4 | Europe | gmail.com | Joaquincortes | Spain | Flamenco | europe | joaquin@gmail.com | Gold | 14FEB1988 | 115 | 57 | 650000 | 4330000 |
| 5 | Europe | gmail.com | Akramkhan | Uk | Contemporary | Europe | akram@gmail.com | Gold | 14AUG1997 | 111 | 52 | 610000 | 5290000 |
| 6 | North America | gmail.com | Jenniferlopez | Usa | Pop | usa | jlo@gmail.com | Platinum | 12JAN1999 | 114 | 56 | 980000 | 8570000 |
6.PROC FORMAT
proc format;
value agegrp 0-20='Young'
21-40='Mid Career'
41-high='Legend';
run;
LOG:
proc sql;
create table earnings_summary as
select Country,Age format=agegrp.,
count(*) as Total_Dancers,
mean(Annual_Earnings) as Avg_Earnings
from dancers_clean
group by Country;
quit;
proc print data=earnings_summary;
run;
OUTPUT:
| Obs | Country | Age | Total_Dancers | Avg_Earnings |
|---|---|---|---|---|
| 1 | Colombia | Legend | 1 | 720000 |
| 2 | India | Young | 5 | 472000 |
| 3 | India | Legend | 5 | 472000 |
| 4 | India | Legend | 5 | 472000 |
| 5 | India | Mid Career | 5 | 472000 |
| 6 | India | . | 5 | 472000 |
| 7 | Korea | Mid Career | 1 | 600000 |
| 8 | Spain | Legend | 1 | 650000 |
| 9 | Uk | . | 2 | 480000 |
| 10 | Uk | Legend | 2 | 480000 |
| 11 | Usa | Legend | 5 | 656000 |
| 12 | Usa | Mid Career | 5 | 656000 |
| 13 | Usa | Legend | 5 | 656000 |
| 14 | Usa | Mid Career | 5 | 656000 |
| 15 | Usa | Legend | 5 | 656000 |
8.DATA Step Equivalent
proc summary data=dancers_clean nway;
class Country;
var Annual_Earnings;
output out=summary mean=Avg_Earnings;
run;
proc print data=summary;
run;
OUTPUT:
| Obs | Country | _TYPE_ | _FREQ_ | Avg_Earnings |
|---|---|---|---|---|
| 1 | Colombia | 1 | 1 | 720000 |
| 2 | India | 1 | 5 | 472000 |
| 3 | Korea | 1 | 1 | 600000 |
| 4 | Spain | 1 | 1 | 650000 |
| 5 | Uk | 1 | 2 | 480000 |
| 6 | Usa | 1 | 5 | 656000 |
proc freq data=dancers_clean;
tables Country*Dance_Style/list missing;
run;
OUTPUT:
The FREQ Procedure
| Country | Dance_Style | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|---|
| Colombia | Latin | 1 | 6.67 | 1 | 6.67 |
| India | BellyDance | 1 | 6.67 | 2 | 13.33 |
| India | Bollywood | 3 | 20.00 | 5 | 33.33 |
| India | Cinema | 1 | 6.67 | 6 | 40.00 |
| Korea | KPOP | 1 | 6.67 | 7 | 46.67 |
| Spain | Flamenco | 1 | 6.67 | 8 | 53.33 |
| Uk | Ballet | 1 | 6.67 | 9 | 60.00 |
| Uk | Contemporary | 1 | 6.67 | 10 | 66.67 |
| Usa | Ballet | 1 | 6.67 | 11 | 73.33 |
| Usa | HipHop | 2 | 13.33 | 13 | 86.67 |
| Usa | Pop | 2 | 13.33 | 15 | 100.00 |
proc means data=dancers_clean n mean median max min;
var Age Annual_Earnings;
run;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Median | Maximum | Minimum |
|---|---|---|---|---|---|
Age Annual_Earnings | 13 15 | 42.6923077 571333.33 | 49.0000000 600000.00 | 57.0000000 980000.00 | 5.0000000 50000.00 |
proc transpose data=dancers_clean
out=transpose_demo;
by Country;
var Annual_Earnings;
run;
proc print data=transpose_demo;
run;
OUTPUT:
| Obs | Country | _NAME_ | COL1 | COL2 | COL3 | COL4 | COL5 |
|---|---|---|---|---|---|---|---|
| 1 | Colombia | Annual_Earnings | 720000 | . | . | . | . |
| 2 | India | Annual_Earnings | 850000 | 920000 | 90000 | 50000 | 450000 |
| 3 | Korea | Annual_Earnings | 600000 | . | . | . | . |
| 4 | Spain | Annual_Earnings | 650000 | . | . | . | . |
| 5 | Uk | Annual_Earnings | 350000 | 610000 | . | . | . |
| 6 | Usa | Annual_Earnings | 450000 | 450000 | 900000 | 500000 | 980000 |
12.PROC REPORT
proc report data=dancers_clean nowd;
column Country Dance_Style Annual_Earnings;
define Country/group;
define Dance_Style/group;
define Annual_Earnings/analysis sum;
run;
OUTPUT:
| Country | Dance_Style | Annual_Earnings |
|---|---|---|
| Colombia | Latin | 720000 |
| India | BellyDance | 50000 |
| Bollywood | 1460000 | |
| Cinema | 850000 | |
| Korea | KPOP | 600000 |
| Spain | Flamenco | 650000 |
| Uk | Ballet | 350000 |
| Contemporary | 610000 | |
| Usa | Ballet | 900000 |
| HipHop | 950000 | |
| Pop | 1430000 |
13.PROC CONTENTS
proc contents data=dancers_clean varnum;
run;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.DANCERS_CLEAN | Observations | 15 |
|---|---|---|---|
| Member Type | DATA | Variables | 12 |
| Engine | V9 | Indexes | 0 |
| Created | 06/27/2026 16:33:53 | Observation Length | 248 |
| Last Modified | 06/27/2026 16:33:53 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | YES | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 528 |
| Obs in First Data Page | 15 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_workB0030001F361_odaws02-apse1-2.oda.sas.com/SAS_work050E0001F361_odaws02-apse1-2.oda.sas.com/dancers_clean.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 67119700 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Variables in Creation Order | |||||
|---|---|---|---|---|---|
| # | Variable | Type | Len | Format | Informat |
| 1 | Region_Std | Char | 20 | ||
| 2 | Email_Domain | Char | 30 | ||
| 3 | Dancer_Name | Char | 30 | ||
| 4 | Country | Char | 20 | ||
| 5 | Dance_Style | Char | 20 | ||
| 6 | Region | Char | 15 | ||
| 7 | Char | 60 | |||
| 8 | Award_Level | Char | 20 | ||
| 9 | Debut_Date | Num | 8 | DATE9. | ANYDTDTE20. |
| 10 | Dancer_ID | Num | 8 | ||
| 11 | Age | Num | 8 | ||
| 12 | Annual_Earnings | Num | 8 | ||
| Sort Information | |
|---|---|
| Sortedby | Country |
| Validated | YES |
| Character Set | ASCII |
proc datasets library=work;
modify dancers_clean;
label Annual_Earnings='Annual Earnings in USD';
quit;
proc print data=dancers_clean label;
run;
OUTPUT:
| Directory | |
|---|---|
| Libref | WORK |
| Engine | V9 |
| Physical Name | /saswork/SAS_workB0030001F361_odaws02-apse1-2.oda.sas.com/SAS_work050E0001F361_odaws02-apse1-2.oda.sas.com |
| Filename | /saswork/SAS_workB0030001F361_odaws02-apse1-2.oda.sas.com/SAS_work050E0001F361_odaws02-apse1-2.oda.sas.com |
| Inode Number | 67122553 |
| Access Permission | rwx------ |
| Owner Name | u63247146 |
| File Size | 4KB |
| File Size (bytes) | 4096 |
| # | Name | Member Type | File Size | Last Modified |
|---|---|---|---|---|
| 1 | COUNTRY_SUMMARY | DATA | 256KB | 06/27/2026 11:04:04 |
| 2 | DANCERS_CLEAN | DATA | 256KB | 06/27/2026 11:03:52 |
| 3 | DANCERS_RAW | DATA | 256KB | 06/27/2026 11:02:02 |
| 4 | EARNINGS_SUMMARY | DATA | 256KB | 06/27/2026 11:04:30 |
| 5 | EXAMPLE | DATA | 256KB | 06/27/2026 11:02:15 |
| 6 | FORMATS | CATALOG | 24KB | 06/27/2026 11:04:18 |
| 7 | REGSTRY | ITEMSTOR | 32KB | 06/27/2026 10:55:12 |
| 8 | SASGOPT | CATALOG | 12KB | 06/27/2026 10:55:12 |
| 9 | SASMAC1 | CATALOG | 212KB | 06/27/2026 10:55:12 |
| 10 | SASMAC2 | CATALOG | 20KB | 06/27/2026 10:55:12 |
| 11 | SASMAC3 | CATALOG | 20KB | 06/27/2026 10:55:12 |
| 12 | SASMAC4 | CATALOG | 20KB | 06/27/2026 11:07:30 |
| 13 | SASMAC5 | CATALOG | 20KB | 06/27/2026 10:55:12 |
| 14 | SASMAC6 | CATALOG | 20KB | 06/27/2026 10:55:12 |
| 15 | SASMAC7 | CATALOG | 20KB | 06/27/2026 10:55:12 |
| 16 | SASMAC8 | CATALOG | 20KB | 06/27/2026 10:55:12 |
| 17 | SASMAC9 | CATALOG | 20KB | 06/27/2026 10:55:12 |
| 18 | SASMACR | CATALOG | 20KB | 06/27/2026 11:06:30 |
| 19 | SUMMARY | DATA | 256KB | 06/27/2026 11:04:45 |
| 20 | TRANSPOSE_DEMO | DATA | 256KB | 06/27/2026 11:06:04 |
| 21 | VALIDATION | DATA | 256KB | 06/27/2026 11:03:37 |
| Obs | Region_Std | Email_Domain | Dancer_Name | Country | Dance_Style | Region | Award_Level | Debut_Date | Dancer_ID | Age | Annual Earnings in USD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Other | gmail.com | Shakira | Colombia | Latin | LATAM | shakira@gmail.com | Premium | 18APR1998 | 106 | 49 | 720000 |
| 2 | Asia Pacific | gmailcom | Prabhudeva | India | Cinema | asia | prabhudeva@gmailcom | Gold | 12FEB1991 | 102 | 5 | 850000 |
| 3 | Asia Pacific | gmail.com | Madhuridixit | India | Bollywood | AP-AC | madhuri@gmail.com | Platinum | 27JUN2016 | 103 | 57 | 920000 |
| 4 | Asia Pacific | gmail.com | Hrithikroshan | India | Bollywood | apac | hrithik@gmail.com | Gold | 20MAR2000 | 105 | 52 | 90000 |
| 5 | Asia Pacific | gmailcom | Norafatehi | India | BellyDance | asia | nora@gmailcom | Gold | 27JUN2016 | 109 | 36 | 50000 |
| 6 | Asia Pacific | Invalidname | India | Bollywood | APAC | invalid@email.com | Premium | 19SEP2010 | 112 | . | 450000 | |
| 7 | Asia Pacific | gmail.com | Bts | Korea | KPOP | APAC | bts@gmail.com | Premium | 27JUN2016 | 107 | 31 | 600000 |
| 8 | Europe | gmail.com | Joaquincortes | Spain | Flamenco | europe | joaquin@gmail.com | Gold | 14FEB1988 | 115 | 57 | 650000 |
| 9 | Europe | Null | Uk | Ballet | EUROPE | invalid@email.com | Silver | 27JUN2016 | 104 | . | 350000 | |
| 10 | Europe | gmail.com | Akramkhan | Uk | Contemporary | Europe | akram@gmail.com | Gold | 14AUG1997 | 111 | 52 | 610000 |
| 11 | Asia Pacific | gmail.com | Michaeljackson | Usa | Pop | APAC | michael@gmail.com | Platinum | 15JAN1980 | 101 | 50 | 450000 |
| 12 | North America | gmail.com | Chrisbrown | Usa | HipHop | USA | chris@gmail.com | Platinum | 25DEC2005 | 108 | 38 | 450000 |
| 13 | North America | gmail.com | Mistycopeland | Usa | Ballet | usa | misty@gmail.com | Platinum | 12JUL2003 | 110 | 44 | 900000 |
| 14 | North America | gmail.com | Jabbawockeez | Usa | HipHop | USA | crew@gmail.com | Gold | 25DEC2014 | 113 | 28 | 500000 |
| 15 | North America | gmail.com | Jenniferlopez | Usa | Pop | usa | jlo@gmail.com | Platinum | 12JAN1999 | 114 | 56 | 980000 |
%macro missing_check(ds);
data missing_records;
set &ds;
if cmiss(of _character_)>0 or
nmiss(of _numeric_)>0;
run;
proc print data=&ds label;
run;
proc sql;
select count(*) as Missing_Records
from missing_records;
quit;
%mend;
%missing_check(dancers_clean);
OUTPUT:
| Obs | Region_Std | Email_Domain | Dancer_Name | Country | Dance_Style | Region | Award_Level | Debut_Date | Dancer_ID | Age | Annual Earnings in USD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Other | gmail.com | Shakira | Colombia | Latin | LATAM | shakira@gmail.com | Premium | 18APR1998 | 106 | 49 | 720000 |
| 2 | Asia Pacific | gmailcom | Prabhudeva | India | Cinema | asia | prabhudeva@gmailcom | Gold | 12FEB1991 | 102 | 5 | 850000 |
| 3 | Asia Pacific | gmail.com | Madhuridixit | India | Bollywood | AP-AC | madhuri@gmail.com | Platinum | 27JUN2016 | 103 | 57 | 920000 |
| 4 | Asia Pacific | gmail.com | Hrithikroshan | India | Bollywood | apac | hrithik@gmail.com | Gold | 20MAR2000 | 105 | 52 | 90000 |
| 5 | Asia Pacific | gmailcom | Norafatehi | India | BellyDance | asia | nora@gmailcom | Gold | 27JUN2016 | 109 | 36 | 50000 |
| 6 | Asia Pacific | Invalidname | India | Bollywood | APAC | invalid@email.com | Premium | 19SEP2010 | 112 | . | 450000 | |
| 7 | Asia Pacific | gmail.com | Bts | Korea | KPOP | APAC | bts@gmail.com | Premium | 27JUN2016 | 107 | 31 | 600000 |
| 8 | Europe | gmail.com | Joaquincortes | Spain | Flamenco | europe | joaquin@gmail.com | Gold | 14FEB1988 | 115 | 57 | 650000 |
| 9 | Europe | Null | Uk | Ballet | EUROPE | invalid@email.com | Silver | 27JUN2016 | 104 | . | 350000 | |
| 10 | Europe | gmail.com | Akramkhan | Uk | Contemporary | Europe | akram@gmail.com | Gold | 14AUG1997 | 111 | 52 | 610000 |
| 11 | Asia Pacific | gmail.com | Michaeljackson | Usa | Pop | APAC | michael@gmail.com | Platinum | 15JAN1980 | 101 | 50 | 450000 |
| 12 | North America | gmail.com | Chrisbrown | Usa | HipHop | USA | chris@gmail.com | Platinum | 25DEC2005 | 108 | 38 | 450000 |
| 13 | North America | gmail.com | Mistycopeland | Usa | Ballet | usa | misty@gmail.com | Platinum | 12JUL2003 | 110 | 44 | 900000 |
| 14 | North America | gmail.com | Jabbawockeez | Usa | HipHop | USA | crew@gmail.com | Gold | 25DEC2014 | 113 | 28 | 500000 |
| 15 | North America | gmail.com | Jenniferlopez | Usa | Pop | usa | jlo@gmail.com | Platinum | 12JAN1999 | 114 | 56 | 980000 |
| Missing_Records |
|---|
| 2 |
library(tidyverse)
dancers_raw <- tibble(
Dancer_ID=c(101,102,103,103,104,105,106),
Name=c(" michael ","prabhuDeva","NULL",
"MadhuriDixit","ChrisBrown",
" Nora ","JenniferLopez"),
Age=c(50,-5,150,57,38,36,56),
Region=c("apac","AP-AC","asia",
"Europe","usa","APAC","USA"),
Email=c("a@gmail.com","badmail",
"x@gmail","nora@gmail.com",
"jlo@gmail.com","abc",
"good@gmail.com")
)
OUTPUT:
|
Dancer_ID |
Name |
Age |
Region |
Email |
|
101 |
michael |
50 |
apac |
a@gmail.com |
|
102 |
prabhuDeva |
-5 |
AP-AC |
badmail |
|
103 |
NULL |
150 |
asia |
x@gmail |
|
103 |
MadhuriDixit |
57 |
Europe |
nora@gmail.com |
|
104 |
ChrisBrown |
38 |
usa |
jlo@gmail.com |
|
105 |
Nora |
36 |
APAC |
abc |
|
106 |
JenniferLopez |
56 |
USA |
good@gmail.com |
17.R Cleaning Layer
library(janitor)
library(lubridate)
dancers_clean <- dancers_raw %>%
clean_names() %>%
mutate(
name=str_to_title(str_trim(name)),
age=abs(age),
age=if_else(age>90,NA_real_,age),
region=case_when(
region %in% c("apac","APAC","asia","AP-AC")
~"Asia Pacific",
TRUE~region
),
email=str_replace_all(email," ",""),
email=if_else(grepl("@",email),
email,
"invalid@email.com")
)
|
dancer_id |
name |
age |
region |
email |
|
101 |
Michael |
50 |
Asia Pacific |
a@gmail.com |
|
102 |
Prabhudeva |
5 |
Asia Pacific |
invalid@email.com |
|
103 |
Null |
Asia Pacific |
x@gmail |
|
|
103 |
Madhuridixit |
57 |
Europe |
nora@gmail.com |
|
104 |
Chrisbrown |
38 |
usa |
jlo@gmail.com |
|
105 |
Nora |
36 |
Asia Pacific |
invalid@email.com |
|
106 |
Jenniferlopez |
56 |
USA |
good@gmail.com |
SAS vs R Equivalent Transformations
|
SAS |
R |
|
PROPCASE() |
str_to_title() |
|
STRIP() |
str_trim() |
|
COMPRESS() |
str_replace_all() |
|
SELECT
WHEN |
case_when() |
|
COALESCEC() |
coalesce() |
|
PROC
SORT NODUPKEY |
distinct() |
|
INPUT() |
parse_date_time() |
|
PUT() |
as.character() |
Validation and Compliance
Clinical
environments require:
- SDTM traceability
- ADaM reproducibility
- audit trails
- QC independence
- regulatory consistency
- metadata lineage
One
dangerous SAS behavior is:
if Age < 18 then Flag='Y';
Since
missing numeric values are smaller than valid numbers in SAS:
. < 18 = TRUE
This can
accidentally classify missing ages as pediatric subjects.
Regulatory
submissions can fail because of this single mistake.
Always
use:
if not missing(age) and
age<18;
Business Logic Behind
Cleaning
Missing
values are rarely random.
A missing
patient visit date may indicate missed treatment exposure.
Negative
revenue values may represent refunds rather than earnings.
A patient
age of 250 years is clearly a data-entry error.
Dates
require standardization because analytics engines interpret formats
differently.
Text
normalization prevents "India", "INDIA", and
"india" from becoming separate categories.
Malformed
emails damage communication workflows.
Duplicate
IDs inflate counts and bias statistical inference.
Imputation
methods must be documented because every correction influences downstream
analysis.
Regulators
expect complete traceability from source data to analysis outputs.
20 Best Practices
- Maintain metadata
repositories.
- Validate dates before
analysis.
- Standardize macro libraries.
- Separate QC programming
teams.
- Use reusable validation
frameworks.
- Document assumptions.
- Preserve raw data.
- Version-control code.
- Use defensive programming.
- Validate ranges.
- Remove duplicates carefully.
- Standardize missing values.
- Audit transformations.
- Use independent review.
- Maintain lineage
documentation.
- Test edge cases.
- Lock production code.
- Validate formats.
- Create reconciliation
reports.
- Automate validation checks.
20 One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic beats
visual inspection.
- Missing values hide risk.
- Standardization improves
reproducibility.
- Metadata drives trust.
- Automation reduces defects.
- Duplicate records distort
analytics.
- Auditability matters.
- Defensive programming saves
projects.
- Dates deserve respect.
- Traceability protects
submissions.
- QC independence improves
quality.
- Business logic matters.
- Governance enables scale.
- Documentation prevents
confusion.
- Reproducibility builds
confidence.
- PROC SQL improves
readability.
- DATA Step improves control.
- R improves flexibility.
- SAS improves compliance.
SAS and R: Complementary
Strengths
SAS
excels in governance, auditability, scalability, and regulatory reporting.
R excels
in flexibility, visualization, machine learning integration, and rapid
experimentation.
SAS
provides deterministic production pipelines.
R
provides innovation.
Together
they create modern analytical ecosystems capable of handling enterprise-scale
healthcare, banking, insurance, retail, and entertainment analytics.
Conclusion
Data
cleaning is not a preprocessing activity.
It is
risk management.
The
world's best dancers dataset may appear simple, yet it contains the same
failures seen in oncology trials, insurance claims, banking systems, and retail
operations.
Duplicate
identifiers inflate counts.
Negative
values distort forecasts.
Malformed
dates break time-series models.
Inconsistent
categories corrupt dashboards.
Regulated
industries demand traceability, reproducibility, and auditability.
SAS
provides industrial-strength controls through DATA Step programming, PROC SQL,
metadata governance, and compliance-friendly workflows.
R
complements these strengths through flexible transformation frameworks and
modern data engineering libraries.
The
future of enterprise analytics belongs to organizations capable of transforming
imperfect operational records into trustworthy intelligence.
Clean
data is not merely cleaner data.
It is
safer science, smarter business, stronger compliance, and better decisions.
The real
competitive advantage is not collecting more data.
It is
trusting the data you already have.
Interview Questions and
Answers
1. How would you remove duplicate patients in SAS?
Use:
proc sort nodupkey;
by USUBJID;
run;
or
proc sql;
select distinct *
2. Why is missing value handling dangerous in SAS?
Because
numeric missing values are treated as lower than valid numbers.
3. When would you use PROC SQL instead of DATA
Step?
PROC SQL
is preferred for joins and aggregation while DATA Step provides row-wise
control.
4. How would you validate malformed emails in R?
Using:
grepl("@",email)
or
regular expressions.
5. How do you maintain audit readiness?
By
preserving source data, documenting transformations, and maintaining
traceability.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about DANCERS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment