Where Data Learns to Dance: Advanced Cleaning Frameworks for Reliable Analytics Using Sas And R

From Stage Legends to Statistical Legends: Transforming the World's Best Dancers Dataset into Analysis-Ready Intelligence with SAS and R

The Business Crisis Nobody Saw Coming

Imagine a global entertainment analytics company preparing an annual report ranking the world's most influential dancers. Sponsors are allocating millions in endorsement budgets, streaming platforms are using AI models to recommend performers, and event organizers are planning international festivals.

Then disaster strikes.

The dashboard reports that a dancer earned negative annual revenue, another performer appears twice due to duplicate IDs, ages exceed 140 years, and several email addresses are malformed.

A dancer from India is incorrectly classified under the European region because of inconsistent region codes such as:

  • APAC
  • apac
  • Asia Pacific
  • AP-AC
  • asia

Machine learning models begin producing biased predictions.

Executive reports become inaccurate.

Business decisions become unreliable.

This is exactly what happens in healthcare when patient enrollment is duplicated, in banking when fraud transactions are misclassified, and in insurance when claim records contain invalid policy dates.

Dirty data is not merely an inconvenience.

It is an enterprise risk.

1.Raw Corrupted Dataset SAS Version

data dancers_raw;

length Dancer_Name $30 Country $20 Dance_Style $20 Region $15

Email $60 Award_Level $20;

infile datalines dlm='|' dsd truncover;

informat Debut_Date anydtdte20.;

format Debut_Date date9.;

input Dancer_ID Dancer_Name :$30. Country :$20. Age Dance_Style :$20.

Annual_Earnings Debut_Date :anydtdte20. Region :$15. Email :$60.

Award_Level :$20.;

datalines;

101|MichaelJackson|USA|50|Pop|-450000|15JAN1980|APAC|michael@gmail.com|Platinum

102|prabhuDeva|india|-5|Cinema|850000|12FEB1991|asia|prabhudeva@gmailcom|Gold

103|MadhuriDixit|India|57|Bollywood|920000|.|AP-AC|madhuri@gmail.com|Platinum

103|MadhuriDixit|India|57|Bollywood|920000|.|AP-AC|madhuri@gmail.com|Platinum

104|NULL|UK|150|Ballet|350000|32JAN1990|EUROPE|marygmail.com|Silver

105|HrithikRoshan|India|52|Bollywood|-90000|20MAR2000|apac|hrithik@ gmail.com|Gold

106|Shakira|Colombia|49|Latin|720000|18APR1998|LATAM|shakira@gmail.com|Premium

107|BTS|Korea|31|KPOP|600000|InvalidDate|APAC|bts@gmail.com|Premium

108|ChrisBrown|USA|38|HipHop|450000|25DEC2005|USA|chris@gmail.com|Platinum

109|NoraFatehi|India|36|BellyDance|-50000|.|asia|nora@gmailcom|Gold

110|MistyCopeland|USA|44|Ballet|900000|12JUL2003|usa|misty@gmail.com|Platinum

111|AkramKhan|UK|52|Contemporary|610000|14AUG1997|Europe|akram@gmail.com|Gold

112|InvalidName|India|.|Bollywood|450000|19SEP2010|APAC|invalidmail|Premium

113|Jabbawockeez|USA|28|HipHop|500000|25DEC2014|USA|crew@gmail.com|Gold

114|JenniferLopez|USA|56|Pop|980000|12JAN1999|usa|jlo@gmail.com|Platinum

115|JoaquinCortes|Spain|57|Flamenco|650000|14FEB1988|europe|joaquin@gmail.com|Gold

;

run;

proc print data=dancers_raw;

run;

OUTPUT:

ObsDancer_NameCountryDance_StyleRegionEmailAward_LevelDebut_DateDancer_IDAgeAnnual_Earnings
1MichaelJacksonUSAPopAPACmichael@gmail.comPlatinum15JAN198010150-450000
2prabhuDevaindiaCinemaasiaprabhudeva@gmailcomGold12FEB1991102-5850000
3MadhuriDixitIndiaBollywoodAP-ACmadhuri@gmail.comPlatinum.10357920000
4MadhuriDixitIndiaBollywoodAP-ACmadhuri@gmail.comPlatinum.10357920000
5NULLUKBalletEUROPEmarygmail.comSilver.104150350000
6HrithikRoshanIndiaBollywoodapachrithik@ gmail.comGold20MAR200010552-90000
7ShakiraColombiaLatinLATAMshakira@gmail.comPremium18APR199810649720000
8BTSKoreaKPOPAPACbts@gmail.comPremium.10731600000
9ChrisBrownUSAHipHopUSAchris@gmail.comPlatinum25DEC200510838450000
10NoraFatehiIndiaBellyDanceasianora@gmailcomGold.10936-50000
11MistyCopelandUSABalletusamisty@gmail.comPlatinum12JUL200311044900000
12AkramKhanUKContemporaryEuropeakram@gmail.comGold14AUG199711152610000
13InvalidNameIndiaBollywoodAPACinvalidmailPremium19SEP2010112.450000
14JabbawockeezUSAHipHopUSAcrew@gmail.comGold25DEC201411328500000
15JenniferLopezUSAPopusajlo@gmail.comPlatinum12JAN199911456980000
16JoaquinCortesSpainFlamencoeuropejoaquin@gmail.comGold14FEB198811557650000

Why LENGTH Must Come First

One of the most dangerous SAS behaviors is character truncation.

data example;

name="Christopher";

length name $5;

run;

proc print data=example;

run;

OUTPUT:

Obsname
1Chris

SAS allocates memory during compile time.

If LENGTH appears after assignment, the variable length is already fixed.

R behaves differently because character vectors dynamically allocate memory and do not truncate automatically.

In regulated industries such as clinical trials, truncating treatment codes can create SDTM compliance failures.

2.Enterprise Cleaning Workflow Using DATA Step

data dancers_clean;

length Region_Std $20 Email_Domain $30;

set dancers_raw;

Dancer_Name=propcase(strip(Dancer_Name));

Country=propcase(strip(Country));

Email=lowcase(compress(Email));

Annual_Earnings=abs(Annual_Earnings);

Age=abs(Age);

if Age>90 then Age=.;

if missing(Debut_Date) then

Debut_Date=intnx('year',today(),-10,'same');

select(upcase(region));

when("APAC","ASIA","AP-AC") Region_Std="Asia Pacific";

when("USA","US") Region_Std="North America";

when("EUROPE","UK") Region_Std="Europe";

otherwise Region_Std="Other";

end;

Email_Domain=scan(email,2,'@');

if index(email,'@')=0 then Email='invalid@email.com';

Award_Level=propcase(Award_Level);

run;

proc print data=dancers_clean;

run;

OUTPUT:

ObsRegion_StdEmail_DomainDancer_NameCountryDance_StyleRegionEmailAward_LevelDebut_DateDancer_IDAgeAnnual_Earnings
1Asia Pacificgmail.comMichaeljacksonUsaPopAPACmichael@gmail.comPlatinum15JAN198010150450000
2Asia PacificgmailcomPrabhudevaIndiaCinemaasiaprabhudeva@gmailcomGold12FEB19911025850000
3Asia Pacificgmail.comMadhuridixitIndiaBollywoodAP-ACmadhuri@gmail.comPlatinum27JUN201610357920000
4Asia Pacificgmail.comMadhuridixitIndiaBollywoodAP-ACmadhuri@gmail.comPlatinum27JUN201610357920000
5Europe NullUkBalletEUROPEinvalid@email.comSilver27JUN2016104.350000
6Asia Pacificgmail.comHrithikroshanIndiaBollywoodapachrithik@gmail.comGold20MAR20001055290000
7Othergmail.comShakiraColombiaLatinLATAMshakira@gmail.comPremium18APR199810649720000
8Asia Pacificgmail.comBtsKoreaKPOPAPACbts@gmail.comPremium27JUN201610731600000
9North Americagmail.comChrisbrownUsaHipHopUSAchris@gmail.comPlatinum25DEC200510838450000
10Asia PacificgmailcomNorafatehiIndiaBellyDanceasianora@gmailcomGold27JUN20161093650000
11North Americagmail.comMistycopelandUsaBalletusamisty@gmail.comPlatinum12JUL200311044900000
12Europegmail.comAkramkhanUkContemporaryEuropeakram@gmail.comGold14AUG199711152610000
13Asia Pacific InvalidnameIndiaBollywoodAPACinvalid@email.comPremium19SEP2010112.450000
14North Americagmail.comJabbawockeezUsaHipHopUSAcrew@gmail.comGold25DEC201411328500000
15North Americagmail.comJenniferlopezUsaPopusajlo@gmail.comPlatinum12JAN199911456980000
16Europegmail.comJoaquincortesSpainFlamencoeuropejoaquin@gmail.comGold14FEB198811557650000

Key Points

This DATA step demonstrates:

  • IF THEN ELSE validation
  • SELECT WHEN categorization
  • ABS() correction
  • INTNX() imputation
  • SCAN() extraction
  • INDEX() validation
  • PROPCASE() normalization
  • COMPRESS() whitespace removal

The workflow converts unusable operational records into analytical intelligence.

3.Duplicate Removal

proc sort data=dancers_clean nodupkey;

by Dancer_ID;

run;

proc print data=dancers_clean;

run;

LOG:

NOTE: There were 16 observations read from the data set WORK.DANCERS_CLEAN.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.DANCERS_CLEAN has 15 observations and 12 variables.

OUTPUT:

ObsRegion_StdEmail_DomainDancer_NameCountryDance_StyleRegionEmailAward_LevelDebut_DateDancer_IDAgeAnnual_Earnings
1Asia Pacificgmail.comMichaeljacksonUsaPopAPACmichael@gmail.comPlatinum15JAN198010150450000
2Asia PacificgmailcomPrabhudevaIndiaCinemaasiaprabhudeva@gmailcomGold12FEB19911025850000
3Asia Pacificgmail.comMadhuridixitIndiaBollywoodAP-ACmadhuri@gmail.comPlatinum27JUN201610357920000
4Europe NullUkBalletEUROPEinvalid@email.comSilver27JUN2016104.350000
5Asia Pacificgmail.comHrithikroshanIndiaBollywoodapachrithik@gmail.comGold20MAR20001055290000
6Othergmail.comShakiraColombiaLatinLATAMshakira@gmail.comPremium18APR199810649720000
7Asia Pacificgmail.comBtsKoreaKPOPAPACbts@gmail.comPremium27JUN201610731600000
8North Americagmail.comChrisbrownUsaHipHopUSAchris@gmail.comPlatinum25DEC200510838450000
9Asia PacificgmailcomNorafatehiIndiaBellyDanceasianora@gmailcomGold27JUN20161093650000
10North Americagmail.comMistycopelandUsaBalletusamisty@gmail.comPlatinum12JUL200311044900000
11Europegmail.comAkramkhanUkContemporaryEuropeakram@gmail.comGold14AUG199711152610000
12Asia Pacific InvalidnameIndiaBollywoodAPACinvalid@email.comPremium19SEP2010112.450000
13North Americagmail.comJabbawockeezUsaHipHopUSAcrew@gmail.comGold25DEC201411328500000
14North Americagmail.comJenniferlopezUsaPopusajlo@gmail.comPlatinum12JAN199911456980000
15Europegmail.comJoaquincortesSpainFlamencoeuropejoaquin@gmail.comGold14FEB198811557650000

Explanation

NODUPKEY removes duplicate business keys while preserving one validated record.

Clinical trials frequently use this technique to avoid duplicate subjects in DM domains.

4.ARRAY-Based Validation

data validation;

set dancers_clean;

array nums(*) Age Annual_Earnings;

do i=1 to dim(nums);

if nums(i)<0 then nums(i)=abs(nums(i));

end;

drop i;

run;

proc print data=validation;

run;

OUTPUT:

ObsRegion_StdEmail_DomainDancer_NameCountryDance_StyleRegionEmailAward_LevelDebut_DateDancer_IDAgeAnnual_Earnings
1Asia Pacificgmail.comMichaeljacksonUsaPopAPACmichael@gmail.comPlatinum15JAN198010150450000
2Asia PacificgmailcomPrabhudevaIndiaCinemaasiaprabhudeva@gmailcomGold12FEB19911025850000
3Asia Pacificgmail.comMadhuridixitIndiaBollywoodAP-ACmadhuri@gmail.comPlatinum27JUN201610357920000
4Europe NullUkBalletEUROPEinvalid@email.comSilver27JUN2016104.350000
5Asia Pacificgmail.comHrithikroshanIndiaBollywoodapachrithik@gmail.comGold20MAR20001055290000
6Othergmail.comShakiraColombiaLatinLATAMshakira@gmail.comPremium18APR199810649720000
7Asia Pacificgmail.comBtsKoreaKPOPAPACbts@gmail.comPremium27JUN201610731600000
8North Americagmail.comChrisbrownUsaHipHopUSAchris@gmail.comPlatinum25DEC200510838450000
9Asia PacificgmailcomNorafatehiIndiaBellyDanceasianora@gmailcomGold27JUN20161093650000
10North Americagmail.comMistycopelandUsaBalletusamisty@gmail.comPlatinum12JUL200311044900000
11Europegmail.comAkramkhanUkContemporaryEuropeakram@gmail.comGold14AUG199711152610000
12Asia Pacific InvalidnameIndiaBollywoodAPACinvalid@email.comPremium19SEP2010112.450000
13North Americagmail.comJabbawockeezUsaHipHopUSAcrew@gmail.comGold25DEC201411328500000
14North Americagmail.comJenniferlopezUsaPopusajlo@gmail.comPlatinum12JAN199911456980000
15Europegmail.comJoaquincortesSpainFlamencoeuropejoaquin@gmail.comGold14FEB198811557650000

Explanation

ARRAY processing allows scalable validation without repetitive code.

Large SDTM domains may contain hundreds of numeric variables requiring identical checks.

5.FIRST./LAST. Processing

proc sort data=dancers_clean;

by Country;

run;

proc print data=dancers_clean;

run;

OUTPUT:

ObsRegion_StdEmail_DomainDancer_NameCountryDance_StyleRegionEmailAward_LevelDebut_DateDancer_IDAgeAnnual_Earnings
1Othergmail.comShakiraColombiaLatinLATAMshakira@gmail.comPremium18APR199810649720000
2Asia PacificgmailcomPrabhudevaIndiaCinemaasiaprabhudeva@gmailcomGold12FEB19911025850000
3Asia Pacificgmail.comMadhuridixitIndiaBollywoodAP-ACmadhuri@gmail.comPlatinum27JUN201610357920000
4Asia Pacificgmail.comHrithikroshanIndiaBollywoodapachrithik@gmail.comGold20MAR20001055290000
5Asia PacificgmailcomNorafatehiIndiaBellyDanceasianora@gmailcomGold27JUN20161093650000
6Asia Pacific InvalidnameIndiaBollywoodAPACinvalid@email.comPremium19SEP2010112.450000
7Asia Pacificgmail.comBtsKoreaKPOPAPACbts@gmail.comPremium27JUN201610731600000
8Europegmail.comJoaquincortesSpainFlamencoeuropejoaquin@gmail.comGold14FEB198811557650000
9Europe NullUkBalletEUROPEinvalid@email.comSilver27JUN2016104.350000
10Europegmail.comAkramkhanUkContemporaryEuropeakram@gmail.comGold14AUG199711152610000
11Asia Pacificgmail.comMichaeljacksonUsaPopAPACmichael@gmail.comPlatinum15JAN198010150450000
12North Americagmail.comChrisbrownUsaHipHopUSAchris@gmail.comPlatinum25DEC200510838450000
13North Americagmail.comMistycopelandUsaBalletusamisty@gmail.comPlatinum12JUL200311044900000
14North Americagmail.comJabbawockeezUsaHipHopUSAcrew@gmail.comGold25DEC201411328500000
15North Americagmail.comJenniferlopezUsaPopusajlo@gmail.comPlatinum12JAN199911456980000

data country_summary;

set dancers_clean;

by Country;

retain Total_Earnings 0;

Total_Earnings+Annual_Earnings;

if last.Country;

run;

proc print data=country_summary;

run;

OUTPUT:

ObsRegion_StdEmail_DomainDancer_NameCountryDance_StyleRegionEmailAward_LevelDebut_DateDancer_IDAgeAnnual_EarningsTotal_Earnings
1Othergmail.comShakiraColombiaLatinLATAMshakira@gmail.comPremium18APR199810649720000720000
2Asia Pacific InvalidnameIndiaBollywoodAPACinvalid@email.comPremium19SEP2010112.4500003080000
3Asia Pacificgmail.comBtsKoreaKPOPAPACbts@gmail.comPremium27JUN2016107316000003680000
4Europegmail.comJoaquincortesSpainFlamencoeuropejoaquin@gmail.comGold14FEB1988115576500004330000
5Europegmail.comAkramkhanUkContemporaryEuropeakram@gmail.comGold14AUG1997111526100005290000
6North Americagmail.comJenniferlopezUsaPopusajlo@gmail.comPlatinum12JAN1999114569800008570000

6.PROC FORMAT

proc format;

value agegrp 0-20='Young'

            21-40='Mid Career'

          41-high='Legend';

run;

LOG:

NOTE: Format AGEGRP has been output.
7.PROC SQL Approach

proc sql;

create table earnings_summary as

select Country,Age format=agegrp.,

       count(*) as Total_Dancers,

       mean(Annual_Earnings) as Avg_Earnings

from dancers_clean

group by Country;

quit;

proc print data=earnings_summary;

run;

OUTPUT:

ObsCountryAgeTotal_DancersAvg_Earnings
1ColombiaLegend1720000
2IndiaYoung5472000
3IndiaLegend5472000
4IndiaLegend5472000
5IndiaMid Career5472000
6India.5472000
7KoreaMid Career1600000
8SpainLegend1650000
9Uk.2480000
10UkLegend2480000
11UsaLegend5656000
12UsaMid Career5656000
13UsaLegend5656000
14UsaMid Career5656000
15UsaLegend5656000

8.DATA Step Equivalent

proc summary data=dancers_clean nway;

class Country;

var Annual_Earnings;

output out=summary mean=Avg_Earnings;

run;

proc print data=summary;

run;

OUTPUT:

ObsCountry_TYPE__FREQ_Avg_Earnings
1Colombia11720000
2India15472000
3Korea11600000
4Spain11650000
5Uk12480000
6Usa15656000
9.PROC FREQ

proc freq data=dancers_clean;

tables Country*Dance_Style/list missing;

run;

OUTPUT:

The FREQ Procedure

CountryDance_StyleFrequencyPercentCumulative
Frequency
Cumulative
Percent
ColombiaLatin16.6716.67
IndiaBellyDance16.67213.33
IndiaBollywood320.00533.33
IndiaCinema16.67640.00
KoreaKPOP16.67746.67
SpainFlamenco16.67853.33
UkBallet16.67960.00
UkContemporary16.671066.67
UsaBallet16.671173.33
UsaHipHop213.331386.67
UsaPop213.3315100.00
10.PROC MEANS

proc means data=dancers_clean n mean median max min;

var Age Annual_Earnings;

run;

OUTPUT:

The MEANS Procedure

VariableNMeanMedianMaximumMinimum
Age
Annual_Earnings
13
15
42.6923077
571333.33
49.0000000
600000.00
57.0000000
980000.00
5.0000000
50000.00
11.PROC TRANSPOSE

proc transpose data=dancers_clean

               out=transpose_demo;

by Country;

var Annual_Earnings;

run;

proc print data=transpose_demo;

run;

OUTPUT:

ObsCountry_NAME_COL1COL2COL3COL4COL5
1ColombiaAnnual_Earnings720000....
2IndiaAnnual_Earnings8500009200009000050000450000
3KoreaAnnual_Earnings600000....
4SpainAnnual_Earnings650000....
5UkAnnual_Earnings350000610000...
6UsaAnnual_Earnings450000450000900000500000980000

12.PROC REPORT

proc report data=dancers_clean nowd;

column Country Dance_Style Annual_Earnings;

define Country/group;

define Dance_Style/group;

define Annual_Earnings/analysis sum;

run;

OUTPUT:

CountryDance_StyleAnnual_Earnings
ColombiaLatin720000
IndiaBellyDance50000
 Bollywood1460000
 Cinema850000
KoreaKPOP600000
SpainFlamenco650000
UkBallet350000
 Contemporary610000
UsaBallet900000
 HipHop950000
 Pop1430000

13.PROC CONTENTS

proc contents data=dancers_clean varnum;

run;

OUTPUT:

The CONTENTS Procedure

Data Set NameWORK.DANCERS_CLEANObservations15
Member TypeDATAVariables12
EngineV9Indexes0
Created06/27/2026 16:33:53Observation Length248
Last Modified06/27/2026 16:33:53Deleted Observations0
Protection CompressedNO
Data Set Type SortedYES
Label   
Data RepresentationSOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64  
Encodingutf-8 Unicode (UTF-8)  
Engine/Host Dependent Information
Data Set Page Size131072
Number of Data Set Pages1
First Data Page1
Max Obs per Page528
Obs in First Data Page15
Number of Data Set Repairs0
Filename/saswork/SAS_workB0030001F361_odaws02-apse1-2.oda.sas.com/SAS_work050E0001F361_odaws02-apse1-2.oda.sas.com/dancers_clean.sas7bdat
Release Created9.0401M8
Host CreatedLinux
Inode Number67119700
Access Permissionrw-r--r--
Owner Nameu63247146
File Size256KB
File Size (bytes)262144
Variables in Creation Order
#VariableTypeLenFormatInformat
1Region_StdChar20  
2Email_DomainChar30  
3Dancer_NameChar30  
4CountryChar20  
5Dance_StyleChar20  
6RegionChar15  
7EmailChar60  
8Award_LevelChar20  
9Debut_DateNum8DATE9.ANYDTDTE20.
10Dancer_IDNum8  
11AgeNum8  
12Annual_EarningsNum8  
Sort Information
SortedbyCountry
ValidatedYES
Character SetASCII
14.PROC DATASETS

proc datasets library=work;

modify dancers_clean;

label Annual_Earnings='Annual Earnings in USD';

quit;

proc print data=dancers_clean label;

run;

OUTPUT:

Directory
LibrefWORK
EngineV9
Physical Name/saswork/SAS_workB0030001F361_odaws02-apse1-2.oda.sas.com/SAS_work050E0001F361_odaws02-apse1-2.oda.sas.com
Filename/saswork/SAS_workB0030001F361_odaws02-apse1-2.oda.sas.com/SAS_work050E0001F361_odaws02-apse1-2.oda.sas.com
Inode Number67122553
Access Permissionrwx------
Owner Nameu63247146
File Size4KB
File Size (bytes)4096
#NameMember TypeFile SizeLast Modified
1COUNTRY_SUMMARYDATA256KB06/27/2026 11:04:04
2DANCERS_CLEANDATA256KB06/27/2026 11:03:52
3DANCERS_RAWDATA256KB06/27/2026 11:02:02
4EARNINGS_SUMMARYDATA256KB06/27/2026 11:04:30
5EXAMPLEDATA256KB06/27/2026 11:02:15
6FORMATSCATALOG24KB06/27/2026 11:04:18
7REGSTRYITEMSTOR32KB06/27/2026 10:55:12
8SASGOPTCATALOG12KB06/27/2026 10:55:12
9SASMAC1CATALOG212KB06/27/2026 10:55:12
10SASMAC2CATALOG20KB06/27/2026 10:55:12
11SASMAC3CATALOG20KB06/27/2026 10:55:12
12SASMAC4CATALOG20KB06/27/2026 11:07:30
13SASMAC5CATALOG20KB06/27/2026 10:55:12
14SASMAC6CATALOG20KB06/27/2026 10:55:12
15SASMAC7CATALOG20KB06/27/2026 10:55:12
16SASMAC8CATALOG20KB06/27/2026 10:55:12
17SASMAC9CATALOG20KB06/27/2026 10:55:12
18SASMACRCATALOG20KB06/27/2026 11:06:30
19SUMMARYDATA256KB06/27/2026 11:04:45
20TRANSPOSE_DEMODATA256KB06/27/2026 11:06:04
21VALIDATIONDATA256KB06/27/2026 11:03:37
ObsRegion_StdEmail_DomainDancer_NameCountryDance_StyleRegionEmailAward_LevelDebut_DateDancer_IDAgeAnnual Earnings in USD
1Othergmail.comShakiraColombiaLatinLATAMshakira@gmail.comPremium18APR199810649720000
2Asia PacificgmailcomPrabhudevaIndiaCinemaasiaprabhudeva@gmailcomGold12FEB19911025850000
3Asia Pacificgmail.comMadhuridixitIndiaBollywoodAP-ACmadhuri@gmail.comPlatinum27JUN201610357920000
4Asia Pacificgmail.comHrithikroshanIndiaBollywoodapachrithik@gmail.comGold20MAR20001055290000
5Asia PacificgmailcomNorafatehiIndiaBellyDanceasianora@gmailcomGold27JUN20161093650000
6Asia Pacific InvalidnameIndiaBollywoodAPACinvalid@email.comPremium19SEP2010112.450000
7Asia Pacificgmail.comBtsKoreaKPOPAPACbts@gmail.comPremium27JUN201610731600000
8Europegmail.comJoaquincortesSpainFlamencoeuropejoaquin@gmail.comGold14FEB198811557650000
9Europe NullUkBalletEUROPEinvalid@email.comSilver27JUN2016104.350000
10Europegmail.comAkramkhanUkContemporaryEuropeakram@gmail.comGold14AUG199711152610000
11Asia Pacificgmail.comMichaeljacksonUsaPopAPACmichael@gmail.comPlatinum15JAN198010150450000
12North Americagmail.comChrisbrownUsaHipHopUSAchris@gmail.comPlatinum25DEC200510838450000
13North Americagmail.comMistycopelandUsaBalletusamisty@gmail.comPlatinum12JUL200311044900000
14North Americagmail.comJabbawockeezUsaHipHopUSAcrew@gmail.comGold25DEC201411328500000
15North Americagmail.comJenniferlopezUsaPopusajlo@gmail.comPlatinum12JAN199911456980000
15.Professional SAS Macro

%macro missing_check(ds);

data missing_records;

set &ds;

if cmiss(of _character_)>0 or

   nmiss(of _numeric_)>0;

run;

proc print data=&ds label;

run;

proc sql;

select count(*) as Missing_Records

from missing_records;

quit;

%mend;

%missing_check(dancers_clean);

OUTPUT:

ObsRegion_StdEmail_DomainDancer_NameCountryDance_StyleRegionEmailAward_LevelDebut_DateDancer_IDAgeAnnual Earnings in USD
1Othergmail.comShakiraColombiaLatinLATAMshakira@gmail.comPremium18APR199810649720000
2Asia PacificgmailcomPrabhudevaIndiaCinemaasiaprabhudeva@gmailcomGold12FEB19911025850000
3Asia Pacificgmail.comMadhuridixitIndiaBollywoodAP-ACmadhuri@gmail.comPlatinum27JUN201610357920000
4Asia Pacificgmail.comHrithikroshanIndiaBollywoodapachrithik@gmail.comGold20MAR20001055290000
5Asia PacificgmailcomNorafatehiIndiaBellyDanceasianora@gmailcomGold27JUN20161093650000
6Asia Pacific InvalidnameIndiaBollywoodAPACinvalid@email.comPremium19SEP2010112.450000
7Asia Pacificgmail.comBtsKoreaKPOPAPACbts@gmail.comPremium27JUN201610731600000
8Europegmail.comJoaquincortesSpainFlamencoeuropejoaquin@gmail.comGold14FEB198811557650000
9Europe NullUkBalletEUROPEinvalid@email.comSilver27JUN2016104.350000
10Europegmail.comAkramkhanUkContemporaryEuropeakram@gmail.comGold14AUG199711152610000
11Asia Pacificgmail.comMichaeljacksonUsaPopAPACmichael@gmail.comPlatinum15JAN198010150450000
12North Americagmail.comChrisbrownUsaHipHopUSAchris@gmail.comPlatinum25DEC200510838450000
13North Americagmail.comMistycopelandUsaBalletusamisty@gmail.comPlatinum12JUL200311044900000
14North Americagmail.comJabbawockeezUsaHipHopUSAcrew@gmail.comGold25DEC201411328500000
15North Americagmail.comJenniferlopezUsaPopusajlo@gmail.comPlatinum12JAN199911456980000
Missing_Records
2
16.R Raw Dataset

library(tidyverse)

dancers_raw <- tibble(

  Dancer_ID=c(101,102,103,103,104,105,106),

  Name=c(" michael ","prabhuDeva","NULL",

         "MadhuriDixit","ChrisBrown",

         " Nora ","JenniferLopez"),

  Age=c(50,-5,150,57,38,36,56),

  Region=c("apac","AP-AC","asia",

           "Europe","usa","APAC","USA"),

  Email=c("a@gmail.com","badmail",

          "x@gmail","nora@gmail.com",

          "jlo@gmail.com","abc",

          "good@gmail.com")

)

OUTPUT:

Dancer_ID

Name

Age

Region

Email

101

 michael

50

apac

a@gmail.com

102

prabhuDeva

-5

AP-AC

badmail

103

NULL

150

asia

x@gmail

103

MadhuriDixit

57

Europe

nora@gmail.com

104

ChrisBrown

38

usa

jlo@gmail.com

105

 Nora

36

APAC

abc

106

JenniferLopez

56

USA

good@gmail.com

17.R Cleaning Layer

library(janitor)

library(lubridate)

dancers_clean <- dancers_raw %>%

  clean_names() %>%

  mutate(

    name=str_to_title(str_trim(name)),

    age=abs(age),

    age=if_else(age>90,NA_real_,age),

    region=case_when(

      region %in% c("apac","APAC","asia","AP-AC")

      ~"Asia Pacific",

      TRUE~region

    ),

    email=str_replace_all(email," ",""),

    email=if_else(grepl("@",email),

                  email,

                  "invalid@email.com")

  )

OUTPUT:

dancer_id

name

age

region

email

101

Michael

50

Asia Pacific

a@gmail.com

102

Prabhudeva

5

Asia Pacific

invalid@email.com

103

Null

Asia Pacific

x@gmail

103

Madhuridixit

57

Europe

nora@gmail.com

104

Chrisbrown

38

usa

jlo@gmail.com

105

Nora

36

Asia Pacific

invalid@email.com

106

Jenniferlopez

56

USA

good@gmail.com


SAS vs R Equivalent Transformations

SAS

R

PROPCASE()

str_to_title()

STRIP()

str_trim()

COMPRESS()

str_replace_all()

SELECT WHEN

case_when()

COALESCEC()

coalesce()

PROC SORT NODUPKEY

distinct()

INPUT()

parse_date_time()

PUT()

as.character()

Validation and Compliance

Clinical environments require:

  • SDTM traceability
  • ADaM reproducibility
  • audit trails
  • QC independence
  • regulatory consistency
  • metadata lineage

One dangerous SAS behavior is:

if Age < 18 then Flag='Y';

Since missing numeric values are smaller than valid numbers in SAS:

. < 18 = TRUE

This can accidentally classify missing ages as pediatric subjects.

Regulatory submissions can fail because of this single mistake.

Always use:

if not missing(age) and age<18;

Business Logic Behind Cleaning

Missing values are rarely random.

A missing patient visit date may indicate missed treatment exposure.

Negative revenue values may represent refunds rather than earnings.

A patient age of 250 years is clearly a data-entry error.

Dates require standardization because analytics engines interpret formats differently.

Text normalization prevents "India", "INDIA", and "india" from becoming separate categories.

Malformed emails damage communication workflows.

Duplicate IDs inflate counts and bias statistical inference.

Imputation methods must be documented because every correction influences downstream analysis.

Regulators expect complete traceability from source data to analysis outputs.

20 Best Practices

  1. Maintain metadata repositories.
  2. Validate dates before analysis.
  3. Standardize macro libraries.
  4. Separate QC programming teams.
  5. Use reusable validation frameworks.
  6. Document assumptions.
  7. Preserve raw data.
  8. Version-control code.
  9. Use defensive programming.
  10. Validate ranges.
  11. Remove duplicates carefully.
  12. Standardize missing values.
  13. Audit transformations.
  14. Use independent review.
  15. Maintain lineage documentation.
  16. Test edge cases.
  17. Lock production code.
  18. Validate formats.
  19. Create reconciliation reports.
  20. Automate validation checks.

20 One-Line Insights

  • Dirty data creates expensive business mistakes.
  • Validation logic beats visual inspection.
  • Missing values hide risk.
  • Standardization improves reproducibility.
  • Metadata drives trust.
  • Automation reduces defects.
  • Duplicate records distort analytics.
  • Auditability matters.
  • Defensive programming saves projects.
  • Dates deserve respect.
  • Traceability protects submissions.
  • QC independence improves quality.
  • Business logic matters.
  • Governance enables scale.
  • Documentation prevents confusion.
  • Reproducibility builds confidence.
  • PROC SQL improves readability.
  • DATA Step improves control.
  • R improves flexibility.
  • SAS improves compliance.

SAS and R: Complementary Strengths

SAS excels in governance, auditability, scalability, and regulatory reporting.

R excels in flexibility, visualization, machine learning integration, and rapid experimentation.

SAS provides deterministic production pipelines.

R provides innovation.

Together they create modern analytical ecosystems capable of handling enterprise-scale healthcare, banking, insurance, retail, and entertainment analytics.

Conclusion

Data cleaning is not a preprocessing activity.

It is risk management.

The world's best dancers dataset may appear simple, yet it contains the same failures seen in oncology trials, insurance claims, banking systems, and retail operations.

Duplicate identifiers inflate counts.

Negative values distort forecasts.

Malformed dates break time-series models.

Inconsistent categories corrupt dashboards.

Regulated industries demand traceability, reproducibility, and auditability.

SAS provides industrial-strength controls through DATA Step programming, PROC SQL, metadata governance, and compliance-friendly workflows.

R complements these strengths through flexible transformation frameworks and modern data engineering libraries.

The future of enterprise analytics belongs to organizations capable of transforming imperfect operational records into trustworthy intelligence.

Clean data is not merely cleaner data.

It is safer science, smarter business, stronger compliance, and better decisions.

The real competitive advantage is not collecting more data.

It is trusting the data you already have.

Interview Questions and Answers

1. How would you remove duplicate patients in SAS?

Use:

proc sort nodupkey;

by USUBJID;

run;

or

proc sql;

select distinct *

2. Why is missing value handling dangerous in SAS?

Because numeric missing values are treated as lower than valid numbers.

3. When would you use PROC SQL instead of DATA Step?

PROC SQL is preferred for joins and aggregation while DATA Step provides row-wise control.

4. How would you validate malformed emails in R?

Using:

grepl("@",email)

or regular expressions.

5. How do you maintain audit readiness?

By preserving source data, documenting transformations, and maintaining traceability.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about DANCERS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS