Cleaning the World's Soundtrack: Advanced SAS and R Techniques for Reliable Music Analytics
From Symphony to Statistics: Transforming Global Music Data into Trusted Analytical Intelligence with SAS and R
Introduction:The Business Crisis Nobody Expected
A global
music streaming company launched an AI recommendation engine intended to
increase subscriber engagement across continents. Instead, the platform began
recommending classical Indian ragas to heavy metal listeners, African Afrobeat
to K-Pop fans, and duplicate royalty payments were issued to artists because
duplicate track IDs existed in production systems.
Executives
discovered serious data quality failures:
- Duplicate music identifiers
- Missing release dates
- Negative streaming revenue
values
- Invalid listener ages
- Corrupted genre labels
- Broken email addresses
- Region code inconsistencies
- Invalid timestamps
- Mixed uppercase and
lowercase artist names
- Embedded whitespace corruption
- NULL strings stored as
actual text
The
result was devastating:
- AI recommendation failures
- Incorrect royalty
calculations
- Executive dashboards
displaying false trends
- Financial reporting
discrepancies
- Regulatory audit concerns
- Damaged customer experience
This is
precisely why enterprise data engineering exists.
Understanding Different
Types of Music in the World
The world
contains thousands of musical traditions, but enterprise analytics often
categorize them into major genres:
|
Genre |
Origin |
Characteristics |
|
Classical |
Europe |
Orchestra,
symphonies |
|
Jazz |
USA |
Improvisation
and swing |
|
Blues |
USA |
Emotional
storytelling |
|
Rock |
Global |
Electric
guitars and drums |
|
Pop |
Global |
Commercial
appeal |
|
Hip Hop |
USA |
Rap and
beat production |
|
Country |
USA |
Acoustic
storytelling |
|
Reggae |
Jamaica |
Offbeat
rhythms |
|
Afrobeat |
Africa |
Percussion-driven
grooves |
|
K-Pop |
South
Korea |
Choreographed
pop |
|
J-Pop |
Japan |
Synth-heavy
melodies |
|
EDM |
Global |
Electronic
production |
|
Flamenco |
Spain |
Guitar
and dance |
|
Indian
Classical |
India |
Ragas
and improvisation |
|
Folk |
Worldwide |
Traditional
cultural music |
Raw Music Dataset with Intentional Errors
1.SAS Raw Dataset
data music_raw;
length Music_ID $8 Genre $20 Artist_Name $40 Region $10 Listener_Email $60
Revenue $8 Listener_Age 8 Release_Date $20 Stream_TS $30;
infile datalines dlm='|' dsd truncover;
input Music_ID $ Genre $ Artist_Name $ Region $ Listener_Email $
Revenue $ Listener_Age Release_Date $ Stream_TS $;
datalines;
M001|rock|queen|eu|fan@gmail.com|2000|25|2025-01-01|2025-01-01T10:20:00
M001|ROCK| QUEEN |EUROPE|fan@gmail.com|2000|25|2025-01-01|2025-01-01T10:20:00
M002|NULL|metallica|na|metallica.com|-500|34|2025-02-12|INVALID
M003|Jazz|Miles Davis|US|miles@gmail|1200|-10|2025-03-05|2025-03-05T09:00:00
M004|Pop| taylor swift |APAC|swift@gmail.com|3500|200||2025-03-20T08:30:00
M005|KPOP|BTS|KR|army@gmail.com|4500|22|2025-13-10|BADTIME
M006|HipHop|Drake|NA|drake@gmail.com|NULL|28|2025-04-12|2025-04-12T10:00:00
M007|AfroBeat| Burna Boy |africa|burna@gmail.com|3900|31|2025-05-01|2025-05-01T09:00:00
M008|EDM|Martin Garrix|EU|garrixgmail.com|2200|26|2025-06-01|2025-06-01T08:00:00
M009|Classical|Mozart|EU|mozart@gmail.com|1800|270|1770-01-01|1770TIME
M010|Folk|Unknown|APAC|unknown@|1300|45|2025-07-01|2025-07-01T10:00:00
M011|Reggae|Bob Marley|NA|bob@gmail.com|2600|36|1980-01-01|1980-01-01T10:00:00
M012|JPOP|Ado|JP|ado@gmail.com|2100|19|2025-08-01|2025-08-01T10:00:00
M013|Blues|BB King|US|bb@gmail.com|-300|55|2025-09-01|2025-09-01T10:00:00
M014|Opera|Pavarotti|EU|pav@gmail.com|4200|67|NULL|NULL
M015|Rock|Nirvana|EU|nirvana@gmail.com|3200|29|2025-10-01|2025-10-01T11:00:00
M016|EDM|NULL|EU|testgmail.com|2500|21|2025-11-01|2025-11-01T11:00:00
M017|Hip Hop| Eminem |NA|em@gmail.com|5000|52|2025-12-01|2025-12-01T11:00:00
M018|Classical|Beethoven|EU|beethoven@gmail.com|3400|248|1800-01-01|1800TIME
M019|Country|Luke Bryan|US|luke@gmail.com|2900|48|2025-01-10|2025-01-10T12:00:00
M020|Rock|Linkin Park|NA|lp@gmail.com|4100|39|2025-02-11|2025-02-11T11:00:00
;
run;
proc print data=music_raw;
run;
OUTPUT:
| Obs | Music_ID | Genre | Artist_Name | Region | Listener_Email | Revenue | Listener_Age | Release_Date | Stream_TS |
|---|---|---|---|---|---|---|---|---|---|
| 1 | M001 | rock | queen | eu | fan@gmail.com | 2000 | 25 | 2025-01-01 | 2025-01-01T10:20:00 |
| 2 | M001 | ROCK | QUEEN | EUROPE | fan@gmail.com | 2000 | 25 | 2025-01-01 | 2025-01-01T10:20:00 |
| 3 | M002 | NULL | metallica | na | metallica.com | -500 | 34 | 2025-02-12 | INVALID |
| 4 | M003 | Jazz | Miles Davis | US | miles@gmail | 1200 | -10 | 2025-03-05 | 2025-03-05T09:00:00 |
| 5 | M004 | Pop | taylor swift | APAC | swift@gmail.com | 3500 | 200 | 2025-03-20T08:30:00 | |
| 6 | M005 | KPOP | BTS | KR | army@gmail.com | 4500 | 22 | 2025-13-10 | BADTIME |
| 7 | M006 | HipHop | Drake | NA | drake@gmail.com | NULL | 28 | 2025-04-12 | 2025-04-12T10:00:00 |
| 8 | M007 | AfroBeat | Burna Boy | africa | burna@gmail.com | 3900 | 31 | 2025-05-01 | 2025-05-01T09:00:00 |
| 9 | M008 | EDM | Martin Garrix | EU | garrixgmail.com | 2200 | 26 | 2025-06-01 | 2025-06-01T08:00:00 |
| 10 | M009 | Classical | Mozart | EU | mozart@gmail.com | 1800 | 270 | 1770-01-01 | 1770TIME |
| 11 | M010 | Folk | Unknown | APAC | unknown@ | 1300 | 45 | 2025-07-01 | 2025-07-01T10:00:00 |
| 12 | M011 | Reggae | Bob Marley | NA | bob@gmail.com | 2600 | 36 | 1980-01-01 | 1980-01-01T10:00:00 |
| 13 | M012 | JPOP | Ado | JP | ado@gmail.com | 2100 | 19 | 2025-08-01 | 2025-08-01T10:00:00 |
| 14 | M013 | Blues | BB King | US | bb@gmail.com | -300 | 55 | 2025-09-01 | 2025-09-01T10:00:00 |
| 15 | M014 | Opera | Pavarotti | EU | pav@gmail.com | 4200 | 67 | NULL | NULL |
| 16 | M015 | Rock | Nirvana | EU | nirvana@gmail.com | 3200 | 29 | 2025-10-01 | 2025-10-01T11:00:00 |
| 17 | M016 | EDM | NULL | EU | testgmail.com | 2500 | 21 | 2025-11-01 | 2025-11-01T11:00:00 |
| 18 | M017 | Hip Hop | Eminem | NA | em@gmail.com | 5000 | 52 | 2025-12-01 | 2025-12-01T11:00:00 |
| 19 | M018 | Classical | Beethoven | EU | beethoven@gmail.com | 3400 | 248 | 1800-01-01 | 1800TIME |
| 20 | M019 | Country | Luke Bryan | US | luke@gmail.com | 2900 | 48 | 2025-01-10 | 2025-01-10T12:00:00 |
| 21 | M020 | Rock | Linkin Park | NA | lp@gmail.com | 4100 | 39 | 2025-02-11 | 2025-02-11T11:00:00 |
Explanation and Key Points
This
dataset intentionally mimics real production corruption. Duplicate IDs
represent duplicate ingestion pipelines. Invalid timestamps represent API
failures. Negative revenue values often emerge from refund systems. Impossible
ages frequently occur after ETL conversion problems.
The LENGTH
statement appears before assignments because SAS allocates memory during
compilation. If omitted or placed later, character truncation can silently
occur.
R does
not behave this way because character vectors are dynamically allocated.
Character Truncation Risk
Bad SAS
example:
data bad;
name='International Classical Music';
length name $10;
run;
proc print data=bad;
run;
LOG:
OUTPUT:
| Obs | name |
|---|---|
| 1 | Internatio |
Correct
approach:
data good;
length name $50;
name='International Classical Music';
run;
proc print data=good;
run;
OUTPUT:
| Obs | name |
|---|---|
| 1 | International Classical Music |
2.Enterprise Cleaning Workflow in SAS
data music_clean;
set music_raw;
retain Source_System "STREAMING_PLATFORM";
Genre=propcase(strip(Genre));
Artist_Name=propcase(compbl(strip(Artist_Name)));
Region=upcase(strip(Region));
Listener_Email=lowcase(strip(Listener_Email));
Revenue_Num = input(Revenue,best32.);
Revenue_Num=abs(Revenue_Num);
if Listener_Age<0 then Listener_Age=.;
if Listener_Age>120 then Listener_Age=.;
if find(Listener_Email,'@')=0 then
Listener_Email='invalid@email.com';
if Genre='Null' then Genre='Unknown';
Revenue=round(Revenue,0.01);
if missing(Release_Date) then
Release_Date='2025-01-01';
Music_Key=catx('-',Music_ID,Genre,Region);
Missing_Count=cmiss(of _character_)
+nmiss(of _numeric_);
drop Revenue;
rename Revenue_Num=Revenue;
run;
proc print data=music_clean;
run;
OUTPUT:
| Obs | Music_ID | Genre | Artist_Name | Region | Listener_Email | Listener_Age | Release_Date | Stream_TS | Source_System | Revenue | Music_Key | Missing_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M001 | Rock | Queen | EU | fan@gmail.com | 25 | 2025-01-01 | 2025-01-01T10:20:00 | STREAMING_PLATFORM | 2000 | M001-Rock-EU | 1 |
| 2 | M001 | Rock | Queen | EUROPE | fan@gmail.com | 25 | 2025-01-01 | 2025-01-01T10:20:00 | STREAMING_PLATFORM | 2000 | M001-Rock-EUROPE | 1 |
| 3 | M002 | Unknown | Metallica | NA | invalid@email.com | 34 | 2025-02-12 | INVALID | STREAMING_PLATFORM | 500 | M002-Unknown-NA | 1 |
| 4 | M003 | Jazz | Miles Davis | US | miles@gmail | . | 2025-03-05 | 2025-03-05T09:00:00 | STREAMING_PLATFORM | 1200 | M003-Jazz-US | 2 |
| 5 | M004 | Pop | Taylor Swift | APAC | swift@gmail.com | . | 2025-01-01 | 2025-03-20T08:30:00 | STREAMING_PLATFORM | 3500 | M004-Pop-APAC | 2 |
| 6 | M005 | Kpop | Bts | KR | army@gmail.com | 22 | 2025-13-10 | BADTIME | STREAMING_PLATFORM | 4500 | M005-Kpop-KR | 1 |
| 7 | M006 | Hiphop | Drake | NA | drake@gmail.com | 28 | 2025-04-12 | 2025-04-12T10:00:00 | STREAMING_PLATFORM | . | M006-Hiphop-NA | 2 |
| 8 | M007 | Afrobeat | Burna Boy | AFRICA | burna@gmail.com | 31 | 2025-05-01 | 2025-05-01T09:00:00 | STREAMING_PLATFORM | 3900 | M007-Afrobeat-AFRICA | 1 |
| 9 | M008 | Edm | Martin Garrix | EU | invalid@email.com | 26 | 2025-06-01 | 2025-06-01T08:00:00 | STREAMING_PLATFORM | 2200 | M008-Edm-EU | 1 |
| 10 | M009 | Classical | Mozart | EU | mozart@gmail.com | . | 1770-01-01 | 1770TIME | STREAMING_PLATFORM | 1800 | M009-Classical-EU | 2 |
| 11 | M010 | Folk | Unknown | APAC | unknown@ | 45 | 2025-07-01 | 2025-07-01T10:00:00 | STREAMING_PLATFORM | 1300 | M010-Folk-APAC | 1 |
| 12 | M011 | Reggae | Bob Marley | NA | bob@gmail.com | 36 | 1980-01-01 | 1980-01-01T10:00:00 | STREAMING_PLATFORM | 2600 | M011-Reggae-NA | 1 |
| 13 | M012 | Jpop | Ado | JP | ado@gmail.com | 19 | 2025-08-01 | 2025-08-01T10:00:00 | STREAMING_PLATFORM | 2100 | M012-Jpop-JP | 1 |
| 14 | M013 | Blues | Bb King | US | bb@gmail.com | 55 | 2025-09-01 | 2025-09-01T10:00:00 | STREAMING_PLATFORM | 300 | M013-Blues-US | 1 |
| 15 | M014 | Opera | Pavarotti | EU | pav@gmail.com | 67 | NULL | NULL | STREAMING_PLATFORM | 4200 | M014-Opera-EU | 1 |
| 16 | M015 | Rock | Nirvana | EU | nirvana@gmail.com | 29 | 2025-10-01 | 2025-10-01T11:00:00 | STREAMING_PLATFORM | 3200 | M015-Rock-EU | 1 |
| 17 | M016 | Edm | Null | EU | invalid@email.com | 21 | 2025-11-01 | 2025-11-01T11:00:00 | STREAMING_PLATFORM | 2500 | M016-Edm-EU | 1 |
| 18 | M017 | Hip Hop | Eminem | NA | em@gmail.com | 52 | 2025-12-01 | 2025-12-01T11:00:00 | STREAMING_PLATFORM | 5000 | M017-Hip Hop-NA | 1 |
| 19 | M018 | Classical | Beethoven | EU | beethoven@gmail.com | . | 1800-01-01 | 1800TIME | STREAMING_PLATFORM | 3400 | M018-Classical-EU | 2 |
| 20 | M019 | Country | Luke Bryan | US | luke@gmail.com | 48 | 2025-01-10 | 2025-01-10T12:00:00 | STREAMING_PLATFORM | 2900 | M019-Country-US | 1 |
| 21 | M020 | Rock | Linkin Park | NA | lp@gmail.com | 39 | 2025-02-11 | 2025-02-11T11:00:00 | STREAMING_PLATFORM | 4100 | M020-Rock-NA | 1 |
Explanation and Key Points
This
workflow uses several advanced production techniques.
- RETAIN preserves metadata
lineage.
- COMPBL removes duplicate
spaces.
- PROPCASE standardizes
capitalization.
- ABS corrects financial
reversals.
- FIND validates emails.
- CATX generates business
keys.
- CMISS and NMISS support
completeness metrics.
In
regulated environments, these transformations would be documented in Define.xml
traceability metadata.
3.PROC SQL Deduplication
proc sql;
create table music_sql as
select distinct *
from music_clean;
quit;
proc print data=music_sql;
run;
OUTPUT:
| Obs | Music_ID | Genre | Artist_Name | Region | Listener_Email | Listener_Age | Release_Date | Stream_TS | Source_System | Revenue | Music_Key | Missing_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M001 | Rock | Queen | EU | fan@gmail.com | 25 | 2025-01-01 | 2025-01-01T10:20:00 | STREAMING_PLATFORM | 2000 | M001-Rock-EU | 1 |
| 2 | M001 | Rock | Queen | EUROPE | fan@gmail.com | 25 | 2025-01-01 | 2025-01-01T10:20:00 | STREAMING_PLATFORM | 2000 | M001-Rock-EUROPE | 1 |
| 3 | M002 | Unknown | Metallica | NA | invalid@email.com | 34 | 2025-02-12 | INVALID | STREAMING_PLATFORM | 500 | M002-Unknown-NA | 1 |
| 4 | M003 | Jazz | Miles Davis | US | miles@gmail | . | 2025-03-05 | 2025-03-05T09:00:00 | STREAMING_PLATFORM | 1200 | M003-Jazz-US | 2 |
| 5 | M004 | Pop | Taylor Swift | APAC | swift@gmail.com | . | 2025-01-01 | 2025-03-20T08:30:00 | STREAMING_PLATFORM | 3500 | M004-Pop-APAC | 2 |
| 6 | M005 | Kpop | Bts | KR | army@gmail.com | 22 | 2025-13-10 | BADTIME | STREAMING_PLATFORM | 4500 | M005-Kpop-KR | 1 |
| 7 | M006 | Hiphop | Drake | NA | drake@gmail.com | 28 | 2025-04-12 | 2025-04-12T10:00:00 | STREAMING_PLATFORM | . | M006-Hiphop-NA | 2 |
| 8 | M007 | Afrobeat | Burna Boy | AFRICA | burna@gmail.com | 31 | 2025-05-01 | 2025-05-01T09:00:00 | STREAMING_PLATFORM | 3900 | M007-Afrobeat-AFRICA | 1 |
| 9 | M008 | Edm | Martin Garrix | EU | invalid@email.com | 26 | 2025-06-01 | 2025-06-01T08:00:00 | STREAMING_PLATFORM | 2200 | M008-Edm-EU | 1 |
| 10 | M009 | Classical | Mozart | EU | mozart@gmail.com | . | 1770-01-01 | 1770TIME | STREAMING_PLATFORM | 1800 | M009-Classical-EU | 2 |
| 11 | M010 | Folk | Unknown | APAC | unknown@ | 45 | 2025-07-01 | 2025-07-01T10:00:00 | STREAMING_PLATFORM | 1300 | M010-Folk-APAC | 1 |
| 12 | M011 | Reggae | Bob Marley | NA | bob@gmail.com | 36 | 1980-01-01 | 1980-01-01T10:00:00 | STREAMING_PLATFORM | 2600 | M011-Reggae-NA | 1 |
| 13 | M012 | Jpop | Ado | JP | ado@gmail.com | 19 | 2025-08-01 | 2025-08-01T10:00:00 | STREAMING_PLATFORM | 2100 | M012-Jpop-JP | 1 |
| 14 | M013 | Blues | Bb King | US | bb@gmail.com | 55 | 2025-09-01 | 2025-09-01T10:00:00 | STREAMING_PLATFORM | 300 | M013-Blues-US | 1 |
| 15 | M014 | Opera | Pavarotti | EU | pav@gmail.com | 67 | NULL | NULL | STREAMING_PLATFORM | 4200 | M014-Opera-EU | 1 |
| 16 | M015 | Rock | Nirvana | EU | nirvana@gmail.com | 29 | 2025-10-01 | 2025-10-01T11:00:00 | STREAMING_PLATFORM | 3200 | M015-Rock-EU | 1 |
| 17 | M016 | Edm | Null | EU | invalid@email.com | 21 | 2025-11-01 | 2025-11-01T11:00:00 | STREAMING_PLATFORM | 2500 | M016-Edm-EU | 1 |
| 18 | M017 | Hip Hop | Eminem | NA | em@gmail.com | 52 | 2025-12-01 | 2025-12-01T11:00:00 | STREAMING_PLATFORM | 5000 | M017-Hip Hop-NA | 1 |
| 19 | M018 | Classical | Beethoven | EU | beethoven@gmail.com | . | 1800-01-01 | 1800TIME | STREAMING_PLATFORM | 3400 | M018-Classical-EU | 2 |
| 20 | M019 | Country | Luke Bryan | US | luke@gmail.com | 48 | 2025-01-10 | 2025-01-10T12:00:00 | STREAMING_PLATFORM | 2900 | M019-Country-US | 1 |
| 21 | M020 | Rock | Linkin Park | NA | lp@gmail.com | 39 | 2025-02-11 | 2025-02-11T11:00:00 | STREAMING_PLATFORM | 4100 | M020-Rock-NA | 1 |
Explanation
PROC SQL
excels in relational operations and deduplication.
Advantages:
- Simpler syntax
- Better join readability
- Database pushdown
optimization
4.DATA Step Deduplication
proc sort data=music_clean out=music_nodup nodupkey;
by Music_ID;
run;
proc print data=music_nodup;
run;
OUTPUT:
| Obs | Music_ID | Genre | Artist_Name | Region | Listener_Email | Listener_Age | Release_Date | Stream_TS | Source_System | Revenue | Music_Key | Missing_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | M001 | Rock | Queen | EU | fan@gmail.com | 25 | 2025-01-01 | 2025-01-01T10:20:00 | STREAMING_PLATFORM | 2000 | M001-Rock-EU | 1 |
| 2 | M002 | Unknown | Metallica | NA | invalid@email.com | 34 | 2025-02-12 | INVALID | STREAMING_PLATFORM | 500 | M002-Unknown-NA | 1 |
| 3 | M003 | Jazz | Miles Davis | US | miles@gmail | . | 2025-03-05 | 2025-03-05T09:00:00 | STREAMING_PLATFORM | 1200 | M003-Jazz-US | 2 |
| 4 | M004 | Pop | Taylor Swift | APAC | swift@gmail.com | . | 2025-01-01 | 2025-03-20T08:30:00 | STREAMING_PLATFORM | 3500 | M004-Pop-APAC | 2 |
| 5 | M005 | Kpop | Bts | KR | army@gmail.com | 22 | 2025-13-10 | BADTIME | STREAMING_PLATFORM | 4500 | M005-Kpop-KR | 1 |
| 6 | M006 | Hiphop | Drake | NA | drake@gmail.com | 28 | 2025-04-12 | 2025-04-12T10:00:00 | STREAMING_PLATFORM | . | M006-Hiphop-NA | 2 |
| 7 | M007 | Afrobeat | Burna Boy | AFRICA | burna@gmail.com | 31 | 2025-05-01 | 2025-05-01T09:00:00 | STREAMING_PLATFORM | 3900 | M007-Afrobeat-AFRICA | 1 |
| 8 | M008 | Edm | Martin Garrix | EU | invalid@email.com | 26 | 2025-06-01 | 2025-06-01T08:00:00 | STREAMING_PLATFORM | 2200 | M008-Edm-EU | 1 |
| 9 | M009 | Classical | Mozart | EU | mozart@gmail.com | . | 1770-01-01 | 1770TIME | STREAMING_PLATFORM | 1800 | M009-Classical-EU | 2 |
| 10 | M010 | Folk | Unknown | APAC | unknown@ | 45 | 2025-07-01 | 2025-07-01T10:00:00 | STREAMING_PLATFORM | 1300 | M010-Folk-APAC | 1 |
| 11 | M011 | Reggae | Bob Marley | NA | bob@gmail.com | 36 | 1980-01-01 | 1980-01-01T10:00:00 | STREAMING_PLATFORM | 2600 | M011-Reggae-NA | 1 |
| 12 | M012 | Jpop | Ado | JP | ado@gmail.com | 19 | 2025-08-01 | 2025-08-01T10:00:00 | STREAMING_PLATFORM | 2100 | M012-Jpop-JP | 1 |
| 13 | M013 | Blues | Bb King | US | bb@gmail.com | 55 | 2025-09-01 | 2025-09-01T10:00:00 | STREAMING_PLATFORM | 300 | M013-Blues-US | 1 |
| 14 | M014 | Opera | Pavarotti | EU | pav@gmail.com | 67 | NULL | NULL | STREAMING_PLATFORM | 4200 | M014-Opera-EU | 1 |
| 15 | M015 | Rock | Nirvana | EU | nirvana@gmail.com | 29 | 2025-10-01 | 2025-10-01T11:00:00 | STREAMING_PLATFORM | 3200 | M015-Rock-EU | 1 |
| 16 | M016 | Edm | Null | EU | invalid@email.com | 21 | 2025-11-01 | 2025-11-01T11:00:00 | STREAMING_PLATFORM | 2500 | M016-Edm-EU | 1 |
| 17 | M017 | Hip Hop | Eminem | NA | em@gmail.com | 52 | 2025-12-01 | 2025-12-01T11:00:00 | STREAMING_PLATFORM | 5000 | M017-Hip Hop-NA | 1 |
| 18 | M018 | Classical | Beethoven | EU | beethoven@gmail.com | . | 1800-01-01 | 1800TIME | STREAMING_PLATFORM | 3400 | M018-Classical-EU | 2 |
| 19 | M019 | Country | Luke Bryan | US | luke@gmail.com | 48 | 2025-01-10 | 2025-01-10T12:00:00 | STREAMING_PLATFORM | 2900 | M019-Country-US | 1 |
| 20 | M020 | Rock | Linkin Park | NA | lp@gmail.com | 39 | 2025-02-11 | 2025-02-11T11:00:00 | STREAMING_PLATFORM | 4100 | M020-Rock-NA | 1 |
Explanation
DATA Step
processing is often faster on extremely large datasets because SAS can optimize
sequential disk reads.
5.Advanced SAS Features
5.1 SELECT WHEN
data music_clean;
length Category $20.;
set music_clean;
select;
when (Genre='Rock') Category='Modern';
when (Genre='Classical') Category='Traditional';
otherwise Category='Other';
end;
run;
proc print data=music_clean;
run;
OUTPUT:
| Obs | Category | Music_ID | Genre | Artist_Name | Region | Listener_Email | Listener_Age | Release_Date | Stream_TS | Source_System | Revenue | Music_Key | Missing_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Modern | M001 | Rock | Queen | EU | fan@gmail.com | 25 | 2025-01-01 | 2025-01-01T10:20:00 | STREAMING_PLATFORM | 2000 | M001-Rock-EU | 1 |
| 2 | Modern | M001 | Rock | Queen | EUROPE | fan@gmail.com | 25 | 2025-01-01 | 2025-01-01T10:20:00 | STREAMING_PLATFORM | 2000 | M001-Rock-EUROPE | 1 |
| 3 | Other | M002 | Unknown | Metallica | NA | invalid@email.com | 34 | 2025-02-12 | INVALID | STREAMING_PLATFORM | 500 | M002-Unknown-NA | 1 |
| 4 | Other | M003 | Jazz | Miles Davis | US | miles@gmail | . | 2025-03-05 | 2025-03-05T09:00:00 | STREAMING_PLATFORM | 1200 | M003-Jazz-US | 2 |
| 5 | Other | M004 | Pop | Taylor Swift | APAC | swift@gmail.com | . | 2025-01-01 | 2025-03-20T08:30:00 | STREAMING_PLATFORM | 3500 | M004-Pop-APAC | 2 |
| 6 | Other | M005 | Kpop | Bts | KR | army@gmail.com | 22 | 2025-13-10 | BADTIME | STREAMING_PLATFORM | 4500 | M005-Kpop-KR | 1 |
| 7 | Other | M006 | Hiphop | Drake | NA | drake@gmail.com | 28 | 2025-04-12 | 2025-04-12T10:00:00 | STREAMING_PLATFORM | . | M006-Hiphop-NA | 2 |
| 8 | Other | M007 | Afrobeat | Burna Boy | AFRICA | burna@gmail.com | 31 | 2025-05-01 | 2025-05-01T09:00:00 | STREAMING_PLATFORM | 3900 | M007-Afrobeat-AFRICA | 1 |
| 9 | Other | M008 | Edm | Martin Garrix | EU | invalid@email.com | 26 | 2025-06-01 | 2025-06-01T08:00:00 | STREAMING_PLATFORM | 2200 | M008-Edm-EU | 1 |
| 10 | Traditional | M009 | Classical | Mozart | EU | mozart@gmail.com | . | 1770-01-01 | 1770TIME | STREAMING_PLATFORM | 1800 | M009-Classical-EU | 2 |
| 11 | Other | M010 | Folk | Unknown | APAC | unknown@ | 45 | 2025-07-01 | 2025-07-01T10:00:00 | STREAMING_PLATFORM | 1300 | M010-Folk-APAC | 1 |
| 12 | Other | M011 | Reggae | Bob Marley | NA | bob@gmail.com | 36 | 1980-01-01 | 1980-01-01T10:00:00 | STREAMING_PLATFORM | 2600 | M011-Reggae-NA | 1 |
| 13 | Other | M012 | Jpop | Ado | JP | ado@gmail.com | 19 | 2025-08-01 | 2025-08-01T10:00:00 | STREAMING_PLATFORM | 2100 | M012-Jpop-JP | 1 |
| 14 | Other | M013 | Blues | Bb King | US | bb@gmail.com | 55 | 2025-09-01 | 2025-09-01T10:00:00 | STREAMING_PLATFORM | 300 | M013-Blues-US | 1 |
| 15 | Other | M014 | Opera | Pavarotti | EU | pav@gmail.com | 67 | NULL | NULL | STREAMING_PLATFORM | 4200 | M014-Opera-EU | 1 |
| 16 | Modern | M015 | Rock | Nirvana | EU | nirvana@gmail.com | 29 | 2025-10-01 | 2025-10-01T11:00:00 | STREAMING_PLATFORM | 3200 | M015-Rock-EU | 1 |
| 17 | Other | M016 | Edm | Null | EU | invalid@email.com | 21 | 2025-11-01 | 2025-11-01T11:00:00 | STREAMING_PLATFORM | 2500 | M016-Edm-EU | 1 |
| 18 | Other | M017 | Hip Hop | Eminem | NA | em@gmail.com | 52 | 2025-12-01 | 2025-12-01T11:00:00 | STREAMING_PLATFORM | 5000 | M017-Hip Hop-NA | 1 |
| 19 | Traditional | M018 | Classical | Beethoven | EU | beethoven@gmail.com | . | 1800-01-01 | 1800TIME | STREAMING_PLATFORM | 3400 | M018-Classical-EU | 2 |
| 20 | Other | M019 | Country | Luke Bryan | US | luke@gmail.com | 48 | 2025-01-10 | 2025-01-10T12:00:00 | STREAMING_PLATFORM | 2900 | M019-Country-US | 1 |
| 21 | Modern | M020 | Rock | Linkin Park | NA | lp@gmail.com | 39 | 2025-02-11 | 2025-02-11T11:00:00 | STREAMING_PLATFORM | 4100 | M020-Rock-NA | 1 |
5.2 ARRAY Processing
data music_clean;
set music_clean;
array chars(*) Genre Artist_Name Region;
do i=1 to dim(chars);
chars(i)=strip(chars(i));
end;
run;
proc print data=music_clean;
run;
OUTPUT:
| Obs | Category | Music_ID | Genre | Artist_Name | Region | Listener_Email | Listener_Age | Release_Date | Stream_TS | Source_System | Revenue | Music_Key | Missing_Count | i |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Modern | M001 | Rock | Queen | EU | fan@gmail.com | 25 | 2025-01-01 | 2025-01-01T10:20:00 | STREAMING_PLATFORM | 2000 | M001-Rock-EU | 1 | 4 |
| 2 | Modern | M001 | Rock | Queen | EUROPE | fan@gmail.com | 25 | 2025-01-01 | 2025-01-01T10:20:00 | STREAMING_PLATFORM | 2000 | M001-Rock-EUROPE | 1 | 4 |
| 3 | Other | M002 | Unknown | Metallica | NA | invalid@email.com | 34 | 2025-02-12 | INVALID | STREAMING_PLATFORM | 500 | M002-Unknown-NA | 1 | 4 |
| 4 | Other | M003 | Jazz | Miles Davis | US | miles@gmail | . | 2025-03-05 | 2025-03-05T09:00:00 | STREAMING_PLATFORM | 1200 | M003-Jazz-US | 2 | 4 |
| 5 | Other | M004 | Pop | Taylor Swift | APAC | swift@gmail.com | . | 2025-01-01 | 2025-03-20T08:30:00 | STREAMING_PLATFORM | 3500 | M004-Pop-APAC | 2 | 4 |
| 6 | Other | M005 | Kpop | Bts | KR | army@gmail.com | 22 | 2025-13-10 | BADTIME | STREAMING_PLATFORM | 4500 | M005-Kpop-KR | 1 | 4 |
| 7 | Other | M006 | Hiphop | Drake | NA | drake@gmail.com | 28 | 2025-04-12 | 2025-04-12T10:00:00 | STREAMING_PLATFORM | . | M006-Hiphop-NA | 2 | 4 |
| 8 | Other | M007 | Afrobeat | Burna Boy | AFRICA | burna@gmail.com | 31 | 2025-05-01 | 2025-05-01T09:00:00 | STREAMING_PLATFORM | 3900 | M007-Afrobeat-AFRICA | 1 | 4 |
| 9 | Other | M008 | Edm | Martin Garrix | EU | invalid@email.com | 26 | 2025-06-01 | 2025-06-01T08:00:00 | STREAMING_PLATFORM | 2200 | M008-Edm-EU | 1 | 4 |
| 10 | Traditional | M009 | Classical | Mozart | EU | mozart@gmail.com | . | 1770-01-01 | 1770TIME | STREAMING_PLATFORM | 1800 | M009-Classical-EU | 2 | 4 |
| 11 | Other | M010 | Folk | Unknown | APAC | unknown@ | 45 | 2025-07-01 | 2025-07-01T10:00:00 | STREAMING_PLATFORM | 1300 | M010-Folk-APAC | 1 | 4 |
| 12 | Other | M011 | Reggae | Bob Marley | NA | bob@gmail.com | 36 | 1980-01-01 | 1980-01-01T10:00:00 | STREAMING_PLATFORM | 2600 | M011-Reggae-NA | 1 | 4 |
| 13 | Other | M012 | Jpop | Ado | JP | ado@gmail.com | 19 | 2025-08-01 | 2025-08-01T10:00:00 | STREAMING_PLATFORM | 2100 | M012-Jpop-JP | 1 | 4 |
| 14 | Other | M013 | Blues | Bb King | US | bb@gmail.com | 55 | 2025-09-01 | 2025-09-01T10:00:00 | STREAMING_PLATFORM | 300 | M013-Blues-US | 1 | 4 |
| 15 | Other | M014 | Opera | Pavarotti | EU | pav@gmail.com | 67 | NULL | NULL | STREAMING_PLATFORM | 4200 | M014-Opera-EU | 1 | 4 |
| 16 | Modern | M015 | Rock | Nirvana | EU | nirvana@gmail.com | 29 | 2025-10-01 | 2025-10-01T11:00:00 | STREAMING_PLATFORM | 3200 | M015-Rock-EU | 1 | 4 |
| 17 | Other | M016 | Edm | Null | EU | invalid@email.com | 21 | 2025-11-01 | 2025-11-01T11:00:00 | STREAMING_PLATFORM | 2500 | M016-Edm-EU | 1 | 4 |
| 18 | Other | M017 | Hip Hop | Eminem | NA | em@gmail.com | 52 | 2025-12-01 | 2025-12-01T11:00:00 | STREAMING_PLATFORM | 5000 | M017-Hip Hop-NA | 1 | 4 |
| 19 | Traditional | M018 | Classical | Beethoven | EU | beethoven@gmail.com | . | 1800-01-01 | 1800TIME | STREAMING_PLATFORM | 3400 | M018-Classical-EU | 2 | 4 |
| 20 | Other | M019 | Country | Luke Bryan | US | luke@gmail.com | 48 | 2025-01-10 | 2025-01-10T12:00:00 | STREAMING_PLATFORM | 2900 | M019-Country-US | 1 | 4 |
| 21 | Modern | M020 | Rock | Linkin Park | NA | lp@gmail.com | 39 | 2025-02-11 | 2025-02-11T11:00:00 | STREAMING_PLATFORM | 4100 | M020-Rock-NA | 1 | 4 |
5.3 FIRST./LAST. Logic
data music_clean;
set music_clean;
by Music_ID;
if first.Music_ID then Count=0;
Count+1;
if last.Music_ID;
run;
proc print data=music_clean;
run;
OUTPUT:
| Obs | Category | Music_ID | Genre | Artist_Name | Region | Listener_Email | Listener_Age | Release_Date | Stream_TS | Source_System | Revenue | Music_Key | Missing_Count | i | Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Modern | M001 | Rock | Queen | EUROPE | fan@gmail.com | 25 | 2025-01-01 | 2025-01-01T10:20:00 | STREAMING_PLATFORM | 2000 | M001-Rock-EUROPE | 1 | 4 | 2 |
| 2 | Other | M002 | Unknown | Metallica | NA | invalid@email.com | 34 | 2025-02-12 | INVALID | STREAMING_PLATFORM | 500 | M002-Unknown-NA | 1 | 4 | 1 |
| 3 | Other | M003 | Jazz | Miles Davis | US | miles@gmail | . | 2025-03-05 | 2025-03-05T09:00:00 | STREAMING_PLATFORM | 1200 | M003-Jazz-US | 2 | 4 | 1 |
| 4 | Other | M004 | Pop | Taylor Swift | APAC | swift@gmail.com | . | 2025-01-01 | 2025-03-20T08:30:00 | STREAMING_PLATFORM | 3500 | M004-Pop-APAC | 2 | 4 | 1 |
| 5 | Other | M005 | Kpop | Bts | KR | army@gmail.com | 22 | 2025-13-10 | BADTIME | STREAMING_PLATFORM | 4500 | M005-Kpop-KR | 1 | 4 | 1 |
| 6 | Other | M006 | Hiphop | Drake | NA | drake@gmail.com | 28 | 2025-04-12 | 2025-04-12T10:00:00 | STREAMING_PLATFORM | . | M006-Hiphop-NA | 2 | 4 | 1 |
| 7 | Other | M007 | Afrobeat | Burna Boy | AFRICA | burna@gmail.com | 31 | 2025-05-01 | 2025-05-01T09:00:00 | STREAMING_PLATFORM | 3900 | M007-Afrobeat-AFRICA | 1 | 4 | 1 |
| 8 | Other | M008 | Edm | Martin Garrix | EU | invalid@email.com | 26 | 2025-06-01 | 2025-06-01T08:00:00 | STREAMING_PLATFORM | 2200 | M008-Edm-EU | 1 | 4 | 1 |
| 9 | Traditional | M009 | Classical | Mozart | EU | mozart@gmail.com | . | 1770-01-01 | 1770TIME | STREAMING_PLATFORM | 1800 | M009-Classical-EU | 2 | 4 | 1 |
| 10 | Other | M010 | Folk | Unknown | APAC | unknown@ | 45 | 2025-07-01 | 2025-07-01T10:00:00 | STREAMING_PLATFORM | 1300 | M010-Folk-APAC | 1 | 4 | 1 |
| 11 | Other | M011 | Reggae | Bob Marley | NA | bob@gmail.com | 36 | 1980-01-01 | 1980-01-01T10:00:00 | STREAMING_PLATFORM | 2600 | M011-Reggae-NA | 1 | 4 | 1 |
| 12 | Other | M012 | Jpop | Ado | JP | ado@gmail.com | 19 | 2025-08-01 | 2025-08-01T10:00:00 | STREAMING_PLATFORM | 2100 | M012-Jpop-JP | 1 | 4 | 1 |
| 13 | Other | M013 | Blues | Bb King | US | bb@gmail.com | 55 | 2025-09-01 | 2025-09-01T10:00:00 | STREAMING_PLATFORM | 300 | M013-Blues-US | 1 | 4 | 1 |
| 14 | Other | M014 | Opera | Pavarotti | EU | pav@gmail.com | 67 | NULL | NULL | STREAMING_PLATFORM | 4200 | M014-Opera-EU | 1 | 4 | 1 |
| 15 | Modern | M015 | Rock | Nirvana | EU | nirvana@gmail.com | 29 | 2025-10-01 | 2025-10-01T11:00:00 | STREAMING_PLATFORM | 3200 | M015-Rock-EU | 1 | 4 | 1 |
| 16 | Other | M016 | Edm | Null | EU | invalid@email.com | 21 | 2025-11-01 | 2025-11-01T11:00:00 | STREAMING_PLATFORM | 2500 | M016-Edm-EU | 1 | 4 | 1 |
| 17 | Other | M017 | Hip Hop | Eminem | NA | em@gmail.com | 52 | 2025-12-01 | 2025-12-01T11:00:00 | STREAMING_PLATFORM | 5000 | M017-Hip Hop-NA | 1 | 4 | 1 |
| 18 | Traditional | M018 | Classical | Beethoven | EU | beethoven@gmail.com | . | 1800-01-01 | 1800TIME | STREAMING_PLATFORM | 3400 | M018-Classical-EU | 2 | 4 | 1 |
| 19 | Other | M019 | Country | Luke Bryan | US | luke@gmail.com | 48 | 2025-01-10 | 2025-01-10T12:00:00 | STREAMING_PLATFORM | 2900 | M019-Country-US | 1 | 4 | 1 |
| 20 | Modern | M020 | Rock | Linkin Park | NA | lp@gmail.com | 39 | 2025-02-11 | 2025-02-11T11:00:00 | STREAMING_PLATFORM | 4100 | M020-Rock-NA | 1 | 4 | 1 |
6.R Raw Dataset
library(tidyverse)
music_raw <- read_delim(
"
Music_ID|Genre|Artist_Name|Region|Listener_Email|Revenue|Listener_Age|Release_Date|Stream_TS
M001|rock|queen|eu|fan@gmail.com|2000|25|2025-01-01|2025-01-01T10:20:00
M001|ROCK| QUEEN |EUROPE|fan@gmail.com|2000|25|2025-01-01|2025-01-01T10:20:00
M002|NULL|metallica|na|metallica.com|-500|34|2025-02-12|INVALID
M003|Jazz|Miles Davis|US|miles@gmail|1200|-10|2025-03-05|2025-03-05T09:00:00
M004|Pop| taylor swift |APAC|swift@gmail.com|3500|200||2025-03-20T08:30:00
M005|KPOP|BTS|KR|army@gmail.com|4500|22|2025-13-10|BADTIME
M006|HipHop|Drake|NA|drake@gmail.com|NULL|28|2025-04-12|2025-04-12T10:00:00
M007|AfroBeat| Burna Boy |africa|burna@gmail.com|3900|31|2025-05-01|2025-05-01T09:00:00
M008|EDM|Martin Garrix|EU|garrixgmail.com|2200|26|2025-06-01|2025-06-01T08:00:00
M009|Classical|Mozart|EU|mozart@gmail.com|1800|270|1770-01-01|1770TIME
M010|Folk|Unknown|APAC|unknown@|1300|45|2025-07-01|2025-07-01T10:00:00
M011|Reggae|Bob Marley|NA|bob@gmail.com|2600|36|1980-01-01|1980-01-01T10:00:00
M012|JPOP|Ado|JP|ado@gmail.com|2100|19|2025-08-01|2025-08-01T10:00:00
M013|Blues|BB King|US|bb@gmail.com|-300|55|2025-09-01|2025-09-01T10:00:00
M014|Opera|Pavarotti|EU|pav@gmail.com|4200|67|NULL|NULL
M015|Rock|Nirvana|EU|nirvana@gmail.com|3200|29|2025-10-01|2025-10-01T11:00:00
M016|EDM|NULL|EU|testgmail.com|2500|21|2025-11-01|2025-11-01T11:00:00
M017|Hip Hop| Eminem |NA|em@gmail.com|5000|52|2025-12-01|2025-12-01T11:00:00
M018|Classical|Beethoven|EU|beethoven@gmail.com|3400|248|1800-01-01|1800TIME
M019|Country|Luke Bryan|US|luke@gmail.com|2900|48|2025-01-10|2025-01-10T12:00:00
M020|Rock|Linkin Park|NA|lp@gmail.com|4100|39|2025-02-11|2025-02-11T11:00:00
",
delim = "|",
show_col_types = FALSE,
trim_ws = FALSE,
col_types = cols(
Music_ID = col_character(),
Genre = col_character(),
Artist_Name = col_character(),
Region = col_character(),
Listener_Email = col_character(),
Revenue = col_character(),
Listener_Age = col_double(),
Release_Date = col_character(),
Stream_TS = col_character()
)
)
OUTPUT:
|
Music_ID |
Genre |
Artist_Name |
Region |
Listener_Email |
Revenue |
Listener_Age |
Release_Date |
Stream_TS |
|
M001 |
rock |
queen |
eu |
fan@gmail.com |
2000 |
25 |
2025-01-01 |
2025-01-01T10:20:00 |
|
M001 |
ROCK |
QUEEN |
EUROPE |
fan@gmail.com |
2000 |
25 |
2025-01-01 |
2025-01-01T10:20:00 |
|
M002 |
NULL |
metallica |
na |
metallica.com |
-500 |
34 |
2025-02-12 |
INVALID |
|
M003 |
Jazz |
Miles
Davis |
US |
miles@gmail |
1200 |
-10 |
2025-03-05 |
2025-03-05T09:00:00 |
|
M004 |
Pop |
taylor swift |
APAC |
swift@gmail.com |
3500 |
200 |
2025-03-20T08:30:00 |
|
|
M005 |
KPOP |
BTS |
KR |
army@gmail.com |
4500 |
22 |
2025-13-10 |
BADTIME |
|
M006 |
HipHop |
Drake |
drake@gmail.com |
NULL |
28 |
2025-04-12 |
2025-04-12T10:00:00 |
|
|
M007 |
AfroBeat |
Burna Boy |
africa |
burna@gmail.com |
3900 |
31 |
2025-05-01 |
2025-05-01T09:00:00 |
|
M008 |
EDM |
Martin
Garrix |
EU |
garrixgmail.com |
2200 |
26 |
2025-06-01 |
2025-06-01T08:00:00 |
|
M009 |
Classical |
Mozart |
EU |
mozart@gmail.com |
1800 |
270 |
1770-01-01 |
1770TIME |
|
M010 |
Folk |
Unknown |
APAC |
unknown@ |
1300 |
45 |
2025-07-01 |
2025-07-01T10:00:00 |
|
M011 |
Reggae |
Bob
Marley |
bob@gmail.com |
2600 |
36 |
1980-01-01 |
1980-01-01T10:00:00 |
|
|
M012 |
JPOP |
Ado |
JP |
ado@gmail.com |
2100 |
19 |
2025-08-01 |
2025-08-01T10:00:00 |
|
M013 |
Blues |
BB King |
US |
bb@gmail.com |
-300 |
55 |
2025-09-01 |
2025-09-01T10:00:00 |
|
M014 |
Opera |
Pavarotti |
EU |
pav@gmail.com |
4200 |
67 |
NULL |
NULL |
|
M015 |
Rock |
Nirvana |
EU |
nirvana@gmail.com |
3200 |
29 |
2025-10-01 |
2025-10-01T11:00:00 |
|
M016 |
EDM |
NULL |
EU |
testgmail.com |
2500 |
21 |
2025-11-01 |
2025-11-01T11:00:00 |
|
M017 |
Hip Hop |
Eminem |
em@gmail.com |
5000 |
52 |
2025-12-01 |
2025-12-01T11:00:00 |
|
|
M018 |
Classical |
Beethoven |
EU |
beethoven@gmail.com |
3400 |
248 |
1800-01-01 |
1800TIME |
|
M019 |
Country |
Luke
Bryan |
US |
luke@gmail.com |
2900 |
48 |
2025-01-10 |
2025-01-10T12:00:00 |
|
M020 |
Rock |
Linkin
Park |
lp@gmail.com |
4100 |
39 |
2025-02-11 |
2025-02-11T11:00:00 |
7.Equivalent R Cleaning Layer
music_clean <- music_raw %>%
janitor::clean_names() %>%
mutate(
revenue = na_if(revenue, "NULL"),
revenue = as.numeric(revenue),
genre = str_to_title(str_trim(genre)),
artist_name = str_to_title(str_trim(artist_name)),
region = str_to_upper(str_trim(region)),
revenue = abs(revenue),
listener_age = if_else(
listener_age < 0 | listener_age > 120,NA_real_,
listener_age),
listener_email = if_else(grepl("@", listener_email),
listener_email,"invalid@email.com"),
genre = na_if(genre, "NULL"),
genre = coalesce(genre, "Unknown")
) %>%
distinct(music_id, .keep_all = TRUE)
OUTPUT:
|
music_id |
genre |
artist_name |
region |
listener_email |
revenue |
listener_age |
release_date |
stream_ts |
|
M001 |
Rock |
Queen |
EU |
fan@gmail.com |
2000 |
25 |
2025-01-01 |
2025-01-01T10:20:00 |
|
M002 |
Null |
Metallica |
NA |
invalid@email.com |
500 |
34 |
2025-02-12 |
INVALID |
|
M003 |
Jazz |
Miles
Davis |
US |
miles@gmail |
1200 |
2025-03-05 |
2025-03-05T09:00:00 |
|
|
M004 |
Pop |
Taylor
Swift |
APAC |
swift@gmail.com |
3500 |
2025-03-20T08:30:00 |
||
|
M005 |
Kpop |
Bts |
KR |
army@gmail.com |
4500 |
22 |
2025-13-10 |
BADTIME |
|
M006 |
Hiphop |
Drake |
drake@gmail.com |
28 |
2025-04-12 |
2025-04-12T10:00:00 |
||
|
M007 |
Afrobeat |
Burna
Boy |
AFRICA |
burna@gmail.com |
3900 |
31 |
2025-05-01 |
2025-05-01T09:00:00 |
|
M008 |
Edm |
Martin
Garrix |
EU |
invalid@email.com |
2200 |
26 |
2025-06-01 |
2025-06-01T08:00:00 |
|
M009 |
Classical |
Mozart |
EU |
mozart@gmail.com |
1800 |
1770-01-01 |
1770TIME |
|
|
M010 |
Folk |
Unknown |
APAC |
unknown@ |
1300 |
45 |
2025-07-01 |
2025-07-01T10:00:00 |
|
M011 |
Reggae |
Bob
Marley |
bob@gmail.com |
2600 |
36 |
1980-01-01 |
1980-01-01T10:00:00 |
|
|
M012 |
Jpop |
Ado |
JP |
ado@gmail.com |
2100 |
19 |
2025-08-01 |
2025-08-01T10:00:00 |
|
M013 |
Blues |
Bb King |
US |
bb@gmail.com |
300 |
55 |
2025-09-01 |
2025-09-01T10:00:00 |
|
M014 |
Opera |
Pavarotti |
EU |
pav@gmail.com |
4200 |
67 |
NULL |
NULL |
|
M015 |
Rock |
Nirvana |
EU |
nirvana@gmail.com |
3200 |
29 |
2025-10-01 |
2025-10-01T11:00:00 |
|
M016 |
Edm |
Null |
EU |
invalid@email.com |
2500 |
21 |
2025-11-01 |
2025-11-01T11:00:00 |
|
M017 |
Hip Hop |
Eminem |
em@gmail.com |
5000 |
52 |
2025-12-01 |
2025-12-01T11:00:00 |
|
|
M018 |
Classical |
Beethoven |
EU |
beethoven@gmail.com |
3400 |
1800-01-01 |
1800TIME |
|
|
M019 |
Country |
Luke
Bryan |
US |
luke@gmail.com |
2900 |
48 |
2025-01-10 |
2025-01-10T12:00:00 |
|
M020 |
Rock |
Linkin
Park |
lp@gmail.com |
4100 |
39 |
2025-02-11 |
2025-02-11T11:00:00 |
Explanation and Key Points
R's mutate()
resembles SAS assignment statements.
|
SAS |
R |
|
IF THEN
ELSE |
if_else() |
|
SELECT
WHEN |
case_when() |
|
STRIP |
str_trim() |
|
UPCASE |
str_to_upper() |
|
LOWCASE |
str_to_lower() |
|
CATX |
unite() |
|
COALESCEC |
coalesce() |
Enterprise Validation and
Compliance
Clinical
programming environments demand:
- SDTM traceability
- ADaM derivation transparency
- Independent QC programming
- Audit trail preservation
- Metadata governance
A
dangerous SAS behavior is:
if age < 18 then flag='Y';
Missing
values satisfy this condition because:
. < 18 = TRUE
This has
caused real regulatory findings.
20 Data Cleaning Best
Practices
- Standardize metadata.
- Validate all primary keys.
- Detect duplicates early.
- Separate business rules from
code.
- Use macros for reuse.
- Preserve audit trails.
- Use version control.
- Perform independent QC.
- Validate date ranges.
- Normalize categories.
- Avoid hardcoding values.
- Document derivations.
- Track lineage.
- Use parameterized pipelines.
- Validate email formats.
- Standardize timestamps.
- Test edge cases.
- Use defensive programming.
- Monitor production drift.
- Automate reconciliation.
Business Logic Behind
Cleaning
Missing
values are rarely random.
A missing
patient age can invalidate subgroup analyses. A negative salary can distort
compensation models. Missing visit dates can affect survival analysis.
Standardization
improves reproducibility.
For
example:
rock
ROCK
Rock
become:
Rock
This
prevents false category inflation.
20 One-Line Insights
- Dirty data creates expensive
mistakes.
- Validation logic beats
visual inspection.
- Missing values hide
silently.
- Standardization improves
reproducibility.
- Duplicate records distort
trends.
- Audit trails build trust.
- Metadata is business
documentation.
- AI depends on clean inputs.
- Dates drive analytics.
- Categories require
governance.
- Automation reduces risk.
- Defensive programming saves
projects.
- Traceability supports
compliance.
- Macros improve consistency.
- PROC SQL improves
readability.
- DATA Step improves control.
- R accelerates exploration.
- SAS excels at governance.
- Validation is continuous.
- Data quality is everyone's
responsibility.
SAS vs R Comparison
|
Capability |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Excellent |
Limited |
|
Flexibility |
Moderate |
Excellent |
|
Visualization |
Moderate |
Excellent |
|
Scalability |
Excellent |
Good |
|
Metadata
Control |
Excellent |
Moderate |
|
Reproducibility |
Excellent |
Excellent |
Conclusion
Modern
analytical ecosystems are only as reliable as the quality of their underlying
data. Whether analyzing global music genres, clinical trial records, banking
transactions, or insurance claims, poor-quality data creates expensive business
failures.
SAS
provides enterprise-grade governance, auditability, traceability, and
regulatory compliance. R provides unmatched flexibility and exploratory power.
Together
they form one of the strongest production analytics combinations available
today.
The
future of analytics belongs not to organizations with the largest data lakes,
but to those with the cleanest, most trusted, and most reproducible datasets.
In
production analytics, data cleaning is not preparation work.
It is the
foundation upon which every dashboard, every AI model, every regulatory
submission, and every executive decision ultimately depends.
Interview Questions
1. How would you remove duplicate records?
Use:
- PROC SORT NODUPKEY
- PROC SQL DISTINCT
- dplyr::distinct()
2. Why is LENGTH placement important?
SAS
determines storage during compile time. Incorrect placement causes silent
truncation.
3. How do missing numeric values behave in SAS?
Missing
numeric values are smaller than all valid numbers.
4. When should PROC SQL be preferred?
For joins,
aggregations, and relational transformations.
5. When should DATA Step be preferred?
For
row-wise transformations and complex business logic.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MUSIC DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment