Cleaning the World's Soundtrack: Advanced SAS and R Techniques for Reliable Music Analytics

From Symphony to Statistics: Transforming Global Music Data into Trusted Analytical Intelligence with SAS and R

Introduction:The Business Crisis Nobody Expected

A global music streaming company launched an AI recommendation engine intended to increase subscriber engagement across continents. Instead, the platform began recommending classical Indian ragas to heavy metal listeners, African Afrobeat to K-Pop fans, and duplicate royalty payments were issued to artists because duplicate track IDs existed in production systems.

Executives discovered serious data quality failures:

  • Duplicate music identifiers
  • Missing release dates
  • Negative streaming revenue values
  • Invalid listener ages
  • Corrupted genre labels
  • Broken email addresses
  • Region code inconsistencies
  • Invalid timestamps
  • Mixed uppercase and lowercase artist names
  • Embedded whitespace corruption
  • NULL strings stored as actual text

The result was devastating:

  • AI recommendation failures
  • Incorrect royalty calculations
  • Executive dashboards displaying false trends
  • Financial reporting discrepancies
  • Regulatory audit concerns
  • Damaged customer experience

This is precisely why enterprise data engineering exists.

Understanding Different Types of Music in the World

The world contains thousands of musical traditions, but enterprise analytics often categorize them into major genres:

Genre

Origin

Characteristics

Classical

Europe

Orchestra, symphonies

Jazz

USA

Improvisation and swing

Blues

USA

Emotional storytelling

Rock

Global

Electric guitars and drums

Pop

Global

Commercial appeal

Hip Hop

USA

Rap and beat production

Country

USA

Acoustic storytelling

Reggae

Jamaica

Offbeat rhythms

Afrobeat

Africa

Percussion-driven grooves

K-Pop

South Korea

Choreographed pop

J-Pop

Japan

Synth-heavy melodies

EDM

Global

Electronic production

Flamenco

Spain

Guitar and dance

Indian Classical

India

Ragas and improvisation

Folk

Worldwide

Traditional cultural music

Raw Music Dataset with Intentional Errors

1.SAS Raw Dataset

data music_raw;

length Music_ID $8 Genre $20 Artist_Name $40 Region $10 Listener_Email $60

Revenue $8 Listener_Age 8 Release_Date $20 Stream_TS $30;

infile datalines dlm='|' dsd truncover;

input Music_ID $ Genre $ Artist_Name $ Region $ Listener_Email $

Revenue $ Listener_Age Release_Date $ Stream_TS $;

datalines;

M001|rock|queen|eu|fan@gmail.com|2000|25|2025-01-01|2025-01-01T10:20:00

M001|ROCK| QUEEN |EUROPE|fan@gmail.com|2000|25|2025-01-01|2025-01-01T10:20:00

M002|NULL|metallica|na|metallica.com|-500|34|2025-02-12|INVALID

M003|Jazz|Miles Davis|US|miles@gmail|1200|-10|2025-03-05|2025-03-05T09:00:00

M004|Pop| taylor swift |APAC|swift@gmail.com|3500|200||2025-03-20T08:30:00

M005|KPOP|BTS|KR|army@gmail.com|4500|22|2025-13-10|BADTIME

M006|HipHop|Drake|NA|drake@gmail.com|NULL|28|2025-04-12|2025-04-12T10:00:00

M007|AfroBeat| Burna Boy |africa|burna@gmail.com|3900|31|2025-05-01|2025-05-01T09:00:00

M008|EDM|Martin Garrix|EU|garrixgmail.com|2200|26|2025-06-01|2025-06-01T08:00:00

M009|Classical|Mozart|EU|mozart@gmail.com|1800|270|1770-01-01|1770TIME

M010|Folk|Unknown|APAC|unknown@|1300|45|2025-07-01|2025-07-01T10:00:00

M011|Reggae|Bob Marley|NA|bob@gmail.com|2600|36|1980-01-01|1980-01-01T10:00:00

M012|JPOP|Ado|JP|ado@gmail.com|2100|19|2025-08-01|2025-08-01T10:00:00

M013|Blues|BB King|US|bb@gmail.com|-300|55|2025-09-01|2025-09-01T10:00:00

M014|Opera|Pavarotti|EU|pav@gmail.com|4200|67|NULL|NULL

M015|Rock|Nirvana|EU|nirvana@gmail.com|3200|29|2025-10-01|2025-10-01T11:00:00

M016|EDM|NULL|EU|testgmail.com|2500|21|2025-11-01|2025-11-01T11:00:00

M017|Hip Hop| Eminem |NA|em@gmail.com|5000|52|2025-12-01|2025-12-01T11:00:00

M018|Classical|Beethoven|EU|beethoven@gmail.com|3400|248|1800-01-01|1800TIME

M019|Country|Luke Bryan|US|luke@gmail.com|2900|48|2025-01-10|2025-01-10T12:00:00

M020|Rock|Linkin Park|NA|lp@gmail.com|4100|39|2025-02-11|2025-02-11T11:00:00

;

run;

proc print data=music_raw;

run;

OUTPUT:

ObsMusic_IDGenreArtist_NameRegionListener_EmailRevenueListener_AgeRelease_DateStream_TS
1M001rockqueeneufan@gmail.com2000252025-01-012025-01-01T10:20:00
2M001ROCKQUEENEUROPEfan@gmail.com2000252025-01-012025-01-01T10:20:00
3M002NULLmetallicanametallica.com-500342025-02-12INVALID
4M003JazzMiles DavisUSmiles@gmail1200-102025-03-052025-03-05T09:00:00
5M004Poptaylor swiftAPACswift@gmail.com3500200 2025-03-20T08:30:00
6M005KPOPBTSKRarmy@gmail.com4500222025-13-10BADTIME
7M006HipHopDrakeNAdrake@gmail.comNULL282025-04-122025-04-12T10:00:00
8M007AfroBeatBurna Boyafricaburna@gmail.com3900312025-05-012025-05-01T09:00:00
9M008EDMMartin GarrixEUgarrixgmail.com2200262025-06-012025-06-01T08:00:00
10M009ClassicalMozartEUmozart@gmail.com18002701770-01-011770TIME
11M010FolkUnknownAPACunknown@1300452025-07-012025-07-01T10:00:00
12M011ReggaeBob MarleyNAbob@gmail.com2600361980-01-011980-01-01T10:00:00
13M012JPOPAdoJPado@gmail.com2100192025-08-012025-08-01T10:00:00
14M013BluesBB KingUSbb@gmail.com-300552025-09-012025-09-01T10:00:00
15M014OperaPavarottiEUpav@gmail.com420067NULLNULL
16M015RockNirvanaEUnirvana@gmail.com3200292025-10-012025-10-01T11:00:00
17M016EDMNULLEUtestgmail.com2500212025-11-012025-11-01T11:00:00
18M017Hip HopEminemNAem@gmail.com5000522025-12-012025-12-01T11:00:00
19M018ClassicalBeethovenEUbeethoven@gmail.com34002481800-01-011800TIME
20M019CountryLuke BryanUSluke@gmail.com2900482025-01-102025-01-10T12:00:00
21M020RockLinkin ParkNAlp@gmail.com4100392025-02-112025-02-11T11:00:00

Explanation and Key Points

This dataset intentionally mimics real production corruption. Duplicate IDs represent duplicate ingestion pipelines. Invalid timestamps represent API failures. Negative revenue values often emerge from refund systems. Impossible ages frequently occur after ETL conversion problems.

The LENGTH statement appears before assignments because SAS allocates memory during compilation. If omitted or placed later, character truncation can silently occur.

R does not behave this way because character vectors are dynamically allocated.

Character Truncation Risk

Bad SAS example:

data bad;

name='International Classical Music';

length name $10;

run;

proc print data=bad;

run;

LOG:

WARNING: Length of character variable name has already been set.
Use the LENGTH statement as the very first statement in the DATA
STEP to declare the length of a character variable.

OUTPUT:

Obsname
1Internatio

Correct approach:

data good;

length name $50;

name='International Classical Music';

run;

proc print data=good;

run;

OUTPUT:

Obsname
1International Classical Music

2.Enterprise Cleaning Workflow in SAS

data music_clean;

set music_raw;

retain Source_System "STREAMING_PLATFORM";

Genre=propcase(strip(Genre));

Artist_Name=propcase(compbl(strip(Artist_Name)));

Region=upcase(strip(Region));

Listener_Email=lowcase(strip(Listener_Email));

Revenue_Num = input(Revenue,best32.);

Revenue_Num=abs(Revenue_Num);

if Listener_Age<0 then Listener_Age=.;

if Listener_Age>120 then Listener_Age=.;

if find(Listener_Email,'@')=0 then

Listener_Email='invalid@email.com';

if Genre='Null' then Genre='Unknown';

Revenue=round(Revenue,0.01);

if missing(Release_Date) then

Release_Date='2025-01-01';

Music_Key=catx('-',Music_ID,Genre,Region);

Missing_Count=cmiss(of _character_)

            +nmiss(of _numeric_);

drop Revenue;

rename Revenue_Num=Revenue;

run;

proc print data=music_clean;

run;

OUTPUT:

ObsMusic_IDGenreArtist_NameRegionListener_EmailListener_AgeRelease_DateStream_TSSource_SystemRevenueMusic_KeyMissing_Count
1M001RockQueenEUfan@gmail.com252025-01-012025-01-01T10:20:00STREAMING_PLATFORM2000M001-Rock-EU1
2M001RockQueenEUROPEfan@gmail.com252025-01-012025-01-01T10:20:00STREAMING_PLATFORM2000M001-Rock-EUROPE1
3M002UnknownMetallicaNAinvalid@email.com342025-02-12INVALIDSTREAMING_PLATFORM500M002-Unknown-NA1
4M003JazzMiles DavisUSmiles@gmail.2025-03-052025-03-05T09:00:00STREAMING_PLATFORM1200M003-Jazz-US2
5M004PopTaylor SwiftAPACswift@gmail.com.2025-01-012025-03-20T08:30:00STREAMING_PLATFORM3500M004-Pop-APAC2
6M005KpopBtsKRarmy@gmail.com222025-13-10BADTIMESTREAMING_PLATFORM4500M005-Kpop-KR1
7M006HiphopDrakeNAdrake@gmail.com282025-04-122025-04-12T10:00:00STREAMING_PLATFORM.M006-Hiphop-NA2
8M007AfrobeatBurna BoyAFRICAburna@gmail.com312025-05-012025-05-01T09:00:00STREAMING_PLATFORM3900M007-Afrobeat-AFRICA1
9M008EdmMartin GarrixEUinvalid@email.com262025-06-012025-06-01T08:00:00STREAMING_PLATFORM2200M008-Edm-EU1
10M009ClassicalMozartEUmozart@gmail.com.1770-01-011770TIMESTREAMING_PLATFORM1800M009-Classical-EU2
11M010FolkUnknownAPACunknown@452025-07-012025-07-01T10:00:00STREAMING_PLATFORM1300M010-Folk-APAC1
12M011ReggaeBob MarleyNAbob@gmail.com361980-01-011980-01-01T10:00:00STREAMING_PLATFORM2600M011-Reggae-NA1
13M012JpopAdoJPado@gmail.com192025-08-012025-08-01T10:00:00STREAMING_PLATFORM2100M012-Jpop-JP1
14M013BluesBb KingUSbb@gmail.com552025-09-012025-09-01T10:00:00STREAMING_PLATFORM300M013-Blues-US1
15M014OperaPavarottiEUpav@gmail.com67NULLNULLSTREAMING_PLATFORM4200M014-Opera-EU1
16M015RockNirvanaEUnirvana@gmail.com292025-10-012025-10-01T11:00:00STREAMING_PLATFORM3200M015-Rock-EU1
17M016EdmNullEUinvalid@email.com212025-11-012025-11-01T11:00:00STREAMING_PLATFORM2500M016-Edm-EU1
18M017Hip HopEminemNAem@gmail.com522025-12-012025-12-01T11:00:00STREAMING_PLATFORM5000M017-Hip Hop-NA1
19M018ClassicalBeethovenEUbeethoven@gmail.com.1800-01-011800TIMESTREAMING_PLATFORM3400M018-Classical-EU2
20M019CountryLuke BryanUSluke@gmail.com482025-01-102025-01-10T12:00:00STREAMING_PLATFORM2900M019-Country-US1
21M020RockLinkin ParkNAlp@gmail.com392025-02-112025-02-11T11:00:00STREAMING_PLATFORM4100M020-Rock-NA1

Explanation and Key Points

This workflow uses several advanced production techniques.

  • RETAIN preserves metadata lineage.
  • COMPBL removes duplicate spaces.
  • PROPCASE standardizes capitalization.
  • ABS corrects financial reversals.
  • FIND validates emails.
  • CATX generates business keys.
  • CMISS and NMISS support completeness metrics.

In regulated environments, these transformations would be documented in Define.xml traceability metadata.

3.PROC SQL Deduplication

proc sql;

create table music_sql as

select distinct *

from music_clean;

quit;

proc print data=music_sql;

run;

OUTPUT:

ObsMusic_IDGenreArtist_NameRegionListener_EmailListener_AgeRelease_DateStream_TSSource_SystemRevenueMusic_KeyMissing_Count
1M001RockQueenEUfan@gmail.com252025-01-012025-01-01T10:20:00STREAMING_PLATFORM2000M001-Rock-EU1
2M001RockQueenEUROPEfan@gmail.com252025-01-012025-01-01T10:20:00STREAMING_PLATFORM2000M001-Rock-EUROPE1
3M002UnknownMetallicaNAinvalid@email.com342025-02-12INVALIDSTREAMING_PLATFORM500M002-Unknown-NA1
4M003JazzMiles DavisUSmiles@gmail.2025-03-052025-03-05T09:00:00STREAMING_PLATFORM1200M003-Jazz-US2
5M004PopTaylor SwiftAPACswift@gmail.com.2025-01-012025-03-20T08:30:00STREAMING_PLATFORM3500M004-Pop-APAC2
6M005KpopBtsKRarmy@gmail.com222025-13-10BADTIMESTREAMING_PLATFORM4500M005-Kpop-KR1
7M006HiphopDrakeNAdrake@gmail.com282025-04-122025-04-12T10:00:00STREAMING_PLATFORM.M006-Hiphop-NA2
8M007AfrobeatBurna BoyAFRICAburna@gmail.com312025-05-012025-05-01T09:00:00STREAMING_PLATFORM3900M007-Afrobeat-AFRICA1
9M008EdmMartin GarrixEUinvalid@email.com262025-06-012025-06-01T08:00:00STREAMING_PLATFORM2200M008-Edm-EU1
10M009ClassicalMozartEUmozart@gmail.com.1770-01-011770TIMESTREAMING_PLATFORM1800M009-Classical-EU2
11M010FolkUnknownAPACunknown@452025-07-012025-07-01T10:00:00STREAMING_PLATFORM1300M010-Folk-APAC1
12M011ReggaeBob MarleyNAbob@gmail.com361980-01-011980-01-01T10:00:00STREAMING_PLATFORM2600M011-Reggae-NA1
13M012JpopAdoJPado@gmail.com192025-08-012025-08-01T10:00:00STREAMING_PLATFORM2100M012-Jpop-JP1
14M013BluesBb KingUSbb@gmail.com552025-09-012025-09-01T10:00:00STREAMING_PLATFORM300M013-Blues-US1
15M014OperaPavarottiEUpav@gmail.com67NULLNULLSTREAMING_PLATFORM4200M014-Opera-EU1
16M015RockNirvanaEUnirvana@gmail.com292025-10-012025-10-01T11:00:00STREAMING_PLATFORM3200M015-Rock-EU1
17M016EdmNullEUinvalid@email.com212025-11-012025-11-01T11:00:00STREAMING_PLATFORM2500M016-Edm-EU1
18M017Hip HopEminemNAem@gmail.com522025-12-012025-12-01T11:00:00STREAMING_PLATFORM5000M017-Hip Hop-NA1
19M018ClassicalBeethovenEUbeethoven@gmail.com.1800-01-011800TIMESTREAMING_PLATFORM3400M018-Classical-EU2
20M019CountryLuke BryanUSluke@gmail.com482025-01-102025-01-10T12:00:00STREAMING_PLATFORM2900M019-Country-US1
21M020RockLinkin ParkNAlp@gmail.com392025-02-112025-02-11T11:00:00STREAMING_PLATFORM4100M020-Rock-NA1

Explanation

PROC SQL excels in relational operations and deduplication.

Advantages:

  • Simpler syntax
  • Better join readability
  • Database pushdown optimization

4.DATA Step Deduplication

proc sort data=music_clean out=music_nodup nodupkey;

by Music_ID;

run;

proc print data=music_nodup;

run;

OUTPUT:

ObsMusic_IDGenreArtist_NameRegionListener_EmailListener_AgeRelease_DateStream_TSSource_SystemRevenueMusic_KeyMissing_Count
1M001RockQueenEUfan@gmail.com252025-01-012025-01-01T10:20:00STREAMING_PLATFORM2000M001-Rock-EU1
2M002UnknownMetallicaNAinvalid@email.com342025-02-12INVALIDSTREAMING_PLATFORM500M002-Unknown-NA1
3M003JazzMiles DavisUSmiles@gmail.2025-03-052025-03-05T09:00:00STREAMING_PLATFORM1200M003-Jazz-US2
4M004PopTaylor SwiftAPACswift@gmail.com.2025-01-012025-03-20T08:30:00STREAMING_PLATFORM3500M004-Pop-APAC2
5M005KpopBtsKRarmy@gmail.com222025-13-10BADTIMESTREAMING_PLATFORM4500M005-Kpop-KR1
6M006HiphopDrakeNAdrake@gmail.com282025-04-122025-04-12T10:00:00STREAMING_PLATFORM.M006-Hiphop-NA2
7M007AfrobeatBurna BoyAFRICAburna@gmail.com312025-05-012025-05-01T09:00:00STREAMING_PLATFORM3900M007-Afrobeat-AFRICA1
8M008EdmMartin GarrixEUinvalid@email.com262025-06-012025-06-01T08:00:00STREAMING_PLATFORM2200M008-Edm-EU1
9M009ClassicalMozartEUmozart@gmail.com.1770-01-011770TIMESTREAMING_PLATFORM1800M009-Classical-EU2
10M010FolkUnknownAPACunknown@452025-07-012025-07-01T10:00:00STREAMING_PLATFORM1300M010-Folk-APAC1
11M011ReggaeBob MarleyNAbob@gmail.com361980-01-011980-01-01T10:00:00STREAMING_PLATFORM2600M011-Reggae-NA1
12M012JpopAdoJPado@gmail.com192025-08-012025-08-01T10:00:00STREAMING_PLATFORM2100M012-Jpop-JP1
13M013BluesBb KingUSbb@gmail.com552025-09-012025-09-01T10:00:00STREAMING_PLATFORM300M013-Blues-US1
14M014OperaPavarottiEUpav@gmail.com67NULLNULLSTREAMING_PLATFORM4200M014-Opera-EU1
15M015RockNirvanaEUnirvana@gmail.com292025-10-012025-10-01T11:00:00STREAMING_PLATFORM3200M015-Rock-EU1
16M016EdmNullEUinvalid@email.com212025-11-012025-11-01T11:00:00STREAMING_PLATFORM2500M016-Edm-EU1
17M017Hip HopEminemNAem@gmail.com522025-12-012025-12-01T11:00:00STREAMING_PLATFORM5000M017-Hip Hop-NA1
18M018ClassicalBeethovenEUbeethoven@gmail.com.1800-01-011800TIMESTREAMING_PLATFORM3400M018-Classical-EU2
19M019CountryLuke BryanUSluke@gmail.com482025-01-102025-01-10T12:00:00STREAMING_PLATFORM2900M019-Country-US1
20M020RockLinkin ParkNAlp@gmail.com392025-02-112025-02-11T11:00:00STREAMING_PLATFORM4100M020-Rock-NA1

Explanation

DATA Step processing is often faster on extremely large datasets because SAS can optimize sequential disk reads.

5.Advanced SAS Features

5.1 SELECT WHEN

data music_clean;

length Category $20.;

set music_clean;

select;

when (Genre='Rock') Category='Modern';

when (Genre='Classical') Category='Traditional';

otherwise Category='Other';

end;

run;

proc print data=music_clean;

run;

OUTPUT:

ObsCategoryMusic_IDGenreArtist_NameRegionListener_EmailListener_AgeRelease_DateStream_TSSource_SystemRevenueMusic_KeyMissing_Count
1ModernM001RockQueenEUfan@gmail.com252025-01-012025-01-01T10:20:00STREAMING_PLATFORM2000M001-Rock-EU1
2ModernM001RockQueenEUROPEfan@gmail.com252025-01-012025-01-01T10:20:00STREAMING_PLATFORM2000M001-Rock-EUROPE1
3OtherM002UnknownMetallicaNAinvalid@email.com342025-02-12INVALIDSTREAMING_PLATFORM500M002-Unknown-NA1
4OtherM003JazzMiles DavisUSmiles@gmail.2025-03-052025-03-05T09:00:00STREAMING_PLATFORM1200M003-Jazz-US2
5OtherM004PopTaylor SwiftAPACswift@gmail.com.2025-01-012025-03-20T08:30:00STREAMING_PLATFORM3500M004-Pop-APAC2
6OtherM005KpopBtsKRarmy@gmail.com222025-13-10BADTIMESTREAMING_PLATFORM4500M005-Kpop-KR1
7OtherM006HiphopDrakeNAdrake@gmail.com282025-04-122025-04-12T10:00:00STREAMING_PLATFORM.M006-Hiphop-NA2
8OtherM007AfrobeatBurna BoyAFRICAburna@gmail.com312025-05-012025-05-01T09:00:00STREAMING_PLATFORM3900M007-Afrobeat-AFRICA1
9OtherM008EdmMartin GarrixEUinvalid@email.com262025-06-012025-06-01T08:00:00STREAMING_PLATFORM2200M008-Edm-EU1
10TraditionalM009ClassicalMozartEUmozart@gmail.com.1770-01-011770TIMESTREAMING_PLATFORM1800M009-Classical-EU2
11OtherM010FolkUnknownAPACunknown@452025-07-012025-07-01T10:00:00STREAMING_PLATFORM1300M010-Folk-APAC1
12OtherM011ReggaeBob MarleyNAbob@gmail.com361980-01-011980-01-01T10:00:00STREAMING_PLATFORM2600M011-Reggae-NA1
13OtherM012JpopAdoJPado@gmail.com192025-08-012025-08-01T10:00:00STREAMING_PLATFORM2100M012-Jpop-JP1
14OtherM013BluesBb KingUSbb@gmail.com552025-09-012025-09-01T10:00:00STREAMING_PLATFORM300M013-Blues-US1
15OtherM014OperaPavarottiEUpav@gmail.com67NULLNULLSTREAMING_PLATFORM4200M014-Opera-EU1
16ModernM015RockNirvanaEUnirvana@gmail.com292025-10-012025-10-01T11:00:00STREAMING_PLATFORM3200M015-Rock-EU1
17OtherM016EdmNullEUinvalid@email.com212025-11-012025-11-01T11:00:00STREAMING_PLATFORM2500M016-Edm-EU1
18OtherM017Hip HopEminemNAem@gmail.com522025-12-012025-12-01T11:00:00STREAMING_PLATFORM5000M017-Hip Hop-NA1
19TraditionalM018ClassicalBeethovenEUbeethoven@gmail.com.1800-01-011800TIMESTREAMING_PLATFORM3400M018-Classical-EU2
20OtherM019CountryLuke BryanUSluke@gmail.com482025-01-102025-01-10T12:00:00STREAMING_PLATFORM2900M019-Country-US1
21ModernM020RockLinkin ParkNAlp@gmail.com392025-02-112025-02-11T11:00:00STREAMING_PLATFORM4100M020-Rock-NA1

5.2 ARRAY Processing

data music_clean;

set music_clean;

array chars(*) Genre Artist_Name Region;

do i=1 to dim(chars);

chars(i)=strip(chars(i));

end;

run;

proc print data=music_clean;

run;

OUTPUT:

ObsCategoryMusic_IDGenreArtist_NameRegionListener_EmailListener_AgeRelease_DateStream_TSSource_SystemRevenueMusic_KeyMissing_Counti
1ModernM001RockQueenEUfan@gmail.com252025-01-012025-01-01T10:20:00STREAMING_PLATFORM2000M001-Rock-EU14
2ModernM001RockQueenEUROPEfan@gmail.com252025-01-012025-01-01T10:20:00STREAMING_PLATFORM2000M001-Rock-EUROPE14
3OtherM002UnknownMetallicaNAinvalid@email.com342025-02-12INVALIDSTREAMING_PLATFORM500M002-Unknown-NA14
4OtherM003JazzMiles DavisUSmiles@gmail.2025-03-052025-03-05T09:00:00STREAMING_PLATFORM1200M003-Jazz-US24
5OtherM004PopTaylor SwiftAPACswift@gmail.com.2025-01-012025-03-20T08:30:00STREAMING_PLATFORM3500M004-Pop-APAC24
6OtherM005KpopBtsKRarmy@gmail.com222025-13-10BADTIMESTREAMING_PLATFORM4500M005-Kpop-KR14
7OtherM006HiphopDrakeNAdrake@gmail.com282025-04-122025-04-12T10:00:00STREAMING_PLATFORM.M006-Hiphop-NA24
8OtherM007AfrobeatBurna BoyAFRICAburna@gmail.com312025-05-012025-05-01T09:00:00STREAMING_PLATFORM3900M007-Afrobeat-AFRICA14
9OtherM008EdmMartin GarrixEUinvalid@email.com262025-06-012025-06-01T08:00:00STREAMING_PLATFORM2200M008-Edm-EU14
10TraditionalM009ClassicalMozartEUmozart@gmail.com.1770-01-011770TIMESTREAMING_PLATFORM1800M009-Classical-EU24
11OtherM010FolkUnknownAPACunknown@452025-07-012025-07-01T10:00:00STREAMING_PLATFORM1300M010-Folk-APAC14
12OtherM011ReggaeBob MarleyNAbob@gmail.com361980-01-011980-01-01T10:00:00STREAMING_PLATFORM2600M011-Reggae-NA14
13OtherM012JpopAdoJPado@gmail.com192025-08-012025-08-01T10:00:00STREAMING_PLATFORM2100M012-Jpop-JP14
14OtherM013BluesBb KingUSbb@gmail.com552025-09-012025-09-01T10:00:00STREAMING_PLATFORM300M013-Blues-US14
15OtherM014OperaPavarottiEUpav@gmail.com67NULLNULLSTREAMING_PLATFORM4200M014-Opera-EU14
16ModernM015RockNirvanaEUnirvana@gmail.com292025-10-012025-10-01T11:00:00STREAMING_PLATFORM3200M015-Rock-EU14
17OtherM016EdmNullEUinvalid@email.com212025-11-012025-11-01T11:00:00STREAMING_PLATFORM2500M016-Edm-EU14
18OtherM017Hip HopEminemNAem@gmail.com522025-12-012025-12-01T11:00:00STREAMING_PLATFORM5000M017-Hip Hop-NA14
19TraditionalM018ClassicalBeethovenEUbeethoven@gmail.com.1800-01-011800TIMESTREAMING_PLATFORM3400M018-Classical-EU24
20OtherM019CountryLuke BryanUSluke@gmail.com482025-01-102025-01-10T12:00:00STREAMING_PLATFORM2900M019-Country-US14
21ModernM020RockLinkin ParkNAlp@gmail.com392025-02-112025-02-11T11:00:00STREAMING_PLATFORM4100M020-Rock-NA14

5.3 FIRST./LAST. Logic

data music_clean;

set music_clean;

by Music_ID;

if first.Music_ID then Count=0;

Count+1;

if last.Music_ID;

run;

proc print data=music_clean;

run;

OUTPUT:

ObsCategoryMusic_IDGenreArtist_NameRegionListener_EmailListener_AgeRelease_DateStream_TSSource_SystemRevenueMusic_KeyMissing_CountiCount
1ModernM001RockQueenEUROPEfan@gmail.com252025-01-012025-01-01T10:20:00STREAMING_PLATFORM2000M001-Rock-EUROPE142
2OtherM002UnknownMetallicaNAinvalid@email.com342025-02-12INVALIDSTREAMING_PLATFORM500M002-Unknown-NA141
3OtherM003JazzMiles DavisUSmiles@gmail.2025-03-052025-03-05T09:00:00STREAMING_PLATFORM1200M003-Jazz-US241
4OtherM004PopTaylor SwiftAPACswift@gmail.com.2025-01-012025-03-20T08:30:00STREAMING_PLATFORM3500M004-Pop-APAC241
5OtherM005KpopBtsKRarmy@gmail.com222025-13-10BADTIMESTREAMING_PLATFORM4500M005-Kpop-KR141
6OtherM006HiphopDrakeNAdrake@gmail.com282025-04-122025-04-12T10:00:00STREAMING_PLATFORM.M006-Hiphop-NA241
7OtherM007AfrobeatBurna BoyAFRICAburna@gmail.com312025-05-012025-05-01T09:00:00STREAMING_PLATFORM3900M007-Afrobeat-AFRICA141
8OtherM008EdmMartin GarrixEUinvalid@email.com262025-06-012025-06-01T08:00:00STREAMING_PLATFORM2200M008-Edm-EU141
9TraditionalM009ClassicalMozartEUmozart@gmail.com.1770-01-011770TIMESTREAMING_PLATFORM1800M009-Classical-EU241
10OtherM010FolkUnknownAPACunknown@452025-07-012025-07-01T10:00:00STREAMING_PLATFORM1300M010-Folk-APAC141
11OtherM011ReggaeBob MarleyNAbob@gmail.com361980-01-011980-01-01T10:00:00STREAMING_PLATFORM2600M011-Reggae-NA141
12OtherM012JpopAdoJPado@gmail.com192025-08-012025-08-01T10:00:00STREAMING_PLATFORM2100M012-Jpop-JP141
13OtherM013BluesBb KingUSbb@gmail.com552025-09-012025-09-01T10:00:00STREAMING_PLATFORM300M013-Blues-US141
14OtherM014OperaPavarottiEUpav@gmail.com67NULLNULLSTREAMING_PLATFORM4200M014-Opera-EU141
15ModernM015RockNirvanaEUnirvana@gmail.com292025-10-012025-10-01T11:00:00STREAMING_PLATFORM3200M015-Rock-EU141
16OtherM016EdmNullEUinvalid@email.com212025-11-012025-11-01T11:00:00STREAMING_PLATFORM2500M016-Edm-EU141
17OtherM017Hip HopEminemNAem@gmail.com522025-12-012025-12-01T11:00:00STREAMING_PLATFORM5000M017-Hip Hop-NA141
18TraditionalM018ClassicalBeethovenEUbeethoven@gmail.com.1800-01-011800TIMESTREAMING_PLATFORM3400M018-Classical-EU241
19OtherM019CountryLuke BryanUSluke@gmail.com482025-01-102025-01-10T12:00:00STREAMING_PLATFORM2900M019-Country-US141
20ModernM020RockLinkin ParkNAlp@gmail.com392025-02-112025-02-11T11:00:00STREAMING_PLATFORM4100M020-Rock-NA141


6.R Raw Dataset

library(tidyverse)

music_raw <- read_delim(

  "

Music_ID|Genre|Artist_Name|Region|Listener_Email|Revenue|Listener_Age|Release_Date|Stream_TS

M001|rock|queen|eu|fan@gmail.com|2000|25|2025-01-01|2025-01-01T10:20:00

M001|ROCK| QUEEN |EUROPE|fan@gmail.com|2000|25|2025-01-01|2025-01-01T10:20:00

M002|NULL|metallica|na|metallica.com|-500|34|2025-02-12|INVALID

M003|Jazz|Miles Davis|US|miles@gmail|1200|-10|2025-03-05|2025-03-05T09:00:00

M004|Pop| taylor swift |APAC|swift@gmail.com|3500|200||2025-03-20T08:30:00

M005|KPOP|BTS|KR|army@gmail.com|4500|22|2025-13-10|BADTIME

M006|HipHop|Drake|NA|drake@gmail.com|NULL|28|2025-04-12|2025-04-12T10:00:00

M007|AfroBeat| Burna Boy |africa|burna@gmail.com|3900|31|2025-05-01|2025-05-01T09:00:00

M008|EDM|Martin Garrix|EU|garrixgmail.com|2200|26|2025-06-01|2025-06-01T08:00:00

M009|Classical|Mozart|EU|mozart@gmail.com|1800|270|1770-01-01|1770TIME

M010|Folk|Unknown|APAC|unknown@|1300|45|2025-07-01|2025-07-01T10:00:00

M011|Reggae|Bob Marley|NA|bob@gmail.com|2600|36|1980-01-01|1980-01-01T10:00:00

M012|JPOP|Ado|JP|ado@gmail.com|2100|19|2025-08-01|2025-08-01T10:00:00

M013|Blues|BB King|US|bb@gmail.com|-300|55|2025-09-01|2025-09-01T10:00:00

M014|Opera|Pavarotti|EU|pav@gmail.com|4200|67|NULL|NULL

M015|Rock|Nirvana|EU|nirvana@gmail.com|3200|29|2025-10-01|2025-10-01T11:00:00

M016|EDM|NULL|EU|testgmail.com|2500|21|2025-11-01|2025-11-01T11:00:00

M017|Hip Hop| Eminem |NA|em@gmail.com|5000|52|2025-12-01|2025-12-01T11:00:00

M018|Classical|Beethoven|EU|beethoven@gmail.com|3400|248|1800-01-01|1800TIME

M019|Country|Luke Bryan|US|luke@gmail.com|2900|48|2025-01-10|2025-01-10T12:00:00

M020|Rock|Linkin Park|NA|lp@gmail.com|4100|39|2025-02-11|2025-02-11T11:00:00

",

  delim = "|",

  show_col_types = FALSE,

  trim_ws = FALSE,

  col_types = cols(

    Music_ID = col_character(),

    Genre = col_character(),

    Artist_Name = col_character(),

    Region = col_character(),

    Listener_Email = col_character(),

    Revenue = col_character(),

    Listener_Age = col_double(),

    Release_Date = col_character(),

    Stream_TS = col_character()

  )

)

OUTPUT:

Music_ID

Genre

Artist_Name

Region

Listener_Email

Revenue

Listener_Age

Release_Date

Stream_TS

M001

rock

queen

eu

fan@gmail.com

2000

25

2025-01-01

2025-01-01T10:20:00

M001

ROCK

 QUEEN

EUROPE

fan@gmail.com

2000

25

2025-01-01

2025-01-01T10:20:00

M002

NULL

metallica

na

metallica.com

-500

34

2025-02-12

INVALID

M003

Jazz

Miles Davis

US

miles@gmail

1200

-10

2025-03-05

2025-03-05T09:00:00

M004

Pop

 taylor swift

APAC

swift@gmail.com

3500

200

2025-03-20T08:30:00

M005

KPOP

BTS

KR

army@gmail.com

4500

22

2025-13-10

BADTIME

M006

HipHop

Drake

drake@gmail.com

NULL

28

2025-04-12

2025-04-12T10:00:00

M007

AfroBeat

 Burna Boy

africa

burna@gmail.com

3900

31

2025-05-01

2025-05-01T09:00:00

M008

EDM

Martin Garrix

EU

garrixgmail.com

2200

26

2025-06-01

2025-06-01T08:00:00

M009

Classical

Mozart

EU

mozart@gmail.com

1800

270

1770-01-01

1770TIME

M010

Folk

Unknown

APAC

unknown@

1300

45

2025-07-01

2025-07-01T10:00:00

M011

Reggae

Bob Marley

bob@gmail.com

2600

36

1980-01-01

1980-01-01T10:00:00

M012

JPOP

Ado

JP

ado@gmail.com

2100

19

2025-08-01

2025-08-01T10:00:00

M013

Blues

BB King

US

bb@gmail.com

-300

55

2025-09-01

2025-09-01T10:00:00

M014

Opera

Pavarotti

EU

pav@gmail.com

4200

67

NULL

NULL

M015

Rock

Nirvana

EU

nirvana@gmail.com

3200

29

2025-10-01

2025-10-01T11:00:00

M016

EDM

NULL

EU

testgmail.com

2500

21

2025-11-01

2025-11-01T11:00:00

M017

Hip Hop

 Eminem

em@gmail.com

5000

52

2025-12-01

2025-12-01T11:00:00

M018

Classical

Beethoven

EU

beethoven@gmail.com

3400

248

1800-01-01

1800TIME

M019

Country

Luke Bryan

US

luke@gmail.com

2900

48

2025-01-10

2025-01-10T12:00:00

M020

Rock

Linkin Park

lp@gmail.com

4100

39

2025-02-11

2025-02-11T11:00:00

7.Equivalent R Cleaning Layer

music_clean <- music_raw %>%

  janitor::clean_names() %>%

  mutate(

    revenue = na_if(revenue, "NULL"),

    revenue = as.numeric(revenue),

    genre = str_to_title(str_trim(genre)),

    artist_name = str_to_title(str_trim(artist_name)),

    region = str_to_upper(str_trim(region)),

    revenue = abs(revenue),

    listener_age = if_else(

      listener_age < 0 | listener_age > 120,NA_real_,

      listener_age),

    listener_email = if_else(grepl("@", listener_email),

      listener_email,"invalid@email.com"),

    genre = na_if(genre, "NULL"),

    genre = coalesce(genre, "Unknown")

  ) %>%

  distinct(music_id, .keep_all = TRUE)

OUTPUT:

music_id

genre

artist_name

region

listener_email

revenue

listener_age

release_date

stream_ts

M001

Rock

Queen

EU

fan@gmail.com

2000

25

2025-01-01

2025-01-01T10:20:00

M002

Null

Metallica

NA

invalid@email.com

500

34

2025-02-12

INVALID

M003

Jazz

Miles Davis

US

miles@gmail

1200

2025-03-05

2025-03-05T09:00:00

M004

Pop

Taylor Swift

APAC

swift@gmail.com

3500

2025-03-20T08:30:00

M005

Kpop

Bts

KR

army@gmail.com

4500

22

2025-13-10

BADTIME

M006

Hiphop

Drake

drake@gmail.com

28

2025-04-12

2025-04-12T10:00:00

M007

Afrobeat

Burna Boy

AFRICA

burna@gmail.com

3900

31

2025-05-01

2025-05-01T09:00:00

M008

Edm

Martin Garrix

EU

invalid@email.com

2200

26

2025-06-01

2025-06-01T08:00:00

M009

Classical

Mozart

EU

mozart@gmail.com

1800

1770-01-01

1770TIME

M010

Folk

Unknown

APAC

unknown@

1300

45

2025-07-01

2025-07-01T10:00:00

M011

Reggae

Bob Marley

bob@gmail.com

2600

36

1980-01-01

1980-01-01T10:00:00

M012

Jpop

Ado

JP

ado@gmail.com

2100

19

2025-08-01

2025-08-01T10:00:00

M013

Blues

Bb King

US

bb@gmail.com

300

55

2025-09-01

2025-09-01T10:00:00

M014

Opera

Pavarotti

EU

pav@gmail.com

4200

67

NULL

NULL

M015

Rock

Nirvana

EU

nirvana@gmail.com

3200

29

2025-10-01

2025-10-01T11:00:00

M016

Edm

Null

EU

invalid@email.com

2500

21

2025-11-01

2025-11-01T11:00:00

M017

Hip Hop

Eminem

em@gmail.com

5000

52

2025-12-01

2025-12-01T11:00:00

M018

Classical

Beethoven

EU

beethoven@gmail.com

3400

1800-01-01

1800TIME

M019

Country

Luke Bryan

US

luke@gmail.com

2900

48

2025-01-10

2025-01-10T12:00:00

M020

Rock

Linkin Park

lp@gmail.com

4100

39

2025-02-11

2025-02-11T11:00:00


Explanation and Key Points

R's mutate() resembles SAS assignment statements.

SAS

R

IF THEN ELSE

if_else()

SELECT WHEN

case_when()

STRIP

str_trim()

UPCASE

str_to_upper()

LOWCASE

str_to_lower()

CATX

unite()

COALESCEC

coalesce()

Enterprise Validation and Compliance

Clinical programming environments demand:

  • SDTM traceability
  • ADaM derivation transparency
  • Independent QC programming
  • Audit trail preservation
  • Metadata governance

A dangerous SAS behavior is:

if age < 18 then flag='Y';

Missing values satisfy this condition because:

. < 18 = TRUE

This has caused real regulatory findings.

20 Data Cleaning Best Practices

  1. Standardize metadata.
  2. Validate all primary keys.
  3. Detect duplicates early.
  4. Separate business rules from code.
  5. Use macros for reuse.
  6. Preserve audit trails.
  7. Use version control.
  8. Perform independent QC.
  9. Validate date ranges.
  10. Normalize categories.
  11. Avoid hardcoding values.
  12. Document derivations.
  13. Track lineage.
  14. Use parameterized pipelines.
  15. Validate email formats.
  16. Standardize timestamps.
  17. Test edge cases.
  18. Use defensive programming.
  19. Monitor production drift.
  20. Automate reconciliation.

Business Logic Behind Cleaning

Missing values are rarely random.

A missing patient age can invalidate subgroup analyses. A negative salary can distort compensation models. Missing visit dates can affect survival analysis.

Standardization improves reproducibility.

For example:

rock

ROCK

Rock

become:

Rock

This prevents false category inflation.

20 One-Line Insights

  • Dirty data creates expensive mistakes.
  • Validation logic beats visual inspection.
  • Missing values hide silently.
  • Standardization improves reproducibility.
  • Duplicate records distort trends.
  • Audit trails build trust.
  • Metadata is business documentation.
  • AI depends on clean inputs.
  • Dates drive analytics.
  • Categories require governance.
  • Automation reduces risk.
  • Defensive programming saves projects.
  • Traceability supports compliance.
  • Macros improve consistency.
  • PROC SQL improves readability.
  • DATA Step improves control.
  • R accelerates exploration.
  • SAS excels at governance.
  • Validation is continuous.
  • Data quality is everyone's responsibility.

SAS vs R Comparison

Capability

SAS

R

Auditability

Excellent

Moderate

Regulatory Acceptance

Excellent

Limited

Flexibility

Moderate

Excellent

Visualization

Moderate

Excellent

Scalability

Excellent

Good

Metadata Control

Excellent

Moderate

Reproducibility

Excellent

Excellent

Conclusion

Modern analytical ecosystems are only as reliable as the quality of their underlying data. Whether analyzing global music genres, clinical trial records, banking transactions, or insurance claims, poor-quality data creates expensive business failures.

SAS provides enterprise-grade governance, auditability, traceability, and regulatory compliance. R provides unmatched flexibility and exploratory power.

Together they form one of the strongest production analytics combinations available today.

The future of analytics belongs not to organizations with the largest data lakes, but to those with the cleanest, most trusted, and most reproducible datasets.

In production analytics, data cleaning is not preparation work.

It is the foundation upon which every dashboard, every AI model, every regulatory submission, and every executive decision ultimately depends.

Interview Questions

1. How would you remove duplicate records?

Use:

  • PROC SORT NODUPKEY
  • PROC SQL DISTINCT
  • dplyr::distinct()

2. Why is LENGTH placement important?

SAS determines storage during compile time. Incorrect placement causes silent truncation.

3. How do missing numeric values behave in SAS?

Missing numeric values are smaller than all valid numbers.

4. When should PROC SQL be preferred?

For joins, aggregations, and relational transformations.

5. When should DATA Step be preferred?

For row-wise transformations and complex business logic.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MUSIC DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS