Transforming Corrupted Music Intelligence into Trusted Insights Using Sas And R

Corrupted Global Famous Songs Data into Enterprise-Grade Analytical Intelligence Using SAS (PROC SQL vs DATA Step) and R

Introduction

The modern analytics world runs on data integrity. Whether you work in clinical trials, banking, retail, insurance, or entertainment analytics, one harsh reality remains constant: dirty data destroys trust. A single malformed date, duplicated ID, corrupted category label, or inconsistent text value can silently contaminate dashboards, AI predictions, executive reporting, and regulatory submissions.

Imagine a global music streaming company analyzing “famous songs worldwide” to identify trending genres and audience behavior. Suddenly, executive dashboards begin showing incorrect revenue projections because duplicate Song IDs inflated streaming counts. Some records contain negative royalties. Others have invalid release dates like 32/15/2025. Genre labels vary between "POP", "pop", " Pop ", and "P0P". Regional codes are inconsistent. Artist emails are malformed. Missing listener counts distort machine learning models.

This resembles real-world clinical trial disasters where incorrect patient enrollment dates, duplicated subjects, or invalid adverse-event records can lead to rejected regulatory submissions. In enterprise environments, poor-quality data is not just a technical issue it becomes a compliance, financial, and reputational risk.

Business Crisis Scenario

A multinational streaming platform launched an AI recommendation engine trained on global famous-song data. Within weeks:

  • Duplicate Song IDs inflated royalty payments
  • Invalid timestamps corrupted streaming trends
  • Negative revenue values impacted forecasting
  • Mixed genre labels reduced recommendation accuracy
  • Missing release dates distorted monthly analytics
  • Incorrect listener counts damaged executive dashboards
  • Corrupted emails prevented artist communication
  • Inconsistent regional codes broke territory reporting

The same problems occur in clinical trial systems:

  • Incorrect patient ages
  • Missing visit dates
  • Invalid adverse-event categories
  • Duplicate subject IDs
  • Missing treatment assignments

In regulated environments like clinical SAS programming, such issues can lead to FDA rejection, failed QC validation, and delayed submissions.

Raw Corrupted Dataset (SAS)

Dataset Structure – 9 Variables

Variables:

  1. SONG_ID
  2. SONG_NAME
  3. ARTIST_NAME
  4. GENRE
  5. RELEASE_DATE
  6. STREAM_COUNT
  7. ROYALTY_AMOUNT
  8. REGION_CODE
  9. ARTIST_EMAIL

1.SAS Raw Dataset Creation

DATA songs_raw;

LENGTH SONG_ID $12 SONG_NAME $40 ARTIST_NAME $30 GENRE $20

RELEASE_DATE $20 REGION_CODE $10 ARTIST_EMAIL $50;

INFILE DATALINES DLM='|' MISSOVER;

INPUT SONG_ID $ SONG_NAME $ ARTIST_NAME $ GENRE $ RELEASE_DATE $

STREAM_COUNT ROYALTY_AMOUNT REGION_CODE $ ARTIST_EMAIL $;

DATALINES;

S001|ShapeOfYou|EdSheeran|POP|2024-01-15|500000|15000|us|ed@gmail.com

S001|ShapeOfYou|EdSheeran|POP|2024-01-15|500000|15000|US|ed@gmail.com

S002| BlindingLights |TheWeeknd| pop |2024/02/10|700000|-2500|usa|weekndgmail.com

S003|NULL|Adele|Soul|2024-13-01|450000|12000|UK|adele@gmail

S004|Believer|ImagineDragons|ROCK|2024-03-12|.|18000|EU|imagine@music.com

S005|Perfect|EdSheeran|P0P|2024-04-22|-90000|22000|US|ed@@gmail.com

S006|Halo|Beyonce|Pop| |850000|27000|Us|beyonce@gmail.com

S007|Despacito|LuisFonsi|Latin|2024-05-10|920000|30000|LATAM|luis@gmail.com

S008|Thunder|ImagineDragons|rock|2024-05-40|640000|17000|EU|wrongmail

S009|SomeoneLikeYou|Adele|SOUL|2024-06-01|730000|-1000|UK|adele@gmail.com

S010|Closer|Chainsmokers|EDM|2024-07-11|810000|25000|APAC|chain@gmail.com

S011|Faded|AlanWalker|edm|2024-08-05|NULL|19000|AP|alan@gmail.com

S012|Levitating|DuaLipa|POP|2024-09-15|620000|21000|USA|dua@gmail.com

S013|BadGuy|BillieEilish|Alternative|2024-10-10|430000|16000|US|billiegmail.com

S014|Senorita|ShawnMendes|Pop|invalid_date|780000|24000|US|shawn@gmail.com

S015|CountingStars|OneRepublic|ROCK|2024-11-21|560000|20000|EUROPE|one@gmail.com

S016|Stay|JustinBieber|Pop|2024-12-05|910000|29000|US|justin@gmail.com

S017|RollingInTheDeep|Adele|Soul|2024-09-18|610000|21000|UK|adele@gmail.com

S018|CheapThrills|Sia|Pop|2024-03-09|580000|18500|US|sia@gmail.com

S019|Radioactive|ImagineDragons|Rock|2024-04-14|620000|19500|EU|imagine@gmail.com

S020|HeatWaves|GlassAnimals|Indie|2024-08-25|730000|22500|APAC|glass@gmail.com

;

RUN;

PROC PRINT DATA=songs_raw;

RUN;

OUTPUT:

ObsSONG_IDSONG_NAMEARTIST_NAMEGENRERELEASE_DATEREGION_CODEARTIST_EMAILSTREAM_COUNTROYALTY_AMOUNT
1S001ShapeOfYouEdSheeranPOP2024-01-15used@gmail.com50000015000
2S001ShapeOfYouEdSheeranPOP2024-01-15USed@gmail.com50000015000
3S002BlindingLightsTheWeekndpop2024/02/10usaweekndgmail.com700000-2500
4S003NULLAdeleSoul2024-13-01UKadele@gmail45000012000
5S004BelieverImagineDragonsROCK2024-03-12EUimagine@music.com.18000
6S005PerfectEdSheeranP0P2024-04-22USed@@gmail.com-9000022000
7S006HaloBeyoncePop Usbeyonce@gmail.com85000027000
8S007DespacitoLuisFonsiLatin2024-05-10LATAMluis@gmail.com92000030000
9S008ThunderImagineDragonsrock2024-05-40EUwrongmail64000017000
10S009SomeoneLikeYouAdeleSOUL2024-06-01UKadele@gmail.com730000-1000
11S010CloserChainsmokersEDM2024-07-11APACchain@gmail.com81000025000
12S011FadedAlanWalkeredm2024-08-05APalan@gmail.com.19000
13S012LevitatingDuaLipaPOP2024-09-15USAdua@gmail.com62000021000
14S013BadGuyBillieEilishAlternative2024-10-10USbilliegmail.com43000016000
15S014SenoritaShawnMendesPopinvalid_dateUSshawn@gmail.com78000024000
16S015CountingStarsOneRepublicROCK2024-11-21EUROPEone@gmail.com56000020000
17S016StayJustinBieberPop2024-12-05USjustin@gmail.com91000029000
18S017RollingInTheDeepAdeleSoul2024-09-18UKadele@gmail.com61000021000
19S018CheapThrillsSiaPop2024-03-09USsia@gmail.com58000018500
20S019RadioactiveImagineDragonsRock2024-04-14EUimagine@gmail.com62000019500
21S020HeatWavesGlassAnimalsIndie2024-08-25APACglass@gmail.com73000022500

Explanation and Key Points

This raw dataset intentionally contains enterprise-grade corruption patterns commonly seen in real production systems. Duplicate SONG_ID values simulate duplicate patient records in clinical trials. Invalid dates mimic incorrect visit dates. Negative royalty amounts resemble negative billing values in banking systems. Mixed uppercase/lowercase text demonstrates metadata inconsistency. Malformed emails simulate communication failures.

The LENGTH statement appears BEFORE assignment logic because SAS allocates character memory during compilation. If LENGTH is declared later, SAS may permanently truncate values. For example:

artist_name = "ImagineDragons";

length artist_name $10;

This causes truncation because memory was already allocated. In R, character vectors dynamically resize, avoiding this specific risk.

The MISSOVER option prevents SAS from shifting values into incorrect columns when data is missing.

2.SAS Data Cleaning Workflow

DATA songs_clean;

SET songs_raw;

SONG_NAME      = PROPCASE(STRIP(SONG_NAME));

ARTIST_NAME    = PROPCASE(STRIP(ARTIST_NAME));

GENRE          = UPCASE(COMPRESS(GENRE));

REGION_CODE    = UPCASE(STRIP(REGION_CODE));

GENRE          = TRANWRD(GENRE,'P0P','POP');

IF SONG_NAME='NULL' THEN SONG_NAME='Unknown';

STREAM_COUNT = ABS(STREAM_COUNT);

ROYALTY_AMOUNT = ROUND(ABS(ROYALTY_AMOUNT),0.01);

RELEASE_DT = INPUT(RELEASE_DATE,ANYDTDTE15.);

FORMAT RELEASE_DT DATE9.;

IF MISSING(RELEASE_DT) THEN RELEASE_DT = TODAY();

IF FIND(ARTIST_EMAIL,'@')=0 THEN

ARTIST_EMAIL='invalid@email.com';

SELECT (REGION_CODE);

        WHEN ('USA') REGION_CODE='US';

        WHEN ('AP') REGION_CODE='APAC';

        WHEN ('EUROPE') REGION_CODE='EU';

        OTHERWISE;

END;

SONG_KEY = CATX('_',SONG_ID,REGION_CODE);

DROP RELEASE_DATE;

RENAME RELEASE_DT=RELEASE_DATE;

RUN;

PROC PRINT DATA=songs_clean;

RUN;

OUTPUT:

ObsSONG_IDSONG_NAMEARTIST_NAMEGENREREGION_CODEARTIST_EMAILSTREAM_COUNTROYALTY_AMOUNTRELEASE_DATESONG_KEY
1S001ShapeofyouEdsheeranPOPUSed@gmail.com5000001500015JAN2024S001_US
2S001ShapeofyouEdsheeranPOPUSed@gmail.com5000001500015JAN2024S001_US
3S002BlindinglightsTheweekndPOPUSinvalid@email.com700000250010FEB2024S002_US
4S003NullAdeleSOULUKadele@gmail4500001200013JUN2026S003_UK
5S004BelieverImaginedragonsROCKEUimagine@music.com.1800012MAR2024S004_EU
6S005PerfectEdsheeranPOPUSed@@gmail.com900002200022APR2024S005_US
7S006HaloBeyoncePOPUSbeyonce@gmail.com8500002700013JUN2026S006_US
8S007DespacitoLuisfonsiLATINLATAMluis@gmail.com9200003000010MAY2024S007_LATAM
9S008ThunderImaginedragonsROCKEUinvalid@email.com6400001700013JUN2026S008_EU
10S009SomeonelikeyouAdeleSOULUKadele@gmail.com730000100001JUN2024S009_UK
11S010CloserChainsmokersEDMAPACchain@gmail.com8100002500011JUL2024S010_APAC
12S011FadedAlanwalkerEDMAPACalan@gmail.com.1900005AUG2024S011_APAC
13S012LevitatingDualipaPOPUSdua@gmail.com6200002100015SEP2024S012_US
14S013BadguyBillieeilishALTERNATIVEUSinvalid@email.com4300001600010OCT2024S013_US
15S014SenoritaShawnmendesPOPUSshawn@gmail.com7800002400013JUN2026S014_US
16S015CountingstarsOnerepublicROCKEUone@gmail.com5600002000021NOV2024S015_EU
17S016StayJustinbieberPOPUSjustin@gmail.com9100002900005DEC2024S016_US
18S017RollinginthedeepAdeleSOULUKadele@gmail.com6100002100018SEP2024S017_UK
19S018CheapthrillsSiaPOPUSsia@gmail.com5800001850009MAR2024S018_US
20S019RadioactiveImaginedragonsROCKEUimagine@gmail.com6200001950014APR2024S019_EU
21S020HeatwavesGlassanimalsINDIEAPACglass@gmail.com7300002250025AUG2024S020_APAC

Explanation and Key Points

This workflow demonstrates enterprise-level SAS cleaning logic. PROPCASE() standardizes text capitalization. COMPRESS() removes hidden spaces. TRANWRD() repairs corrupted category labels. ABS() prevents negative analytical measures.

INPUT() converts character dates into true SAS numeric dates. Missing dates are imputed using TODAY()a common healthcare approach when partial dates exist.

SELECT-WHEN improves readability compared to multiple IF-THEN blocks. CATX() builds composite business keys for traceability.

Clinical SAS programmers frequently use this methodology when preparing SDTM and ADaM datasets for regulatory submissions.

3.Removing Duplicate Records

PROC SORT DATA=songs_clean OUT=songs_nodup NODUPKEY;

BY SONG_ID;

RUN;

PROC PRINT DATA=songs_nodup;

RUN;

LOG:

NOTE: There were 21 observations read from the data set WORK.SONGS_CLEAN.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.SONGS_NODUP has 20 observations and 10 variables.

OUTPUT:

ObsSONG_IDSONG_NAMEARTIST_NAMEGENREREGION_CODEARTIST_EMAILSTREAM_COUNTROYALTY_AMOUNTRELEASE_DATESONG_KEY
1S001ShapeofyouEdsheeranPOPUSed@gmail.com5000001500015JAN2024S001_US
2S002BlindinglightsTheweekndPOPUSinvalid@email.com700000250010FEB2024S002_US
3S003NullAdeleSOULUKadele@gmail4500001200013JUN2026S003_UK
4S004BelieverImaginedragonsROCKEUimagine@music.com.1800012MAR2024S004_EU
5S005PerfectEdsheeranPOPUSed@@gmail.com900002200022APR2024S005_US
6S006HaloBeyoncePOPUSbeyonce@gmail.com8500002700013JUN2026S006_US
7S007DespacitoLuisfonsiLATINLATAMluis@gmail.com9200003000010MAY2024S007_LATAM
8S008ThunderImaginedragonsROCKEUinvalid@email.com6400001700013JUN2026S008_EU
9S009SomeonelikeyouAdeleSOULUKadele@gmail.com730000100001JUN2024S009_UK
10S010CloserChainsmokersEDMAPACchain@gmail.com8100002500011JUL2024S010_APAC
11S011FadedAlanwalkerEDMAPACalan@gmail.com.1900005AUG2024S011_APAC
12S012LevitatingDualipaPOPUSdua@gmail.com6200002100015SEP2024S012_US
13S013BadguyBillieeilishALTERNATIVEUSinvalid@email.com4300001600010OCT2024S013_US
14S014SenoritaShawnmendesPOPUSshawn@gmail.com7800002400013JUN2026S014_US
15S015CountingstarsOnerepublicROCKEUone@gmail.com5600002000021NOV2024S015_EU
16S016StayJustinbieberPOPUSjustin@gmail.com9100002900005DEC2024S016_US
17S017RollinginthedeepAdeleSOULUKadele@gmail.com6100002100018SEP2024S017_UK
18S018CheapthrillsSiaPOPUSsia@gmail.com5800001850009MAR2024S018_US
19S019RadioactiveImaginedragonsROCKEUimagine@gmail.com6200001950014APR2024S019_EU
20S020HeatwavesGlassanimalsINDIEAPACglass@gmail.com7300002250025AUG2024S020_APAC

Explanation and Key Points

PROC SORT NODUPKEY removes duplicate keys efficiently. In clinical trials, duplicate USUBJID values can invalidate analysis populations. Deduplication must always be documented for audit trails.

PROC SQL vs DATA Step

4.PROC SQL Join

PROC SQL;

CREATE TABLE royalty_summary AS

SELECT REGION_CODE,GENRE,

    COUNT(*) AS TOTAL_SONGS,

    SUM(ROYALTY_AMOUNT) AS TOTAL_ROYALTY

FROM songs_nodup

GROUP BY REGION_CODE,GENRE;

QUIT;

PROC PRINT DATA=royalty_summary;

RUN;

OUTPUT:

ObsREGION_CODEGENRETOTAL_SONGSTOTAL_ROYALTY
1APACEDM244000
2APACINDIE122500
3EUROCK474500
4LATAMLATIN130000
5UKSOUL334000
6USALTERNATIVE116000
7USPOP8159000

5.DATA Step Equivalent

PROC SORT DATA=songs_nodup;

BY REGION_CODE GENRE;

RUN;

PROC PRINT DATA=songs_nodup;

RUN;

OUTPUT:

ObsSONG_IDSONG_NAMEARTIST_NAMEGENREREGION_CODEARTIST_EMAILSTREAM_COUNTROYALTY_AMOUNTRELEASE_DATESONG_KEY
1S010CloserChainsmokersEDMAPACchain@gmail.com8100002500011JUL2024S010_APAC
2S011FadedAlanwalkerEDMAPACalan@gmail.com.1900005AUG2024S011_APAC
3S020HeatwavesGlassanimalsINDIEAPACglass@gmail.com7300002250025AUG2024S020_APAC
4S004BelieverImaginedragonsROCKEUimagine@music.com.1800012MAR2024S004_EU
5S008ThunderImaginedragonsROCKEUinvalid@email.com6400001700013JUN2026S008_EU
6S015CountingstarsOnerepublicROCKEUone@gmail.com5600002000021NOV2024S015_EU
7S019RadioactiveImaginedragonsROCKEUimagine@gmail.com6200001950014APR2024S019_EU
8S007DespacitoLuisfonsiLATINLATAMluis@gmail.com9200003000010MAY2024S007_LATAM
9S003NullAdeleSOULUKadele@gmail4500001200013JUN2026S003_UK
10S009SomeonelikeyouAdeleSOULUKadele@gmail.com730000100001JUN2024S009_UK
11S017RollinginthedeepAdeleSOULUKadele@gmail.com6100002100018SEP2024S017_UK
12S013BadguyBillieeilishALTERNATIVEUSinvalid@email.com4300001600010OCT2024S013_US
13S001ShapeofyouEdsheeranPOPUSed@gmail.com5000001500015JAN2024S001_US
14S002BlindinglightsTheweekndPOPUSinvalid@email.com700000250010FEB2024S002_US
15S005PerfectEdsheeranPOPUSed@@gmail.com900002200022APR2024S005_US
16S006HaloBeyoncePOPUSbeyonce@gmail.com8500002700013JUN2026S006_US
17S012LevitatingDualipaPOPUSdua@gmail.com6200002100015SEP2024S012_US
18S014SenoritaShawnmendesPOPUSshawn@gmail.com7800002400013JUN2026S014_US
19S016StayJustinbieberPOPUSjustin@gmail.com9100002900005DEC2024S016_US
20S018CheapthrillsSiaPOPUSsia@gmail.com5800001850009MAR2024S018_US

DATA summary_ds;

SET songs_nodup;

BY REGION_CODE GENRE;

RETAIN TOTAL_ROYALTY TOTAL_SONGS;

IF FIRST.GENRE THEN DO;

   TOTAL_ROYALTY=0;

   TOTAL_SONGS=0;

END;

TOTAL_ROYALTY + ROYALTY_AMOUNT;

TOTAL_SONGS + 1;

IF LAST.GENRE;

RUN;

PROC PRINT DATA=summary_ds;

RUN;

OUTPUT:

ObsSONG_IDSONG_NAMEARTIST_NAMEGENREREGION_CODEARTIST_EMAILSTREAM_COUNTROYALTY_AMOUNTRELEASE_DATESONG_KEYTOTAL_ROYALTYTOTAL_SONGS
1S011FadedAlanwalkerEDMAPACalan@gmail.com.1900005AUG2024S011_APAC440002
2S020HeatwavesGlassanimalsINDIEAPACglass@gmail.com7300002250025AUG2024S020_APAC225001
3S019RadioactiveImaginedragonsROCKEUimagine@gmail.com6200001950014APR2024S019_EU745004
4S007DespacitoLuisfonsiLATINLATAMluis@gmail.com9200003000010MAY2024S007_LATAM300001
5S017RollinginthedeepAdeleSOULUKadele@gmail.com6100002100018SEP2024S017_UK340003
6S013BadguyBillieeilishALTERNATIVEUSinvalid@email.com4300001600010OCT2024S013_US160001
7S018CheapthrillsSiaPOPUSsia@gmail.com5800001850009MAR2024S018_US1590008

Explanation and Key Points

PROC SQL is concise and intuitive for aggregation logic. DATA step processing provides deeper control, especially for FIRST./LAST. group processing. Enterprise SAS developers often prefer DATA step for highly customized derivations.

6.Creating songs_raw Dataset in R

library(tidyverse)

songs_raw <- read.delim(

  text = "

S001|ShapeOfYou|EdSheeran|POP|2024-01-15|500000|15000|us|ed@gmail.com

S001|ShapeOfYou|EdSheeran|POP|2024-01-15|500000|15000|US|ed@gmail.com

S002| BlindingLights |TheWeeknd| pop |2024/02/10|700000|-2500|usa|weekndgmail.com

S003|NULL|Adele|Soul|2024-13-01|450000|12000|UK|adele@gmail

S004|Believer|ImagineDragons|ROCK|2024-03-12|.|18000|EU|imagine@music.com

S005|Perfect|EdSheeran|P0P|2024-04-22|-90000|22000|US|ed@@gmail.com

S006|Halo|Beyonce|Pop| |850000|27000|Us|beyonce@gmail.com

S007|Despacito|LuisFonsi|Latin|2024-05-10|920000|30000|LATAM|luis@gmail.com

S008|Thunder|ImagineDragons|rock|2024-05-40|640000|17000|EU|wrongmail

S009|SomeoneLikeYou|Adele|SOUL|2024-06-01|730000|-1000|UK|adele@gmail.com

S010|Closer|Chainsmokers|EDM|2024-07-11|810000|25000|APAC|chain@gmail.com

S011|Faded|AlanWalker|edm|2024-08-05|NULL|19000|AP|alan@gmail.com

S012|Levitating|DuaLipa|POP|2024-09-15|620000|21000|USA|dua@gmail.com

S013|BadGuy|BillieEilish|Alternative|2024-10-10|430000|16000|US|billiegmail.com

S014|Senorita|ShawnMendes|Pop|invalid_date|780000|24000|US|shawn@gmail.com

S015|CountingStars|OneRepublic|ROCK|2024-11-21|560000|20000|EUROPE|one@gmail.com

S016|Stay|JustinBieber|Pop|2024-12-05|910000|29000|US|justin@gmail.com

S017|RollingInTheDeep|Adele|Soul|2024-09-18|610000|21000|UK|adele@gmail.com

S018|CheapThrills|Sia|Pop|2024-03-09|580000|18500|US|sia@gmail.com

S019|Radioactive|ImagineDragons|Rock|2024-04-14|620000|19500|EU|imagine@gmail.com

S020|HeatWaves|GlassAnimals|Indie|2024-08-25|730000|22500|APAC|glass@gmail.com

",

  

  sep = "|",

  header = FALSE,

  stringsAsFactors = FALSE,

  na.strings = c("", "NULL", ".")

)

OUTPUT:

 

V1

V2

V3

V4

V5

V6

V7

V8

V9

1

S001

ShapeOfYou

EdSheeran

POP

15-01-2024

500000

15000

us

ed@gmail.com

2

S001

ShapeOfYou

EdSheeran

POP

15-01-2024

500000

15000

US

ed@gmail.com

3

S002

 BlindingLights 

TheWeeknd

 pop 

10-02-2024

700000

-2500

usa

weekndgmail.com

4

S003

NA

Adele

Soul

2024-13-01

450000

12000

UK

adele@gmail

5

S004

Believer

ImagineDragons

ROCK

12-03-2024

NA

18000

EU

imagine@music.com

6

S005

Perfect

EdSheeran

P0P

22-04-2024

-90000

22000

US

ed@@gmail.com

7

S006

Halo

Beyonce

Pop

 

850000

27000

Us

beyonce@gmail.com

8

S007

Despacito

LuisFonsi

Latin

10-05-2024

920000

30000

LATAM

luis@gmail.com

9

S008

Thunder

ImagineDragons

rock

2024-05-40

640000

17000

EU

wrongmail

10

S009

SomeoneLikeYou

Adele

SOUL

01-06-2024

730000

-1000

UK

adele@gmail.com

11

S010

Closer

Chainsmokers

EDM

11-07-2024

810000

25000

APAC

chain@gmail.com

12

S011

Faded

AlanWalker

edm

05-08-2024

NA

19000

AP

alan@gmail.com

13

S012

Levitating

DuaLipa

POP

15-09-2024

620000

21000

USA

dua@gmail.com

14

S013

BadGuy

BillieEilish

Alternative

10-10-2024

430000

16000

US

billiegmail.com

15

S014

Senorita

ShawnMendes

Pop

invalid_date

780000

24000

US

shawn@gmail.com

16

S015

CountingStars

OneRepublic

ROCK

21-11-2024

560000

20000

EUROPE

one@gmail.com

17

S016

Stay

JustinBieber

Pop

05-12-2024

910000

29000

US

justin@gmail.com

18

S017

RollingInTheDeep

Adele

Soul

18-09-2024

610000

21000

UK

adele@gmail.com

19

S018

CheapThrills

Sia

Pop

09-03-2024

580000

18500

US

sia@gmail.com

20

S019

Radioactive

ImagineDragons

Rock

14-04-2024

620000

19500

EU

imagine@gmail.com

21

S020

HeatWaves

GlassAnimals

Indie

25-08-2024

730000

22500

APAC

glass@gmail.com

#========================================================

# Assign Column Names

#========================================================

colnames(songs_raw) <- c(

  "SONG_ID",

  "SONG_NAME",

  "ARTIST_NAME",

  "GENRE",

  "RELEASE_DATE",

  "STREAM_COUNT",

  "ROYALTY_AMOUNT",

  "REGION_CODE",

  "ARTIST_EMAIL"

)

OUTPUT:

 

SONG_ID

SONG_NAME

ARTIST_NAME

GENRE

RELEASE_DATE

STREAM_COUNT

ROYALTY_AMOUNT

REGION_CODE

ARTIST_EMAIL

1

S001

ShapeOfYou

EdSheeran

POP

15-01-2024

500000

15000

us

ed@gmail.com

2

S001

ShapeOfYou

EdSheeran

POP

15-01-2024

500000

15000

US

ed@gmail.com

3

S002

 BlindingLights 

TheWeeknd

 pop 

10-02-2024

700000

-2500

usa

weekndgmail.com

4

S003

NA

Adele

Soul

2024-13-01

450000

12000

UK

adele@gmail

5

S004

Believer

ImagineDragons

ROCK

12-03-2024

NA

18000

EU

imagine@music.com

6

S005

Perfect

EdSheeran

P0P

22-04-2024

-90000

22000

US

ed@@gmail.com

7

S006

Halo

Beyonce

Pop

 

850000

27000

Us

beyonce@gmail.com

8

S007

Despacito

LuisFonsi

Latin

10-05-2024

920000

30000

LATAM

luis@gmail.com

9

S008

Thunder

ImagineDragons

rock

2024-05-40

640000

17000

EU

wrongmail

10

S009

SomeoneLikeYou

Adele

SOUL

01-06-2024

730000

-1000

UK

adele@gmail.com

11

S010

Closer

Chainsmokers

EDM

11-07-2024

810000

25000

APAC

chain@gmail.com

12

S011

Faded

AlanWalker

edm

05-08-2024

NA

19000

AP

alan@gmail.com

13

S012

Levitating

DuaLipa

POP

15-09-2024

620000

21000

USA

dua@gmail.com

14

S013

BadGuy

BillieEilish

Alternative

10-10-2024

430000

16000

US

billiegmail.com

15

S014

Senorita

ShawnMendes

Pop

invalid_date

780000

24000

US

shawn@gmail.com

16

S015

CountingStars

OneRepublic

ROCK

21-11-2024

560000

20000

EUROPE

one@gmail.com

17

S016

Stay

JustinBieber

Pop

05-12-2024

910000

29000

US

justin@gmail.com

18

S017

RollingInTheDeep

Adele

Soul

18-09-2024

610000

21000

UK

adele@gmail.com

19

S018

CheapThrills

Sia

Pop

09-03-2024

580000

18500

US

sia@gmail.com

20

S019

Radioactive

ImagineDragons

Rock

14-04-2024

620000

19500

EU

imagine@gmail.com

21

S020

HeatWaves

GlassAnimals

Indie

25-08-2024

730000

22500

APAC

glass@gmail.com

#========================================================

# Convert Numeric Variables

#========================================================

songs_raw$STREAM_COUNT   <- as.numeric(songs_raw$STREAM_COUNT)

songs_raw$ROYALTY_AMOUNT <- as.numeric(songs_raw$ROYALTY_AMOUNT)

OUTPUT:

 

SONG_ID

SONG_NAME

ARTIST_NAME

GENRE

RELEASE_DATE

STREAM_COUNT

ROYALTY_AMOUNT

REGION_CODE

ARTIST_EMAIL

1

S001

ShapeOfYou

EdSheeran

POP

15-01-2024

500000

15000

us

ed@gmail.com

2

S001

ShapeOfYou

EdSheeran

POP

15-01-2024

500000

15000

US

ed@gmail.com

3

S002

 BlindingLights 

TheWeeknd

 pop 

10-02-2024

700000

-2500

usa

weekndgmail.com

4

S003

NA

Adele

Soul

2024-13-01

450000

12000

UK

adele@gmail

5

S004

Believer

ImagineDragons

ROCK

12-03-2024

NA

18000

EU

imagine@music.com

6

S005

Perfect

EdSheeran

P0P

22-04-2024

-90000

22000

US

ed@@gmail.com

7

S006

Halo

Beyonce

Pop

 

850000

27000

Us

beyonce@gmail.com

8

S007

Despacito

LuisFonsi

Latin

10-05-2024

920000

30000

LATAM

luis@gmail.com

9

S008

Thunder

ImagineDragons

rock

2024-05-40

640000

17000

EU

wrongmail

10

S009

SomeoneLikeYou

Adele

SOUL

01-06-2024

730000

-1000

UK

adele@gmail.com

11

S010

Closer

Chainsmokers

EDM

11-07-2024

810000

25000

APAC

chain@gmail.com

12

S011

Faded

AlanWalker

edm

05-08-2024

NA

19000

AP

alan@gmail.com

13

S012

Levitating

DuaLipa

POP

15-09-2024

620000

21000

USA

dua@gmail.com

14

S013

BadGuy

BillieEilish

Alternative

10-10-2024

430000

16000

US

billiegmail.com

15

S014

Senorita

ShawnMendes

Pop

invalid_date

780000

24000

US

shawn@gmail.com

16

S015

CountingStars

OneRepublic

ROCK

21-11-2024

560000

20000

EUROPE

one@gmail.com

17

S016

Stay

JustinBieber

Pop

05-12-2024

910000

29000

US

justin@gmail.com

18

S017

RollingInTheDeep

Adele

Soul

18-09-2024

610000

21000

UK

adele@gmail.com

19

S018

CheapThrills

Sia

Pop

09-03-2024

580000

18500

US

sia@gmail.com

20

S019

Radioactive

ImagineDragons

Rock

14-04-2024

620000

19500

EU

imagine@gmail.com

21

S020

HeatWaves

GlassAnimals

Indie

25-08-2024

730000

22500

APAC

glass@gmail.com

Key Explanation

This R code is the equivalent of the SAS DATALINES approach. Instead of using INFILE DATALINES, R uses read.delim() with the text= argument to directly read inline raw data.

Important Enterprise-Level Points

·       sep="|" works like DLM='|' in SAS.

·       na.strings=c("", "NULL", ".") converts corrupted values into proper missing values (NA) similar to SAS missing handling.

·       stringsAsFactors=FALSE prevents automatic factor conversion, which is important for data cleaning projects.

·       as.numeric() converts mixed character numeric variables into numeric types.

·       Invalid values like "NULL" and "." become NA.

·       This dataset intentionally contains:

o   Duplicate SONG_IDs

o   Invalid dates

o   Negative royalties

o   Mixed case text

o   Corrupted emails

o   Missing values

o   Inconsistent region codes

These issues simulate real-world enterprise data quality problems commonly found in:

·       Clinical trial systems

·       Banking transactions

·       Insurance claims

·       Retail analytics

·       Streaming platforms

This raw dataset is now ready for advanced cleaning using:

·       dplyr

·       stringr

·       lubridate

·       janitor

·       tidyr

7.R Data Cleaning Layer

library(tidyverse)

library(lubridate)

library(janitor)

songs_clean <- songs_raw %>%

  clean_names() %>%

  mutate(

    song_name = str_to_title(str_trim(song_name)),

    genre = str_to_upper(str_trim(genre)),

    genre = str_replace_all(genre,"P0P","POP"),

    stream_count = abs(as.numeric(stream_count)),

    royalty_amount = abs(royalty_amount),

    release_date = str_trim(release_date),

    release_date = na_if(release_date, ""),

    release_date = if_else(release_date == "invalid_date",

      NA_character_,release_date),

    release_date = parse_date_time(release_date,

      orders = c("ymd","y/m/d"),quiet = TRUE),

    region_code = case_when(

      region_code=="usa" ~ "US",

      region_code=="AP" ~ "APAC",

      TRUE ~ str_to_upper(region_code)),

    artist_email = if_else(grepl("@",artist_email),

      artist_email,"invalid@email.com")

  ) %>%

distinct(song_id,.keep_all=TRUE)

OUTPUT:

 

song_id

song_name

artist_name

genre

release_date

stream_count

royalty_amount

region_code

artist_email

1

S001

Shapeofyou

EdSheeran

POP

15-01-2024

500000

15000

US

ed@gmail.com

2

S002

Blindinglights

TheWeeknd

POP

10-02-2024

700000

2500

US

invalid@email.com

3

S003

NA

Adele

SOUL

NA

450000

12000

UK

adele@gmail

4

S004

Believer

ImagineDragons

ROCK

12-03-2024

NA

18000

EU

imagine@music.com

5

S005

Perfect

EdSheeran

POP

22-04-2024

90000

22000

US

ed@@gmail.com

6

S006

Halo

Beyonce

POP

NA

850000

27000

US

beyonce@gmail.com

7

S007

Despacito

LuisFonsi

LATIN

10-05-2024

920000

30000

LATAM

luis@gmail.com

8

S008

Thunder

ImagineDragons

ROCK

NA

640000

17000

EU

invalid@email.com

9

S009

Someonelikeyou

Adele

SOUL

01-06-2024

730000

1000

UK

adele@gmail.com

10

S010

Closer

Chainsmokers

EDM

11-07-2024

810000

25000

APAC

chain@gmail.com

11

S011

Faded

AlanWalker

EDM

05-08-2024

NA

19000

APAC

alan@gmail.com

12

S012

Levitating

DuaLipa

POP

15-09-2024

620000

21000

USA

dua@gmail.com

13

S013

Badguy

BillieEilish

ALTERNATIVE

10-10-2024

430000

16000

US

invalid@email.com

14

S014

Senorita

ShawnMendes

POP

NA

780000

24000

US

shawn@gmail.com

15

S015

Countingstars

OneRepublic

ROCK

21-11-2024

560000

20000

EUROPE

one@gmail.com

16

S016

Stay

JustinBieber

POP

05-12-2024

910000

29000

US

justin@gmail.com

17

S017

Rollinginthedeep

Adele

SOUL

18-09-2024

610000

21000

UK

adele@gmail.com

18

S018

Cheapthrills

Sia

POP

09-03-2024

580000

18500

US

sia@gmail.com

19

S019

Radioactive

ImagineDragons

ROCK

14-04-2024

620000

19500

EU

imagine@gmail.com

20

S020

Heatwaves

GlassAnimals

INDIE

25-08-2024

730000

22500

APAC

glass@gmail.com

Explanation and Key Points

This R workflow mirrors SAS transformations using tidyverse syntax. mutate() behaves similarly to DATA step assignment logic. case_when() resembles SELECT-WHEN. distinct() matches PROC SORT NODUPKEY.

parse_date_time() from lubridate handles mixed date formats more flexibly than standard SAS informats.

R provides superior exploratory flexibility, while SAS offers stronger audit traceability and regulatory compliance.

Why quiet = TRUE is Important

quiet = TRUE

suppresses warning messages.

Equivalent to SAS:

?? informat.

Very common in enterprise ETL workflows.

Clinical SAS vs R Comparison

SAS

R

INPUT(date, ?? yymmdd10.)

parse_date_time(..., quiet=TRUE)

Invalid dates become .

Invalid dates become NA

Log warning generated

Warning generated

?? suppresses warning

quiet=TRUE suppresses warning

Enterprise Validation & Compliance

In clinical trials:

  • SDTM datasets standardize raw collection data
  • ADaM datasets support statistical analysis
  • QC programmers independently validate derivations
  • Audit trails track every transformation
  • Traceability ensures reproducibility

One dangerous SAS behavior:

IF lab_value < 5 THEN FLAG='Y';

Missing numeric values in SAS are treated as lower than valid numbers. Therefore, missing values accidentally satisfy <5.

Correct approach:

IF NOT MISSING(lab_value) AND lab_value < 5 THEN FLAG='Y';

This prevents catastrophic analytical errors in regulatory environments.

20 Data Cleaning Best Practices

  1. Always validate duplicates before merging
  2. Standardize categorical variables
  3. Use metadata-driven programming
  4. Create reusable macros
  5. Never overwrite raw datasets
  6. Maintain audit trails
  7. Validate missing-value logic
  8. Use defensive programming
  9. Standardize date formats
  10. Apply QC independence
  11. Document derivation logic
  12. Validate variable lengths
  13. Avoid hardcoding
  14. Use controlled terminology
  15. Validate ranges systematically
  16. Build reproducible workflows
  17. Track data lineage
  18. Separate raw and clean layers
  19. Validate merge cardinality
  20. Automate compliance checks

Business Logic Explanation

Enterprise data cleaning exists to ensure analytical reliability. Missing values are imputed because incomplete records distort statistical models, forecasting systems, and executive dashboards. For example, if patient visit dates are missing in clinical trials, treatment duration calculations become inaccurate. In music analytics, missing release dates disrupt trend analysis.

Unrealistic values are corrected because downstream algorithms assume valid business logic. A negative royalty amount implies impossible financial behavior. Similarly, a patient age of 250 years invalidates demographic summaries.

Text normalization ensures consistency across systems. "POP", "pop", and " Pop " represent the same category but appear different computationally. Without normalization, aggregation results become fragmented.

Date standardization is critical for time-series analytics, regulatory reporting, and AI modeling. Mixed date formats often cause parsing failures during ETL processing.

Missing numeric handling is especially important in SAS because missing numeric values are treated differently than NULLs in SQL databases.

These corrections collectively transform unreliable operational data into analysis-ready intelligence capable of supporting dashboards, machine learning, regulatory submissions, and strategic business decisions.

20 Sharp One-Line Insights

  1. Dirty data creates expensive business mistakes.
  2. Validation logic is stronger than visual inspection.
  3. Standardized variables improve reproducibility.
  4. Duplicate IDs silently corrupt analytics.
  5. Missing dates distort trend intelligence.
  6. Audit trails protect regulatory credibility.
  7. Character truncation causes hidden data loss.
  8. SAS excels in governed environments.
  9. R dominates exploratory transformation flexibility.
  10. Metadata-driven programming reduces risk.
  11. PROC SQL simplifies aggregation logic.
  12. DATA step offers granular control.
  13. Defensive programming prevents production failures.
  14. Controlled terminology improves reporting consistency.
  15. QC independence strengthens analytical trust.
  16. Missing-value handling can change outcomes.
  17. Production code requires scalability.
  18. Clean data improves AI predictions.
  19. Traceability is essential for compliance.
  20. Reliable analytics begins with disciplined cleaning.

SAS vs R Comparison Table

Feature

SAS

R

Regulatory Compliance

Excellent

Moderate

Audit Trails

Strong

Limited

Flexibility

Moderate

Excellent

Statistical Libraries

Excellent

Excellent

Enterprise Governance

Strong

Moderate

Visualization

Moderate

Excellent

Scalability

Excellent

Good

Learning Curve

Moderate

Moderate

Metadata Integration

Strong

Moderate

Open Source

No

Yes

Validation Checklist

1. Duplicate check completed
2.Missing-value validation completed
3.Invalid dates corrected
4.Text normalization completed
5.Range validation completed
6.Email validation completed
7.QC logic verified
8.Metadata reviewed
9.Audit trail documented
10.Analysis-ready dataset generated

Summary

SAS and R together create one of the strongest enterprise analytics ecosystems available today. SAS provides governed, scalable, audit-ready processing ideal for regulated industries such as pharmaceuticals, banking, and insurance. Features like DATA step programming, FIRST./LAST. processing, PROC SQL, and metadata-driven validation frameworks make SAS exceptionally powerful for production-grade workflows.

R complements SAS by offering highly flexible transformation libraries, rapid exploratory analytics, and modern data-engineering frameworks. Packages such as tidyverse, lubridate, janitor, and stringr simplify complex transformations while enabling interactive analytical workflows.

In regulated clinical environments, SAS remains dominant because of traceability, validation standards, and regulatory acceptance. However, R increasingly supports exploratory analytics, machine learning, and visualization layers integrated into enterprise pipelines.

This project demonstrated how corrupted global famous-song datasets resemble real-world healthcare and financial data-quality failures. Problems such as duplicate IDs, malformed emails, inconsistent categories, negative numeric values, and invalid dates can severely impact business intelligence and statistical reliability.

Using SAS PROC SQL, DATA step programming, and modern R workflows, we converted unreliable operational records into clean analytical intelligence. The result is an analysis-ready dataset capable of supporting dashboards, forecasting models, compliance reporting, and executive decision-making.

Ultimately, enterprise analytics success depends not on sophisticated AI algorithms alone, but on disciplined data-engineering foundations built through validation, standardization, reproducibility, and robust cleaning frameworks.

Conclusion

Modern organizations generate enormous volumes of operational data every second. Yet raw data alone has little value unless it can be trusted. Whether the domain is healthcare, clinical trials, retail, banking, insurance, or global music streaming analytics, poor-quality data introduces dangerous analytical uncertainty.

This project demonstrated how corrupted famous-song datasets mirror real-world enterprise data failures. Duplicate records inflate metrics. Missing dates break trend analysis. Negative financial values distort revenue forecasting. Inconsistent text labels damage segmentation models. Malformed emails disrupt communication workflows. Invalid timestamps create reporting inaccuracies.

In regulated industries like pharmaceuticals, such failures become even more critical. Incorrect SDTM or ADaM derivations can delay regulatory approvals and compromise patient safety analysis. Therefore, structured data-cleaning frameworks are not optional they are foundational to enterprise trust.

SAS provides exceptional strength in governed environments because of its auditability, traceability, validation capabilities, and scalable DATA step architecture. Features like FIRST./LAST. processing, PROC SQL, macros, arrays, and controlled metadata frameworks make SAS ideal for production analytics.

R complements this ecosystem by enabling highly flexible transformations, modern data engineering, exploratory analytics, and advanced visualization. Together, SAS and R form a hybrid analytical ecosystem capable of delivering scalable, reproducible, and trustworthy intelligence.

The most successful enterprise analytics teams are not merely coders they are data-quality engineers, validation specialists, compliance thinkers, and business translators. Their goal is not simply to clean data, but to transform unreliable operational chaos into analytical clarity capable of supporting executive decisions, AI systems, statistical outputs, and regulatory submissions.

Ultimately, trustworthy analytics begins with disciplined cleaning, standardized logic, reproducible workflows, and rigorous validation culture. Clean data is not a technical luxury it is the backbone of modern business intelligence.

Interview Questions and Answers

1. How would you identify duplicate records in SAS?

Answer:
I would use PROC SORT NODUPKEY or PROC SQL GROUP BY HAVING COUNT(*) > 1. In regulated environments, I would also document duplicate-removal logic for audit traceability.

2. Why is missing-value handling dangerous in SAS?

Answer:
Missing numeric values are treated as smaller than valid numbers. Therefore:

IF value < 5;

incorrectly includes missing values unless explicitly checked using:

IF NOT MISSING(value) AND value < 5;

3. When would you prefer DATA step over PROC SQL?

Answer:
I prefer DATA step for row-wise transformations, retained calculations, FIRST./LAST. logic, arrays, and complex conditional processing. PROC SQL is better for joins and aggregation.

4. How do SAS and R complement each other in enterprise analytics?

Answer:
SAS provides governance, validation, and regulatory traceability, while R offers flexible exploratory analytics and modern transformation libraries. Together they create scalable analytical ecosystems.

5. Describe a real-world validation issue you might face in clinical trial data.

Answer:
A common issue is duplicate subject enrollment causing inflated treatment populations. Another is invalid visit dates affecting treatment duration calculations. Both require validation checks, QC programming, and audit documentation before submission.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent SONGS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and Exams Reviewers and Observers


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can Advanced SAS Programming Detect, Analyze, and Fix Errors in High-Frequency Trading Data While Identifying Fraud Patterns?

2.How Do SAS and R Complement Each Other in Detecting, Cleaning, and Transforming Complex Sensor Fusion Vehicle Data?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS