Transforming Corrupted Music Intelligence into Trusted Insights Using Sas And R
Corrupted Global Famous Songs Data into Enterprise-Grade Analytical Intelligence Using SAS (PROC SQL vs DATA Step) and R
Introduction
The
modern analytics world runs on data integrity. Whether you work in clinical
trials, banking, retail, insurance, or entertainment analytics, one harsh
reality remains constant: dirty data destroys trust. A single malformed date,
duplicated ID, corrupted category label, or inconsistent text value can
silently contaminate dashboards, AI predictions, executive reporting, and
regulatory submissions.
Imagine a
global music streaming company analyzing “famous songs worldwide” to identify
trending genres and audience behavior. Suddenly, executive dashboards begin
showing incorrect revenue projections because duplicate Song IDs inflated
streaming counts. Some records contain negative royalties. Others have invalid
release dates like 32/15/2025. Genre labels vary between "POP", "pop",
" Pop ", and "P0P". Regional codes are inconsistent. Artist
emails are malformed. Missing listener counts distort machine learning models.
This
resembles real-world clinical trial disasters where incorrect patient
enrollment dates, duplicated subjects, or invalid adverse-event records can
lead to rejected regulatory submissions. In enterprise environments,
poor-quality data is not just a technical issue it becomes a compliance,
financial, and reputational risk.
Business Crisis Scenario
A
multinational streaming platform launched an AI recommendation engine trained
on global famous-song data. Within weeks:
- Duplicate Song IDs inflated
royalty payments
- Invalid timestamps corrupted
streaming trends
- Negative revenue values
impacted forecasting
- Mixed genre labels reduced
recommendation accuracy
- Missing release dates
distorted monthly analytics
- Incorrect listener counts
damaged executive dashboards
- Corrupted emails prevented
artist communication
- Inconsistent regional codes
broke territory reporting
The same
problems occur in clinical trial systems:
- Incorrect patient ages
- Missing visit dates
- Invalid adverse-event
categories
- Duplicate subject IDs
- Missing treatment
assignments
In
regulated environments like clinical SAS programming, such issues can lead to
FDA rejection, failed QC validation, and delayed submissions.
Raw Corrupted Dataset (SAS)
Dataset Structure – 9 Variables
Variables:
- SONG_ID
- SONG_NAME
- ARTIST_NAME
- GENRE
- RELEASE_DATE
- STREAM_COUNT
- ROYALTY_AMOUNT
- REGION_CODE
- ARTIST_EMAIL
1.SAS Raw Dataset Creation
DATA songs_raw;
LENGTH SONG_ID $12 SONG_NAME $40 ARTIST_NAME $30 GENRE $20
RELEASE_DATE $20 REGION_CODE $10 ARTIST_EMAIL $50;
INFILE DATALINES DLM='|' MISSOVER;
INPUT SONG_ID $ SONG_NAME $ ARTIST_NAME $ GENRE $ RELEASE_DATE $
STREAM_COUNT ROYALTY_AMOUNT REGION_CODE $ ARTIST_EMAIL $;
DATALINES;
S001|ShapeOfYou|EdSheeran|POP|2024-01-15|500000|15000|us|ed@gmail.com
S001|ShapeOfYou|EdSheeran|POP|2024-01-15|500000|15000|US|ed@gmail.com
S002| BlindingLights |TheWeeknd| pop |2024/02/10|700000|-2500|usa|weekndgmail.com
S003|NULL|Adele|Soul|2024-13-01|450000|12000|UK|adele@gmail
S004|Believer|ImagineDragons|ROCK|2024-03-12|.|18000|EU|imagine@music.com
S005|Perfect|EdSheeran|P0P|2024-04-22|-90000|22000|US|ed@@gmail.com
S006|Halo|Beyonce|Pop| |850000|27000|Us|beyonce@gmail.com
S007|Despacito|LuisFonsi|Latin|2024-05-10|920000|30000|LATAM|luis@gmail.com
S008|Thunder|ImagineDragons|rock|2024-05-40|640000|17000|EU|wrongmail
S009|SomeoneLikeYou|Adele|SOUL|2024-06-01|730000|-1000|UK|adele@gmail.com
S010|Closer|Chainsmokers|EDM|2024-07-11|810000|25000|APAC|chain@gmail.com
S011|Faded|AlanWalker|edm|2024-08-05|NULL|19000|AP|alan@gmail.com
S012|Levitating|DuaLipa|POP|2024-09-15|620000|21000|USA|dua@gmail.com
S013|BadGuy|BillieEilish|Alternative|2024-10-10|430000|16000|US|billiegmail.com
S014|Senorita|ShawnMendes|Pop|invalid_date|780000|24000|US|shawn@gmail.com
S015|CountingStars|OneRepublic|ROCK|2024-11-21|560000|20000|EUROPE|one@gmail.com
S016|Stay|JustinBieber|Pop|2024-12-05|910000|29000|US|justin@gmail.com
S017|RollingInTheDeep|Adele|Soul|2024-09-18|610000|21000|UK|adele@gmail.com
S018|CheapThrills|Sia|Pop|2024-03-09|580000|18500|US|sia@gmail.com
S019|Radioactive|ImagineDragons|Rock|2024-04-14|620000|19500|EU|imagine@gmail.com
S020|HeatWaves|GlassAnimals|Indie|2024-08-25|730000|22500|APAC|glass@gmail.com
;
RUN;
PROC PRINT DATA=songs_raw;
RUN;
OUTPUT:
| Obs | SONG_ID | SONG_NAME | ARTIST_NAME | GENRE | RELEASE_DATE | REGION_CODE | ARTIST_EMAIL | STREAM_COUNT | ROYALTY_AMOUNT |
|---|---|---|---|---|---|---|---|---|---|
| 1 | S001 | ShapeOfYou | EdSheeran | POP | 2024-01-15 | us | ed@gmail.com | 500000 | 15000 |
| 2 | S001 | ShapeOfYou | EdSheeran | POP | 2024-01-15 | US | ed@gmail.com | 500000 | 15000 |
| 3 | S002 | BlindingLights | TheWeeknd | pop | 2024/02/10 | usa | weekndgmail.com | 700000 | -2500 |
| 4 | S003 | NULL | Adele | Soul | 2024-13-01 | UK | adele@gmail | 450000 | 12000 |
| 5 | S004 | Believer | ImagineDragons | ROCK | 2024-03-12 | EU | imagine@music.com | . | 18000 |
| 6 | S005 | Perfect | EdSheeran | P0P | 2024-04-22 | US | ed@@gmail.com | -90000 | 22000 |
| 7 | S006 | Halo | Beyonce | Pop | Us | beyonce@gmail.com | 850000 | 27000 | |
| 8 | S007 | Despacito | LuisFonsi | Latin | 2024-05-10 | LATAM | luis@gmail.com | 920000 | 30000 |
| 9 | S008 | Thunder | ImagineDragons | rock | 2024-05-40 | EU | wrongmail | 640000 | 17000 |
| 10 | S009 | SomeoneLikeYou | Adele | SOUL | 2024-06-01 | UK | adele@gmail.com | 730000 | -1000 |
| 11 | S010 | Closer | Chainsmokers | EDM | 2024-07-11 | APAC | chain@gmail.com | 810000 | 25000 |
| 12 | S011 | Faded | AlanWalker | edm | 2024-08-05 | AP | alan@gmail.com | . | 19000 |
| 13 | S012 | Levitating | DuaLipa | POP | 2024-09-15 | USA | dua@gmail.com | 620000 | 21000 |
| 14 | S013 | BadGuy | BillieEilish | Alternative | 2024-10-10 | US | billiegmail.com | 430000 | 16000 |
| 15 | S014 | Senorita | ShawnMendes | Pop | invalid_date | US | shawn@gmail.com | 780000 | 24000 |
| 16 | S015 | CountingStars | OneRepublic | ROCK | 2024-11-21 | EUROPE | one@gmail.com | 560000 | 20000 |
| 17 | S016 | Stay | JustinBieber | Pop | 2024-12-05 | US | justin@gmail.com | 910000 | 29000 |
| 18 | S017 | RollingInTheDeep | Adele | Soul | 2024-09-18 | UK | adele@gmail.com | 610000 | 21000 |
| 19 | S018 | CheapThrills | Sia | Pop | 2024-03-09 | US | sia@gmail.com | 580000 | 18500 |
| 20 | S019 | Radioactive | ImagineDragons | Rock | 2024-04-14 | EU | imagine@gmail.com | 620000 | 19500 |
| 21 | S020 | HeatWaves | GlassAnimals | Indie | 2024-08-25 | APAC | glass@gmail.com | 730000 | 22500 |
Explanation and Key Points
This raw
dataset intentionally contains enterprise-grade corruption patterns commonly
seen in real production systems. Duplicate SONG_ID values simulate duplicate
patient records in clinical trials. Invalid dates mimic incorrect visit dates.
Negative royalty amounts resemble negative billing values in banking systems.
Mixed uppercase/lowercase text demonstrates metadata inconsistency. Malformed
emails simulate communication failures.
The LENGTH
statement appears BEFORE assignment logic because SAS allocates character
memory during compilation. If LENGTH is declared later, SAS may permanently
truncate values. For example:
artist_name =
"ImagineDragons";
length artist_name $10;
This
causes truncation because memory was already allocated. In R, character vectors
dynamically resize, avoiding this specific risk.
The MISSOVER
option prevents SAS from shifting values into incorrect columns when data is
missing.
2.SAS Data Cleaning Workflow
DATA songs_clean;
SET songs_raw;
SONG_NAME = PROPCASE(STRIP(SONG_NAME));
ARTIST_NAME = PROPCASE(STRIP(ARTIST_NAME));
GENRE = UPCASE(COMPRESS(GENRE));
REGION_CODE = UPCASE(STRIP(REGION_CODE));
GENRE = TRANWRD(GENRE,'P0P','POP');
IF SONG_NAME='NULL' THEN SONG_NAME='Unknown';
STREAM_COUNT = ABS(STREAM_COUNT);
ROYALTY_AMOUNT = ROUND(ABS(ROYALTY_AMOUNT),0.01);
RELEASE_DT = INPUT(RELEASE_DATE,ANYDTDTE15.);
FORMAT RELEASE_DT DATE9.;
IF MISSING(RELEASE_DT) THEN RELEASE_DT = TODAY();
IF FIND(ARTIST_EMAIL,'@')=0 THEN
ARTIST_EMAIL='invalid@email.com';
SELECT (REGION_CODE);
WHEN ('USA') REGION_CODE='US';
WHEN ('AP') REGION_CODE='APAC';
WHEN ('EUROPE') REGION_CODE='EU';
OTHERWISE;
END;
SONG_KEY = CATX('_',SONG_ID,REGION_CODE);
DROP RELEASE_DATE;
RENAME RELEASE_DT=RELEASE_DATE;
RUN;
PROC PRINT DATA=songs_clean;
RUN;
OUTPUT:
| Obs | SONG_ID | SONG_NAME | ARTIST_NAME | GENRE | REGION_CODE | ARTIST_EMAIL | STREAM_COUNT | ROYALTY_AMOUNT | RELEASE_DATE | SONG_KEY |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S001 | Shapeofyou | Edsheeran | POP | US | ed@gmail.com | 500000 | 15000 | 15JAN2024 | S001_US |
| 2 | S001 | Shapeofyou | Edsheeran | POP | US | ed@gmail.com | 500000 | 15000 | 15JAN2024 | S001_US |
| 3 | S002 | Blindinglights | Theweeknd | POP | US | invalid@email.com | 700000 | 2500 | 10FEB2024 | S002_US |
| 4 | S003 | Null | Adele | SOUL | UK | adele@gmail | 450000 | 12000 | 13JUN2026 | S003_UK |
| 5 | S004 | Believer | Imaginedragons | ROCK | EU | imagine@music.com | . | 18000 | 12MAR2024 | S004_EU |
| 6 | S005 | Perfect | Edsheeran | POP | US | ed@@gmail.com | 90000 | 22000 | 22APR2024 | S005_US |
| 7 | S006 | Halo | Beyonce | POP | US | beyonce@gmail.com | 850000 | 27000 | 13JUN2026 | S006_US |
| 8 | S007 | Despacito | Luisfonsi | LATIN | LATAM | luis@gmail.com | 920000 | 30000 | 10MAY2024 | S007_LATAM |
| 9 | S008 | Thunder | Imaginedragons | ROCK | EU | invalid@email.com | 640000 | 17000 | 13JUN2026 | S008_EU |
| 10 | S009 | Someonelikeyou | Adele | SOUL | UK | adele@gmail.com | 730000 | 1000 | 01JUN2024 | S009_UK |
| 11 | S010 | Closer | Chainsmokers | EDM | APAC | chain@gmail.com | 810000 | 25000 | 11JUL2024 | S010_APAC |
| 12 | S011 | Faded | Alanwalker | EDM | APAC | alan@gmail.com | . | 19000 | 05AUG2024 | S011_APAC |
| 13 | S012 | Levitating | Dualipa | POP | US | dua@gmail.com | 620000 | 21000 | 15SEP2024 | S012_US |
| 14 | S013 | Badguy | Billieeilish | ALTERNATIVE | US | invalid@email.com | 430000 | 16000 | 10OCT2024 | S013_US |
| 15 | S014 | Senorita | Shawnmendes | POP | US | shawn@gmail.com | 780000 | 24000 | 13JUN2026 | S014_US |
| 16 | S015 | Countingstars | Onerepublic | ROCK | EU | one@gmail.com | 560000 | 20000 | 21NOV2024 | S015_EU |
| 17 | S016 | Stay | Justinbieber | POP | US | justin@gmail.com | 910000 | 29000 | 05DEC2024 | S016_US |
| 18 | S017 | Rollinginthedeep | Adele | SOUL | UK | adele@gmail.com | 610000 | 21000 | 18SEP2024 | S017_UK |
| 19 | S018 | Cheapthrills | Sia | POP | US | sia@gmail.com | 580000 | 18500 | 09MAR2024 | S018_US |
| 20 | S019 | Radioactive | Imaginedragons | ROCK | EU | imagine@gmail.com | 620000 | 19500 | 14APR2024 | S019_EU |
| 21 | S020 | Heatwaves | Glassanimals | INDIE | APAC | glass@gmail.com | 730000 | 22500 | 25AUG2024 | S020_APAC |
Explanation and Key Points
This
workflow demonstrates enterprise-level SAS cleaning logic. PROPCASE()
standardizes text capitalization. COMPRESS() removes hidden spaces. TRANWRD()
repairs corrupted category labels. ABS() prevents negative analytical measures.
INPUT()
converts character dates into true SAS numeric dates. Missing dates are imputed
using TODAY()a common healthcare approach when partial dates exist.
SELECT-WHEN
improves readability compared to multiple IF-THEN blocks. CATX() builds
composite business keys for traceability.
Clinical
SAS programmers frequently use this methodology when preparing SDTM and ADaM
datasets for regulatory submissions.
3.Removing Duplicate Records
PROC SORT DATA=songs_clean OUT=songs_nodup NODUPKEY;
BY SONG_ID;
RUN;
PROC PRINT DATA=songs_nodup;
RUN;
LOG:
OUTPUT:
| Obs | SONG_ID | SONG_NAME | ARTIST_NAME | GENRE | REGION_CODE | ARTIST_EMAIL | STREAM_COUNT | ROYALTY_AMOUNT | RELEASE_DATE | SONG_KEY |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S001 | Shapeofyou | Edsheeran | POP | US | ed@gmail.com | 500000 | 15000 | 15JAN2024 | S001_US |
| 2 | S002 | Blindinglights | Theweeknd | POP | US | invalid@email.com | 700000 | 2500 | 10FEB2024 | S002_US |
| 3 | S003 | Null | Adele | SOUL | UK | adele@gmail | 450000 | 12000 | 13JUN2026 | S003_UK |
| 4 | S004 | Believer | Imaginedragons | ROCK | EU | imagine@music.com | . | 18000 | 12MAR2024 | S004_EU |
| 5 | S005 | Perfect | Edsheeran | POP | US | ed@@gmail.com | 90000 | 22000 | 22APR2024 | S005_US |
| 6 | S006 | Halo | Beyonce | POP | US | beyonce@gmail.com | 850000 | 27000 | 13JUN2026 | S006_US |
| 7 | S007 | Despacito | Luisfonsi | LATIN | LATAM | luis@gmail.com | 920000 | 30000 | 10MAY2024 | S007_LATAM |
| 8 | S008 | Thunder | Imaginedragons | ROCK | EU | invalid@email.com | 640000 | 17000 | 13JUN2026 | S008_EU |
| 9 | S009 | Someonelikeyou | Adele | SOUL | UK | adele@gmail.com | 730000 | 1000 | 01JUN2024 | S009_UK |
| 10 | S010 | Closer | Chainsmokers | EDM | APAC | chain@gmail.com | 810000 | 25000 | 11JUL2024 | S010_APAC |
| 11 | S011 | Faded | Alanwalker | EDM | APAC | alan@gmail.com | . | 19000 | 05AUG2024 | S011_APAC |
| 12 | S012 | Levitating | Dualipa | POP | US | dua@gmail.com | 620000 | 21000 | 15SEP2024 | S012_US |
| 13 | S013 | Badguy | Billieeilish | ALTERNATIVE | US | invalid@email.com | 430000 | 16000 | 10OCT2024 | S013_US |
| 14 | S014 | Senorita | Shawnmendes | POP | US | shawn@gmail.com | 780000 | 24000 | 13JUN2026 | S014_US |
| 15 | S015 | Countingstars | Onerepublic | ROCK | EU | one@gmail.com | 560000 | 20000 | 21NOV2024 | S015_EU |
| 16 | S016 | Stay | Justinbieber | POP | US | justin@gmail.com | 910000 | 29000 | 05DEC2024 | S016_US |
| 17 | S017 | Rollinginthedeep | Adele | SOUL | UK | adele@gmail.com | 610000 | 21000 | 18SEP2024 | S017_UK |
| 18 | S018 | Cheapthrills | Sia | POP | US | sia@gmail.com | 580000 | 18500 | 09MAR2024 | S018_US |
| 19 | S019 | Radioactive | Imaginedragons | ROCK | EU | imagine@gmail.com | 620000 | 19500 | 14APR2024 | S019_EU |
| 20 | S020 | Heatwaves | Glassanimals | INDIE | APAC | glass@gmail.com | 730000 | 22500 | 25AUG2024 | S020_APAC |
Explanation and Key Points
PROC SORT
NODUPKEY removes duplicate keys efficiently. In clinical trials, duplicate
USUBJID values can invalidate analysis populations. Deduplication must always
be documented for audit trails.
PROC SQL vs DATA Step
4.PROC SQL Join
PROC SQL;
CREATE TABLE royalty_summary AS
SELECT REGION_CODE,GENRE,
COUNT(*) AS TOTAL_SONGS,
SUM(ROYALTY_AMOUNT) AS TOTAL_ROYALTY
FROM songs_nodup
GROUP BY REGION_CODE,GENRE;
QUIT;
PROC PRINT DATA=royalty_summary;
RUN;
OUTPUT:
| Obs | REGION_CODE | GENRE | TOTAL_SONGS | TOTAL_ROYALTY |
|---|---|---|---|---|
| 1 | APAC | EDM | 2 | 44000 |
| 2 | APAC | INDIE | 1 | 22500 |
| 3 | EU | ROCK | 4 | 74500 |
| 4 | LATAM | LATIN | 1 | 30000 |
| 5 | UK | SOUL | 3 | 34000 |
| 6 | US | ALTERNATIVE | 1 | 16000 |
| 7 | US | POP | 8 | 159000 |
5.DATA Step Equivalent
PROC SORT DATA=songs_nodup;
BY REGION_CODE GENRE;
RUN;
PROC PRINT DATA=songs_nodup;
RUN;
OUTPUT:
| Obs | SONG_ID | SONG_NAME | ARTIST_NAME | GENRE | REGION_CODE | ARTIST_EMAIL | STREAM_COUNT | ROYALTY_AMOUNT | RELEASE_DATE | SONG_KEY |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S010 | Closer | Chainsmokers | EDM | APAC | chain@gmail.com | 810000 | 25000 | 11JUL2024 | S010_APAC |
| 2 | S011 | Faded | Alanwalker | EDM | APAC | alan@gmail.com | . | 19000 | 05AUG2024 | S011_APAC |
| 3 | S020 | Heatwaves | Glassanimals | INDIE | APAC | glass@gmail.com | 730000 | 22500 | 25AUG2024 | S020_APAC |
| 4 | S004 | Believer | Imaginedragons | ROCK | EU | imagine@music.com | . | 18000 | 12MAR2024 | S004_EU |
| 5 | S008 | Thunder | Imaginedragons | ROCK | EU | invalid@email.com | 640000 | 17000 | 13JUN2026 | S008_EU |
| 6 | S015 | Countingstars | Onerepublic | ROCK | EU | one@gmail.com | 560000 | 20000 | 21NOV2024 | S015_EU |
| 7 | S019 | Radioactive | Imaginedragons | ROCK | EU | imagine@gmail.com | 620000 | 19500 | 14APR2024 | S019_EU |
| 8 | S007 | Despacito | Luisfonsi | LATIN | LATAM | luis@gmail.com | 920000 | 30000 | 10MAY2024 | S007_LATAM |
| 9 | S003 | Null | Adele | SOUL | UK | adele@gmail | 450000 | 12000 | 13JUN2026 | S003_UK |
| 10 | S009 | Someonelikeyou | Adele | SOUL | UK | adele@gmail.com | 730000 | 1000 | 01JUN2024 | S009_UK |
| 11 | S017 | Rollinginthedeep | Adele | SOUL | UK | adele@gmail.com | 610000 | 21000 | 18SEP2024 | S017_UK |
| 12 | S013 | Badguy | Billieeilish | ALTERNATIVE | US | invalid@email.com | 430000 | 16000 | 10OCT2024 | S013_US |
| 13 | S001 | Shapeofyou | Edsheeran | POP | US | ed@gmail.com | 500000 | 15000 | 15JAN2024 | S001_US |
| 14 | S002 | Blindinglights | Theweeknd | POP | US | invalid@email.com | 700000 | 2500 | 10FEB2024 | S002_US |
| 15 | S005 | Perfect | Edsheeran | POP | US | ed@@gmail.com | 90000 | 22000 | 22APR2024 | S005_US |
| 16 | S006 | Halo | Beyonce | POP | US | beyonce@gmail.com | 850000 | 27000 | 13JUN2026 | S006_US |
| 17 | S012 | Levitating | Dualipa | POP | US | dua@gmail.com | 620000 | 21000 | 15SEP2024 | S012_US |
| 18 | S014 | Senorita | Shawnmendes | POP | US | shawn@gmail.com | 780000 | 24000 | 13JUN2026 | S014_US |
| 19 | S016 | Stay | Justinbieber | POP | US | justin@gmail.com | 910000 | 29000 | 05DEC2024 | S016_US |
| 20 | S018 | Cheapthrills | Sia | POP | US | sia@gmail.com | 580000 | 18500 | 09MAR2024 | S018_US |
DATA summary_ds;
SET songs_nodup;
BY REGION_CODE GENRE;
RETAIN TOTAL_ROYALTY TOTAL_SONGS;
IF FIRST.GENRE THEN DO;
TOTAL_ROYALTY=0;
TOTAL_SONGS=0;
END;
TOTAL_ROYALTY + ROYALTY_AMOUNT;
TOTAL_SONGS + 1;
IF LAST.GENRE;
RUN;
PROC PRINT DATA=summary_ds;
RUN;
OUTPUT:
| Obs | SONG_ID | SONG_NAME | ARTIST_NAME | GENRE | REGION_CODE | ARTIST_EMAIL | STREAM_COUNT | ROYALTY_AMOUNT | RELEASE_DATE | SONG_KEY | TOTAL_ROYALTY | TOTAL_SONGS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S011 | Faded | Alanwalker | EDM | APAC | alan@gmail.com | . | 19000 | 05AUG2024 | S011_APAC | 44000 | 2 |
| 2 | S020 | Heatwaves | Glassanimals | INDIE | APAC | glass@gmail.com | 730000 | 22500 | 25AUG2024 | S020_APAC | 22500 | 1 |
| 3 | S019 | Radioactive | Imaginedragons | ROCK | EU | imagine@gmail.com | 620000 | 19500 | 14APR2024 | S019_EU | 74500 | 4 |
| 4 | S007 | Despacito | Luisfonsi | LATIN | LATAM | luis@gmail.com | 920000 | 30000 | 10MAY2024 | S007_LATAM | 30000 | 1 |
| 5 | S017 | Rollinginthedeep | Adele | SOUL | UK | adele@gmail.com | 610000 | 21000 | 18SEP2024 | S017_UK | 34000 | 3 |
| 6 | S013 | Badguy | Billieeilish | ALTERNATIVE | US | invalid@email.com | 430000 | 16000 | 10OCT2024 | S013_US | 16000 | 1 |
| 7 | S018 | Cheapthrills | Sia | POP | US | sia@gmail.com | 580000 | 18500 | 09MAR2024 | S018_US | 159000 | 8 |
Explanation and Key Points
PROC SQL
is concise and intuitive for aggregation logic. DATA step processing provides
deeper control, especially for FIRST./LAST. group processing. Enterprise SAS
developers often prefer DATA step for highly customized derivations.
6.Creating songs_raw Dataset in R
library(tidyverse)
songs_raw <- read.delim(
text = "
S001|ShapeOfYou|EdSheeran|POP|2024-01-15|500000|15000|us|ed@gmail.com
S001|ShapeOfYou|EdSheeran|POP|2024-01-15|500000|15000|US|ed@gmail.com
S002| BlindingLights |TheWeeknd| pop |2024/02/10|700000|-2500|usa|weekndgmail.com
S003|NULL|Adele|Soul|2024-13-01|450000|12000|UK|adele@gmail
S004|Believer|ImagineDragons|ROCK|2024-03-12|.|18000|EU|imagine@music.com
S005|Perfect|EdSheeran|P0P|2024-04-22|-90000|22000|US|ed@@gmail.com
S006|Halo|Beyonce|Pop| |850000|27000|Us|beyonce@gmail.com
S007|Despacito|LuisFonsi|Latin|2024-05-10|920000|30000|LATAM|luis@gmail.com
S008|Thunder|ImagineDragons|rock|2024-05-40|640000|17000|EU|wrongmail
S009|SomeoneLikeYou|Adele|SOUL|2024-06-01|730000|-1000|UK|adele@gmail.com
S010|Closer|Chainsmokers|EDM|2024-07-11|810000|25000|APAC|chain@gmail.com
S011|Faded|AlanWalker|edm|2024-08-05|NULL|19000|AP|alan@gmail.com
S012|Levitating|DuaLipa|POP|2024-09-15|620000|21000|USA|dua@gmail.com
S013|BadGuy|BillieEilish|Alternative|2024-10-10|430000|16000|US|billiegmail.com
S014|Senorita|ShawnMendes|Pop|invalid_date|780000|24000|US|shawn@gmail.com
S015|CountingStars|OneRepublic|ROCK|2024-11-21|560000|20000|EUROPE|one@gmail.com
S016|Stay|JustinBieber|Pop|2024-12-05|910000|29000|US|justin@gmail.com
S017|RollingInTheDeep|Adele|Soul|2024-09-18|610000|21000|UK|adele@gmail.com
S018|CheapThrills|Sia|Pop|2024-03-09|580000|18500|US|sia@gmail.com
S019|Radioactive|ImagineDragons|Rock|2024-04-14|620000|19500|EU|imagine@gmail.com
S020|HeatWaves|GlassAnimals|Indie|2024-08-25|730000|22500|APAC|glass@gmail.com
",
sep = "|",
header = FALSE,
stringsAsFactors = FALSE,
na.strings = c("", "NULL", ".")
)
OUTPUT:
|
|
V1 |
V2 |
V3 |
V4 |
V5 |
V6 |
V7 |
V8 |
V9 |
|
1 |
S001 |
ShapeOfYou |
EdSheeran |
POP |
15-01-2024 |
500000 |
15000 |
us |
ed@gmail.com |
|
2 |
S001 |
ShapeOfYou |
EdSheeran |
POP |
15-01-2024 |
500000 |
15000 |
US |
ed@gmail.com |
|
3 |
S002 |
BlindingLights |
TheWeeknd |
pop |
10-02-2024 |
700000 |
-2500 |
usa |
weekndgmail.com |
|
4 |
S003 |
NA |
Adele |
Soul |
2024-13-01 |
450000 |
12000 |
UK |
adele@gmail |
|
5 |
S004 |
Believer |
ImagineDragons |
ROCK |
12-03-2024 |
NA |
18000 |
EU |
imagine@music.com |
|
6 |
S005 |
Perfect |
EdSheeran |
P0P |
22-04-2024 |
-90000 |
22000 |
US |
ed@@gmail.com |
|
7 |
S006 |
Halo |
Beyonce |
Pop |
|
850000 |
27000 |
Us |
beyonce@gmail.com |
|
8 |
S007 |
Despacito |
LuisFonsi |
Latin |
10-05-2024 |
920000 |
30000 |
LATAM |
luis@gmail.com |
|
9 |
S008 |
Thunder |
ImagineDragons |
rock |
2024-05-40 |
640000 |
17000 |
EU |
wrongmail |
|
10 |
S009 |
SomeoneLikeYou |
Adele |
SOUL |
01-06-2024 |
730000 |
-1000 |
UK |
adele@gmail.com |
|
11 |
S010 |
Closer |
Chainsmokers |
EDM |
11-07-2024 |
810000 |
25000 |
APAC |
chain@gmail.com |
|
12 |
S011 |
Faded |
AlanWalker |
edm |
05-08-2024 |
NA |
19000 |
AP |
alan@gmail.com |
|
13 |
S012 |
Levitating |
DuaLipa |
POP |
15-09-2024 |
620000 |
21000 |
USA |
dua@gmail.com |
|
14 |
S013 |
BadGuy |
BillieEilish |
Alternative |
10-10-2024 |
430000 |
16000 |
US |
billiegmail.com |
|
15 |
S014 |
Senorita |
ShawnMendes |
Pop |
invalid_date |
780000 |
24000 |
US |
shawn@gmail.com |
|
16 |
S015 |
CountingStars |
OneRepublic |
ROCK |
21-11-2024 |
560000 |
20000 |
EUROPE |
one@gmail.com |
|
17 |
S016 |
Stay |
JustinBieber |
Pop |
05-12-2024 |
910000 |
29000 |
US |
justin@gmail.com |
|
18 |
S017 |
RollingInTheDeep |
Adele |
Soul |
18-09-2024 |
610000 |
21000 |
UK |
adele@gmail.com |
|
19 |
S018 |
CheapThrills |
Sia |
Pop |
09-03-2024 |
580000 |
18500 |
US |
sia@gmail.com |
|
20 |
S019 |
Radioactive |
ImagineDragons |
Rock |
14-04-2024 |
620000 |
19500 |
EU |
imagine@gmail.com |
|
21 |
S020 |
HeatWaves |
GlassAnimals |
Indie |
25-08-2024 |
730000 |
22500 |
APAC |
glass@gmail.com |
#========================================================
# Assign Column Names
#========================================================
colnames(songs_raw) <- c(
"SONG_ID",
"SONG_NAME",
"ARTIST_NAME",
"GENRE",
"RELEASE_DATE",
"STREAM_COUNT",
"ROYALTY_AMOUNT",
"REGION_CODE",
"ARTIST_EMAIL"
)
OUTPUT:
|
|
SONG_ID |
SONG_NAME |
ARTIST_NAME |
GENRE |
RELEASE_DATE |
STREAM_COUNT |
ROYALTY_AMOUNT |
REGION_CODE |
ARTIST_EMAIL |
|
1 |
S001 |
ShapeOfYou |
EdSheeran |
POP |
15-01-2024 |
500000 |
15000 |
us |
ed@gmail.com |
|
2 |
S001 |
ShapeOfYou |
EdSheeran |
POP |
15-01-2024 |
500000 |
15000 |
US |
ed@gmail.com |
|
3 |
S002 |
BlindingLights |
TheWeeknd |
pop |
10-02-2024 |
700000 |
-2500 |
usa |
weekndgmail.com |
|
4 |
S003 |
NA |
Adele |
Soul |
2024-13-01 |
450000 |
12000 |
UK |
adele@gmail |
|
5 |
S004 |
Believer |
ImagineDragons |
ROCK |
12-03-2024 |
NA |
18000 |
EU |
imagine@music.com |
|
6 |
S005 |
Perfect |
EdSheeran |
P0P |
22-04-2024 |
-90000 |
22000 |
US |
ed@@gmail.com |
|
7 |
S006 |
Halo |
Beyonce |
Pop |
|
850000 |
27000 |
Us |
beyonce@gmail.com |
|
8 |
S007 |
Despacito |
LuisFonsi |
Latin |
10-05-2024 |
920000 |
30000 |
LATAM |
luis@gmail.com |
|
9 |
S008 |
Thunder |
ImagineDragons |
rock |
2024-05-40 |
640000 |
17000 |
EU |
wrongmail |
|
10 |
S009 |
SomeoneLikeYou |
Adele |
SOUL |
01-06-2024 |
730000 |
-1000 |
UK |
adele@gmail.com |
|
11 |
S010 |
Closer |
Chainsmokers |
EDM |
11-07-2024 |
810000 |
25000 |
APAC |
chain@gmail.com |
|
12 |
S011 |
Faded |
AlanWalker |
edm |
05-08-2024 |
NA |
19000 |
AP |
alan@gmail.com |
|
13 |
S012 |
Levitating |
DuaLipa |
POP |
15-09-2024 |
620000 |
21000 |
USA |
dua@gmail.com |
|
14 |
S013 |
BadGuy |
BillieEilish |
Alternative |
10-10-2024 |
430000 |
16000 |
US |
billiegmail.com |
|
15 |
S014 |
Senorita |
ShawnMendes |
Pop |
invalid_date |
780000 |
24000 |
US |
shawn@gmail.com |
|
16 |
S015 |
CountingStars |
OneRepublic |
ROCK |
21-11-2024 |
560000 |
20000 |
EUROPE |
one@gmail.com |
|
17 |
S016 |
Stay |
JustinBieber |
Pop |
05-12-2024 |
910000 |
29000 |
US |
justin@gmail.com |
|
18 |
S017 |
RollingInTheDeep |
Adele |
Soul |
18-09-2024 |
610000 |
21000 |
UK |
adele@gmail.com |
|
19 |
S018 |
CheapThrills |
Sia |
Pop |
09-03-2024 |
580000 |
18500 |
US |
sia@gmail.com |
|
20 |
S019 |
Radioactive |
ImagineDragons |
Rock |
14-04-2024 |
620000 |
19500 |
EU |
imagine@gmail.com |
|
21 |
S020 |
HeatWaves |
GlassAnimals |
Indie |
25-08-2024 |
730000 |
22500 |
APAC |
glass@gmail.com |
#========================================================
# Convert Numeric Variables
#========================================================
songs_raw$STREAM_COUNT <- as.numeric(songs_raw$STREAM_COUNT)
songs_raw$ROYALTY_AMOUNT <- as.numeric(songs_raw$ROYALTY_AMOUNT)
OUTPUT:
|
|
SONG_ID |
SONG_NAME |
ARTIST_NAME |
GENRE |
RELEASE_DATE |
STREAM_COUNT |
ROYALTY_AMOUNT |
REGION_CODE |
ARTIST_EMAIL |
|
1 |
S001 |
ShapeOfYou |
EdSheeran |
POP |
15-01-2024 |
500000 |
15000 |
us |
ed@gmail.com |
|
2 |
S001 |
ShapeOfYou |
EdSheeran |
POP |
15-01-2024 |
500000 |
15000 |
US |
ed@gmail.com |
|
3 |
S002 |
BlindingLights |
TheWeeknd |
pop |
10-02-2024 |
700000 |
-2500 |
usa |
weekndgmail.com |
|
4 |
S003 |
NA |
Adele |
Soul |
2024-13-01 |
450000 |
12000 |
UK |
adele@gmail |
|
5 |
S004 |
Believer |
ImagineDragons |
ROCK |
12-03-2024 |
NA |
18000 |
EU |
imagine@music.com |
|
6 |
S005 |
Perfect |
EdSheeran |
P0P |
22-04-2024 |
-90000 |
22000 |
US |
ed@@gmail.com |
|
7 |
S006 |
Halo |
Beyonce |
Pop |
|
850000 |
27000 |
Us |
beyonce@gmail.com |
|
8 |
S007 |
Despacito |
LuisFonsi |
Latin |
10-05-2024 |
920000 |
30000 |
LATAM |
luis@gmail.com |
|
9 |
S008 |
Thunder |
ImagineDragons |
rock |
2024-05-40 |
640000 |
17000 |
EU |
wrongmail |
|
10 |
S009 |
SomeoneLikeYou |
Adele |
SOUL |
01-06-2024 |
730000 |
-1000 |
UK |
adele@gmail.com |
|
11 |
S010 |
Closer |
Chainsmokers |
EDM |
11-07-2024 |
810000 |
25000 |
APAC |
chain@gmail.com |
|
12 |
S011 |
Faded |
AlanWalker |
edm |
05-08-2024 |
NA |
19000 |
AP |
alan@gmail.com |
|
13 |
S012 |
Levitating |
DuaLipa |
POP |
15-09-2024 |
620000 |
21000 |
USA |
dua@gmail.com |
|
14 |
S013 |
BadGuy |
BillieEilish |
Alternative |
10-10-2024 |
430000 |
16000 |
US |
billiegmail.com |
|
15 |
S014 |
Senorita |
ShawnMendes |
Pop |
invalid_date |
780000 |
24000 |
US |
shawn@gmail.com |
|
16 |
S015 |
CountingStars |
OneRepublic |
ROCK |
21-11-2024 |
560000 |
20000 |
EUROPE |
one@gmail.com |
|
17 |
S016 |
Stay |
JustinBieber |
Pop |
05-12-2024 |
910000 |
29000 |
US |
justin@gmail.com |
|
18 |
S017 |
RollingInTheDeep |
Adele |
Soul |
18-09-2024 |
610000 |
21000 |
UK |
adele@gmail.com |
|
19 |
S018 |
CheapThrills |
Sia |
Pop |
09-03-2024 |
580000 |
18500 |
US |
sia@gmail.com |
|
20 |
S019 |
Radioactive |
ImagineDragons |
Rock |
14-04-2024 |
620000 |
19500 |
EU |
imagine@gmail.com |
|
21 |
S020 |
HeatWaves |
GlassAnimals |
Indie |
25-08-2024 |
730000 |
22500 |
APAC |
glass@gmail.com |
Key Explanation
This R code is the equivalent of the SAS DATALINES
approach. Instead of using INFILE
DATALINES, R uses read.delim()
with the text= argument to
directly read inline raw data.
Important
Enterprise-Level Points
· sep="|" works
like DLM='|' in SAS.
· na.strings=c("",
"NULL", ".") converts corrupted values into
proper missing values (NA)
similar to SAS missing handling.
· stringsAsFactors=FALSE
prevents automatic factor conversion, which is important for data cleaning
projects.
· as.numeric() converts
mixed character numeric variables into numeric types.
· Invalid
values like "NULL"
and "."
become NA.
· This
dataset intentionally contains:
o
Duplicate SONG_IDs
o
Invalid dates
o
Negative royalties
o
Mixed case text
o
Corrupted emails
o
Missing values
o
Inconsistent region codes
These issues simulate real-world enterprise data quality problems commonly
found in:
· Clinical
trial systems
· Banking
transactions
· Insurance
claims
· Retail
analytics
· Streaming
platforms
This raw dataset is now ready for advanced cleaning using:
· dplyr
· stringr
· lubridate
· janitor
· tidyr
7.R Data Cleaning Layer
library(tidyverse)
library(lubridate)
library(janitor)
songs_clean <- songs_raw %>%
clean_names() %>%
mutate(
song_name = str_to_title(str_trim(song_name)),
genre = str_to_upper(str_trim(genre)),
genre = str_replace_all(genre,"P0P","POP"),
stream_count = abs(as.numeric(stream_count)),
royalty_amount = abs(royalty_amount),
release_date = str_trim(release_date),
release_date = na_if(release_date, ""),
release_date = if_else(release_date == "invalid_date",
NA_character_,release_date),
release_date = parse_date_time(release_date,
orders = c("ymd","y/m/d"),quiet = TRUE),
region_code = case_when(
region_code=="usa" ~ "US",
region_code=="AP" ~ "APAC",
TRUE ~ str_to_upper(region_code)),
artist_email = if_else(grepl("@",artist_email),
artist_email,"invalid@email.com")
) %>%
distinct(song_id,.keep_all=TRUE)
OUTPUT:
|
|
song_id |
song_name |
artist_name |
genre |
release_date |
stream_count |
royalty_amount |
region_code |
artist_email |
|
1 |
S001 |
Shapeofyou |
EdSheeran |
POP |
15-01-2024 |
500000 |
15000 |
US |
ed@gmail.com |
|
2 |
S002 |
Blindinglights |
TheWeeknd |
POP |
10-02-2024 |
700000 |
2500 |
US |
invalid@email.com |
|
3 |
S003 |
NA |
Adele |
SOUL |
NA |
450000 |
12000 |
UK |
adele@gmail |
|
4 |
S004 |
Believer |
ImagineDragons |
ROCK |
12-03-2024 |
NA |
18000 |
EU |
imagine@music.com |
|
5 |
S005 |
Perfect |
EdSheeran |
POP |
22-04-2024 |
90000 |
22000 |
US |
ed@@gmail.com |
|
6 |
S006 |
Halo |
Beyonce |
POP |
NA |
850000 |
27000 |
US |
beyonce@gmail.com |
|
7 |
S007 |
Despacito |
LuisFonsi |
LATIN |
10-05-2024 |
920000 |
30000 |
LATAM |
luis@gmail.com |
|
8 |
S008 |
Thunder |
ImagineDragons |
ROCK |
NA |
640000 |
17000 |
EU |
invalid@email.com |
|
9 |
S009 |
Someonelikeyou |
Adele |
SOUL |
01-06-2024 |
730000 |
1000 |
UK |
adele@gmail.com |
|
10 |
S010 |
Closer |
Chainsmokers |
EDM |
11-07-2024 |
810000 |
25000 |
APAC |
chain@gmail.com |
|
11 |
S011 |
Faded |
AlanWalker |
EDM |
05-08-2024 |
NA |
19000 |
APAC |
alan@gmail.com |
|
12 |
S012 |
Levitating |
DuaLipa |
POP |
15-09-2024 |
620000 |
21000 |
USA |
dua@gmail.com |
|
13 |
S013 |
Badguy |
BillieEilish |
ALTERNATIVE |
10-10-2024 |
430000 |
16000 |
US |
invalid@email.com |
|
14 |
S014 |
Senorita |
ShawnMendes |
POP |
NA |
780000 |
24000 |
US |
shawn@gmail.com |
|
15 |
S015 |
Countingstars |
OneRepublic |
ROCK |
21-11-2024 |
560000 |
20000 |
EUROPE |
one@gmail.com |
|
16 |
S016 |
Stay |
JustinBieber |
POP |
05-12-2024 |
910000 |
29000 |
US |
justin@gmail.com |
|
17 |
S017 |
Rollinginthedeep |
Adele |
SOUL |
18-09-2024 |
610000 |
21000 |
UK |
adele@gmail.com |
|
18 |
S018 |
Cheapthrills |
Sia |
POP |
09-03-2024 |
580000 |
18500 |
US |
sia@gmail.com |
|
19 |
S019 |
Radioactive |
ImagineDragons |
ROCK |
14-04-2024 |
620000 |
19500 |
EU |
imagine@gmail.com |
|
20 |
S020 |
Heatwaves |
GlassAnimals |
INDIE |
25-08-2024 |
730000 |
22500 |
APAC |
glass@gmail.com |
Explanation and Key Points
This R
workflow mirrors SAS transformations using tidyverse syntax. mutate() behaves
similarly to DATA step assignment logic. case_when() resembles SELECT-WHEN. distinct()
matches PROC SORT NODUPKEY.
parse_date_time()
from lubridate handles mixed date formats more flexibly than standard SAS
informats.
R
provides superior exploratory flexibility, while SAS offers stronger audit
traceability and regulatory compliance.
Why quiet = TRUE is
Important
quiet = TRUE
suppresses
warning messages.
Equivalent
to SAS:
?? informat.
Very
common in enterprise ETL workflows.
Clinical SAS vs R
Comparison
|
SAS |
R |
|
INPUT(date,
?? yymmdd10.) |
parse_date_time(...,
quiet=TRUE) |
|
Invalid
dates become . |
Invalid
dates become NA |
|
Log
warning generated |
Warning
generated |
|
?? suppresses
warning |
quiet=TRUE
suppresses warning |
Enterprise Validation &
Compliance
In
clinical trials:
- SDTM datasets standardize
raw collection data
- ADaM datasets support
statistical analysis
- QC programmers independently
validate derivations
- Audit trails track every
transformation
- Traceability ensures
reproducibility
One
dangerous SAS behavior:
IF lab_value < 5 THEN
FLAG='Y';
Missing
numeric values in SAS are treated as lower than valid numbers. Therefore,
missing values accidentally satisfy <5.
Correct
approach:
IF NOT MISSING(lab_value) AND
lab_value < 5 THEN FLAG='Y';
This
prevents catastrophic analytical errors in regulatory environments.
20 Data Cleaning Best
Practices
- Always validate duplicates
before merging
- Standardize categorical
variables
- Use metadata-driven
programming
- Create reusable macros
- Never overwrite raw datasets
- Maintain audit trails
- Validate missing-value logic
- Use defensive programming
- Standardize date formats
- Apply QC independence
- Document derivation logic
- Validate variable lengths
- Avoid hardcoding
- Use controlled terminology
- Validate ranges
systematically
- Build reproducible workflows
- Track data lineage
- Separate raw and clean
layers
- Validate merge cardinality
- Automate compliance checks
Business Logic Explanation
Enterprise
data cleaning exists to ensure analytical reliability. Missing values are
imputed because incomplete records distort statistical models, forecasting
systems, and executive dashboards. For example, if patient visit dates are
missing in clinical trials, treatment duration calculations become inaccurate.
In music analytics, missing release dates disrupt trend analysis.
Unrealistic
values are corrected because downstream algorithms assume valid business logic.
A negative royalty amount implies impossible financial behavior. Similarly, a
patient age of 250 years invalidates demographic summaries.
Text
normalization ensures consistency across systems. "POP", "pop",
and " Pop " represent the same category but appear different
computationally. Without normalization, aggregation results become fragmented.
Date
standardization is critical for time-series analytics, regulatory reporting,
and AI modeling. Mixed date formats often cause parsing failures during ETL
processing.
Missing
numeric handling is especially important in SAS because missing numeric values
are treated differently than NULLs in SQL databases.
These
corrections collectively transform unreliable operational data into
analysis-ready intelligence capable of supporting dashboards, machine learning,
regulatory submissions, and strategic business decisions.
20 Sharp One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic is stronger
than visual inspection.
- Standardized variables
improve reproducibility.
- Duplicate IDs silently
corrupt analytics.
- Missing dates distort trend
intelligence.
- Audit trails protect
regulatory credibility.
- Character truncation causes
hidden data loss.
- SAS excels in governed
environments.
- R dominates exploratory
transformation flexibility.
- Metadata-driven programming
reduces risk.
- PROC SQL simplifies
aggregation logic.
- DATA step offers granular
control.
- Defensive programming
prevents production failures.
- Controlled terminology
improves reporting consistency.
- QC independence strengthens
analytical trust.
- Missing-value handling can
change outcomes.
- Production code requires
scalability.
- Clean data improves AI
predictions.
- Traceability is essential
for compliance.
- Reliable analytics begins
with disciplined cleaning.
SAS vs R Comparison Table
|
Feature |
SAS |
R |
|
Regulatory
Compliance |
Excellent |
Moderate |
|
Audit
Trails |
Strong |
Limited |
|
Flexibility |
Moderate |
Excellent |
|
Statistical
Libraries |
Excellent |
Excellent |
|
Enterprise
Governance |
Strong |
Moderate |
|
Visualization |
Moderate |
Excellent |
|
Scalability |
Excellent |
Good |
|
Learning
Curve |
Moderate |
Moderate |
|
Metadata
Integration |
Strong |
Moderate |
|
Open
Source |
No |
Yes |
Validation Checklist
1.
Duplicate check completed
2.Missing-value validation completed
3.Invalid dates corrected
4.Text normalization completed
5.Range validation completed
6.Email validation completed
7.QC logic verified
8.Metadata reviewed
9.Audit trail documented
10.Analysis-ready dataset generated
Summary
SAS and R
together create one of the strongest enterprise analytics ecosystems available
today. SAS provides governed, scalable, audit-ready processing ideal for
regulated industries such as pharmaceuticals, banking, and insurance. Features
like DATA step programming, FIRST./LAST. processing, PROC SQL, and
metadata-driven validation frameworks make SAS exceptionally powerful for
production-grade workflows.
R
complements SAS by offering highly flexible transformation libraries, rapid
exploratory analytics, and modern data-engineering frameworks. Packages such as
tidyverse, lubridate, janitor, and stringr simplify complex transformations
while enabling interactive analytical workflows.
In
regulated clinical environments, SAS remains dominant because of traceability,
validation standards, and regulatory acceptance. However, R increasingly
supports exploratory analytics, machine learning, and visualization layers
integrated into enterprise pipelines.
This
project demonstrated how corrupted global famous-song datasets resemble
real-world healthcare and financial data-quality failures. Problems such as
duplicate IDs, malformed emails, inconsistent categories, negative numeric
values, and invalid dates can severely impact business intelligence and
statistical reliability.
Using SAS
PROC SQL, DATA step programming, and modern R workflows, we converted
unreliable operational records into clean analytical intelligence. The result
is an analysis-ready dataset capable of supporting dashboards, forecasting
models, compliance reporting, and executive decision-making.
Ultimately,
enterprise analytics success depends not on sophisticated AI algorithms alone,
but on disciplined data-engineering foundations built through validation,
standardization, reproducibility, and robust cleaning frameworks.
Conclusion
Modern
organizations generate enormous volumes of operational data every second. Yet
raw data alone has little value unless it can be trusted. Whether the domain is
healthcare, clinical trials, retail, banking, insurance, or global music
streaming analytics, poor-quality data introduces dangerous analytical
uncertainty.
This
project demonstrated how corrupted famous-song datasets mirror real-world
enterprise data failures. Duplicate records inflate metrics. Missing dates
break trend analysis. Negative financial values distort revenue forecasting.
Inconsistent text labels damage segmentation models. Malformed emails disrupt
communication workflows. Invalid timestamps create reporting inaccuracies.
In
regulated industries like pharmaceuticals, such failures become even more
critical. Incorrect SDTM or ADaM derivations can delay regulatory approvals and
compromise patient safety analysis. Therefore, structured data-cleaning
frameworks are not optional they are foundational to enterprise trust.
SAS
provides exceptional strength in governed environments because of its
auditability, traceability, validation capabilities, and scalable DATA step
architecture. Features like FIRST./LAST. processing, PROC SQL, macros, arrays,
and controlled metadata frameworks make SAS ideal for production analytics.
R
complements this ecosystem by enabling highly flexible transformations, modern
data engineering, exploratory analytics, and advanced visualization. Together,
SAS and R form a hybrid analytical ecosystem capable of delivering scalable,
reproducible, and trustworthy intelligence.
The most
successful enterprise analytics teams are not merely coders they are
data-quality engineers, validation specialists, compliance thinkers, and
business translators. Their goal is not simply to clean data, but to transform
unreliable operational chaos into analytical clarity capable of supporting
executive decisions, AI systems, statistical outputs, and regulatory
submissions.
Ultimately,
trustworthy analytics begins with disciplined cleaning, standardized logic,
reproducible workflows, and rigorous validation culture. Clean data is not a
technical luxury it is the backbone of modern business intelligence.
Interview Questions and
Answers
1. How would you identify duplicate records in SAS?
Answer:
I would use PROC SORT NODUPKEY or PROC SQL GROUP BY HAVING COUNT(*) > 1. In
regulated environments, I would also document duplicate-removal logic for audit
traceability.
2. Why is missing-value handling dangerous in SAS?
Answer:
Missing numeric values are treated as smaller than valid numbers. Therefore:
IF value < 5;
incorrectly
includes missing values unless explicitly checked using:
IF NOT MISSING(value) AND value
< 5;
3. When would you prefer DATA step over PROC SQL?
Answer:
I prefer DATA step for row-wise transformations, retained calculations,
FIRST./LAST. logic, arrays, and complex conditional processing. PROC SQL is
better for joins and aggregation.
4. How do SAS and R complement each other in
enterprise analytics?
Answer:
SAS provides governance, validation, and regulatory traceability, while R
offers flexible exploratory analytics and modern transformation libraries.
Together they create scalable analytical ecosystems.
5. Describe a real-world validation issue you might
face in clinical trial data.
Answer:
A common issue is duplicate subject enrollment causing inflated treatment
populations. Another is invalid visit dates affecting treatment duration
calculations. Both require validation checks, QC programming, and audit
documentation before submission.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent SONGS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and Exams Reviewers and Observers
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment