Global Villains Dataset Reimagined with SAS DATA Step, PROC SQL and R
Transforming the World’s Most Famous Villains Dataset into Professional Analytical Intelligence with PROC FORMAT, PROC SQL, DATA Step, SAS and R
1. Introduction — When
Dirty Data Becomes the Real Villain
Imagine a global entertainment analytics company preparing a billion-dollar streaming recommendation engine. Their data warehouse stores information about the world’s most famous fictional villains — characters from movies, comics, anime, games, mythology, and literature. Executives want dashboards showing villain popularity, franchise earnings, danger ratings, and audience engagement scores.
But the
raw dataset is disastrous.
Some
villain names are in lowercase. Others contain "NULL" values. Ages
are negative. Release dates are invalid. Duplicate entries exist for iconic
villains like Joker and Darth Vader. Revenue values are inconsistent across
regions.
One
analyst mistakenly treats missing revenue as zero. Another leaves duplicate
records untouched. The reporting team generates executive dashboards directly
from the raw data. Suddenly:
- Revenue predictions become
inaccurate
- Popularity rankings collapse
- AI recommendation systems
fail
- Business decisions become
unreliable
This is
exactly why structured data cleaning matters.
In modern
analytics ecosystems, raw data is never production-ready. SAS and R dominate
enterprise-grade data engineering because they provide scalable frameworks for:
- Cleaning dirty datasets
- Standardizing inconsistent
values
- Applying business rules
- Formatting professional
reports
- Creating
regulatory-compliant outputs
In this
project, we will build a Most Famous Villains Dataset and intentionally
inject real-world data quality problems. Then we will clean, validate, format,
standardize, and professionally present the data using:
- SAS DATA Step
- PROC SQL
- PROC FORMAT
- PROC SORT
- PROC REPORT
- R tidyverse workflows
The focus
is not just coding it is understanding
the business logic behind professional analytics engineering.
2. Raw Data Creation in SAS
and R
Raw Dataset Variables (9 Variables)
|
Variable |
Description |
|
Villain_ID |
Unique
identifier |
|
Villain_Name |
Famous
villain |
|
Universe |
Franchise/Universe |
|
Country |
Origin
country |
|
Age |
Villain
age |
|
Power_Level |
Threat
score |
|
Revenue_Million |
Franchise
revenue |
|
Release_Date |
First
appearance |
|
Status |
Active/Inactive |
SAS Raw Dataset Creation
data villains_raw;
length Villain_Name $30 Universe $25 Country $20
Status $12;
informat Release_Date ddmmyy10.;
format Release_Date date9.;
input Villain_ID Villain_Name $ Universe $ Country $
Age Power_Level Revenue_Million Release_Date :?? ddmmyy10.
Status $;
datalines;
101 Joker DC_USA usa 45 99 1500 12/05/2008 Active
102 DarthVader StarWars USA -65 100 2200 25/05/1977 ACTIVE
103 Loki Marvel norway 1500 92 . 11/07/2011 Active
104 Voldemort HarryPotter UK 71 98 1800 15/07/2007 inactive
105 NULL Marvel USA 55 85 1600 01/01/2015 Active
106 Thanos marvel USA . 100 3000 27/04/2018 ACTIVE
106 Thanos marvel USA . 100 3000 27/04/2018 ACTIVE
107 Hannibal NBC USA 49 77 -100 20/04/2013 Active
108 Sauron LOTR NULL 3000 99 2500 19/12/2001 active
109 Dracula Horror Romania 600 89 900 99/99/2020 Active
110 Magneto Marvel USA 65 95 1700 14/07/2000 Active
111 Scar Disney kenya -20 80 400 15/06/1994 inactive
;
run;
proc print data = villains_raw;
run;
OUTPUT:
| Obs | Villain_Name | Universe | Country | Status | Release_Date | Villain_ID | Age | Power_Level | Revenue_Million |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Joker | DC_USA | usa | Active | 12MAY2008 | 101 | 45 | 99 | 1500 |
| 2 | DarthVader | StarWars | USA | ACTIVE | 25MAY1977 | 102 | -65 | 100 | 2200 |
| 3 | Loki | Marvel | norway | Active | 11JUL2011 | 103 | 1500 | 92 | . |
| 4 | Voldemort | HarryPotter | UK | inactive | 15JUL2007 | 104 | 71 | 98 | 1800 |
| 5 | NULL | Marvel | USA | Active | 01JAN2015 | 105 | 55 | 85 | 1600 |
| 6 | Thanos | marvel | USA | ACTIVE | 27APR2018 | 106 | . | 100 | 3000 |
| 7 | Thanos | marvel | USA | ACTIVE | 27APR2018 | 106 | . | 100 | 3000 |
| 8 | Hannibal | NBC | USA | Active | 20APR2013 | 107 | 49 | 77 | -100 |
| 9 | Sauron | LOTR | NULL | active | 19DEC2001 | 108 | 3000 | 99 | 2500 |
| 10 | Dracula | Horror | Romania | Active | . | 109 | 600 | 89 | 900 |
| 11 | Magneto | Marvel | USA | Active | 14JUL2000 | 110 | 65 | 95 | 1700 |
| 12 | Scar | Disney | kenya | inactive | 15JUN1994 | 111 | -20 | 80 | 400 |
Explanation and Key Points
This raw
dataset intentionally contains multiple enterprise-level data quality issues.
We inserted:
- Missing revenue values
- Invalid negative ages
- Duplicate records
- Invalid dates
- Mixed case inconsistencies
- NULL strings
- Negative revenue values
The LENGTH
statement is critical because SAS assigns variable lengths during compilation.
If lengths are not defined early, truncation may occur. The INFORMAT statement
ensures proper reading of raw dates, while FORMAT controls reporting
presentation.
This
mirrors real production environments where incoming flat files are often
inconsistent and unreliable.
|
Modifier |
Meaning |
|
: |
Modified
list input |
|
? |
Suppresses
invalid data message |
|
?? |
Suppresses
invalid data message AND prevents _ERROR_=1 |
R Code – Equivalent Raw Dataset
villains_raw <- data.frame(
Villain_ID = c(101,102,103,104,105,106,106,107,108,109,110,111),
Villain_Name = c("Joker","DarthVader","Loki","Voldemort","NULL","Thanos",
"Thanos","Hannibal","Sauron","Dracula","Magneto",
"Scar"),
Universe = c("DC_USA","StarWars","Marvel","HarryPotter","Marvel","marvel",
"marvel","NBC", "LOTR","Horror","Marvel","Disney"),
Country = c("usa","USA","norway","UK","USA","USA",
"USA","USA","NULL","Romania","USA","kenya"),
Age = c(45,-65,1500,71,55,NA,NA,49,3000,600,65,-20),
Power_Level = c(99,100,92,98,85,100,100,77,99,89,95,80),
Revenue_Million = c(1500,2200,NA,1800,1600,3000,
3000,-100,2500,900,1700,400),
Release_Date = c("12/05/2008","25/05/1977","11/07/2011",
"15/07/2007","01/01/2015","27/04/2018",
"27/04/2018","20/04/2013","19/12/2001",
"99/99/2020","14/07/2000","15/06/1994"),
Status = c("Active","ACTIVE","Active","inactive",
"Active","ACTIVE","ACTIVE","Active",
"active","Active","Active","inactive")
)
OUTPUT:
|
|
Villain_ID |
Villain_Name |
Universe |
Country |
Age |
Power_Level |
Revenue_Million |
Release_Date |
Status |
|
1 |
101 |
Joker |
DC_USA |
usa |
45 |
99 |
1500 |
12-05-2008 |
Active |
|
2 |
102 |
DarthVader |
StarWars |
USA |
-65 |
100 |
2200 |
25-05-1977 |
ACTIVE |
|
3 |
103 |
Loki |
Marvel |
norway |
1500 |
92 |
NA |
11-07-2011 |
Active |
|
4 |
104 |
Voldemort |
HarryPotter |
UK |
71 |
98 |
1800 |
15-07-2007 |
inactive |
|
5 |
105 |
NULL |
Marvel |
USA |
55 |
85 |
1600 |
01-01-2015 |
Active |
|
6 |
106 |
Thanos |
marvel |
USA |
NA |
100 |
3000 |
27-04-2018 |
ACTIVE |
|
7 |
106 |
Thanos |
marvel |
USA |
NA |
100 |
3000 |
27-04-2018 |
ACTIVE |
|
8 |
107 |
Hannibal |
NBC |
USA |
49 |
77 |
-100 |
20-04-2013 |
Active |
|
9 |
108 |
Sauron |
LOTR |
NULL |
3000 |
99 |
2500 |
19-12-2001 |
active |
|
10 |
109 |
Dracula |
Horror |
Romania |
600 |
89 |
900 |
99/99/2020 |
Active |
|
11 |
110 |
Magneto |
Marvel |
USA |
65 |
95 |
1700 |
14-07-2000 |
Active |
|
12 |
111 |
Scar |
Disney |
kenya |
-20 |
80 |
400 |
15-06-1994 |
inactive |
Explanation and Key Points
The R
dataset mirrors the SAS dataset exactly, enabling cross-platform validation.
Real-world analytics teams frequently use SAS for regulatory workflows and R
for exploratory refinement. This dual-platform consistency ensures
reproducibility.
The data.frame()
function creates structured tabular data. Notice the use of NA values to
simulate missing observations. Mixed capitalization and invalid business values
reflect realistic ingestion problems from APIs, flat files, or manual entry
systems.
3. The SAS Engineering
Layer
DATA STEP Cleaning
data villains_clean;
set villains_raw;
length Universe_Cat $20 Country_Clean $20 Status_Clean $10;
Villain_Name = coalescec(Villain_Name,"UNKNOWN");
if Villain_Name="NULL" then Villain_Name="UNKNOWN";
if not missing(Age) then Age = abs(Age);
if Revenue_Million < 0 then Revenue_Million=0;
Country_Clean = upcase(strip(Country));
if Country_Clean="NULL" then Country_Clean="UNKNOWN";
Status_Clean = propcase(Status);
select(upcase(Universe));
when("MARVEL") Universe_Cat="SUPERHERO";
when("DC_USA") Universe_Cat="COMIC";
when("STARWARS") Universe_Cat="SCI-FI";
otherwise Universe_Cat="FANTASY";
end;
format Revenue_Million comma8.;
drop Country Status;
rename Country_Clean = Country Status_Clean = Status;
run;
proc print data = villains_clean;
run;
OUTPUT:
| Obs | Villain_Name | Universe | Release_Date | Villain_ID | Age | Power_Level | Revenue_Million | Universe_Cat | Country | Status |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Joker | DC_USA | 12MAY2008 | 101 | 45 | 99 | 1,500 | COMIC | USA | Active |
| 2 | DarthVader | StarWars | 25MAY1977 | 102 | 65 | 100 | 2,200 | SCI-FI | USA | Active |
| 3 | Loki | Marvel | 11JUL2011 | 103 | 1500 | 92 | 0 | SUPERHERO | NORWAY | Active |
| 4 | Voldemort | HarryPotter | 15JUL2007 | 104 | 71 | 98 | 1,800 | FANTASY | UK | Inactive |
| 5 | UNKNOWN | Marvel | 01JAN2015 | 105 | 55 | 85 | 1,600 | SUPERHERO | USA | Active |
| 6 | Thanos | marvel | 27APR2018 | 106 | . | 100 | 3,000 | SUPERHERO | USA | Active |
| 7 | Thanos | marvel | 27APR2018 | 106 | . | 100 | 3,000 | SUPERHERO | USA | Active |
| 8 | Hannibal | NBC | 20APR2013 | 107 | 49 | 77 | 0 | FANTASY | USA | Active |
| 9 | Sauron | LOTR | 19DEC2001 | 108 | 3000 | 99 | 2,500 | FANTASY | UNKNOWN | Active |
| 10 | Dracula | Horror | . | 109 | 600 | 89 | 900 | FANTASY | ROMANIA | Active |
| 11 | Magneto | Marvel | 14JUL2000 | 110 | 65 | 95 | 1,700 | SUPERHERO | USA | Active |
| 12 | Scar | Disney | 15JUN1994 | 111 | 20 | 80 | 400 | FANTASY | KENYA | Inactive |
Explanation and Key Points
This DATA
Step demonstrates industrial-strength SAS engineering logic.
Important Functions Used
|
Function |
Purpose |
|
COALESCEC |
Handles
missing character values |
|
ABS |
Converts
negative ages |
|
STRIP |
Removes
spaces |
|
UPCASE |
Standardizes
case |
|
INPUT/PUT |
Converts
dates safely |
|
SELECT-WHEN |
Cleaner
alternative to IF-ELSE |
SELECT-WHEN
becomes more scalable than nested IF statements in large categorical mappings.
The INPUT/PUT
combination is extremely important in enterprise SAS because raw dates often
arrive as inconsistent character strings.
4.PROC SORT Deduplication
proc sort data=villains_clean nodupkey;
by Villain_ID;
run;
proc print data = villains_clean;
run;
LOG:
OUTPUT:
| Obs | Villain_Name | Universe | Release_Date | Villain_ID | Age | Power_Level | Revenue_Million | Universe_Cat | Country | Status |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Joker | DC_USA | 12MAY2008 | 101 | 45 | 99 | 1,500 | COMIC | USA | Active |
| 2 | DarthVader | StarWars | 25MAY1977 | 102 | 65 | 100 | 2,200 | SCI-FI | USA | Active |
| 3 | Loki | Marvel | 11JUL2011 | 103 | 1500 | 92 | 0 | SUPERHERO | NORWAY | Active |
| 4 | Voldemort | HarryPotter | 15JUL2007 | 104 | 71 | 98 | 1,800 | FANTASY | UK | Inactive |
| 5 | UNKNOWN | Marvel | 01JAN2015 | 105 | 55 | 85 | 1,600 | SUPERHERO | USA | Active |
| 6 | Thanos | marvel | 27APR2018 | 106 | . | 100 | 3,000 | SUPERHERO | USA | Active |
| 7 | Hannibal | NBC | 20APR2013 | 107 | 49 | 77 | 0 | FANTASY | USA | Active |
| 8 | Sauron | LOTR | 19DEC2001 | 108 | 3000 | 99 | 2,500 | FANTASY | UNKNOWN | Active |
| 9 | Dracula | Horror | . | 109 | 600 | 89 | 900 | FANTASY | ROMANIA | Active |
| 10 | Magneto | Marvel | 14JUL2000 | 110 | 65 | 95 | 1,700 | SUPERHERO | USA | Active |
| 11 | Scar | Disney | 15JUN1994 | 111 | 20 | 80 | 400 | FANTASY | KENYA | Inactive |
Explanation and Key Points
PROC SORT
NODUPKEY removes duplicate observations based on business keys. In production
environments, duplicates can inflate metrics, distort AI models, and violate
regulatory standards.
Deduplication
should always happen after initial standardization.
5.PROC FORMAT for Professional Outputs
proc format;
value powerfmt 0-79 = 'Moderate Threat'
80-95 = 'High Threat'
96-high = 'Extreme Threat';
LOG:
value $statusfmt 'Active' = 'Operational'
'Inactive' = 'Dormant';
run;
LOG:
Explanation and Key Points
PROC
FORMAT is one of SAS’s most powerful presentation tools. Instead of changing
raw values permanently, formats provide presentation-layer intelligence.
Benefits
include:
- Cleaner executive reports
- Consistent terminology
- Simplified dashboards
- Reusable business standards
This
separation between raw data and presentation logic is critical in regulated
industries.
PROC REPORT Professional Reporting
proc report data=villains_clean nowd;
columns Villain_ID Villain_Name Universe_Cat Power_Level
Status;
define Villain_Name / display;
define Universe_Cat / group;
define Power_Level / display format=powerfmt.;
define Status_Clean / display format=$statusfmt.;
run;
OUTPUT:
| Villain_ID | Villain_Name | Universe_Cat | Power_Level | Status |
|---|---|---|---|---|
| 101 | Joker | COMIC | Extreme Threat | Operational |
| 104 | Voldemort | FANTASY | Extreme Threat | Dormant |
| 107 | Hannibal | Moderate Threat | Operational | |
| 108 | Sauron | Extreme Threat | Operational | |
| 109 | Dracula | High Threat | Operational | |
| 111 | Scar | High Threat | Dormant | |
| 102 | DarthVader | SCI-FI | Extreme Threat | Operational |
| 103 | Loki | SUPERHERO | High Threat | Operational |
| 105 | UNKNOWN | High Threat | Operational | |
| 106 | Thanos | Extreme Threat | Operational | |
| 110 | Magneto | High Threat | Operational |
Explanation and Key Points
PROC
REPORT is designed for enterprise reporting and executive presentation. By
integrating PROC FORMAT, analysts can create professional semantic outputs
without modifying the underlying dataset.
This
improves maintainability and reporting governance.
6. The R Refinement Layer
library(dplyr)
library(stringr)
library(tidyr)
villains_clean <- villains_raw %>%
mutate(Villain_Name = ifelse(Villain_Name=="NULL","UNKNOWN",
Villain_Name),
Age = abs(Age),
Revenue_Million = ifelse(Revenue_Million < 0,0,Revenue_Million),
Country = toupper(trimws(Country)),
Country = ifelse(Country=="NULL","UNKNOWN",Country),
Status = str_to_title(Status),
Universe_Cat = case_when(toupper(Universe)=="MARVEL" ~ "SUPERHERO",
toupper(Universe)=="DC_USA" ~ "COMIC",
toupper(Universe)=="STARWARS" ~ "SCI-FI",
TRUE ~ "FANTASY")
) %>%
distinct(Villain_ID,.keep_all=TRUE)
OUTPUT:
|
|
Villain_ID |
Villain_Name |
Universe |
Country |
Age |
Power_Level |
Revenue_Million |
Release_Date |
Status |
Universe_Cat |
|
1 |
101 |
Joker |
DC_USA |
USA |
45 |
99 |
1500 |
12-05-2008 |
Active |
COMIC |
|
2 |
102 |
DarthVader |
StarWars |
USA |
65 |
100 |
2200 |
25-05-1977 |
Active |
SCI-FI |
|
3 |
103 |
Loki |
Marvel |
NORWAY |
1500 |
92 |
NA |
11-07-2011 |
Active |
SUPERHERO |
|
4 |
104 |
Voldemort |
HarryPotter |
UK |
71 |
98 |
1800 |
15-07-2007 |
Inactive |
FANTASY |
|
5 |
105 |
UNKNOWN |
Marvel |
USA |
55 |
85 |
1600 |
01-01-2015 |
Active |
SUPERHERO |
|
6 |
106 |
Thanos |
marvel |
USA |
NA |
100 |
3000 |
27-04-2018 |
Active |
SUPERHERO |
|
7 |
107 |
Hannibal |
NBC |
USA |
49 |
77 |
0 |
20-04-2013 |
Active |
FANTASY |
|
8 |
108 |
Sauron |
LOTR |
UNKNOWN |
3000 |
99 |
2500 |
19-12-2001 |
Active |
FANTASY |
|
9 |
109 |
Dracula |
Horror |
ROMANIA |
600 |
89 |
900 |
99/99/2020 |
Active |
FANTASY |
|
10 |
110 |
Magneto |
Marvel |
USA |
65 |
95 |
1700 |
14-07-2000 |
Active |
SUPERHERO |
|
11 |
111 |
Scar |
Disney |
KENYA |
20 |
80 |
400 |
15-06-1994 |
Inactive |
FANTASY |
Explanation and Logic Bridge
|
SAS
Concept |
R Equivalent |
|
IF-THEN |
ifelse() |
|
SELECT-WHEN |
case_when() |
|
PROC
SORT NODUPKEY |
distinct() |
|
STRIP |
trimws() |
|
UPCASE |
toupper() |
R’s
tidyverse provides elegant chained workflows. The %>% operator enables
readable pipelines similar to layered DATA Step transformations.
case_when()
is especially useful for scalable business-rule mapping.
7. Business Logic & The
“Why”
Consider
a streaming intelligence company ranking villains for a recommendation engine.
Suppose
missing revenue is incorrectly treated as zero instead of unknown.
Consequences:
- Thanos appears less
profitable
- AI ranking algorithms become
biased
- Investor dashboards become
inaccurate
- Marketing budgets are
misallocated
In
pharmaceutical analytics, this becomes even more dangerous. If patient
adverse-event severity is incorrectly defaulted to “mild,” regulators may
reject the submission.
Data
cleaning is not cosmetic. It is operational risk management.
8. 20 Key Points of
Implementation
- Always define LENGTH before
transformation logic.
- Standardize text casing
early.
- Validate dates before
analysis.
- Never trust raw flat files.
- Deduplicate before
aggregation.
- Use formats instead of
hardcoding labels.
- Keep raw data immutable.
- Create audit-ready
transformation logic.
- Use meaningful variable
names.
- Separate business logic from
presentation logic.
- Validate numeric ranges
aggressively.
- Handle NULL and blanks
differently.
- Document every
transformation.
- Use reusable macros for
scalability.
- Avoid nested IF statements
when possible.
- Prefer SELECT-WHEN for
maintainability.
- Always test edge cases.
- Use PROC CONTENTS
frequently.
- Validate duplicates using
business keys.
- Create production-ready
reports, not temporary outputs.
9. Extended Analysis in SAS
data villains;
length Villain_Name $30 Region $15
High_Value $3;
infile datalines dlm='|';
input Villain_ID Villain_Name $ Revenue Region $;
if Revenue > 2000 then High_Value='YES';
else High_Value='NO';
datalines;
101|Joker|1500|NorthAmerica
102|DarthVader|2200|Global
103|Loki|1800|Europe
104|Voldemort|2500|Europe
105|Thanos|3200|Global
105|Thanos|3200|Global
106|Hannibal|900|USA
107|Sauron|2700|MiddleEarth
108|Dracula|1100|Europe
109|Magneto|2100|USA
110|Scar|700|Africa
111|GreenGoblin|2400|USA
;
run;
proc print data=villains;
run;
OUTPUT:
| Obs | Villain_Name | Region | High_Value | Villain_ID | Revenue |
|---|---|---|---|---|---|
| 1 | Joker | NorthAmerica | NO | 101 | 1500 |
| 2 | DarthVader | Global | YES | 102 | 2200 |
| 3 | Loki | Europe | NO | 103 | 1800 |
| 4 | Voldemort | Europe | YES | 104 | 2500 |
| 5 | Thanos | Global | YES | 105 | 3200 |
| 6 | Thanos | Global | YES | 105 | 3200 |
| 7 | Hannibal | USA | NO | 106 | 900 |
| 8 | Sauron | MiddleEarth | YES | 107 | 2700 |
| 9 | Dracula | Europe | NO | 108 | 1100 |
| 10 | Magneto | USA | YES | 109 | 2100 |
| 11 | Scar | Africa | NO | 110 | 700 |
| 12 | GreenGoblin | USA | YES | 111 | 2400 |
proc means data=villains sum mean maxdec=2;
class Region;
var Revenue;
run;
OUTPUT:
The MEANS Procedure
| Analysis Variable : Revenue | |||
|---|---|---|---|
| Region | N Obs | Sum | Mean |
| Africa | 1 | 700.00 | 700.00 |
| Europe | 3 | 5400.00 | 1800.00 |
| Global | 3 | 8600.00 | 2866.67 |
| MiddleEarth | 1 | 2700.00 | 2700.00 |
| NorthAmerica | 1 | 1500.00 | 1500.00 |
| USA | 3 | 5400.00 | 1800.00 |
proc sort data=villains nodupkey;
by Villain_ID;
run;
proc print data=villains;
run;
LOG:
OUTPUT:
| Obs | Villain_Name | Region | High_Value | Villain_ID | Revenue |
|---|---|---|---|---|---|
| 1 | Joker | NorthAmerica | NO | 101 | 1500 |
| 2 | DarthVader | Global | YES | 102 | 2200 |
| 3 | Loki | Europe | NO | 103 | 1800 |
| 4 | Voldemort | Europe | YES | 104 | 2500 |
| 5 | Thanos | Global | YES | 105 | 3200 |
| 6 | Hannibal | USA | NO | 106 | 900 |
| 7 | Sauron | MiddleEarth | YES | 107 | 2700 |
| 8 | Dracula | Europe | NO | 108 | 1100 |
| 9 | Magneto | USA | YES | 109 | 2100 |
| 10 | Scar | Africa | NO | 110 | 700 |
| 11 | GreenGoblin | USA | YES | 111 | 2400 |
Explanation and Key Points
This
workflow demonstrates enterprise ETL architecture:
- INFILE imports flat-file
data
- IF-THEN applies business
rules
- PROC MEANS aggregates
revenue
- PROC SORT removes duplicates
- PROC PRINT generates
operational reporting
This
pattern is extremely common in banking, insurance, retail, and clinical trial
pipelines.
10. 20 Additional Data
Cleaning Best Practices
- Preserve original raw files.
- Maintain SDTM traceability.
- Document derivation logic.
- Validate controlled terminology.
- Track every data
modification.
- Use version control systems.
- Maintain audit trails.
- Validate primary keys.
- Standardize missing value
handling.
- Reconcile datasets
regularly.
- Use automated QC scripts.
- Validate variable metadata.
- Cross-check against protocol
specifications.
- Review duplicate handling
policies.
- Use consistent date
standards.
- Validate derived variables
independently.
- Perform peer code reviews.
- Monitor data lineage.
- Build reusable validation
macros.
- Ensure regulatory reproducibility.
11. Business Logic Behind
Data Cleaning
Data
cleaning exists because analytics systems depend entirely on data reliability.
When missing values remain untreated, business intelligence becomes misleading.
For example, a missing patient age in a clinical trial cannot simply become
zero because it changes statistical interpretation. Similarly, negative salary
values in banking systems create financial reporting distortions.
Correcting
unrealistic values protects decision-making integrity. If a villain’s revenue
becomes -100, financial dashboards may incorrectly calculate losses. Using ABS()
for impossible age values or replacing invalid text such as "NULL"
ensures analytical consistency.
Date
imputation is equally critical. In healthcare, missing treatment dates may
affect survival analysis. In banking, incorrect loan approval dates can create
legal disputes.
Standardization
also improves machine learning performance. AI systems require clean, normalized
inputs. If "usa", "USA", and "UsA" remain
inconsistent, aggregation logic may treat them as different countries.
Ultimately,
data cleaning transforms unreliable raw inputs into trusted analytical assets.
12.20 Sharp & Impactful
Insights
- Dirty data destroys trust.
- Standardization ensures
reproducibility.
- Missing values can bankrupt
analytics.
- Duplicate rows inflate
business metrics.
- PROC FORMAT improves
reporting elegance.
- Validation prevents
regulatory failures.
- Bad dates corrupt timelines.
- Clean data powers machine
learning.
- Metadata matters as much as
data.
- IF-THEN logic must reflect
business rules.
- Presentation logic should
remain separate.
- SAS excels in enterprise
governance.
- R accelerates exploratory
refinement.
- Audit trails protect
organizations legally.
- Every transformation must be
explainable.
- Text inconsistency breaks
aggregation logic.
- Deduplication improves
analytical accuracy.
- Scalable code reduces
maintenance costs.
- Data quality drives
executive confidence.
- Clean datasets create
reliable intelligence.
13. Summary
SAS and R
together create one of the most powerful data engineering ecosystems available
today. SAS dominates structured enterprise environments because of its
exceptional governance, validation, auditability, and reporting capabilities.
Features such as PROC FORMAT, DATA Step logic, PROC SQL, PROC REPORT, and PROC
SORT make SAS ideal for regulatory industries including pharmaceuticals,
banking, insurance, and healthcare.
R
complements SAS by offering highly flexible exploratory data manipulation
through tidyverse libraries like dplyr and stringr. Functions such as mutate(),
case_when(), distinct(), and replace_na() allow analysts to rapidly iterate
through transformations while maintaining readable workflows.
In this
project, we transformed a chaotic global villains dataset into a professional
analytical framework. We corrected missing values, standardized inconsistent
text, removed duplicates, validated dates, categorized universes, and generated
presentation-ready outputs using PROC FORMAT.
The key
lesson is simple: analytics quality depends entirely on data quality. No AI
model, dashboard, or executive report can outperform flawed input data.
Structured cleaning frameworks ensure reliability, reproducibility,
scalability, and business trust.
Professional
analytics is not about writing code quickly it is about engineering trustworthy
intelligence systems.
14. Conclusion
In the
modern data economy, raw data is rarely usable without intervention. Whether
the dataset involves clinical trials, banking transactions, streaming
intelligence, or fictional villains, the underlying challenge remains identical:
transforming chaos into reliable insight.
This
project demonstrated how SAS and R solve that challenge through complementary
strengths. SAS provides enterprise-grade structure, repeatability, governance,
and professional reporting capabilities. R provides flexibility, rapid
transformation, and exploratory analytical power.
PROC
FORMAT emerged as one of the most valuable tools in this workflow. Instead of
permanently altering raw values, formats allowed us to create elegant semantic
reporting layers. This separation between raw data and presentation logic is
foundational in enterprise analytics architecture.
We also
explored critical data engineering concepts:
- Deduplication
- Missing value handling
- Business rule validation
- Text standardization
- Date correction
- Conditional transformations
- Enterprise reporting
The
deeper lesson extends beyond coding. Data cleaning is fundamentally about
protecting organizational decisions. Incorrect analytics can trigger financial
losses, failed regulatory submissions, legal disputes, and broken AI systems.
Organizations
that invest in structured data engineering frameworks consistently outperform
those relying on ad hoc spreadsheet corrections.
The real
villain in analytics is not bad software.
It is unmanaged dirty data.
And the
heroes are disciplined engineering practices powered by SAS, R, validation
logic, and intelligent formatting frameworks.
15. Interview Questions and
Answers
1. Why is PROC FORMAT preferred over hardcoding
labels?
Answer:
PROC FORMAT centralizes presentation logic, improves maintainability, ensures
consistency across reports, and avoids repetitive IF-THEN statements.
2. Difference between IF-THEN and SELECT-WHEN in
SAS?
Answer:
IF-THEN is suitable for simple conditions, while SELECT-WHEN is cleaner and
more scalable for multiple categorical mappings.
3. How would you handle duplicate patient records
in clinical trials?
Answer:
Use PROC SORT NODUPKEY with business identifiers like USUBJID, then validate
against source systems and maintain audit documentation.
4. What is the R equivalent of PROC SORT NODUPKEY?
Answer:
distinct() from dplyr removes duplicates similarly while allowing retention of
selected variables.
5. Real-world debugging scenario?
Answer:
A dashboard showed negative revenue totals because refund transactions were
incorrectly merged with sales data. We used PROC SQL joins, validation checks,
and ABS logic to isolate and correct the issue before executive reporting.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about VILLAINS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment