461.Data Blockbusters: Transforming Inconsistent Film Records into Reliable Insights
Pixels Before Perfection:
Cleaning Inconsistent Film Data Across Countries with SAS & R
1. Introduction
Imagine
you're working as a clinical SAS programmer in a global pharma company. Your
task: generate safety tables for a submission. Everything looks fine until your
outputs show impossible patient ages (−5 years), duplicate subject IDs, and
dates like 30FEB2023. Suddenly, your clean statistical model collapses.
Now
replace “patients” with a global film dataset specifically, first
graphics films across countries. Same chaos. Same risk.
Dirty data
is not just an inconvenience it destroys analytical integrity. Whether
it's a clinical trial (SDTM/ADaM) or a business dataset, inconsistency leads
to:
- Biased results
- Regulatory rejection
- Wrong business decisions
This is
where SAS and R become your surgical tools. SAS excels in structured,
regulatory-grade pipelines, while R provides flexibility for exploratory
cleaning.
In this blog, we will simulate a “FIRST GRAPHICS FILMS IN ALL COUNTRIES” dataset, intentionally inject errors, and then clean it step-by-step using both SAS and R.
2. Raw Data Creation in SAS and R
SAS Code (Raw Dataset with intentional errors)
DATA films_raw;
INFILE DATALINES MISSOVER;
LENGTH Film_ID 8. Country $12. Film_Name $20. Release_Date $10.
Rating Director $10.;
INPUT Film_ID Country $ Film_Name $ Year Budget
Release_Date $ Rating Director $ Duration;
DATALINES;
1 india Avatar 2009 237000000 18-12-2009 4.8 cameron 162
2 USA Avatar 2009 . 2009/12/18 4.8 Cameron 162
3 UK null 2010 150000000 2010-05-20 4.5 Nolan 148
4 china Hero 2002 -50000000 2002/10/24 4.7 zhang 99
5 japan SpiritedAway 2001 19000000 2001-07-20 . Miyazaki 125
6 INDIA Avatar 2009 237000000 18-12-2009 4.8 cameron 162
7 france Amelie 2001 10000000 2001/04/25 4.6 Jeunet 122
8 germany NULL 2006 60000000 2006-01-01 4.3 Tykwer 147
9 usa Matrix 1999 63000000 1999-03-31 4.9 Wachowski 136
10 brazil CityOfGod 2002 3300000 2002-08-30 4.8 Meirelles 130
;
RUN;
PROC PRINT DATA = films_raw;
RUN;
OUTPUT:
| Obs | Film_ID | Country | Film_Name | Release_Date | Rating | Director | Year | Budget | Duration |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | india | Avatar | 18-12-2009 | 4.8 | cameron | 2009 | 237000000 | 162 |
| 2 | 2 | USA | Avatar | 2009/12/18 | 4.8 | Cameron | 2009 | . | 162 |
| 3 | 3 | UK | null | 2010-05-20 | 4.5 | Nolan | 2010 | 150000000 | 148 |
| 4 | 4 | china | Hero | 2002/10/24 | 4.7 | zhang | 2002 | -50000000 | 99 |
| 5 | 5 | japan | SpiritedAway | 2001-07-20 | Miyazaki | 2001 | 19000000 | 125 | |
| 6 | 6 | INDIA | Avatar | 18-12-2009 | 4.8 | cameron | 2009 | 237000000 | 162 |
| 7 | 7 | france | Amelie | 2001/04/25 | 4.6 | Jeunet | 2001 | 10000000 | 122 |
| 8 | 8 | germany | NULL | 2006-01-01 | 4.3 | Tykwer | 2006 | 60000000 | 147 |
| 9 | 9 | usa | Matrix | 1999-03-31 | 4.9 | Wachowski | 1999 | 63000000 | 136 |
| 10 | 10 | brazil | CityOfGod | 2002-08-30 | 4.8 | Meirelles | 2002 | 3300000 | 130 |
R Code – Equivalent Raw Dataset
films_raw <- data.frame(
Film_ID = c(1,2,3,4,5,6,7,8,9,10),
Country = c("india","USA","UK","china","japan","INDIA",
"france","GERMANY","usa","brazil"),
Film_Name = c("Avatar","Avatar","null","Hero","SpiritedAway",
"Avatar","Amelie","NULL","Matrix","CityOfGod"),
Year = c(2009,2009,2010,2002,2001,2009,2001,2006,1999,2002),
Budget = c(237000000,NA,150000000,-50000000,19000000,237000000,
10000000,60000000,63000000,3300000),
Release_Date = c("18-12-2009","2009/12/18","2010-05-20","2002/10/24",
"2001-07-20","18-12-2009","2001/04/25","2006-01-01",
"1999-03-31","2002-08-30"),
Rating = c(4.8,4.8,4.5,4.7,NA,4.8,4.6,4.3,4.9,4.8),
Director = c("cameron","Cameron","Nolan","zhang","Miyazaki",
"cameron","Jeunet","Tykwer","Wachowski","Meirelles"),
Duration = c(162,162,148,99,125,162,122,147,136,130)
)
OUTPUT:
|
|
Film_ID |
Country |
Film_Name |
Year |
Budget |
Release_Date |
Rating |
Director |
Duration |
|
1 |
1 |
india |
Avatar |
2009 |
237000000 |
18-12-2009 |
4.8 |
cameron |
162 |
|
2 |
2 |
USA |
Avatar |
2009 |
NA |
18-12-2009 |
4.8 |
Cameron |
162 |
|
3 |
3 |
UK |
null |
2010 |
150000000 |
20-05-2010 |
4.5 |
Nolan |
148 |
|
4 |
4 |
china |
Hero |
2002 |
-50000000 |
24-10-2002 |
4.7 |
zhang |
99 |
|
5 |
5 |
japan |
SpiritedAway |
2001 |
19000000 |
20-07-2001 |
NA |
Miyazaki |
125 |
|
6 |
6 |
INDIA |
Avatar |
2009 |
237000000 |
18-12-2009 |
4.8 |
cameron |
162 |
|
7 |
7 |
france |
Amelie |
2001 |
10000000 |
25-04-2001 |
4.6 |
Jeunet |
122 |
|
8 |
8 |
GERMANY |
NULL |
2006 |
60000000 |
01-01-2006 |
4.3 |
Tykwer |
147 |
|
9 |
9 |
usa |
Matrix |
1999 |
63000000 |
31-03-1999 |
4.9 |
Wachowski |
136 |
Issues Identified
- Missing values (. / NA)
- Invalid budget (negative)
- Duplicate records (Film_ID 1
& 6)
- Inconsistent text (india vs
INDIA vs usa)
- NULL strings
- Date inconsistencies
3. Phase 1: Data Cleaning in SAS
SAS Cleaning Code
DATA films_clean;
SET films_raw;
/* Standardize Country */
Country = UPCASE(STRIP(COALESCEC(Country,"UNKNOWN")));
/* Handle NULL film names */
IF UPCASE(Film_Name) IN ("NULL","") THEN Film_Name="UNKNOWN";
/* Fix negative budget */
IF Budget < 0 THEN Budget = ABS(Budget);
/* Replace missing rating */
IF Rating = . THEN Rating = 0;
/* Standardize Director */
Director = PROPCASE(STRIP(Director));
/* Convert Date */
Release_dt = INPUT(Release_Date, ANYDTDTE10.);
FORMAT Release_dt DATE9.;
DROP Release_Date;
RUN;
PROC PRINT DATA = films_clean;
RUN;
OUTPUT:
| Obs | Film_ID | Country | Film_Name | Rating | Director | Year | Budget | Duration | Release_dt |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | INDIA | Avatar | 4.8 | Cameron | 2009 | 237000000 | 162 | 18DEC2009 |
| 2 | 2 | USA | Avatar | 4.8 | Cameron | 2009 | . | 162 | 18DEC2009 |
| 3 | 3 | UK | UNKNOWN | 4.5 | Nolan | 2010 | 150000000 | 148 | 20MAY2010 |
| 4 | 4 | CHINA | Hero | 4.7 | Zhang | 2002 | 50000000 | 99 | 24OCT2002 |
| 5 | 5 | JAPAN | SpiritedAway | 0 | Miyazaki | 2001 | 19000000 | 125 | 20JUL2001 |
| 6 | 6 | INDIA | Avatar | 4.8 | Cameron | 2009 | 237000000 | 162 | 18DEC2009 |
| 7 | 7 | FRANCE | Amelie | 4.6 | Jeunet | 2001 | 10000000 | 122 | 25APR2001 |
| 8 | 8 | GERMANY | UNKNOWN | 4.3 | Tykwer | 2006 | 60000000 | 147 | 01JAN2006 |
| 9 | 9 | USA | Matrix | 4.9 | Wachowski | 1999 | 63000000 | 136 | 31MAR1999 |
| 10 | 10 | BRAZIL | CityOfGod | 4.8 | Meirelles | 2002 | 3300000 | 130 | 30AUG2002 |
/* Remove duplicates */
PROC SORT DATA = films_clean NODUPKEY;
BY Country Film_Name Year Director;
RUN;
PROC PRINT DATA = films_clean;
RUN;
OUTPUT:
| Obs | Film_ID | Country | Film_Name | Rating | Director | Year | Budget | Duration | Release_dt |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 10 | BRAZIL | CityOfGod | 4.8 | Meirelles | 2002 | 3300000 | 130 | 30AUG2002 |
| 2 | 4 | CHINA | Hero | 4.7 | Zhang | 2002 | 50000000 | 99 | 24OCT2002 |
| 3 | 7 | FRANCE | Amelie | 4.6 | Jeunet | 2001 | 10000000 | 122 | 25APR2001 |
| 4 | 8 | GERMANY | UNKNOWN | 4.3 | Tykwer | 2006 | 60000000 | 147 | 01JAN2006 |
| 5 | 1 | INDIA | Avatar | 4.8 | Cameron | 2009 | 237000000 | 162 | 18DEC2009 |
| 6 | 5 | JAPAN | SpiritedAway | 0 | Miyazaki | 2001 | 19000000 | 125 | 20JUL2001 |
| 7 | 3 | UK | UNKNOWN | 4.5 | Nolan | 2010 | 150000000 | 148 | 20MAY2010 |
| 8 | 2 | USA | Avatar | 4.8 | Cameron | 2009 | . | 162 | 18DEC2009 |
| 9 | 9 | USA | Matrix | 4.9 | Wachowski | 1999 | 63000000 | 136 | 31MAR1999 |
PROC SORT DATA = films_clean;
BY Film_ID ;
RUN;
PROC PRINT DATA = films_clean;
RUN;
OUTPUT:
| Obs | Film_ID | Country | Film_Name | Rating | Director | Year | Budget | Duration | Release_dt |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | INDIA | Avatar | 4.8 | Cameron | 2009 | 237000000 | 162 | 18DEC2009 |
| 2 | 2 | USA | Avatar | 4.8 | Cameron | 2009 | . | 162 | 18DEC2009 |
| 3 | 3 | UK | UNKNOWN | 4.5 | Nolan | 2010 | 150000000 | 148 | 20MAY2010 |
| 4 | 4 | CHINA | Hero | 4.7 | Zhang | 2002 | 50000000 | 99 | 24OCT2002 |
| 5 | 5 | JAPAN | SpiritedAway | 0 | Miyazaki | 2001 | 19000000 | 125 | 20JUL2001 |
| 6 | 7 | FRANCE | Amelie | 4.6 | Jeunet | 2001 | 10000000 | 122 | 25APR2001 |
| 7 | 8 | GERMANY | UNKNOWN | 4.3 | Tykwer | 2006 | 60000000 | 147 | 01JAN2006 |
| 8 | 9 | USA | Matrix | 4.9 | Wachowski | 1999 | 63000000 | 136 | 31MAR1999 |
| 9 | 10 | BRAZIL | CityOfGod | 4.8 | Meirelles | 2002 | 3300000 | 130 | 30AUG2002 |
Explanation
This SAS
step applies structured data cleaning aligned with SDTM principles. COALESCEC
ensures missing character values are replaced systematically. UPCASE standardizes
categorical variables critical for downstream joins. Negative budgets are
corrected using ABS, avoiding illogical financial analytics. Date conversion
via ANYDTDTE. handles heterogeneous formats efficiently. Duplicate removal
using PROC SORT NODUPKEY ensures entity-level uniqueness (similar to USUBJID in
clinical trials). Overall, this pipeline ensures reproducibility, auditability,
and regulatory compliance key requirements in pharmaceutical and enterprise
analytics.
4. Phase 2: Data Cleaning in R
library(dplyr)
options(scipen = 999)
films_clean <- films_raw %>%
mutate(
Country = toupper(trimws(ifelse(Country %in% c("NULL", NA), "UNKNOWN", Country))),
Film_Name = toupper(trimws(ifelse(Film_Name %in% c("NULL", NA), "UNKNOWN", Film_Name))),
Director = tools::toTitleCase(trimws(Director)),
Budget = ifelse(Budget < 0, abs(Budget), Budget),
Rating = ifelse(is.na(Rating), 0, Rating),
Release_Date = as.Date(Release_Date,
tryFormats = c("%d-%m-%Y", "%Y/%m/%d", "%Y-%m-%d")),
Release_Date = dplyr::coalesce(
Release_Date,
as.Date(paste0(Year, "-01-01"))
)
) %>%
arrange(Country, Film_Name, Year, Director, desc(Budget)) %>%
distinct(Country, Film_Name, Year, Director, .keep_all = TRUE) %>%
arrange(Film_ID)
OUTPUT:
|
|
Film_ID |
Country |
Film_Name |
Year |
Budget |
Release_Date |
Rating |
Director |
Duration |
|
1 |
1 |
INDIA |
AVATAR |
2009 |
237000000 |
18-12-2009 |
4.8 |
Cameron |
162 |
|
2 |
2 |
USA |
AVATAR |
2009 |
NA |
01-01-2009 |
4.8 |
Cameron |
162 |
|
3 |
3 |
UK |
NULL |
2010 |
150000000 |
01-01-2010 |
4.5 |
Nolan |
148 |
|
4 |
4 |
CHINA |
HERO |
2002 |
50000000 |
01-01-2002 |
4.7 |
Zhang |
99 |
|
5 |
5 |
JAPAN |
SPIRITEDAWAY |
2001 |
19000000 |
01-01-2001 |
0 |
Miyazaki |
125 |
|
6 |
7 |
FRANCE |
AMELIE |
2001 |
10000000 |
01-01-2001 |
4.6 |
Jeunet |
122 |
|
7 |
8 |
GERMANY |
UNKNOWN |
2006 |
60000000 |
01-01-2006 |
4.3 |
Tykwer |
147 |
|
8 |
9 |
USA |
MATRIX |
1999 |
63000000 |
01-01-1999 |
4.9 |
Wachowski |
136 |
|
9 |
10 |
BRAZIL |
CITYOFGOD |
2002 |
3300000 |
01-01-2002 |
4.8 |
Meirelles |
130 |
Explanation
R uses a
pipeline approach (dplyr) for clarity and flexibility. mutate() applies
transformations column-wise. ifelse() handles missing and invalid values
efficiently. String normalization (toupper, trimws) ensures consistency. The as.Date()
function supports multiple formats via tryFormats, solving real-world messy
date issues. Duplicate removal using distinct() mirrors SAS NODUPKEY. Compared
to SAS, R is more expressive but less rigid making it ideal for exploratory
cleaning but requiring careful validation for production-grade workflows.
5. Phase 3: Additional SAS Processing
DATA films_enriched;
SET films_clean;
LENGTH Length_Category $8.;
IF Duration > 150 THEN Length_Category="LONG";
ELSE Length_Category="SHORT";
IF Rating >= 4.5 THEN Hit_Film="YES";
ELSE Hit_Film="NO";
RUN;
PROC PRINT DATA = films_enriched;
RUN;
OUTPUT:
| Obs | Film_ID | Country | Film_Name | Rating | Director | Year | Budget | Duration | Release_dt | Length_Category | Hit_Film |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | INDIA | Avatar | 4.8 | Cameron | 2009 | 237000000 | 162 | 18DEC2009 | LONG | YES |
| 2 | 2 | USA | Avatar | 4.8 | Cameron | 2009 | . | 162 | 18DEC2009 | LONG | YES |
| 3 | 3 | UK | UNKNOWN | 4.5 | Nolan | 2010 | 150000000 | 148 | 20MAY2010 | SHORT | YES |
| 4 | 4 | CHINA | Hero | 4.7 | Zhang | 2002 | 50000000 | 99 | 24OCT2002 | SHORT | YES |
| 5 | 5 | JAPAN | SpiritedAway | 0 | Miyazaki | 2001 | 19000000 | 125 | 20JUL2001 | SHORT | NO |
| 6 | 7 | FRANCE | Amelie | 4.6 | Jeunet | 2001 | 10000000 | 122 | 25APR2001 | SHORT | YES |
| 7 | 8 | GERMANY | UNKNOWN | 4.3 | Tykwer | 2006 | 60000000 | 147 | 01JAN2006 | SHORT | NO |
| 8 | 9 | USA | Matrix | 4.9 | Wachowski | 1999 | 63000000 | 136 | 31MAR1999 | SHORT | YES |
| 9 | 10 | BRAZIL | CityOfGod | 4.8 | Meirelles | 2002 | 3300000 | 130 | 30AUG2002 | SHORT | YES |
/* Frequency Analysis */
PROC FREQ DATA=films_enriched;
TABLES Country Hit_Film;
RUN;
OUTPUT:
The FREQ Procedure
| Country | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| BRAZIL | 1 | 11.11 | 1 | 11.11 |
| CHINA | 1 | 11.11 | 2 | 22.22 |
| FRANCE | 1 | 11.11 | 3 | 33.33 |
| GERMANY | 1 | 11.11 | 4 | 44.44 |
| INDIA | 1 | 11.11 | 5 | 55.56 |
| JAPAN | 1 | 11.11 | 6 | 66.67 |
| UK | 1 | 11.11 | 7 | 77.78 |
| USA | 2 | 22.22 | 9 | 100.00 |
| Hit_Film | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| NO | 2 | 22.22 | 2 | 22.22 |
| YES | 7 | 77.78 | 9 | 100.00 |
Explanation
This step
introduces derived variables, similar to ADaM derivations. Categorizing film
duration improves interpretability. Binary classification (Hit_Film) mimics
responder analysis in clinical trials. PROC FREQ enables quick validation
checks ensuring no anomalies post-cleaning. This stage bridges raw data
cleaning and analytical readiness.
6. 20 Additional Data Cleaning Best Practices
- Always validate against
source CRF or metadata
- Use controlled terminology
(CDISC standards)
- Maintain audit trails
- Avoid overwriting raw data
- Perform range checks
- Validate date sequences
- Handle duplicates carefully
- Standardize case formats
- Use consistent missing value
strategy
- Validate keys (USUBJID)
- Apply derivation
traceability
- Use macros for repeatability
- Cross-check domains (SDTM
consistency)
- Implement QC programming
- Document assumptions
- Avoid hardcoding values
- Validate categorical levels
- Monitor outliers
- Perform reconciliation
checks
- Ensure regulatory compliance
(FDA/EMA)
7. Business Logic Behind Data Cleaning
Data
cleaning is not cosmetic it directly impacts decision-making. Missing values
are often replaced to maintain dataset completeness; however, the method
depends on context. For example, in clinical trials, missing patient age cannot
be guessed, but missing lab values may be imputed based on protocol-defined
rules. Similarly, negative salary or budget values are logically invalid, so
correcting them using absolute values or flagging them ensures analytical
consistency.
Unrealistic
values distort models. Imagine a patient age of −5 it would skew demographic
summaries. Likewise, a film budget of −50 million leads to incorrect ROI
calculations. Date imputation is another critical area. Inconsistent dates can
break time-to-event analyses in survival models or release trend analysis in
business datasets.
Standardization
ensures comparability. “india,” “INDIA,” and “India” must be unified;
otherwise, aggregation fails. In regulatory environments like SDTM/ADaM, such
inconsistencies can lead to submission rejection.
Ultimately,
clean data ensures:
- Accurate statistical
inference
- Reliable reporting
- Regulatory acceptance
- Better business insights
8. 20 Key Points
- Dirty data leads to wrong
conclusions
- Standardization ensures
reproducibility
- Missing data must be handled
systematically
- Validation prevents
downstream errors
- Duplicate data inflates
metrics
- Date consistency is critical
- SAS ensures structured
pipelines
- R provides flexibility
- Always validate assumptions
- Controlled terminology is
essential
- Cleaning is iterative
- Documentation is mandatory
- Automation improves
efficiency
- Outliers need investigation
- Never trust raw data blindly
- Consistency beats complexity
- QC is not optional
- Audit trails ensure
traceability
- Regulatory compliance is
non-negotiable
- Clean data builds trust
9. Summary
Data
cleaning is the backbone of reliable analytics. In this blog, we explored how
inconsistent data in a global film dataset mirrors real-world challenges seen
in clinical trials and business systems. Using SAS and R, we demonstrated
practical approaches to handle missing values, invalid entries, duplicate
records, and inconsistent formats.
SAS
stands out for its structured, regulatory-compliant workflows. Its
deterministic behavior, strong data step processing, and procedures like PROC
SORT make it ideal for production environments, especially in clinical
programming (SDTM/ADaM). On the other hand, R offers flexibility and expressive
syntax through packages like dplyr, making it highly effective for exploratory
analysis and rapid prototyping.
We also
extended the cleaning process into derivations and validation checks,
emphasizing that cleaning is not an isolated step but part of a larger
analytical pipeline. Best practices such as maintaining audit trails,
validating keys, and ensuring consistency across domains highlight the
importance of disciplined programming.
Ultimately,
the choice between SAS and R is not binary they complement each other. SAS
ensures robustness and compliance, while R accelerates innovation and
exploration. Together, they form a powerful toolkit for any data professional.
10. Conclusion
Inconsistent
data is inevitable but unmanaged inconsistency is unacceptable. Whether you are
working on a clinical trial dataset or analyzing global film trends, the
principles remain the same: structure, validate, standardize, and document.
This blog
demonstrated that data cleaning is not just about fixing errors it’s about engineering
trust into your data pipeline. SAS provides the rigor required for regulatory
environments, ensuring traceability and reproducibility. R complements this
with agility and flexibility, enabling deeper exploration and faster iteration.
The
real-world takeaway is this: never treat data cleaning as a preliminary step.
It is a core analytical function. Poor cleaning leads to flawed models,
misleading insights, and potentially catastrophic decisions especially in
healthcare and finance.
By
implementing structured cleaning frameworks, leveraging tools like SAS and R
effectively, and adhering to best practices, you can transform chaotic datasets
into reliable analytical assets.
In the
end, clean data is not just about accuracy it’s about credibility,
compliance, and confidence.
11. Interview Questions
Q1: How do you handle duplicate
records in SAS?
A: Use PROC SORT NODUPKEY with appropriate BY variables. Ensure key
selection aligns with business logic.
Q2: What would you do if dates are
in multiple formats?
A: Use ANYDTDTE. in SAS or as.Date() with multiple formats in R.
Q3: How do you validate missing
values in clinical datasets?
A: Cross-check with CRF, apply imputation rules per SAP, and document
assumptions.
Q4: Difference between SAS and R in
cleaning?
A: SAS is structured and regulatory-focused; R is flexible and
exploratory.
Q5: Debugging scenario: Negative age
found what next?
A: Investigate source, correct if derivable, else set to missing and
flag for review.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MOVIE DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment