Pixels Before Perfection: Cleaning Inconsistent Film Data Across Countries with SAS & R

1. Introduction

Imagine you're working as a clinical SAS programmer in a global pharma company. Your task: generate safety tables for a submission. Everything looks fine until your outputs show impossible patient ages (−5 years), duplicate subject IDs, and dates like 30FEB2023. Suddenly, your clean statistical model collapses.

Now replace “patients” with a global film dataset specifically, first graphics films across countries. Same chaos. Same risk.

Dirty data is not just an inconvenience it destroys analytical integrity. Whether it's a clinical trial (SDTM/ADaM) or a business dataset, inconsistency leads to:

Biased results
Regulatory rejection
Wrong business decisions

This is where SAS and R become your surgical tools. SAS excels in structured, regulatory-grade pipelines, while R provides flexibility for exploratory cleaning.

In this blog, we will simulate a “FIRST GRAPHICS FILMS IN ALL COUNTRIES” dataset, intentionally inject errors, and then clean it step-by-step using both SAS and R.

2. Raw Data Creation in SAS and R

SAS Code (Raw Dataset with intentional errors)

DATA films_raw;

INFILE DATALINES MISSOVER;

LENGTH Film_ID 8. Country $12. Film_Name $20. Release_Date $10.

Rating Director $10.;

INPUT Film_ID Country $ Film_Name $ Year Budget

Release_Date $ Rating Director $ Duration;

DATALINES;

1 india Avatar 2009 237000000 18-12-2009 4.8 cameron 162

2 USA Avatar 2009 . 2009/12/18 4.8 Cameron 162

3 UK null 2010 150000000 2010-05-20 4.5 Nolan 148

4 china Hero 2002 -50000000 2002/10/24 4.7 zhang 99

5 japan SpiritedAway 2001 19000000 2001-07-20 . Miyazaki 125

6 INDIA Avatar 2009 237000000 18-12-2009 4.8 cameron 162

7 france Amelie 2001 10000000 2001/04/25 4.6 Jeunet 122

8 germany NULL 2006 60000000 2006-01-01 4.3 Tykwer 147

9 usa Matrix 1999 63000000 1999-03-31 4.9 Wachowski 136

10 brazil CityOfGod 2002 3300000 2002-08-30 4.8 Meirelles 130

;

RUN;

PROC PRINT DATA = films_raw;

RUN;

OUTPUT:


Obs	Film_ID	Country	Film_Name	Release_Date	Rating	Director	Year	Budget	Duration
1	1	india	Avatar	18-12-2009	4.8	cameron	2009	237000000	162
2	2	USA	Avatar	2009/12/18	4.8	Cameron	2009	.	162
3	3	UK	null	2010-05-20	4.5	Nolan	2010	150000000	148
4	4	china	Hero	2002/10/24	4.7	zhang	2002	-50000000	99
5	5	japan	SpiritedAway	2001-07-20		Miyazaki	2001	19000000	125
6	6	INDIA	Avatar	18-12-2009	4.8	cameron	2009	237000000	162
7	7	france	Amelie	2001/04/25	4.6	Jeunet	2001	10000000	122
8	8	germany	NULL	2006-01-01	4.3	Tykwer	2006	60000000	147
9	9	usa	Matrix	1999-03-31	4.9	Wachowski	1999	63000000	136
10	10	brazil	CityOfGod	2002-08-30	4.8	Meirelles	2002	3300000	130

R Code – Equivalent Raw Dataset

films_raw <- data.frame(

Film_ID = c(1,2,3,4,5,6,7,8,9,10),

Country = c("india","USA","UK","china","japan","INDIA",

"france","GERMANY","usa","brazil"),

Film_Name = c("Avatar","Avatar","null","Hero","SpiritedAway",

"Avatar","Amelie","NULL","Matrix","CityOfGod"),

Year = c(2009,2009,2010,2002,2001,2009,2001,2006,1999,2002),

Budget = c(237000000,NA,150000000,-50000000,19000000,237000000,

10000000,60000000,63000000,3300000),

Release_Date = c("18-12-2009","2009/12/18","2010-05-20","2002/10/24",

"2001-07-20","18-12-2009","2001/04/25","2006-01-01",

"1999-03-31","2002-08-30"),

Rating = c(4.8,4.8,4.5,4.7,NA,4.8,4.6,4.3,4.9,4.8),

Director = c("cameron","Cameron","Nolan","zhang","Miyazaki",

"cameron","Jeunet","Tykwer","Wachowski","Meirelles"),

Duration = c(162,162,148,99,125,162,122,147,136,130)

)

OUTPUT:

	Film_ID	Country	Film_Name	Year	Budget	Release_Date	Rating	Director	Duration
1	1	india	Avatar	2009	237000000	18-12-2009	4.8	cameron	162
2	2	USA	Avatar	2009	NA	18-12-2009	4.8	Cameron	162
3	3	UK	null	2010	150000000	20-05-2010	4.5	Nolan	148
4	4	china	Hero	2002	-50000000	24-10-2002	4.7	zhang	99
5	5	japan	SpiritedAway	2001	19000000	20-07-2001	NA	Miyazaki	125
6	6	INDIA	Avatar	2009	237000000	18-12-2009	4.8	cameron	162
7	7	france	Amelie	2001	10000000	25-04-2001	4.6	Jeunet	122
8	8	GERMANY	NULL	2006	60000000	01-01-2006	4.3	Tykwer	147
9	9	usa	Matrix	1999	63000000	31-03-1999	4.9	Wachowski	136

Issues Identified

Missing values (. / NA)
Invalid budget (negative)
Duplicate records (Film_ID 1 & 6)
Inconsistent text (india vs INDIA vs usa)
NULL strings
Date inconsistencies

3. Phase 1: Data Cleaning in SAS

SAS Cleaning Code

DATA films_clean;

SET films_raw;

/* Standardize Country */

Country = UPCASE(STRIP(COALESCEC(Country,"UNKNOWN")));

/* Handle NULL film names */

IF UPCASE(Film_Name) IN ("NULL","") THEN Film_Name="UNKNOWN";

/* Fix negative budget */

IF Budget < 0 THEN Budget = ABS(Budget);

/* Replace missing rating */

IF Rating = . THEN Rating = 0;

/* Standardize Director */

Director = PROPCASE(STRIP(Director));

/* Convert Date */

Release_dt = INPUT(Release_Date, ANYDTDTE10.);

FORMAT Release_dt DATE9.;

DROP Release_Date;

RUN;

PROC PRINT DATA = films_clean;

RUN;

OUTPUT:


Obs	Film_ID	Country	Film_Name	Rating	Director	Year	Budget	Duration	Release_dt
1	1	INDIA	Avatar	4.8	Cameron	2009	237000000	162	18DEC2009
2	2	USA	Avatar	4.8	Cameron	2009	.	162	18DEC2009
3	3	UK	UNKNOWN	4.5	Nolan	2010	150000000	148	20MAY2010
4	4	CHINA	Hero	4.7	Zhang	2002	50000000	99	24OCT2002
5	5	JAPAN	SpiritedAway	0	Miyazaki	2001	19000000	125	20JUL2001
6	6	INDIA	Avatar	4.8	Cameron	2009	237000000	162	18DEC2009
7	7	FRANCE	Amelie	4.6	Jeunet	2001	10000000	122	25APR2001
8	8	GERMANY	UNKNOWN	4.3	Tykwer	2006	60000000	147	01JAN2006
9	9	USA	Matrix	4.9	Wachowski	1999	63000000	136	31MAR1999
10	10	BRAZIL	CityOfGod	4.8	Meirelles	2002	3300000	130	30AUG2002

/* Remove duplicates */

PROC SORT DATA = films_clean NODUPKEY;

BY Country Film_Name Year Director;

RUN;

PROC PRINT DATA = films_clean;

RUN;

OUTPUT:


Obs	Film_ID	Country	Film_Name	Rating	Director	Year	Budget	Duration	Release_dt
1	10	BRAZIL	CityOfGod	4.8	Meirelles	2002	3300000	130	30AUG2002
2	4	CHINA	Hero	4.7	Zhang	2002	50000000	99	24OCT2002
3	7	FRANCE	Amelie	4.6	Jeunet	2001	10000000	122	25APR2001
4	8	GERMANY	UNKNOWN	4.3	Tykwer	2006	60000000	147	01JAN2006
5	1	INDIA	Avatar	4.8	Cameron	2009	237000000	162	18DEC2009
6	5	JAPAN	SpiritedAway	0	Miyazaki	2001	19000000	125	20JUL2001
7	3	UK	UNKNOWN	4.5	Nolan	2010	150000000	148	20MAY2010
8	2	USA	Avatar	4.8	Cameron	2009	.	162	18DEC2009
9	9	USA	Matrix	4.9	Wachowski	1999	63000000	136	31MAR1999

PROC SORT DATA = films_clean;

BY Film_ID ;

RUN;

PROC PRINT DATA = films_clean;

RUN;

OUTPUT:


Obs	Film_ID	Country	Film_Name	Rating	Director	Year	Budget	Duration	Release_dt
1	1	INDIA	Avatar	4.8	Cameron	2009	237000000	162	18DEC2009
2	2	USA	Avatar	4.8	Cameron	2009	.	162	18DEC2009
3	3	UK	UNKNOWN	4.5	Nolan	2010	150000000	148	20MAY2010
4	4	CHINA	Hero	4.7	Zhang	2002	50000000	99	24OCT2002
5	5	JAPAN	SpiritedAway	0	Miyazaki	2001	19000000	125	20JUL2001
6	7	FRANCE	Amelie	4.6	Jeunet	2001	10000000	122	25APR2001
7	8	GERMANY	UNKNOWN	4.3	Tykwer	2006	60000000	147	01JAN2006
8	9	USA	Matrix	4.9	Wachowski	1999	63000000	136	31MAR1999
9	10	BRAZIL	CityOfGod	4.8	Meirelles	2002	3300000	130	30AUG2002

Explanation

This SAS step applies structured data cleaning aligned with SDTM principles. COALESCEC ensures missing character values are replaced systematically. UPCASE standardizes categorical variables critical for downstream joins. Negative budgets are corrected using ABS, avoiding illogical financial analytics. Date conversion via ANYDTDTE. handles heterogeneous formats efficiently. Duplicate removal using PROC SORT NODUPKEY ensures entity-level uniqueness (similar to USUBJID in clinical trials). Overall, this pipeline ensures reproducibility, auditability, and regulatory compliance key requirements in pharmaceutical and enterprise analytics.

4. Phase 2: Data Cleaning in R

library(dplyr)

options(scipen = 999)

films_clean <- films_raw %>%

mutate(

Country = toupper(trimws(ifelse(Country %in% c("NULL", NA), "UNKNOWN", Country))),

Film_Name = toupper(trimws(ifelse(Film_Name %in% c("NULL", NA), "UNKNOWN", Film_Name))),

Director = tools::toTitleCase(trimws(Director)),

Budget = ifelse(Budget < 0, abs(Budget), Budget),

Rating = ifelse(is.na(Rating), 0, Rating),

Release_Date = as.Date(Release_Date,

tryFormats = c("%d-%m-%Y", "%Y/%m/%d", "%Y-%m-%d")),

Release_Date = dplyr::coalesce(

Release_Date,

as.Date(paste0(Year, "-01-01"))

)

) %>%

arrange(Country, Film_Name, Year, Director, desc(Budget)) %>%

distinct(Country, Film_Name, Year, Director, .keep_all = TRUE) %>%

arrange(Film_ID)

OUTPUT:

	Film_ID	Country	Film_Name	Year	Budget	Release_Date	Rating	Director	Duration
1	1	INDIA	AVATAR	2009	237000000	18-12-2009	4.8	Cameron	162
2	2	USA	AVATAR	2009	NA	01-01-2009	4.8	Cameron	162
3	3	UK	NULL	2010	150000000	01-01-2010	4.5	Nolan	148
4	4	CHINA	HERO	2002	50000000	01-01-2002	4.7	Zhang	99
5	5	JAPAN	SPIRITEDAWAY	2001	19000000	01-01-2001	0	Miyazaki	125
6	7	FRANCE	AMELIE	2001	10000000	01-01-2001	4.6	Jeunet	122
7	8	GERMANY	UNKNOWN	2006	60000000	01-01-2006	4.3	Tykwer	147
8	9	USA	MATRIX	1999	63000000	01-01-1999	4.9	Wachowski	136
9	10	BRAZIL	CITYOFGOD	2002	3300000	01-01-2002	4.8	Meirelles	130

Explanation

R uses a pipeline approach (dplyr) for clarity and flexibility. mutate() applies transformations column-wise. ifelse() handles missing and invalid values efficiently. String normalization (toupper, trimws) ensures consistency. The as.Date() function supports multiple formats via tryFormats, solving real-world messy date issues. Duplicate removal using distinct() mirrors SAS NODUPKEY. Compared to SAS, R is more expressive but less rigid making it ideal for exploratory cleaning but requiring careful validation for production-grade workflows.

5. Phase 3: Additional SAS Processing

DATA films_enriched;

SET films_clean;

LENGTH Length_Category $8.;

IF Duration > 150 THEN Length_Category="LONG";

ELSE Length_Category="SHORT";

IF Rating >= 4.5 THEN Hit_Film="YES";

ELSE Hit_Film="NO";

RUN;

PROC PRINT DATA = films_enriched;

RUN;

OUTPUT:


Obs	Film_ID	Country	Film_Name	Rating	Director	Year	Budget	Duration	Release_dt	Length_Category	Hit_Film
1	1	INDIA	Avatar	4.8	Cameron	2009	237000000	162	18DEC2009	LONG	YES
2	2	USA	Avatar	4.8	Cameron	2009	.	162	18DEC2009	LONG	YES
3	3	UK	UNKNOWN	4.5	Nolan	2010	150000000	148	20MAY2010	SHORT	YES
4	4	CHINA	Hero	4.7	Zhang	2002	50000000	99	24OCT2002	SHORT	YES
5	5	JAPAN	SpiritedAway	0	Miyazaki	2001	19000000	125	20JUL2001	SHORT	NO
6	7	FRANCE	Amelie	4.6	Jeunet	2001	10000000	122	25APR2001	SHORT	YES
7	8	GERMANY	UNKNOWN	4.3	Tykwer	2006	60000000	147	01JAN2006	SHORT	NO
8	9	USA	Matrix	4.9	Wachowski	1999	63000000	136	31MAR1999	SHORT	YES
9	10	BRAZIL	CityOfGod	4.8	Meirelles	2002	3300000	130	30AUG2002	SHORT	YES

/* Frequency Analysis */

PROC FREQ DATA=films_enriched;

TABLES Country Hit_Film;

RUN;

OUTPUT:

The FREQ Procedure


Country	Frequency	Percent	Cumulative Frequency	Cumulative Percent
BRAZIL	1	11.11	1	11.11
CHINA	1	11.11	2	22.22
FRANCE	1	11.11	3	33.33
GERMANY	1	11.11	4	44.44
INDIA	1	11.11	5	55.56
JAPAN	1	11.11	6	66.67
UK	1	11.11	7	77.78
USA	2	22.22	9	100.00


Hit_Film	Frequency	Percent	Cumulative Frequency	Cumulative Percent
NO	2	22.22	2	22.22
YES	7	77.78	9	100.00

Explanation

This step introduces derived variables, similar to ADaM derivations. Categorizing film duration improves interpretability. Binary classification (Hit_Film) mimics responder analysis in clinical trials. PROC FREQ enables quick validation checks ensuring no anomalies post-cleaning. This stage bridges raw data cleaning and analytical readiness.

6. 20 Additional Data Cleaning Best Practices

Always validate against source CRF or metadata
Use controlled terminology (CDISC standards)
Maintain audit trails
Avoid overwriting raw data
Perform range checks
Validate date sequences
Handle duplicates carefully
Standardize case formats
Use consistent missing value strategy
Validate keys (USUBJID)
Apply derivation traceability
Use macros for repeatability
Cross-check domains (SDTM consistency)
Implement QC programming
Document assumptions
Avoid hardcoding values
Validate categorical levels
Monitor outliers
Perform reconciliation checks
Ensure regulatory compliance (FDA/EMA)

7. Business Logic Behind Data Cleaning

Data cleaning is not cosmetic it directly impacts decision-making. Missing values are often replaced to maintain dataset completeness; however, the method depends on context. For example, in clinical trials, missing patient age cannot be guessed, but missing lab values may be imputed based on protocol-defined rules. Similarly, negative salary or budget values are logically invalid, so correcting them using absolute values or flagging them ensures analytical consistency.

Unrealistic values distort models. Imagine a patient age of −5 it would skew demographic summaries. Likewise, a film budget of −50 million leads to incorrect ROI calculations. Date imputation is another critical area. Inconsistent dates can break time-to-event analyses in survival models or release trend analysis in business datasets.

Standardization ensures comparability. “india,” “INDIA,” and “India” must be unified; otherwise, aggregation fails. In regulatory environments like SDTM/ADaM, such inconsistencies can lead to submission rejection.

Ultimately, clean data ensures:

Accurate statistical inference
Reliable reporting
Regulatory acceptance
Better business insights

8. 20 Key Points

Dirty data leads to wrong conclusions
Standardization ensures reproducibility
Missing data must be handled systematically
Validation prevents downstream errors
Duplicate data inflates metrics
Date consistency is critical
SAS ensures structured pipelines
R provides flexibility
Always validate assumptions
Controlled terminology is essential
Cleaning is iterative
Documentation is mandatory
Automation improves efficiency
Outliers need investigation
Never trust raw data blindly
Consistency beats complexity
QC is not optional
Audit trails ensure traceability
Regulatory compliance is non-negotiable
Clean data builds trust

9. Summary

Data cleaning is the backbone of reliable analytics. In this blog, we explored how inconsistent data in a global film dataset mirrors real-world challenges seen in clinical trials and business systems. Using SAS and R, we demonstrated practical approaches to handle missing values, invalid entries, duplicate records, and inconsistent formats.

SAS stands out for its structured, regulatory-compliant workflows. Its deterministic behavior, strong data step processing, and procedures like PROC SORT make it ideal for production environments, especially in clinical programming (SDTM/ADaM). On the other hand, R offers flexibility and expressive syntax through packages like dplyr, making it highly effective for exploratory analysis and rapid prototyping.

We also extended the cleaning process into derivations and validation checks, emphasizing that cleaning is not an isolated step but part of a larger analytical pipeline. Best practices such as maintaining audit trails, validating keys, and ensuring consistency across domains highlight the importance of disciplined programming.

Ultimately, the choice between SAS and R is not binary they complement each other. SAS ensures robustness and compliance, while R accelerates innovation and exploration. Together, they form a powerful toolkit for any data professional.

10. Conclusion

Inconsistent data is inevitable but unmanaged inconsistency is unacceptable. Whether you are working on a clinical trial dataset or analyzing global film trends, the principles remain the same: structure, validate, standardize, and document.

This blog demonstrated that data cleaning is not just about fixing errors it’s about engineering trust into your data pipeline. SAS provides the rigor required for regulatory environments, ensuring traceability and reproducibility. R complements this with agility and flexibility, enabling deeper exploration and faster iteration.

The real-world takeaway is this: never treat data cleaning as a preliminary step. It is a core analytical function. Poor cleaning leads to flawed models, misleading insights, and potentially catastrophic decisions especially in healthcare and finance.

By implementing structured cleaning frameworks, leveraging tools like SAS and R effectively, and adhering to best practices, you can transform chaotic datasets into reliable analytical assets.

In the end, clean data is not just about accuracy it’s about credibility, compliance, and confidence.

11. Interview Questions

Q1: How do you handle duplicate records in SAS?
A: Use PROC SORT NODUPKEY with appropriate BY variables. Ensure key selection aligns with business logic.

Q2: What would you do if dates are in multiple formats?
A: Use ANYDTDTE. in SAS or as.Date() with multiple formats in R.

Q3: How do you validate missing values in clinical datasets?
A: Cross-check with CRF, apply imputation rules per SAP, and document assumptions.

Q4: Difference between SAS and R in cleaning?
A: SAS is structured and regulatory-focused; R is flexible and exploratory.

Q5: Debugging scenario: Negative age found what next?
A: Investigate source, correct if derivable, else set to missing and flag for review.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MOVIE DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Billion-Dollar Marriages to Clean Datasets: The Art of SAS Standardization

2.The Hidden Power of SAS Dates: Building Smart Time Logic with INTNX and INTCK

3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

461.Data Blockbusters: Transforming Inconsistent Film Records into Reliable Insights

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function