461.Data Blockbusters: Transforming Inconsistent Film Records into Reliable Insights

Pixels Before Perfection: Cleaning Inconsistent Film Data Across Countries with SAS & R

1. Introduction

Imagine you're working as a clinical SAS programmer in a global pharma company. Your task: generate safety tables for a submission. Everything looks fine until your outputs show impossible patient ages (−5 years), duplicate subject IDs, and dates like 30FEB2023. Suddenly, your clean statistical model collapses.

Now replace “patients” with a global film dataset specifically, first graphics films across countries. Same chaos. Same risk.

Dirty data is not just an inconvenience it destroys analytical integrity. Whether it's a clinical trial (SDTM/ADaM) or a business dataset, inconsistency leads to:

  • Biased results
  • Regulatory rejection
  • Wrong business decisions

This is where SAS and R become your surgical tools. SAS excels in structured, regulatory-grade pipelines, while R provides flexibility for exploratory cleaning.

In this blog, we will simulate a “FIRST GRAPHICS FILMS IN ALL COUNTRIES” dataset, intentionally inject errors, and then clean it step-by-step using both SAS and R.

2. Raw Data Creation in SAS and R

SAS Code (Raw Dataset with intentional errors)

DATA films_raw;

INFILE DATALINES MISSOVER;

LENGTH Film_ID 8. Country $12. Film_Name $20. Release_Date $10. 

       Rating Director $10.;

INPUT Film_ID Country $ Film_Name $ Year Budget 

      Release_Date $ Rating Director $ Duration;

DATALINES;

1 india Avatar 2009 237000000 18-12-2009 4.8 cameron 162

2 USA Avatar 2009 . 2009/12/18 4.8 Cameron 162

3 UK null 2010 150000000 2010-05-20 4.5 Nolan 148

4 china Hero 2002 -50000000 2002/10/24 4.7 zhang 99

5 japan SpiritedAway 2001 19000000 2001-07-20 . Miyazaki 125

6 INDIA Avatar 2009 237000000 18-12-2009 4.8 cameron 162

7 france Amelie 2001 10000000 2001/04/25 4.6 Jeunet 122

8 germany NULL 2006 60000000 2006-01-01 4.3 Tykwer 147

9 usa Matrix 1999 63000000 1999-03-31 4.9 Wachowski 136

10 brazil CityOfGod 2002 3300000 2002-08-30 4.8 Meirelles 130

;

RUN;

PROC PRINT DATA = films_raw;

RUN;

OUTPUT:

ObsFilm_IDCountryFilm_NameRelease_DateRatingDirectorYearBudgetDuration
11indiaAvatar18-12-20094.8cameron2009237000000162
22USAAvatar2009/12/184.8Cameron2009.162
33UKnull2010-05-204.5Nolan2010150000000148
44chinaHero2002/10/244.7zhang2002-5000000099
55japanSpiritedAway2001-07-20 Miyazaki200119000000125
66INDIAAvatar18-12-20094.8cameron2009237000000162
77franceAmelie2001/04/254.6Jeunet200110000000122
88germanyNULL2006-01-014.3Tykwer200660000000147
99usaMatrix1999-03-314.9Wachowski199963000000136
1010brazilCityOfGod2002-08-304.8Meirelles20023300000130

R Code – Equivalent Raw Dataset

films_raw <- data.frame(

  Film_ID = c(1,2,3,4,5,6,7,8,9,10),

  Country = c("india","USA","UK","china","japan","INDIA",

              "france","GERMANY","usa","brazil"),

  Film_Name = c("Avatar","Avatar","null","Hero","SpiritedAway",

                "Avatar","Amelie","NULL","Matrix","CityOfGod"),

  Year = c(2009,2009,2010,2002,2001,2009,2001,2006,1999,2002),

  Budget = c(237000000,NA,150000000,-50000000,19000000,237000000,

             10000000,60000000,63000000,3300000),

  Release_Date = c("18-12-2009","2009/12/18","2010-05-20","2002/10/24",

                   "2001-07-20","18-12-2009","2001/04/25","2006-01-01",

                   "1999-03-31","2002-08-30"),

  Rating = c(4.8,4.8,4.5,4.7,NA,4.8,4.6,4.3,4.9,4.8),

  Director = c("cameron","Cameron","Nolan","zhang","Miyazaki",

               "cameron","Jeunet","Tykwer","Wachowski","Meirelles"),

  Duration = c(162,162,148,99,125,162,122,147,136,130)

)

OUTPUT:

 

Film_ID

Country

Film_Name

Year

Budget

Release_Date

Rating

Director

Duration

1

1

india

Avatar

2009

237000000

18-12-2009

4.8

cameron

162

2

2

USA

Avatar

2009

NA

18-12-2009

4.8

Cameron

162

3

3

UK

null

2010

150000000

20-05-2010

4.5

Nolan

148

4

4

china

Hero

2002

-50000000

24-10-2002

4.7

zhang

99

5

5

japan

SpiritedAway

2001

19000000

20-07-2001

NA

Miyazaki

125

6

6

INDIA

Avatar

2009

237000000

18-12-2009

4.8

cameron

162

7

7

france

Amelie

2001

10000000

25-04-2001

4.6

Jeunet

122

8

8

GERMANY

NULL

2006

60000000

01-01-2006

4.3

Tykwer

147

9

9

usa

Matrix

1999

63000000

31-03-1999

4.9

Wachowski

136

Issues Identified

  • Missing values (. / NA)
  • Invalid budget (negative)
  • Duplicate records (Film_ID 1 & 6)
  • Inconsistent text (india vs INDIA vs usa)
  • NULL strings
  • Date inconsistencies

3. Phase 1: Data Cleaning in SAS

SAS Cleaning Code

DATA films_clean;

SET films_raw;

/* Standardize Country */

Country = UPCASE(STRIP(COALESCEC(Country,"UNKNOWN")));

/* Handle NULL film names */

IF UPCASE(Film_Name) IN ("NULL","") THEN Film_Name="UNKNOWN";

/* Fix negative budget */

IF Budget < 0 THEN Budget = ABS(Budget);

/* Replace missing rating */

IF Rating = . THEN Rating = 0;

/* Standardize Director */

Director = PROPCASE(STRIP(Director));

/* Convert Date */

Release_dt = INPUT(Release_Date, ANYDTDTE10.);

FORMAT Release_dt DATE9.;

DROP Release_Date;

RUN;

PROC PRINT DATA = films_clean;

RUN;

OUTPUT:

ObsFilm_IDCountryFilm_NameRatingDirectorYearBudgetDurationRelease_dt
11INDIAAvatar4.8Cameron200923700000016218DEC2009
22USAAvatar4.8Cameron2009.16218DEC2009
33UKUNKNOWN4.5Nolan201015000000014820MAY2010
44CHINAHero4.7Zhang2002500000009924OCT2002
55JAPANSpiritedAway0Miyazaki20011900000012520JUL2001
66INDIAAvatar4.8Cameron200923700000016218DEC2009
77FRANCEAmelie4.6Jeunet20011000000012225APR2001
88GERMANYUNKNOWN4.3Tykwer20066000000014701JAN2006
99USAMatrix4.9Wachowski19996300000013631MAR1999
1010BRAZILCityOfGod4.8Meirelles2002330000013030AUG2002

/* Remove duplicates */

PROC SORT DATA = films_clean NODUPKEY;

BY Country Film_Name Year Director;

RUN;

PROC PRINT DATA = films_clean;

RUN;

OUTPUT:

ObsFilm_IDCountryFilm_NameRatingDirectorYearBudgetDurationRelease_dt
110BRAZILCityOfGod4.8Meirelles2002330000013030AUG2002
24CHINAHero4.7Zhang2002500000009924OCT2002
37FRANCEAmelie4.6Jeunet20011000000012225APR2001
48GERMANYUNKNOWN4.3Tykwer20066000000014701JAN2006
51INDIAAvatar4.8Cameron200923700000016218DEC2009
65JAPANSpiritedAway0Miyazaki20011900000012520JUL2001
73UKUNKNOWN4.5Nolan201015000000014820MAY2010
82USAAvatar4.8Cameron2009.16218DEC2009
99USAMatrix4.9Wachowski19996300000013631MAR1999

PROC SORT DATA = films_clean;

BY Film_ID ;

RUN;

PROC PRINT DATA = films_clean;

RUN;

OUTPUT:

ObsFilm_IDCountryFilm_NameRatingDirectorYearBudgetDurationRelease_dt
11INDIAAvatar4.8Cameron200923700000016218DEC2009
22USAAvatar4.8Cameron2009.16218DEC2009
33UKUNKNOWN4.5Nolan201015000000014820MAY2010
44CHINAHero4.7Zhang2002500000009924OCT2002
55JAPANSpiritedAway0Miyazaki20011900000012520JUL2001
67FRANCEAmelie4.6Jeunet20011000000012225APR2001
78GERMANYUNKNOWN4.3Tykwer20066000000014701JAN2006
89USAMatrix4.9Wachowski19996300000013631MAR1999
910BRAZILCityOfGod4.8Meirelles2002330000013030AUG2002

Explanation

This SAS step applies structured data cleaning aligned with SDTM principles. COALESCEC ensures missing character values are replaced systematically. UPCASE standardizes categorical variables critical for downstream joins. Negative budgets are corrected using ABS, avoiding illogical financial analytics. Date conversion via ANYDTDTE. handles heterogeneous formats efficiently. Duplicate removal using PROC SORT NODUPKEY ensures entity-level uniqueness (similar to USUBJID in clinical trials). Overall, this pipeline ensures reproducibility, auditability, and regulatory compliance key requirements in pharmaceutical and enterprise analytics.

4. Phase 2: Data Cleaning in R

library(dplyr)

options(scipen = 999)

films_clean <- films_raw %>%

  mutate(

    Country   = toupper(trimws(ifelse(Country %in% c("NULL", NA), "UNKNOWN", Country))),

    Film_Name = toupper(trimws(ifelse(Film_Name %in% c("NULL", NA), "UNKNOWN", Film_Name))),

    Director  = tools::toTitleCase(trimws(Director)),

    Budget = ifelse(Budget < 0, abs(Budget), Budget),

    Rating = ifelse(is.na(Rating), 0, Rating),

    Release_Date = as.Date(Release_Date,

                           tryFormats = c("%d-%m-%Y", "%Y/%m/%d", "%Y-%m-%d")),

    Release_Date = dplyr::coalesce(

      Release_Date,

      as.Date(paste0(Year, "-01-01"))

    )

  ) %>%

  arrange(Country, Film_Name, Year, Director, desc(Budget)) %>%

  distinct(Country, Film_Name, Year, Director, .keep_all = TRUE) %>%

  arrange(Film_ID)

OUTPUT:

 

Film_ID

Country

Film_Name

Year

Budget

Release_Date

Rating

Director

Duration

1

1

INDIA

AVATAR

2009

237000000

18-12-2009

4.8

Cameron

162

2

2

USA

AVATAR

2009

NA

01-01-2009

4.8

Cameron

162

3

3

UK

NULL

2010

150000000

01-01-2010

4.5

Nolan

148

4

4

CHINA

HERO

2002

50000000

01-01-2002

4.7

Zhang

99

5

5

JAPAN

SPIRITEDAWAY

2001

19000000

01-01-2001

0

Miyazaki

125

6

7

FRANCE

AMELIE

2001

10000000

01-01-2001

4.6

Jeunet

122

7

8

GERMANY

UNKNOWN

2006

60000000

01-01-2006

4.3

Tykwer

147

8

9

USA

MATRIX

1999

63000000

01-01-1999

4.9

Wachowski

136

9

10

BRAZIL

CITYOFGOD

2002

3300000

01-01-2002

4.8

Meirelles

130

Explanation

R uses a pipeline approach (dplyr) for clarity and flexibility. mutate() applies transformations column-wise. ifelse() handles missing and invalid values efficiently. String normalization (toupper, trimws) ensures consistency. The as.Date() function supports multiple formats via tryFormats, solving real-world messy date issues. Duplicate removal using distinct() mirrors SAS NODUPKEY. Compared to SAS, R is more expressive but less rigid making it ideal for exploratory cleaning but requiring careful validation for production-grade workflows.

5. Phase 3: Additional SAS Processing

DATA films_enriched;

SET films_clean;

LENGTH Length_Category $8.;

IF Duration > 150 THEN Length_Category="LONG";

ELSE Length_Category="SHORT";

IF Rating >= 4.5 THEN Hit_Film="YES";

ELSE Hit_Film="NO";

RUN;

PROC PRINT DATA = films_enriched;

RUN;

OUTPUT:

ObsFilm_IDCountryFilm_NameRatingDirectorYearBudgetDurationRelease_dtLength_CategoryHit_Film
11INDIAAvatar4.8Cameron200923700000016218DEC2009LONGYES
22USAAvatar4.8Cameron2009.16218DEC2009LONGYES
33UKUNKNOWN4.5Nolan201015000000014820MAY2010SHORTYES
44CHINAHero4.7Zhang2002500000009924OCT2002SHORTYES
55JAPANSpiritedAway0Miyazaki20011900000012520JUL2001SHORTNO
67FRANCEAmelie4.6Jeunet20011000000012225APR2001SHORTYES
78GERMANYUNKNOWN4.3Tykwer20066000000014701JAN2006SHORTNO
89USAMatrix4.9Wachowski19996300000013631MAR1999SHORTYES
910BRAZILCityOfGod4.8Meirelles2002330000013030AUG2002SHORTYES

/* Frequency Analysis */

PROC FREQ DATA=films_enriched;

TABLES Country Hit_Film;

RUN;

 OUTPUT:

The FREQ Procedure

CountryFrequencyPercentCumulative
Frequency
Cumulative
Percent
BRAZIL111.11111.11
CHINA111.11222.22
FRANCE111.11333.33
GERMANY111.11444.44
INDIA111.11555.56
JAPAN111.11666.67
UK111.11777.78
USA222.229100.00
Hit_FilmFrequencyPercentCumulative
Frequency
Cumulative
Percent
NO222.22222.22
YES777.789100.00

Explanation

This step introduces derived variables, similar to ADaM derivations. Categorizing film duration improves interpretability. Binary classification (Hit_Film) mimics responder analysis in clinical trials. PROC FREQ enables quick validation checks ensuring no anomalies post-cleaning. This stage bridges raw data cleaning and analytical readiness.

6. 20 Additional Data Cleaning Best Practices

  1. Always validate against source CRF or metadata
  2. Use controlled terminology (CDISC standards)
  3. Maintain audit trails
  4. Avoid overwriting raw data
  5. Perform range checks
  6. Validate date sequences
  7. Handle duplicates carefully
  8. Standardize case formats
  9. Use consistent missing value strategy
  10. Validate keys (USUBJID)
  11. Apply derivation traceability
  12. Use macros for repeatability
  13. Cross-check domains (SDTM consistency)
  14. Implement QC programming
  15. Document assumptions
  16. Avoid hardcoding values
  17. Validate categorical levels
  18. Monitor outliers
  19. Perform reconciliation checks
  20. Ensure regulatory compliance (FDA/EMA)

7. Business Logic Behind Data Cleaning

Data cleaning is not cosmetic it directly impacts decision-making. Missing values are often replaced to maintain dataset completeness; however, the method depends on context. For example, in clinical trials, missing patient age cannot be guessed, but missing lab values may be imputed based on protocol-defined rules. Similarly, negative salary or budget values are logically invalid, so correcting them using absolute values or flagging them ensures analytical consistency.

Unrealistic values distort models. Imagine a patient age of −5 it would skew demographic summaries. Likewise, a film budget of −50 million leads to incorrect ROI calculations. Date imputation is another critical area. Inconsistent dates can break time-to-event analyses in survival models or release trend analysis in business datasets.

Standardization ensures comparability. “india,” “INDIA,” and “India” must be unified; otherwise, aggregation fails. In regulatory environments like SDTM/ADaM, such inconsistencies can lead to submission rejection.

Ultimately, clean data ensures:

  • Accurate statistical inference
  • Reliable reporting
  • Regulatory acceptance
  • Better business insights

8. 20 Key Points

  • Dirty data leads to wrong conclusions
  • Standardization ensures reproducibility
  • Missing data must be handled systematically
  • Validation prevents downstream errors
  • Duplicate data inflates metrics
  • Date consistency is critical
  • SAS ensures structured pipelines
  • R provides flexibility
  • Always validate assumptions
  • Controlled terminology is essential
  • Cleaning is iterative
  • Documentation is mandatory
  • Automation improves efficiency
  • Outliers need investigation
  • Never trust raw data blindly
  • Consistency beats complexity
  • QC is not optional
  • Audit trails ensure traceability
  • Regulatory compliance is non-negotiable
  • Clean data builds trust

9. Summary

Data cleaning is the backbone of reliable analytics. In this blog, we explored how inconsistent data in a global film dataset mirrors real-world challenges seen in clinical trials and business systems. Using SAS and R, we demonstrated practical approaches to handle missing values, invalid entries, duplicate records, and inconsistent formats.

SAS stands out for its structured, regulatory-compliant workflows. Its deterministic behavior, strong data step processing, and procedures like PROC SORT make it ideal for production environments, especially in clinical programming (SDTM/ADaM). On the other hand, R offers flexibility and expressive syntax through packages like dplyr, making it highly effective for exploratory analysis and rapid prototyping.

We also extended the cleaning process into derivations and validation checks, emphasizing that cleaning is not an isolated step but part of a larger analytical pipeline. Best practices such as maintaining audit trails, validating keys, and ensuring consistency across domains highlight the importance of disciplined programming.

Ultimately, the choice between SAS and R is not binary they complement each other. SAS ensures robustness and compliance, while R accelerates innovation and exploration. Together, they form a powerful toolkit for any data professional.

10. Conclusion

Inconsistent data is inevitable but unmanaged inconsistency is unacceptable. Whether you are working on a clinical trial dataset or analyzing global film trends, the principles remain the same: structure, validate, standardize, and document.

This blog demonstrated that data cleaning is not just about fixing errors it’s about engineering trust into your data pipeline. SAS provides the rigor required for regulatory environments, ensuring traceability and reproducibility. R complements this with agility and flexibility, enabling deeper exploration and faster iteration.

The real-world takeaway is this: never treat data cleaning as a preliminary step. It is a core analytical function. Poor cleaning leads to flawed models, misleading insights, and potentially catastrophic decisions especially in healthcare and finance.

By implementing structured cleaning frameworks, leveraging tools like SAS and R effectively, and adhering to best practices, you can transform chaotic datasets into reliable analytical assets.

In the end, clean data is not just about accuracy it’s about credibility, compliance, and confidence.

11. Interview Questions

Q1: How do you handle duplicate records in SAS?
A: Use PROC SORT NODUPKEY with appropriate BY variables. Ensure key selection aligns with business logic.

Q2: What would you do if dates are in multiple formats?
A: Use ANYDTDTE. in SAS or as.Date() with multiple formats in R.

Q3: How do you validate missing values in clinical datasets?
A: Cross-check with CRF, apply imputation rules per SAP, and document assumptions.

Q4: Difference between SAS and R in cleaning?
A: SAS is structured and regulatory-focused; R is flexible and exploratory.

Q5: Debugging scenario: Negative age found what next?
A: Investigate source, correct if derivable, else set to missing and flag for review.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MOVIE DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy

Comments

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function