448.Hidden Errors in Movie Data and How SAS LENGTH & FORMAT Bring Clarity
When Movie Data Lies: Mastering LENGTH & FORMAT in SAS to Transform Broken Datasets into Reliable Intelligence
1. Introduction
Imagine you're working as a data analyst in a movie streaming company something like Netflix or Amazon Prime Video. Your manager asks for a report: “Which movies performed best last quarter?”
You confidently run your analysis… but the results look strange.
- Some movie durations are negative
- Release dates show 20235 instead of 2023
- Genres appear as action, Action, ACTION
- Duplicate movies inflate revenue
Now imagine this happening in a clinical trial dataset wrong age, incorrect treatment dates, duplicate patient IDs. That’s not just messy it’s dangerous.
👉 Bad data doesn’t just reduce accuracy it destroys decision-making.
This is where tools like SAS and R become critical. Especially in SAS, understanding LENGTH and FORMAT statements is foundational. They don’t just store data they control how data behaves, how it is interpreted, and how it is presented.
Let’s walk through a movie dataset project with intentional errors and fix it step-by-step using SAS and R like a real-world data engineer.
2. Raw Data Creation in SAS and R
Variables:
- Movie_ID
- Title
- Genre
- Release_Date
- Duration
- Rating
- Revenue
SAS Raw Dataset (With Intentional Errors)
DATA movie_raw;
LENGTH Movie_ID $5 Title $20 Genre $15;
INPUT Movie_ID $ Title $ Genre $ Release_Date : $10.
Duration Rating Revenue;
DATALINES;
M001 Inception Action 2010-07-16 148 8.8 800
M002 avatar action 2009-12-18 162 7.8 2800
M003 Titanic Drama 1997-12-19 195 9.0 2200
M004 NULL Comedy 20235 120 6.5 500
M005 Joker drama 2019-10-04 -122 8.5 1000
M006 Avengers ACTION 2012-05-04 143 . 1500
M007 Frozen Animation 2013-11-27 102 7.5 .
M008 Bahubali Action 2015-07-10 159 8.2 600
M008 Bahubali Action 2015-07-10 159 8.2 600
M009 kgf action 2018-12-21 155 8.4 700
M010 Dangal Drama 2016-12-23 161 8.3 2000
M011 Pushpa Action 2021-12-17 179 7.9 900
M012 RRR action 2022-03-25 187 8.7 1200
M013 Interstellar SciFi 2014-11-07 169 8.6 700
M014 Null Comedy 2018-05-05 110 6.0 400
M015 Jawan Action 2023-09-07 169 7.2 1100
;
RUN;
PROC PRINT DATA=movie_raw;
RUN;
OUTPUT:
| Obs | Movie_ID | Title | Genre | Release_Date | Duration | Rating | Revenue |
|---|---|---|---|---|---|---|---|
| 1 | M001 | Inception | Action | 2010-07-16 | 148 | 8.8 | 800 |
| 2 | M002 | avatar | action | 2009-12-18 | 162 | 7.8 | 2800 |
| 3 | M003 | Titanic | Drama | 1997-12-19 | 195 | 9.0 | 2200 |
| 4 | M004 | NULL | Comedy | 20235 | 120 | 6.5 | 500 |
| 5 | M005 | Joker | drama | 2019-10-04 | -122 | 8.5 | 1000 |
| 6 | M006 | Avengers | ACTION | 2012-05-04 | 143 | . | 1500 |
| 7 | M007 | Frozen | Animation | 2013-11-27 | 102 | 7.5 | . |
| 8 | M008 | Bahubali | Action | 2015-07-10 | 159 | 8.2 | 600 |
| 9 | M008 | Bahubali | Action | 2015-07-10 | 159 | 8.2 | 600 |
| 10 | M009 | kgf | action | 2018-12-21 | 155 | 8.4 | 700 |
| 11 | M010 | Dangal | Drama | 2016-12-23 | 161 | 8.3 | 2000 |
| 12 | M011 | Pushpa | Action | 2021-12-17 | 179 | 7.9 | 900 |
| 13 | M012 | RRR | action | 2022-03-25 | 187 | 8.7 | 1200 |
| 14 | M013 | Interstellar | SciFi | 2014-11-07 | 169 | 8.6 | 700 |
| 15 | M014 | Null | Comedy | 2018-05-05 | 110 | 6.0 | 400 |
| 16 | M015 | Jawan | Action | 2023-09-07 | 169 | 7.2 | 1100 |
Explanation (SAS Raw Data)
This dataset intentionally includes real-world issues:
- Missing values (. for numeric, NULL text)
- Invalid duration (-122)
- Incorrect date (20235)
- Duplicate record (M008)
- Inconsistent casing (action vs ACTION)
Key LENGTH Insight:
The LENGTH statement ensures variables like Title don’t get truncated. Without it, SAS may assign shorter default lengths cutting data silently.
R Raw Dataset
movie_raw <- data.frame(
Movie_ID = c("M001","M002","M003","M004","M005","M006","M007",
"M008","M008","M009","M010","M011","M012","M013",
"M014","M015"),
Title = c("Inception","avatar","Titanic","NULL","Joker","Avengers",
"Frozen","Bahubali","Bahubali","kgf","Dangal","Pushpa",
"RRR","Interstellar","Null","Jawan"),
Genre = c("Action","action","Drama","Comedy","drama","ACTION",
"Animation","Action","Action","action","Drama","Action",
"action","SciFi","Comedy","Action"),
Release_Date = c("2010-07-16","2009-12-18","1997-12-19","20235",
"2019-10-04","2012-05-04","2013-11-27","2015-07-10",
"2015-07-10","2018-12-21","2016-12-23","2021-12-17",
"2022-03-25","2014-11-07","2018-05-05","2023-09-07"),
Duration = c(148,162,195,120,-122,143,102,159,159,155,161,179,187,169,110,169),
Rating = c(8.8,7.8,9.0,6.5,8.5,NA,7.5,8.2,8.2,8.4,8.3,7.9,8.7,8.6,6.0,7.2),
Revenue = c(800,2800,2200,500,1000,1500,NA,600,600,700,2000,900,1200,700,400,1100)
)
| Movie_ID | Title | Genre | Release_Date | Duration | Rating | Revenue |
1 | M001 | Inception | Action | 16-07-2010 | 148 | 8.8 | 800 |
2 | M002 | avatar | action | 18-12-2009 | 162 | 7.8 | 2800 |
3 | M003 | Titanic | Drama | 19-12-1997 | 195 | 9 | 2200 |
4 | M004 | NULL | Comedy | 20235 | 120 | 6.5 | 500 |
5 | M005 | Joker | drama | 04-10-2019 | -122 | 8.5 | 1000 |
6 | M006 | Avengers | ACTION | 04-05-2012 | 143 | NA | 1500 |
7 | M007 | Frozen | Animation | 27-11-2013 | 102 | 7.5 | NA |
8 | M008 | Bahubali | Action | 10-07-2015 | 159 | 8.2 | 600 |
9 | M008 | Bahubali | Action | 10-07-2015 | 159 | 8.2 | 600 |
10 | M009 | kgf | action | 21-12-2018 | 155 | 8.4 | 700 |
11 | M010 | Dangal | Drama | 23-12-2016 | 161 | 8.3 | 2000 |
12 | M011 | Pushpa | Action | 17-12-2021 | 179 | 7.9 | 900 |
13 | M012 | RRR | action | 25-03-2022 | 187 | 8.7 | 1200 |
14 | M013 | Interstellar | SciFi | 07-11-2014 | 169 | 8.6 | 700 |
15 | M014 | Null | Comedy | 05-05-2018 | 110 | 6 | 400 |
16 | M015 | Jawan | Action | 07-09-2023 | 169 | 7.2 | 1100 |
Explanation (R Raw Data)
R allows flexible structures, but it doesn’t enforce strict metadata like SAS.
This makes it powerful but also risky if validation is ignored.
Key Insight:
R stores strings without fixed length, unlike SAS LENGTH so truncation isn’t an issue, but inconsistency still is.
3. Phase 1: Data Cleaning in SAS
DATA movie_clean1;
SET movie_raw;
/* Handle missing text */
IF Title IN ("NULL","Null","") THEN Title = "UNKNOWN";
/* Standardize Genre */
Genre = UPCASE(STRIP(Genre));
/* Fix invalid Duration */
IF Duration < 0 THEN Duration = ABS(Duration);
/* Handle missing Rating */
IF Rating = . THEN Rating = 7.0;
/* Fix Date */
Release_Date_Num = INPUT(Release_Date, YYMMDD10.);
FORMAT Release_Date_Num DATE9.;
RUN;
PROC PRINT DATA=movie_clean1;
RUN;
OUTPUT:
| Obs | Movie_ID | Title | Genre | Release_Date | Duration | Rating | Revenue | Release_Date_Num |
|---|---|---|---|---|---|---|---|---|
| 1 | M001 | Inception | ACTION | 2010-07-16 | 148 | 8.8 | 800 | 16JUL2010 |
| 2 | M002 | avatar | ACTION | 2009-12-18 | 162 | 7.8 | 2800 | 18DEC2009 |
| 3 | M003 | Titanic | DRAMA | 1997-12-19 | 195 | 9.0 | 2200 | 19DEC1997 |
| 4 | M004 | UNKNOWN | COMEDY | 20235 | 120 | 6.5 | 500 | . |
| 5 | M005 | Joker | DRAMA | 2019-10-04 | 122 | 8.5 | 1000 | 04OCT2019 |
| 6 | M006 | Avengers | ACTION | 2012-05-04 | 143 | 7.0 | 1500 | 04MAY2012 |
| 7 | M007 | Frozen | ANIMATION | 2013-11-27 | 102 | 7.5 | . | 27NOV2013 |
| 8 | M008 | Bahubali | ACTION | 2015-07-10 | 159 | 8.2 | 600 | 10JUL2015 |
| 9 | M008 | Bahubali | ACTION | 2015-07-10 | 159 | 8.2 | 600 | 10JUL2015 |
| 10 | M009 | kgf | ACTION | 2018-12-21 | 155 | 8.4 | 700 | 21DEC2018 |
| 11 | M010 | Dangal | DRAMA | 2016-12-23 | 161 | 8.3 | 2000 | 23DEC2016 |
| 12 | M011 | Pushpa | ACTION | 2021-12-17 | 179 | 7.9 | 900 | 17DEC2021 |
| 13 | M012 | RRR | ACTION | 2022-03-25 | 187 | 8.7 | 1200 | 25MAR2022 |
| 14 | M013 | Interstellar | SCIFI | 2014-11-07 | 169 | 8.6 | 700 | 07NOV2014 |
| 15 | M014 | UNKNOWN | COMEDY | 2018-05-05 | 110 | 6.0 | 400 | 05MAY2018 |
| 16 | M015 | Jawan | ACTION | 2023-09-07 | 169 | 7.2 | 1100 | 07SEP2023 |
/* Remove duplicates */
PROC SORT DATA=movie_clean1 NODUPKEY;
BY Movie_ID;
RUN;
PROC PRINT DATA=movie_clean1;
RUN;
OUTPUT:
| Obs | Movie_ID | Title | Genre | Release_Date | Duration | Rating | Revenue | Release_Date_Num |
|---|---|---|---|---|---|---|---|---|
| 1 | M001 | Inception | ACTION | 2010-07-16 | 148 | 8.8 | 800 | 16JUL2010 |
| 2 | M002 | avatar | ACTION | 2009-12-18 | 162 | 7.8 | 2800 | 18DEC2009 |
| 3 | M003 | Titanic | DRAMA | 1997-12-19 | 195 | 9.0 | 2200 | 19DEC1997 |
| 4 | M004 | UNKNOWN | COMEDY | 20235 | 120 | 6.5 | 500 | . |
| 5 | M005 | Joker | DRAMA | 2019-10-04 | 122 | 8.5 | 1000 | 04OCT2019 |
| 6 | M006 | Avengers | ACTION | 2012-05-04 | 143 | 7.0 | 1500 | 04MAY2012 |
| 7 | M007 | Frozen | ANIMATION | 2013-11-27 | 102 | 7.5 | . | 27NOV2013 |
| 8 | M008 | Bahubali | ACTION | 2015-07-10 | 159 | 8.2 | 600 | 10JUL2015 |
| 9 | M009 | kgf | ACTION | 2018-12-21 | 155 | 8.4 | 700 | 21DEC2018 |
| 10 | M010 | Dangal | DRAMA | 2016-12-23 | 161 | 8.3 | 2000 | 23DEC2016 |
| 11 | M011 | Pushpa | ACTION | 2021-12-17 | 179 | 7.9 | 900 | 17DEC2021 |
| 12 | M012 | RRR | ACTION | 2022-03-25 | 187 | 8.7 | 1200 | 25MAR2022 |
| 13 | M013 | Interstellar | SCIFI | 2014-11-07 | 169 | 8.6 | 700 | 07NOV2014 |
| 14 | M014 | UNKNOWN | COMEDY | 2018-05-05 | 110 | 6.0 | 400 | 05MAY2018 |
| 15 | M015 | Jawan | ACTION | 2023-09-07 | 169 | 7.2 | 1100 | 07SEP2023 |
Explanation (SAS Cleaning)
This phase demonstrates real SAS power:
- COALESCEC alternative via IF handling text
- UPCASE + STRIP ensures consistency
- ABS() corrects illogical values
- INPUT() converts raw text to date
- FORMAT ensures readable output (DATE9.)
Key LENGTH & FORMAT Insight:
- LENGTH controls storage
- FORMAT controls display
Without FORMAT, SAS stores date as numeric confusing users.
4. Phase 2: Data Cleaning in R
library(dplyr)
movie_clean <- movie_raw %>%
mutate(
Title = ifelse(Title %in% c("NULL","Null",""),"UNKNOWN", Title),
Genre = toupper(trimws(Genre)),
Duration = ifelse(Duration < 0, abs(Duration), Duration),
Rating = ifelse(is.na(Rating), 7.0, Rating),
Release_Date = as.Date(Release_Date, format="%Y-%m-%d"),
) %>%
distinct(Movie_ID, .keep_all = TRUE)
LOG:
OUTPUT:
| Movie_ID | Title | Genre | Release_Date | Duration | Rating | Revenue |
1 | M001 | Inception | ACTION | 16-07-2010 | 148 | 8.8 | 800 |
2 | M002 | avatar | ACTION | 18-12-2009 | 162 | 7.8 | 2800 |
3 | M003 | Titanic | DRAMA | 19-12-1997 | 195 | 9 | 2200 |
4 | M004 | UNKNOWN | COMEDY | NA | 120 | 6.5 | 500 |
5 | M005 | Joker | DRAMA | 04-10-2019 | 122 | 8.5 | 1000 |
6 | M006 | Avengers | ACTION | 04-05-2012 | 143 | 7 | 1500 |
7 | M007 | Frozen | ANIMATION | 27-11-2013 | 102 | 7.5 | NA |
8 | M008 | Bahubali | ACTION | 10-07-2015 | 159 | 8.2 | 600 |
9 | M009 | kgf | ACTION | 21-12-2018 | 155 | 8.4 | 700 |
10 | M010 | Dangal | DRAMA | 23-12-2016 | 161 | 8.3 | 2000 |
11 | M011 | Pushpa | ACTION | 17-12-2021 | 179 | 7.9 | 900 |
12 | M012 | RRR | ACTION | 25-03-2022 | 187 | 8.7 | 1200 |
13 | M013 | Interstellar | SCIFI | 07-11-2014 | 169 | 8.6 | 700 |
14 | M014 | UNKNOWN | COMEDY | 05-05-2018 | 110 | 6 | 400 |
15 | M015 | Jawan | ACTION | 07-09-2023 | 169 | 7.2 | 1100 |
Explanation (R Cleaning)
R uses dplyr for elegant transformations:
- mutate() modifies columns
- ifelse() handles conditions
- distinct() removes duplicates
- toupper() standardizes text
Key Insight:
R is expressive and concise but requires discipline for reproducibility.
5. 20 Additional Data Cleaning Best Practices
- Always validate data against protocol (SDTM standards)
- Maintain audit trails for regulatory compliance
- Never overwrite raw data
- Use metadata-driven programming
- Validate date ranges carefully
- Ensure unique identifiers (USUBJID in clinical trials)
- Handle missing data systematically
- Normalize categorical variables
- Use controlled terminology
- Validate duplicates before removal
- Use PROC COMPARE for QC
- Maintain version control
- Document derivation logic
- Use macros for scalability
- Perform cross-dataset validation
- Implement edit checks
- Validate numeric ranges
- Ensure traceability (Define.xml)
- Follow CDISC standards strictly
- Automate validation checks
6. Business Logic Behind Data Cleaning
Why do we clean data?
- Missing rating → Replace with average → ensures analysis continuity
- Negative duration → Correct → avoids logical errors
- Invalid dates → Fix → ensures timeline accuracy
In clinical trials:
- Wrong patient age = wrong safety conclusions
- Wrong treatment date = incorrect efficacy results
In business:
- Revenue duplicates = inflated profits
- Wrong categories = misleading segmentation
7. 20 Key Points
- Dirty data leads to wrong conclusions
- LENGTH prevents truncation
- FORMAT improves readability
- Missing data must be handled logically
- Standardization ensures consistency
- Duplicate removal improves accuracy
- Date validation is critical
- Text normalization avoids grouping errors
- SAS enforces structure
- R offers flexibility
- QC is non-negotiable
- Audit trails are essential
- Metadata drives reliability
- Automation saves time
- Business rules guide cleaning
- Validation ensures trust
- Reproducibility is key
- Clean data drives insights
- Errors propagate quickly
- Prevention is better than correction
8. Summary
Feature | SAS | R |
Data Control | Strong (LENGTH, FORMAT) | Flexible |
Validation | Structured | Custom |
Clinical Use | Preferred | Supportive |
Scalability | High | High |
9. Conclusion
Data cleaning is not a technical step it’s a strategic necessity.
Understanding LENGTH and FORMAT in SAS gives you control over:
- Data storage
- Data interpretation
- Data presentation
When combined with R’s flexibility, you gain a powerful toolkit.
Clean data is not optional it’s the foundation of truth.
10. Interview Questions
Q1: What happens if LENGTH is not defined in SAS?
Answer:
If the LENGTH statement is not explicitly defined, SAS assigns a default length based on the first occurrence of the variable in the DATA step. For character variables, this often leads to unintended truncation if later values exceed that initial length.
For example, if the first value of Title is "KGF" (3 characters), SAS may assign length = 3. Later, "Interstellar" will be truncated to "Int" resulting in silent data loss.
Key Insight:
LENGTH should always be defined upfront for character variables to ensure data integrity, especially in production or clinical datasets.
Q2: What is the difference between FORMAT and INFORMAT in SAS?
Answer:
- INFORMAT is used to read and convert raw data into SAS internal format (during input).
- FORMAT is used to control how data is displayed in output (during reporting).
For example:
- INPUT(Date_Char, YYMMDD10.) → converts character to numeric date
- FORMAT Date_Num DATE9. → displays it as 01JAN2023
Key Insight:
SAS stores dates as numeric values internally. INFORMAT handles conversion, while FORMAT ensures human-readable presentation.
Q3: How do you handle duplicate records in SAS?
Answer:
Duplicate records are typically handled using PROC SORT with the NODUPKEY option:
PROC SORT DATA=dataset NODUPKEY;
BY key_variable;
RUN;
This keeps only the first occurrence of each unique key.
In more advanced scenarios:
- Use PROC SORT NODUPREC to remove exact duplicates
- Use PROC SQL with GROUP BY or DISTINCT for more control
Key Insight:
Always validate duplicates before removing them some duplicates may be legitimate in clinical or transactional data.
Q4: How do you fix invalid numeric values in SAS?
Answer:
Invalid numeric values are handled using conditional logic and validation rules:
- Use IF conditions for range checks
- Use functions like ABS() to correct negative values
- Replace missing values using business rules
Example:
IF Age < 0 THEN Age = ABS(Age);
IF Salary > 1000000 THEN Salary = .; /* unrealistic */
Key Insight:
Corrections should always align with domain-specific business logic not arbitrary assumptions.
Q5: How does data cleaning differ between SAS and R?
Answer:
- SAS provides a structured, metadata-driven environment with strict control over variable attributes (LENGTH, FORMAT, INFORMAT). It is widely used in regulated industries like clinical trials due to its reproducibility and auditability.
- R offers flexibility and powerful data manipulation libraries like dplyr, enabling fast and expressive transformations. However, it requires disciplined coding practices to ensure consistency and traceability.
Key Insight:
SAS is preferred for compliance and standardization, while R excels in exploration and rapid data processing. Many organizations use both together.
Validation Checklist (Bonus)
✔ No missing critical variables
✔ No duplicates
✔ Valid date formats
✔ Consistent categories
✔ Logical numeric values
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about MOVIE DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment