459.Elevating SAS Validation to a New Standard : When Vintage Cameras Meet Dirty Data
From Daguerreotypes to Data Integrity: Mastering SAS Data Validation Through a Global Photography History Dataset
1. Introduction
Imagine
you're working on a clinical trial dataset where patient ages are negative,
visit dates occur before birth dates, and half the gender values are “M”,
“male”, “MALE”, or even blank. Now imagine making regulatory submissions based
on that.
That’s
not just messy it’s dangerous.
The same
chaos exists in business datasets. Suppose you’re analyzing a Photography
History in World dataset capturing key inventions, inventors, and years.
But your dataset contains duplicate inventions, incorrect years like 3025,
missing inventor names, and inconsistent country formats like “usa”, “USA”, and
“U.S.A”.
Bad data
doesn’t just reduce quality it destroys trust, invalidates analytics, and leads
to flawed decisions.
This is
where SAS and R become powerful. SAS dominates regulated environments
like clinical trials (SDTM/ADaM), ensuring traceability and compliance. R
offers flexibility and rapid transformations.
In this blog, we’ll simulate a messy real-world dataset and walk through data validation techniques in SAS, compare with R, and build a robust, audit-ready pipeline.
2. Raw Data Creation in SAS and R
We intentionally introduce errors such as:
- Missing values
- Invalid years (future or negative)
- Duplicate records
- Inconsistent text formats
SAS Code: Raw Dataset Creation
DATA photography_raw;
INPUT ID Camera_Name:$15. Year Inventor $ Country $
Type $ Popularity_Rank Cost;
DATALINES;
1 Daguerreotype 1839 Daguerre France Analog 1 500
2 Calotype 1841 Talbot UK Analog 2 450
3 Kodak 1888 Eastman usa Analog 3 300
4 DigitalCam -1990 NULL USA Digital 4 700
5 DSLR 2000 Canon Japan Digital 5 1200
6 Mirrorless 2010 sony japan Digital 6 1500
7 Polaroid 1948 Land USA Instant 7 600
8 SmartphoneCam 2025 NULL India Digital 8 0
9 Kodak 1888 Eastman USA Analog 3 300
10 FilmCam . NULL UK Analog 9 400
;
RUN;
PROC PRINT DATA = photography_raw;
RUN;
OUTPUT:
| Obs | ID | Camera_Name | Year | Inventor | Country | Type | Popularity_Rank | Cost |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Daguerreotype | 1839 | Daguerre | France | Analog | 1 | 500 |
| 2 | 2 | Calotype | 1841 | Talbot | UK | Analog | 2 | 450 |
| 3 | 3 | Kodak | 1888 | Eastman | usa | Analog | 3 | 300 |
| 4 | 4 | DigitalCam | -1990 | NULL | USA | Digital | 4 | 700 |
| 5 | 5 | DSLR | 2000 | Canon | Japan | Digital | 5 | 1200 |
| 6 | 6 | Mirrorless | 2010 | sony | japan | Digital | 6 | 1500 |
| 7 | 7 | Polaroid | 1948 | Land | USA | Instant | 7 | 600 |
| 8 | 8 | SmartphoneCam | 2025 | NULL | India | Digital | 8 | 0 |
| 9 | 9 | Kodak | 1888 | Eastman | USA | Analog | 3 | 300 |
| 10 | 10 | FilmCam | . | NULL | UK | Analog | 9 | 400 |
Explanation
This
dataset mimics real-world inconsistencies. The Year variable contains
invalid values such as negative (-1990) and missing (.). Inventor names are
missing or labeled as “NULL”. Country names are inconsistent (“usa”, “USA”,
“japan”). Duplicate records exist (Kodak appears twice). Cost includes
unrealistic values like 0. This raw dataset is intentionally flawed to
demonstrate SAS data validation techniques. In real clinical or business
environments, such inconsistencies could lead to incorrect reporting or
regulatory rejection. Creating such test datasets is a best practice to
validate cleaning pipelines before applying them to production data.
R Code – Equivalent Raw Dataset
photography_raw <- data.frame(
ID = 1:10,
Camera_Name = c("Daguerreotype","Calotype","Kodak","DigitalCam",
"DSLR","Mirrorless","Polaroid","SmartphoneCam",
"Kodak","FilmCam"),
Year = c(1839,1841,1888,-1990,2000,2010,1948,2025,1888,NA),
Inventor = c("Daguerre","Talbot","Eastman","NULL","Canon","sony",
"Land","NULL","Eastman","NULL"),
Country = c("France","UK","usa","USA","Japan","japan","USA",
"India","USA","UK"),
Type = c("Analog","Analog","Analog","Digital","Digital","Digital",
"Instant","Digital","Analog","Analog"),
Popularity_Rank = c(1,2,3,4,5,6,7,8,3,9),
Cost = c(500,450,300,700,1200,1500,600,0,300,400)
)
OUTPUT:
|
|
ID |
Camera_Name |
Year |
Inventor |
Country |
Type |
Popularity_Rank |
Cost |
|
1 |
1 |
Daguerreotype |
1839 |
Daguerre |
France |
Analog |
1 |
500 |
|
2 |
2 |
Calotype |
1841 |
Talbot |
UK |
Analog |
2 |
450 |
|
3 |
3 |
Kodak |
1888 |
Eastman |
usa |
Analog |
3 |
300 |
|
4 |
4 |
DigitalCam |
-1990 |
NULL |
USA |
Digital |
4 |
700 |
|
5 |
5 |
DSLR |
2000 |
Canon |
Japan |
Digital |
5 |
1200 |
|
6 |
6 |
Mirrorless |
2010 |
sony |
japan |
Digital |
6 |
1500 |
|
7 |
7 |
Polaroid |
1948 |
Land |
USA |
Instant |
7 |
600 |
|
8 |
8 |
SmartphoneCam |
2025 |
NULL |
India |
Digital |
8 |
0 |
|
9 |
9 |
Kodak |
1888 |
Eastman |
USA |
Analog |
3 |
300 |
|
10 |
10 |
FilmCam |
NA |
NULL |
UK |
Analog |
9 |
400 |
Explanation
This R
dataset mirrors the SAS dataset, ensuring consistency across platforms. It
includes missing values (NA), invalid entries (negative years), and inconsistent
strings. The purpose is to demonstrate cross-platform validation. In real-world
analytics, teams often use SAS for production and R for exploratory analysis.
Maintaining identical datasets helps ensure reproducibility. The presence of
errors allows us to test functions like is.na(), mutate(), and distinct(). This
setup is critical for building robust pipelines that can handle noisy data
without manual intervention.
3. Phase 1: Data Cleaning in SAS
DATA photography_clean;
SET photography_raw;
/* Handle missing Inventor */
IF Inventor = "NULL" OR Inventor = "" THEN Inventor = "UNKNOWN";
/* Fix invalid Year */
IF Year < 1800 OR Year > 2023 THEN Year = .;
/* Standardize Country */
Country = UPCASE(STRIP(Country));
/* Fix Cost */
IF Cost <= 0 THEN Cost = .;
RUN;
PROC PRINT DATA = photography_clean;
RUN;
OUTPUT:
| Obs | ID | Camera_Name | Year | Inventor | Country | Type | Popularity_Rank | Cost |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Daguerreotype | 1839 | Daguerre | FRANCE | Analog | 1 | 500 |
| 2 | 2 | Calotype | 1841 | Talbot | UK | Analog | 2 | 450 |
| 3 | 3 | Kodak | 1888 | Eastman | USA | Analog | 3 | 300 |
| 4 | 4 | DigitalCam | . | UNKNOWN | USA | Digital | 4 | 700 |
| 5 | 5 | DSLR | 2000 | Canon | JAPAN | Digital | 5 | 1200 |
| 6 | 6 | Mirrorless | 2010 | sony | JAPAN | Digital | 6 | 1500 |
| 7 | 7 | Polaroid | 1948 | Land | USA | Instant | 7 | 600 |
| 8 | 8 | SmartphoneCam | . | UNKNOWN | INDIA | Digital | 8 | . |
| 9 | 9 | Kodak | 1888 | Eastman | USA | Analog | 3 | 300 |
| 10 | 10 | FilmCam | . | UNKNOWN | UK | Analog | 9 | 400 |
/* Remove duplicates */
PROC SORT DATA=photography_clean NODUPKEY;
BY Camera_Name Year;
RUN;
PROC PRINT DATA = photography_clean;
RUN;
OUTPUT:
| Obs | ID | Camera_Name | Year | Inventor | Country | Type | Popularity_Rank | Cost |
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | Calotype | 1841 | Talbot | UK | Analog | 2 | 450 |
| 2 | 5 | DSLR | 2000 | Canon | JAPAN | Digital | 5 | 1200 |
| 3 | 1 | Daguerreotype | 1839 | Daguerre | FRANCE | Analog | 1 | 500 |
| 4 | 4 | DigitalCam | . | UNKNOWN | USA | Digital | 4 | 700 |
| 5 | 10 | FilmCam | . | UNKNOWN | UK | Analog | 9 | 400 |
| 6 | 3 | Kodak | 1888 | Eastman | USA | Analog | 3 | 300 |
| 7 | 6 | Mirrorless | 2010 | sony | JAPAN | Digital | 6 | 1500 |
| 8 | 7 | Polaroid | 1948 | Land | USA | Instant | 7 | 600 |
| 9 | 8 | SmartphoneCam | . | UNKNOWN | INDIA | Digital | 8 | . |
PROC SORT DATA=photography_clean;
BY ID;
RUN;
PROC PRINT DATA = photography_clean;
RUN;
OUTPUT:
| Obs | ID | Camera_Name | Year | Inventor | Country | Type | Popularity_Rank | Cost |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Daguerreotype | 1839 | Daguerre | FRANCE | Analog | 1 | 500 |
| 2 | 2 | Calotype | 1841 | Talbot | UK | Analog | 2 | 450 |
| 3 | 3 | Kodak | 1888 | Eastman | USA | Analog | 3 | 300 |
| 4 | 4 | DigitalCam | . | UNKNOWN | USA | Digital | 4 | 700 |
| 5 | 5 | DSLR | 2000 | Canon | JAPAN | Digital | 5 | 1200 |
| 6 | 6 | Mirrorless | 2010 | sony | JAPAN | Digital | 6 | 1500 |
| 7 | 7 | Polaroid | 1948 | Land | USA | Instant | 7 | 600 |
| 8 | 8 | SmartphoneCam | . | UNKNOWN | INDIA | Digital | 8 | . |
| 9 | 10 | FilmCam | . | UNKNOWN | UK | Analog | 9 | 400 |
Explanation
This SAS
step applies core validation logic. Missing inventors are replaced using conditional
checks. Invalid years are nullified based on realistic bounds (photography
began ~1800s). Country names are standardized using UPCASE and STRIP, ensuring
consistency for grouping. Cost values less than or equal to zero are flagged as
missing. Finally, PROC SORT NODUPKEY removes duplicate records based on key
variables. This approach ensures data integrity while maintaining auditability.
In clinical trials, similar logic is applied to patient demographics and
adverse events, ensuring regulatory compliance and traceability.
4. Phase 2: Data Cleaning in R
library(dplyr)
photography_clean <- photography_raw %>%
mutate(
Inventor = ifelse(Inventor == "NULL" | Inventor == "",
"UNKNOWN", Inventor),
Year = ifelse(Year < 1800 | Year > 2023, NA, Year),
Country = toupper(trimws(Country)),
Cost = ifelse(Cost <= 0, NA, Cost)
) %>%
distinct(Camera_Name, Year, .keep_all = TRUE)
OUTPUT:
|
|
ID |
Camera_Name |
Year |
Inventor |
Country |
Type |
Popularity_Rank |
Cost |
|
1 |
1 |
Daguerreotype |
1839 |
Daguerre |
FRANCE |
Analog |
1 |
500 |
|
2 |
2 |
Calotype |
1841 |
Talbot |
UK |
Analog |
2 |
450 |
|
3 |
3 |
Kodak |
1888 |
Eastman |
USA |
Analog |
3 |
300 |
|
4 |
4 |
DigitalCam |
NA |
UNKNOWN |
USA |
Digital |
4 |
700 |
|
5 |
5 |
DSLR |
2000 |
Canon |
JAPAN |
Digital |
5 |
1200 |
|
6 |
6 |
Mirrorless |
2010 |
sony |
JAPAN |
Digital |
6 |
1500 |
|
7 |
7 |
Polaroid |
1948 |
Land |
USA |
Instant |
7 |
600 |
|
8 |
8 |
SmartphoneCam |
NA |
UNKNOWN |
INDIA |
Digital |
8 |
NA |
|
9 |
10 |
FilmCam |
NA |
UNKNOWN |
UK |
Analog |
9 |
400 |
Explanation
In R, the
dplyr package enables elegant transformations. The mutate() function updates
multiple columns simultaneously. Missing or invalid values are handled using ifelse().
Text standardization is achieved using toupper() and trimws(). Duplicate
removal is performed using distinct(), similar to SAS’s PROC SORT. This
pipeline is concise and readable, making it ideal for exploratory workflows.
However, unlike SAS, R does not inherently enforce audit trails, which is why
SAS is preferred in regulated environments.
5. Phase 3: Additional SAS Validation Code
PROC MEANS DATA=photography_clean N NMISS;
VAR Year Cost;
RUN;
OUTPUT:
The MEANS Procedure
| Variable | N | N Miss |
|---|---|---|
Year Cost | 6 8 | 3 1 |
PROC FREQ DATA=photography_clean;
TABLES Country Type / MISSING;
RUN;
OUTPUT:
The FREQ Procedure
| Country | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| FRANCE | 1 | 11.11 | 1 | 11.11 |
| INDIA | 1 | 11.11 | 2 | 22.22 |
| JAPAN | 2 | 22.22 | 4 | 44.44 |
| UK | 2 | 22.22 | 6 | 66.67 |
| USA | 3 | 33.33 | 9 | 100.00 |
| Type | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Analog | 4 | 44.44 | 4 | 44.44 |
| Digital | 4 | 44.44 | 8 | 88.89 |
| Instant | 1 | 11.11 | 9 | 100.00 |
PROC PRINT DATA=photography_clean;
WHERE Year = .;
RUN;
OUTPUT:
| Obs | ID | Camera_Name | Year | Inventor | Country | Type | Popularity_Rank | Cost |
|---|---|---|---|---|---|---|---|---|
| 4 | 4 | DigitalCam | . | UNKNOWN | USA | Digital | 4 | 700 |
| 8 | 8 | SmartphoneCam | . | UNKNOWN | INDIA | Digital | 8 | . |
| 9 | 10 | FilmCam | . | UNKNOWN | UK | Analog | 9 | 400 |
Explanation
These
procedures provide validation insights. PROC MEANS identifies missing values
and distributions. PROC FREQ highlights categorical inconsistencies and missing
categories. PROC PRINT isolates problematic records for manual review.
Together, these steps act as a validation layer, ensuring cleaned data meets
business rules. In clinical trials, such checks are mandatory before SDTM/ADaM
submission, ensuring datasets are complete, consistent, and compliant.
6. 20 Additional Data Cleaning Best Practices
- Always validate against
protocol/SAP
- Maintain audit trails
- Use controlled terminology
(CDISC)
- Avoid hardcoding values
- Document assumptions
- Use macros for reusability
- Validate date sequences
- Check duplicates using
composite keys
- Standardize categorical
variables
- Flag outliers instead of
deleting
- Perform cross-domain
validation
- Use metadata-driven
programming
- Automate QC checks
- Validate against external
sources
- Use PROC COMPARE for
validation
- Maintain version control
- Separate raw and cleaned
datasets
- Use consistent naming
conventions
- Perform unit testing
- Ensure reproducibility
7. Business Logic Behind Data Cleaning
Data
cleaning is not arbitrary it is driven by business logic. For example,
replacing missing values ensures continuity in analysis. In a photography
dataset, if the inventor is missing, labeling it as “UNKNOWN” preserves the
record without introducing bias. Similarly, unrealistic values like negative
years must be corrected because photography did not exist before the 1800s.
Keeping such values would distort historical analysis.
In
clinical trials, patient age cannot be negative. If such values exist, they
must be flagged or corrected based on source data. Salary normalization in
business datasets ensures comparability across regions. Date imputation is
another critical aspect if a visit date is missing, it may be imputed based on
study rules.
These
corrections directly impact decision-making. For instance, incorrect dates
could lead to wrong treatment duration calculations. In photography history,
incorrect invention years could misrepresent technological evolution.
Thus,
data cleaning ensures accuracy, consistency, and reliability, forming the
foundation for meaningful analytics.
8. 20 Key Points Of This Project
- Dirty data leads to wrong
conclusions
- Validation is not optional it’s
mandatory
- Standardization ensures
reproducibility
- Missing data must be handled
carefully
- Duplicate data inflates
results
- Outliers require
investigation
- Text inconsistencies break
grouping
- Dates must follow logical
sequences
- Audit trails are critical in
SAS
- Regulatory compliance
depends on clean data
- Metadata drives automation
- QC checks prevent downstream
errors
- Business rules define
cleaning logic
- Automation reduces manual
errors
- SAS ensures traceability
- R ensures flexibility
- Clean data improves trust
- Validation improves accuracy
- Consistency enables scalability
- Data quality defines
analytics success
9. Summary
This blog
demonstrated how a seemingly simple Photography History dataset can
become complex due to real-world data issues. We introduced intentional errors
such as missing values, invalid years, duplicates, and inconsistent text
formats. Using SAS, we applied structured data validation techniques, including
conditional logic, standardization, and duplicate removal. SAS proved to be
highly reliable, especially in maintaining audit trails and ensuring regulatory
compliance.
In
parallel, we implemented equivalent transformations in R using dplyr. R
provided a concise and flexible approach, making it ideal for exploratory
analysis. However, it lacks the built-in traceability that SAS offers, which is
critical in regulated environments like clinical trials.
The
comparison highlights a key insight: SAS and R are not competitors but
complementary tools. SAS excels in production-grade, compliant environments,
while R shines in rapid prototyping and visualization.
We also
explored business logic behind cleaning decisions, emphasizing that every
transformation must be justified. Additional best practices and validation
techniques further reinforced the importance of structured data pipelines.
Ultimately,
data validation is not just a technical step it is a strategic necessity. Clean
data ensures accurate insights, reliable reporting, and confident
decision-making.
10. Conclusion
Data
validation is the backbone of any analytical workflow. Whether you are working
with clinical trial datasets or historical photography data, the principles
remain the same accuracy, consistency, and reliability.
Through
this blog, we transformed a flawed dataset into a structured, analysis-ready
dataset using SAS and R. The journey highlighted the importance of identifying
and correcting errors such as missing values, invalid entries, and duplicates.
More importantly, it demonstrated how business logic drives every cleaning
decision.
SAS
stands out as a robust tool for regulated environments, offering audit trails,
reproducibility, and compliance. Its structured approach ensures that every
transformation is traceable a critical requirement in industries like
pharmaceuticals. R, on the other hand, provides flexibility and speed, making
it ideal for exploratory analysis and rapid prototyping.
However,
tools alone are not enough. A strong understanding of data validation
principles is essential. Analysts must think critically, question anomalies,
and apply domain knowledge to ensure data integrity.
In
today’s data-driven world, organizations rely heavily on analytics for
decision-making. Poor data quality can lead to incorrect conclusions, financial
losses, and regulatory issues. Therefore, investing in robust data validation
frameworks is not optional it is essential.
As you
prepare for interviews or real-world projects, remember this: clean data is
not a luxury it is a necessity.
11. Interview Questions
Q1: How
do you handle missing values in SAS?
A: Use IF conditions, COALESCEC, or PROC STDIZE depending on context.
Q2: How
do you remove duplicates in SAS?
A: Use PROC SORT NODUPKEY BY key variables.
Q3:
Scenario: Year is negative what do you do?
A: Validate against business rules, set to missing or correct using
source data.
Q4: How
do you validate datasets in SAS?
A: Use PROC MEANS, PROC FREQ, PROC COMPARE, and custom QC checks.
Q5:
Difference between SAS and R in data cleaning?
A: SAS ensures auditability and compliance; R offers flexibility and
speed.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about PHOTOGRAPHY DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment