From Daguerreotypes to Data Integrity: Mastering SAS Data Validation Through a Global Photography History Dataset

1. Introduction

Imagine you're working on a clinical trial dataset where patient ages are negative, visit dates occur before birth dates, and half the gender values are “M”, “male”, “MALE”, or even blank. Now imagine making regulatory submissions based on that.

That’s not just messy it’s dangerous.

The same chaos exists in business datasets. Suppose you’re analyzing a Photography History in World dataset capturing key inventions, inventors, and years. But your dataset contains duplicate inventions, incorrect years like 3025, missing inventor names, and inconsistent country formats like “usa”, “USA”, and “U.S.A”.

Bad data doesn’t just reduce quality it destroys trust, invalidates analytics, and leads to flawed decisions.

This is where SAS and R become powerful. SAS dominates regulated environments like clinical trials (SDTM/ADaM), ensuring traceability and compliance. R offers flexibility and rapid transformations.

In this blog, we’ll simulate a messy real-world dataset and walk through data validation techniques in SAS, compare with R, and build a robust, audit-ready pipeline.

2. Raw Data Creation in SAS and R

We intentionally introduce errors such as:

Missing values
Invalid years (future or negative)
Duplicate records
Inconsistent text formats

SAS Code: Raw Dataset Creation

DATA photography_raw;

INPUT ID Camera_Name:$15. Year Inventor $ Country $

Type $ Popularity_Rank Cost;

DATALINES;

1 Daguerreotype 1839 Daguerre France Analog 1 500

2 Calotype 1841 Talbot UK Analog 2 450

3 Kodak 1888 Eastman usa Analog 3 300

4 DigitalCam -1990 NULL USA Digital 4 700

5 DSLR 2000 Canon Japan Digital 5 1200

6 Mirrorless 2010 sony japan Digital 6 1500

7 Polaroid 1948 Land USA Instant 7 600

8 SmartphoneCam 2025 NULL India Digital 8 0

9 Kodak 1888 Eastman USA Analog 3 300

10 FilmCam . NULL UK Analog 9 400

;

RUN;

PROC PRINT DATA = photography_raw;

RUN;

OUTPUT:


Obs	ID	Camera_Name	Year	Inventor	Country	Type	Popularity_Rank	Cost
1	1	Daguerreotype	1839	Daguerre	France	Analog	1	500
2	2	Calotype	1841	Talbot	UK	Analog	2	450
3	3	Kodak	1888	Eastman	usa	Analog	3	300
4	4	DigitalCam	-1990	NULL	USA	Digital	4	700
5	5	DSLR	2000	Canon	Japan	Digital	5	1200
6	6	Mirrorless	2010	sony	japan	Digital	6	1500
7	7	Polaroid	1948	Land	USA	Instant	7	600
8	8	SmartphoneCam	2025	NULL	India	Digital	8	0
9	9	Kodak	1888	Eastman	USA	Analog	3	300
10	10	FilmCam	.	NULL	UK	Analog	9	400

Explanation

This dataset mimics real-world inconsistencies. The Year variable contains invalid values such as negative (-1990) and missing (.). Inventor names are missing or labeled as “NULL”. Country names are inconsistent (“usa”, “USA”, “japan”). Duplicate records exist (Kodak appears twice). Cost includes unrealistic values like 0. This raw dataset is intentionally flawed to demonstrate SAS data validation techniques. In real clinical or business environments, such inconsistencies could lead to incorrect reporting or regulatory rejection. Creating such test datasets is a best practice to validate cleaning pipelines before applying them to production data.

R Code – Equivalent Raw Dataset

photography_raw <- data.frame(

ID = 1:10,

Camera_Name = c("Daguerreotype","Calotype","Kodak","DigitalCam",

"DSLR","Mirrorless","Polaroid","SmartphoneCam",

"Kodak","FilmCam"),

Year = c(1839,1841,1888,-1990,2000,2010,1948,2025,1888,NA),

Inventor = c("Daguerre","Talbot","Eastman","NULL","Canon","sony",

"Land","NULL","Eastman","NULL"),

Country = c("France","UK","usa","USA","Japan","japan","USA",

"India","USA","UK"),

Type = c("Analog","Analog","Analog","Digital","Digital","Digital",

"Instant","Digital","Analog","Analog"),

Popularity_Rank = c(1,2,3,4,5,6,7,8,3,9),

Cost = c(500,450,300,700,1200,1500,600,0,300,400)

)

OUTPUT:

	ID	Camera_Name	Year	Inventor	Country	Type	Popularity_Rank	Cost
1	1	Daguerreotype	1839	Daguerre	France	Analog	1	500
2	2	Calotype	1841	Talbot	UK	Analog	2	450
3	3	Kodak	1888	Eastman	usa	Analog	3	300
4	4	DigitalCam	-1990	NULL	USA	Digital	4	700
5	5	DSLR	2000	Canon	Japan	Digital	5	1200
6	6	Mirrorless	2010	sony	japan	Digital	6	1500
7	7	Polaroid	1948	Land	USA	Instant	7	600
8	8	SmartphoneCam	2025	NULL	India	Digital	8	0
9	9	Kodak	1888	Eastman	USA	Analog	3	300
10	10	FilmCam	NA	NULL	UK	Analog	9	400

Explanation

This R dataset mirrors the SAS dataset, ensuring consistency across platforms. It includes missing values (NA), invalid entries (negative years), and inconsistent strings. The purpose is to demonstrate cross-platform validation. In real-world analytics, teams often use SAS for production and R for exploratory analysis. Maintaining identical datasets helps ensure reproducibility. The presence of errors allows us to test functions like is.na(), mutate(), and distinct(). This setup is critical for building robust pipelines that can handle noisy data without manual intervention.

3. Phase 1: Data Cleaning in SAS

DATA photography_clean;

SET photography_raw;

/* Handle missing Inventor */

IF Inventor = "NULL" OR Inventor = "" THEN Inventor = "UNKNOWN";

/* Fix invalid Year */

IF Year < 1800 OR Year > 2023 THEN Year = .;

/* Standardize Country */

Country = UPCASE(STRIP(Country));

/* Fix Cost */

IF Cost <= 0 THEN Cost = .;

RUN;

PROC PRINT DATA = photography_clean;

RUN;

OUTPUT:


Obs	ID	Camera_Name	Year	Inventor	Country	Type	Popularity_Rank	Cost
1	1	Daguerreotype	1839	Daguerre	FRANCE	Analog	1	500
2	2	Calotype	1841	Talbot	UK	Analog	2	450
3	3	Kodak	1888	Eastman	USA	Analog	3	300
4	4	DigitalCam	.	UNKNOWN	USA	Digital	4	700
5	5	DSLR	2000	Canon	JAPAN	Digital	5	1200
6	6	Mirrorless	2010	sony	JAPAN	Digital	6	1500
7	7	Polaroid	1948	Land	USA	Instant	7	600
8	8	SmartphoneCam	.	UNKNOWN	INDIA	Digital	8	.
9	9	Kodak	1888	Eastman	USA	Analog	3	300
10	10	FilmCam	.	UNKNOWN	UK	Analog	9	400

/* Remove duplicates */

PROC SORT DATA=photography_clean NODUPKEY;

BY Camera_Name Year;

RUN;

PROC PRINT DATA = photography_clean;

RUN;

OUTPUT:


Obs	ID	Camera_Name	Year	Inventor	Country	Type	Popularity_Rank	Cost
1	2	Calotype	1841	Talbot	UK	Analog	2	450
2	5	DSLR	2000	Canon	JAPAN	Digital	5	1200
3	1	Daguerreotype	1839	Daguerre	FRANCE	Analog	1	500
4	4	DigitalCam	.	UNKNOWN	USA	Digital	4	700
5	10	FilmCam	.	UNKNOWN	UK	Analog	9	400
6	3	Kodak	1888	Eastman	USA	Analog	3	300
7	6	Mirrorless	2010	sony	JAPAN	Digital	6	1500
8	7	Polaroid	1948	Land	USA	Instant	7	600
9	8	SmartphoneCam	.	UNKNOWN	INDIA	Digital	8	.

PROC SORT DATA=photography_clean;

BY ID;

RUN;

PROC PRINT DATA = photography_clean;

RUN;

OUTPUT:


Obs	ID	Camera_Name	Year	Inventor	Country	Type	Popularity_Rank	Cost
1	1	Daguerreotype	1839	Daguerre	FRANCE	Analog	1	500
2	2	Calotype	1841	Talbot	UK	Analog	2	450
3	3	Kodak	1888	Eastman	USA	Analog	3	300
4	4	DigitalCam	.	UNKNOWN	USA	Digital	4	700
5	5	DSLR	2000	Canon	JAPAN	Digital	5	1200
6	6	Mirrorless	2010	sony	JAPAN	Digital	6	1500
7	7	Polaroid	1948	Land	USA	Instant	7	600
8	8	SmartphoneCam	.	UNKNOWN	INDIA	Digital	8	.
9	10	FilmCam	.	UNKNOWN	UK	Analog	9	400

Explanation

This SAS step applies core validation logic. Missing inventors are replaced using conditional checks. Invalid years are nullified based on realistic bounds (photography began ~1800s). Country names are standardized using UPCASE and STRIP, ensuring consistency for grouping. Cost values less than or equal to zero are flagged as missing. Finally, PROC SORT NODUPKEY removes duplicate records based on key variables. This approach ensures data integrity while maintaining auditability. In clinical trials, similar logic is applied to patient demographics and adverse events, ensuring regulatory compliance and traceability.

4. Phase 2: Data Cleaning in R

library(dplyr)

photography_clean <- photography_raw %>%

mutate(

Inventor = ifelse(Inventor == "NULL" | Inventor == "",

"UNKNOWN", Inventor),

Year = ifelse(Year < 1800 | Year > 2023, NA, Year),

Country = toupper(trimws(Country)),

Cost = ifelse(Cost <= 0, NA, Cost)

) %>%

distinct(Camera_Name, Year, .keep_all = TRUE)

OUTPUT:

	ID	Camera_Name	Year	Inventor	Country	Type	Popularity_Rank	Cost
1	1	Daguerreotype	1839	Daguerre	FRANCE	Analog	1	500
2	2	Calotype	1841	Talbot	UK	Analog	2	450
3	3	Kodak	1888	Eastman	USA	Analog	3	300
4	4	DigitalCam	NA	UNKNOWN	USA	Digital	4	700
5	5	DSLR	2000	Canon	JAPAN	Digital	5	1200
6	6	Mirrorless	2010	sony	JAPAN	Digital	6	1500
7	7	Polaroid	1948	Land	USA	Instant	7	600
8	8	SmartphoneCam	NA	UNKNOWN	INDIA	Digital	8	NA
9	10	FilmCam	NA	UNKNOWN	UK	Analog	9	400

Explanation

In R, the dplyr package enables elegant transformations. The mutate() function updates multiple columns simultaneously. Missing or invalid values are handled using ifelse(). Text standardization is achieved using toupper() and trimws(). Duplicate removal is performed using distinct(), similar to SAS’s PROC SORT. This pipeline is concise and readable, making it ideal for exploratory workflows. However, unlike SAS, R does not inherently enforce audit trails, which is why SAS is preferred in regulated environments.

5. Phase 3: Additional SAS Validation Code

PROC MEANS DATA=photography_clean N NMISS;

VAR Year Cost;

RUN;

OUTPUT:

The MEANS Procedure


Variable	N	N Miss
Year Cost	6 8	3 1

PROC FREQ DATA=photography_clean;

TABLES Country Type / MISSING;

RUN;

OUTPUT:

The FREQ Procedure


Country	Frequency	Percent	Cumulative Frequency	Cumulative Percent
FRANCE	1	11.11	1	11.11
INDIA	1	11.11	2	22.22
JAPAN	2	22.22	4	44.44
UK	2	22.22	6	66.67
USA	3	33.33	9	100.00


Type	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Analog	4	44.44	4	44.44
Digital	4	44.44	8	88.89
Instant	1	11.11	9	100.00

PROC PRINT DATA=photography_clean;

WHERE Year = .;

RUN;

OUTPUT:


Obs	ID	Camera_Name	Year	Inventor	Country	Type	Popularity_Rank	Cost
4	4	DigitalCam	.	UNKNOWN	USA	Digital	4	700
8	8	SmartphoneCam	.	UNKNOWN	INDIA	Digital	8	.
9	10	FilmCam	.	UNKNOWN	UK	Analog	9	400

Explanation

These procedures provide validation insights. PROC MEANS identifies missing values and distributions. PROC FREQ highlights categorical inconsistencies and missing categories. PROC PRINT isolates problematic records for manual review. Together, these steps act as a validation layer, ensuring cleaned data meets business rules. In clinical trials, such checks are mandatory before SDTM/ADaM submission, ensuring datasets are complete, consistent, and compliant.

6. 20 Additional Data Cleaning Best Practices

Always validate against protocol/SAP
Maintain audit trails
Use controlled terminology (CDISC)
Avoid hardcoding values
Document assumptions
Use macros for reusability
Validate date sequences
Check duplicates using composite keys
Standardize categorical variables
Flag outliers instead of deleting
Perform cross-domain validation
Use metadata-driven programming
Automate QC checks
Validate against external sources
Use PROC COMPARE for validation
Maintain version control
Separate raw and cleaned datasets
Use consistent naming conventions
Perform unit testing
Ensure reproducibility

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary it is driven by business logic. For example, replacing missing values ensures continuity in analysis. In a photography dataset, if the inventor is missing, labeling it as “UNKNOWN” preserves the record without introducing bias. Similarly, unrealistic values like negative years must be corrected because photography did not exist before the 1800s. Keeping such values would distort historical analysis.

In clinical trials, patient age cannot be negative. If such values exist, they must be flagged or corrected based on source data. Salary normalization in business datasets ensures comparability across regions. Date imputation is another critical aspect if a visit date is missing, it may be imputed based on study rules.

These corrections directly impact decision-making. For instance, incorrect dates could lead to wrong treatment duration calculations. In photography history, incorrect invention years could misrepresent technological evolution.

Thus, data cleaning ensures accuracy, consistency, and reliability, forming the foundation for meaningful analytics.

8. 20 Key Points Of This Project

Dirty data leads to wrong conclusions
Validation is not optional it’s mandatory
Standardization ensures reproducibility
Missing data must be handled carefully
Duplicate data inflates results
Outliers require investigation
Text inconsistencies break grouping
Dates must follow logical sequences
Audit trails are critical in SAS
Regulatory compliance depends on clean data
Metadata drives automation
QC checks prevent downstream errors
Business rules define cleaning logic
Automation reduces manual errors
SAS ensures traceability
R ensures flexibility
Clean data improves trust
Validation improves accuracy
Consistency enables scalability
Data quality defines analytics success

9. Summary

This blog demonstrated how a seemingly simple Photography History dataset can become complex due to real-world data issues. We introduced intentional errors such as missing values, invalid years, duplicates, and inconsistent text formats. Using SAS, we applied structured data validation techniques, including conditional logic, standardization, and duplicate removal. SAS proved to be highly reliable, especially in maintaining audit trails and ensuring regulatory compliance.

In parallel, we implemented equivalent transformations in R using dplyr. R provided a concise and flexible approach, making it ideal for exploratory analysis. However, it lacks the built-in traceability that SAS offers, which is critical in regulated environments like clinical trials.

The comparison highlights a key insight: SAS and R are not competitors but complementary tools. SAS excels in production-grade, compliant environments, while R shines in rapid prototyping and visualization.

We also explored business logic behind cleaning decisions, emphasizing that every transformation must be justified. Additional best practices and validation techniques further reinforced the importance of structured data pipelines.

Ultimately, data validation is not just a technical step it is a strategic necessity. Clean data ensures accurate insights, reliable reporting, and confident decision-making.

10. Conclusion

Data validation is the backbone of any analytical workflow. Whether you are working with clinical trial datasets or historical photography data, the principles remain the same accuracy, consistency, and reliability.

Through this blog, we transformed a flawed dataset into a structured, analysis-ready dataset using SAS and R. The journey highlighted the importance of identifying and correcting errors such as missing values, invalid entries, and duplicates. More importantly, it demonstrated how business logic drives every cleaning decision.

SAS stands out as a robust tool for regulated environments, offering audit trails, reproducibility, and compliance. Its structured approach ensures that every transformation is traceable a critical requirement in industries like pharmaceuticals. R, on the other hand, provides flexibility and speed, making it ideal for exploratory analysis and rapid prototyping.

However, tools alone are not enough. A strong understanding of data validation principles is essential. Analysts must think critically, question anomalies, and apply domain knowledge to ensure data integrity.

In today’s data-driven world, organizations rely heavily on analytics for decision-making. Poor data quality can lead to incorrect conclusions, financial losses, and regulatory issues. Therefore, investing in robust data validation frameworks is not optional it is essential.

As you prepare for interviews or real-world projects, remember this: clean data is not a luxury it is a necessity.

11. Interview Questions

Q1: How do you handle missing values in SAS?
A: Use IF conditions, COALESCEC, or PROC STDIZE depending on context.

Q2: How do you remove duplicates in SAS?
A: Use PROC SORT NODUPKEY BY key variables.

Q3: Scenario: Year is negative what do you do?
A: Validate against business rules, set to missing or correct using source data.

Q4: How do you validate datasets in SAS?
A: Use PROC MEANS, PROC FREQ, PROC COMPARE, and custom QC checks.

Q5: Difference between SAS and R in data cleaning?
A: SAS ensures auditability and compliance; R offers flexibility and speed.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about PHOTOGRAPHY DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Billion-Dollar Marriages to Clean Datasets: The Art of SAS Standardization

2.The Hidden Power of SAS Dates: Building Smart Time Logic with INTNX and INTCK

3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

459.Elevating SAS Validation to a New Standard : When Vintage Cameras Meet Dirty Data

From Daguerreotypes to Data Integrity: Mastering SAS Data Validation Through a Global Photography History Dataset

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function