Global Plate Chaos to Clean Insights: Mastering Duplicate Handling in SAS with Real-World Food Data

1. Introduction

Imagine you’re working as a clinical data analyst in a pharmaceutical company. You receive a patient dataset for a Phase III trial. Everything looks fine until your summary statistics show 120 patients enrolled, while the actual study had only 100. Panic.

What went wrong?

Duplicate records.

Now shift that same scenario to a business dataset say, global food analytics. You’re analyzing “most popular foods worldwide” to guide a multinational restaurant expansion strategy. But duplicate rows inflate demand for Pizza in Italy or Sushi in Japan. Suddenly, your insights are misleading, and millions of dollars are at stake.

Bad data doesn’t just “look messy” it destroys decision-making.

This is where tools like SAS and R come in. They don’t just clean data they enforce discipline, reproducibility, and regulatory compliance (especially critical in clinical trials using SDTM/ADaM).

In this blog, we’ll simulate a “Famous Foods in the World” dataset with intentional issues, then systematically clean and de-duplicate it using SAS and R like a real-world data engineer.

2. Raw Data Creation in SAS and R (With Intentional Errors)

SAS Code (Raw Dataset)

DATA food_raw;

INPUT ID Country $ Food $ Price Rating Launch_Date :$10.;

DATALINES;

1 India Biryani 250 4.5 2023-01-10

2 USA Burger 150 4.2 2023-02-30

3 Italy Pizza 300 4.8 2023-03-12

4 Japan Sushi 500 4.9 2023-04-05

5 India biryani 250 4.5 2023-01-10

6 USA Burger . 4.2 2023-02-28

7 NULL Pasta 200 3.8 2023-03-15

8 China Noodles -50 4.1 2023-05-20

9 Italy Pizza 300 4.8 2023-03-12

10 Japan Sushi 500 4.9 .

11 India Biryani 250 4.5 2023-01-10

;

RUN;

PROC PRINT DATA = food_raw;

RUN;

OUTPUT:


Obs	ID	Country	Food	Price	Rating	Launch_Date
1	1	India	Biryani	250	4.5	2023-01-10
2	2	USA	Burger	150	4.2	2023-02-30
3	3	Italy	Pizza	300	4.8	2023-03-12
4	4	Japan	Sushi	500	4.9	2023-04-05
5	5	India	biryani	250	4.5	2023-01-10
6	6	USA	Burger	.	4.2	2023-02-28
7	7	NULL	Pasta	200	3.8	2023-03-15
8	8	China	Noodles	-50	4.1	2023-05-20
9	9	Italy	Pizza	300	4.8	2023-03-12
10	10	Japan	Sushi	500	4.9
11	11	India	Biryani	250	4.5	2023-01-10

Explanation

This dataset intentionally contains real-world data quality issues. Duplicate records exist (e.g., Pizza, Biryani repeated). Missing values appear as . and NULL. Invalid data includes negative price (-50) and incorrect date (2023-02-30). Text inconsistency is present (Biryani vs biryani). Such issues are extremely common in raw data pipelines, especially when data comes from multiple sources like APIs, manual entry, or legacy systems. In clinical trials, similar inconsistencies can occur in patient demographics or lab data, leading to incorrect analysis. This raw dataset sets the foundation for demonstrating robust cleaning techniques.

R Code – Equivalent Raw Dataset

food_raw <- data.frame(

ID = 1:11,

Country = c("India","USA","Italy","Japan","India","USA",

"NULL","China","Italy","Japan","India"),

Food = c("Biryani","Burger","Pizza","Sushi","biryani",

"Burger","Pasta","Noodles","Pizza","Sushi","Biryani"),

Price = c(250,150,300,500,250,NA,200,-50,300,500,250),

Rating = c(4.5,4.2,4.8,4.9,4.5,4.2,3.8,4.1,4.8,4.9,4.5),

Launch_Date = c("2023-01-10","2023-02-30","2023-03-12",

"2023-04-05","2023-01-10","2023-02-28",

"2023-03-15","2023-05-20","2023-03-12",NA,

"2023-01-10")

)

OUTPUT:

	ID	Country	Food	Price	Rating	Launch_Date
1	1	India	Biryani	250	4.5	10-01-2023
2	2	USA	Burger	150	4.2	2023-02-30
3	3	Italy	Pizza	300	4.8	12-03-2023
4	4	Japan	Sushi	500	4.9	05-04-2023
5	5	India	biryani	250	4.5	10-01-2023
6	6	USA	Burger	NA	4.2	28-02-2023
7	7	NULL	Pasta	200	3.8	15-03-2023
8	8	China	Noodles	-50	4.1	20-05-2023
9	9	Italy	Pizza	300	4.8	12-03-2023
10	10	Japan	Sushi	500	4.9	NA
11	11	India	Biryani	250	4.5	10-01-2023

Explanation

The R dataset mirrors the SAS structure, ensuring cross-platform consistency. Missing values are represented using NA, while invalid values like negative price and incorrect dates remain intact. Duplicate entries are intentionally included to demonstrate how distinct() works. In real-world pipelines, R is frequently used for exploratory analysis, while SAS dominates regulatory environments. Understanding both ensures flexibility. This dataset highlights how even structured formats like data.frame() can contain inconsistent entries if validation rules are not enforced during ingestion. It prepares us for transformation using dplyr.

3. Phase 1: Data Cleaning in SAS

DATA food_clean;

SET food_raw;

/* Fix missing country */

IF Country IN ("NULL","") THEN Country="UNKNOWN";

/* Standardize text */

Country = UPCASE(STRIP(Country));

Food = PROPCASE(STRIP(Food));

/* Fix invalid price */

IF Price < 0 THEN Price = ABS(Price);

/* Handle missing price */

IF Price = . THEN Price = 100;

/* Fix date */

Launch_dt = INPUT(Launch_Date, yymmdd10.);

FORMAT Launch_dt yymmdd10.;

RUN;

PROC PRINT DATA = food_clean;

RUN;

OUTPUT:


Obs	ID	Country	Food	Price	Rating	Launch_Date	Launch_dt
1	1	INDIA	Biryani	250	4.5	2023-01-10	2023-01-10
2	2	USA	Burger	150	4.2	2023-02-30	.
3	3	ITALY	Pizza	300	4.8	2023-03-12	2023-03-12
4	4	JAPAN	Sushi	500	4.9	2023-04-05	2023-04-05
5	5	INDIA	Biryani	250	4.5	2023-01-10	2023-01-10
6	6	USA	Burger	100	4.2	2023-02-28	2023-02-28
7	7	UNKNOWN	Pasta	200	3.8	2023-03-15	2023-03-15
8	8	CHINA	Noodles	50	4.1	2023-05-20	2023-05-20
9	9	ITALY	Pizza	300	4.8	2023-03-12	2023-03-12
10	10	JAPAN	Sushi	500	4.9		.
11	11	INDIA	Biryani	250	4.5	2023-01-10	2023-01-10

/* Remove duplicates */

PROC SORT DATA=food_clean NODUPKEY;

BY ID Country Food Price Rating Launch_dt;

RUN;

PROC PRINT DATA = food_clean;

RUN;

OUTPUT:


Obs	ID	Country	Food	Price	Rating	Launch_Date	Launch_dt
1	1	INDIA	Biryani	250	4.5	2023-01-10	2023-01-10
2	2	USA	Burger	150	4.2	2023-02-30	.
3	3	ITALY	Pizza	300	4.8	2023-03-12	2023-03-12
4	4	JAPAN	Sushi	500	4.9	2023-04-05	2023-04-05
5	6	USA	Burger	100	4.2	2023-02-28	2023-02-28
6	7	UNKNOWN	Pasta	200	3.8	2023-03-15	2023-03-15
7	8	CHINA	Noodles	50	4.1	2023-05-20	2023-05-20
8	10	JAPAN	Sushi	500	4.9		.

Explanation

This SAS cleaning pipeline uses industry-standard techniques. UPCASE and PROPCASE ensure consistency critical for grouping and joins. ABS() corrects invalid numeric entries like negative prices. Missing values are imputed using business logic (default price = 100). Date conversion uses INPUT() with proper informat, ensuring SAS recognizes it as a numeric date. Finally, PROC SORT NODUPKEY removes duplicates based on key variables. This step is crucial in clinical trials where duplicate patient records can violate regulatory compliance. SAS ensures deterministic and reproducible cleaning, which is why it's preferred in FDA submissions.

4. Phase 2: Data Cleaning in R

library(dplyr)

food_clean <- food_raw %>%

mutate(

Country = toupper(trimws(ifelse(Country == "NULL" |

Country == "", "UNKNOWN", Country))),

Food = tools::toTitleCase(trimws(Food)),

Price = ifelse(is.na(Price), 100, abs(Price)),

Launch_Date = as.Date(Launch_Date, format="%Y-%m-%d")

) %>%

distinct(Country, Food, Price, Rating, Launch_Date, .keep_all = TRUE)

OUTPUT:

	ID	Country	Food	Price	Rating	Launch_Date
1	1	INDIA	Biryani	250	4.5	10-01-2023
2	2	USA	Burger	150	4.2	NA
3	3	ITALY	Pizza	300	4.8	12-03-2023
4	4	JAPAN	Sushi	500	4.9	05-04-2023
5	6	USA	Burger	100	4.2	28-02-2023
6	7	UNKNOWN	Pasta	200	3.8	15-03-2023
7	8	CHINA	Noodles	50	4.1	20-05-2023
8	10	JAPAN	Sushi	500	4.9	NA

Explanation

R uses dplyr for intuitive transformation. mutate() handles multiple cleaning steps in a pipeline. ifelse() replaces missing or invalid values, while abs() ensures no negative price remains. toupper() and trimws() standardize text fields. Date parsing is handled using as.Date(). The distinct() function removes duplicates based on key variables, similar to PROC SORT NODUPKEY in SAS. R’s advantage lies in readability and chaining operations, making it ideal for rapid data exploration. However, unlike SAS, R requires careful validation to ensure reproducibility in regulated environments.

5. Phase 3: Additional SAS Enhancements

DATA food_enhanced;

SET food_clean;

/* Categorize Price */

LENGTH Price_Group $8. Region $8.;

IF Price < 200 THEN Price_Group="LOW";

ELSE IF Price < 400 THEN Price_Group="MEDIUM";

ELSE Price_Group="HIGH";

/* Derive Region */

IF Country IN ("INDIA","CHINA","JAPAN") THEN Region="ASIA";

ELSE IF Country IN ("USA") THEN Region="WEST";

ELSE Region="EUROPE";

RUN;

PROC PRINT DATA = food_enhanced;

RUN;

OUTPUT:


Obs	ID	Country	Food	Price	Rating	Launch_Date	Launch_dt	Price_Group	Region
1	1	INDIA	Biryani	250	4.5	2023-01-10	2023-01-10	MEDIUM	ASIA
2	2	USA	Burger	150	4.2	2023-02-30	.	LOW	WEST
3	3	ITALY	Pizza	300	4.8	2023-03-12	2023-03-12	MEDIUM	EUROPE
4	4	JAPAN	Sushi	500	4.9	2023-04-05	2023-04-05	HIGH	ASIA
5	6	USA	Burger	100	4.2	2023-02-28	2023-02-28	LOW	WEST
6	7	UNKNOWN	Pasta	200	3.8	2023-03-15	2023-03-15	MEDIUM	EUROPE
7	8	CHINA	Noodles	50	4.1	2023-05-20	2023-05-20	LOW	ASIA
8	10	JAPAN	Sushi	500	4.9		.	HIGH	ASIA

Explanation

This phase introduces derived variables common in ADaM datasets. Price_Group helps in segmentation analysis, while Region enables geographical insights. These derivations mimic real-world clinical derivations like age groups or treatment arms. SAS allows structured transformation using conditional logic, ensuring traceability. In regulatory submissions, every derived variable must be documented in Define.xml. This step demonstrates how raw data evolves into analytical datasets, ready for reporting and visualization.

6. 20 Additional Data Cleaning Best Practices

Always validate against source data
Maintain audit trails
Use controlled terminology (CDISC)
Document derivation logic
Avoid hardcoding values
Use metadata-driven programming
Validate duplicates before removal
Use PROC COMPARE for QC
Standardize formats early
Handle missing values consistently
Use macros for scalability
Validate date ranges
Cross-check with SAP
Maintain version control
Log all transformations
Use validation datasets
Apply business rules strictly
Avoid overwriting raw data
Use test cases
Ensure reproducibility

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary it’s driven by business logic.

Missing price → replaced with default because business assumes minimum cost
Negative price → corrected using ABS (data entry error)
Duplicate records → removed to avoid inflated demand
Invalid dates → corrected to maintain timeline integrity

In clinical trials:

Age < 0 → invalid, must be corrected or flagged
Missing visit date → imputed using protocol rules

Bad cleaning = wrong insights = wrong decisions.

8. 20 Key Insights

Dirty data leads to wrong conclusions
Duplicate records inflate metrics
Standardization ensures reproducibility
Missing values must follow business rules
SAS ensures regulatory compliance
R enables fast prototyping
Data cleaning is 70% of analytics
Validation is non-negotiable
Documentation is critical
Audit trails ensure trust
Dates must be consistent
Text inconsistency breaks joins
Negative values often indicate errors
Imputation must be justified
Duplicate removal must be traceable
QC checks prevent disasters
Clean data drives insights
Poor data kills models
Structured pipelines save time
Clean data = reliable decisions

9. SAS vs R

Feature	SAS	R
Regulatory Compliance	Strong	Moderate
Ease of Use	Structured	Flexible
Duplicate Handling	PROC SORT	distinct()
Scalability	High	Moderate
Audit Trails	Built-in	Manual

10. Summary

This blog walked through a realistic, industry-style data cleaning workflow using a “Famous Foods in the World” dataset deliberately injected with common data quality issues duplicates, missing values, invalid entries, and inconsistent text. The journey began with a raw dataset in both SAS and R, reflecting how real-world data often arrives unstructured and unreliable. Through structured phases, we demonstrated how SAS handles cleaning using functions like UPCASE, PROPCASE, ABS, and INPUT, followed by PROC SORT NODUPKEY to efficiently eliminate duplicate records based on key variables.

In parallel, we replicated the same transformations in R using dplyr, showcasing functions like mutate(), ifelse(), distinct(), and string-handling utilities. This dual approach highlights the complementary strengths of SAS and R,SAS for compliance-heavy, traceable workflows, and R for flexible, exploratory transformations.

We also extended the cleaned dataset with derived variables such as Price_Group and Region, simulating real-world ADaM-like derivations. Beyond coding, the blog emphasized 20 best practices aligned with clinical trial standards (SDTM/ADaM), including audit trails, validation checks, and reproducibility.

Critically, the discussion reinforced that data cleaning is not just technical it is governed by business logic. Decisions like imputing missing values or correcting invalid entries must align with domain rules. The blog concluded with sharp insights and interview-focused Q&A, making it both practical and career-oriented. Overall, it demonstrated that handling duplicates and inconsistencies is foundational to producing reliable, decision-ready datasets.

11. Conclusion

Effective data cleaning is the backbone of any reliable analytical workflow, whether in global business analytics or regulated clinical trials. This exercise using a “Famous Foods” dataset clearly illustrates how even small inconsistencies like duplicate records or invalid values can significantly distort outcomes if left unaddressed. Duplicate handling, in particular, is not just a technical step but a critical control mechanism to ensure data integrity, accuracy, and trustworthiness.

SAS stands out as a powerful tool in this domain due to its structured programming paradigm, reproducibility, and strong support for regulatory compliance. Procedures like PROC SORT NODUPKEY provide deterministic and auditable duplicate removal, which is essential in environments such as FDA submissions. On the other hand, R offers flexibility and speed, enabling analysts to prototype and iterate quickly using intuitive libraries like dplyr. Together, they form a robust ecosystem for modern data professionals.

However, tools alone are not enough. The real strength lies in applying sound business logic, maintaining detailed documentation, and implementing rigorous validation checks. Every cleaning decision whether imputing a missing value or correcting an anomaly must be justifiable and traceable.

In the end, clean data is not just about aesthetics; it directly impacts strategic decisions, model accuracy, and regulatory credibility. A well-structured data cleaning framework ensures that insights derived are not only accurate but also defensible. Mastering these practices positions you as a reliable and industry-ready data professional.

12. Interview Questions

Q1: How do you remove duplicates in SAS?

Answer: Use PROC SORT NODUPKEY with appropriate BY variables.

Q2: Difference between NODUPKEY and NODUP?

Answer:

NODUPKEY → removes based on BY variables
NODUP → removes exact duplicates

Q3: How do you handle duplicates in R?

Answer: Use distinct() from dplyr.

Q4: What if duplicates contain conflicting values?

Answer: Apply business rules or aggregation (e.g., take latest record).

Q5: Clinical scenario – duplicate patient IDs?

Answer: Validate against source, flag discrepancies, consult data management, document resolution.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about FAMOUS FOOD DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can We Merge, Transform, and Visualize Global Disaster Data in SAS to Discover Shocking Impact Patterns?

2.How Can PROC MEANS and PROC UNIVARIATE Help Us Understand Forest Distribution and Growth Trends?

3.How Can PROC FREQ, PROC MEANS, and PROC SQL Reveal Critical Blood Type Shortages?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS