458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global Plate Chaos to Clean Insights: Mastering Duplicate Handling in SAS with Real-World Food Data

1. Introduction

Imagine you’re working as a clinical data analyst in a pharmaceutical company. You receive a patient dataset for a Phase III trial. Everything looks fine until your summary statistics show 120 patients enrolled, while the actual study had only 100. Panic.

What went wrong?

Duplicate records.

Now shift that same scenario to a business dataset say, global food analytics. You’re analyzing “most popular foods worldwide” to guide a multinational restaurant expansion strategy. But duplicate rows inflate demand for Pizza in Italy or Sushi in Japan. Suddenly, your insights are misleading, and millions of dollars are at stake.

Bad data doesn’t just “look messy” it destroys decision-making.

This is where tools like SAS and R come in. They don’t just clean data they enforce discipline, reproducibility, and regulatory compliance (especially critical in clinical trials using SDTM/ADaM).

In this blog, we’ll simulate a “Famous Foods in the World” dataset with intentional issues, then systematically clean and de-duplicate it using SAS and R like a real-world data engineer. 

2. Raw Data Creation in SAS and R (With Intentional Errors)

SAS Code (Raw Dataset)

DATA food_raw;

INPUT ID Country $ Food $ Price Rating Launch_Date :$10.;

DATALINES;

1 India Biryani 250 4.5 2023-01-10

2 USA Burger 150 4.2 2023-02-30

3 Italy Pizza 300 4.8 2023-03-12

4 Japan Sushi 500 4.9 2023-04-05

5 India biryani 250 4.5 2023-01-10

6 USA Burger . 4.2 2023-02-28

7 NULL Pasta 200 3.8 2023-03-15

8 China Noodles -50 4.1 2023-05-20

9 Italy Pizza 300 4.8 2023-03-12

10 Japan Sushi 500 4.9 .

11 India Biryani 250 4.5 2023-01-10

;

RUN;

PROC PRINT DATA = food_raw;

RUN;

OUTPUT:

ObsIDCountryFoodPriceRatingLaunch_Date
11IndiaBiryani2504.52023-01-10
22USABurger1504.22023-02-30
33ItalyPizza3004.82023-03-12
44JapanSushi5004.92023-04-05
55Indiabiryani2504.52023-01-10
66USABurger.4.22023-02-28
77NULLPasta2003.82023-03-15
88ChinaNoodles-504.12023-05-20
99ItalyPizza3004.82023-03-12
1010JapanSushi5004.9 
1111IndiaBiryani2504.52023-01-10

Explanation

This dataset intentionally contains real-world data quality issues. Duplicate records exist (e.g., Pizza, Biryani repeated). Missing values appear as . and NULL. Invalid data includes negative price (-50) and incorrect date (2023-02-30). Text inconsistency is present (Biryani vs biryani). Such issues are extremely common in raw data pipelines, especially when data comes from multiple sources like APIs, manual entry, or legacy systems. In clinical trials, similar inconsistencies can occur in patient demographics or lab data, leading to incorrect analysis. This raw dataset sets the foundation for demonstrating robust cleaning techniques.

R Code – Equivalent Raw Dataset

food_raw <- data.frame(

  ID = 1:11,

  Country = c("India","USA","Italy","Japan","India","USA",

              "NULL","China","Italy","Japan","India"),

  Food = c("Biryani","Burger","Pizza","Sushi","biryani",

           "Burger","Pasta","Noodles","Pizza","Sushi","Biryani"),

  Price = c(250,150,300,500,250,NA,200,-50,300,500,250),

  Rating = c(4.5,4.2,4.8,4.9,4.5,4.2,3.8,4.1,4.8,4.9,4.5),

  Launch_Date = c("2023-01-10","2023-02-30","2023-03-12",

                  "2023-04-05","2023-01-10","2023-02-28",

                  "2023-03-15","2023-05-20","2023-03-12",NA,

                  "2023-01-10")

)

OUTPUT:

 

ID

Country

Food

Price

Rating

Launch_Date

1

1

India

Biryani

250

4.5

10-01-2023

2

2

USA

Burger

150

4.2

2023-02-30

3

3

Italy

Pizza

300

4.8

12-03-2023

4

4

Japan

Sushi

500

4.9

05-04-2023

5

5

India

biryani

250

4.5

10-01-2023

6

6

USA

Burger

NA

4.2

28-02-2023

7

7

NULL

Pasta

200

3.8

15-03-2023

8

8

China

Noodles

-50

4.1

20-05-2023

9

9

Italy

Pizza

300

4.8

12-03-2023

10

10

Japan

Sushi

500

4.9

NA

11

11

India

Biryani

250

4.5

10-01-2023

Explanation

The R dataset mirrors the SAS structure, ensuring cross-platform consistency. Missing values are represented using NA, while invalid values like negative price and incorrect dates remain intact. Duplicate entries are intentionally included to demonstrate how distinct() works. In real-world pipelines, R is frequently used for exploratory analysis, while SAS dominates regulatory environments. Understanding both ensures flexibility. This dataset highlights how even structured formats like data.frame() can contain inconsistent entries if validation rules are not enforced during ingestion. It prepares us for transformation using dplyr.

3. Phase 1: Data Cleaning in SAS

DATA food_clean;

SET food_raw;

/* Fix missing country */

IF Country IN ("NULL","") THEN Country="UNKNOWN";

/* Standardize text */

Country = UPCASE(STRIP(Country));

Food = PROPCASE(STRIP(Food));

/* Fix invalid price */

IF Price < 0 THEN Price = ABS(Price);

/* Handle missing price */

IF Price = . THEN Price = 100;

/* Fix date */

Launch_dt = INPUT(Launch_Date, yymmdd10.);

FORMAT Launch_dt yymmdd10.;

RUN;

PROC PRINT DATA = food_clean;

RUN;

OUTPUT:

ObsIDCountryFoodPriceRatingLaunch_DateLaunch_dt
11INDIABiryani2504.52023-01-102023-01-10
22USABurger1504.22023-02-30.
33ITALYPizza3004.82023-03-122023-03-12
44JAPANSushi5004.92023-04-052023-04-05
55INDIABiryani2504.52023-01-102023-01-10
66USABurger1004.22023-02-282023-02-28
77UNKNOWNPasta2003.82023-03-152023-03-15
88CHINANoodles504.12023-05-202023-05-20
99ITALYPizza3004.82023-03-122023-03-12
1010JAPANSushi5004.9 .
1111INDIABiryani2504.52023-01-102023-01-10

/* Remove duplicates */

PROC SORT DATA=food_clean NODUPKEY;

BY ID Country Food Price Rating Launch_dt;

RUN;

PROC PRINT DATA = food_clean;

RUN;

OUTPUT:

ObsIDCountryFoodPriceRatingLaunch_DateLaunch_dt
11INDIABiryani2504.52023-01-102023-01-10
22USABurger1504.22023-02-30.
33ITALYPizza3004.82023-03-122023-03-12
44JAPANSushi5004.92023-04-052023-04-05
56USABurger1004.22023-02-282023-02-28
67UNKNOWNPasta2003.82023-03-152023-03-15
78CHINANoodles504.12023-05-202023-05-20
810JAPANSushi5004.9 .

Explanation

This SAS cleaning pipeline uses industry-standard techniques. UPCASE and PROPCASE ensure consistency critical for grouping and joins. ABS() corrects invalid numeric entries like negative prices. Missing values are imputed using business logic (default price = 100). Date conversion uses INPUT() with proper informat, ensuring SAS recognizes it as a numeric date. Finally, PROC SORT NODUPKEY removes duplicates based on key variables. This step is crucial in clinical trials where duplicate patient records can violate regulatory compliance. SAS ensures deterministic and reproducible cleaning, which is why it's preferred in FDA submissions.

4. Phase 2: Data Cleaning in R

library(dplyr)

food_clean <- food_raw %>%

  mutate(

    Country = toupper(trimws(ifelse(Country == "NULL" | 

                                    Country == "", "UNKNOWN", Country))),

    Food = tools::toTitleCase(trimws(Food)),

    Price = ifelse(is.na(Price), 100, abs(Price)),

    Launch_Date = as.Date(Launch_Date, format="%Y-%m-%d")

  ) %>%

  distinct(Country, Food, Price, Rating, Launch_Date, .keep_all = TRUE)

OUTPUT:

 

ID

Country

Food

Price

Rating

Launch_Date

1

1

INDIA

Biryani

250

4.5

10-01-2023

2

2

USA

Burger

150

4.2

NA

3

3

ITALY

Pizza

300

4.8

12-03-2023

4

4

JAPAN

Sushi

500

4.9

05-04-2023

5

6

USA

Burger

100

4.2

28-02-2023

6

7

UNKNOWN

Pasta

200

3.8

15-03-2023

7

8

CHINA

Noodles

50

4.1

20-05-2023

8

10

JAPAN

Sushi

500

4.9

NA

Explanation

R uses dplyr for intuitive transformation. mutate() handles multiple cleaning steps in a pipeline. ifelse() replaces missing or invalid values, while abs() ensures no negative price remains. toupper() and trimws() standardize text fields. Date parsing is handled using as.Date(). The distinct() function removes duplicates based on key variables, similar to PROC SORT NODUPKEY in SAS. R’s advantage lies in readability and chaining operations, making it ideal for rapid data exploration. However, unlike SAS, R requires careful validation to ensure reproducibility in regulated environments.

5. Phase 3: Additional SAS Enhancements

DATA food_enhanced;

SET food_clean;

/* Categorize Price */

LENGTH Price_Group $8. Region $8.;

IF Price < 200 THEN Price_Group="LOW";

ELSE IF Price < 400 THEN Price_Group="MEDIUM";

ELSE Price_Group="HIGH";

/* Derive Region */

IF Country IN ("INDIA","CHINA","JAPAN") THEN Region="ASIA";

ELSE IF Country IN ("USA") THEN Region="WEST";

ELSE Region="EUROPE";

RUN;

PROC PRINT DATA = food_enhanced;

RUN;

OUTPUT:

ObsIDCountryFoodPriceRatingLaunch_DateLaunch_dtPrice_GroupRegion
11INDIABiryani2504.52023-01-102023-01-10MEDIUMASIA
22USABurger1504.22023-02-30.LOWWEST
33ITALYPizza3004.82023-03-122023-03-12MEDIUMEUROPE
44JAPANSushi5004.92023-04-052023-04-05HIGHASIA
56USABurger1004.22023-02-282023-02-28LOWWEST
67UNKNOWNPasta2003.82023-03-152023-03-15MEDIUMEUROPE
78CHINANoodles504.12023-05-202023-05-20LOWASIA
810JAPANSushi5004.9 .HIGHASIA

Explanation

This phase introduces derived variables common in ADaM datasets. Price_Group helps in segmentation analysis, while Region enables geographical insights. These derivations mimic real-world clinical derivations like age groups or treatment arms. SAS allows structured transformation using conditional logic, ensuring traceability. In regulatory submissions, every derived variable must be documented in Define.xml. This step demonstrates how raw data evolves into analytical datasets, ready for reporting and visualization.

6. 20 Additional Data Cleaning Best Practices

  1. Always validate against source data
  2. Maintain audit trails
  3. Use controlled terminology (CDISC)
  4. Document derivation logic
  5. Avoid hardcoding values
  6. Use metadata-driven programming
  7. Validate duplicates before removal
  8. Use PROC COMPARE for QC
  9. Standardize formats early
  10. Handle missing values consistently
  11. Use macros for scalability
  12. Validate date ranges
  13. Cross-check with SAP
  14. Maintain version control
  15. Log all transformations
  16. Use validation datasets
  17. Apply business rules strictly
  18. Avoid overwriting raw data
  19. Use test cases
  20. Ensure reproducibility

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary  it’s driven by business logic.

  • Missing price → replaced with default because business assumes minimum cost
  • Negative price → corrected using ABS (data entry error)
  • Duplicate records → removed to avoid inflated demand
  • Invalid dates → corrected to maintain timeline integrity

In clinical trials:

  • Age < 0 → invalid, must be corrected or flagged
  • Missing visit date → imputed using protocol rules

Bad cleaning = wrong insights = wrong decisions.

8. 20 Key Insights 

  • Dirty data leads to wrong conclusions
  • Duplicate records inflate metrics
  • Standardization ensures reproducibility
  • Missing values must follow business rules
  • SAS ensures regulatory compliance
  • R enables fast prototyping
  • Data cleaning is 70% of analytics
  • Validation is non-negotiable
  • Documentation is critical
  • Audit trails ensure trust
  • Dates must be consistent
  • Text inconsistency breaks joins
  • Negative values often indicate errors
  • Imputation must be justified
  • Duplicate removal must be traceable
  • QC checks prevent disasters
  • Clean data drives insights
  • Poor data kills models
  • Structured pipelines save time
  • Clean data = reliable decisions

9. SAS vs R

Feature

SAS

R

Regulatory Compliance

Strong

Moderate

Ease of Use

Structured

Flexible

Duplicate Handling

PROC SORT

distinct()

Scalability

High

Moderate

Audit Trails

Built-in

Manual


10. Summary

This blog walked through a realistic, industry-style data cleaning workflow using a “Famous Foods in the World” dataset deliberately injected with common data quality issues duplicates, missing values, invalid entries, and inconsistent text. The journey began with a raw dataset in both SAS and R, reflecting how real-world data often arrives unstructured and unreliable. Through structured phases, we demonstrated how SAS handles cleaning using functions like UPCASE, PROPCASE, ABS, and INPUT, followed by PROC SORT NODUPKEY to efficiently eliminate duplicate records based on key variables.

In parallel, we replicated the same transformations in R using dplyr, showcasing functions like mutate(), ifelse(), distinct(), and string-handling utilities. This dual approach highlights the complementary strengths of SAS and R,SAS for compliance-heavy, traceable workflows, and R for flexible, exploratory transformations.

We also extended the cleaned dataset with derived variables such as Price_Group and Region, simulating real-world ADaM-like derivations. Beyond coding, the blog emphasized 20 best practices aligned with clinical trial standards (SDTM/ADaM), including audit trails, validation checks, and reproducibility.

Critically, the discussion reinforced that data cleaning is not just technical it is governed by business logic. Decisions like imputing missing values or correcting invalid entries must align with domain rules. The blog concluded with sharp insights and interview-focused Q&A, making it both practical and career-oriented. Overall, it demonstrated that handling duplicates and inconsistencies is foundational to producing reliable, decision-ready datasets.

11. Conclusion

Effective data cleaning is the backbone of any reliable analytical workflow, whether in global business analytics or regulated clinical trials. This exercise using a “Famous Foods” dataset clearly illustrates how even small inconsistencies like duplicate records or invalid values can significantly distort outcomes if left unaddressed. Duplicate handling, in particular, is not just a technical step but a critical control mechanism to ensure data integrity, accuracy, and trustworthiness.

SAS stands out as a powerful tool in this domain due to its structured programming paradigm, reproducibility, and strong support for regulatory compliance. Procedures like PROC SORT NODUPKEY provide deterministic and auditable duplicate removal, which is essential in environments such as FDA submissions. On the other hand, R offers flexibility and speed, enabling analysts to prototype and iterate quickly using intuitive libraries like dplyr. Together, they form a robust ecosystem for modern data professionals.

However, tools alone are not enough. The real strength lies in applying sound business logic, maintaining detailed documentation, and implementing rigorous validation checks. Every cleaning decision whether imputing a missing value or correcting an anomaly must be justifiable and traceable.

In the end, clean data is not just about aesthetics; it directly impacts strategic decisions, model accuracy, and regulatory credibility. A well-structured data cleaning framework ensures that insights derived are not only accurate but also defensible. Mastering these practices positions you as a reliable and industry-ready data professional.

12. Interview Questions

Q1: How do you remove duplicates in SAS?

Answer: Use PROC SORT NODUPKEY with appropriate BY variables.

Q2: Difference between NODUPKEY and NODUP?

Answer:

  • NODUPKEY → removes based on BY variables
  • NODUP → removes exact duplicates

Q3: How do you handle duplicates in R?

Answer: Use distinct() from dplyr.

Q4: What if duplicates contain conflicting values?

Answer: Apply business rules or aggregation (e.g., take latest record).

Q5: Clinical scenario – duplicate patient IDs?

Answer: Validate against source, flag discrepancies, consult data management, document resolution.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about FAMOUS FOOD DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy


Comments

Popular posts from this blog

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function

444.Turning Messy Raw Data Into Decision-Ready Gold With SAS Error Fixing Techniques