458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS
Global Plate Chaos to Clean Insights: Mastering Duplicate Handling in SAS with Real-World Food Data
1. Introduction
Imagine
you’re working as a clinical data analyst in a pharmaceutical company. You
receive a patient dataset for a Phase III trial. Everything looks fine until
your summary statistics show 120 patients enrolled, while the actual study had
only 100. Panic.
What went
wrong?
Duplicate
records.
Now shift
that same scenario to a business dataset say, global food analytics. You’re
analyzing “most popular foods worldwide” to guide a multinational restaurant
expansion strategy. But duplicate rows inflate demand for Pizza in Italy or
Sushi in Japan. Suddenly, your insights are misleading, and millions of dollars
are at stake.
Bad data
doesn’t just “look messy” it destroys decision-making.
This is
where tools like SAS and R come in. They don’t just clean data they
enforce discipline, reproducibility, and regulatory compliance (especially
critical in clinical trials using SDTM/ADaM).
In this blog, we’ll simulate a “Famous Foods in the World” dataset with intentional issues, then systematically clean and de-duplicate it using SAS and R like a real-world data engineer.
2. Raw Data Creation in SAS and R (With Intentional Errors)
SAS Code (Raw Dataset)
DATA food_raw;
INPUT ID Country $ Food $ Price Rating Launch_Date :$10.;
DATALINES;
1 India Biryani 250 4.5 2023-01-10
2 USA Burger 150 4.2 2023-02-30
3 Italy Pizza 300 4.8 2023-03-12
4 Japan Sushi 500 4.9 2023-04-05
5 India biryani 250 4.5 2023-01-10
6 USA Burger . 4.2 2023-02-28
7 NULL Pasta 200 3.8 2023-03-15
8 China Noodles -50 4.1 2023-05-20
9 Italy Pizza 300 4.8 2023-03-12
10 Japan Sushi 500 4.9 .
11 India Biryani 250 4.5 2023-01-10
;
RUN;
PROC PRINT DATA = food_raw;
RUN;
OUTPUT:
| Obs | ID | Country | Food | Price | Rating | Launch_Date |
|---|---|---|---|---|---|---|
| 1 | 1 | India | Biryani | 250 | 4.5 | 2023-01-10 |
| 2 | 2 | USA | Burger | 150 | 4.2 | 2023-02-30 |
| 3 | 3 | Italy | Pizza | 300 | 4.8 | 2023-03-12 |
| 4 | 4 | Japan | Sushi | 500 | 4.9 | 2023-04-05 |
| 5 | 5 | India | biryani | 250 | 4.5 | 2023-01-10 |
| 6 | 6 | USA | Burger | . | 4.2 | 2023-02-28 |
| 7 | 7 | NULL | Pasta | 200 | 3.8 | 2023-03-15 |
| 8 | 8 | China | Noodles | -50 | 4.1 | 2023-05-20 |
| 9 | 9 | Italy | Pizza | 300 | 4.8 | 2023-03-12 |
| 10 | 10 | Japan | Sushi | 500 | 4.9 | |
| 11 | 11 | India | Biryani | 250 | 4.5 | 2023-01-10 |
Explanation
This
dataset intentionally contains real-world data quality issues. Duplicate
records exist (e.g., Pizza, Biryani repeated). Missing values appear as . and NULL.
Invalid data includes negative price (-50) and incorrect date (2023-02-30).
Text inconsistency is present (Biryani vs biryani). Such issues are extremely
common in raw data pipelines, especially when data comes from multiple sources
like APIs, manual entry, or legacy systems. In clinical trials, similar
inconsistencies can occur in patient demographics or lab data, leading to
incorrect analysis. This raw dataset sets the foundation for demonstrating
robust cleaning techniques.
R Code – Equivalent Raw Dataset
food_raw <- data.frame(
ID = 1:11,
Country = c("India","USA","Italy","Japan","India","USA",
"NULL","China","Italy","Japan","India"),
Food = c("Biryani","Burger","Pizza","Sushi","biryani",
"Burger","Pasta","Noodles","Pizza","Sushi","Biryani"),
Price = c(250,150,300,500,250,NA,200,-50,300,500,250),
Rating = c(4.5,4.2,4.8,4.9,4.5,4.2,3.8,4.1,4.8,4.9,4.5),
Launch_Date = c("2023-01-10","2023-02-30","2023-03-12",
"2023-04-05","2023-01-10","2023-02-28",
"2023-03-15","2023-05-20","2023-03-12",NA,
"2023-01-10")
)
OUTPUT:
|
|
ID |
Country |
Food |
Price |
Rating |
Launch_Date |
|
1 |
1 |
India |
Biryani |
250 |
4.5 |
10-01-2023 |
|
2 |
2 |
USA |
Burger |
150 |
4.2 |
2023-02-30 |
|
3 |
3 |
Italy |
Pizza |
300 |
4.8 |
12-03-2023 |
|
4 |
4 |
Japan |
Sushi |
500 |
4.9 |
05-04-2023 |
|
5 |
5 |
India |
biryani |
250 |
4.5 |
10-01-2023 |
|
6 |
6 |
USA |
Burger |
NA |
4.2 |
28-02-2023 |
|
7 |
7 |
NULL |
Pasta |
200 |
3.8 |
15-03-2023 |
|
8 |
8 |
China |
Noodles |
-50 |
4.1 |
20-05-2023 |
|
9 |
9 |
Italy |
Pizza |
300 |
4.8 |
12-03-2023 |
|
10 |
10 |
Japan |
Sushi |
500 |
4.9 |
NA |
|
11 |
11 |
India |
Biryani |
250 |
4.5 |
10-01-2023 |
Explanation
The R
dataset mirrors the SAS structure, ensuring cross-platform consistency. Missing
values are represented using NA, while invalid values like negative price and
incorrect dates remain intact. Duplicate entries are intentionally included to
demonstrate how distinct() works. In real-world pipelines, R is frequently used
for exploratory analysis, while SAS dominates regulatory environments.
Understanding both ensures flexibility. This dataset highlights how even
structured formats like data.frame() can contain inconsistent entries if
validation rules are not enforced during ingestion. It prepares us for
transformation using dplyr.
3. Phase 1: Data Cleaning in SAS
DATA food_clean;
SET food_raw;
/* Fix missing country */
IF Country IN ("NULL","") THEN Country="UNKNOWN";
/* Standardize text */
Country = UPCASE(STRIP(Country));
Food = PROPCASE(STRIP(Food));
/* Fix invalid price */
IF Price < 0 THEN Price = ABS(Price);
/* Handle missing price */
IF Price = . THEN Price = 100;
/* Fix date */
Launch_dt = INPUT(Launch_Date, yymmdd10.);
FORMAT Launch_dt yymmdd10.;
RUN;
PROC PRINT DATA = food_clean;
RUN;
OUTPUT:
| Obs | ID | Country | Food | Price | Rating | Launch_Date | Launch_dt |
|---|---|---|---|---|---|---|---|
| 1 | 1 | INDIA | Biryani | 250 | 4.5 | 2023-01-10 | 2023-01-10 |
| 2 | 2 | USA | Burger | 150 | 4.2 | 2023-02-30 | . |
| 3 | 3 | ITALY | Pizza | 300 | 4.8 | 2023-03-12 | 2023-03-12 |
| 4 | 4 | JAPAN | Sushi | 500 | 4.9 | 2023-04-05 | 2023-04-05 |
| 5 | 5 | INDIA | Biryani | 250 | 4.5 | 2023-01-10 | 2023-01-10 |
| 6 | 6 | USA | Burger | 100 | 4.2 | 2023-02-28 | 2023-02-28 |
| 7 | 7 | UNKNOWN | Pasta | 200 | 3.8 | 2023-03-15 | 2023-03-15 |
| 8 | 8 | CHINA | Noodles | 50 | 4.1 | 2023-05-20 | 2023-05-20 |
| 9 | 9 | ITALY | Pizza | 300 | 4.8 | 2023-03-12 | 2023-03-12 |
| 10 | 10 | JAPAN | Sushi | 500 | 4.9 | . | |
| 11 | 11 | INDIA | Biryani | 250 | 4.5 | 2023-01-10 | 2023-01-10 |
/* Remove duplicates */
PROC SORT DATA=food_clean NODUPKEY;
BY ID Country Food Price Rating Launch_dt;
RUN;
PROC PRINT DATA = food_clean;
RUN;
OUTPUT:
| Obs | ID | Country | Food | Price | Rating | Launch_Date | Launch_dt |
|---|---|---|---|---|---|---|---|
| 1 | 1 | INDIA | Biryani | 250 | 4.5 | 2023-01-10 | 2023-01-10 |
| 2 | 2 | USA | Burger | 150 | 4.2 | 2023-02-30 | . |
| 3 | 3 | ITALY | Pizza | 300 | 4.8 | 2023-03-12 | 2023-03-12 |
| 4 | 4 | JAPAN | Sushi | 500 | 4.9 | 2023-04-05 | 2023-04-05 |
| 5 | 6 | USA | Burger | 100 | 4.2 | 2023-02-28 | 2023-02-28 |
| 6 | 7 | UNKNOWN | Pasta | 200 | 3.8 | 2023-03-15 | 2023-03-15 |
| 7 | 8 | CHINA | Noodles | 50 | 4.1 | 2023-05-20 | 2023-05-20 |
| 8 | 10 | JAPAN | Sushi | 500 | 4.9 | . |
Explanation
This SAS
cleaning pipeline uses industry-standard techniques. UPCASE and PROPCASE ensure
consistency critical for grouping and joins. ABS() corrects invalid numeric
entries like negative prices. Missing values are imputed using business logic
(default price = 100). Date conversion uses INPUT() with proper informat,
ensuring SAS recognizes it as a numeric date. Finally, PROC SORT NODUPKEY
removes duplicates based on key variables. This step is crucial in clinical
trials where duplicate patient records can violate regulatory compliance. SAS
ensures deterministic and reproducible cleaning, which is why it's preferred in
FDA submissions.
4. Phase 2: Data Cleaning in R
library(dplyr)
food_clean <- food_raw %>%
mutate(
Country = toupper(trimws(ifelse(Country == "NULL" |
Country == "", "UNKNOWN", Country))),
Food = tools::toTitleCase(trimws(Food)),
Price = ifelse(is.na(Price), 100, abs(Price)),
Launch_Date = as.Date(Launch_Date, format="%Y-%m-%d")
) %>%
distinct(Country, Food, Price, Rating, Launch_Date, .keep_all = TRUE)
OUTPUT:
|
|
ID |
Country |
Food |
Price |
Rating |
Launch_Date |
|
1 |
1 |
INDIA |
Biryani |
250 |
4.5 |
10-01-2023 |
|
2 |
2 |
USA |
Burger |
150 |
4.2 |
NA |
|
3 |
3 |
ITALY |
Pizza |
300 |
4.8 |
12-03-2023 |
|
4 |
4 |
JAPAN |
Sushi |
500 |
4.9 |
05-04-2023 |
|
5 |
6 |
USA |
Burger |
100 |
4.2 |
28-02-2023 |
|
6 |
7 |
UNKNOWN |
Pasta |
200 |
3.8 |
15-03-2023 |
|
7 |
8 |
CHINA |
Noodles |
50 |
4.1 |
20-05-2023 |
|
8 |
10 |
JAPAN |
Sushi |
500 |
4.9 |
NA |
Explanation
R uses dplyr
for intuitive transformation. mutate() handles multiple cleaning steps in a
pipeline. ifelse() replaces missing or invalid values, while abs() ensures no
negative price remains. toupper() and trimws() standardize text fields. Date
parsing is handled using as.Date(). The distinct() function removes duplicates
based on key variables, similar to PROC SORT NODUPKEY in SAS. R’s advantage
lies in readability and chaining operations, making it ideal for rapid data
exploration. However, unlike SAS, R requires careful validation to ensure
reproducibility in regulated environments.
5. Phase 3: Additional SAS Enhancements
DATA food_enhanced;
SET food_clean;
/* Categorize Price */
LENGTH Price_Group $8. Region $8.;
IF Price < 200 THEN Price_Group="LOW";
ELSE IF Price < 400 THEN Price_Group="MEDIUM";
ELSE Price_Group="HIGH";
/* Derive Region */
IF Country IN ("INDIA","CHINA","JAPAN") THEN Region="ASIA";
ELSE IF Country IN ("USA") THEN Region="WEST";
ELSE Region="EUROPE";
RUN;
PROC PRINT DATA = food_enhanced;
RUN;
OUTPUT:
| Obs | ID | Country | Food | Price | Rating | Launch_Date | Launch_dt | Price_Group | Region |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | INDIA | Biryani | 250 | 4.5 | 2023-01-10 | 2023-01-10 | MEDIUM | ASIA |
| 2 | 2 | USA | Burger | 150 | 4.2 | 2023-02-30 | . | LOW | WEST |
| 3 | 3 | ITALY | Pizza | 300 | 4.8 | 2023-03-12 | 2023-03-12 | MEDIUM | EUROPE |
| 4 | 4 | JAPAN | Sushi | 500 | 4.9 | 2023-04-05 | 2023-04-05 | HIGH | ASIA |
| 5 | 6 | USA | Burger | 100 | 4.2 | 2023-02-28 | 2023-02-28 | LOW | WEST |
| 6 | 7 | UNKNOWN | Pasta | 200 | 3.8 | 2023-03-15 | 2023-03-15 | MEDIUM | EUROPE |
| 7 | 8 | CHINA | Noodles | 50 | 4.1 | 2023-05-20 | 2023-05-20 | LOW | ASIA |
| 8 | 10 | JAPAN | Sushi | 500 | 4.9 | . | HIGH | ASIA |
Explanation
This phase
introduces derived variables common in ADaM datasets. Price_Group helps in
segmentation analysis, while Region enables geographical insights. These
derivations mimic real-world clinical derivations like age groups or treatment
arms. SAS allows structured transformation using conditional logic, ensuring
traceability. In regulatory submissions, every derived variable must be
documented in Define.xml. This step demonstrates how raw data evolves into
analytical datasets, ready for reporting and visualization.
6. 20 Additional Data
Cleaning Best Practices
- Always validate against
source data
- Maintain audit trails
- Use controlled terminology
(CDISC)
- Document derivation logic
- Avoid hardcoding values
- Use metadata-driven
programming
- Validate duplicates before
removal
- Use PROC COMPARE for QC
- Standardize formats early
- Handle missing values
consistently
- Use macros for scalability
- Validate date ranges
- Cross-check with SAP
- Maintain version control
- Log all transformations
- Use validation datasets
- Apply business rules
strictly
- Avoid overwriting raw data
- Use test cases
- Ensure reproducibility
7. Business Logic Behind
Data Cleaning
Data
cleaning is not arbitrary it’s driven by
business logic.
- Missing price → replaced
with default because business assumes minimum cost
- Negative price → corrected
using ABS (data entry error)
- Duplicate records → removed
to avoid inflated demand
- Invalid dates → corrected to
maintain timeline integrity
In
clinical trials:
- Age < 0 → invalid, must
be corrected or flagged
- Missing visit date → imputed
using protocol rules
Bad
cleaning = wrong insights = wrong decisions.
8. 20 Key Insights
- Dirty data leads to wrong
conclusions
- Duplicate records inflate
metrics
- Standardization ensures
reproducibility
- Missing values must follow
business rules
- SAS ensures regulatory
compliance
- R enables fast prototyping
- Data cleaning is 70% of
analytics
- Validation is non-negotiable
- Documentation is critical
- Audit trails ensure trust
- Dates must be consistent
- Text inconsistency breaks
joins
- Negative values often
indicate errors
- Imputation must be justified
- Duplicate removal must be
traceable
- QC checks prevent disasters
- Clean data drives insights
- Poor data kills models
- Structured pipelines save
time
- Clean data = reliable
decisions
9. SAS vs R
|
Feature |
SAS |
R |
|
Regulatory
Compliance |
Strong |
Moderate |
|
Ease of
Use |
Structured |
Flexible |
|
Duplicate
Handling |
PROC
SORT |
distinct() |
|
Scalability |
High |
Moderate |
|
Audit
Trails |
Built-in |
Manual |
10. Summary
This blog walked through a realistic, industry-style data cleaning workflow
using a “Famous Foods in the World” dataset deliberately
injected with common data quality issues duplicates, missing values, invalid
entries, and inconsistent text. The journey began with a raw dataset in both
SAS and R, reflecting how real-world data often arrives unstructured and
unreliable. Through structured phases, we demonstrated how SAS handles cleaning
using functions like UPCASE,
PROPCASE, ABS,
and INPUT, followed by PROC SORT NODUPKEY to
efficiently eliminate duplicate records based on key variables.
In parallel, we replicated the same transformations in R using dplyr, showcasing
functions like mutate(),
ifelse(), distinct(),
and string-handling utilities. This dual approach highlights the complementary
strengths of SAS and R,SAS for compliance-heavy, traceable workflows, and R for
flexible, exploratory transformations.
We also extended the cleaned dataset with derived variables such as Price_Group and Region, simulating
real-world ADaM-like derivations. Beyond coding, the blog emphasized 20 best
practices aligned with clinical trial standards (SDTM/ADaM), including audit
trails, validation checks, and reproducibility.
Critically, the discussion reinforced that data cleaning is not just
technical it is governed by business logic. Decisions like imputing missing
values or correcting invalid entries must align with domain rules. The blog
concluded with sharp insights and interview-focused Q&A, making it both
practical and career-oriented. Overall, it demonstrated that handling
duplicates and inconsistencies is foundational to producing reliable,
decision-ready datasets.
11. Conclusion
Effective data cleaning is the backbone of any reliable analytical workflow,
whether in global business analytics or regulated clinical trials. This
exercise using a “Famous Foods” dataset clearly illustrates how even small
inconsistencies like duplicate records or invalid values can significantly distort
outcomes if left unaddressed. Duplicate handling, in particular, is not just a
technical step but a critical control mechanism to ensure data integrity,
accuracy, and trustworthiness.
SAS stands out as a powerful tool in this domain due to its structured
programming paradigm, reproducibility, and strong support for regulatory
compliance. Procedures like PROC
SORT NODUPKEY provide deterministic and auditable duplicate
removal, which is essential in environments such as FDA submissions. On the
other hand, R offers flexibility and speed, enabling analysts to prototype and
iterate quickly using intuitive libraries like dplyr.
Together, they form a robust ecosystem for modern data professionals.
However, tools alone are not enough. The real strength lies in applying sound
business logic, maintaining detailed documentation, and implementing
rigorous validation checks. Every cleaning decision whether imputing a missing
value or correcting an anomaly must be justifiable and traceable.
In the end, clean data is not just about aesthetics; it directly impacts
strategic decisions, model accuracy, and regulatory credibility. A
well-structured data cleaning framework ensures that insights derived are not
only accurate but also defensible. Mastering these practices positions you as a
reliable and industry-ready data professional.
12. Interview Questions
Q1: How do you remove duplicates in SAS?
Answer: Use PROC SORT NODUPKEY with
appropriate BY variables.
Q2: Difference between NODUPKEY and NODUP?
Answer:
- NODUPKEY → removes based on
BY variables
- NODUP → removes exact
duplicates
Q3: How do you handle duplicates in R?
Answer: Use distinct() from dplyr.
Q4: What if duplicates contain conflicting values?
Answer: Apply business rules or
aggregation (e.g., take latest record).
Q5: Clinical scenario – duplicate patient IDs?
Answer: Validate against source, flag
discrepancies, consult data management, document resolution.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about FAMOUS FOOD DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment