462.Cleaning Surprising Incident Records the Right Way:Raw Data to Real Intelligence
Chaos, Curiosity & Clean Code: Turning “Most Surprising Incidents in the World” into Reliable Insights with SAS & R
1. Introduction
Imagine
you're working as a SAS programmer on a global clinical trial. You receive a
dataset of adverse events from multiple countries. Everything looks fine until
analysis begins.
- Patient ages include -5
and 230
- Dates like 2023-15-40
- Text fields contain “NULL”,
“unknown”, “n/a”, blanks
- Duplicate patient records
appear with slight variations
Now
imagine building a safety report on top of this.
That’s
not analytics that’s a disaster waiting to happen.
The same
scenario applies to business datasets. Whether it's customer data, financial
records, or in our case “Most Surprising Incidents in the World” dataset,
messy data silently corrupts results.
This is
where data cleaning becomes not just a step but a discipline.
Tools
like SAS (strong in regulated environments like clinical trials) and R
(flexible and powerful for transformation) play a crucial role in:
- Ensuring data integrity
- Supporting regulatory
compliance
- Delivering accurate
insights
In this
blog, we’ll simulate a messy real-world dataset, inject errors intentionally,
and clean it step-by-step using both SAS and R.
2. Raw Data Creation in SAS and R
SAS Code (Raw Dataset)
DATA incidents_raw;
INFILE DATALINES MISSOVER;
INPUT Incident_ID :$5. Country :$10. Incident_Type :$10. Year
Age Description :$20. Impact_Score Date :$10.;
DATALINES;
I001 USA UFO 2020 45 strange_light 8 2020-05-10
I002 india ghost 2019 -5 haunted_house 7 2019-13-01
I003 UK mystery 2021 120 unknown . NULL
I004 USA UFO 2020 45 strange_light 8 2020-05-10
I005 Japan alien . 30 sighting 6 2021-07-15
I006 NULL ghost 2018 200 dark_shadow 9 2018-08-22
I007 Brazil mystery 2022 25 blank 5
I008 india ghost 2019 40 haunted_house 7 2019-11-01
I009 UK UFO 2021 35 strange_light 8 2021-06-30
I010 USA mystery 2023 29 unknown 6 2023-03-10
;
RUN;
PROC PRINT DATA = incidents_raw;
RUN;
OUTPUT:
| Obs | Incident_ID | Country | Incident_Type | Year | Age | Description | Impact_Score | Date |
|---|---|---|---|---|---|---|---|---|
| 1 | I001 | USA | UFO | 2020 | 45 | strange_light | 8 | 2020-05-10 |
| 2 | I002 | india | ghost | 2019 | -5 | haunted_house | 7 | 2019-13-01 |
| 3 | I003 | UK | mystery | 2021 | 120 | unknown | . | NULL |
| 4 | I004 | USA | UFO | 2020 | 45 | strange_light | 8 | 2020-05-10 |
| 5 | I005 | Japan | alien | . | 30 | sighting | 6 | 2021-07-15 |
| 6 | I006 | NULL | ghost | 2018 | 200 | dark_shadow | 9 | 2018-08-22 |
| 7 | I007 | Brazil | mystery | 2022 | 25 | blank | 5 | |
| 8 | I008 | india | ghost | 2019 | 40 | haunted_house | 7 | 2019-11-01 |
| 9 | I009 | UK | UFO | 2021 | 35 | strange_light | 8 | 2021-06-30 |
| 10 | I010 | USA | mystery | 2023 | 29 | unknown | 6 | 2023-03-10 |
Explanation
This
dataset intentionally contains multiple real-world issues. We have:
- Missing values (. and NULL)
- Invalid age values (-5, 200)
- Duplicate record (I001 &
I004)
- Inconsistent text (india vs
India, NULL, blank)
- Invalid date format
(2019-13-01)
- Missing year
These
mimic real clinical trial issues, like incorrect patient demographics or
adverse event dates. Using DATALINES simulates raw ingestion from external
sources like CSV or CRF systems.
The :
modifier tells SAS:
“Read
until delimiter OR max length whichever comes first”
This
combines:
- Flexibility of list input
- Control of formatted input
It
prevents over-reading (strange_light 8) and ensures clean separation. This is industry
best practice when dealing with semi-structured data sources like flat
files in clinical trials.
R Code – Equivalent Raw Dataset
incidents_raw <- data.frame(
Incident_ID = c("I001","I002","I003","I004","I005","I006",
"I007","I008","I009","I010"),
Country = c("USA","india","UK","USA","Japan","NULL","Brazil",
"india","UK","USA"),
Incident_Type = c("UFO","ghost","mystery","UFO","alien","ghost",
"mystery","ghost","UFO","mystery"),
Year = c(2020,2019,2021,2020,NA,2018,2022,2019,2021,2023),
Age = c(45,-5,120,45,30,200,25,40,35,29),
Description = c("strange_light","haunted_house","unknown",
"strange_light","sighting","dark_shadow",
"blank","haunted_house","strange_light","unknown"),
Impact_Score = c(8,7,NA,8,6,9,5,7,8,6),
Date = c("2020-05-10","2019-13-01","NULL","2020-05-10",
"2021-07-15","2018-08-22",NA,"2019-11-01","2021-06-30",
"2023-03-10")
)
OUTPUT:
|
|
Incident_ID |
Country |
Incident_Type |
Year |
Age |
Description |
Impact_Score |
Date |
|
1 |
I001 |
USA |
UFO |
2020 |
45 |
strange_light |
8 |
10-05-2020 |
|
2 |
I002 |
india |
ghost |
2019 |
-5 |
haunted_house |
7 |
2019-13-01 |
|
3 |
I003 |
UK |
mystery |
2021 |
120 |
unknown |
NA |
NULL |
|
4 |
I004 |
USA |
UFO |
2020 |
45 |
strange_light |
8 |
10-05-2020 |
|
5 |
I005 |
Japan |
alien |
NA |
30 |
sighting |
6 |
15-07-2021 |
|
6 |
I006 |
NULL |
ghost |
2018 |
200 |
dark_shadow |
9 |
22-08-2018 |
|
7 |
I007 |
Brazil |
mystery |
2022 |
25 |
blank |
5 |
NA |
|
8 |
I008 |
india |
ghost |
2019 |
40 |
haunted_house |
7 |
01-11-2019 |
|
9 |
I009 |
UK |
UFO |
2021 |
35 |
strange_light |
8 |
30-06-2021 |
|
10 |
I010 |
USA |
mystery |
2023 |
29 |
unknown |
6 |
10-03-2023 |
Explanation
This R
dataset mirrors SAS input. It includes:
- NA values
- Invalid dates stored as
strings
- Inconsistent text values
- Duplicate entries
R treats
everything flexibly, which is both powerful and dangerous. Without
strict typing, errors propagate easily. This is why validation layers are
essential.
3. Phase 1: Data Cleaning in SAS
DATA incidents_clean;
SET incidents_raw;
IF Country IN ("NULL", "", " ") THEN Country = "UNKNOWN";
Country = UPCASE(STRIP(Country));
Incident_Type = UPCASE(STRIP(Incident_Type));
IF Age < 0 OR Age > 100 THEN Age = .;
IF Date NOT IN (".", "NULL", "") THEN
Date_clean = INPUT(Date, yymmdd10.);
FORMAT Date_clean yymmdd10.;
IF Year = . AND NOT MISSING(Date_clean) THEN
Year = YEAR(Date_clean);
RUN;
PROC PRINT DATA = incidents_clean;
RUN;
OUTPUT:
| Obs | Incident_ID | Country | Incident_Type | Year | Age | Description | Impact_Score | Date | Date_clean |
|---|---|---|---|---|---|---|---|---|---|
| 1 | I001 | USA | UFO | 2020 | 45 | strange_light | 8 | 2020-05-10 | 2020-05-10 |
| 2 | I002 | INDIA | GHOST | 2019 | . | haunted_house | 7 | 2019-13-01 | . |
| 3 | I003 | UK | MYSTERY | 2021 | . | unknown | . | NULL | . |
| 4 | I004 | USA | UFO | 2020 | 45 | strange_light | 8 | 2020-05-10 | 2020-05-10 |
| 5 | I005 | JAPAN | ALIEN | 2021 | 30 | sighting | 6 | 2021-07-15 | 2021-07-15 |
| 6 | I006 | UNKNOWN | GHOST | 2018 | . | dark_shadow | 9 | 2018-08-22 | 2018-08-22 |
| 7 | I007 | BRAZIL | MYSTERY | 2022 | 25 | blank | 5 | . | |
| 8 | I008 | INDIA | GHOST | 2019 | 40 | haunted_house | 7 | 2019-11-01 | 2019-11-01 |
| 9 | I009 | UK | UFO | 2021 | 35 | strange_light | 8 | 2021-06-30 | 2021-06-30 |
| 10 | I010 | USA | MYSTERY | 2023 | 29 | unknown | 6 | 2023-03-10 | 2023-03-10 |
PROC SORT DATA=incidents_clean NODUPKEY;
BY Country Incident_Type Year Age;
RUN;
PROC PRINT DATA = incidents_clean;
RUN;
OUTPUT:
| Obs | Incident_ID | Country | Incident_Type | Year | Age | Description | Impact_Score | Date | Date_clean |
|---|---|---|---|---|---|---|---|---|---|
| 1 | I007 | BRAZIL | MYSTERY | 2022 | 25 | blank | 5 | . | |
| 2 | I002 | INDIA | GHOST | 2019 | . | haunted_house | 7 | 2019-13-01 | . |
| 3 | I008 | INDIA | GHOST | 2019 | 40 | haunted_house | 7 | 2019-11-01 | 2019-11-01 |
| 4 | I005 | JAPAN | ALIEN | 2021 | 30 | sighting | 6 | 2021-07-15 | 2021-07-15 |
| 5 | I003 | UK | MYSTERY | 2021 | . | unknown | . | NULL | . |
| 6 | I009 | UK | UFO | 2021 | 35 | strange_light | 8 | 2021-06-30 | 2021-06-30 |
| 7 | I006 | UNKNOWN | GHOST | 2018 | . | dark_shadow | 9 | 2018-08-22 | 2018-08-22 |
| 8 | I010 | USA | MYSTERY | 2023 | 29 | unknown | 6 | 2023-03-10 | 2023-03-10 |
| 9 | I001 | USA | UFO | 2020 | 45 | strange_light | 8 | 2020-05-10 | 2020-05-10 |
PROC SORT DATA=incidents_clean NODUPKEY;
BY Incident_ID;
RUN;
PROC PRINT DATA = incidents_clean;
RUN;
OUTPUT:
| Obs | Incident_ID | Country | Incident_Type | Year | Age | Description | Impact_Score | Date | Date_clean |
|---|---|---|---|---|---|---|---|---|---|
| 1 | I001 | USA | UFO | 2020 | 45 | strange_light | 8 | 2020-05-10 | 2020-05-10 |
| 2 | I002 | INDIA | GHOST | 2019 | . | haunted_house | 7 | 2019-13-01 | . |
| 3 | I003 | UK | MYSTERY | 2021 | . | unknown | . | NULL | . |
| 4 | I005 | JAPAN | ALIEN | 2021 | 30 | sighting | 6 | 2021-07-15 | 2021-07-15 |
| 5 | I006 | UNKNOWN | GHOST | 2018 | . | dark_shadow | 9 | 2018-08-22 | 2018-08-22 |
| 6 | I007 | BRAZIL | MYSTERY | 2022 | 25 | blank | 5 | . | |
| 7 | I008 | INDIA | GHOST | 2019 | 40 | haunted_house | 7 | 2019-11-01 | 2019-11-01 |
| 8 | I009 | UK | UFO | 2021 | 35 | strange_light | 8 | 2021-06-30 | 2021-06-30 |
| 9 | I010 | USA | MYSTERY | 2023 | 29 | unknown | 6 | 2023-03-10 | 2023-03-10 |
Explanation
This SAS
step demonstrates core data cleaning techniques:
- COALESCEC replaces missing
or NULL values
- UPCASE + STRIP standardizes
text
- Age validation removes
unrealistic values
- INPUT() converts character
dates into SAS date format
- Missing Year derived from
date
- PROC SORT NODUPKEY removes
duplicates
In
clinical trials, such logic ensures:
- Consistency across SDTM
domains
- Accurate patient-level
analysis
- Compliance with regulatory
expectations
4. Phase 2: Data Cleaning in R
library(dplyr)
incidents_clean <- incidents_raw %>%
mutate(
Country = toupper(trimws(ifelse(Country %in% c("NULL", NA), "UNKNOWN", Country))),
Incident_Type = toupper(trimws(Incident_Type)),
Age = ifelse(Age < 0 | Age > 100, NA, Age),
Date = as.Date(Date, format="%Y-%m-%d")
) %>%
distinct(Country,Incident_Type,Year,Age, .keep_all = TRUE)
OUTPUT:
|
|
Incident_ID |
Country |
Incident_Type |
Year |
Age |
Description |
Impact_Score |
Date |
|
1 |
I001 |
USA |
UFO |
2020 |
45 |
strange_light |
8 |
10-05-2020 |
|
2 |
I002 |
INDIA |
GHOST |
2019 |
NA |
haunted_house |
7 |
NA |
|
3 |
I003 |
UK |
MYSTERY |
2021 |
NA |
unknown |
NA |
NA |
|
4 |
I005 |
JAPAN |
ALIEN |
NA |
30 |
sighting |
6 |
15-07-2021 |
|
5 |
I006 |
UNKNOWN |
GHOST |
2018 |
NA |
dark_shadow |
9 |
22-08-2018 |
|
6 |
I007 |
BRAZIL |
MYSTERY |
2022 |
25 |
blank |
5 |
NA |
|
7 |
I008 |
INDIA |
GHOST |
2019 |
40 |
haunted_house |
7 |
01-11-2019 |
|
8 |
I009 |
UK |
UFO |
2021 |
35 |
strange_light |
8 |
30-06-2021 |
|
9 |
I010 |
USA |
MYSTERY |
2023 |
29 |
unknown |
6 |
10-03-2023 |
Explanation
This R
pipeline uses dplyr for transformation:
- mutate() handles
transformations
- toupper + trimws standardizes
text
- Conditional logic cleans Age
- as.Date() converts date
strings
- distinct() removes
duplicates
R is
highly expressive, making transformations readable. However, it lacks SAS-level
strictness unless explicitly enforced.
5. Phase 3: Additional SAS Processing
PROC FREQ DATA=incidents_clean;
TABLES Country Incident_Type;
RUN;
OUTPUT:
The FREQ Procedure
| Country | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| BRAZIL | 1 | 11.11 | 1 | 11.11 |
| INDIA | 2 | 22.22 | 3 | 33.33 |
| JAPAN | 1 | 11.11 | 4 | 44.44 |
| UK | 2 | 22.22 | 6 | 66.67 |
| UNKNOWN | 1 | 11.11 | 7 | 77.78 |
| USA | 2 | 22.22 | 9 | 100.00 |
| Incident_Type | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| ALIEN | 1 | 11.11 | 1 | 11.11 |
| GHOST | 3 | 33.33 | 4 | 44.44 |
| MYSTERY | 3 | 33.33 | 7 | 77.78 |
| UFO | 2 | 22.22 | 9 | 100.00 |
PROC MEANS DATA=incidents_clean;
VAR Age Impact_Score;
RUN;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|
Age Impact_Score | 6 8 | 34.0000000 7.0000000 | 7.4833148 1.3093073 | 25.0000000 5.0000000 | 45.0000000 9.0000000 |
Explanation
These
steps validate cleaned data:
- PROC FREQ checks
distribution consistency
- PROC MEANS identifies anomalies
In
real-world projects, this is part of QC validation, ensuring cleaned
data behaves logically before analysis.
6. 20 Additional Data Cleaning Best Practices
- Always validate raw data
against protocol
- Maintain audit trail for all
transformations
- Use SDTM standards for
structure
- Never overwrite raw datasets
- Use controlled terminology
- Validate date formats
rigorously
- Perform double programming
(QC)
- Flag outliers before
removing
- Maintain metadata
documentation
- Use macros for reusable
cleaning logic
- Apply derivation rules
consistently
- Validate key identifiers
(USUBJID)
- Avoid hardcoding values
- Log all data issues
- Perform cross-domain
validation
- Ensure reproducibility
- Validate merge logic
- Use PROC COMPARE for QC
- Document assumptions
- Follow regulatory guidelines
(FDA/CDISC)
7. Business Logic Behind Data Cleaning
Data
cleaning is not arbitrary it is driven by business logic and domain
understanding. For example, in clinical trials, a patient age of -5 is not
just incorrect it can distort safety analysis. Similarly, a missing visit date
can affect time-to-event analysis.
Replacing
missing values with “UNKNOWN” ensures categorical consistency without
introducing bias. However, numerical fields require caution imputing values
incorrectly can mislead conclusions.
Consider
salary normalization in business analytics: if one entry is in USD and another
in INR without conversion, average salary calculations become meaningless.
Date
imputation is another critical example. If a treatment start date is missing,
analysts may use the first known visit date but this must be justified and
documented.
In our
incident dataset, cleaning ensures:
- Correct grouping by country
- Accurate trend analysis by
year
- Reliable pattern detection
Ultimately,
data cleaning is about preserving truth while enabling analysis.
8. 20 Key Points
- Dirty data leads to wrong
conclusions
- Standardization ensures
reproducibility
- Missing data must be handled
strategically
- Outliers can be errors or
insights
- Text inconsistency breaks
grouping
- Dates drive time-based
analysis
- Validation is not optional
- Duplicate data inflates
results
- Cleaning is iterative
- Domain knowledge is critical
- SAS ensures compliance
- R ensures flexibility
- Metadata is your backbone
- Audit trails are mandatory
- QC prevents disasters
- Automation improves
efficiency
- Never trust raw data blindly
- Consistency beats complexity
- Cleaning improves model
accuracy
- Good data = good decisions
9. Summary
Data
cleaning is the backbone of reliable analytics, especially in regulated
environments like clinical trials. In this blog, we explored a messy dataset
representing “Most Surprising Incidents in the World” and demonstrated how
errors missing values, duplicates, inconsistent text, and invalid entries can
compromise analysis.
Using
SAS, we leveraged structured approaches like DATA step, PROC SORT, and
validation procedures to ensure consistency and compliance. SAS excels in
environments where auditability, traceability, and regulatory standards
are critical.
In
contrast, R provided a flexible and expressive approach using dplyr. Its
pipeline-based transformations made cleaning intuitive and efficient,
especially for exploratory analysis and rapid prototyping.
The key
takeaway is not choosing SAS vs R but understanding where each fits:
- SAS for clinical,
production-grade pipelines
- R for exploration and
advanced transformations
We also
covered best practices aligned with SDTM/ADaM standards, ensuring that cleaned
data is not only accurate but also compliant.
Ultimately,
clean data enables:
- Better decision-making
- Reliable reporting
- Regulatory approval
Without
it, even the most advanced analytics fail.
10. Conclusion
Data
cleaning is often underestimated but in reality, it is the most critical
phase of any data pipeline. Whether you're working with clinical trial
datasets, financial records, or global incident reports, the quality of your
output is directly tied to the quality of your input.
Through
this blog, we simulated a realistic messy dataset and demonstrated how
structured cleaning approaches in SAS and R can transform chaos into clarity.
The techniques we applied handling missing values, validating ranges,
standardizing text, and removing duplicates are not just technical steps; they
are essential practices that ensure data integrity and trustworthiness.
In clinical
research, poor data quality can lead to incorrect conclusions about drug safety
or efficacy. In business, it can result in flawed strategies and financial
loss. This is why organizations invest heavily in data governance frameworks,
validation processes, and audit trails.
SAS
provides a robust, compliance-driven environment ideal for production and
regulatory submissions. R complements it with flexibility and speed, making it
ideal for exploratory and advanced analytics.
The real
mastery lies in combining both using the right tool for the right task.
As a data
professional, your responsibility is not just to analyze data but to protect
its integrity. Because at the end of the day, clean data is not just about
accuracy it’s about credibility.
11. Interview Questions
Q1: How
do you handle missing values in SAS?
A: Use COALESCEC for character variables and conditional logic for
numeric fields. Always justify imputation.
Q2: How
do you remove duplicates?
A: PROC SORT NODUPKEY BY key_variable;
Q3: How
do you validate date formats?
A: Use INPUT() with proper informat and check for invalid conversions.
Q4: R
Scenario: Dataset
has mixed case country names
A: Use toupper(trimws()) for standardization
Q5:
Debugging Scenario: Why is
Age missing after cleaning?
A: Check conditions values outside valid range may have been set to
missing intentionally
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SURPRISING INCIDENTS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment