Chaos, Curiosity & Clean Code: Turning “Most Surprising Incidents in the World” into Reliable Insights with SAS & R

1. Introduction

Imagine you're working as a SAS programmer on a global clinical trial. You receive a dataset of adverse events from multiple countries. Everything looks fine until analysis begins.

Patient ages include -5 and 230
Dates like 2023-15-40
Text fields contain “NULL”, “unknown”, “n/a”, blanks
Duplicate patient records appear with slight variations

Now imagine building a safety report on top of this.

That’s not analytics that’s a disaster waiting to happen.

The same scenario applies to business datasets. Whether it's customer data, financial records, or in our case “Most Surprising Incidents in the World” dataset, messy data silently corrupts results.

This is where data cleaning becomes not just a step but a discipline.

Tools like SAS (strong in regulated environments like clinical trials) and R (flexible and powerful for transformation) play a crucial role in:

Ensuring data integrity
Supporting regulatory compliance
Delivering accurate insights

In this blog, we’ll simulate a messy real-world dataset, inject errors intentionally, and clean it step-by-step using both SAS and R.

2. Raw Data Creation in SAS and R

SAS Code (Raw Dataset)

DATA incidents_raw;

INFILE DATALINES MISSOVER;

INPUT Incident_ID :$5. Country :$10. Incident_Type :$10. Year

Age Description :$20. Impact_Score Date :$10.;

DATALINES;

I001 USA UFO 2020 45 strange_light 8 2020-05-10

I002 india ghost 2019 -5 haunted_house 7 2019-13-01

I003 UK mystery 2021 120 unknown . NULL

I004 USA UFO 2020 45 strange_light 8 2020-05-10

I005 Japan alien . 30 sighting 6 2021-07-15

I006 NULL ghost 2018 200 dark_shadow 9 2018-08-22

I007 Brazil mystery 2022 25 blank 5

I008 india ghost 2019 40 haunted_house 7 2019-11-01

I009 UK UFO 2021 35 strange_light 8 2021-06-30

I010 USA mystery 2023 29 unknown 6 2023-03-10

;

RUN;

PROC PRINT DATA = incidents_raw;

RUN;

OUTPUT:


Obs	Incident_ID	Country	Incident_Type	Year	Age	Description	Impact_Score	Date
1	I001	USA	UFO	2020	45	strange_light	8	2020-05-10
2	I002	india	ghost	2019	-5	haunted_house	7	2019-13-01
3	I003	UK	mystery	2021	120	unknown	.	NULL
4	I004	USA	UFO	2020	45	strange_light	8	2020-05-10
5	I005	Japan	alien	.	30	sighting	6	2021-07-15
6	I006	NULL	ghost	2018	200	dark_shadow	9	2018-08-22
7	I007	Brazil	mystery	2022	25	blank	5
8	I008	india	ghost	2019	40	haunted_house	7	2019-11-01
9	I009	UK	UFO	2021	35	strange_light	8	2021-06-30
10	I010	USA	mystery	2023	29	unknown	6	2023-03-10

Explanation

This dataset intentionally contains multiple real-world issues. We have:

Missing values (. and NULL)
Invalid age values (-5, 200)
Duplicate record (I001 & I004)
Inconsistent text (india vs India, NULL, blank)
Invalid date format (2019-13-01)
Missing year

These mimic real clinical trial issues, like incorrect patient demographics or adverse event dates. Using DATALINES simulates raw ingestion from external sources like CSV or CRF systems.

The : modifier tells SAS:

“Read until delimiter OR max length whichever comes first”

This combines:

Flexibility of list input
Control of formatted input

It prevents over-reading (strange_light 8) and ensures clean separation. This is industry best practice when dealing with semi-structured data sources like flat files in clinical trials.

R Code – Equivalent Raw Dataset

incidents_raw <- data.frame(

Incident_ID = c("I001","I002","I003","I004","I005","I006",

"I007","I008","I009","I010"),

Country = c("USA","india","UK","USA","Japan","NULL","Brazil",

"india","UK","USA"),

Incident_Type = c("UFO","ghost","mystery","UFO","alien","ghost",

"mystery","ghost","UFO","mystery"),

Year = c(2020,2019,2021,2020,NA,2018,2022,2019,2021,2023),

Age = c(45,-5,120,45,30,200,25,40,35,29),

Description = c("strange_light","haunted_house","unknown",

"strange_light","sighting","dark_shadow",

"blank","haunted_house","strange_light","unknown"),

Impact_Score = c(8,7,NA,8,6,9,5,7,8,6),

Date = c("2020-05-10","2019-13-01","NULL","2020-05-10",

"2021-07-15","2018-08-22",NA,"2019-11-01","2021-06-30",

"2023-03-10")

)

OUTPUT:

	Incident_ID	Country	Incident_Type	Year	Age	Description	Impact_Score	Date
1	I001	USA	UFO	2020	45	strange_light	8	10-05-2020
2	I002	india	ghost	2019	-5	haunted_house	7	2019-13-01
3	I003	UK	mystery	2021	120	unknown	NA	NULL
4	I004	USA	UFO	2020	45	strange_light	8	10-05-2020
5	I005	Japan	alien	NA	30	sighting	6	15-07-2021
6	I006	NULL	ghost	2018	200	dark_shadow	9	22-08-2018
7	I007	Brazil	mystery	2022	25	blank	5	NA
8	I008	india	ghost	2019	40	haunted_house	7	01-11-2019
9	I009	UK	UFO	2021	35	strange_light	8	30-06-2021
10	I010	USA	mystery	2023	29	unknown	6	10-03-2023

Explanation

This R dataset mirrors SAS input. It includes:

NA values
Invalid dates stored as strings
Inconsistent text values
Duplicate entries

R treats everything flexibly, which is both powerful and dangerous. Without strict typing, errors propagate easily. This is why validation layers are essential.

3. Phase 1: Data Cleaning in SAS

DATA incidents_clean;

SET incidents_raw;

IF Country IN ("NULL", "", " ") THEN Country = "UNKNOWN";

Country = UPCASE(STRIP(Country));

Incident_Type = UPCASE(STRIP(Incident_Type));

IF Age < 0 OR Age > 100 THEN Age = .;

IF Date NOT IN (".", "NULL", "") THEN

Date_clean = INPUT(Date, yymmdd10.);

FORMAT Date_clean yymmdd10.;

IF Year = . AND NOT MISSING(Date_clean) THEN

Year = YEAR(Date_clean);

RUN;

PROC PRINT DATA = incidents_clean;

RUN;

OUTPUT:


Obs	Incident_ID	Country	Incident_Type	Year	Age	Description	Impact_Score	Date	Date_clean
1	I001	USA	UFO	2020	45	strange_light	8	2020-05-10	2020-05-10
2	I002	INDIA	GHOST	2019	.	haunted_house	7	2019-13-01	.
3	I003	UK	MYSTERY	2021	.	unknown	.	NULL	.
4	I004	USA	UFO	2020	45	strange_light	8	2020-05-10	2020-05-10
5	I005	JAPAN	ALIEN	2021	30	sighting	6	2021-07-15	2021-07-15
6	I006	UNKNOWN	GHOST	2018	.	dark_shadow	9	2018-08-22	2018-08-22
7	I007	BRAZIL	MYSTERY	2022	25	blank	5		.
8	I008	INDIA	GHOST	2019	40	haunted_house	7	2019-11-01	2019-11-01
9	I009	UK	UFO	2021	35	strange_light	8	2021-06-30	2021-06-30
10	I010	USA	MYSTERY	2023	29	unknown	6	2023-03-10	2023-03-10

PROC SORT DATA=incidents_clean NODUPKEY;

BY Country Incident_Type Year Age;

RUN;

PROC PRINT DATA = incidents_clean;

RUN;

OUTPUT:


Obs	Incident_ID	Country	Incident_Type	Year	Age	Description	Impact_Score	Date	Date_clean
1	I007	BRAZIL	MYSTERY	2022	25	blank	5		.
2	I002	INDIA	GHOST	2019	.	haunted_house	7	2019-13-01	.
3	I008	INDIA	GHOST	2019	40	haunted_house	7	2019-11-01	2019-11-01
4	I005	JAPAN	ALIEN	2021	30	sighting	6	2021-07-15	2021-07-15
5	I003	UK	MYSTERY	2021	.	unknown	.	NULL	.
6	I009	UK	UFO	2021	35	strange_light	8	2021-06-30	2021-06-30
7	I006	UNKNOWN	GHOST	2018	.	dark_shadow	9	2018-08-22	2018-08-22
8	I010	USA	MYSTERY	2023	29	unknown	6	2023-03-10	2023-03-10
9	I001	USA	UFO	2020	45	strange_light	8	2020-05-10	2020-05-10

PROC SORT DATA=incidents_clean NODUPKEY;

BY Incident_ID;

RUN;

PROC PRINT DATA = incidents_clean;

RUN;

OUTPUT:


Obs	Incident_ID	Country	Incident_Type	Year	Age	Description	Impact_Score	Date	Date_clean
1	I001	USA	UFO	2020	45	strange_light	8	2020-05-10	2020-05-10
2	I002	INDIA	GHOST	2019	.	haunted_house	7	2019-13-01	.
3	I003	UK	MYSTERY	2021	.	unknown	.	NULL	.
4	I005	JAPAN	ALIEN	2021	30	sighting	6	2021-07-15	2021-07-15
5	I006	UNKNOWN	GHOST	2018	.	dark_shadow	9	2018-08-22	2018-08-22
6	I007	BRAZIL	MYSTERY	2022	25	blank	5		.
7	I008	INDIA	GHOST	2019	40	haunted_house	7	2019-11-01	2019-11-01
8	I009	UK	UFO	2021	35	strange_light	8	2021-06-30	2021-06-30
9	I010	USA	MYSTERY	2023	29	unknown	6	2023-03-10	2023-03-10

Explanation

This SAS step demonstrates core data cleaning techniques:

COALESCEC replaces missing or NULL values
UPCASE + STRIP standardizes text
Age validation removes unrealistic values
INPUT() converts character dates into SAS date format
Missing Year derived from date
PROC SORT NODUPKEY removes duplicates

In clinical trials, such logic ensures:

Consistency across SDTM domains
Accurate patient-level analysis
Compliance with regulatory expectations

4. Phase 2: Data Cleaning in R

library(dplyr)

incidents_clean <- incidents_raw %>%

mutate(

Country = toupper(trimws(ifelse(Country %in% c("NULL", NA), "UNKNOWN", Country))),

Incident_Type = toupper(trimws(Incident_Type)),

Age = ifelse(Age < 0 | Age > 100, NA, Age),

Date = as.Date(Date, format="%Y-%m-%d")

) %>%

distinct(Country,Incident_Type,Year,Age, .keep_all = TRUE)

OUTPUT:

	Incident_ID	Country	Incident_Type	Year	Age	Description	Impact_Score	Date
1	I001	USA	UFO	2020	45	strange_light	8	10-05-2020
2	I002	INDIA	GHOST	2019	NA	haunted_house	7	NA
3	I003	UK	MYSTERY	2021	NA	unknown	NA	NA
4	I005	JAPAN	ALIEN	NA	30	sighting	6	15-07-2021
5	I006	UNKNOWN	GHOST	2018	NA	dark_shadow	9	22-08-2018
6	I007	BRAZIL	MYSTERY	2022	25	blank	5	NA
7	I008	INDIA	GHOST	2019	40	haunted_house	7	01-11-2019
8	I009	UK	UFO	2021	35	strange_light	8	30-06-2021
9	I010	USA	MYSTERY	2023	29	unknown	6	10-03-2023

Explanation

This R pipeline uses dplyr for transformation:

mutate() handles transformations
toupper + trimws standardizes text
Conditional logic cleans Age
as.Date() converts date strings
distinct() removes duplicates

R is highly expressive, making transformations readable. However, it lacks SAS-level strictness unless explicitly enforced.

5. Phase 3: Additional SAS Processing

PROC FREQ DATA=incidents_clean;

TABLES Country Incident_Type;

RUN;

OUTPUT:

The FREQ Procedure


Country	Frequency	Percent	Cumulative Frequency	Cumulative Percent
BRAZIL	1	11.11	1	11.11
INDIA	2	22.22	3	33.33
JAPAN	1	11.11	4	44.44
UK	2	22.22	6	66.67
UNKNOWN	1	11.11	7	77.78
USA	2	22.22	9	100.00


Incident_Type	Frequency	Percent	Cumulative Frequency	Cumulative Percent
ALIEN	1	11.11	1	11.11
GHOST	3	33.33	4	44.44
MYSTERY	3	33.33	7	77.78
UFO	2	22.22	9	100.00

PROC MEANS DATA=incidents_clean;

VAR Age Impact_Score;

RUN;

OUTPUT:

The MEANS Procedure


Variable	N	Mean	Std Dev	Minimum	Maximum
Age Impact_Score	6 8	34.0000000 7.0000000	7.4833148 1.3093073	25.0000000 5.0000000	45.0000000 9.0000000

Explanation

These steps validate cleaned data:

PROC FREQ checks distribution consistency
PROC MEANS identifies anomalies

In real-world projects, this is part of QC validation, ensuring cleaned data behaves logically before analysis.

6. 20 Additional Data Cleaning Best Practices

Always validate raw data against protocol
Maintain audit trail for all transformations
Use SDTM standards for structure
Never overwrite raw datasets
Use controlled terminology
Validate date formats rigorously
Perform double programming (QC)
Flag outliers before removing
Maintain metadata documentation
Use macros for reusable cleaning logic
Apply derivation rules consistently
Validate key identifiers (USUBJID)
Avoid hardcoding values
Log all data issues
Perform cross-domain validation
Ensure reproducibility
Validate merge logic
Use PROC COMPARE for QC
Document assumptions
Follow regulatory guidelines (FDA/CDISC)

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary it is driven by business logic and domain understanding. For example, in clinical trials, a patient age of -5 is not just incorrect it can distort safety analysis. Similarly, a missing visit date can affect time-to-event analysis.

Replacing missing values with “UNKNOWN” ensures categorical consistency without introducing bias. However, numerical fields require caution imputing values incorrectly can mislead conclusions.

Consider salary normalization in business analytics: if one entry is in USD and another in INR without conversion, average salary calculations become meaningless.

Date imputation is another critical example. If a treatment start date is missing, analysts may use the first known visit date but this must be justified and documented.

In our incident dataset, cleaning ensures:

Correct grouping by country
Accurate trend analysis by year
Reliable pattern detection

Ultimately, data cleaning is about preserving truth while enabling analysis.

8. 20 Key Points

Dirty data leads to wrong conclusions
Standardization ensures reproducibility
Missing data must be handled strategically
Outliers can be errors or insights
Text inconsistency breaks grouping
Dates drive time-based analysis
Validation is not optional
Duplicate data inflates results
Cleaning is iterative
Domain knowledge is critical
SAS ensures compliance
R ensures flexibility
Metadata is your backbone
Audit trails are mandatory
QC prevents disasters
Automation improves efficiency
Never trust raw data blindly
Consistency beats complexity
Cleaning improves model accuracy
Good data = good decisions

9. Summary

Data cleaning is the backbone of reliable analytics, especially in regulated environments like clinical trials. In this blog, we explored a messy dataset representing “Most Surprising Incidents in the World” and demonstrated how errors missing values, duplicates, inconsistent text, and invalid entries can compromise analysis.

Using SAS, we leveraged structured approaches like DATA step, PROC SORT, and validation procedures to ensure consistency and compliance. SAS excels in environments where auditability, traceability, and regulatory standards are critical.

In contrast, R provided a flexible and expressive approach using dplyr. Its pipeline-based transformations made cleaning intuitive and efficient, especially for exploratory analysis and rapid prototyping.

The key takeaway is not choosing SAS vs R but understanding where each fits:

SAS for clinical, production-grade pipelines
R for exploration and advanced transformations

We also covered best practices aligned with SDTM/ADaM standards, ensuring that cleaned data is not only accurate but also compliant.

Ultimately, clean data enables:

Better decision-making
Reliable reporting
Regulatory approval

Without it, even the most advanced analytics fail.

10. Conclusion

Data cleaning is often underestimated but in reality, it is the most critical phase of any data pipeline. Whether you're working with clinical trial datasets, financial records, or global incident reports, the quality of your output is directly tied to the quality of your input.

Through this blog, we simulated a realistic messy dataset and demonstrated how structured cleaning approaches in SAS and R can transform chaos into clarity. The techniques we applied handling missing values, validating ranges, standardizing text, and removing duplicates are not just technical steps; they are essential practices that ensure data integrity and trustworthiness.

In clinical research, poor data quality can lead to incorrect conclusions about drug safety or efficacy. In business, it can result in flawed strategies and financial loss. This is why organizations invest heavily in data governance frameworks, validation processes, and audit trails.

SAS provides a robust, compliance-driven environment ideal for production and regulatory submissions. R complements it with flexibility and speed, making it ideal for exploratory and advanced analytics.

The real mastery lies in combining both using the right tool for the right task.

As a data professional, your responsibility is not just to analyze data but to protect its integrity. Because at the end of the day, clean data is not just about accuracy it’s about credibility.

11. Interview Questions

Q1: How do you handle missing values in SAS?
A: Use COALESCEC for character variables and conditional logic for numeric fields. Always justify imputation.

Q2: How do you remove duplicates?
A: PROC SORT NODUPKEY BY key_variable;

Q3: How do you validate date formats?
A: Use INPUT() with proper informat and check for invalid conversions.

Q4: R Scenario: Dataset has mixed case country names
A: Use toupper(trimws()) for standardization

Q5: Debugging Scenario: Why is Age missing after cleaning?
A: Check conditions values outside valid range may have been set to missing intentionally

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SURPRISING INCIDENTS DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Billion-Dollar Marriages to Clean Datasets: The Art of SAS Standardization

2.The Hidden Power of SAS Dates: Building Smart Time Logic with INTNX and INTCK

3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

462.Cleaning Surprising Incident Records the Right Way:Raw Data to Real Intelligence

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function