462.Cleaning Surprising Incident Records the Right Way:Raw Data to Real Intelligence

Chaos, Curiosity & Clean Code: Turning “Most Surprising Incidents in the World” into Reliable Insights with SAS & R

1. Introduction

Imagine you're working as a SAS programmer on a global clinical trial. You receive a dataset of adverse events from multiple countries. Everything looks fine until analysis begins.

  • Patient ages include -5 and 230
  • Dates like 2023-15-40
  • Text fields contain “NULL”, “unknown”, “n/a”, blanks
  • Duplicate patient records appear with slight variations

Now imagine building a safety report on top of this.

That’s not analytics that’s a disaster waiting to happen.

The same scenario applies to business datasets. Whether it's customer data, financial records, or in our case “Most Surprising Incidents in the World” dataset, messy data silently corrupts results.

This is where data cleaning becomes not just a step but a discipline.

Tools like SAS (strong in regulated environments like clinical trials) and R (flexible and powerful for transformation) play a crucial role in:

  • Ensuring data integrity
  • Supporting regulatory compliance
  • Delivering accurate insights

In this blog, we’ll simulate a messy real-world dataset, inject errors intentionally, and clean it step-by-step using both SAS and R.

2. Raw Data Creation in SAS and R

SAS Code (Raw Dataset)

DATA incidents_raw;

INFILE DATALINES MISSOVER;

INPUT  Incident_ID :$5. Country :$10. Incident_Type :$10. Year

       Age Description :$20. Impact_Score Date :$10.;   

DATALINES;

I001 USA UFO 2020 45 strange_light 8 2020-05-10

I002 india ghost 2019 -5 haunted_house 7 2019-13-01

I003 UK mystery 2021 120 unknown . NULL

I004 USA UFO 2020 45 strange_light 8 2020-05-10

I005 Japan alien . 30 sighting 6 2021-07-15

I006 NULL ghost 2018 200 dark_shadow 9 2018-08-22

I007 Brazil mystery 2022 25 blank 5 

I008 india ghost 2019 40 haunted_house 7 2019-11-01

I009 UK UFO 2021 35 strange_light 8 2021-06-30

I010 USA mystery 2023 29 unknown 6 2023-03-10

;

RUN;

PROC PRINT DATA = incidents_raw;

RUN;

OUTPUT:

ObsIncident_IDCountryIncident_TypeYearAgeDescriptionImpact_ScoreDate
1I001USAUFO202045strange_light82020-05-10
2I002indiaghost2019-5haunted_house72019-13-01
3I003UKmystery2021120unknown.NULL
4I004USAUFO202045strange_light82020-05-10
5I005Japanalien.30sighting62021-07-15
6I006NULLghost2018200dark_shadow92018-08-22
7I007Brazilmystery202225blank5 
8I008indiaghost201940haunted_house72019-11-01
9I009UKUFO202135strange_light82021-06-30
10I010USAmystery202329unknown62023-03-10

Explanation

This dataset intentionally contains multiple real-world issues. We have:

  • Missing values (. and NULL)
  • Invalid age values (-5, 200)
  • Duplicate record (I001 & I004)
  • Inconsistent text (india vs India, NULL, blank)
  • Invalid date format (2019-13-01)
  • Missing year

These mimic real clinical trial issues, like incorrect patient demographics or adverse event dates. Using DATALINES simulates raw ingestion from external sources like CSV or CRF systems.

The : modifier tells SAS:

“Read until delimiter OR max length whichever comes first”

This combines:

  • Flexibility of list input
  • Control of formatted input

It prevents over-reading (strange_light 8) and ensures clean separation. This is industry best practice when dealing with semi-structured data sources like flat files in clinical trials.

R Code – Equivalent Raw Dataset

incidents_raw <- data.frame(

  Incident_ID = c("I001","I002","I003","I004","I005","I006",

                  "I007","I008","I009","I010"),

  Country = c("USA","india","UK","USA","Japan","NULL","Brazil",

              "india","UK","USA"),

  Incident_Type = c("UFO","ghost","mystery","UFO","alien","ghost",

                    "mystery","ghost","UFO","mystery"),

  Year = c(2020,2019,2021,2020,NA,2018,2022,2019,2021,2023),

  Age = c(45,-5,120,45,30,200,25,40,35,29),

  Description = c("strange_light","haunted_house","unknown",

                  "strange_light","sighting","dark_shadow",

                  "blank","haunted_house","strange_light","unknown"),

  Impact_Score = c(8,7,NA,8,6,9,5,7,8,6),

  Date = c("2020-05-10","2019-13-01","NULL","2020-05-10",

           "2021-07-15","2018-08-22",NA,"2019-11-01","2021-06-30",

           "2023-03-10")

)

OUTPUT:

 

Incident_ID

Country

Incident_Type

Year

Age

Description

Impact_Score

Date

1

I001

USA

UFO

2020

45

strange_light

8

10-05-2020

2

I002

india

ghost

2019

-5

haunted_house

7

2019-13-01

3

I003

UK

mystery

2021

120

unknown

NA

NULL

4

I004

USA

UFO

2020

45

strange_light

8

10-05-2020

5

I005

Japan

alien

NA

30

sighting

6

15-07-2021

6

I006

NULL

ghost

2018

200

dark_shadow

9

22-08-2018

7

I007

Brazil

mystery

2022

25

blank

5

NA

8

I008

india

ghost

2019

40

haunted_house

7

01-11-2019

9

I009

UK

UFO

2021

35

strange_light

8

30-06-2021

10

I010

USA

mystery

2023

29

unknown

6

10-03-2023

Explanation

This R dataset mirrors SAS input. It includes:

  • NA values
  • Invalid dates stored as strings
  • Inconsistent text values
  • Duplicate entries

R treats everything flexibly, which is both powerful and dangerous. Without strict typing, errors propagate easily. This is why validation layers are essential.

3. Phase 1: Data Cleaning in SAS

DATA incidents_clean;

SET incidents_raw;

IF Country IN ("NULL", "", " ") THEN Country = "UNKNOWN";

Country = UPCASE(STRIP(Country));

Incident_Type = UPCASE(STRIP(Incident_Type));

IF Age < 0 OR Age > 100 THEN Age = .;

IF Date NOT IN (".", "NULL", "") THEN 

Date_clean = INPUT(Date, yymmdd10.);

FORMAT Date_clean yymmdd10.;

IF Year = . AND NOT MISSING(Date_clean) THEN 

Year = YEAR(Date_clean);

RUN;

PROC PRINT DATA = incidents_clean;

RUN;

OUTPUT:

ObsIncident_IDCountryIncident_TypeYearAgeDescriptionImpact_ScoreDateDate_clean
1I001USAUFO202045strange_light82020-05-102020-05-10
2I002INDIAGHOST2019.haunted_house72019-13-01.
3I003UKMYSTERY2021.unknown.NULL.
4I004USAUFO202045strange_light82020-05-102020-05-10
5I005JAPANALIEN202130sighting62021-07-152021-07-15
6I006UNKNOWNGHOST2018.dark_shadow92018-08-222018-08-22
7I007BRAZILMYSTERY202225blank5 .
8I008INDIAGHOST201940haunted_house72019-11-012019-11-01
9I009UKUFO202135strange_light82021-06-302021-06-30
10I010USAMYSTERY202329unknown62023-03-102023-03-10

PROC SORT DATA=incidents_clean NODUPKEY;

BY Country Incident_Type Year Age;

RUN;

PROC PRINT DATA = incidents_clean;

RUN;

OUTPUT:

ObsIncident_IDCountryIncident_TypeYearAgeDescriptionImpact_ScoreDateDate_clean
1I007BRAZILMYSTERY202225blank5 .
2I002INDIAGHOST2019.haunted_house72019-13-01.
3I008INDIAGHOST201940haunted_house72019-11-012019-11-01
4I005JAPANALIEN202130sighting62021-07-152021-07-15
5I003UKMYSTERY2021.unknown.NULL.
6I009UKUFO202135strange_light82021-06-302021-06-30
7I006UNKNOWNGHOST2018.dark_shadow92018-08-222018-08-22
8I010USAMYSTERY202329unknown62023-03-102023-03-10
9I001USAUFO202045strange_light82020-05-102020-05-10

PROC SORT DATA=incidents_clean NODUPKEY;

BY Incident_ID;

RUN;

PROC PRINT DATA = incidents_clean;

RUN;

OUTPUT:

ObsIncident_IDCountryIncident_TypeYearAgeDescriptionImpact_ScoreDateDate_clean
1I001USAUFO202045strange_light82020-05-102020-05-10
2I002INDIAGHOST2019.haunted_house72019-13-01.
3I003UKMYSTERY2021.unknown.NULL.
4I005JAPANALIEN202130sighting62021-07-152021-07-15
5I006UNKNOWNGHOST2018.dark_shadow92018-08-222018-08-22
6I007BRAZILMYSTERY202225blank5 .
7I008INDIAGHOST201940haunted_house72019-11-012019-11-01
8I009UKUFO202135strange_light82021-06-302021-06-30
9I010USAMYSTERY202329unknown62023-03-102023-03-10

Explanation

This SAS step demonstrates core data cleaning techniques:

  • COALESCEC replaces missing or NULL values
  • UPCASE + STRIP standardizes text
  • Age validation removes unrealistic values
  • INPUT() converts character dates into SAS date format
  • Missing Year derived from date
  • PROC SORT NODUPKEY removes duplicates

In clinical trials, such logic ensures:

  • Consistency across SDTM domains
  • Accurate patient-level analysis
  • Compliance with regulatory expectations

4. Phase 2: Data Cleaning in R

library(dplyr)

incidents_clean <- incidents_raw %>%

  mutate(

    Country = toupper(trimws(ifelse(Country %in% c("NULL", NA), "UNKNOWN", Country))),

    Incident_Type = toupper(trimws(Incident_Type)),

    Age = ifelse(Age < 0 | Age > 100, NA, Age),

    Date = as.Date(Date, format="%Y-%m-%d")

  ) %>%

  distinct(Country,Incident_Type,Year,Age, .keep_all = TRUE)

OUTPUT:

 

Incident_ID

Country

Incident_Type

Year

Age

Description

Impact_Score

Date

1

I001

USA

UFO

2020

45

strange_light

8

10-05-2020

2

I002

INDIA

GHOST

2019

NA

haunted_house

7

NA

3

I003

UK

MYSTERY

2021

NA

unknown

NA

NA

4

I005

JAPAN

ALIEN

NA

30

sighting

6

15-07-2021

5

I006

UNKNOWN

GHOST

2018

NA

dark_shadow

9

22-08-2018

6

I007

BRAZIL

MYSTERY

2022

25

blank

5

NA

7

I008

INDIA

GHOST

2019

40

haunted_house

7

01-11-2019

8

I009

UK

UFO

2021

35

strange_light

8

30-06-2021

9

I010

USA

MYSTERY

2023

29

unknown

6

10-03-2023

Explanation

This R pipeline uses dplyr for transformation:

  • mutate() handles transformations
  • toupper + trimws standardizes text
  • Conditional logic cleans Age
  • as.Date() converts date strings
  • distinct() removes duplicates

R is highly expressive, making transformations readable. However, it lacks SAS-level strictness unless explicitly enforced.

5. Phase 3: Additional SAS Processing

PROC FREQ DATA=incidents_clean;

TABLES Country Incident_Type;

RUN;

OUTPUT:

The FREQ Procedure

CountryFrequencyPercentCumulative
Frequency
Cumulative
Percent
BRAZIL111.11111.11
INDIA222.22333.33
JAPAN111.11444.44
UK222.22666.67
UNKNOWN111.11777.78
USA222.229100.00
Incident_TypeFrequencyPercentCumulative
Frequency
Cumulative
Percent
ALIEN111.11111.11
GHOST333.33444.44
MYSTERY333.33777.78
UFO222.229100.00

PROC MEANS DATA=incidents_clean;

VAR Age Impact_Score;

RUN;

OUTPUT:

The MEANS Procedure

VariableNMeanStd DevMinimumMaximum
Age
Impact_Score
6
8
34.0000000
7.0000000
7.4833148
1.3093073
25.0000000
5.0000000
45.0000000
9.0000000

Explanation

These steps validate cleaned data:

  • PROC FREQ checks distribution consistency
  • PROC MEANS identifies anomalies

In real-world projects, this is part of QC validation, ensuring cleaned data behaves logically before analysis.

6. 20 Additional Data Cleaning Best Practices

  1. Always validate raw data against protocol
  2. Maintain audit trail for all transformations
  3. Use SDTM standards for structure
  4. Never overwrite raw datasets
  5. Use controlled terminology
  6. Validate date formats rigorously
  7. Perform double programming (QC)
  8. Flag outliers before removing
  9. Maintain metadata documentation
  10. Use macros for reusable cleaning logic
  11. Apply derivation rules consistently
  12. Validate key identifiers (USUBJID)
  13. Avoid hardcoding values
  14. Log all data issues
  15. Perform cross-domain validation
  16. Ensure reproducibility
  17. Validate merge logic
  18. Use PROC COMPARE for QC
  19. Document assumptions
  20. Follow regulatory guidelines (FDA/CDISC)

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary it is driven by business logic and domain understanding. For example, in clinical trials, a patient age of -5 is not just incorrect it can distort safety analysis. Similarly, a missing visit date can affect time-to-event analysis.

Replacing missing values with “UNKNOWN” ensures categorical consistency without introducing bias. However, numerical fields require caution imputing values incorrectly can mislead conclusions.

Consider salary normalization in business analytics: if one entry is in USD and another in INR without conversion, average salary calculations become meaningless.

Date imputation is another critical example. If a treatment start date is missing, analysts may use the first known visit date but this must be justified and documented.

In our incident dataset, cleaning ensures:

  • Correct grouping by country
  • Accurate trend analysis by year
  • Reliable pattern detection

Ultimately, data cleaning is about preserving truth while enabling analysis.

8. 20 Key Points

  • Dirty data leads to wrong conclusions
  • Standardization ensures reproducibility
  • Missing data must be handled strategically
  • Outliers can be errors or insights
  • Text inconsistency breaks grouping
  • Dates drive time-based analysis
  • Validation is not optional
  • Duplicate data inflates results
  • Cleaning is iterative
  • Domain knowledge is critical
  • SAS ensures compliance
  • R ensures flexibility
  • Metadata is your backbone
  • Audit trails are mandatory
  • QC prevents disasters
  • Automation improves efficiency
  • Never trust raw data blindly
  • Consistency beats complexity
  • Cleaning improves model accuracy
  • Good data = good decisions

9. Summary

Data cleaning is the backbone of reliable analytics, especially in regulated environments like clinical trials. In this blog, we explored a messy dataset representing “Most Surprising Incidents in the World” and demonstrated how errors missing values, duplicates, inconsistent text, and invalid entries can compromise analysis.

Using SAS, we leveraged structured approaches like DATA step, PROC SORT, and validation procedures to ensure consistency and compliance. SAS excels in environments where auditability, traceability, and regulatory standards are critical.

In contrast, R provided a flexible and expressive approach using dplyr. Its pipeline-based transformations made cleaning intuitive and efficient, especially for exploratory analysis and rapid prototyping.

The key takeaway is not choosing SAS vs R but understanding where each fits:

  • SAS for clinical, production-grade pipelines
  • R for exploration and advanced transformations

We also covered best practices aligned with SDTM/ADaM standards, ensuring that cleaned data is not only accurate but also compliant.

Ultimately, clean data enables:

  • Better decision-making
  • Reliable reporting
  • Regulatory approval

Without it, even the most advanced analytics fail.

10. Conclusion

Data cleaning is often underestimated but in reality, it is the most critical phase of any data pipeline. Whether you're working with clinical trial datasets, financial records, or global incident reports, the quality of your output is directly tied to the quality of your input.

Through this blog, we simulated a realistic messy dataset and demonstrated how structured cleaning approaches in SAS and R can transform chaos into clarity. The techniques we applied handling missing values, validating ranges, standardizing text, and removing duplicates are not just technical steps; they are essential practices that ensure data integrity and trustworthiness.

In clinical research, poor data quality can lead to incorrect conclusions about drug safety or efficacy. In business, it can result in flawed strategies and financial loss. This is why organizations invest heavily in data governance frameworks, validation processes, and audit trails.

SAS provides a robust, compliance-driven environment ideal for production and regulatory submissions. R complements it with flexibility and speed, making it ideal for exploratory and advanced analytics.

The real mastery lies in combining both using the right tool for the right task.

As a data professional, your responsibility is not just to analyze data but to protect its integrity. Because at the end of the day, clean data is not just about accuracy it’s about credibility.

11. Interview Questions

Q1: How do you handle missing values in SAS?
A: Use COALESCEC for character variables and conditional logic for numeric fields. Always justify imputation.

Q2: How do you remove duplicates?
A: PROC SORT NODUPKEY BY key_variable;

Q3: How do you validate date formats?
A: Use INPUT() with proper informat and check for invalid conversions.

Q4: R Scenario: Dataset has mixed case country names
A: Use toupper(trimws()) for standardization

Q5: Debugging Scenario: Why is Age missing after cleaning?
A: Check conditions values outside valid range may have been set to missing intentionally

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SURPRISING INCIDENTS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy


Comments

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function