451.Corrupted Data, Dangerous Decisions: Cleaning Global Blast Records with SAS Techniques
Shocking Data Chaos in Global Bomb Blast Records: Mastering the COMPRESS Function in SAS for Precision Data Cleaning
1. Introduction
Imagine
you are working as a clinical SAS programmer or data analyst in a global
security analytics firm. You receive a dataset titled “Major Bomb Blasts
Around the World.” The dataset is supposed to help governments predict
high-risk zones, but once you open it chaos.
Dates
look like “32JAN2020”, locations are written as “new york”, “NEW
YORK”, “Ny”, casualties are negative, and some values literally say “NULL”.
Duplicate records? Plenty. Missing values? Everywhere.
Now pause
and thinknwhat happens if this data feeds into a predictive model?
👉 Governments may allocate resources incorrectly
👉 Risk assessments become flawed
👉 Policy decisions fail
This is
exactly why data cleaning is not optional it’s foundational.
In
industries like clinical trials (SDTM/ADaM), finance, or global risk analytics,
bad data destroys trust.
That’s
where SAS and R become powerful allies:
- SAS → Highly structured,
regulatory-grade cleaning
- R → Flexible, fast,
exploratory transformations
In this
blog, we will simulate a messy bomb blast dataset, intentionally
introduce errors, and then clean it step-by-step using:
- SAS (including COMPRESS
function)
- R (tidyverse approach)
2. Raw Data Creation in SAS and R
SAS Raw Dataset (With Intentional Errors)
DATA blast_raw;
INFILE DATALINES DLM='|' DSD MISSOVER;
INPUT ID Country $ City $ Date:$10.
Casualties Injured Weapon:$10. Group $;
DATALINES;
1|USA|NewYork|12JAN2020|50|100|Bomb|GroupA
2|india|delhi|32JAN2020|-10|50|NULL|GroupB
3|UK|London|05FEB2021|30|.|Bomb|GroupA
4|USA|NEWYORK|12JAN2020|50|100|Bomb|GroupA
5|France|Paris|.|20|40|Explosive|NULL
6|India|Delhi|15MAR2022|25|60|Bomb|groupb
7|USA|Ny|10APR2023|100|200|Bomb|GroupC
8|NULL|Berlin|20MAY2021|45|80|Bomb|GroupD
9|Japan|Tokyo|11JUN2020|-5|30|Bomb|GroupE
10|India|Mumbai|01JUL2022|60|120|Bomb|GroupF
11|USA|New York|10APR2023|100|200|Bomb|GroupC
12|UK|London|05FEB2021|30|.|Bomb|GroupA
;
RUN;
PROC PRINT DATA=blast_raw;
RUN;
OUTPUT:
| Obs | ID | Country | City | Date | Casualties | Injured | Weapon | Group |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | USA | NewYork | 12JAN2020 | 50 | 100 | Bomb | GroupA |
| 2 | 2 | india | delhi | 32JAN2020 | -10 | 50 | NULL | GroupB |
| 3 | 3 | UK | London | 05FEB2021 | 30 | . | Bomb | GroupA |
| 4 | 4 | USA | NEWYORK | 12JAN2020 | 50 | 100 | Bomb | GroupA |
| 5 | 5 | France | Paris | 20 | 40 | Explosive | NULL | |
| 6 | 6 | India | Delhi | 15MAR2022 | 25 | 60 | Bomb | groupb |
| 7 | 7 | USA | Ny | 10APR2023 | 100 | 200 | Bomb | GroupC |
| 8 | 8 | NULL | Berlin | 20MAY2021 | 45 | 80 | Bomb | GroupD |
| 9 | 9 | Japan | Tokyo | 11JUN2020 | -5 | 30 | Bomb | GroupE |
| 10 | 10 | India | Mumbai | 01JUL2022 | 60 | 120 | Bomb | GroupF |
| 11 | 11 | USA | New York | 10APR2023 | 100 | 200 | Bomb | GroupC |
| 12 | 12 | UK | London | 05FEB2021 | 30 | . | Bomb | GroupA |
Explanation
This
dataset intentionally contains real-world data quality issues:
- Invalid dates (32JAN2020)
- Negative values (Casualties = -10)
- Missing values (.)
- Duplicates (records 1 & 4)
- Inconsistent text (“india”, “India”, “NULL”)
- Extra spaces (“New York” vs “NewYork”)
The issue
arises because SAS list input uses
space as delimiter, so multi-word values like “New York” break
variable alignment. This causes downstream variables (Date, Casualties, etc.)
to shift incorrectly. The best fix is either:
- Standardize values (e.g.,
“NewYork”)
- Use a custom delimiter (|) with
DLM= and DSD
MISSOVER
ensures SAS does not move to the next line when data is missing. In real-world
clinical or business datasets, using proper delimiters is critical to avoid
data corruption during ingestion.
This
mimics real-world raw datasets from unstructured sources. The goal is to
simulate how messy operational or clinical datasets arrive before SDTM/ADaM
transformation. Understanding these errors helps build strong debugging and
cleaning strategies.
R Raw Dataset
blast_raw <- data.frame(
ID = c(1,2,3,4,5,6,7,8,9,10,11,12),
Country = c("USA","india","UK","USA","France","India","USA","NULL","Japan",
"India","USA","UK"),
City = c("NewYork","delhi","London","NEWYORK","Paris","","Ny","Berlin","Tokyo",
"Mumbai","New York","London"),
Date = c("12JAN2020","32JAN2020","05FEB2021","12JAN2020",NA,"15MAR2022",
"10APR2023","20MAY2021","11JUN2020","01JUL2022","10APR2023","05FEB2021"),
Casualties = c(50,-10,30,50,20,25,100,45,-5,60,100,30),
Injured = c(100,50,NA,100,40,60,200,80,30,120,200,NA),
Weapon = c("Bomb","NULL","Bomb","Bomb","Explosive","Bomb","Bomb","Bomb",
"Bomb","Bomb","Bomb","Bomb"),
Group = c("GroupA","GroupB","GroupA","GroupA","NULL","groupb","GroupC",
"GroupD","GroupE","GroupF","GroupC","GroupA")
)
OUTPUT:
|
|
ID |
Country |
City |
Date |
Casualties |
Injured |
Weapon |
Group |
|
1 |
1 |
USA |
NewYork |
12-Jan-20 |
50 |
100 |
Bomb |
GroupA |
|
2 |
2 |
india |
delhi |
32JAN2020 |
-10 |
50 |
NULL |
GroupB |
|
3 |
3 |
UK |
London |
05-Feb-21 |
30 |
NA |
Bomb |
GroupA |
|
4 |
4 |
USA |
NEWYORK |
12-Jan-20 |
50 |
100 |
Bomb |
GroupA |
|
5 |
5 |
France |
Paris |
NA |
20 |
40 |
Explosive |
NULL |
|
6 |
6 |
India |
|
15-Mar-22 |
25 |
60 |
Bomb |
groupb |
|
7 |
7 |
USA |
Ny |
10-Apr-23 |
100 |
200 |
Bomb |
GroupC |
|
8 |
8 |
NULL |
Berlin |
20-May-21 |
45 |
80 |
Bomb |
GroupD |
|
9 |
9 |
Japan |
Tokyo |
11-Jun-20 |
-5 |
30 |
Bomb |
GroupE |
|
10 |
10 |
India |
Mumbai |
01-Jul-22 |
60 |
120 |
Bomb |
GroupF |
|
11 |
11 |
USA |
New York |
10-Apr-23 |
100 |
200 |
Bomb |
GroupC |
|
12 |
12 |
UK |
London |
05-Feb-21 |
30 |
NA |
Bomb |
GroupA |
Explanation
This R
dataset mirrors the SAS dataset, ensuring cross-platform consistency. It
includes:
- NA values
- Case inconsistencies
- Invalid numerical values
- Text anomalies
Creating
identical datasets in SAS and R allows analysts to validate transformations
across platforms a critical skill in regulated industries. It also enables
cross-verification during QC (Quality Control).
3. Phase 1: Data Cleaning in SAS
DATA blast_clean;
SET blast_raw;
/* Handle missing and NULL values */
IF Country = "NULL" THEN Country = "UNKNOWN";
ELSE Country = UPCASE(Country);
City = PROPCASE(COMPRESS(City));
Weapon = UPCASE(COALESCEC(Weapon, "UNKNOWN"));
Group = UPCASE(COMPRESS(Group));
/* Fix invalid numeric values */
if Casualties < 0 then Casualties = ABS(Casualties);
/* Fix missing injured */
if Injured = . then Injured = 0;
/* Fix dates */
Date_clean = INPUT(Date, ?? DATE9.);
FORMAT Date_clean DATE9.;
RUN;
PROC PRINT DATA=blast_clean;
RUN;
OUTPUT:
| Obs | ID | Country | City | Date | Casualties | Injured | Weapon | Group | Date_clean |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | USA | Newyork | 12JAN2020 | 50 | 100 | BOMB | GROUPA | 12JAN2020 |
| 2 | 2 | INDIA | Delhi | 32JAN2020 | 10 | 50 | NULL | GROUPB | . |
| 3 | 3 | UK | London | 05FEB2021 | 30 | 0 | BOMB | GROUPA | 05FEB2021 |
| 4 | 4 | USA | Newyork | 12JAN2020 | 50 | 100 | BOMB | GROUPA | 12JAN2020 |
| 5 | 5 | FRANCE | Paris | 20 | 40 | EXPLOSIVE | NULL | . | |
| 6 | 6 | INDIA | Delhi | 15MAR2022 | 25 | 60 | BOMB | GROUPB | 15MAR2022 |
| 7 | 7 | USA | Ny | 10APR2023 | 100 | 200 | BOMB | GROUPC | 10APR2023 |
| 8 | 8 | UNKNOWN | Berlin | 20MAY2021 | 45 | 80 | BOMB | GROUPD | 20MAY2021 |
| 9 | 9 | JAPAN | Tokyo | 11JUN2020 | 5 | 30 | BOMB | GROUPE | 11JUN2020 |
| 10 | 10 | INDIA | Mumbai | 01JUL2022 | 60 | 120 | BOMB | GROUPF | 01JUL2022 |
| 11 | 11 | USA | Newyork | 10APR2023 | 100 | 200 | BOMB | GROUPC | 10APR2023 |
| 12 | 12 | UK | London | 05FEB2021 | 30 | 0 | BOMB | GROUPA | 05FEB2021 |
/* Remove duplicates */
PROC SORT DATA=blast_clean NODUPKEY;
BY ID Country City Date_clean;
RUN;
PROC PRINT DATA=blast_clean;
RUN;
OUTPUT:
| Obs | ID | Country | City | Date | Casualties | Injured | Weapon | Group | Date_clean |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | USA | Newyork | 12JAN2020 | 50 | 100 | BOMB | GROUPA | 12JAN2020 |
| 2 | 2 | INDIA | Delhi | 32JAN2020 | 10 | 50 | NULL | GROUPB | . |
| 3 | 3 | UK | London | 05FEB2021 | 30 | 0 | BOMB | GROUPA | 05FEB2021 |
| 4 | 4 | USA | Newyork | 12JAN2020 | 50 | 100 | BOMB | GROUPA | 12JAN2020 |
| 5 | 5 | FRANCE | Paris | 20 | 40 | EXPLOSIVE | NULL | . | |
| 6 | 6 | INDIA | Delhi | 15MAR2022 | 25 | 60 | BOMB | GROUPB | 15MAR2022 |
| 7 | 7 | USA | Ny | 10APR2023 | 100 | 200 | BOMB | GROUPC | 10APR2023 |
| 8 | 8 | UNKNOWN | Berlin | 20MAY2021 | 45 | 80 | BOMB | GROUPD | 20MAY2021 |
| 9 | 9 | JAPAN | Tokyo | 11JUN2020 | 5 | 30 | BOMB | GROUPE | 11JUN2020 |
| 10 | 10 | INDIA | Mumbai | 01JUL2022 | 60 | 120 | BOMB | GROUPF | 01JUL2022 |
| 11 | 11 | USA | Newyork | 10APR2023 | 100 | 200 | BOMB | GROUPC | 10APR2023 |
| 12 | 12 | UK | London | 05FEB2021 | 30 | 0 | BOMB | GROUPA | 05FEB2021 |
Explanation
This
phase demonstrates structured SAS cleaning:
- COALESCEC replaces missing character
values
- COMPRESS removes unwanted spaces or
characters → critical for cleaning messy text like "New York"
- UPCASE/PROPCASE standardize text
- ABS() fixes negative values
- INPUT + FORMAT converts character dates
into SAS dates
- PROC SORT NODUPKEY removes duplicates
The COMPRESS
function is key here it eliminates unwanted spaces or special characters,
ensuring consistency in categorical variables. This is especially important in
SDTM where controlled terminology must be exact.
4. Phase 2: Data Cleaning in R
library(dplyr)
blast_clean <- blast_raw %>%
mutate(
Country = toupper(ifelse(Country == "NULL" | is.na(Country),
"UNKNOWN", Country)),
City = tools::toTitleCase(trimws(City)),
Weapon = toupper(trimws(Weapon)),
Group = toupper(trimws(Group)),
Casualties = ifelse(Casualties < 0, abs(Casualties),
Casualties),
Injured = ifelse(is.na(Injured), 0, Injured)
) %>%
distinct()
OUTPUT:
|
|
ID |
Country |
City |
Date |
Casualties |
Injured |
Weapon |
Group |
|
1 |
1 |
USA |
NewYork |
12-Jan-20 |
50 |
100 |
BOMB |
GROUPA |
|
2 |
2 |
INDIA |
Delhi |
32JAN2020 |
10 |
50 |
NULL |
GROUPB |
|
3 |
3 |
UK |
London |
05-Feb-21 |
30 |
0 |
BOMB |
GROUPA |
|
4 |
4 |
USA |
NEWYORK |
12-Jan-20 |
50 |
100 |
BOMB |
GROUPA |
|
5 |
5 |
FRANCE |
Paris |
NA |
20 |
40 |
EXPLOSIVE |
NULL |
|
6 |
6 |
INDIA |
|
15-Mar-22 |
25 |
60 |
BOMB |
GROUPB |
|
7 |
7 |
USA |
Ny |
10-Apr-23 |
100 |
200 |
BOMB |
GROUPC |
|
8 |
8 |
UNKNOWN |
Berlin |
20-May-21 |
45 |
80 |
BOMB |
GROUPD |
|
9 |
9 |
JAPAN |
Tokyo |
11-Jun-20 |
5 |
30 |
BOMB |
GROUPE |
|
10 |
10 |
INDIA |
Mumbai |
01-Jul-22 |
60 |
120 |
BOMB |
GROUPF |
|
11 |
11 |
USA |
New York |
10-Apr-23 |
100 |
200 |
BOMB |
GROUPC |
|
12 |
12 |
UK |
London |
05-Feb-21 |
30 |
0 |
BOMB |
GROUPA |
Explanation
In R:
- mutate() transforms variables
- ifelse() handles missing and invalid
values
- toupper() / trimws() standardize text
- distinct() removes duplicates
Unlike
SAS, R is more flexible and concise. However, it requires discipline to
maintain reproducibility. Functions like trimws() act similarly to SAS COMPRESS
but are less powerful for pattern-based cleaning. R excels in quick transformations
but lacks SAS’s built-in regulatory traceability.
5. Phase 3: Additional SAS Processing Using Raw Data
DATA blast_enhanced;
SET blast_raw;
/* Use COMPRESS to remove special characters */
City_clean = COMPRESS(UPCASE(City), , 'ka');
/* Create flag variables */
if Casualties > 80 then High_Impact = "YES";
else High_Impact = "NO";
/* Length standardization */
LENGTH Country_std $15;
Country_std = UPCASE(COMPRESS(Country));
RUN;
PROC PRINT DATA=blast_enhanced;
RUN;
OUTPUT:
| Obs | ID | Country | City | Date | Casualties | Injured | Weapon | Group | City_clean | High_Impact | Country_std |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | USA | NewYork | 12JAN2020 | 50 | 100 | Bomb | GroupA | NEWYORK | NO | USA |
| 2 | 2 | india | delhi | 32JAN2020 | -10 | 50 | NULL | GroupB | DELHI | NO | INDIA |
| 3 | 3 | UK | London | 05FEB2021 | 30 | . | Bomb | GroupA | LONDON | NO | UK |
| 4 | 4 | USA | NEWYORK | 12JAN2020 | 50 | 100 | Bomb | GroupA | NEWYORK | NO | USA |
| 5 | 5 | France | Paris | 20 | 40 | Explosive | NULL | PARIS | NO | FRANCE | |
| 6 | 6 | India | Delhi | 15MAR2022 | 25 | 60 | Bomb | groupb | DELHI | NO | INDIA |
| 7 | 7 | USA | Ny | 10APR2023 | 100 | 200 | Bomb | GroupC | NY | YES | USA |
| 8 | 8 | NULL | Berlin | 20MAY2021 | 45 | 80 | Bomb | GroupD | BERLIN | NO | NULL |
| 9 | 9 | Japan | Tokyo | 11JUN2020 | -5 | 30 | Bomb | GroupE | TOKYO | NO | JAPAN |
| 10 | 10 | India | Mumbai | 01JUL2022 | 60 | 120 | Bomb | GroupF | MUMBAI | NO | INDIA |
| 11 | 11 | USA | New York | 10APR2023 | 100 | 200 | Bomb | GroupC | NEWYORK | YES | USA |
| 12 | 12 | UK | London | 05FEB2021 | 30 | . | Bomb | GroupA | LONDON | NO | UK |
Explanation
This
phase deepens COMPRESS usage:
- COMPRESS(UPCASE(City), ,
'ka') keeps only alphabetic characters
- Useful for removing noise
like symbols, numbers
- Helps in building
standardized keys
Additional
logic:
- Flagging high-impact events
- Standardizing variable
lengths
This
reflects real-world ADaM dataset derivations, where variables like flags (e.g.,
AVALC, ANL01FL) are created.
6. 20 Additional Data Cleaning Best Practices
- Always validate against
source data (SDTM vs ADaM)
- Maintain audit trails for
all transformations
- Use controlled terminology
(CDISC standards)
- Never overwrite raw datasets
- Validate date formats
rigorously
- Handle missing values with
business logic
- Check for duplicates across
keys
- Standardize categorical
variables
- Use macros for reusable
cleaning logic
- Document all assumptions
- Perform QC using independent
programming
- Use PROC COMPARE for
validation
- Validate ranges (age, lab
values)
- Track derivation logic
clearly
- Avoid hardcoding values
- Use metadata-driven
programming
- Ensure reproducibility
- Align with SAP
specifications
- Validate joins and merges
- Always test edge cases
7. Business Logic Behind Data Cleaning
Data
cleaning is not random it’s logic-driven.
- Missing values → Replace with defaults or
impute
Example: Missing injured → set to 0 - Negative values → Not realistic
Example: Casualties cannot be -10 - Text inconsistency → Impacts grouping
Example: “India” vs “india” - Date correction → Essential for timeline
analysis
In
clinical trials:
- Incorrect age → invalid
subject eligibility
- Missing visit date → affects
endpoint analysis
8. 20 Key Points
- Dirty data leads to wrong
conclusions.
- COMPRESS is powerful for
string cleaning.
- Standardization ensures
reproducibility.
- Missing values must follow
business logic.
- Negative values require validation.
- Duplicate data inflates
results.
- Dates must be consistent.
- Text inconsistencies break
grouping.
- SAS ensures regulatory
compliance.
- R enables flexible
transformations.
- Always validate after
cleaning.
- Never trust raw data
blindly.
- Cleaning improves model
accuracy.
- Documentation is critical.
- QC is mandatory.
- Use functions, not manual
fixes.
- Controlled terminology
matters.
- Automation reduces errors.
- Clean data builds trust.
- Data cleaning is data
science.
9. Summary
|
Aspect |
SAS |
R |
|
Structure |
Highly
structured |
Flexible |
|
Regulatory |
Strong |
Moderate |
|
Cleaning
Functions |
COMPRESS,
COALESCEC |
trimws,
mutate |
|
Reproducibility |
High |
Depends
on coding |
👉 SAS is ideal for clinical and regulatory work
👉 R is ideal for exploration and rapid analysis
10. Conclusion
Data
cleaning is not just a preprocessing step it is the foundation of reliable
analytics.
Using
functions like COMPRESS in SAS, we can eliminate inconsistencies and
bring structure to chaos. Whether in clinical trials, global risk analysis, or
business intelligence, clean data drives accurate decisions.
If your
data is messy, your insights are misleading.
If your data is clean, your decisions are powerful.
11. Interview Questions
Q1: Why use COMPRESS in SAS?
👉 Removes unwanted characters → ensures clean,
standardized strings.
Q2: How do you handle negative values in a dataset?
👉 Use conditions:
if value < 0 then value =
abs(value);
Q3: How do you remove duplicates in SAS?
👉 PROC SORT NODUPKEY:
PROC SORT DATA=data NODUPKEY;
BY variables;
RUN;
Q4: R equivalent of SAS COMPRESS?
👉 gsub() or trimws() depending on requirement.
Q5: If patient age = -5 → correct
using ABS or flag for review?(According To Real World Scenario).
if patient age < 0 then patient
age = abs(patient age);
Answer:
It is
technically correct SAS syntax, but not correct from a real-world
clinical or interview perspective.
Let’s
break this properly like an experienced SAS programmer would explain in an
interview.
if patient_age < 0 then
patient_age = abs(patient_age);
❌ Why This
Is Problematic in Real Scenarios
In real
clinical trials (SDTM/ADaM), this approach is considered inappropriate.
1. You Are Changing Raw Data Without Justification
- Age = -5 is not just
a small error
- It indicates data entry
issue / upstream system problem
- Converting it to +5
silently changes meaning
👉 This violates data integrity principles
2. Regulatory Risk (Very Important)
In
clinical trials:
- You cannot modify data
blindly
- Every change must be:
- Traceable
- Justified
- Documented
👉 Regulators (FDA, EMA) may question:
“How did
-5 become +5?”
3. Loss of Audit Trail
- Original error disappears
- No way to trace what
happened
- Fails audit requirements
Correct Approach in Real-World (Interview-Ready
Answer)
Option 1: Flag the Issue (Best Practice)
if patient_age < 0 then do;
age_flag = "INVALID";
end;
Option 2: Set to Missing + Flag
if patient_age < 0 then do;
patient_age = .;
age_flag = "NEGATIVE VALUE";
end;
Option 3: Use Business Rule (Only if Defined in
SAP)
if patient_age < 0 then
patient_age = abs(patient_age); /* Only if SAP allows */
But in
interview, you must say:
I would
only apply ABS if explicitly mentioned in SAP or data management guidelines.
Perfect Interview Answer (Use This)
Technically, the code is correct in SAS.
However, in real clinical scenarios, directly converting negative age to
positive using ABS is not appropriate. Negative age indicates a data quality
issue. Instead, I would flag the record or set it to missing and follow the
study’s SAP or data management guidelines. Any correction must be traceable and
justified for regulatory compliance.
Key Takeaways
- ✔ Syntax = Correct
- ❌ Logic (Real-world) = Risky
- ✔ Best Practice = Flag + Validate + Follow SAP
Pro Tip (Interview Booster)
If you
say this, you will stand out:
“In ADaM,
I would not derive AGE using ABS blindly. I would check DM domain, RFSTDTC,
BRTHDTC, and ensure derivation aligns with CDISC standards.”
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about BOMB BLAST DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment