Shocking Data Chaos in Global Bomb Blast Records: Mastering the COMPRESS Function in SAS for Precision Data Cleaning

1. Introduction

Imagine you are working as a clinical SAS programmer or data analyst in a global security analytics firm. You receive a dataset titled “Major Bomb Blasts Around the World.” The dataset is supposed to help governments predict high-risk zones, but once you open it chaos.

Dates look like “32JAN2020”, locations are written as “new york”, “NEW YORK”, “Ny”, casualties are negative, and some values literally say “NULL”. Duplicate records? Plenty. Missing values? Everywhere.

Now pause and thinknwhat happens if this data feeds into a predictive model?

👉 Governments may allocate resources incorrectly
👉 Risk assessments become flawed
👉 Policy decisions fail

This is exactly why data cleaning is not optional it’s foundational.

In industries like clinical trials (SDTM/ADaM), finance, or global risk analytics, bad data destroys trust.

That’s where SAS and R become powerful allies:

SAS → Highly structured, regulatory-grade cleaning
R → Flexible, fast, exploratory transformations

In this blog, we will simulate a messy bomb blast dataset, intentionally introduce errors, and then clean it step-by-step using:

SAS (including COMPRESS function)
R (tidyverse approach)

2. Raw Data Creation in SAS and R

SAS Raw Dataset (With Intentional Errors)

DATA blast_raw;

INFILE DATALINES DLM='|' DSD MISSOVER;

INPUT ID Country $ City $ Date:$10.

Casualties Injured Weapon:$10. Group $;

DATALINES;

1|USA|NewYork|12JAN2020|50|100|Bomb|GroupA

2|india|delhi|32JAN2020|-10|50|NULL|GroupB

3|UK|London|05FEB2021|30|.|Bomb|GroupA

4|USA|NEWYORK|12JAN2020|50|100|Bomb|GroupA

5|France|Paris|.|20|40|Explosive|NULL

6|India|Delhi|15MAR2022|25|60|Bomb|groupb

7|USA|Ny|10APR2023|100|200|Bomb|GroupC

8|NULL|Berlin|20MAY2021|45|80|Bomb|GroupD

9|Japan|Tokyo|11JUN2020|-5|30|Bomb|GroupE

10|India|Mumbai|01JUL2022|60|120|Bomb|GroupF

11|USA|New York|10APR2023|100|200|Bomb|GroupC

12|UK|London|05FEB2021|30|.|Bomb|GroupA

;

RUN;

PROC PRINT DATA=blast_raw;

RUN;

OUTPUT:


Obs	ID	Country	City	Date	Casualties	Injured	Weapon	Group
1	1	USA	NewYork	12JAN2020	50	100	Bomb	GroupA
2	2	india	delhi	32JAN2020	-10	50	NULL	GroupB
3	3	UK	London	05FEB2021	30	.	Bomb	GroupA
4	4	USA	NEWYORK	12JAN2020	50	100	Bomb	GroupA
5	5	France	Paris		20	40	Explosive	NULL
6	6	India	Delhi	15MAR2022	25	60	Bomb	groupb
7	7	USA	Ny	10APR2023	100	200	Bomb	GroupC
8	8	NULL	Berlin	20MAY2021	45	80	Bomb	GroupD
9	9	Japan	Tokyo	11JUN2020	-5	30	Bomb	GroupE
10	10	India	Mumbai	01JUL2022	60	120	Bomb	GroupF
11	11	USA	New York	10APR2023	100	200	Bomb	GroupC
12	12	UK	London	05FEB2021	30	.	Bomb	GroupA

Explanation

This dataset intentionally contains real-world data quality issues:

Invalid dates (32JAN2020)
Negative values (Casualties = -10)
Missing values (.)
Duplicates (records 1 & 4)
Inconsistent text (“india”, “India”, “NULL”)
Extra spaces (“New York” vs “NewYork”)

The issue arises because SAS list input uses space as delimiter, so multi-word values like “New York” break variable alignment. This causes downstream variables (Date, Casualties, etc.) to shift incorrectly. The best fix is either:

Standardize values (e.g., “NewYork”)
Use a custom delimiter (|) with DLM= and DSD

MISSOVER ensures SAS does not move to the next line when data is missing. In real-world clinical or business datasets, using proper delimiters is critical to avoid data corruption during ingestion.

This mimics real-world raw datasets from unstructured sources. The goal is to simulate how messy operational or clinical datasets arrive before SDTM/ADaM transformation. Understanding these errors helps build strong debugging and cleaning strategies.

R Raw Dataset

blast_raw <- data.frame(

ID = c(1,2,3,4,5,6,7,8,9,10,11,12),

Country = c("USA","india","UK","USA","France","India","USA","NULL","Japan",

"India","USA","UK"),

City = c("NewYork","delhi","London","NEWYORK","Paris","","Ny","Berlin","Tokyo",

"Mumbai","New York","London"),

Date = c("12JAN2020","32JAN2020","05FEB2021","12JAN2020",NA,"15MAR2022",

"10APR2023","20MAY2021","11JUN2020","01JUL2022","10APR2023","05FEB2021"),

Casualties = c(50,-10,30,50,20,25,100,45,-5,60,100,30),

Injured = c(100,50,NA,100,40,60,200,80,30,120,200,NA),

Weapon = c("Bomb","NULL","Bomb","Bomb","Explosive","Bomb","Bomb","Bomb",

"Bomb","Bomb","Bomb","Bomb"),

Group = c("GroupA","GroupB","GroupA","GroupA","NULL","groupb","GroupC",

"GroupD","GroupE","GroupF","GroupC","GroupA")

)

OUTPUT:

	ID	Country	City	Date	Casualties	Injured	Weapon	Group
1	1	USA	NewYork	12-Jan-20	50	100	Bomb	GroupA
2	2	india	delhi	32JAN2020	-10	50	NULL	GroupB
3	3	UK	London	05-Feb-21	30	NA	Bomb	GroupA
4	4	USA	NEWYORK	12-Jan-20	50	100	Bomb	GroupA
5	5	France	Paris	NA	20	40	Explosive	NULL
6	6	India		15-Mar-22	25	60	Bomb	groupb
7	7	USA	Ny	10-Apr-23	100	200	Bomb	GroupC
8	8	NULL	Berlin	20-May-21	45	80	Bomb	GroupD
9	9	Japan	Tokyo	11-Jun-20	-5	30	Bomb	GroupE
10	10	India	Mumbai	01-Jul-22	60	120	Bomb	GroupF
11	11	USA	New York	10-Apr-23	100	200	Bomb	GroupC
12	12	UK	London	05-Feb-21	30	NA	Bomb	GroupA

Explanation

This R dataset mirrors the SAS dataset, ensuring cross-platform consistency. It includes:

NA values
Case inconsistencies
Invalid numerical values
Text anomalies

Creating identical datasets in SAS and R allows analysts to validate transformations across platforms a critical skill in regulated industries. It also enables cross-verification during QC (Quality Control).

3. Phase 1: Data Cleaning in SAS

DATA blast_clean;

SET blast_raw;

/* Handle missing and NULL values */

IF Country = "NULL" THEN Country = "UNKNOWN";

ELSE Country = UPCASE(Country);

City = PROPCASE(COMPRESS(City));

Weapon = UPCASE(COALESCEC(Weapon, "UNKNOWN"));

Group = UPCASE(COMPRESS(Group));

/* Fix invalid numeric values */

if Casualties < 0 then Casualties = ABS(Casualties);

/* Fix missing injured */

if Injured = . then Injured = 0;

/* Fix dates */

Date_clean = INPUT(Date, ?? DATE9.);

FORMAT Date_clean DATE9.;

RUN;

PROC PRINT DATA=blast_clean;

RUN;

OUTPUT:


Obs	ID	Country	City	Date	Casualties	Injured	Weapon	Group	Date_clean
1	1	USA	Newyork	12JAN2020	50	100	BOMB	GROUPA	12JAN2020
2	2	INDIA	Delhi	32JAN2020	10	50	NULL	GROUPB	.
3	3	UK	London	05FEB2021	30	0	BOMB	GROUPA	05FEB2021
4	4	USA	Newyork	12JAN2020	50	100	BOMB	GROUPA	12JAN2020
5	5	FRANCE	Paris		20	40	EXPLOSIVE	NULL	.
6	6	INDIA	Delhi	15MAR2022	25	60	BOMB	GROUPB	15MAR2022
7	7	USA	Ny	10APR2023	100	200	BOMB	GROUPC	10APR2023
8	8	UNKNOWN	Berlin	20MAY2021	45	80	BOMB	GROUPD	20MAY2021
9	9	JAPAN	Tokyo	11JUN2020	5	30	BOMB	GROUPE	11JUN2020
10	10	INDIA	Mumbai	01JUL2022	60	120	BOMB	GROUPF	01JUL2022
11	11	USA	Newyork	10APR2023	100	200	BOMB	GROUPC	10APR2023
12	12	UK	London	05FEB2021	30	0	BOMB	GROUPA	05FEB2021

/* Remove duplicates */

PROC SORT DATA=blast_clean NODUPKEY;

BY ID Country City Date_clean;

RUN;

PROC PRINT DATA=blast_clean;

RUN;

OUTPUT:


Obs	ID	Country	City	Date	Casualties	Injured	Weapon	Group	Date_clean
1	1	USA	Newyork	12JAN2020	50	100	BOMB	GROUPA	12JAN2020
2	2	INDIA	Delhi	32JAN2020	10	50	NULL	GROUPB	.
3	3	UK	London	05FEB2021	30	0	BOMB	GROUPA	05FEB2021
4	4	USA	Newyork	12JAN2020	50	100	BOMB	GROUPA	12JAN2020
5	5	FRANCE	Paris		20	40	EXPLOSIVE	NULL	.
6	6	INDIA	Delhi	15MAR2022	25	60	BOMB	GROUPB	15MAR2022
7	7	USA	Ny	10APR2023	100	200	BOMB	GROUPC	10APR2023
8	8	UNKNOWN	Berlin	20MAY2021	45	80	BOMB	GROUPD	20MAY2021
9	9	JAPAN	Tokyo	11JUN2020	5	30	BOMB	GROUPE	11JUN2020
10	10	INDIA	Mumbai	01JUL2022	60	120	BOMB	GROUPF	01JUL2022
11	11	USA	Newyork	10APR2023	100	200	BOMB	GROUPC	10APR2023
12	12	UK	London	05FEB2021	30	0	BOMB	GROUPA	05FEB2021

Explanation

This phase demonstrates structured SAS cleaning:

COALESCEC replaces missing character values
COMPRESS removes unwanted spaces or characters → critical for cleaning messy text like "New York"
UPCASE/PROPCASE standardize text
ABS() fixes negative values
INPUT + FORMAT converts character dates into SAS dates
PROC SORT NODUPKEY removes duplicates

The COMPRESS function is key here it eliminates unwanted spaces or special characters, ensuring consistency in categorical variables. This is especially important in SDTM where controlled terminology must be exact.

4. Phase 2: Data Cleaning in R

library(dplyr)

blast_clean <- blast_raw %>%

mutate(

Country = toupper(ifelse(Country == "NULL" | is.na(Country),

"UNKNOWN", Country)),

City = tools::toTitleCase(trimws(City)),

Weapon = toupper(trimws(Weapon)),

Group = toupper(trimws(Group)),

Casualties = ifelse(Casualties < 0, abs(Casualties),

Casualties),

Injured = ifelse(is.na(Injured), 0, Injured)

) %>%

distinct()

OUTPUT:

	ID	Country	City	Date	Casualties	Injured	Weapon	Group
1	1	USA	NewYork	12-Jan-20	50	100	BOMB	GROUPA
2	2	INDIA	Delhi	32JAN2020	10	50	NULL	GROUPB
3	3	UK	London	05-Feb-21	30	0	BOMB	GROUPA
4	4	USA	NEWYORK	12-Jan-20	50	100	BOMB	GROUPA
5	5	FRANCE	Paris	NA	20	40	EXPLOSIVE	NULL
6	6	INDIA		15-Mar-22	25	60	BOMB	GROUPB
7	7	USA	Ny	10-Apr-23	100	200	BOMB	GROUPC
8	8	UNKNOWN	Berlin	20-May-21	45	80	BOMB	GROUPD
9	9	JAPAN	Tokyo	11-Jun-20	5	30	BOMB	GROUPE
10	10	INDIA	Mumbai	01-Jul-22	60	120	BOMB	GROUPF
11	11	USA	New York	10-Apr-23	100	200	BOMB	GROUPC
12	12	UK	London	05-Feb-21	30	0	BOMB	GROUPA

Explanation

In R:

mutate() transforms variables
ifelse() handles missing and invalid values
toupper() / trimws() standardize text
distinct() removes duplicates

Unlike SAS, R is more flexible and concise. However, it requires discipline to maintain reproducibility. Functions like trimws() act similarly to SAS COMPRESS but are less powerful for pattern-based cleaning. R excels in quick transformations but lacks SAS’s built-in regulatory traceability.

5. Phase 3: Additional SAS Processing Using Raw Data

DATA blast_enhanced;

SET blast_raw;

/* Use COMPRESS to remove special characters */

City_clean = COMPRESS(UPCASE(City), , 'ka');

/* Create flag variables */

if Casualties > 80 then High_Impact = "YES";

else High_Impact = "NO";

/* Length standardization */

LENGTH Country_std $15;

Country_std = UPCASE(COMPRESS(Country));

RUN;

PROC PRINT DATA=blast_enhanced;

RUN;

OUTPUT:


Obs	ID	Country	City	Date	Casualties	Injured	Weapon	Group	City_clean	High_Impact	Country_std
1	1	USA	NewYork	12JAN2020	50	100	Bomb	GroupA	NEWYORK	NO	USA
2	2	india	delhi	32JAN2020	-10	50	NULL	GroupB	DELHI	NO	INDIA
3	3	UK	London	05FEB2021	30	.	Bomb	GroupA	LONDON	NO	UK
4	4	USA	NEWYORK	12JAN2020	50	100	Bomb	GroupA	NEWYORK	NO	USA
5	5	France	Paris		20	40	Explosive	NULL	PARIS	NO	FRANCE
6	6	India	Delhi	15MAR2022	25	60	Bomb	groupb	DELHI	NO	INDIA
7	7	USA	Ny	10APR2023	100	200	Bomb	GroupC	NY	YES	USA
8	8	NULL	Berlin	20MAY2021	45	80	Bomb	GroupD	BERLIN	NO	NULL
9	9	Japan	Tokyo	11JUN2020	-5	30	Bomb	GroupE	TOKYO	NO	JAPAN
10	10	India	Mumbai	01JUL2022	60	120	Bomb	GroupF	MUMBAI	NO	INDIA
11	11	USA	New York	10APR2023	100	200	Bomb	GroupC	NEWYORK	YES	USA
12	12	UK	London	05FEB2021	30	.	Bomb	GroupA	LONDON	NO	UK

Explanation

This phase deepens COMPRESS usage:

COMPRESS(UPCASE(City), , 'ka') keeps only alphabetic characters
Useful for removing noise like symbols, numbers
Helps in building standardized keys

Additional logic:

Flagging high-impact events
Standardizing variable lengths

This reflects real-world ADaM dataset derivations, where variables like flags (e.g., AVALC, ANL01FL) are created.

6. 20 Additional Data Cleaning Best Practices

Always validate against source data (SDTM vs ADaM)
Maintain audit trails for all transformations
Use controlled terminology (CDISC standards)
Never overwrite raw datasets
Validate date formats rigorously
Handle missing values with business logic
Check for duplicates across keys
Standardize categorical variables
Use macros for reusable cleaning logic
Document all assumptions
Perform QC using independent programming
Use PROC COMPARE for validation
Validate ranges (age, lab values)
Track derivation logic clearly
Avoid hardcoding values
Use metadata-driven programming
Ensure reproducibility
Align with SAP specifications
Validate joins and merges
Always test edge cases

7. Business Logic Behind Data Cleaning

Data cleaning is not random it’s logic-driven.

Missing values → Replace with defaults or impute
Example: Missing injured → set to 0
Negative values → Not realistic
Example: Casualties cannot be -10
Text inconsistency → Impacts grouping
Example: “India” vs “india”
Date correction → Essential for timeline analysis

In clinical trials:

Incorrect age → invalid subject eligibility
Missing visit date → affects endpoint analysis

8. 20 Key Points

Dirty data leads to wrong conclusions.
COMPRESS is powerful for string cleaning.
Standardization ensures reproducibility.
Missing values must follow business logic.
Negative values require validation.
Duplicate data inflates results.
Dates must be consistent.
Text inconsistencies break grouping.
SAS ensures regulatory compliance.
R enables flexible transformations.
Always validate after cleaning.
Never trust raw data blindly.
Cleaning improves model accuracy.
Documentation is critical.
QC is mandatory.
Use functions, not manual fixes.
Controlled terminology matters.
Automation reduces errors.
Clean data builds trust.
Data cleaning is data science.

9. Summary

Aspect	SAS	R
Structure	Highly structured	Flexible
Regulatory	Strong	Moderate
Cleaning Functions	COMPRESS, COALESCEC	trimws, mutate
Reproducibility	High	Depends on coding

👉 SAS is ideal for clinical and regulatory work
👉 R is ideal for exploration and rapid analysis

10. Conclusion

Data cleaning is not just a preprocessing step it is the foundation of reliable analytics.

Using functions like COMPRESS in SAS, we can eliminate inconsistencies and bring structure to chaos. Whether in clinical trials, global risk analysis, or business intelligence, clean data drives accurate decisions.

If your data is messy, your insights are misleading.
If your data is clean, your decisions are powerful.

11. Interview Questions

Q1: Why use COMPRESS in SAS?

👉 Removes unwanted characters → ensures clean, standardized strings.

Q2: How do you handle negative values in a dataset?

👉 Use conditions:

if value < 0 then value = abs(value);

Q3: How do you remove duplicates in SAS?

👉 PROC SORT NODUPKEY:

PROC SORT DATA=data NODUPKEY;

BY variables;

RUN;

Q4: R equivalent of SAS COMPRESS?

👉 gsub() or trimws() depending on requirement.

Q5: If patient age = -5 → correct using ABS or flag for review?(According To Real World Scenario).

if patient age < 0 then patient age = abs(patient age);

Answer: It is technically correct SAS syntax, but not correct from a real-world clinical or interview perspective.

Let’s break this properly like an experienced SAS programmer would explain in an interview.

if patient_age < 0 then patient_age = abs(patient_age);

❌ Why This Is Problematic in Real Scenarios

In real clinical trials (SDTM/ADaM), this approach is considered inappropriate.

1. You Are Changing Raw Data Without Justification

Age = -5 is not just a small error
It indicates data entry issue / upstream system problem
Converting it to +5 silently changes meaning

👉 This violates data integrity principles

2. Regulatory Risk (Very Important)

In clinical trials:

You cannot modify data blindly
Every change must be:

Traceable
Justified
Documented

👉 Regulators (FDA, EMA) may question:

“How did -5 become +5?”

3. Loss of Audit Trail

Original error disappears
No way to trace what happened
Fails audit requirements

Correct Approach in Real-World (Interview-Ready Answer)

Option 1: Flag the Issue (Best Practice)

if patient_age < 0 then do;

age_flag = "INVALID";

end;

Option 2: Set to Missing + Flag

if patient_age < 0 then do;

patient_age = .;

age_flag = "NEGATIVE VALUE";

end;

Option 3: Use Business Rule (Only if Defined in SAP)

if patient_age < 0 then patient_age = abs(patient_age); /* Only if SAP allows */

But in interview, you must say:

I would only apply ABS if explicitly mentioned in SAP or data management guidelines.

Perfect Interview Answer (Use This)

Technically, the code is correct in SAS. However, in real clinical scenarios, directly converting negative age to positive using ABS is not appropriate. Negative age indicates a data quality issue. Instead, I would flag the record or set it to missing and follow the study’s SAP or data management guidelines. Any correction must be traceable and justified for regulatory compliance.

Key Takeaways

✔ Syntax = Correct
❌ Logic (Real-world) = Risky
✔ Best Practice = Flag + Validate + Follow SAP

Pro Tip (Interview Booster)

If you say this, you will stand out:

“In ADaM, I would not derive AGE using ABS blindly. I would check DM domain, RFSTDTC, BRTHDTC, and ensure derivation aligns with CDISC standards.”

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about BOMB BLAST DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS
2.Can Advanced SAS Programming Detect, Correct, and Optimize Global Cargo Shipping Route Data While Improving Efficiency and Reducing Costs?
3.Can We Build and Debug a Digital Identity Verification Fraud Detection System in SAS Using Advanced Data Engineering and Macros?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Corrupted Data, Dangerous Decisions: Cleaning Global Blast Records with SAS Techniques

Shocking Data Chaos in Global Bomb Blast Records: Mastering the COMPRESS Function in SAS for Precision Data Cleaning

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS