451.Corrupted Data, Dangerous Decisions: Cleaning Global Blast Records with SAS Techniques

Shocking Data Chaos in Global Bomb Blast Records: Mastering the COMPRESS Function in SAS for Precision Data Cleaning

1. Introduction

Imagine you are working as a clinical SAS programmer or data analyst in a global security analytics firm. You receive a dataset titled “Major Bomb Blasts Around the World.” The dataset is supposed to help governments predict high-risk zones, but once you open it chaos.

Dates look like “32JAN2020”, locations are written as “new york”, “NEW YORK”, “Ny”, casualties are negative, and some values literally say “NULL”. Duplicate records? Plenty. Missing values? Everywhere.

Now pause and thinknwhat happens if this data feeds into a predictive model?

👉 Governments may allocate resources incorrectly
👉 Risk assessments become flawed
👉 Policy decisions fail

This is exactly why data cleaning is not optional it’s foundational.

In industries like clinical trials (SDTM/ADaM), finance, or global risk analytics, bad data destroys trust.

That’s where SAS and R become powerful allies:

  • SAS → Highly structured, regulatory-grade cleaning
  • R → Flexible, fast, exploratory transformations

In this blog, we will simulate a messy bomb blast dataset, intentionally introduce errors, and then clean it step-by-step using:

  • SAS (including COMPRESS function)
  • R (tidyverse approach)

2. Raw Data Creation in SAS and R

SAS Raw Dataset (With Intentional Errors)

DATA blast_raw;

INFILE DATALINES DLM='|' DSD MISSOVER;

INPUT ID Country $ City $ Date:$10.

      Casualties Injured Weapon:$10. Group $;

DATALINES;

1|USA|NewYork|12JAN2020|50|100|Bomb|GroupA

2|india|delhi|32JAN2020|-10|50|NULL|GroupB

3|UK|London|05FEB2021|30|.|Bomb|GroupA

4|USA|NEWYORK|12JAN2020|50|100|Bomb|GroupA

5|France|Paris|.|20|40|Explosive|NULL

6|India|Delhi|15MAR2022|25|60|Bomb|groupb

7|USA|Ny|10APR2023|100|200|Bomb|GroupC

8|NULL|Berlin|20MAY2021|45|80|Bomb|GroupD

9|Japan|Tokyo|11JUN2020|-5|30|Bomb|GroupE

10|India|Mumbai|01JUL2022|60|120|Bomb|GroupF

11|USA|New York|10APR2023|100|200|Bomb|GroupC

12|UK|London|05FEB2021|30|.|Bomb|GroupA

;

RUN;

PROC PRINT DATA=blast_raw;

RUN;

OUTPUT:

ObsIDCountryCityDateCasualtiesInjuredWeaponGroup
11USANewYork12JAN202050100BombGroupA
22indiadelhi32JAN2020-1050NULLGroupB
33UKLondon05FEB202130.BombGroupA
44USANEWYORK12JAN202050100BombGroupA
55FranceParis 2040ExplosiveNULL
66IndiaDelhi15MAR20222560Bombgroupb
77USANy10APR2023100200BombGroupC
88NULLBerlin20MAY20214580BombGroupD
99JapanTokyo11JUN2020-530BombGroupE
1010IndiaMumbai01JUL202260120BombGroupF
1111USANew York10APR2023100200BombGroupC
1212UKLondon05FEB202130.BombGroupA

Explanation

This dataset intentionally contains real-world data quality issues:

  • Invalid dates (32JAN2020)
  • Negative values (Casualties = -10)
  • Missing values (.)
  • Duplicates (records 1 & 4)
  • Inconsistent text (“india”, “India”, “NULL”)
  • Extra spaces (“New York” vs “NewYork”)

The issue arises because SAS list input uses space as delimiter, so multi-word values like “New York” break variable alignment. This causes downstream variables (Date, Casualties, etc.) to shift incorrectly. The best fix is either:

  1. Standardize values (e.g., “NewYork”)
  2. Use a custom delimiter (|) with DLM= and DSD

MISSOVER ensures SAS does not move to the next line when data is missing. In real-world clinical or business datasets, using proper delimiters is critical to avoid data corruption during ingestion.

This mimics real-world raw datasets from unstructured sources. The goal is to simulate how messy operational or clinical datasets arrive before SDTM/ADaM transformation. Understanding these errors helps build strong debugging and cleaning strategies.

R Raw Dataset

blast_raw <- data.frame(

  ID = c(1,2,3,4,5,6,7,8,9,10,11,12),

  Country = c("USA","india","UK","USA","France","India","USA","NULL","Japan",

                        "India","USA","UK"),

  City = c("NewYork","delhi","London","NEWYORK","Paris","","Ny","Berlin","Tokyo",

                "Mumbai","New York","London"),

  Date = c("12JAN2020","32JAN2020","05FEB2021","12JAN2020",NA,"15MAR2022",

         "10APR2023","20MAY2021","11JUN2020","01JUL2022","10APR2023","05FEB2021"),

  Casualties = c(50,-10,30,50,20,25,100,45,-5,60,100,30),

  Injured = c(100,50,NA,100,40,60,200,80,30,120,200,NA),

  Weapon = c("Bomb","NULL","Bomb","Bomb","Explosive","Bomb","Bomb","Bomb",

                        "Bomb","Bomb","Bomb","Bomb"),

  Group = c("GroupA","GroupB","GroupA","GroupA","NULL","groupb","GroupC",

                     "GroupD","GroupE","GroupF","GroupC","GroupA")

)

OUTPUT:

 

ID

Country

City

Date

Casualties

Injured

Weapon

Group

1

1

USA

NewYork

12-Jan-20

50

100

Bomb

GroupA

2

2

india

delhi

32JAN2020

-10

50

NULL

GroupB

3

3

UK

London

05-Feb-21

30

NA

Bomb

GroupA

4

4

USA

NEWYORK

12-Jan-20

50

100

Bomb

GroupA

5

5

France

Paris

NA

20

40

Explosive

NULL

6

6

India

 

15-Mar-22

25

60

Bomb

groupb

7

7

USA

Ny

10-Apr-23

100

200

Bomb

GroupC

8

8

NULL

Berlin

20-May-21

45

80

Bomb

GroupD

9

9

Japan

Tokyo

11-Jun-20

-5

30

Bomb

GroupE

10

10

India

Mumbai

01-Jul-22

60

120

Bomb

GroupF

11

11

USA

New York

10-Apr-23

100

200

Bomb

GroupC

12

12

UK

London

05-Feb-21

30

NA

Bomb

GroupA

Explanation

This R dataset mirrors the SAS dataset, ensuring cross-platform consistency. It includes:

  • NA values
  • Case inconsistencies
  • Invalid numerical values
  • Text anomalies

Creating identical datasets in SAS and R allows analysts to validate transformations across platforms a critical skill in regulated industries. It also enables cross-verification during QC (Quality Control).

3. Phase 1: Data Cleaning in SAS

DATA blast_clean;

SET blast_raw;

/* Handle missing and NULL values */

IF Country = "NULL" THEN Country = "UNKNOWN";

ELSE Country = UPCASE(Country);

City = PROPCASE(COMPRESS(City));

Weapon = UPCASE(COALESCEC(Weapon, "UNKNOWN"));

Group = UPCASE(COMPRESS(Group));

/* Fix invalid numeric values */

if Casualties < 0 then Casualties = ABS(Casualties);

/* Fix missing injured */

if Injured = . then Injured = 0;

/* Fix dates */

Date_clean = INPUT(Date, ?? DATE9.);

FORMAT Date_clean DATE9.;

RUN;

PROC PRINT DATA=blast_clean;

RUN;

OUTPUT:

ObsIDCountryCityDateCasualtiesInjuredWeaponGroupDate_clean
11USANewyork12JAN202050100BOMBGROUPA12JAN2020
22INDIADelhi32JAN20201050NULLGROUPB.
33UKLondon05FEB2021300BOMBGROUPA05FEB2021
44USANewyork12JAN202050100BOMBGROUPA12JAN2020
55FRANCEParis 2040EXPLOSIVENULL.
66INDIADelhi15MAR20222560BOMBGROUPB15MAR2022
77USANy10APR2023100200BOMBGROUPC10APR2023
88UNKNOWNBerlin20MAY20214580BOMBGROUPD20MAY2021
99JAPANTokyo11JUN2020530BOMBGROUPE11JUN2020
1010INDIAMumbai01JUL202260120BOMBGROUPF01JUL2022
1111USANewyork10APR2023100200BOMBGROUPC10APR2023
1212UKLondon05FEB2021300BOMBGROUPA05FEB2021

/* Remove duplicates */

PROC SORT DATA=blast_clean NODUPKEY;

BY ID Country City Date_clean;

RUN;

PROC PRINT DATA=blast_clean;

RUN;

OUTPUT:

ObsIDCountryCityDateCasualtiesInjuredWeaponGroupDate_clean
11USANewyork12JAN202050100BOMBGROUPA12JAN2020
22INDIADelhi32JAN20201050NULLGROUPB.
33UKLondon05FEB2021300BOMBGROUPA05FEB2021
44USANewyork12JAN202050100BOMBGROUPA12JAN2020
55FRANCEParis 2040EXPLOSIVENULL.
66INDIADelhi15MAR20222560BOMBGROUPB15MAR2022
77USANy10APR2023100200BOMBGROUPC10APR2023
88UNKNOWNBerlin20MAY20214580BOMBGROUPD20MAY2021
99JAPANTokyo11JUN2020530BOMBGROUPE11JUN2020
1010INDIAMumbai01JUL202260120BOMBGROUPF01JUL2022
1111USANewyork10APR2023100200BOMBGROUPC10APR2023
1212UKLondon05FEB2021300BOMBGROUPA05FEB2021

Explanation

This phase demonstrates structured SAS cleaning:

  • COALESCEC replaces missing character values
  • COMPRESS removes unwanted spaces or characters → critical for cleaning messy text like "New York"
  • UPCASE/PROPCASE standardize text
  • ABS() fixes negative values
  • INPUT + FORMAT converts character dates into SAS dates
  • PROC SORT NODUPKEY removes duplicates

The COMPRESS function is key here it eliminates unwanted spaces or special characters, ensuring consistency in categorical variables. This is especially important in SDTM where controlled terminology must be exact.

4. Phase 2: Data Cleaning in R

library(dplyr)

blast_clean <- blast_raw %>%

  mutate(

    Country = toupper(ifelse(Country == "NULL" | is.na(Country), 

                             "UNKNOWN", Country)),

    City = tools::toTitleCase(trimws(City)),

    Weapon = toupper(trimws(Weapon)),

    Group = toupper(trimws(Group)),

    Casualties = ifelse(Casualties < 0, abs(Casualties),

                        Casualties),

    Injured = ifelse(is.na(Injured), 0, Injured)

  ) %>%

  distinct()

OUTPUT:

 

ID

Country

City

Date

Casualties

Injured

Weapon

Group

1

1

USA

NewYork

12-Jan-20

50

100

BOMB

GROUPA

2

2

INDIA

Delhi

32JAN2020

10

50

NULL

GROUPB

3

3

UK

London

05-Feb-21

30

0

BOMB

GROUPA

4

4

USA

NEWYORK

12-Jan-20

50

100

BOMB

GROUPA

5

5

FRANCE

Paris

NA

20

40

EXPLOSIVE

NULL

6

6

INDIA

 

15-Mar-22

25

60

BOMB

GROUPB

7

7

USA

Ny

10-Apr-23

100

200

BOMB

GROUPC

8

8

UNKNOWN

Berlin

20-May-21

45

80

BOMB

GROUPD

9

9

JAPAN

Tokyo

11-Jun-20

5

30

BOMB

GROUPE

10

10

INDIA

Mumbai

01-Jul-22

60

120

BOMB

GROUPF

11

11

USA

New York

10-Apr-23

100

200

BOMB

GROUPC

12

12

UK

London

05-Feb-21

30

0

BOMB

GROUPA

Explanation

In R:

  • mutate() transforms variables
  • ifelse() handles missing and invalid values
  • toupper() / trimws() standardize text
  • distinct() removes duplicates

Unlike SAS, R is more flexible and concise. However, it requires discipline to maintain reproducibility. Functions like trimws() act similarly to SAS COMPRESS but are less powerful for pattern-based cleaning. R excels in quick transformations but lacks SAS’s built-in regulatory traceability.

5. Phase 3: Additional SAS Processing Using Raw Data

DATA blast_enhanced;

SET blast_raw;

/* Use COMPRESS to remove special characters */

City_clean = COMPRESS(UPCASE(City), , 'ka');

/* Create flag variables */

if Casualties > 80 then High_Impact = "YES";

else High_Impact = "NO";

/* Length standardization */

LENGTH Country_std $15;

Country_std = UPCASE(COMPRESS(Country));

RUN;

PROC PRINT DATA=blast_enhanced;

RUN;

OUTPUT:

ObsIDCountryCityDateCasualtiesInjuredWeaponGroupCity_cleanHigh_ImpactCountry_std
11USANewYork12JAN202050100BombGroupANEWYORKNOUSA
22indiadelhi32JAN2020-1050NULLGroupBDELHINOINDIA
33UKLondon05FEB202130.BombGroupALONDONNOUK
44USANEWYORK12JAN202050100BombGroupANEWYORKNOUSA
55FranceParis 2040ExplosiveNULLPARISNOFRANCE
66IndiaDelhi15MAR20222560BombgroupbDELHINOINDIA
77USANy10APR2023100200BombGroupCNYYESUSA
88NULLBerlin20MAY20214580BombGroupDBERLINNONULL
99JapanTokyo11JUN2020-530BombGroupETOKYONOJAPAN
1010IndiaMumbai01JUL202260120BombGroupFMUMBAINOINDIA
1111USANew York10APR2023100200BombGroupCNEWYORKYESUSA
1212UKLondon05FEB202130.BombGroupALONDONNOUK

Explanation

This phase deepens COMPRESS usage:

  • COMPRESS(UPCASE(City), , 'ka') keeps only alphabetic characters
  • Useful for removing noise like symbols, numbers
  • Helps in building standardized keys

Additional logic:

  • Flagging high-impact events
  • Standardizing variable lengths

This reflects real-world ADaM dataset derivations, where variables like flags (e.g., AVALC, ANL01FL) are created.

6. 20 Additional Data Cleaning Best Practices

  1. Always validate against source data (SDTM vs ADaM)
  2. Maintain audit trails for all transformations
  3. Use controlled terminology (CDISC standards)
  4. Never overwrite raw datasets
  5. Validate date formats rigorously
  6. Handle missing values with business logic
  7. Check for duplicates across keys
  8. Standardize categorical variables
  9. Use macros for reusable cleaning logic
  10. Document all assumptions
  11. Perform QC using independent programming
  12. Use PROC COMPARE for validation
  13. Validate ranges (age, lab values)
  14. Track derivation logic clearly
  15. Avoid hardcoding values
  16. Use metadata-driven programming
  17. Ensure reproducibility
  18. Align with SAP specifications
  19. Validate joins and merges
  20. Always test edge cases

7. Business Logic Behind Data Cleaning

Data cleaning is not random it’s logic-driven.

  • Missing values → Replace with defaults or impute
    Example: Missing injured → set to 0
  • Negative values → Not realistic
    Example: Casualties cannot be -10
  • Text inconsistency → Impacts grouping
    Example: “India” vs “india”
  • Date correction → Essential for timeline analysis

In clinical trials:

  • Incorrect age → invalid subject eligibility
  • Missing visit date → affects endpoint analysis

8. 20 Key Points

  1. Dirty data leads to wrong conclusions.
  2. COMPRESS is powerful for string cleaning.
  3. Standardization ensures reproducibility.
  4. Missing values must follow business logic.
  5. Negative values require validation.
  6. Duplicate data inflates results.
  7. Dates must be consistent.
  8. Text inconsistencies break grouping.
  9. SAS ensures regulatory compliance.
  10. R enables flexible transformations.
  11. Always validate after cleaning.
  12. Never trust raw data blindly.
  13. Cleaning improves model accuracy.
  14. Documentation is critical.
  15. QC is mandatory.
  16. Use functions, not manual fixes.
  17. Controlled terminology matters.
  18. Automation reduces errors.
  19. Clean data builds trust.
  20. Data cleaning is data science.

9. Summary

Aspect

SAS

R

Structure

Highly structured

Flexible

Regulatory

Strong

Moderate

Cleaning Functions

COMPRESS, COALESCEC

trimws, mutate

Reproducibility

High

Depends on coding

👉 SAS is ideal for clinical and regulatory work
👉 R is ideal for exploration and rapid analysis

10. Conclusion

Data cleaning is not just a preprocessing step it is the foundation of reliable analytics.

Using functions like COMPRESS in SAS, we can eliminate inconsistencies and bring structure to chaos. Whether in clinical trials, global risk analysis, or business intelligence, clean data drives accurate decisions.

If your data is messy, your insights are misleading.
If your data is clean, your decisions are powerful.

11. Interview Questions

Q1: Why use COMPRESS in SAS?

👉 Removes unwanted characters → ensures clean, standardized strings.

Q2: How do you handle negative values in a dataset?

👉 Use conditions:

if value < 0 then value = abs(value);

Q3: How do you remove duplicates in SAS?

👉 PROC SORT NODUPKEY:

PROC SORT DATA=data NODUPKEY;

BY variables;

RUN;

Q4: R equivalent of SAS COMPRESS?

👉 gsub() or trimws() depending on requirement.

Q5: If patient age = -5 → correct using ABS or flag for review?(According To Real World Scenario).

if patient age < 0 then patient age = abs(patient age);

Answer: It is technically correct SAS syntax, but not correct from a real-world clinical or interview perspective.

Let’s break this properly like an experienced SAS programmer would explain in an interview.

if patient_age < 0 then patient_age = abs(patient_age);

❌  Why This Is Problematic in Real Scenarios

In real clinical trials (SDTM/ADaM), this approach is considered inappropriate.

1. You Are Changing Raw Data Without Justification

  • Age = -5 is not just a small error
  • It indicates data entry issue / upstream system problem
  • Converting it to +5 silently changes meaning

👉 This violates data integrity principles

2. Regulatory Risk (Very Important)

In clinical trials:

  • You cannot modify data blindly
  • Every change must be:
    • Traceable
    • Justified
    • Documented

👉 Regulators (FDA, EMA) may question:

“How did -5 become +5?”

3. Loss of Audit Trail

  • Original error disappears
  • No way to trace what happened
  • Fails audit requirements

Correct Approach in Real-World (Interview-Ready Answer)

Option 1: Flag the Issue (Best Practice)

if patient_age < 0 then do;

    age_flag = "INVALID";

end;

Option 2: Set to Missing + Flag

if patient_age < 0 then do;

    patient_age = .;

    age_flag = "NEGATIVE VALUE";

end;

Option 3: Use Business Rule (Only if Defined in SAP)

if patient_age < 0 then patient_age = abs(patient_age); /* Only if SAP allows */

But in interview, you must say:

I would only apply ABS if explicitly mentioned in SAP or data management guidelines.

Perfect Interview Answer (Use This)

Technically, the code is correct in SAS. However, in real clinical scenarios, directly converting negative age to positive using ABS is not appropriate. Negative age indicates a data quality issue. Instead, I would flag the record or set it to missing and follow the study’s SAP or data management guidelines. Any correction must be traceable and justified for regulatory compliance.

Key Takeaways

  • Syntax = Correct
  • ❌ Logic (Real-world) = Risky
  • Best Practice = Flag + Validate + Follow SAP

Pro Tip (Interview Booster)

If you say this, you will stand out:

“In ADaM, I would not derive AGE using ABS blindly. I would check DM domain, RFSTDTC, BRTHDTC, and ensure derivation aligns with CDISC standards.”

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about BOMB BLAST DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS

2.Can Advanced SAS Programming Detect, Correct, and Optimize Global Cargo Shipping Route Data While Improving Efficiency and Reducing Costs?

3.Can We Build and Debug a Digital Identity Verification Fraud Detection System in SAS Using Advanced Data Engineering and Macros?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy


Comments

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

418.Can We Design, Debug, Detect Fraud, And Optimize A Smart Parking System Using Advanced SAS Programming Techniques?