449.RETAIN-Based Data Correction in Guns Dataset Using SAS : From Errors to Accuracy

Mastering Guns Dataset Cleaning in SAS Using RETAIN for Intelligent Data Correction

1. Introduction

Imagine you’re working on a sensitive analytics project say, a firearms incident dataset collected across multiple regions. This dataset is supposed to help policymakers understand patterns, risks, and safety measures. But when you open the data… it’s a mess.

Ages are negative. Dates are inconsistent. Some records say “NULL,” others are blank. Duplicate entries inflate incident counts. And worst of all critical fields like weapon type or incident severity are missing.

Now imagine building a report on top of this.

The result? Misleading insights, poor decisions, and potentially dangerous conclusions.

This is exactly why data cleaning is not optional it’s foundational.

Tools like SAS and R are not just programming languages they are data integrity engines. And within SAS, one of the most underrated yet powerful tools is the RETAIN statement, which helps maintain state across observations crucial for handling sequential logic, imputation, and carry-forward corrections.

In this blog, we’ll simulate a Guns dataset with intentional errors, clean it step-by-step in SAS and R, and demonstrate how RETAIN transforms data logic from reactive to intelligent.

2. Raw Data Creation in SAS and R

SAS Code (Raw Dataset with Errors)

DATA guns_raw;

INFILE DATALINES MISSOVER;

INPUT ID Age Location $ Weapon $ 

      Incident_Date:$9. Severity $;

DATALINES;

1 25 NY Pistol 12JAN2023 High

2 -30 CA Rifle 15FEB2023 Medium

3 . TX NULL 20MAR2023 Low

4 40 FL Shotgun WRONGDATE High

5 25 NY Pistol 12JAN2023 High

6 60 CA Rifle 18APR2023 

7 . TX Pistol 22MAY2023 Medium

8 35 FL NULL 25JUN2023 Low

9 -5 NY Shotgun 30JUL2023 High

10 45 CA Rifle 10AUG2023 Medium

11 50 TX Pistol 12SEP2023 NULL

12 . FL Shotgun 15OCT2023 High

13 70 NY Rifle 20NOV2023 Medium

14 80 CA NULL 25DEC2023 Low

15 30 TX Pistol 01JAN2024 High

;

RUN;

PROC PRINT DATA=guns_raw;

RUN;

OUTPUT:

ObsIDAgeLocationWeaponIncident_DateSeverity
1125NYPistol12JAN2023High
22-30CARifle15FEB2023Medium
33.TXNULL20MAR2023Low
4440FLShotgunWRONGDATEHigh
5525NYPistol12JAN2023High
6660CARifle18APR2023 
77.TXPistol22MAY2023Medium
8835FLNULL25JUN2023Low
99-5NYShotgun30JUL2023High
101045CARifle10AUG2023Medium
111150TXPistol12SEP2023NULL
1212.FLShotgun15OCT2023High
131370NYRifle20NOV2023Medium
141480CANULL25DEC2023Low
151530TXPistol01JAN2024High

Explanation (SAS Raw Data)

This dataset intentionally introduces real-world data quality issues:

  • Missing Age (.)
  • Negative Age values
  • Invalid dates (WRONGDATE)
  • Duplicate records (ID 1 & 5)
  • Inconsistent text (NULL, blank Severity)

This mimics raw ingestion datasets seen in clinical or operational systems before SDTM mapping.

Key Points

  • DATALINES is useful for quick prototyping
  • Missing numeric = .
  • Character inconsistencies simulate real ETL challenges
  • Duplicate records often occur in system merges

R Code (Equivalent Dataset)

guns_raw <- data.frame(

  ID = 1:15,

  Age = c(25, -30, NA, 40, 25, 60, NA, 35, -5, 45, 50, NA, 70, 80, 30),

  Location = c("NY","CA","TX","FL","NY","CA","TX","FL","NY","CA","TX",

               "FL","NY","CA","TX"),

  Weapon = c("Pistol","Rifle","NULL","Shotgun","Pistol","Rifle","Pistol",

             "NULL","Shotgun","Rifle","Pistol","Shotgun","Rifle","NULL",

             "Pistol"),

  Incident_Date = c("12JAN2023","15FEB2023","20MAR2023","WRONGDATE",

                    "12JAN2023","18APR2023","22MAY2023","25JUN2023",

                    "30JUL2023","10AUG2023","12SEP2023","15OCT2023",

                    "20NOV2023","25DEC2023","01JAN2024"),

  Severity = c("High","Medium","Low","High","High","","Medium","Low",

               "High","Medium","NULL","High","Medium","Low","High")

)

OUTPUT:

 

ID

Age

Location

Weapon

Incident_Date

Severity

1

1

25

NY

Pistol

12-Jan-23

High

2

2

-30

CA

Rifle

15-Feb-23

Medium

3

3

NA

TX

NULL

20-Mar-23

Low

4

4

40

FL

Shotgun

WRONGDATE

High

5

5

25

NY

Pistol

12-Jan-23

High

6

6

60

CA

Rifle

18-Apr-23

 

7

7

NA

TX

Pistol

22-May-23

Medium

8

8

35

FL

NULL

25-Jun-23

Low

9

9

-5

NY

Shotgun

30-Jul-23

High

10

10

45

CA

Rifle

10-Aug-23

Medium

11

11

50

TX

Pistol

12-Sep-23

NULL

12

12

NA

FL

Shotgun

15-Oct-23

High

13

13

70

NY

Rifle

20-Nov-23

Medium

14

14

80

CA

NULL

25-Dec-23

Low

15

15

30

TX

Pistol

01-Jan-24

High

Explanation (R Raw Data)

R uses data.frame() for structured datasets. NA represents missing values. Character inconsistencies remain intact, requiring explicit cleaning.

Key Points

  • NA handling differs from SAS (. vs NA)
  • Character vectors are flexible but error-prone
  • Data types need validation before analysis

3. Phase 1: Data Cleaning in SAS (Using RETAIN)

SAS Cleaning Code

PROC SORT DATA=guns_raw NODUPKEY OUT=guns_nodup;

BY ID;

RUN;

PROC PRINT DATA=guns_nodup;

RUN;

LOG:

NOTE: There were 15 observations read from the data set WORK.GUNS_RAW.
NOTE: 0 observations with duplicate key values were deleted.

OUTPUT:

ObsIDAgeLocationWeaponIncident_DateSeverity
1125NYPistol12JAN2023High
22-30CARifle15FEB2023Medium
33.TXNULL20MAR2023Low
4440FLShotgunWRONGDATEHigh
5525NYPistol12JAN2023High
6660CARifle18APR2023 
77.TXPistol22MAY2023Medium
8835FLNULL25JUN2023Low
99-5NYShotgun30JUL2023High
101045CARifle10AUG2023Medium
111150TXPistol12SEP2023NULL
1212.FLShotgun15OCT2023High
131370NYRifle20NOV2023Medium
141480CANULL25DEC2023Low
151530TXPistol01JAN2024High

DATA guns_clean;

SET guns_nodup;

RETAIN Last_Valid_Age;

* Handle missing and negative age;

IF Age = . OR Age < 0 THEN Age = Last_Valid_Age;

ELSE Last_Valid_Age = Age;

* Fix weapon inconsistencies;

IF STRIP(UPCASE(Weapon)) = "NULL" OR STRIP(Weapon) = "" 

THEN  Weapon = "UNKNOWN";

Weapon = PROPCASE(STRIP(Weapon));

* Fix severity;

IF STRIP(UPCASE(Severity)) = "NULL" OR STRIP(Severity) = "" 

THEN Severity = "MEDIUM";

Severity = PROPCASE(STRIP(Severity));

* SAFE DATE CONVERSION USING ??;

Incident_Date_Num = INPUT(Incident_Date, ?? DATE9.);

FORMAT Incident_Date_Num DATE9.;

IF Incident_Date_Num = . THEN Incident_Date_Num = TODAY();

DROP Incident_Date Last_Valid_Age;

RENAME Incident_Date_Num = Incident_Date;

RUN;

PROC PRINT DATA=guns_clean;

RUN;

OUTPUT:
ObsIDAgeLocationWeaponSeverityIncident_Date
1125NYPistolHigh12JAN2023
2225CARifleMedium15FEB2023
3325TXUnknownLow20MAR2023
4440FLShotgunHigh12APR2026
5525NYPistolHigh12JAN2023
6660CARifleMedium18APR2023
7760TXPistolMedium22MAY2023
8835FLUnknownLow25JUN2023
9935NYShotgunHigh30JUL2023
101045CARifleMedium10AUG2023
111150TXPistolMedium12SEP2023
121250FLShotgunHigh15OCT2023
131370NYRifleMedium20NOV2023
141480CAUnknownLow25DEC2023
151530TXPistolHigh01JAN2024

Explanation (SAS Cleaning + RETAIN)

This is where SAS shines.

  • PROC SORT NODUPKEY removes duplicate IDs
  • RETAIN Last_Valid_Age holds previous valid value
  • Missing/invalid age is replaced using retained value
  • Dates are converted using INPUT()
  • Invalid dates default to today's date

RETAIN enables forward-fill logic, commonly used in clinical datasets (e.g., last observation carried forward – LOCF)

Key Points

  • RETAIN prevents reinitialization each iteration
  • Useful for sequential dependency logic
  • Essential in time-series or patient-level datasets
  • INPUT + FORMAT ensures date integrity

4. Phase 2: Data Cleaning in R

R Cleaning Code

library(dplyr)

library(stringr)

library(zoo)


guns_clean <- guns_raw %>%

  distinct(ID, .keep_all = TRUE) %>%

  mutate(

    Age = ifelse(Age < 0, NA, Age),

    Age = zoo::na.locf(Age, na.rm = FALSE), 

    Weapon = ifelse(str_trim(str_to_upper(Weapon)) %in% c("NULL",""),

                    "UNKNOWN", Weapon),

    Weapon = str_to_title(str_trim(Weapon)),

    Severity = ifelse(str_trim(str_to_upper(Severity)) %in% c("NULL",""),

                      "MEDIUM", Severity),

    Severity = str_to_title(str_trim(Severity)),

    Incident_Date = as.Date(Incident_Date, format="%d%b%Y")

  )

guns_clean$Incident_Date[is.na(guns_clean$Incident_Date)] <- Sys.Date()

OUTPUT:

 

ID

Age

Location

Weapon

Incident_Date

Severity

1

1

25

NY

Pistol

12-01-2023

High

2

2

25

CA

Rifle

15-02-2023

Medium

3

3

25

TX

Unknown

20-03-2023

Low

4

4

40

FL

Shotgun

12-04-2026

High

5

5

25

NY

Pistol

12-01-2023

High

6

6

60

CA

Rifle

18-04-2023

Medium

7

7

60

TX

Pistol

22-05-2023

Medium

8

8

35

FL

Unknown

25-06-2023

Low

9

9

35

NY

Shotgun

30-07-2023

High

10

10

45

CA

Rifle

10-08-2023

Medium

11

11

50

TX

Pistol

12-09-2023

Medium

12

12

50

FL

Shotgun

15-10-2023

High

13

13

70

NY

Rifle

20-11-2023

Medium

14

14

80

CA

Unknown

25-12-2023

Low

15

15

30

TX

Pistol

01-01-2024

High

Explanation (R Cleaning)

  • distinct() removes duplicates
  • lag() mimics RETAIN behavior
  • mutate() applies transformations
  • as.Date() converts date formats
  • lag()RETAIN
  • na.locf() = true RETAIN equivalent
  • Always standardize BEFORE formatting text 

 Key Points

  • lag() is R’s equivalent of RETAIN
  • dplyr simplifies pipeline logic
  • Date parsing is stricter than SAS
  • Vectorized operations improve efficiency

5. 20 Additional Data Cleaning Best Practices

  1. Always validate raw data before transformation
  2. Maintain audit trails for regulatory compliance
  3. Use SDTM standards for clinical datasets
  4. Never overwrite raw datasets
  5. Document all derivations (ADaM requirement)
  6. Validate ranges (e.g., age 0–120)
  7. Use controlled terminology
  8. Handle missing data systematically
  9. Log all corrections
  10. Use macros for reusable logic
  11. Perform QC checks independently
  12. Avoid hardcoding values
  13. Normalize categorical variables
  14. Check date consistency across domains
  15. Validate keys before joins
  16. Ensure traceability from raw → final
  17. Use metadata-driven programming
  18. Perform duplicate checks regularly
  19. Align with SAP specifications
  20. Use validation tools (e.g., Pinnacle 21)

6. Business Logic Behind Data Cleaning

Data cleaning is not just technical it’s business-critical logic.

  • Missing Age → replaced using last valid value
    πŸ‘‰ Prevents bias in demographic analysis
  • Negative Age → corrected
    πŸ‘‰ Ensures realistic modeling
  • Date Imputation
    πŸ‘‰ Maintains timeline consistency
  • Severity normalization
    πŸ‘‰ Ensures consistent reporting

Example:

If a patient’s age is missing in a clinical trial:

  • Leaving it blank → excludes them from analysis
  • Imputing intelligently → retains statistical power

7. 20 Key Points (Sharp Insights)

  • Dirty data leads to wrong conclusions
  • RETAIN enables memory in SAS
  • Standardization ensures reproducibility
  • Missing data must be handled strategically
  • Duplicate records distort analysis
  • Date errors break timelines
  • Text inconsistencies reduce accuracy
  • Validation is as important as analysis
  • SAS excels in structured pipelines
  • R excels in flexibility
  • Audit trails are mandatory in pharma
  • LOCF is common in clinical trials
  • Always separate raw and cleaned data
  • QC checks prevent downstream errors
  • Business logic drives cleaning rules
  • Data cleaning is 70% of analytics work
  • Never trust raw data blindly
  • Automation reduces human error
  • Clean data = reliable insights
  • Consistency is the backbone of analytics

8. SAS vs R Comparison

Feature

SAS

R

Data Handling

Structured

Flexible

RETAIN Logic

Native

lag() workaround

Regulatory Use

Strong (FDA)

Limited

Learning Curve

Moderate

Steep

Automation

Macro-based

Functional


Logic

SAS

R

Remove duplicates

PROC SORT NODUPKEY

distinct()

RETAIN logic

RETAIN + carry forward

zoo::na.locf()

NULL handling

STRIP + UPCASE

str_trim + str_to_upper

Case formatting

PROPCASE

str_to_title

Date conversion

INPUT(?? DATE9.)

as.Date()

Missing date fix

TODAY()

Sys.Date()

9. Summary

We transformed a messy Guns dataset into a reliable analytical asset using:

  • SAS: Structured, RETAIN-driven logic
  • R: Flexible, pipeline-based transformations

Both tools are powerful but SAS dominates in regulated environments, while R excels in exploratory analytics.

10. Conclusion

Data cleaning is not a preprocessing step it’s the foundation of truth in analytics.

The RETAIN statement in SAS elevates your logic from row-based corrections to context-aware intelligence.

Whether you're working in clinical trials, finance, or public safety
πŸ‘‰ Clean data is not a luxury. It’s a responsibility.

11. Interview Questions

Q1: What is RETAIN in SAS?

πŸ‘‰ It prevents variables from resetting each iteration.

Q2: How do you handle missing values in SAS?

πŸ‘‰ Using IF conditions, COALESCEC, or RETAIN logic.

Q3: How does R handle RETAIN-like behavior?

πŸ‘‰ Using lag() function.

Q4: What is NODUPKEY in PROC SORT?

πŸ‘‰ Removes duplicate records based on BY variables.

Q5: A dataset has negative ages and missing dates. What will you do?

πŸ‘‰ Steps:

  • Validate ranges
  • Replace negatives using ABS or logic
  • Impute missing dates
  • Document all changes

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about GUNS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1. Are We Losing Wildlife Faster Than We Think? – Animal Conservation Analysis Using SAS

2.Can Modern Art Data Explain What Actually Sells? – A Real-World SAS Analytics Project

3.Hidden Errors in Movie Data and How SAS LENGTH & FORMAT Bring Clarity 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy









Comments

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

418.Can We Design, Debug, Detect Fraud, And Optimize A Smart Parking System Using Advanced SAS Programming Techniques?