Mastering Guns Dataset Cleaning in SAS Using RETAIN for Intelligent Data Correction

1. Introduction

Imagine you’re working on a sensitive analytics project say, a firearms incident dataset collected across multiple regions. This dataset is supposed to help policymakers understand patterns, risks, and safety measures. But when you open the data… it’s a mess.

Ages are negative. Dates are inconsistent. Some records say “NULL,” others are blank. Duplicate entries inflate incident counts. And worst of all critical fields like weapon type or incident severity are missing.

Now imagine building a report on top of this.

The result? Misleading insights, poor decisions, and potentially dangerous conclusions.

This is exactly why data cleaning is not optional it’s foundational.

Tools like SAS and R are not just programming languages they are data integrity engines. And within SAS, one of the most underrated yet powerful tools is the RETAIN statement, which helps maintain state across observations crucial for handling sequential logic, imputation, and carry-forward corrections.

In this blog, we’ll simulate a Guns dataset with intentional errors, clean it step-by-step in SAS and R, and demonstrate how RETAIN transforms data logic from reactive to intelligent.

2. Raw Data Creation in SAS and R

SAS Code (Raw Dataset with Errors)

DATA guns_raw;

INFILE DATALINES MISSOVER;

INPUT ID Age Location $ Weapon $

Incident_Date:$9. Severity $;

DATALINES;

1 25 NY Pistol 12JAN2023 High

2 -30 CA Rifle 15FEB2023 Medium

3 . TX NULL 20MAR2023 Low

4 40 FL Shotgun WRONGDATE High

5 25 NY Pistol 12JAN2023 High

6 60 CA Rifle 18APR2023

7 . TX Pistol 22MAY2023 Medium

8 35 FL NULL 25JUN2023 Low

9 -5 NY Shotgun 30JUL2023 High

10 45 CA Rifle 10AUG2023 Medium

11 50 TX Pistol 12SEP2023 NULL

12 . FL Shotgun 15OCT2023 High

13 70 NY Rifle 20NOV2023 Medium

14 80 CA NULL 25DEC2023 Low

15 30 TX Pistol 01JAN2024 High

;

RUN;

PROC PRINT DATA=guns_raw;

RUN;

OUTPUT:


Obs	ID	Age	Location	Weapon	Incident_Date	Severity
1	1	25	NY	Pistol	12JAN2023	High
2	2	-30	CA	Rifle	15FEB2023	Medium
3	3	.	TX	NULL	20MAR2023	Low
4	4	40	FL	Shotgun	WRONGDATE	High
5	5	25	NY	Pistol	12JAN2023	High
6	6	60	CA	Rifle	18APR2023
7	7	.	TX	Pistol	22MAY2023	Medium
8	8	35	FL	NULL	25JUN2023	Low
9	9	-5	NY	Shotgun	30JUL2023	High
10	10	45	CA	Rifle	10AUG2023	Medium
11	11	50	TX	Pistol	12SEP2023	NULL
12	12	.	FL	Shotgun	15OCT2023	High
13	13	70	NY	Rifle	20NOV2023	Medium
14	14	80	CA	NULL	25DEC2023	Low
15	15	30	TX	Pistol	01JAN2024	High

Explanation (SAS Raw Data)

This dataset intentionally introduces real-world data quality issues:

Missing Age (.)
Negative Age values
Invalid dates (WRONGDATE)
Duplicate records (ID 1 & 5)
Inconsistent text (NULL, blank Severity)

This mimics raw ingestion datasets seen in clinical or operational systems before SDTM mapping.

Key Points

DATALINES is useful for quick prototyping
Missing numeric = .
Character inconsistencies simulate real ETL challenges
Duplicate records often occur in system merges

R Code (Equivalent Dataset)

guns_raw <- data.frame(

ID = 1:15,

Age = c(25, -30, NA, 40, 25, 60, NA, 35, -5, 45, 50, NA, 70, 80, 30),

Location = c("NY","CA","TX","FL","NY","CA","TX","FL","NY","CA","TX",

"FL","NY","CA","TX"),

Weapon = c("Pistol","Rifle","NULL","Shotgun","Pistol","Rifle","Pistol",

"NULL","Shotgun","Rifle","Pistol","Shotgun","Rifle","NULL",

"Pistol"),

Incident_Date = c("12JAN2023","15FEB2023","20MAR2023","WRONGDATE",

"12JAN2023","18APR2023","22MAY2023","25JUN2023",

"30JUL2023","10AUG2023","12SEP2023","15OCT2023",

"20NOV2023","25DEC2023","01JAN2024"),

Severity = c("High","Medium","Low","High","High","","Medium","Low",

"High","Medium","NULL","High","Medium","Low","High")

)

OUTPUT:

	ID	Age	Location	Weapon	Incident_Date	Severity
1	1	25	NY	Pistol	12-Jan-23	High
2	2	-30	CA	Rifle	15-Feb-23	Medium
3	3	NA	TX	NULL	20-Mar-23	Low
4	4	40	FL	Shotgun	WRONGDATE	High
5	5	25	NY	Pistol	12-Jan-23	High
6	6	60	CA	Rifle	18-Apr-23
7	7	NA	TX	Pistol	22-May-23	Medium
8	8	35	FL	NULL	25-Jun-23	Low
9	9	-5	NY	Shotgun	30-Jul-23	High
10	10	45	CA	Rifle	10-Aug-23	Medium
11	11	50	TX	Pistol	12-Sep-23	NULL
12	12	NA	FL	Shotgun	15-Oct-23	High
13	13	70	NY	Rifle	20-Nov-23	Medium
14	14	80	CA	NULL	25-Dec-23	Low
15	15	30	TX	Pistol	01-Jan-24	High

Explanation (R Raw Data)

R uses data.frame() for structured datasets. NA represents missing values. Character inconsistencies remain intact, requiring explicit cleaning.

Key Points

NA handling differs from SAS (. vs NA)
Character vectors are flexible but error-prone
Data types need validation before analysis

3. Phase 1: Data Cleaning in SAS (Using RETAIN)

SAS Cleaning Code

PROC SORT DATA=guns_raw NODUPKEY OUT=guns_nodup;

BY ID;

RUN;

PROC PRINT DATA=guns_nodup;

RUN;

LOG:

 NOTE: There were 15 observations read from the data set WORK.GUNS_RAW.

 NOTE: 0 observations with duplicate key values were deleted.

OUTPUT:


Obs	ID	Age	Location	Weapon	Incident_Date	Severity
1	1	25	NY	Pistol	12JAN2023	High
2	2	-30	CA	Rifle	15FEB2023	Medium
3	3	.	TX	NULL	20MAR2023	Low
4	4	40	FL	Shotgun	WRONGDATE	High
5	5	25	NY	Pistol	12JAN2023	High
6	6	60	CA	Rifle	18APR2023
7	7	.	TX	Pistol	22MAY2023	Medium
8	8	35	FL	NULL	25JUN2023	Low
9	9	-5	NY	Shotgun	30JUL2023	High
10	10	45	CA	Rifle	10AUG2023	Medium
11	11	50	TX	Pistol	12SEP2023	NULL
12	12	.	FL	Shotgun	15OCT2023	High
13	13	70	NY	Rifle	20NOV2023	Medium
14	14	80	CA	NULL	25DEC2023	Low
15	15	30	TX	Pistol	01JAN2024	High

DATA guns_clean;

SET guns_nodup;

RETAIN Last_Valid_Age;

* Handle missing and negative age;

IF Age = . OR Age < 0 THEN Age = Last_Valid_Age;

ELSE Last_Valid_Age = Age;

* Fix weapon inconsistencies;

IF STRIP(UPCASE(Weapon)) = "NULL" OR STRIP(Weapon) = ""

THEN Weapon = "UNKNOWN";

Weapon = PROPCASE(STRIP(Weapon));

* Fix severity;

IF STRIP(UPCASE(Severity)) = "NULL" OR STRIP(Severity) = ""

THEN Severity = "MEDIUM";

Severity = PROPCASE(STRIP(Severity));

* SAFE DATE CONVERSION USING ??;

Incident_Date_Num = INPUT(Incident_Date, ?? DATE9.);

FORMAT Incident_Date_Num DATE9.;

IF Incident_Date_Num = . THEN Incident_Date_Num = TODAY();

DROP Incident_Date Last_Valid_Age;

RENAME Incident_Date_Num = Incident_Date;

RUN;

PROC PRINT DATA=guns_clean;

RUN;

OUTPUT:


Obs	ID	Age	Location	Weapon	Severity	Incident_Date
1	1	25	NY	Pistol	High	12JAN2023
2	2	25	CA	Rifle	Medium	15FEB2023
3	3	25	TX	Unknown	Low	20MAR2023
4	4	40	FL	Shotgun	High	12APR2026
5	5	25	NY	Pistol	High	12JAN2023
6	6	60	CA	Rifle	Medium	18APR2023
7	7	60	TX	Pistol	Medium	22MAY2023
8	8	35	FL	Unknown	Low	25JUN2023
9	9	35	NY	Shotgun	High	30JUL2023
10	10	45	CA	Rifle	Medium	10AUG2023
11	11	50	TX	Pistol	Medium	12SEP2023
12	12	50	FL	Shotgun	High	15OCT2023
13	13	70	NY	Rifle	Medium	20NOV2023
14	14	80	CA	Unknown	Low	25DEC2023
15	15	30	TX	Pistol	High	01JAN2024

Explanation (SAS Cleaning + RETAIN)

This is where SAS shines.

PROC SORT NODUPKEY removes duplicate IDs
RETAIN Last_Valid_Age holds previous valid value
Missing/invalid age is replaced using retained value
Dates are converted using INPUT()
Invalid dates default to today's date

RETAIN enables forward-fill logic, commonly used in clinical datasets (e.g., last observation carried forward – LOCF)

Key Points

RETAIN prevents reinitialization each iteration
Useful for sequential dependency logic
Essential in time-series or patient-level datasets
INPUT + FORMAT ensures date integrity

4. Phase 2: Data Cleaning in R

R Cleaning Code

library(dplyr)

library(stringr)

library(zoo)

guns_clean <- guns_raw %>%

distinct(ID, .keep_all = TRUE) %>%

mutate(

Age = ifelse(Age < 0, NA, Age),

Age = zoo::na.locf(Age, na.rm = FALSE),

Weapon = ifelse(str_trim(str_to_upper(Weapon)) %in% c("NULL",""),

"UNKNOWN", Weapon),

Weapon = str_to_title(str_trim(Weapon)),

Severity = ifelse(str_trim(str_to_upper(Severity)) %in% c("NULL",""),

"MEDIUM", Severity),

Severity = str_to_title(str_trim(Severity)),

Incident_Date = as.Date(Incident_Date, format="%d%b%Y")

)

guns_clean$Incident_Date[is.na(guns_clean$Incident_Date)] <- Sys.Date()

OUTPUT:

	ID	Age	Location	Weapon	Incident_Date	Severity
1	1	25	NY	Pistol	12-01-2023	High
2	2	25	CA	Rifle	15-02-2023	Medium
3	3	25	TX	Unknown	20-03-2023	Low
4	4	40	FL	Shotgun	12-04-2026	High
5	5	25	NY	Pistol	12-01-2023	High
6	6	60	CA	Rifle	18-04-2023	Medium
7	7	60	TX	Pistol	22-05-2023	Medium
8	8	35	FL	Unknown	25-06-2023	Low
9	9	35	NY	Shotgun	30-07-2023	High
10	10	45	CA	Rifle	10-08-2023	Medium
11	11	50	TX	Pistol	12-09-2023	Medium
12	12	50	FL	Shotgun	15-10-2023	High
13	13	70	NY	Rifle	20-11-2023	Medium
14	14	80	CA	Unknown	25-12-2023	Low
15	15	30	TX	Pistol	01-01-2024	High

Explanation (R Cleaning)

distinct() removes duplicates
lag() mimics RETAIN behavior
mutate() applies transformations
as.Date() converts date formats
lag() ≠ RETAIN
na.locf() = true RETAIN equivalent
Always standardize BEFORE formatting text

Key Points

lag() is R’s equivalent of RETAIN
dplyr simplifies pipeline logic
Date parsing is stricter than SAS
Vectorized operations improve efficiency

5. 20 Additional Data Cleaning Best Practices

Always validate raw data before transformation
Maintain audit trails for regulatory compliance
Use SDTM standards for clinical datasets
Never overwrite raw datasets
Document all derivations (ADaM requirement)
Validate ranges (e.g., age 0–120)
Use controlled terminology
Handle missing data systematically
Log all corrections
Use macros for reusable logic
Perform QC checks independently
Avoid hardcoding values
Normalize categorical variables
Check date consistency across domains
Validate keys before joins
Ensure traceability from raw → final
Use metadata-driven programming
Perform duplicate checks regularly
Align with SAP specifications
Use validation tools (e.g., Pinnacle 21)

6. Business Logic Behind Data Cleaning

Data cleaning is not just technical it’s business-critical logic.

Missing Age → replaced using last valid value
👉 Prevents bias in demographic analysis
Negative Age → corrected
👉 Ensures realistic modeling
Date Imputation
👉 Maintains timeline consistency
Severity normalization
👉 Ensures consistent reporting

Example:

If a patient’s age is missing in a clinical trial:

Leaving it blank → excludes them from analysis
Imputing intelligently → retains statistical power

7. 20 Key Points (Sharp Insights)

Dirty data leads to wrong conclusions
RETAIN enables memory in SAS
Standardization ensures reproducibility
Missing data must be handled strategically
Duplicate records distort analysis
Date errors break timelines
Text inconsistencies reduce accuracy
Validation is as important as analysis
SAS excels in structured pipelines
R excels in flexibility
Audit trails are mandatory in pharma
LOCF is common in clinical trials
Always separate raw and cleaned data
QC checks prevent downstream errors
Business logic drives cleaning rules
Data cleaning is 70% of analytics work
Never trust raw data blindly
Automation reduces human error
Clean data = reliable insights
Consistency is the backbone of analytics

8. SAS vs R Comparison

Feature	SAS	R
Data Handling	Structured	Flexible
RETAIN Logic	Native	lag() workaround
Regulatory Use	Strong (FDA)	Limited
Learning Curve	Moderate	Steep
Automation	Macro-based	Functional

Logic	SAS	R
Remove duplicates	PROC SORT NODUPKEY	distinct()
RETAIN logic	RETAIN + carry forward	zoo::na.locf()
NULL handling	STRIP + UPCASE	str_trim + str_to_upper
Case formatting	PROPCASE	str_to_title
Date conversion	INPUT(?? DATE9.)	as.Date()
Missing date fix	TODAY()	Sys.Date()

9. Summary

We transformed a messy Guns dataset into a reliable analytical asset using:

SAS: Structured, RETAIN-driven logic
R: Flexible, pipeline-based transformations

Both tools are powerful but SAS dominates in regulated environments, while R excels in exploratory analytics.

10. Conclusion

Data cleaning is not a preprocessing step it’s the foundation of truth in analytics.

The RETAIN statement in SAS elevates your logic from row-based corrections to context-aware intelligence.

Whether you're working in clinical trials, finance, or public safety
👉 Clean data is not a luxury. It’s a responsibility.

11. Interview Questions

Q1: What is RETAIN in SAS?

👉 It prevents variables from resetting each iteration.

Q2: How do you handle missing values in SAS?

👉 Using IF conditions, COALESCEC, or RETAIN logic.

Q3: How does R handle RETAIN-like behavior?

👉 Using lag() function.

Q4: What is NODUPKEY in PROC SORT?

👉 Removes duplicate records based on BY variables.

Q5: A dataset has negative ages and missing dates. What will you do?

👉 Steps:

Validate ranges
Replace negatives using ABS or logic
Impute missing dates
Document all changes

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about GUNS DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1. Are We Losing Wildlife Faster Than We Think? – Animal Conservation Analysis Using SAS
2.Can Modern Art Data Explain What Actually Sells? – A Real-World SAS Analytics Project
3.Hidden Errors in Movie Data and How SAS LENGTH & FORMAT Bring Clarity
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

RETAIN-Based Data Correction in Guns Dataset Using SAS : From Errors to Accuracy

Mastering Guns Dataset Cleaning in SAS Using RETAIN for Intelligent Data Correction

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS