449.RETAIN-Based Data Correction in Guns Dataset Using SAS : From Errors to Accuracy
Mastering Guns Dataset Cleaning in SAS Using RETAIN for Intelligent Data Correction
1. Introduction
Imagine
you’re working on a sensitive analytics project say, a firearms incident
dataset collected across multiple regions. This dataset is supposed to help
policymakers understand patterns, risks, and safety measures. But when you open
the data… it’s a mess.
Ages are
negative. Dates are inconsistent. Some records say “NULL,” others are blank.
Duplicate entries inflate incident counts. And worst of all critical fields
like weapon type or incident severity are missing.
Now
imagine building a report on top of this.
The
result? Misleading insights, poor decisions, and potentially dangerous
conclusions.
This is
exactly why data cleaning is not optional it’s foundational.
Tools
like SAS and R are not just programming languages they are data
integrity engines. And within SAS, one of the most underrated yet powerful
tools is the RETAIN statement, which helps maintain state across
observations crucial for handling sequential logic, imputation, and
carry-forward corrections.
In this blog, we’ll simulate a Guns dataset with intentional errors, clean it step-by-step in SAS and R, and demonstrate how RETAIN transforms data logic from reactive to intelligent.
2. Raw Data Creation in SAS and R
SAS Code (Raw Dataset with Errors)
DATA guns_raw;
INFILE DATALINES MISSOVER;
INPUT ID Age Location $ Weapon $
Incident_Date:$9. Severity $;
DATALINES;
1 25 NY Pistol 12JAN2023 High
2 -30 CA Rifle 15FEB2023 Medium
3 . TX NULL 20MAR2023 Low
4 40 FL Shotgun WRONGDATE High
5 25 NY Pistol 12JAN2023 High
6 60 CA Rifle 18APR2023
7 . TX Pistol 22MAY2023 Medium
8 35 FL NULL 25JUN2023 Low
9 -5 NY Shotgun 30JUL2023 High
10 45 CA Rifle 10AUG2023 Medium
11 50 TX Pistol 12SEP2023 NULL
12 . FL Shotgun 15OCT2023 High
13 70 NY Rifle 20NOV2023 Medium
14 80 CA NULL 25DEC2023 Low
15 30 TX Pistol 01JAN2024 High
;
RUN;
PROC PRINT DATA=guns_raw;
RUN;
OUTPUT:
| Obs | ID | Age | Location | Weapon | Incident_Date | Severity |
|---|---|---|---|---|---|---|
| 1 | 1 | 25 | NY | Pistol | 12JAN2023 | High |
| 2 | 2 | -30 | CA | Rifle | 15FEB2023 | Medium |
| 3 | 3 | . | TX | NULL | 20MAR2023 | Low |
| 4 | 4 | 40 | FL | Shotgun | WRONGDATE | High |
| 5 | 5 | 25 | NY | Pistol | 12JAN2023 | High |
| 6 | 6 | 60 | CA | Rifle | 18APR2023 | |
| 7 | 7 | . | TX | Pistol | 22MAY2023 | Medium |
| 8 | 8 | 35 | FL | NULL | 25JUN2023 | Low |
| 9 | 9 | -5 | NY | Shotgun | 30JUL2023 | High |
| 10 | 10 | 45 | CA | Rifle | 10AUG2023 | Medium |
| 11 | 11 | 50 | TX | Pistol | 12SEP2023 | NULL |
| 12 | 12 | . | FL | Shotgun | 15OCT2023 | High |
| 13 | 13 | 70 | NY | Rifle | 20NOV2023 | Medium |
| 14 | 14 | 80 | CA | NULL | 25DEC2023 | Low |
| 15 | 15 | 30 | TX | Pistol | 01JAN2024 | High |
Explanation (SAS Raw Data)
This
dataset intentionally introduces real-world data quality issues:
- Missing Age (.)
- Negative Age values
- Invalid dates (WRONGDATE)
- Duplicate records (ID 1
& 5)
- Inconsistent text (NULL,
blank Severity)
This
mimics raw ingestion datasets seen in clinical or operational systems
before SDTM mapping.
Key Points
- DATALINES is useful for
quick prototyping
- Missing numeric = .
- Character inconsistencies
simulate real ETL challenges
- Duplicate records often
occur in system merges
R Code (Equivalent Dataset)
guns_raw <- data.frame(
ID = 1:15,
Age = c(25, -30, NA, 40, 25, 60, NA, 35, -5, 45, 50, NA, 70, 80, 30),
Location = c("NY","CA","TX","FL","NY","CA","TX","FL","NY","CA","TX",
"FL","NY","CA","TX"),
Weapon = c("Pistol","Rifle","NULL","Shotgun","Pistol","Rifle","Pistol",
"NULL","Shotgun","Rifle","Pistol","Shotgun","Rifle","NULL",
"Pistol"),
Incident_Date = c("12JAN2023","15FEB2023","20MAR2023","WRONGDATE",
"12JAN2023","18APR2023","22MAY2023","25JUN2023",
"30JUL2023","10AUG2023","12SEP2023","15OCT2023",
"20NOV2023","25DEC2023","01JAN2024"),
Severity = c("High","Medium","Low","High","High","","Medium","Low",
"High","Medium","NULL","High","Medium","Low","High")
)
OUTPUT:
|
|
ID |
Age |
Location |
Weapon |
Incident_Date |
Severity |
|
1 |
1 |
25 |
NY |
Pistol |
12-Jan-23 |
High |
|
2 |
2 |
-30 |
CA |
Rifle |
15-Feb-23 |
Medium |
|
3 |
3 |
NA |
TX |
NULL |
20-Mar-23 |
Low |
|
4 |
4 |
40 |
FL |
Shotgun |
WRONGDATE |
High |
|
5 |
5 |
25 |
NY |
Pistol |
12-Jan-23 |
High |
|
6 |
6 |
60 |
CA |
Rifle |
18-Apr-23 |
|
|
7 |
7 |
NA |
TX |
Pistol |
22-May-23 |
Medium |
|
8 |
8 |
35 |
FL |
NULL |
25-Jun-23 |
Low |
|
9 |
9 |
-5 |
NY |
Shotgun |
30-Jul-23 |
High |
|
10 |
10 |
45 |
CA |
Rifle |
10-Aug-23 |
Medium |
|
11 |
11 |
50 |
TX |
Pistol |
12-Sep-23 |
NULL |
|
12 |
12 |
NA |
FL |
Shotgun |
15-Oct-23 |
High |
|
13 |
13 |
70 |
NY |
Rifle |
20-Nov-23 |
Medium |
|
14 |
14 |
80 |
CA |
NULL |
25-Dec-23 |
Low |
|
15 |
15 |
30 |
TX |
Pistol |
01-Jan-24 |
High |
Explanation (R Raw Data)
R uses data.frame()
for structured datasets. NA represents missing values. Character
inconsistencies remain intact, requiring explicit cleaning.
Key Points
- NA handling differs from SAS
(. vs NA)
- Character vectors are
flexible but error-prone
- Data types need validation
before analysis
3. Phase 1: Data Cleaning in SAS (Using RETAIN)
SAS Cleaning Code
PROC SORT DATA=guns_raw NODUPKEY OUT=guns_nodup;
BY ID;
RUN;
PROC PRINT DATA=guns_nodup;
RUN;
LOG:
OUTPUT:
| Obs | ID | Age | Location | Weapon | Incident_Date | Severity |
|---|---|---|---|---|---|---|
| 1 | 1 | 25 | NY | Pistol | 12JAN2023 | High |
| 2 | 2 | -30 | CA | Rifle | 15FEB2023 | Medium |
| 3 | 3 | . | TX | NULL | 20MAR2023 | Low |
| 4 | 4 | 40 | FL | Shotgun | WRONGDATE | High |
| 5 | 5 | 25 | NY | Pistol | 12JAN2023 | High |
| 6 | 6 | 60 | CA | Rifle | 18APR2023 | |
| 7 | 7 | . | TX | Pistol | 22MAY2023 | Medium |
| 8 | 8 | 35 | FL | NULL | 25JUN2023 | Low |
| 9 | 9 | -5 | NY | Shotgun | 30JUL2023 | High |
| 10 | 10 | 45 | CA | Rifle | 10AUG2023 | Medium |
| 11 | 11 | 50 | TX | Pistol | 12SEP2023 | NULL |
| 12 | 12 | . | FL | Shotgun | 15OCT2023 | High |
| 13 | 13 | 70 | NY | Rifle | 20NOV2023 | Medium |
| 14 | 14 | 80 | CA | NULL | 25DEC2023 | Low |
| 15 | 15 | 30 | TX | Pistol | 01JAN2024 | High |
DATA guns_clean;
SET guns_nodup;
RETAIN Last_Valid_Age;
* Handle missing and negative age;
IF Age = . OR Age < 0 THEN Age = Last_Valid_Age;
ELSE Last_Valid_Age = Age;
* Fix weapon inconsistencies;
IF STRIP(UPCASE(Weapon)) = "NULL" OR STRIP(Weapon) = ""
THEN Weapon = "UNKNOWN";
Weapon = PROPCASE(STRIP(Weapon));
* Fix severity;
IF STRIP(UPCASE(Severity)) = "NULL" OR STRIP(Severity) = ""
THEN Severity = "MEDIUM";
Severity = PROPCASE(STRIP(Severity));
* SAFE DATE CONVERSION USING ??;
Incident_Date_Num = INPUT(Incident_Date, ?? DATE9.);
FORMAT Incident_Date_Num DATE9.;
IF Incident_Date_Num = . THEN Incident_Date_Num = TODAY();
DROP Incident_Date Last_Valid_Age;
RENAME Incident_Date_Num = Incident_Date;
RUN;
PROC PRINT DATA=guns_clean;
RUN;
| Obs | ID | Age | Location | Weapon | Severity | Incident_Date |
|---|---|---|---|---|---|---|
| 1 | 1 | 25 | NY | Pistol | High | 12JAN2023 |
| 2 | 2 | 25 | CA | Rifle | Medium | 15FEB2023 |
| 3 | 3 | 25 | TX | Unknown | Low | 20MAR2023 |
| 4 | 4 | 40 | FL | Shotgun | High | 12APR2026 |
| 5 | 5 | 25 | NY | Pistol | High | 12JAN2023 |
| 6 | 6 | 60 | CA | Rifle | Medium | 18APR2023 |
| 7 | 7 | 60 | TX | Pistol | Medium | 22MAY2023 |
| 8 | 8 | 35 | FL | Unknown | Low | 25JUN2023 |
| 9 | 9 | 35 | NY | Shotgun | High | 30JUL2023 |
| 10 | 10 | 45 | CA | Rifle | Medium | 10AUG2023 |
| 11 | 11 | 50 | TX | Pistol | Medium | 12SEP2023 |
| 12 | 12 | 50 | FL | Shotgun | High | 15OCT2023 |
| 13 | 13 | 70 | NY | Rifle | Medium | 20NOV2023 |
| 14 | 14 | 80 | CA | Unknown | Low | 25DEC2023 |
| 15 | 15 | 30 | TX | Pistol | High | 01JAN2024 |
Explanation (SAS Cleaning + RETAIN)
This is
where SAS shines.
- PROC SORT NODUPKEY removes
duplicate IDs
- RETAIN Last_Valid_Age holds
previous valid value
- Missing/invalid age is
replaced using retained value
- Dates are converted using INPUT()
- Invalid dates default to
today's date
RETAIN
enables forward-fill logic, commonly used in clinical datasets (e.g.,
last observation carried forward – LOCF)
Key Points
- RETAIN prevents
reinitialization each iteration
- Useful for sequential
dependency logic
- Essential in time-series or
patient-level datasets
- INPUT + FORMAT ensures date
integrity
4. Phase 2: Data Cleaning in R
R Cleaning Code
library(dplyr)
library(stringr)
library(zoo)
guns_clean <- guns_raw %>%
distinct(ID, .keep_all = TRUE) %>%
mutate(
Age = ifelse(Age < 0, NA, Age),
Age = zoo::na.locf(Age, na.rm = FALSE),
Weapon = ifelse(str_trim(str_to_upper(Weapon)) %in% c("NULL",""),
"UNKNOWN", Weapon),
Weapon = str_to_title(str_trim(Weapon)),
Severity = ifelse(str_trim(str_to_upper(Severity)) %in% c("NULL",""),
"MEDIUM", Severity),
Severity = str_to_title(str_trim(Severity)),
Incident_Date = as.Date(Incident_Date, format="%d%b%Y")
)
guns_clean$Incident_Date[is.na(guns_clean$Incident_Date)] <- Sys.Date()
OUTPUT:
|
|
ID |
Age |
Location |
Weapon |
Incident_Date |
Severity |
|
1 |
1 |
25 |
NY |
Pistol |
12-01-2023 |
High |
|
2 |
2 |
25 |
CA |
Rifle |
15-02-2023 |
Medium |
|
3 |
3 |
25 |
TX |
Unknown |
20-03-2023 |
Low |
|
4 |
4 |
40 |
FL |
Shotgun |
12-04-2026 |
High |
|
5 |
5 |
25 |
NY |
Pistol |
12-01-2023 |
High |
|
6 |
6 |
60 |
CA |
Rifle |
18-04-2023 |
Medium |
|
7 |
7 |
60 |
TX |
Pistol |
22-05-2023 |
Medium |
|
8 |
8 |
35 |
FL |
Unknown |
25-06-2023 |
Low |
|
9 |
9 |
35 |
NY |
Shotgun |
30-07-2023 |
High |
|
10 |
10 |
45 |
CA |
Rifle |
10-08-2023 |
Medium |
|
11 |
11 |
50 |
TX |
Pistol |
12-09-2023 |
Medium |
|
12 |
12 |
50 |
FL |
Shotgun |
15-10-2023 |
High |
|
13 |
13 |
70 |
NY |
Rifle |
20-11-2023 |
Medium |
|
14 |
14 |
80 |
CA |
Unknown |
25-12-2023 |
Low |
|
15 |
15 |
30 |
TX |
Pistol |
01-01-2024 |
High |
Explanation (R Cleaning)
- distinct() removes
duplicates
- lag() mimics RETAIN behavior
- mutate() applies
transformations
- as.Date() converts date formats
lag()≠RETAINna.locf()= true RETAIN equivalent- Always standardize BEFORE formatting text
Key Points
- lag() is R’s equivalent of
RETAIN
- dplyr simplifies pipeline
logic
- Date parsing is stricter
than SAS
- Vectorized operations
improve efficiency
5. 20 Additional Data
Cleaning Best Practices
- Always validate raw data
before transformation
- Maintain audit trails for
regulatory compliance
- Use SDTM standards for
clinical datasets
- Never overwrite raw datasets
- Document all derivations
(ADaM requirement)
- Validate ranges (e.g., age
0–120)
- Use controlled terminology
- Handle missing data
systematically
- Log all corrections
- Use macros for reusable
logic
- Perform QC checks
independently
- Avoid hardcoding values
- Normalize categorical
variables
- Check date consistency
across domains
- Validate keys before joins
- Ensure traceability from raw
→ final
- Use metadata-driven
programming
- Perform duplicate checks
regularly
- Align with SAP
specifications
- Use validation tools (e.g.,
Pinnacle 21)
6. Business Logic Behind
Data Cleaning
Data
cleaning is not just technical it’s business-critical logic.
- Missing Age → replaced using
last valid value
π Prevents bias in demographic analysis - Negative Age → corrected
π Ensures realistic modeling - Date Imputation
π Maintains timeline consistency - Severity normalization
π Ensures consistent reporting
Example:
If a
patient’s age is missing in a clinical trial:
- Leaving it blank → excludes
them from analysis
- Imputing intelligently →
retains statistical power
7. 20 Key Points (Sharp
Insights)
- Dirty data leads to wrong
conclusions
- RETAIN enables memory in SAS
- Standardization ensures
reproducibility
- Missing data must be handled
strategically
- Duplicate records distort
analysis
- Date errors break timelines
- Text inconsistencies reduce
accuracy
- Validation is as important
as analysis
- SAS excels in structured
pipelines
- R excels in flexibility
- Audit trails are mandatory
in pharma
- LOCF is common in clinical
trials
- Always separate raw and
cleaned data
- QC checks prevent downstream
errors
- Business logic drives
cleaning rules
- Data cleaning is 70% of
analytics work
- Never trust raw data blindly
- Automation reduces human
error
- Clean data = reliable
insights
- Consistency is the backbone
of analytics
8. SAS vs R Comparison
|
Feature |
SAS |
R |
|
Data
Handling |
Structured |
Flexible |
|
RETAIN
Logic |
Native |
lag()
workaround |
|
Regulatory
Use |
Strong
(FDA) |
Limited |
|
Learning
Curve |
Moderate |
Steep |
|
Automation |
Macro-based |
Functional |
|
Logic |
SAS |
R |
|
Remove
duplicates |
PROC
SORT NODUPKEY |
distinct() |
|
RETAIN
logic |
RETAIN
+ carry forward |
zoo::na.locf() |
|
NULL
handling |
STRIP +
UPCASE |
str_trim
+ str_to_upper |
|
Case
formatting |
PROPCASE |
str_to_title |
|
Date
conversion |
INPUT(??
DATE9.) |
as.Date() |
|
Missing
date fix |
TODAY() |
Sys.Date() |
9. Summary
We
transformed a messy Guns dataset into a reliable analytical asset using:
- SAS: Structured,
RETAIN-driven logic
- R: Flexible, pipeline-based
transformations
Both
tools are powerful but SAS dominates in regulated environments, while R
excels in exploratory analytics.
10. Conclusion
Data
cleaning is not a preprocessing step it’s the foundation of truth in
analytics.
The
RETAIN statement in SAS elevates your logic from row-based corrections to context-aware
intelligence.
Whether
you're working in clinical trials, finance, or public safety
π Clean data is not a luxury. It’s a
responsibility.
11. Interview Questions
Q1: What is RETAIN in SAS?
π It prevents variables from resetting each
iteration.
Q2: How do you handle missing values in SAS?
π Using IF conditions, COALESCEC, or RETAIN logic.
Q3: How does R handle RETAIN-like behavior?
π Using lag() function.
Q4: What is NODUPKEY in PROC SORT?
π Removes duplicate records based on BY variables.
Q5: A dataset has negative ages and missing dates. What will you do?
π Steps:
- Validate ranges
- Replace negatives using ABS
or logic
- Impute missing dates
- Document all changes
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about GUNS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment