460.The Dark Side of Data: Eliminating Outliers in Crime Datasets with SAS
Bloodlines & Broken
Data: Mastering Outlier Detection in SAS with a Global Crime Dataset
1. Introduction
Imagine
you’re working on a global crime analytics project for an international
intelligence agency. The goal is simple identify patterns in high-profile murder
cases across countries. But when you open the dataset, chaos greets you.
Victim
ages are negative. Dates like 2023-02-30 exist. Countries appear as
“usa”, “USA”, and “Usa”. Some murder counts are absurdly high like 10,000 cases
in a small town. Duplicate records silently distort trends.
This
isn’t just messy it’s dangerous.
Bad data
doesn’t just slow analysis; it destroys decision-making. In clinical
trials, it could mislead drug safety conclusions. In business, it could result
in million-dollar mistakes.
This is
where SAS and R become powerful allies. They don’t just process data they
enforce discipline, structure, and validation. And today, we’ll go deeper into
one critical aspect:
Detecting
Outliers in SAS datasets and cleaning them effectively
We’ll simulate a “Top Murders in World” dataset, intentionally inject errors, and then fix everything step-by-step like a real-world data engineer.
2. Raw Data Creation in SAS and R
We first create a flawed dataset with intentional errors.
SAS Code (Raw Dataset)
DATA murders_raw;
INPUT Case_ID Country $ Year Victim_Age Murder_Count Date : $12.
Status $ Investigator $;
DATALINES;
1 USA 2020 34 5 2020-05-12 Closed John
2 india 2021 -45 8 2021-07-22 Open Ravi
3 UK 2022 29 9999 2022-13-01 Closed Smith
4 USA 2020 34 5 2020-05-12 Closed John
5 NULL 2019 40 . 2019-11-05 Open NULL
6 Brazil 2023 150 3 2023-02-30 Closed Ana
7 India 2021 25 7 2021-08-15 Open Ravi
8 usa 2022 30 -10 2022-09-10 Closed John
9 UK 2020 . 6 2020-03-25 Closed Smith
10 India 2021 27 7 2021-08-15 Open Ravi
;
RUN;
PROC PRINT DATA = murders_raw;
RUN;
OUTPUT:
| Obs | Case_ID | Country | Year | Victim_Age | Murder_Count | Date | Status | Investigator |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | USA | 2020 | 34 | 5 | 2020-05-12 | Closed | John |
| 2 | 2 | india | 2021 | -45 | 8 | 2021-07-22 | Open | Ravi |
| 3 | 3 | UK | 2022 | 29 | 9999 | 2022-13-01 | Closed | Smith |
| 4 | 4 | USA | 2020 | 34 | 5 | 2020-05-12 | Closed | John |
| 5 | 5 | NULL | 2019 | 40 | . | 2019-11-05 | Open | NULL |
| 6 | 6 | Brazil | 2023 | 150 | 3 | 2023-02-30 | Closed | Ana |
| 7 | 7 | India | 2021 | 25 | 7 | 2021-08-15 | Open | Ravi |
| 8 | 8 | usa | 2022 | 30 | -10 | 2022-09-10 | Closed | John |
| 9 | 9 | UK | 2020 | . | 6 | 2020-03-25 | Closed | Smith |
| 10 | 10 | India | 2021 | 27 | 7 | 2021-08-15 | Open | Ravi |
Explanation
This
dataset intentionally mimics real-world inconsistencies. We included missing
values (.), invalid ages (-45, 150), duplicate records (Case_ID 1 & 4),
inconsistent country formats (USA, usa, india), and impossible dates
(2023-02-30). The Murder_Count variable includes extreme outliers like 9999 and
negative values. These issues replicate challenges seen in clinical trial
datasets (e.g., incorrect lab values, duplicate patient IDs). This raw
structure forms the foundation for demonstrating data cleaning and outlier
detection techniques using SAS procedures like PROC UNIVARIATE, PROC MEANS, and
conditional transformations.
R Code – Equivalent Raw Dataset
murders_raw <- data.frame(
Case_ID = c(1,2,3,4,5,6,7,8,9,10),
Country = c("USA","india","UK","USA","NULL","Brazil","India",
"usa","UK","India"),
Year = c(2020,2021,2022,2020,2019,2023,2021,2022,2020,2021),
Victim_Age = c(34,-45,29,34,40,150,25,30,NA,27),
Murder_Count = c(5,8,9999,5,NA,3,7,-10,6,7),
Date = c("2020-05-12","2021-07-22","2022-13-01","2020-05-12",
"2019-11-05","2023-02-30","2021-08-15","2022-09-10",
"2020-03-25","2021-08-15"),
Status = c("Closed","Open","Closed","Closed","Open","Closed",
"Open","Closed","Closed","Open"),
Investigator = c("John","Ravi","Smith","John","NULL","Ana","Ravi",
"John","Smith","Ravi")
)
OUTPUT:
|
|
Case_ID |
Country |
Year |
Victim_Age |
Murder_Count |
Date |
Status |
Investigator |
|
1 |
1 |
USA |
2020 |
34 |
5 |
12-05-2020 |
Closed |
John |
|
2 |
2 |
india |
2021 |
-45 |
8 |
22-07-2021 |
Open |
Ravi |
|
3 |
3 |
UK |
2022 |
29 |
9999 |
2022-13-01 |
Closed |
Smith |
|
4 |
4 |
USA |
2020 |
34 |
5 |
12-05-2020 |
Closed |
John |
|
5 |
5 |
NULL |
2019 |
40 |
NA |
05-11-2019 |
Open |
NULL |
|
6 |
6 |
Brazil |
2023 |
150 |
3 |
2023-02-30 |
Closed |
Ana |
|
7 |
7 |
India |
2021 |
25 |
7 |
15-08-2021 |
Open |
Ravi |
|
8 |
8 |
usa |
2022 |
30 |
-10 |
10-09-2022 |
Closed |
John |
|
9 |
9 |
UK |
2020 |
NA |
6 |
25-03-2020 |
Closed |
Smith |
|
10 |
10 |
India |
2021 |
27 |
7 |
15-08-2021 |
Open |
Ravi |
Explanation
The R
dataset mirrors SAS but uses data.frame(). It includes NA for missing values,
inconsistent casing, and incorrect numeric entries. R’s flexibility allows fast
prototyping, but without validation, errors propagate quickly. This dataset
will later be cleaned using dplyr functions like mutate(), filter(), and distinct().
The goal is to show parity between SAS and R workflows while emphasizing SAS’s
strength in regulated environments.
3. Phase 1: Data Cleaning in SAS
Step 1: Handle Missing & Invalid Values
DATA murders_clean1;
SET murders_raw;
Country = UPCASE(STRIP(Country));
IF Country = "NULL" OR Country = "" THEN Country = "UNKNOWN";
IF Victim_Age < 0 OR Victim_Age > 120 THEN Victim_Age = .;
IF Murder_Count < 0 THEN Murder_Count = .;
IF Murder_Count > 100 THEN Murder_Count = .;
RUN;
PROC PRINT DATA = murders_clean1;
RUN;
OUTPUT:
| Obs | Case_ID | Country | Year | Victim_Age | Murder_Count | Date | Status | Investigator |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | USA | 2020 | 34 | 5 | 2020-05-12 | Closed | John |
| 2 | 2 | INDIA | 2021 | . | 8 | 2021-07-22 | Open | Ravi |
| 3 | 3 | UK | 2022 | 29 | . | 2022-13-01 | Closed | Smith |
| 4 | 4 | USA | 2020 | 34 | 5 | 2020-05-12 | Closed | John |
| 5 | 5 | UNKNOWN | 2019 | 40 | . | 2019-11-05 | Open | NULL |
| 6 | 6 | BRAZIL | 2023 | . | 3 | 2023-02-30 | Closed | Ana |
| 7 | 7 | INDIA | 2021 | 25 | 7 | 2021-08-15 | Open | Ravi |
| 8 | 8 | USA | 2022 | 30 | . | 2022-09-10 | Closed | John |
| 9 | 9 | UK | 2020 | . | 6 | 2020-03-25 | Closed | Smith |
| 10 | 10 | INDIA | 2021 | 27 | 7 | 2021-08-15 | Open | Ravi |
Explanation
We
standardize text using UPCASE and STRIP, replacing missing country values with
“UNKNOWN”. Invalid ages (negative or >120) are removed. Murder counts are
validated negative and extreme values (>100) are flagged as missing. This
step ensures logical consistency, similar to clinical trial cleaning where
unrealistic lab values are removed.
Step 2: Date Formatting
DATA murders_clean2;
SET murders_clean1;
Date_clean = INPUT(Date, YYMMDD10.);
FORMAT Date_clean DATE9.;
RUN;
PROC PRINT DATA = murders_clean2;
RUN;
OUTPUT:
| Obs | Case_ID | Country | Year | Victim_Age | Murder_Count | Date | Status | Investigator | Date_clean |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | USA | 2020 | 34 | 5 | 2020-05-12 | Closed | John | 12MAY2020 |
| 2 | 2 | INDIA | 2021 | . | 8 | 2021-07-22 | Open | Ravi | 22JUL2021 |
| 3 | 3 | UK | 2022 | 29 | . | 2022-13-01 | Closed | Smith | . |
| 4 | 4 | USA | 2020 | 34 | 5 | 2020-05-12 | Closed | John | 12MAY2020 |
| 5 | 5 | UNKNOWN | 2019 | 40 | . | 2019-11-05 | Open | NULL | 05NOV2019 |
| 6 | 6 | BRAZIL | 2023 | . | 3 | 2023-02-30 | Closed | Ana | . |
| 7 | 7 | INDIA | 2021 | 25 | 7 | 2021-08-15 | Open | Ravi | 15AUG2021 |
| 8 | 8 | USA | 2022 | 30 | . | 2022-09-10 | Closed | John | 10SEP2022 |
| 9 | 9 | UK | 2020 | . | 6 | 2020-03-25 | Closed | Smith | 25MAR2020 |
| 10 | 10 | INDIA | 2021 | 27 | 7 | 2021-08-15 | Open | Ravi | 15AUG2021 |
Explanation
Invalid
dates like “2022-13-01” are automatically converted to missing. SAS’s INPUT
function enforces strict date parsing, which is critical in regulatory datasets
like SDTM where date accuracy is essential.
Step 3: Remove Duplicates
PROC SORT DATA=murders_clean2 NODUPKEY;
BY Case_ID;
RUN;
PROC PRINT DATA = murders_clean2;
RUN;
OUTPUT:
| Obs | Case_ID | Country | Year | Victim_Age | Murder_Count | Date | Status | Investigator | Date_clean |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | USA | 2020 | 34 | 5 | 2020-05-12 | Closed | John | 12MAY2020 |
| 2 | 2 | INDIA | 2021 | . | 8 | 2021-07-22 | Open | Ravi | 22JUL2021 |
| 3 | 3 | UK | 2022 | 29 | . | 2022-13-01 | Closed | Smith | . |
| 4 | 4 | USA | 2020 | 34 | 5 | 2020-05-12 | Closed | John | 12MAY2020 |
| 5 | 5 | UNKNOWN | 2019 | 40 | . | 2019-11-05 | Open | NULL | 05NOV2019 |
| 6 | 6 | BRAZIL | 2023 | . | 3 | 2023-02-30 | Closed | Ana | . |
| 7 | 7 | INDIA | 2021 | 25 | 7 | 2021-08-15 | Open | Ravi | 15AUG2021 |
| 8 | 8 | USA | 2022 | 30 | . | 2022-09-10 | Closed | John | 10SEP2022 |
| 9 | 9 | UK | 2020 | . | 6 | 2020-03-25 | Closed | Smith | 25MAR2020 |
| 10 | 10 | INDIA | 2021 | 27 | 7 | 2021-08-15 | Open | Ravi | 15AUG2021 |
Explanation
Duplicate
Case_IDs are removed. In clinical trials, duplicate subject IDs can invalidate
analysis. NODUPKEY ensures uniqueness based on primary keys.
Step 4: Outlier Detection
PROC UNIVARIATE DATA=murders_clean2;
VAR Murder_Count;
RUN;
OUTPUT:
The UNIVARIATE Procedure
Variable: Murder_Count
| Moments | |||
|---|---|---|---|
| N | 7 | Sum Weights | 7 |
| Mean | 5.85714286 | Sum Observations | 41 |
| Std Deviation | 1.67616342 | Variance | 2.80952381 |
| Skewness | -0.5824439 | Kurtosis | 0.05188164 |
| Uncorrected SS | 257 | Corrected SS | 16.8571429 |
| Coeff Variation | 28.6174242 | Std Error Mean | 0.63353022 |
| Basic Statistical Measures | |||
|---|---|---|---|
| Location | Variability | ||
| Mean | 5.857143 | Std Deviation | 1.67616 |
| Median | 6.000000 | Variance | 2.80952 |
| Mode | 5.000000 | Range | 5.00000 |
| Interquartile Range | 2.00000 | ||
Note: The mode displayed is the smallest of 2 modes with a count of 2.
| Tests for Location: Mu0=0 | ||||
|---|---|---|---|---|
| Test | Statistic | p Value | ||
| Student's t | t | 9.245246 | Pr > |t| | <.0001 |
| Sign | M | 3.5 | Pr >= |M| | 0.0156 |
| Signed Rank | S | 14 | Pr >= |S| | 0.0156 |
| Quantiles (Definition 5) | |
|---|---|
| Level | Quantile |
| 100% Max | 8 |
| 99% | 8 |
| 95% | 8 |
| 90% | 8 |
| 75% Q3 | 7 |
| 50% Median | 6 |
| 25% Q1 | 5 |
| 10% | 3 |
| 5% | 3 |
| 1% | 3 |
| 0% Min | 3 |
| Extreme Observations | |||
|---|---|---|---|
| Lowest | Highest | ||
| Value | Obs | Value | Obs |
| 3 | 6 | 5 | 4 |
| 5 | 4 | 6 | 9 |
| 5 | 1 | 7 | 7 |
| 6 | 9 | 7 | 10 |
| 7 | 10 | 8 | 2 |
| Missing Values | |||
|---|---|---|---|
| Missing Value | Count | Percent Of | |
| All Obs | Missing Obs | ||
| . | 3 | 30.00 | 100.00 |
Explanation
This identifies extreme values using statistical summaries. Outliers can distort mean and variance. In real-world analytics, identifying these ensures robust modeling.
4. Phase 2: Data Cleaning in R
library(dplyr)
murders_clean <- murders_raw %>%
mutate(
Country = toupper(trimws(ifelse(Country=="NULL",
"UNKNOWN",Country))),
Victim_Age = ifelse(Victim_Age < 0 | Victim_Age > 120,
NA, Victim_Age),
Murder_Count = ifelse(Murder_Count < 0 | Murder_Count > 100,
NA, Murder_Count)
) %>%
distinct(Case_ID, .keep_all = TRUE)
OUTPUT:
|
|
Case_ID |
Country |
Year |
Victim_Age |
Murder_Count |
Date |
Status |
Investigator |
|
1 |
1 |
USA |
2020 |
34 |
5 |
12-05-2020 |
Closed |
John |
|
2 |
2 |
INDIA |
2021 |
NA |
8 |
22-07-2021 |
Open |
Ravi |
|
3 |
3 |
UK |
2022 |
29 |
NA |
2022-13-01 |
Closed |
Smith |
|
4 |
4 |
USA |
2020 |
34 |
5 |
12-05-2020 |
Closed |
John |
|
5 |
5 |
UNKNOWN |
2019 |
40 |
NA |
05-11-2019 |
Open |
NULL |
|
6 |
6 |
BRAZIL |
2023 |
NA |
3 |
2023-02-30 |
Closed |
Ana |
|
7 |
7 |
INDIA |
2021 |
25 |
7 |
15-08-2021 |
Open |
Ravi |
|
8 |
8 |
USA |
2022 |
30 |
NA |
10-09-2022 |
Closed |
John |
|
9 |
9 |
UK |
2020 |
NA |
6 |
25-03-2020 |
Closed |
Smith |
|
10 |
10 |
INDIA |
2021 |
27 |
7 |
15-08-2021 |
Open |
Ravi |
Explanation
R uses mutate()
for transformation and distinct() for duplicates. ifelse() handles conditional
cleaning. While R is flexible, it requires disciplined coding to ensure
reproducibility especially in regulated domains.
5. Phase 3: Additional SAS Code (Advanced Outlier Handling)
PROC MEANS DATA=murders_clean2 N MEAN STD MIN MAX;
VAR Murder_Count;
RUN;
OUTPUT:
The MEANS Procedure
| Analysis Variable : Murder_Count | ||||
|---|---|---|---|---|
| N | Mean | Std Dev | Minimum | Maximum |
| 7 | 5.8571429 | 1.6761634 | 3.0000000 | 8.0000000 |
DATA murders_outliers;
SET murders_clean2;
IF Murder_Count > 3*10 THEN Flag_Outlier = 1;
ELSE Flag_Outlier = 0;
RUN;
PROC PRINT DATA = murders_outliers;
RUN;
OUTPUT:
| Obs | Case_ID | Country | Year | Victim_Age | Murder_Count | Date | Status | Investigator | Date_clean | Flag_Outlier |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | USA | 2020 | 34 | 5 | 2020-05-12 | Closed | John | 12MAY2020 | 0 |
| 2 | 2 | INDIA | 2021 | . | 8 | 2021-07-22 | Open | Ravi | 22JUL2021 | 0 |
| 3 | 3 | UK | 2022 | 29 | . | 2022-13-01 | Closed | Smith | . | 0 |
| 4 | 4 | USA | 2020 | 34 | 5 | 2020-05-12 | Closed | John | 12MAY2020 | 0 |
| 5 | 5 | UNKNOWN | 2019 | 40 | . | 2019-11-05 | Open | NULL | 05NOV2019 | 0 |
| 6 | 6 | BRAZIL | 2023 | . | 3 | 2023-02-30 | Closed | Ana | . | 0 |
| 7 | 7 | INDIA | 2021 | 25 | 7 | 2021-08-15 | Open | Ravi | 15AUG2021 | 0 |
| 8 | 8 | USA | 2022 | 30 | . | 2022-09-10 | Closed | John | 10SEP2022 | 0 |
| 9 | 9 | UK | 2020 | . | 6 | 2020-03-25 | Closed | Smith | 25MAR2020 | 0 |
| 10 | 10 | INDIA | 2021 | 27 | 7 | 2021-08-15 | Open | Ravi | 15AUG2021 | 0 |
Explanation
We
compute summary statistics and flag outliers manually. Threshold-based
detection (e.g., 3×STD) is common in statistical modeling. This step ensures
extreme values are flagged rather than blindly removed.
6. 20 Additional Data Cleaning Best Practices
- Always validate primary keys
(USUBJID in SDTM)
- Maintain audit trails for
every transformation
- Use controlled terminology
(CDISC standards)
- Validate date formats
strictly
- Never overwrite raw data
- Use macros for repeatability
- Perform range checks on
numeric fields
- Standardize categorical
variables
- Document derivation logic
- Validate merges carefully
- Use PROC COMPARE for
validation
- Ensure traceability from
SDTM to ADaM
- Flag missing critical
variables
- Perform cross-domain
consistency checks
- Validate units (mg vs g)
- Automate QC checks
- Use metadata-driven
programming
- Log all warnings/errors
- Avoid hardcoding values
- Follow regulatory compliance
(FDA/EMA)
7. Business Logic Behind Data Cleaning
Data
cleaning is not arbitrary it’s driven by business logic. Missing values are
often replaced because they can bias analysis. For example, in a clinical
trial, missing patient age could distort subgroup analysis. Similarly,
unrealistic values like age = 150 must be corrected because they violate
biological constraints. In financial datasets, salary normalization ensures
comparability across regions.
Outliers
are particularly critical. A murder count of 9999 in a small dataset is likely
a data entry error, not reality. If left untreated, it inflates averages and
misleads decision-makers. Date imputation is another key area incorrect dates
can disrupt time-series analysis.
Ultimately,
data cleaning ensures trustworthy insights. Without it, even the most
advanced models fail.
8. 20 Key Points About This Project
- Dirty data leads to wrong
conclusions
- Outliers distort statistical
models
- Standardization ensures
reproducibility
- Missing data must be handled
carefully
- Duplicate records inflate
metrics
- SAS ensures regulatory
compliance
- R offers flexibility in
exploration
- Date errors break timelines
- Validation is non-negotiable
- Cleaning improves model
accuracy
- Audit trails ensure
traceability
- Controlled terminology
avoids confusion
- Range checks prevent
unrealistic values
- Automation reduces manual
errors
- Metadata drives consistency
- QC checks ensure reliability
- Outliers must be flagged,
not ignored
- Data integrity is critical
in healthcare
- Structured cleaning saves
time
- Clean data = trustworthy
insights
9. Summary
Data
cleaning is the backbone of reliable analytics, and this blog demonstrated how
messy real-world datasets like our simulated global murder dataset can be
systematically cleaned using SAS and R. We introduced intentional data issues
such as missing values, invalid entries, duplicates, and inconsistent text
formats to replicate real-world complexity. Using SAS, we applied structured
techniques like COALESCEC, conditional logic, PROC SORT, and PROC UNIVARIATE to
clean and validate the data. SAS proved particularly strong in enforcing data
integrity and handling regulated workflows.
In
parallel, R provided a flexible and intuitive approach using dplyr functions
such as mutate() and distinct(). While R excels in exploratory data analysis,
it requires disciplined coding practices to match SAS’s robustness in
production environments.
A major
focus was outlier detection critical for preventing distorted insights. We used
statistical methods and logical thresholds to identify and handle extreme
values. Additionally, we explored best practices aligned with clinical trial
standards (SDTM/ADaM), emphasizing audit trails, validation checks, and
regulatory compliance.
Ultimately,
both SAS and R are powerful but their effectiveness depends on how well data
cleaning principles are applied. Clean data ensures accuracy, scalability, and
trust in analytics, making it indispensable in both business and healthcare
domains.
10. Conclusion
In the
world of data analytics, tools alone do not guarantee success discipline in
data handling does. Through this deep dive into a “Top Murders in World”
dataset, we exposed how easily data can become unreliable and how critical
structured cleaning frameworks are.
SAS
stands out as a gold standard in environments where compliance, traceability,
and reproducibility are essential especially in clinical trials. Its procedural
rigor ensures that every transformation is documented and validated. R, on the
other hand, provides unmatched flexibility and speed for exploratory analysis,
making it invaluable for early-stage data investigation.
Outlier
detection emerged as a central theme. Ignoring outliers can lead to flawed
conclusions, while blindly removing them can erase meaningful signals. The
right approach is to detect, analyze, and justify.
The
broader lesson is this: data cleaning is not a preprocessing step it is a core
analytical responsibility. Whether you're working on healthcare data,
financial systems, or crime analytics, the integrity of your dataset determines
the quality of your insights.
Adopting best
practices standardization, validation, audit trails, and automation ensures
that your data pipeline is not just functional but trustworthy.
In the
end, clean data is not a luxury it is a necessity.
11. Interview Questions
Q1: How do you detect outliers in SAS?
Answer: Use PROC UNIVARIATE, PROC MEANS,
or IQR methods. Example: values beyond 3*STD.
Q2: How do you handle duplicates?
Answer: PROC SORT NODUPKEY BY variable;
Q3: What is the difference between SAS and R in
data cleaning?
Answer: SAS is structured and
regulatory-friendly; R is flexible and exploratory.
Q4: Scenario: Age = -10 in dataset. What do you do?
Answer: Validate and set to missing or
correct using business rules.
Q5: How do you debug inconsistent text values?
Answer: Use UPCASE, STRIP, and controlled
terminology mapping.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TOP MURDERS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment