Bloodlines & Broken Data: Mastering Outlier Detection in SAS with a Global Crime Dataset

1. Introduction

Imagine you’re working on a global crime analytics project for an international intelligence agency. The goal is simple identify patterns in high-profile murder cases across countries. But when you open the dataset, chaos greets you.

Victim ages are negative. Dates like 2023-02-30 exist. Countries appear as “usa”, “USA”, and “Usa”. Some murder counts are absurdly high like 10,000 cases in a small town. Duplicate records silently distort trends.

This isn’t just messy it’s dangerous.

Bad data doesn’t just slow analysis; it destroys decision-making. In clinical trials, it could mislead drug safety conclusions. In business, it could result in million-dollar mistakes.

This is where SAS and R become powerful allies. They don’t just process data they enforce discipline, structure, and validation. And today, we’ll go deeper into one critical aspect:

Detecting Outliers in SAS datasets and cleaning them effectively

We’ll simulate a “Top Murders in World” dataset, intentionally inject errors, and then fix everything step-by-step like a real-world data engineer.

2. Raw Data Creation in SAS and R

We first create a flawed dataset with intentional errors.

SAS Code (Raw Dataset)

DATA murders_raw;

INPUT Case_ID Country $ Year Victim_Age Murder_Count Date : $12.

Status $ Investigator $;

DATALINES;

1 USA 2020 34 5 2020-05-12 Closed John

2 india 2021 -45 8 2021-07-22 Open Ravi

3 UK 2022 29 9999 2022-13-01 Closed Smith

4 USA 2020 34 5 2020-05-12 Closed John

5 NULL 2019 40 . 2019-11-05 Open NULL

6 Brazil 2023 150 3 2023-02-30 Closed Ana

7 India 2021 25 7 2021-08-15 Open Ravi

8 usa 2022 30 -10 2022-09-10 Closed John

9 UK 2020 . 6 2020-03-25 Closed Smith

10 India 2021 27 7 2021-08-15 Open Ravi

;

RUN;

PROC PRINT DATA = murders_raw;

RUN;

OUTPUT:


Obs	Case_ID	Country	Year	Victim_Age	Murder_Count	Date	Status	Investigator
1	1	USA	2020	34	5	2020-05-12	Closed	John
2	2	india	2021	-45	8	2021-07-22	Open	Ravi
3	3	UK	2022	29	9999	2022-13-01	Closed	Smith
4	4	USA	2020	34	5	2020-05-12	Closed	John
5	5	NULL	2019	40	.	2019-11-05	Open	NULL
6	6	Brazil	2023	150	3	2023-02-30	Closed	Ana
7	7	India	2021	25	7	2021-08-15	Open	Ravi
8	8	usa	2022	30	-10	2022-09-10	Closed	John
9	9	UK	2020	.	6	2020-03-25	Closed	Smith
10	10	India	2021	27	7	2021-08-15	Open	Ravi

Explanation

This dataset intentionally mimics real-world inconsistencies. We included missing values (.), invalid ages (-45, 150), duplicate records (Case_ID 1 & 4), inconsistent country formats (USA, usa, india), and impossible dates (2023-02-30). The Murder_Count variable includes extreme outliers like 9999 and negative values. These issues replicate challenges seen in clinical trial datasets (e.g., incorrect lab values, duplicate patient IDs). This raw structure forms the foundation for demonstrating data cleaning and outlier detection techniques using SAS procedures like PROC UNIVARIATE, PROC MEANS, and conditional transformations.

R Code – Equivalent Raw Dataset

murders_raw <- data.frame(

Case_ID = c(1,2,3,4,5,6,7,8,9,10),

Country = c("USA","india","UK","USA","NULL","Brazil","India",

"usa","UK","India"),

Year = c(2020,2021,2022,2020,2019,2023,2021,2022,2020,2021),

Victim_Age = c(34,-45,29,34,40,150,25,30,NA,27),

Murder_Count = c(5,8,9999,5,NA,3,7,-10,6,7),

Date = c("2020-05-12","2021-07-22","2022-13-01","2020-05-12",

"2019-11-05","2023-02-30","2021-08-15","2022-09-10",

"2020-03-25","2021-08-15"),

Status = c("Closed","Open","Closed","Closed","Open","Closed",

"Open","Closed","Closed","Open"),

Investigator = c("John","Ravi","Smith","John","NULL","Ana","Ravi",

"John","Smith","Ravi")

)

OUTPUT:

	Case_ID	Country	Year	Victim_Age	Murder_Count	Date	Status	Investigator
1	1	USA	2020	34	5	12-05-2020	Closed	John
2	2	india	2021	-45	8	22-07-2021	Open	Ravi
3	3	UK	2022	29	9999	2022-13-01	Closed	Smith
4	4	USA	2020	34	5	12-05-2020	Closed	John
5	5	NULL	2019	40	NA	05-11-2019	Open	NULL
6	6	Brazil	2023	150	3	2023-02-30	Closed	Ana
7	7	India	2021	25	7	15-08-2021	Open	Ravi
8	8	usa	2022	30	-10	10-09-2022	Closed	John
9	9	UK	2020	NA	6	25-03-2020	Closed	Smith
10	10	India	2021	27	7	15-08-2021	Open	Ravi

Explanation

The R dataset mirrors SAS but uses data.frame(). It includes NA for missing values, inconsistent casing, and incorrect numeric entries. R’s flexibility allows fast prototyping, but without validation, errors propagate quickly. This dataset will later be cleaned using dplyr functions like mutate(), filter(), and distinct(). The goal is to show parity between SAS and R workflows while emphasizing SAS’s strength in regulated environments.

3. Phase 1: Data Cleaning in SAS

Step 1: Handle Missing & Invalid Values

DATA murders_clean1;

SET murders_raw;

Country = UPCASE(STRIP(Country));

IF Country = "NULL" OR Country = "" THEN Country = "UNKNOWN";

IF Victim_Age < 0 OR Victim_Age > 120 THEN Victim_Age = .;

IF Murder_Count < 0 THEN Murder_Count = .;

IF Murder_Count > 100 THEN Murder_Count = .;

RUN;

PROC PRINT DATA = murders_clean1;

RUN;

OUTPUT:


Obs	Case_ID	Country	Year	Victim_Age	Murder_Count	Date	Status	Investigator
1	1	USA	2020	34	5	2020-05-12	Closed	John
2	2	INDIA	2021	.	8	2021-07-22	Open	Ravi
3	3	UK	2022	29	.	2022-13-01	Closed	Smith
4	4	USA	2020	34	5	2020-05-12	Closed	John
5	5	UNKNOWN	2019	40	.	2019-11-05	Open	NULL
6	6	BRAZIL	2023	.	3	2023-02-30	Closed	Ana
7	7	INDIA	2021	25	7	2021-08-15	Open	Ravi
8	8	USA	2022	30	.	2022-09-10	Closed	John
9	9	UK	2020	.	6	2020-03-25	Closed	Smith
10	10	INDIA	2021	27	7	2021-08-15	Open	Ravi

Explanation

We standardize text using UPCASE and STRIP, replacing missing country values with “UNKNOWN”. Invalid ages (negative or >120) are removed. Murder counts are validated negative and extreme values (>100) are flagged as missing. This step ensures logical consistency, similar to clinical trial cleaning where unrealistic lab values are removed.

Step 2: Date Formatting

DATA murders_clean2;

SET murders_clean1;

Date_clean = INPUT(Date, YYMMDD10.);

FORMAT Date_clean DATE9.;

RUN;

PROC PRINT DATA = murders_clean2;

RUN;

OUTPUT:


Obs	Case_ID	Country	Year	Victim_Age	Murder_Count	Date	Status	Investigator	Date_clean
1	1	USA	2020	34	5	2020-05-12	Closed	John	12MAY2020
2	2	INDIA	2021	.	8	2021-07-22	Open	Ravi	22JUL2021
3	3	UK	2022	29	.	2022-13-01	Closed	Smith	.
4	4	USA	2020	34	5	2020-05-12	Closed	John	12MAY2020
5	5	UNKNOWN	2019	40	.	2019-11-05	Open	NULL	05NOV2019
6	6	BRAZIL	2023	.	3	2023-02-30	Closed	Ana	.
7	7	INDIA	2021	25	7	2021-08-15	Open	Ravi	15AUG2021
8	8	USA	2022	30	.	2022-09-10	Closed	John	10SEP2022
9	9	UK	2020	.	6	2020-03-25	Closed	Smith	25MAR2020
10	10	INDIA	2021	27	7	2021-08-15	Open	Ravi	15AUG2021

Explanation

Invalid dates like “2022-13-01” are automatically converted to missing. SAS’s INPUT function enforces strict date parsing, which is critical in regulatory datasets like SDTM where date accuracy is essential.

Step 3: Remove Duplicates

PROC SORT DATA=murders_clean2 NODUPKEY;

BY Case_ID;

RUN;

PROC PRINT DATA = murders_clean2;

RUN;

OUTPUT:


Obs	Case_ID	Country	Year	Victim_Age	Murder_Count	Date	Status	Investigator	Date_clean
1	1	USA	2020	34	5	2020-05-12	Closed	John	12MAY2020
2	2	INDIA	2021	.	8	2021-07-22	Open	Ravi	22JUL2021
3	3	UK	2022	29	.	2022-13-01	Closed	Smith	.
4	4	USA	2020	34	5	2020-05-12	Closed	John	12MAY2020
5	5	UNKNOWN	2019	40	.	2019-11-05	Open	NULL	05NOV2019
6	6	BRAZIL	2023	.	3	2023-02-30	Closed	Ana	.
7	7	INDIA	2021	25	7	2021-08-15	Open	Ravi	15AUG2021
8	8	USA	2022	30	.	2022-09-10	Closed	John	10SEP2022
9	9	UK	2020	.	6	2020-03-25	Closed	Smith	25MAR2020
10	10	INDIA	2021	27	7	2021-08-15	Open	Ravi	15AUG2021

Explanation

Duplicate Case_IDs are removed. In clinical trials, duplicate subject IDs can invalidate analysis. NODUPKEY ensures uniqueness based on primary keys.

Step 4: Outlier Detection

PROC UNIVARIATE DATA=murders_clean2;

VAR Murder_Count;

RUN;

OUTPUT:

The UNIVARIATE Procedure

Variable: Murder_Count


Moments
N	7	Sum Weights	7
Mean	5.85714286	Sum Observations	41
Std Deviation	1.67616342	Variance	2.80952381
Skewness	-0.5824439	Kurtosis	0.05188164
Uncorrected SS	257	Corrected SS	16.8571429
Coeff Variation	28.6174242	Std Error Mean	0.63353022


Basic Statistical Measures
Location		Variability
Mean	5.857143	Std Deviation	1.67616
Median	6.000000	Variance	2.80952
Mode	5.000000	Range	5.00000
		Interquartile Range	2.00000

Note: The mode displayed is the smallest of 2 modes with a count of 2.


Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	9.245246	Pr > \|t\|	<.0001
Sign	M	3.5	Pr >= \|M\|	0.0156
Signed Rank	S	14	Pr >= \|S\|	0.0156


Quantiles (Definition 5)
Level	Quantile
100% Max	8
99%	8
95%	8
90%	8
75% Q3	7
50% Median	6
25% Q1	5
10%	3
5%	3
1%	3
0% Min	3


Extreme Observations
Lowest		Highest
Value	Obs	Value	Obs
3	6	5	4
5	4	6	9
5	1	7	7
6	9	7	10
7	10	8	2


Missing Values
Missing Value	Count	Percent Of
Missing Value	Count	All Obs	Missing Obs
.	3	30.00	100.00

Explanation

This identifies extreme values using statistical summaries. Outliers can distort mean and variance. In real-world analytics, identifying these ensures robust modeling.

4. Phase 2: Data Cleaning in R

library(dplyr)

murders_clean <- murders_raw %>%

mutate(

Country = toupper(trimws(ifelse(Country=="NULL",

"UNKNOWN",Country))),

Victim_Age = ifelse(Victim_Age < 0 | Victim_Age > 120,

NA, Victim_Age),

Murder_Count = ifelse(Murder_Count < 0 | Murder_Count > 100,

NA, Murder_Count)

) %>%

distinct(Case_ID, .keep_all = TRUE)

OUTPUT:

	Case_ID	Country	Year	Victim_Age	Murder_Count	Date	Status	Investigator
1	1	USA	2020	34	5	12-05-2020	Closed	John
2	2	INDIA	2021	NA	8	22-07-2021	Open	Ravi
3	3	UK	2022	29	NA	2022-13-01	Closed	Smith
4	4	USA	2020	34	5	12-05-2020	Closed	John
5	5	UNKNOWN	2019	40	NA	05-11-2019	Open	NULL
6	6	BRAZIL	2023	NA	3	2023-02-30	Closed	Ana
7	7	INDIA	2021	25	7	15-08-2021	Open	Ravi
8	8	USA	2022	30	NA	10-09-2022	Closed	John
9	9	UK	2020	NA	6	25-03-2020	Closed	Smith
10	10	INDIA	2021	27	7	15-08-2021	Open	Ravi

Explanation

R uses mutate() for transformation and distinct() for duplicates. ifelse() handles conditional cleaning. While R is flexible, it requires disciplined coding to ensure reproducibility especially in regulated domains.

5. Phase 3: Additional SAS Code (Advanced Outlier Handling)

PROC MEANS DATA=murders_clean2 N MEAN STD MIN MAX;

VAR Murder_Count;

RUN;

OUTPUT:

The MEANS Procedure


Analysis Variable : Murder_Count
N	Mean	Std Dev	Minimum	Maximum
7	5.8571429	1.6761634	3.0000000	8.0000000

DATA murders_outliers;

SET murders_clean2;

IF Murder_Count > 3*10 THEN Flag_Outlier = 1;

ELSE Flag_Outlier = 0;

RUN;

PROC PRINT DATA = murders_outliers;

RUN;

OUTPUT:


Obs	Case_ID	Country	Year	Victim_Age	Murder_Count	Date	Status	Investigator	Date_clean
1	1	USA	2020	34	5	2020-05-12	Closed	John	12MAY2020
2	2	INDIA	2021	.	8	2021-07-22	Open	Ravi	22JUL2021
3	3	UK	2022	29	.	2022-13-01	Closed	Smith	.
4	4	USA	2020	34	5	2020-05-12	Closed	John	12MAY2020
5	5	UNKNOWN	2019	40	.	2019-11-05	Open	NULL	05NOV2019
6	6	BRAZIL	2023	.	3	2023-02-30	Closed	Ana	.
7	7	INDIA	2021	25	7	2021-08-15	Open	Ravi	15AUG2021
8	8	USA	2022	30	.	2022-09-10	Closed	John	10SEP2022
9	9	UK	2020	.	6	2020-03-25	Closed	Smith	25MAR2020
10	10	INDIA	2021	27	7	2021-08-15	Open	Ravi	15AUG2021

Explanation

We compute summary statistics and flag outliers manually. Threshold-based detection (e.g., 3×STD) is common in statistical modeling. This step ensures extreme values are flagged rather than blindly removed.

6. 20 Additional Data Cleaning Best Practices

Always validate primary keys (USUBJID in SDTM)
Maintain audit trails for every transformation
Use controlled terminology (CDISC standards)
Validate date formats strictly
Never overwrite raw data
Use macros for repeatability
Perform range checks on numeric fields
Standardize categorical variables
Document derivation logic
Validate merges carefully
Use PROC COMPARE for validation
Ensure traceability from SDTM to ADaM
Flag missing critical variables
Perform cross-domain consistency checks
Validate units (mg vs g)
Automate QC checks
Use metadata-driven programming
Log all warnings/errors
Avoid hardcoding values
Follow regulatory compliance (FDA/EMA)

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary it’s driven by business logic. Missing values are often replaced because they can bias analysis. For example, in a clinical trial, missing patient age could distort subgroup analysis. Similarly, unrealistic values like age = 150 must be corrected because they violate biological constraints. In financial datasets, salary normalization ensures comparability across regions.

Outliers are particularly critical. A murder count of 9999 in a small dataset is likely a data entry error, not reality. If left untreated, it inflates averages and misleads decision-makers. Date imputation is another key area incorrect dates can disrupt time-series analysis.

Ultimately, data cleaning ensures trustworthy insights. Without it, even the most advanced models fail.

8. 20 Key Points About This Project

Dirty data leads to wrong conclusions
Outliers distort statistical models
Standardization ensures reproducibility
Missing data must be handled carefully
Duplicate records inflate metrics
SAS ensures regulatory compliance
R offers flexibility in exploration
Date errors break timelines
Validation is non-negotiable
Cleaning improves model accuracy
Audit trails ensure traceability
Controlled terminology avoids confusion
Range checks prevent unrealistic values
Automation reduces manual errors
Metadata drives consistency
QC checks ensure reliability
Outliers must be flagged, not ignored
Data integrity is critical in healthcare
Structured cleaning saves time
Clean data = trustworthy insights

9. Summary

Data cleaning is the backbone of reliable analytics, and this blog demonstrated how messy real-world datasets like our simulated global murder dataset can be systematically cleaned using SAS and R. We introduced intentional data issues such as missing values, invalid entries, duplicates, and inconsistent text formats to replicate real-world complexity. Using SAS, we applied structured techniques like COALESCEC, conditional logic, PROC SORT, and PROC UNIVARIATE to clean and validate the data. SAS proved particularly strong in enforcing data integrity and handling regulated workflows.

In parallel, R provided a flexible and intuitive approach using dplyr functions such as mutate() and distinct(). While R excels in exploratory data analysis, it requires disciplined coding practices to match SAS’s robustness in production environments.

A major focus was outlier detection critical for preventing distorted insights. We used statistical methods and logical thresholds to identify and handle extreme values. Additionally, we explored best practices aligned with clinical trial standards (SDTM/ADaM), emphasizing audit trails, validation checks, and regulatory compliance.

Ultimately, both SAS and R are powerful but their effectiveness depends on how well data cleaning principles are applied. Clean data ensures accuracy, scalability, and trust in analytics, making it indispensable in both business and healthcare domains.

10. Conclusion

In the world of data analytics, tools alone do not guarantee success discipline in data handling does. Through this deep dive into a “Top Murders in World” dataset, we exposed how easily data can become unreliable and how critical structured cleaning frameworks are.

SAS stands out as a gold standard in environments where compliance, traceability, and reproducibility are essential especially in clinical trials. Its procedural rigor ensures that every transformation is documented and validated. R, on the other hand, provides unmatched flexibility and speed for exploratory analysis, making it invaluable for early-stage data investigation.

Outlier detection emerged as a central theme. Ignoring outliers can lead to flawed conclusions, while blindly removing them can erase meaningful signals. The right approach is to detect, analyze, and justify.

The broader lesson is this: data cleaning is not a preprocessing step it is a core analytical responsibility. Whether you're working on healthcare data, financial systems, or crime analytics, the integrity of your dataset determines the quality of your insights.

Adopting best practices standardization, validation, audit trails, and automation ensures that your data pipeline is not just functional but trustworthy.

In the end, clean data is not a luxury it is a necessity.

11. Interview Questions

Q1: How do you detect outliers in SAS?

Answer: Use PROC UNIVARIATE, PROC MEANS, or IQR methods. Example: values beyond 3*STD.

Q2: How do you handle duplicates?

Answer: PROC SORT NODUPKEY BY variable;

Q3: What is the difference between SAS and R in data cleaning?

Answer: SAS is structured and regulatory-friendly; R is flexible and exploratory.

Q4: Scenario: Age = -10 in dataset. What do you do?

Answer: Validate and set to missing or correct using business rules.

Q5: How do you debug inconsistent text values?

Answer: Use UPCASE, STRIP, and controlled terminology mapping.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TOP MURDERS DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS
2.Hidden Errors in Movie Data and How SAS LENGTH & FORMAT Bring Clarity
3.Why Even Small Data Errors Can Collapse Enterprise Analytics

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

460.The Dark Side of Data: Eliminating Outliers in Crime Datasets with SAS

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS
2.Hidden Errors in Movie Data and How SAS LENGTH & FORMAT Bring Clarity
3.Why Even Small Data Errors Can Collapse Enterprise Analytics

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function

460.The Dark Side of Data: Eliminating Outliers in Crime Datasets with SAS

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS2.Hidden Errors in Movie Data and How SAS LENGTH & FORMAT Bring Clarity 3.Why Even Small Data Errors Can Collapse Enterprise Analytics

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function

1.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS
2.Hidden Errors in Movie Data and How SAS LENGTH & FORMAT Bring Clarity
3.Why Even Small Data Errors Can Collapse Enterprise Analytics