460.The Dark Side of Data: Eliminating Outliers in Crime Datasets with SAS

Bloodlines & Broken Data: Mastering Outlier Detection in SAS with a Global Crime Dataset

1. Introduction

Imagine you’re working on a global crime analytics project for an international intelligence agency. The goal is simple identify patterns in high-profile murder cases across countries. But when you open the dataset, chaos greets you.

Victim ages are negative. Dates like 2023-02-30 exist. Countries appear as “usa”, “USA”, and “Usa”. Some murder counts are absurdly high like 10,000 cases in a small town. Duplicate records silently distort trends.

This isn’t just messy it’s dangerous.

Bad data doesn’t just slow analysis; it destroys decision-making. In clinical trials, it could mislead drug safety conclusions. In business, it could result in million-dollar mistakes.

This is where SAS and R become powerful allies. They don’t just process data they enforce discipline, structure, and validation. And today, we’ll go deeper into one critical aspect:

Detecting Outliers in SAS datasets and cleaning them effectively

We’ll simulate a “Top Murders in World” dataset, intentionally inject errors, and then fix everything step-by-step like a real-world data engineer. 

2. Raw Data Creation in SAS and R

We first create a flawed dataset with intentional errors.

SAS Code (Raw Dataset)

DATA murders_raw;

INPUT Case_ID Country $ Year Victim_Age Murder_Count Date : $12. 

      Status $ Investigator $;

DATALINES;

1 USA 2020 34 5 2020-05-12 Closed John

2 india 2021 -45 8 2021-07-22 Open Ravi

3 UK 2022 29 9999 2022-13-01 Closed Smith

4 USA 2020 34 5 2020-05-12 Closed John

5 NULL 2019 40 . 2019-11-05 Open NULL

6 Brazil 2023 150 3 2023-02-30 Closed Ana

7 India 2021 25 7 2021-08-15 Open Ravi

8 usa 2022 30 -10 2022-09-10 Closed John

9 UK 2020 . 6 2020-03-25 Closed Smith

10 India 2021 27 7 2021-08-15 Open Ravi

;

RUN;

PROC PRINT DATA = murders_raw;

RUN;

OUTPUT:

ObsCase_IDCountryYearVictim_AgeMurder_CountDateStatusInvestigator
11USA20203452020-05-12ClosedJohn
22india2021-4582021-07-22OpenRavi
33UK20222999992022-13-01ClosedSmith
44USA20203452020-05-12ClosedJohn
55NULL201940.2019-11-05OpenNULL
66Brazil202315032023-02-30ClosedAna
77India20212572021-08-15OpenRavi
88usa202230-102022-09-10ClosedJohn
99UK2020.62020-03-25ClosedSmith
1010India20212772021-08-15OpenRavi

Explanation

This dataset intentionally mimics real-world inconsistencies. We included missing values (.), invalid ages (-45, 150), duplicate records (Case_ID 1 & 4), inconsistent country formats (USA, usa, india), and impossible dates (2023-02-30). The Murder_Count variable includes extreme outliers like 9999 and negative values. These issues replicate challenges seen in clinical trial datasets (e.g., incorrect lab values, duplicate patient IDs). This raw structure forms the foundation for demonstrating data cleaning and outlier detection techniques using SAS procedures like PROC UNIVARIATE, PROC MEANS, and conditional transformations.

R Code – Equivalent Raw Dataset

murders_raw <- data.frame(

  Case_ID = c(1,2,3,4,5,6,7,8,9,10),

  Country = c("USA","india","UK","USA","NULL","Brazil","India",

              "usa","UK","India"),

  Year = c(2020,2021,2022,2020,2019,2023,2021,2022,2020,2021),

  Victim_Age = c(34,-45,29,34,40,150,25,30,NA,27),

  Murder_Count = c(5,8,9999,5,NA,3,7,-10,6,7),

  Date = c("2020-05-12","2021-07-22","2022-13-01","2020-05-12",

           "2019-11-05","2023-02-30","2021-08-15","2022-09-10",

           "2020-03-25","2021-08-15"),

  Status = c("Closed","Open","Closed","Closed","Open","Closed",

             "Open","Closed","Closed","Open"),

  Investigator = c("John","Ravi","Smith","John","NULL","Ana","Ravi",

                   "John","Smith","Ravi")

)

OUTPUT:

 

Case_ID

Country

Year

Victim_Age

Murder_Count

Date

Status

Investigator

1

1

USA

2020

34

5

12-05-2020

Closed

John

2

2

india

2021

-45

8

22-07-2021

Open

Ravi

3

3

UK

2022

29

9999

2022-13-01

Closed

Smith

4

4

USA

2020

34

5

12-05-2020

Closed

John

5

5

NULL

2019

40

NA

05-11-2019

Open

NULL

6

6

Brazil

2023

150

3

2023-02-30

Closed

Ana

7

7

India

2021

25

7

15-08-2021

Open

Ravi

8

8

usa

2022

30

-10

10-09-2022

Closed

John

9

9

UK

2020

NA

6

25-03-2020

Closed

Smith

10

10

India

2021

27

7

15-08-2021

Open

Ravi

Explanation

The R dataset mirrors SAS but uses data.frame(). It includes NA for missing values, inconsistent casing, and incorrect numeric entries. R’s flexibility allows fast prototyping, but without validation, errors propagate quickly. This dataset will later be cleaned using dplyr functions like mutate(), filter(), and distinct(). The goal is to show parity between SAS and R workflows while emphasizing SAS’s strength in regulated environments.

3. Phase 1: Data Cleaning in SAS

Step 1: Handle Missing & Invalid Values

DATA murders_clean1;

SET murders_raw;

Country = UPCASE(STRIP(Country));

IF Country = "NULL" OR Country = "" THEN Country = "UNKNOWN";

IF Victim_Age < 0 OR Victim_Age > 120 THEN Victim_Age = .;

IF Murder_Count < 0 THEN Murder_Count = .;

IF Murder_Count > 100 THEN Murder_Count = .;

RUN;

PROC PRINT DATA = murders_clean1;

RUN;

OUTPUT:

ObsCase_IDCountryYearVictim_AgeMurder_CountDateStatusInvestigator
11USA20203452020-05-12ClosedJohn
22INDIA2021.82021-07-22OpenRavi
33UK202229.2022-13-01ClosedSmith
44USA20203452020-05-12ClosedJohn
55UNKNOWN201940.2019-11-05OpenNULL
66BRAZIL2023.32023-02-30ClosedAna
77INDIA20212572021-08-15OpenRavi
88USA202230.2022-09-10ClosedJohn
99UK2020.62020-03-25ClosedSmith
1010INDIA20212772021-08-15OpenRavi

Explanation

We standardize text using UPCASE and STRIP, replacing missing country values with “UNKNOWN”. Invalid ages (negative or >120) are removed. Murder counts are validated negative and extreme values (>100) are flagged as missing. This step ensures logical consistency, similar to clinical trial cleaning where unrealistic lab values are removed.

Step 2: Date Formatting

DATA murders_clean2;

SET murders_clean1;

Date_clean = INPUT(Date, YYMMDD10.);

FORMAT Date_clean DATE9.;

RUN;

PROC PRINT DATA = murders_clean2;

RUN;

OUTPUT:

ObsCase_IDCountryYearVictim_AgeMurder_CountDateStatusInvestigatorDate_clean
11USA20203452020-05-12ClosedJohn12MAY2020
22INDIA2021.82021-07-22OpenRavi22JUL2021
33UK202229.2022-13-01ClosedSmith.
44USA20203452020-05-12ClosedJohn12MAY2020
55UNKNOWN201940.2019-11-05OpenNULL05NOV2019
66BRAZIL2023.32023-02-30ClosedAna.
77INDIA20212572021-08-15OpenRavi15AUG2021
88USA202230.2022-09-10ClosedJohn10SEP2022
99UK2020.62020-03-25ClosedSmith25MAR2020
1010INDIA20212772021-08-15OpenRavi15AUG2021

Explanation

Invalid dates like “2022-13-01” are automatically converted to missing. SAS’s INPUT function enforces strict date parsing, which is critical in regulatory datasets like SDTM where date accuracy is essential.

Step 3: Remove Duplicates

PROC SORT DATA=murders_clean2 NODUPKEY;

BY Case_ID;

RUN;

PROC PRINT DATA = murders_clean2;

RUN;

OUTPUT:

ObsCase_IDCountryYearVictim_AgeMurder_CountDateStatusInvestigatorDate_clean
11USA20203452020-05-12ClosedJohn12MAY2020
22INDIA2021.82021-07-22OpenRavi22JUL2021
33UK202229.2022-13-01ClosedSmith.
44USA20203452020-05-12ClosedJohn12MAY2020
55UNKNOWN201940.2019-11-05OpenNULL05NOV2019
66BRAZIL2023.32023-02-30ClosedAna.
77INDIA20212572021-08-15OpenRavi15AUG2021
88USA202230.2022-09-10ClosedJohn10SEP2022
99UK2020.62020-03-25ClosedSmith25MAR2020
1010INDIA20212772021-08-15OpenRavi15AUG2021

Explanation

Duplicate Case_IDs are removed. In clinical trials, duplicate subject IDs can invalidate analysis. NODUPKEY ensures uniqueness based on primary keys.

Step 4: Outlier Detection

PROC UNIVARIATE DATA=murders_clean2;

VAR Murder_Count;

RUN;

OUTPUT:

The UNIVARIATE Procedure

Variable: Murder_Count

Moments
N7Sum Weights7
Mean5.85714286Sum Observations41
Std Deviation1.67616342Variance2.80952381
Skewness-0.5824439Kurtosis0.05188164
Uncorrected SS257Corrected SS16.8571429
Coeff Variation28.6174242Std Error Mean0.63353022
Basic Statistical Measures
LocationVariability
Mean5.857143Std Deviation1.67616
Median6.000000Variance2.80952
Mode5.000000Range5.00000
  Interquartile Range2.00000

Note: The mode displayed is the smallest of 2 modes with a count of 2.

Tests for Location: Mu0=0
TestStatisticp Value
Student's tt9.245246Pr > |t|<.0001
SignM3.5Pr >= |M|0.0156
Signed RankS14Pr >= |S|0.0156
Quantiles (Definition 5)
LevelQuantile
100% Max8
99%8
95%8
90%8
75% Q37
50% Median6
25% Q15
10%3
5%3
1%3
0% Min3
Extreme Observations
LowestHighest
ValueObsValueObs
3654
5469
5177
69710
71082
Missing Values
Missing
Value
CountPercent Of
All ObsMissing Obs
.330.00100.00

Explanation

This identifies extreme values using statistical summaries. Outliers can distort mean and variance. In real-world analytics, identifying these ensures robust modeling.

4. Phase 2: Data Cleaning in R

library(dplyr)

murders_clean <- murders_raw %>%

  mutate(

    Country = toupper(trimws(ifelse(Country=="NULL",

                                    "UNKNOWN",Country))),

    Victim_Age = ifelse(Victim_Age < 0 | Victim_Age > 120,

                        NA, Victim_Age),

    Murder_Count = ifelse(Murder_Count < 0 | Murder_Count > 100,

                          NA, Murder_Count)

  ) %>%

  distinct(Case_ID, .keep_all = TRUE)

OUTPUT:

 

Case_ID

Country

Year

Victim_Age

Murder_Count

Date

Status

Investigator

1

1

USA

2020

34

5

12-05-2020

Closed

John

2

2

INDIA

2021

NA

8

22-07-2021

Open

Ravi

3

3

UK

2022

29

NA

2022-13-01

Closed

Smith

4

4

USA

2020

34

5

12-05-2020

Closed

John

5

5

UNKNOWN

2019

40

NA

05-11-2019

Open

NULL

6

6

BRAZIL

2023

NA

3

2023-02-30

Closed

Ana

7

7

INDIA

2021

25

7

15-08-2021

Open

Ravi

8

8

USA

2022

30

NA

10-09-2022

Closed

John

9

9

UK

2020

NA

6

25-03-2020

Closed

Smith

10

10

INDIA

2021

27

7

15-08-2021

Open

Ravi

Explanation

R uses mutate() for transformation and distinct() for duplicates. ifelse() handles conditional cleaning. While R is flexible, it requires disciplined coding to ensure reproducibility especially in regulated domains.

5. Phase 3: Additional SAS Code (Advanced Outlier Handling)

PROC MEANS DATA=murders_clean2 N MEAN STD MIN MAX;

VAR Murder_Count;

RUN;

OUTPUT:

The MEANS Procedure

Analysis Variable : Murder_Count
NMeanStd DevMinimumMaximum
75.85714291.67616343.00000008.0000000

DATA murders_outliers;

SET murders_clean2;

IF Murder_Count > 3*10 THEN Flag_Outlier = 1;

ELSE Flag_Outlier = 0;

RUN;

PROC PRINT DATA = murders_outliers;

RUN;

OUTPUT:

ObsCase_IDCountryYearVictim_AgeMurder_CountDateStatusInvestigatorDate_cleanFlag_Outlier
11USA20203452020-05-12ClosedJohn12MAY20200
22INDIA2021.82021-07-22OpenRavi22JUL20210
33UK202229.2022-13-01ClosedSmith.0
44USA20203452020-05-12ClosedJohn12MAY20200
55UNKNOWN201940.2019-11-05OpenNULL05NOV20190
66BRAZIL2023.32023-02-30ClosedAna.0
77INDIA20212572021-08-15OpenRavi15AUG20210
88USA202230.2022-09-10ClosedJohn10SEP20220
99UK2020.62020-03-25ClosedSmith25MAR20200
1010INDIA20212772021-08-15OpenRavi15AUG20210

Explanation

We compute summary statistics and flag outliers manually. Threshold-based detection (e.g., 3×STD) is common in statistical modeling. This step ensures extreme values are flagged rather than blindly removed.

6. 20 Additional Data Cleaning Best Practices

  1. Always validate primary keys (USUBJID in SDTM)
  2. Maintain audit trails for every transformation
  3. Use controlled terminology (CDISC standards)
  4. Validate date formats strictly
  5. Never overwrite raw data
  6. Use macros for repeatability
  7. Perform range checks on numeric fields
  8. Standardize categorical variables
  9. Document derivation logic
  10. Validate merges carefully
  11. Use PROC COMPARE for validation
  12. Ensure traceability from SDTM to ADaM
  13. Flag missing critical variables
  14. Perform cross-domain consistency checks
  15. Validate units (mg vs g)
  16. Automate QC checks
  17. Use metadata-driven programming
  18. Log all warnings/errors
  19. Avoid hardcoding values
  20. Follow regulatory compliance (FDA/EMA)

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary it’s driven by business logic. Missing values are often replaced because they can bias analysis. For example, in a clinical trial, missing patient age could distort subgroup analysis. Similarly, unrealistic values like age = 150 must be corrected because they violate biological constraints. In financial datasets, salary normalization ensures comparability across regions.

Outliers are particularly critical. A murder count of 9999 in a small dataset is likely a data entry error, not reality. If left untreated, it inflates averages and misleads decision-makers. Date imputation is another key area incorrect dates can disrupt time-series analysis.

Ultimately, data cleaning ensures trustworthy insights. Without it, even the most advanced models fail.

8. 20 Key Points About This Project

  • Dirty data leads to wrong conclusions
  • Outliers distort statistical models
  • Standardization ensures reproducibility
  • Missing data must be handled carefully
  • Duplicate records inflate metrics
  • SAS ensures regulatory compliance
  • R offers flexibility in exploration
  • Date errors break timelines
  • Validation is non-negotiable
  • Cleaning improves model accuracy
  • Audit trails ensure traceability
  • Controlled terminology avoids confusion
  • Range checks prevent unrealistic values
  • Automation reduces manual errors
  • Metadata drives consistency
  • QC checks ensure reliability
  • Outliers must be flagged, not ignored
  • Data integrity is critical in healthcare
  • Structured cleaning saves time
  • Clean data = trustworthy insights

9. Summary

Data cleaning is the backbone of reliable analytics, and this blog demonstrated how messy real-world datasets like our simulated global murder dataset can be systematically cleaned using SAS and R. We introduced intentional data issues such as missing values, invalid entries, duplicates, and inconsistent text formats to replicate real-world complexity. Using SAS, we applied structured techniques like COALESCEC, conditional logic, PROC SORT, and PROC UNIVARIATE to clean and validate the data. SAS proved particularly strong in enforcing data integrity and handling regulated workflows.

In parallel, R provided a flexible and intuitive approach using dplyr functions such as mutate() and distinct(). While R excels in exploratory data analysis, it requires disciplined coding practices to match SAS’s robustness in production environments.

A major focus was outlier detection critical for preventing distorted insights. We used statistical methods and logical thresholds to identify and handle extreme values. Additionally, we explored best practices aligned with clinical trial standards (SDTM/ADaM), emphasizing audit trails, validation checks, and regulatory compliance.

Ultimately, both SAS and R are powerful but their effectiveness depends on how well data cleaning principles are applied. Clean data ensures accuracy, scalability, and trust in analytics, making it indispensable in both business and healthcare domains.

10. Conclusion

In the world of data analytics, tools alone do not guarantee success discipline in data handling does. Through this deep dive into a “Top Murders in World” dataset, we exposed how easily data can become unreliable and how critical structured cleaning frameworks are.

SAS stands out as a gold standard in environments where compliance, traceability, and reproducibility are essential especially in clinical trials. Its procedural rigor ensures that every transformation is documented and validated. R, on the other hand, provides unmatched flexibility and speed for exploratory analysis, making it invaluable for early-stage data investigation.

Outlier detection emerged as a central theme. Ignoring outliers can lead to flawed conclusions, while blindly removing them can erase meaningful signals. The right approach is to detect, analyze, and justify.

The broader lesson is this: data cleaning is not a preprocessing step it is a core analytical responsibility. Whether you're working on healthcare data, financial systems, or crime analytics, the integrity of your dataset determines the quality of your insights.

Adopting best practices standardization, validation, audit trails, and automation ensures that your data pipeline is not just functional but trustworthy.

In the end, clean data is not a luxury it is a necessity.

11. Interview Questions

Q1: How do you detect outliers in SAS?

Answer: Use PROC UNIVARIATE, PROC MEANS, or IQR methods. Example: values beyond 3*STD.

Q2: How do you handle duplicates?

Answer: PROC SORT NODUPKEY BY variable;

Q3: What is the difference between SAS and R in data cleaning?

Answer: SAS is structured and regulatory-friendly; R is flexible and exploratory.

Q4: Scenario: Age = -10 in dataset. What do you do?

Answer: Validate and set to missing or correct using business rules.

Q5: How do you debug inconsistent text values?

Answer: Use UPCASE, STRIP, and controlled terminology mapping.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TOP MURDERS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS

2.Hidden Errors in Movie Data and How SAS LENGTH & FORMAT Bring Clarity 

3.Why Even Small Data Errors Can Collapse Enterprise Analytics

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy


Comments

Popular posts from this blog

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function