Forensic Data Failures & Analytics Nightmares: Building Scalable Crime Intelligence Pipelines Using SAS and R
Global Crime Case Intelligence into Audit-Ready SAS and R Pipelines for Enterprise-Grade Decision Systems
Introduction
Modern
analytics systems fail silently when dirty data enters enterprise pipelines. In
global crime intelligence systems, one corrupted variable can distort fraud
detection, alter criminal profiling, trigger false investigations, and damage
legal reporting credibility. As Clinical SAS Programmers and Data Scientists,
we often discuss healthcare validation problems, but the exact same engineering
principles apply to crime analytics, banking fraud monitoring, insurance claims
intelligence, and retail risk surveillance.
Imagine a
multinational crime-monitoring organization aggregating global crime-case
records from different countries. The raw data arrives from police systems,
cybercrime portals, forensic units, and legal databases. Unfortunately, the
datasets contain duplicate case IDs, impossible ages, malformed timestamps,
corrupted location codes, invalid categories, whitespace contamination, mixed
uppercase/lowercase formats, and negative financial-loss values.
One
incorrectly cleaned fraud-loss variable can reduce millions in projected
damages. One missing arrest date can invalidate judicial reporting dashboards.
One malformed region code can shift crime hot-spots into the wrong geography.
In regulated industries such as healthcare and banking, such failures can
create compliance violations, audit findings, and catastrophic executive
decisions.
This
project demonstrates how to engineer analysis-ready datasets using SAS and R
through enterprise-grade cleaning workflows, validation logic, and professional
reporting pipelines.
Global Crime Case Dataset
Design
Below is a deliberately corrupted operational dataset representing different types of global crime cases.
1.SAS Raw Dataset Creation with Intentional Errors
data crime_raw;
length Case_ID $12 Crime_Type $30 Country $20 Incident_Date $15
Region_Code $10 Officer_Email $50 Status $15 Evidence_Score $10;
infile datalines dlm='|' truncover;
input Case_ID $ Crime_Type $ Country $ Age Financial_Loss
Incident_Date $ Region_Code $ Officer_Email $ Status $
Evidence_Score $;
datalines;
CR001|Cyber Fraud|india|34|50000|12JAN2025|ap01|officer1@gmail.com|Open|85
CR001|cyber fraud|INDIA|-5|-9000|31FEB2025|AP01|wrongmail.com|open|NULL
CR002|Money Laundering|usa|130|700000|15MAR2025|NY 01|agent#mail.com|Closed|91
CR003| Homicide |UK|45|250000|NULL|LN01|detective@gmail.com|Pending|76
CR004|Kidnapping|India|22|.|22APR2025|AP 02|officer2@gmail|OPEN|88
CR005|Cyber Fraud|Canada|19|-5000|19MAY2025|TR01|NULL|Solved|92
CR006|Drug Traffic|brazil|200|900000|99XYZ2025|BZ01|drug@gmail.com|Closed|abc
CR007|Robbery|India|35|25000|01JUN2025|AP03|robbery@gmail.com|Solved|79
CR008|Robbery| INDIA |36|26000|01JUN2025|AP03|robbery@gmail.com|Solved|79
CR009|Human Traffic|USA|41|650000|15JUL2025|NY02|human@@gmail.com|Pending|95
CR010|Cyber Fraud|India|29|0|16AUG2025|AP04|cyber@gmail.com|Open|83
CR011|Insurance Scam|UK|NULL|150000|01SEP2025|LN02|claim@gmail.com|Closed|90
CR012|Insurance Scam|UK|44|160000|01SEP2025|LN02|claim@gmail.com|Closed|90
CR013|Terror Funding|UAE|39|9999999|17OCT2025|DXB01|terror@gmail.com|Investigating|100
CR014|Retail Theft|india|17|1200|20NOV2025|AP05|retailgmail.com|Solved|65
CR015|Forgery|Australia|52|45000|12DEC2025|AU01|forgery@gmail.com|Open|72
;
run;
proc print data = crime_raw;
run;
OUTPUT:
| Obs | Case_ID | Crime_Type | Country | Incident_Date | Region_Code | Officer_Email | Status | Evidence_Score | Age | Financial_Loss |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CR001 | Cyber Fraud | india | 12JAN2025 | ap01 | officer1@gmail.com | Open | 85 | 34 | 50000 |
| 2 | CR001 | cyber fraud | INDIA | 31FEB2025 | AP01 | wrongmail.com | open | NULL | -5 | -9000 |
| 3 | CR002 | Money Laundering | usa | 15MAR2025 | NY 01 | agent#mail.com | Closed | 91 | 130 | 700000 |
| 4 | CR003 | Homicide | UK | NULL | LN01 | detective@gmail.com | Pending | 76 | 45 | 250000 |
| 5 | CR004 | Kidnapping | India | 22APR2025 | AP 02 | officer2@gmail | OPEN | 88 | 22 | . |
| 6 | CR005 | Cyber Fraud | Canada | 19MAY2025 | TR01 | NULL | Solved | 92 | 19 | -5000 |
| 7 | CR006 | Drug Traffic | brazil | 99XYZ2025 | BZ01 | drug@gmail.com | Closed | abc | 200 | 900000 |
| 8 | CR007 | Robbery | India | 01JUN2025 | AP03 | robbery@gmail.com | Solved | 79 | 35 | 25000 |
| 9 | CR008 | Robbery | INDIA | 01JUN2025 | AP03 | robbery@gmail.com | Solved | 79 | 36 | 26000 |
| 10 | CR009 | Human Traffic | USA | 15JUL2025 | NY02 | human@@gmail.com | Pending | 95 | 41 | 650000 |
| 11 | CR010 | Cyber Fraud | India | 16AUG2025 | AP04 | cyber@gmail.com | Open | 83 | 29 | 0 |
| 12 | CR011 | Insurance Scam | UK | 01SEP2025 | LN02 | claim@gmail.com | Closed | 90 | . | 150000 |
| 13 | CR012 | Insurance Scam | UK | 01SEP2025 | LN02 | claim@gmail.com | Closed | 90 | 44 | 160000 |
| 14 | CR013 | Terror Funding | UAE | 17OCT2025 | DXB01 | terror@gmail.com | Investigating | 100 | 39 | 9999999 |
| 15 | CR014 | Retail Theft | india | 20NOV2025 | AP05 | retailgmail.com | Solved | 65 | 17 | 1200 |
| 16 | CR015 | Forgery | Australia | 12DEC2025 | AU01 | forgery@gmail.com | Open | 72 | 52 | 45000 |
Why LENGTH Statements
Matter in SAS
The LENGTH
statement appears before assignments because SAS determines variable storage
length during compilation. If character variables are assigned before defining
LENGTH, truncation occurs silently.
Example:
data demo;
x="INVESTIGATION_PENDING";
length x $10;
run;
Here SAS
stores only "INVESTIG" because the variable length was fixed before
the LENGTH statement executed.
In R,
strings dynamically resize in memory, reducing truncation risk. SAS, however,
uses fixed-length storage architecture, making variable planning extremely
important in enterprise pipelines.
Key
enterprise lesson:
- Always define LENGTH early.
- Metadata governance prevents
silent corruption.
- Truncated categories damage
joins and reporting accuracy.
- Regulatory submissions
require deterministic structures.
2.SAS Cleaning Workflow
data crime_clean;
retain Data_Source "GLOBAL_CRIME_MONITOR";
set crime_raw;
length Clean_Email $60 Clean_Status $15 Risk_Level $10;;
Crime_Type = propcase(strip(Crime_Type));
Country = upcase(compbl(strip(Country)));
Region_Code = upcase(compress(Region_Code));
Clean_Status = upcase(strip(Status));
if Age < 18 or Age > 100 then Age = .;
Financial_Loss = abs(Financial_Loss);
if Financial_Loss=. then Financial_Loss=0;
Financial_Loss = round(Financial_Loss,0.01);
Parsed_Date = input(Incident_Date,date9.);
format Parsed_Date yymmdd10.;
if missing(Parsed_Date) then
Parsed_Date=intnx('day',today(),-30);
if find(Officer_Email,'@')=0 then
Clean_Email='INVALID_EMAIL';
else Clean_Email=lowcase(strip(Officer_Email));
if Evidence_Score='NULL' then Evidence_Score='0';
Evidence_Num=input(Evidence_Score,best12.);
if missing(Evidence_Num) then Evidence_Num=0;
Case_Category = scan(Crime_Type,1,' ');
if Financial_Loss > 500000 then
Risk_Level="HIGH";
else if Financial_Loss > 100000 then
Risk_Level="MEDIUM";
else
Risk_Level="LOW";
drop Officer_Email Status Crime_Type
Evidence_Score Incident_Date;
rename Clean_Email = Officer_Email
Clean_Status = Status
Case_Category = Crime_Type
Evidence_Num = Evidence_Score
Parsed_Date = Incident_Date;
run;
proc print data = crime_clean;
run;
OUTPUT:
| Obs | Data_Source | Case_ID | Country | Region_Code | Age | Financial_Loss | Officer_Email | Status | Risk_Level | Incident_Date | Evidence_Score | Crime_Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_CRIME_MONITOR | CR001 | INDIA | AP01 | 34 | 50000 | officer1@gmail.com | OPEN | LOW | 2025-01-12 | 85 | Cyber |
| 2 | GLOBAL_CRIME_MONITOR | CR001 | INDIA | AP01 | . | 9000 | INVALID_EMAIL | OPEN | LOW | 2026-05-08 | 0 | Cyber |
| 3 | GLOBAL_CRIME_MONITOR | CR002 | USA | NY01 | . | 700000 | INVALID_EMAIL | CLOSED | HIGH | 2025-03-15 | 91 | Money |
| 4 | GLOBAL_CRIME_MONITOR | CR003 | UK | LN01 | 45 | 250000 | detective@gmail.com | PENDING | MEDIUM | 2026-05-08 | 76 | Homicide |
| 5 | GLOBAL_CRIME_MONITOR | CR004 | INDIA | AP02 | 22 | 0 | officer2@gmail | OPEN | LOW | 2025-04-22 | 88 | Kidnapping |
| 6 | GLOBAL_CRIME_MONITOR | CR005 | CANADA | TR01 | 19 | 5000 | INVALID_EMAIL | SOLVED | LOW | 2025-05-19 | 92 | Cyber |
| 7 | GLOBAL_CRIME_MONITOR | CR006 | BRAZIL | BZ01 | . | 900000 | drug@gmail.com | CLOSED | HIGH | 2026-05-08 | 0 | Drug |
| 8 | GLOBAL_CRIME_MONITOR | CR007 | INDIA | AP03 | 35 | 25000 | robbery@gmail.com | SOLVED | LOW | 2025-06-01 | 79 | Robbery |
| 9 | GLOBAL_CRIME_MONITOR | CR008 | INDIA | AP03 | 36 | 26000 | robbery@gmail.com | SOLVED | LOW | 2025-06-01 | 79 | Robbery |
| 10 | GLOBAL_CRIME_MONITOR | CR009 | USA | NY02 | 41 | 650000 | human@@gmail.com | PENDING | HIGH | 2025-07-15 | 95 | Human |
| 11 | GLOBAL_CRIME_MONITOR | CR010 | INDIA | AP04 | 29 | 0 | cyber@gmail.com | OPEN | LOW | 2025-08-16 | 83 | Cyber |
| 12 | GLOBAL_CRIME_MONITOR | CR011 | UK | LN02 | . | 150000 | claim@gmail.com | CLOSED | MEDIUM | 2025-09-01 | 90 | Insurance |
| 13 | GLOBAL_CRIME_MONITOR | CR012 | UK | LN02 | 44 | 160000 | claim@gmail.com | CLOSED | MEDIUM | 2025-09-01 | 90 | Insurance |
| 14 | GLOBAL_CRIME_MONITOR | CR013 | UAE | DXB01 | 39 | 9999999 | terror@gmail.com | INVESTIGATING | HIGH | 2025-10-17 | 100 | Terror |
| 15 | GLOBAL_CRIME_MONITOR | CR014 | INDIA | AP05 | . | 1200 | INVALID_EMAIL | SOLVED | LOW | 2025-11-20 | 65 | Retail |
| 16 | GLOBAL_CRIME_MONITOR | CR015 | AUSTRALIA | AU01 | 52 | 45000 | forgery@gmail.com | OPEN | LOW | 2025-12-12 | 72 | Forgery |
Explanation of SAS Cleaning
Logic
This
workflow demonstrates enterprise defensive programming. PROPCASE, UPCASE, and COMPBL
normalize inconsistent text formatting. ABS() corrects negative financial
values frequently caused by ETL ingestion defects. INPUT() converts corrupted
dates into numeric SAS dates for analytics. Invalid dates are imputed using INTNX()
to maintain downstream continuity.
The email
validation logic uses FIND() to identify malformed addresses. SCAN() extracts
primary crime categories for reporting segmentation. The CASE expression
creates dynamic risk classification based on financial exposure.
Key
production concepts:
- Defensive validation
prevents dashboard corruption.
- Missing-value
standardization improves reproducibility.
- Controlled imputation
ensures traceability.
- Standardized categories
stabilize machine-learning inputs.
- Enterprise pipelines require
deterministic transformations.
3.Removing Duplicate Crime Cases
proc sort data=crime_clean
out=crime_nodup nodupkey;
by Case_ID;
run;
proc print data = crime_nodup;
run;
LOG:
OUTPUT:
| Obs | Data_Source | Case_ID | Country | Region_Code | Age | Financial_Loss | Officer_Email | Status | Risk_Level | Incident_Date | Evidence_Score | Crime_Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_CRIME_MONITOR | CR001 | INDIA | AP01 | 34 | 50000 | officer1@gmail.com | OPEN | LOW | 2025-01-12 | 85 | Cyber |
| 2 | GLOBAL_CRIME_MONITOR | CR002 | USA | NY01 | . | 700000 | INVALID_EMAIL | CLOSED | HIGH | 2025-03-15 | 91 | Money |
| 3 | GLOBAL_CRIME_MONITOR | CR003 | UK | LN01 | 45 | 250000 | detective@gmail.com | PENDING | MEDIUM | 2026-05-08 | 76 | Homicide |
| 4 | GLOBAL_CRIME_MONITOR | CR004 | INDIA | AP02 | 22 | 0 | officer2@gmail | OPEN | LOW | 2025-04-22 | 88 | Kidnapping |
| 5 | GLOBAL_CRIME_MONITOR | CR005 | CANADA | TR01 | 19 | 5000 | INVALID_EMAIL | SOLVED | LOW | 2025-05-19 | 92 | Cyber |
| 6 | GLOBAL_CRIME_MONITOR | CR006 | BRAZIL | BZ01 | . | 900000 | drug@gmail.com | CLOSED | HIGH | 2026-05-08 | 0 | Drug |
| 7 | GLOBAL_CRIME_MONITOR | CR007 | INDIA | AP03 | 35 | 25000 | robbery@gmail.com | SOLVED | LOW | 2025-06-01 | 79 | Robbery |
| 8 | GLOBAL_CRIME_MONITOR | CR008 | INDIA | AP03 | 36 | 26000 | robbery@gmail.com | SOLVED | LOW | 2025-06-01 | 79 | Robbery |
| 9 | GLOBAL_CRIME_MONITOR | CR009 | USA | NY02 | 41 | 650000 | human@@gmail.com | PENDING | HIGH | 2025-07-15 | 95 | Human |
| 10 | GLOBAL_CRIME_MONITOR | CR010 | INDIA | AP04 | 29 | 0 | cyber@gmail.com | OPEN | LOW | 2025-08-16 | 83 | Cyber |
| 11 | GLOBAL_CRIME_MONITOR | CR011 | UK | LN02 | . | 150000 | claim@gmail.com | CLOSED | MEDIUM | 2025-09-01 | 90 | Insurance |
| 12 | GLOBAL_CRIME_MONITOR | CR012 | UK | LN02 | 44 | 160000 | claim@gmail.com | CLOSED | MEDIUM | 2025-09-01 | 90 | Insurance |
| 13 | GLOBAL_CRIME_MONITOR | CR013 | UAE | DXB01 | 39 | 9999999 | terror@gmail.com | INVESTIGATING | HIGH | 2025-10-17 | 100 | Terror |
| 14 | GLOBAL_CRIME_MONITOR | CR014 | INDIA | AP05 | . | 1200 | INVALID_EMAIL | SOLVED | LOW | 2025-11-20 | 65 | Retail |
| 15 | GLOBAL_CRIME_MONITOR | CR015 | AUSTRALIA | AU01 | 52 | 45000 | forgery@gmail.com | OPEN | LOW | 2025-12-12 | 72 | Forgery |
Explanation
Duplicate
case IDs create inflated crime counts and inaccurate trend reports. PROC SORT
NODUPKEY retains the first occurrence while removing duplicates.
Real-world
consequences of duplicates:
- False crime escalation
trends
- Incorrect fraud exposure
calculations
- Duplicate insurance
settlements
- Regulatory reporting
inconsistencies
4.Advanced DATA Step Processing
proc sort data=crime_nodup;
by Country;
run;
proc print data = crime_nodup;
run;
OUTPUT:
| Obs | Data_Source | Case_ID | Country | Region_Code | Age | Financial_Loss | Officer_Email | Status | Risk_Level | Incident_Date | Evidence_Score | Crime_Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_CRIME_MONITOR | CR015 | AUSTRALIA | AU01 | 52 | 45000 | forgery@gmail.com | OPEN | LOW | 2025-12-12 | 72 | Forgery |
| 2 | GLOBAL_CRIME_MONITOR | CR006 | BRAZIL | BZ01 | . | 900000 | drug@gmail.com | CLOSED | HIGH | 2026-05-08 | 0 | Drug |
| 3 | GLOBAL_CRIME_MONITOR | CR005 | CANADA | TR01 | 19 | 5000 | INVALID_EMAIL | SOLVED | LOW | 2025-05-19 | 92 | Cyber |
| 4 | GLOBAL_CRIME_MONITOR | CR001 | INDIA | AP01 | 34 | 50000 | officer1@gmail.com | OPEN | LOW | 2025-01-12 | 85 | Cyber |
| 5 | GLOBAL_CRIME_MONITOR | CR004 | INDIA | AP02 | 22 | 0 | officer2@gmail | OPEN | LOW | 2025-04-22 | 88 | Kidnapping |
| 6 | GLOBAL_CRIME_MONITOR | CR007 | INDIA | AP03 | 35 | 25000 | robbery@gmail.com | SOLVED | LOW | 2025-06-01 | 79 | Robbery |
| 7 | GLOBAL_CRIME_MONITOR | CR008 | INDIA | AP03 | 36 | 26000 | robbery@gmail.com | SOLVED | LOW | 2025-06-01 | 79 | Robbery |
| 8 | GLOBAL_CRIME_MONITOR | CR010 | INDIA | AP04 | 29 | 0 | cyber@gmail.com | OPEN | LOW | 2025-08-16 | 83 | Cyber |
| 9 | GLOBAL_CRIME_MONITOR | CR014 | INDIA | AP05 | . | 1200 | INVALID_EMAIL | SOLVED | LOW | 2025-11-20 | 65 | Retail |
| 10 | GLOBAL_CRIME_MONITOR | CR013 | UAE | DXB01 | 39 | 9999999 | terror@gmail.com | INVESTIGATING | HIGH | 2025-10-17 | 100 | Terror |
| 11 | GLOBAL_CRIME_MONITOR | CR003 | UK | LN01 | 45 | 250000 | detective@gmail.com | PENDING | MEDIUM | 2026-05-08 | 76 | Homicide |
| 12 | GLOBAL_CRIME_MONITOR | CR011 | UK | LN02 | . | 150000 | claim@gmail.com | CLOSED | MEDIUM | 2025-09-01 | 90 | Insurance |
| 13 | GLOBAL_CRIME_MONITOR | CR012 | UK | LN02 | 44 | 160000 | claim@gmail.com | CLOSED | MEDIUM | 2025-09-01 | 90 | Insurance |
| 14 | GLOBAL_CRIME_MONITOR | CR002 | USA | NY01 | . | 700000 | INVALID_EMAIL | CLOSED | HIGH | 2025-03-15 | 91 | Money |
| 15 | GLOBAL_CRIME_MONITOR | CR009 | USA | NY02 | 41 | 650000 | human@@gmail.com | PENDING | HIGH | 2025-07-15 | 95 | Human |
data crime_flags;
set crime_nodup;
by Country;
retain Country_Count;
if first.Country then Country_Count=0;
Country_Count+1;
if last.Country then output;
run;
proc print data = crime_flags;
run;
OUTPUT:
| Obs | Data_Source | Case_ID | Country | Region_Code | Age | Financial_Loss | Officer_Email | Status | Risk_Level | Incident_Date | Evidence_Score | Crime_Type | Country_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_CRIME_MONITOR | CR015 | AUSTRALIA | AU01 | 52 | 45000 | forgery@gmail.com | OPEN | LOW | 2025-12-12 | 72 | Forgery | 1 |
| 2 | GLOBAL_CRIME_MONITOR | CR006 | BRAZIL | BZ01 | . | 900000 | drug@gmail.com | CLOSED | HIGH | 2026-05-08 | 0 | Drug | 1 |
| 3 | GLOBAL_CRIME_MONITOR | CR005 | CANADA | TR01 | 19 | 5000 | INVALID_EMAIL | SOLVED | LOW | 2025-05-19 | 92 | Cyber | 1 |
| 4 | GLOBAL_CRIME_MONITOR | CR014 | INDIA | AP05 | . | 1200 | INVALID_EMAIL | SOLVED | LOW | 2025-11-20 | 65 | Retail | 6 |
| 5 | GLOBAL_CRIME_MONITOR | CR013 | UAE | DXB01 | 39 | 9999999 | terror@gmail.com | INVESTIGATING | HIGH | 2025-10-17 | 100 | Terror | 1 |
| 6 | GLOBAL_CRIME_MONITOR | CR012 | UK | LN02 | 44 | 160000 | claim@gmail.com | CLOSED | MEDIUM | 2025-09-01 | 90 | Insurance | 3 |
| 7 | GLOBAL_CRIME_MONITOR | CR009 | USA | NY02 | 41 | 650000 | human@@gmail.com | PENDING | HIGH | 2025-07-15 | 95 | Human | 2 |
Explanation
FIRST.
and LAST. processing are essential in longitudinal and grouped analytics. Here,
country-level crime aggregation is performed efficiently without SQL overhead.
This
technique is widely used in:
- SDTM patient visit
sequencing
- Banking transaction grouping
- Insurance claim
summarization
- Retail customer analytics
5.PROC FORMAT for Controlled Reporting
proc format;
value riskfmt 1-99999='LOW'
100000-500000='MEDIUM'
500001-high='HIGH';
run;
LOG:
Explanation
Formats
centralize business logic and improve governance. Instead of hardcoding labels
repeatedly, enterprise teams maintain reusable standards through formats.
Benefits
include:
- Consistent reporting
- Easier QC
- Better auditability
- Centralized metadata control
5.PROC SQL Enterprise Join Example
proc sql;
create table crime_summary as
select Country,count(*) as Total_Cases,
sum(Financial_Loss) as Total_Loss format=riskfmt.,
avg(Evidence_Score) as Avg_Evidence
from crime_nodup
group by Country;
quit;
proc print data = crime_summary;
run;
OUTPUT:
| Obs | Country | Total_Cases | Total_Loss | Avg_Evidence |
|---|---|---|---|---|
| 1 | AUSTRALIA | 1 | LOW | 72.000 |
| 2 | BRAZIL | 1 | HIGH | 0.000 |
| 3 | CANADA | 1 | LOW | 92.000 |
| 4 | INDIA | 6 | MEDIUM | 79.833 |
| 5 | UAE | 1 | HIGH | 100.000 |
| 6 | UK | 3 | HIGH | 85.333 |
| 7 | USA | 2 | HIGH | 93.000 |
Explanation
PROC SQL
simplifies aggregation logic and joins. In production environments, SQL often
integrates external operational systems.
Advantages:
- Easier joins
- Faster summarization
- Flexible aggregation
- Better readability for relational
operations
DATA Step
remains superior for row-wise transformations and sequential logic.
6.PROC REPORT for Executive Outputs
proc report data=crime_summary nowd;
column Country Total_Cases Total_Loss Avg_Evidence;
define Country / group;
define Total_Cases / analysis;
define Total_Loss / analysis format=dollar15.;
define Avg_Evidence / analysis format=8.2;
run;
OUTPUT:
| Country | Total_Cases | Total_Loss | Avg_Evidence |
|---|---|---|---|
| AUSTRALIA | 1 | $45,000 | 72.00 |
| BRAZIL | 1 | $900,000 | 0.00 |
| CANADA | 1 | $5,000 | 92.00 |
| INDIA | 6 | $102,200 | 79.83 |
| UAE | 1 | $9,999,999 | 100.00 |
| UK | 3 | $560,000 | 85.33 |
| USA | 2 | $1,350,000 | 93.00 |
Explanation
PROC
REPORT creates enterprise-ready outputs for executives, regulators, and
auditors.
Production
advantages:
- Custom layouts
- Conditional reporting
- Regulatory-ready formatting
- Better presentation
flexibility
7.Reusable SAS Macro Framework
%macro missing_check(ds,var);
proc sql;
select count(*) as Missing_Count
from &ds
where missing(&var);
quit;
%mend;
%missing_check(crime_nodup,Age);
OUTPUT:
| Missing_Count |
|---|
| 4 |
Explanation
Macros
enable reusable validation frameworks across domains. Enterprise organizations
standardize macros to ensure consistency across studies and reporting pipelines.
Benefits:
- Reduced duplication
- Faster deployment
- Standardized QC
- Easier maintenance
8.R Raw Dataset Creation
library(tidyverse)
library(lubridate)
library(janitor)
crime_raw <- tribble(
~Case_ID,~Crime_Type,~Country,~Age,~Financial_Loss,
~Incident_Date,~Region_Code,~Officer_Email,~Status,
"CR001","Cyber Fraud","india",34,50000,
"12JAN2025","ap01","officer1@gmail.com","Open",
"CR001","cyber fraud","INDIA",-5,-9000,
"31FEB2025","AP01","wrongmail.com","open",
"CR002","Money Laundering","usa",130,700000,
"15MAR2025","NY01","agent#mail.com","Closed",
)
OUTPUT:
|
|
Case_ID |
Crime_Type |
Country |
Age |
Financial_Loss |
Incident_Date |
Region_Code |
Officer_Email |
Status |
|
1 |
CR001 |
Cyber Fraud |
india |
34 |
50000 |
12JAN2025 |
ap01 |
officer1@gmail.com |
Open |
|
2 |
CR001 |
cyber fraud |
INDIA |
-5 |
-9000 |
31FEB2025 |
AP01 |
wrongmail.com |
open |
|
3 |
CR002 |
Money Laundering |
usa |
130 |
700000 |
15MAR2025 |
NY01 |
agent#mail.com |
Closed |
9.R Enterprise Cleaning Workflow
options(scipen = 999)
crime_clean <- crime_raw %>%
clean_names() %>%
mutate(
crime_type=str_to_title(str_trim(crime_type)),
country=str_to_upper(str_trim(country)),
region_code=str_to_upper(str_replace_all(region_code," ","")),
financial_loss=abs(financial_loss),
age=if_else(age<18 | age>100,
NA_real_,as.numeric(age)),
incident_date=suppressWarnings(parse_date_time(
incident_date,orders="dby")),
officer_email=if_else(grepl("@",officer_email),
tolower(officer_email),"INVALID_EMAIL"),
status=str_to_upper(status),
risk_level=case_when(
financial_loss>500000 ~ "HIGH",
financial_loss>100000 ~ "MEDIUM",
TRUE ~ "LOW")
)
|
|
case_id |
crime_type |
country |
age |
financial_loss |
incident_date |
region_code |
officer_email |
status |
risk_level |
|
1 |
CR001 |
Cyber Fraud |
INDIA |
34 |
50000 |
2025-01-12 |
AP01 |
officer1@gmail.com |
OPEN |
LOW |
|
2 |
CR001 |
Cyber Fraud |
INDIA |
NA |
9000 |
NA |
AP01 |
INVALID_EMAIL |
OPEN |
LOW |
|
3 |
CR002 |
Money Laundering |
USA |
NA |
700000 |
2025-03-15 |
NY01 |
INVALID_EMAIL |
CLOSED |
HIGH |
Explanation of R Cleaning
Workflow
The tidyverse
ecosystem provides expressive and scalable cleaning pipelines. mutate()
performs controlled transformations, while case_when() creates readable
conditional logic similar to SAS IF-THEN structures.
parse_date_time()
handles inconsistent dates more flexibly than base parsing. str_trim() removes
hidden whitespace corruption. grepl() validates email structures, while if_else()
applies defensive correction logic.
Compared
with SAS:
|
SAS |
R
Equivalent |
|
PROPCASE |
str_to_title |
|
COMPRESS |
str_replace_all |
|
INPUT |
parse_date_time |
|
IF-THEN |
if_else |
|
PROC
SQL |
dplyr
summarise |
|
MERGE |
left_join |
Validation &
Compliance
In
regulated industries, cleaning is not cosmetic it is compliance-critical.
Clinical
trial environments require:
- SDTM traceability
- ADaM derivation
reproducibility
- Independent QC programming
- Audit trails
- Metadata lineage
- Validation documentation
One
dangerous SAS behavior is that missing numeric values are treated lower than
valid numbers.
Example:
if score < 50 then
flag='FAIL';
Missing
scores become FAIL automatically unless explicitly checked.
Correct
logic:
if not missing(score) and score
< 50 then flag='FAIL';
This
small mistake can invalidate regulatory outputs.
20 Enterprise Data-Cleaning
Best Practices
- Standardize variable naming
conventions
- Validate all date fields
- Remove duplicate keys early
- Preserve raw datasets
- Use reusable macros
- Implement metadata
governance
- Create audit-ready logs
- Validate categorical domains
- Normalize text formatting
- Apply defensive programming
- Separate raw and clean
layers
- Maintain traceability
- Document all derivations
- Use QC independence
- Validate joins carefully
- Standardize missing-value
handling
- Use controlled terminology
- Version-control production
code
- Build automated validation
checks
- Test edge cases aggressively
Business Logic Behind
Cleaning Decisions
Business
logic transforms raw operational chaos into reliable analytical intelligence.
Missing values are imputed because downstream models, dashboards, and
statistical calculations require completeness. For example, if patient age is
recorded as 250 years, the value clearly represents corruption rather than
reality. Setting it to missing prevents distorted demographic summaries.
Negative
salary or billing values often emerge from ingestion defects or sign reversals
during ETL migration. Applying ABS() ensures financial consistency. Date
standardization is equally important because inconsistent formats break trend
analysis and time-series modeling.
Text
normalization ensures “india,” “INDIA,” and “ India ” become one standardized
category. Without normalization, country-level reporting becomes fragmented.
Missing
visit dates in healthcare systems may require controlled imputation using
protocol-defined rules. In banking, inconsistent customer IDs can merge
unrelated accounts incorrectly.
Ultimately,
business logic ensures analytics reflect operational truth rather than raw
system noise.
20 Sharp SAS & R
Cleaning Insights
- Dirty data creates expensive
business mistakes.
- Validation logic is stronger
than visual inspection.
- Standardized variables
improve reproducibility.
- Duplicate keys destroy
analytical trust.
- Metadata governance prevents
silent corruption.
- Missing dates break
longitudinal analytics.
- Audit trails protect
regulatory credibility.
- Defensive programming
reduces production failures.
- PROC FORMAT improves
governance consistency.
- Macros accelerate enterprise
scalability.
- Text normalization
stabilizes reporting outputs.
- Date conversions require
strict validation.
- QC independence improves
reliability.
- Controlled terminology
prevents classification drift.
- Traceability is mandatory in
healthcare analytics.
- R excels in flexible
transformations.
- SAS dominates regulated
production systems.
- SQL joins require careful
duplicate checks.
- Enterprise reporting depends
on deterministic pipelines.
- Clean data improves AI
reliability.
SAS vs R for Enterprise
Cleaning Workflows
SAS
provides unmatched governance, auditability, metadata control, and regulatory
acceptance. Pharmaceutical companies heavily rely on SAS because of
deterministic execution, stable production deployment, and traceable outputs.
DATA Step processing remains exceptionally efficient for row-wise
transformations, while PROC REPORT and PROC FORMAT support enterprise reporting
frameworks.
R
provides greater flexibility and modern transformation capabilities. Packages
like dplyr, stringr, and lubridate simplify complex data engineering tasks with
concise syntax. R also integrates naturally with machine learning and
visualization ecosystems.
SAS
excels in:
- Regulatory compliance
- Audit readiness
- Stable enterprise deployment
- Large-scale production
pipelines
R excels
in:
- Flexible transformations
- Open-source innovation
- Advanced analytics
- Rapid experimentation
Modern
organizations increasingly combine both ecosystems. SAS handles validated
production reporting while R supports exploratory analytics and AI modeling.
Summary
This
project demonstrated how corrupted global crime-case data can be transformed
into analysis-ready intelligence using enterprise-grade SAS and R workflows.
The raw dataset intentionally included major operational issues such as
duplicate case IDs, invalid dates, negative financial losses, inconsistent text
formatting, malformed emails, missing values, whitespace corruption, and
unrealistic age values. These issues commonly occur in real-world healthcare,
banking, insurance, retail, and law-enforcement systems and can severely damage
dashboards, AI predictions, fraud detection models, and executive reporting
accuracy.
Using
SAS, the project showcased advanced DATA Step engineering techniques including
LENGTH, INPUT/PUT conversions, RETAIN, FIRST./LAST. processing, IF-THEN logic,
PROC SQL, PROC FORMAT, PROC REPORT, macros, deduplication, and enterprise
validation checks. The workflow emphasized audit readiness, metadata
governance, and regulatory traceability. In R, modern tidyverse tools such as
dplyr, stringr, janitor, and lubridate were used for scalable transformations,
normalization, filtering, parsing, and validation.
The
project also highlighted critical compliance concepts including SDTM/ADaM
relevance, QC independence, missing-value risks in SAS, and reproducibility
standards. Ultimately, the combination of SAS and R created a robust, scalable,
and trustworthy analytical framework capable of converting dirty operational
crime data into reliable business intelligence and professional reporting
outputs.
Conclusion
Modern
analytics ecosystems depend on structured data-cleaning frameworks more than
sophisticated dashboards or AI algorithms. Poor-quality operational data
silently destroys analytical reliability, regulatory credibility, and executive
confidence. Whether the domain involves crime intelligence, banking fraud
detection, insurance claims processing, retail analytics, or clinical trials,
the underlying engineering challenge remains identical: transforming corrupted
raw records into trustworthy analytical intelligence.
SAS
provides deterministic governance, scalable production execution, metadata
control, and audit-ready processing essential for regulated industries. Its
DATA Step architecture, PROC SQL integration, macro standardization, and
reporting capabilities make it ideal for enterprise-grade pipelines requiring
reproducibility and compliance validation.
R
complements SAS by offering highly flexible transformation pipelines, modern
data manipulation frameworks, and advanced analytical integration. Together,
SAS and R create a powerful hybrid ecosystem capable of handling both validated
enterprise reporting and modern exploratory analytics.
The most
successful data engineers do not simply “clean data.” They engineer traceable,
reproducible, scalable, and validated intelligence systems capable of
supporting regulatory submissions, executive dashboards, AI models, and
operational decisions.
In
enterprise environments, clean data is not a technical luxury it is a business
survival requirement.
Interview Questions and
Answers
1. How would you detect duplicate crime case
records in SAS?
Use PROC
SORT NODUPKEY or PROC SQL GROUP BY HAVING COUNT(*)>1. Always validate
composite keys carefully before removal.
2. Why are missing numeric values dangerous in SAS?
SAS
treats missing numeric values as smaller than valid numbers. Improper IF
conditions can misclassify records silently.
3. How would you validate malformed emails in R?
Use grepl()
with regex validation logic inside mutate().
4. When would DATA Step outperform PROC SQL?
DATA Step
performs better for sequential row-level logic, retained calculations, arrays,
and BY-group processing.
5. Why is audit traceability critical in clinical
datasets?
Regulators
require proof showing how every derived variable originated from source data.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about CRIME CASES DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment