Global Bank Loot Records into Trusted Analytical Intelligence Using Advanced SAS (PROC SQL vs DATA Step) and Modern R Data Engineering Frameworks

Introduction: When Dirty Data Becomes a Million-Dollar Disaster

Imagine a multinational banking consortium investigating global bank loot incidents across multiple countries. Fraud analysts discover that several suspicious transactions were incorrectly classified as “LOW RISK” because of corrupted timestamps, duplicated transaction IDs, inconsistent region labels, malformed emails, and negative stolen amounts.

One executive dashboard showed that losses in Europe were lower than Asia. After investigation, analysts discovered that Europe records were coded as:

EU
Eu
europe
EUR
NULL

Because of this inconsistency, the dashboard fragmented the same region into five categories.

Meanwhile, a missing robbery date caused fraud trend models to skip high-risk events entirely. Duplicate transaction identifiers triggered validation failures during regulatory reporting. Even worse, negative stolen amounts contaminated financial summaries and AI fraud-prediction models.

This is not hypothetical. In real enterprise environments involving healthcare, banking, insurance, retail, and clinical trials, dirty data destroys:

Regulatory submissions
Statistical outputs
SDTM/ADaM derivations
Executive dashboards
Machine learning reliability
Operational trust

A Clinical SAS Programmer or Data Scientist quickly learns one truth:

“Analytics is only as trustworthy as the quality of the underlying data.”

This project demonstrates how corrupted global bank-loot operational datasets can be transformed into production-grade analytical intelligence using both SAS and R.

Raw Corrupted Dataset Global Bank Loot Intelligence

SAS Raw Dataset Creation

data bank_loot_raw;

length Loot_ID $12 Bank_Name $30 Country $20 Region $15

Criminal_Email $40 Status $15;

infile datalines dlm='|' truncover;

input Loot_ID $ Bank_Name $ Country $ Region $ Criminal_Email $

Loot_Amount Loot_Date :?? yymmdd10. Criminal_Age Status $;

format Loot_Date date9.;

datalines;

L001|WorldBank|USA|NA|lootmaster@gmail.com|500000|2025-01-15|45|Closed

L002|SafeVault|india|APAC|robbermail.com|-70000|2025-02-18|17|Open

L003|TrustBank|UK|EU|NULL|950000|.|130|Investigating

L004|MoneySecure|usa| north america |crimeboss@yahoo|450000|2025-03-20|35|Closed

L005|CashEmpire|Canada|NULL|cashking@gmail.com|.|2025-04-01|52|Resolved

L006|SecureHold|Germany|EUROPE|fraudster#mail.com|300000|2025-05-09|-5|Open

L007|IronBank|India|apac|stealer@gmail.com|1000000|2025-13-10|40|Closed

L008|RoyalTrust|France|EU|robber@domain|200000|2025-07-12|39|OPEN

L010|PrimeCash|Brazil|LATAM|badmail@|720000|2025-09-18|29|Resolved

L011|SecureLife|UK|EU|stealworld@gmail.com|0|2025-10-11|48|Closed

L012|MegaFunds|Australia|APAC|moneyloot@gmail.com|880000|2025-11-14|33|Investigating

L013|GoldReserve|USA|Na| |250000|2025-12-17|44|Closed

L014|UrbanVault|India|APAC|urban@gmail.com|450000|2026-01-11|22|Resolved

L015|TrustSafe|Italy|EU|lootmail@gmail.com|9999999|2026-02-09|31|Closed

L016|NextBank|China|APAC|china_thief@gmail|560000|2026-03-05|60|Open

L017|CashFlow|Mexico|LATAM|mxloot@gmail.com|340000|2026-04-06|38|Closed

L018|VaultNation|USA|NA|nation@gmail.com|.|2026-05-07|41|Pending

L019|SafeMoney|India|APAC|safe@gmail.com|-100|2026-06-08|27|Open

L020|EuroTrust|Spain|EUROPE|euro@gmail.com|430000|2026-07-09|32|Closed

;

run;

proc print data = bank_loot_raw;

run;

OUTPUT:


Obs	Loot_ID	Bank_Name	Country	Region	Criminal_Email	Status	Loot_Amount	Loot_Date	Criminal_Age
1	L001	WorldBank	USA	NA	lootmaster@gmail.com	Closed	500000	15JAN2025	45
2	L002	SafeVault	india	APAC	robbermail.com	Open	-70000	18FEB2025	17
3	L003	TrustBank	UK	EU	NULL	Investigating	950000	.	130
4	L004	MoneySecure	usa	north america	crimeboss@yahoo	Closed	450000	20MAR2025	35
5	L004	MoneySecure	usa	north america	crimeboss@yahoo	Closed	450000	20MAR2025	35
6	L005	CashEmpire	Canada	NULL	cashking@gmail.com	Resolved	.	01APR2025	52
7	L006	SecureHold	Germany	EUROPE	fraudster#mail.com	Open	300000	09MAY2025	-5
8	L007	IronBank	India	apac	stealer@gmail.com	Closed	1000000	.	40
9	L008	RoyalTrust	France	EU	robber@domain	OPEN	200000	12JUL2025	39
10	L009	VaultCore	Japan	APAC	thief@gmail.com	Closed	650000	15AUG2025	.
11	L010	PrimeCash	Brazil	LATAM	badmail@	Resolved	720000	18SEP2025	29
12	L011	SecureLife	UK	EU	stealworld@gmail.com	Closed	0	11OCT2025	48
13	L012	MegaFunds	Australia	APAC	moneyloot@gmail.com	Investigating	880000	14NOV2025	33
14	L013	GoldReserve	USA	Na		Closed	250000	17DEC2025	44
15	L014	UrbanVault	India	APAC	urban@gmail.com	Resolved	450000	11JAN2026	22
16	L015	TrustSafe	Italy	EU	lootmail@gmail.com	Closed	9999999	09FEB2026	31
17	L016	NextBank	China	APAC	china_thief@gmail	Open	560000	05MAR2026	60
18	L017	CashFlow	Mexico	LATAM	mxloot@gmail.com	Closed	340000	06APR2026	38
19	L018	VaultNation	USA	NA	nation@gmail.com	Pending	.	07MAY2026	41
20	L019	SafeMoney	India	APAC	safe@gmail.com	Open	-100	08JUN2026	27
21	L020	EuroTrust	Spain	EUROPE	euro@gmail.com	Closed	430000	09JUL2026	32

Why LENGTH Must Come First in SAS

One of the biggest beginner mistakes in SAS programming is placing the LENGTH statement after assignments.

Incorrect

Bank_Flag='HIGH';

length Bank_Flag $20;

SAS already assigned default length before the LENGTH statement executes. This causes:

Character truncation
Unexpected missing values
Validation mismatches

Correct

length Bank_Flag $20;

Bank_Flag='HIGH';

In R, character vectors dynamically resize, so truncation risks are far less severe. SAS, however, allocates memory earlier during compilation.

Enterprise SAS Cleaning Workflow

1.Metadata Inspection

proc contents data=bank_loot_raw varnum;

run;

OUTPUT:

The CONTENTS Procedure


Data Set Name	WORK.BANK_LOOT_RAW	Observations	21
Member Type	DATA	Variables	9
Engine	V9	Indexes	0
Created	05/16/2026 15:40:06	Observation Length	160
Last Modified	05/16/2026 15:40:06	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	131072
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	818
Obs in First Data Page	21
Number of Data Set Repairs	0
Filename	/saswork/SAS_work860000004B2C_odaws02-apse1-2.oda.sas.com/SAS_workA50600004B2C_odaws02-apse1-2.oda.sas.com/bank_loot_raw.sas7bdat
Release Created	9.0401M8
Host Created	Linux
Inode Number	134325615
Access Permission	rw-r--r--
Owner Name	u63247146
File Size	256KB
File Size (bytes)	262144


Variables in Creation Order
#	Variable	Type	Len	Format
1	Loot_ID	Char	12
2	Bank_Name	Char	30
3	Country	Char	20
4	Region	Char	15
5	Criminal_Email	Char	40
6	Status	Char	15
7	Loot_Amount	Num	8
8	Loot_Date	Num	8	DATE9.
9	Criminal_Age	Num	8

Explanation

PROC CONTENTS helps analysts inspect metadata, variable types, lengths, formats, and informats. In regulated environments like clinical trials or fraud analytics, metadata verification is mandatory because incorrect variable attributes can break downstream derivations, SDTM mappings, and reporting pipelines. This step identifies whether variables were incorrectly imported as character instead of numeric or vice versa. It also validates format consistency before transformations begin.

2.Data Cleaning Using DATA Step

data bank_loot_clean;

retain Source_System 'GLOBAL_FRAUD_AUDIT';

set bank_loot_raw;

array chars(*) Bank_Name Country Region Status Criminal_Email;

do i=1 to dim(chars);

chars(i)=strip(propcase(chars(i)));

end;

Region=upcase(strip(Region));

if Region in ('NA','NORTH AMERICA') then Region='NORTH_AMERICA';

else if Region in ('EU','EUROPE') then Region='EUROPE';

else if Region='APAC' then Region='ASIA_PACIFIC';

Loot_Amount=abs(Loot_Amount);

if Criminal_Age < 18 then Criminal_Age=18;

if Criminal_Age > 100 then Criminal_Age=.;

if find(Criminal_Email,'@')=0 then Criminal_Email='INVALID_EMAIL';

if Loot_Amount=. then Loot_Amount=50000;

Loot_Risk=round(Loot_Amount/100000,.1);

length Loot_Category $10.;

select;

when (Loot_Amount > 900000) Loot_Category='HIGH';

when (Loot_Amount > 400000) Loot_Category='MEDIUM';

otherwise Loot_Category='LOW';

end;

run;

proc print data = bank_loot_clean;

run;

OUTPUT:


Obs	Source_System	Loot_ID	Bank_Name	Country	Region	Criminal_Email	Status	Loot_Amount	Loot_Date	Criminal_Age	i	Loot_Risk	Loot_Category
1	GLOBAL_FRAUD_AUDIT	L001	Worldbank	Usa	NORTH_AMERICA	Lootmaster@gmail.Com	Closed	500000	15JAN2025	45	6	5.0	MEDIUM
2	GLOBAL_FRAUD_AUDIT	L002	Safevault	India	ASIA_PACIFIC	INVALID_EMAIL	Open	70000	18FEB2025	18	6	0.7	LOW
3	GLOBAL_FRAUD_AUDIT	L003	Trustbank	Uk	EUROPE	INVALID_EMAIL	Investigating	950000	.	.	6	9.5	HIGH
4	GLOBAL_FRAUD_AUDIT	L004	Moneysecure	Usa	NORTH_AMERICA	Crimeboss@yahoo	Closed	450000	20MAR2025	35	6	4.5	MEDIUM
5	GLOBAL_FRAUD_AUDIT	L004	Moneysecure	Usa	NORTH_AMERICA	Crimeboss@yahoo	Closed	450000	20MAR2025	35	6	4.5	MEDIUM
6	GLOBAL_FRAUD_AUDIT	L005	Cashempire	Canada	NULL	Cashking@gmail.Com	Resolved	50000	01APR2025	52	6	0.5	LOW
7	GLOBAL_FRAUD_AUDIT	L006	Securehold	Germany	EUROPE	INVALID_EMAIL	Open	300000	09MAY2025	18	6	3.0	LOW
8	GLOBAL_FRAUD_AUDIT	L007	Ironbank	India	ASIA_PACIFIC	Stealer@gmail.Com	Closed	1000000	.	40	6	10.0	HIGH
9	GLOBAL_FRAUD_AUDIT	L008	Royaltrust	France	EUROPE	Robber@domain	Open	200000	12JUL2025	39	6	2.0	LOW
10	GLOBAL_FRAUD_AUDIT	L009	Vaultcore	Japan	ASIA_PACIFIC	Thief@gmail.Com	Closed	650000	15AUG2025	18	6	6.5	MEDIUM
11	GLOBAL_FRAUD_AUDIT	L010	Primecash	Brazil	LATAM	Badmail@	Resolved	720000	18SEP2025	29	6	7.2	MEDIUM
12	GLOBAL_FRAUD_AUDIT	L011	Securelife	Uk	EUROPE	Stealworld@gmail.Com	Closed	0	11OCT2025	48	6	0.0	LOW
13	GLOBAL_FRAUD_AUDIT	L012	Megafunds	Australia	ASIA_PACIFIC	Moneyloot@gmail.Com	Investigating	880000	14NOV2025	33	6	8.8	MEDIUM
14	GLOBAL_FRAUD_AUDIT	L013	Goldreserve	Usa	NORTH_AMERICA	INVALID_EMAIL	Closed	250000	17DEC2025	44	6	2.5	LOW
15	GLOBAL_FRAUD_AUDIT	L014	Urbanvault	India	ASIA_PACIFIC	Urban@gmail.Com	Resolved	450000	11JAN2026	22	6	4.5	MEDIUM
16	GLOBAL_FRAUD_AUDIT	L015	Trustsafe	Italy	EUROPE	Lootmail@gmail.Com	Closed	9999999	09FEB2026	31	6	100.0	HIGH
17	GLOBAL_FRAUD_AUDIT	L016	Nextbank	China	ASIA_PACIFIC	China_thief@gmail	Open	560000	05MAR2026	60	6	5.6	MEDIUM
18	GLOBAL_FRAUD_AUDIT	L017	Cashflow	Mexico	LATAM	Mxloot@gmail.Com	Closed	340000	06APR2026	38	6	3.4	LOW
19	GLOBAL_FRAUD_AUDIT	L018	Vaultnation	Usa	NORTH_AMERICA	Nation@gmail.Com	Pending	50000	07MAY2026	41	6	0.5	LOW
20	GLOBAL_FRAUD_AUDIT	L019	Safemoney	India	ASIA_PACIFIC	Safe@gmail.Com	Open	100	08JUN2026	27	6	0.0	LOW
21	GLOBAL_FRAUD_AUDIT	L020	Eurotrust	Spain	EUROPE	Euro@gmail.Com	Closed	430000	09JUL2026	32	6	4.3	MEDIUM

Explanation

This DATA step demonstrates enterprise-grade SAS cleaning logic. Arrays standardize multiple variables simultaneously, reducing repetitive coding. PROPCASE, STRIP, and UPCASE normalize inconsistent text values. ABS() corrects negative monetary values frequently caused by ETL corruption or reversed accounting signs. FIND() validates email structures, while conditional logic fixes impossible ages and missing amounts. ROUND() standardizes derived metrics for reporting consistency. Such logic is common in clinical trial SDTM preparation, banking fraud detection, and insurance claims processing where regulatory traceability matters.

3.Deduplication

proc sort data=bank_loot_clean nodupkey;

by Loot_ID;

run;

proc print data = bank_loot_clean;

run;

LOG:

 NOTE: There were 21 observations read from the data set WORK.BANK_LOOT_CLEAN.

 NOTE: 1 observations with duplicate key values were deleted.

OUTPUT:


Obs	Source_System	Loot_ID	Bank_Name	Country	Region	Criminal_Email	Status	Loot_Amount	Loot_Date	Criminal_Age	i	Loot_Risk	Loot_Category
1	GLOBAL_FRAUD_AUDIT	L001	Worldbank	Usa	NORTH_AMERICA	Lootmaster@gmail.Com	Closed	500000	15JAN2025	45	6	5.0	MEDIUM
2	GLOBAL_FRAUD_AUDIT	L002	Safevault	India	ASIA_PACIFIC	INVALID_EMAIL	Open	70000	18FEB2025	18	6	0.7	LOW
3	GLOBAL_FRAUD_AUDIT	L003	Trustbank	Uk	EUROPE	INVALID_EMAIL	Investigating	950000	.	.	6	9.5	HIGH
4	GLOBAL_FRAUD_AUDIT	L004	Moneysecure	Usa	NORTH_AMERICA	Crimeboss@yahoo	Closed	450000	20MAR2025	35	6	4.5	MEDIUM
5	GLOBAL_FRAUD_AUDIT	L005	Cashempire	Canada	NULL	Cashking@gmail.Com	Resolved	50000	01APR2025	52	6	0.5	LOW
6	GLOBAL_FRAUD_AUDIT	L006	Securehold	Germany	EUROPE	INVALID_EMAIL	Open	300000	09MAY2025	18	6	3.0	LOW
7	GLOBAL_FRAUD_AUDIT	L007	Ironbank	India	ASIA_PACIFIC	Stealer@gmail.Com	Closed	1000000	.	40	6	10.0	HIGH
8	GLOBAL_FRAUD_AUDIT	L008	Royaltrust	France	EUROPE	Robber@domain	Open	200000	12JUL2025	39	6	2.0	LOW
9	GLOBAL_FRAUD_AUDIT	L009	Vaultcore	Japan	ASIA_PACIFIC	Thief@gmail.Com	Closed	650000	15AUG2025	18	6	6.5	MEDIUM
10	GLOBAL_FRAUD_AUDIT	L010	Primecash	Brazil	LATAM	Badmail@	Resolved	720000	18SEP2025	29	6	7.2	MEDIUM
11	GLOBAL_FRAUD_AUDIT	L011	Securelife	Uk	EUROPE	Stealworld@gmail.Com	Closed	0	11OCT2025	48	6	0.0	LOW
12	GLOBAL_FRAUD_AUDIT	L012	Megafunds	Australia	ASIA_PACIFIC	Moneyloot@gmail.Com	Investigating	880000	14NOV2025	33	6	8.8	MEDIUM
13	GLOBAL_FRAUD_AUDIT	L013	Goldreserve	Usa	NORTH_AMERICA	INVALID_EMAIL	Closed	250000	17DEC2025	44	6	2.5	LOW
14	GLOBAL_FRAUD_AUDIT	L014	Urbanvault	India	ASIA_PACIFIC	Urban@gmail.Com	Resolved	450000	11JAN2026	22	6	4.5	MEDIUM
15	GLOBAL_FRAUD_AUDIT	L015	Trustsafe	Italy	EUROPE	Lootmail@gmail.Com	Closed	9999999	09FEB2026	31	6	100.0	HIGH
16	GLOBAL_FRAUD_AUDIT	L016	Nextbank	China	ASIA_PACIFIC	China_thief@gmail	Open	560000	05MAR2026	60	6	5.6	MEDIUM
17	GLOBAL_FRAUD_AUDIT	L017	Cashflow	Mexico	LATAM	Mxloot@gmail.Com	Closed	340000	06APR2026	38	6	3.4	LOW
18	GLOBAL_FRAUD_AUDIT	L018	Vaultnation	Usa	NORTH_AMERICA	Nation@gmail.Com	Pending	50000	07MAY2026	41	6	0.5	LOW
19	GLOBAL_FRAUD_AUDIT	L019	Safemoney	India	ASIA_PACIFIC	Safe@gmail.Com	Open	100	08JUN2026	27	6	0.0	LOW
20	GLOBAL_FRAUD_AUDIT	L020	Eurotrust	Spain	EUROPE	Euro@gmail.Com	Closed	430000	09JUL2026	32	6	4.3	MEDIUM

Explanation

Duplicate transaction identifiers create catastrophic reporting problems in banking and healthcare systems. PROC SORT NODUPKEY removes duplicate observations based on business keys. In clinical trials, duplicate subject IDs can invalidate efficacy analysis populations. In fraud systems, duplicate transactions inflate financial exposure calculations. Deduplication is therefore not cosmetic cleaning it is a compliance requirement.

4.PROC SQL Validation Layer

proc sql;

create table loot_summary as

select Region,count(*) as Total_Loots,

sum(Loot_Amount) as Total_Amount format=dollar15.,

avg(Criminal_Age) as Avg_Age

from bank_loot_clean

group by Region;

quit;

proc print data = loot_summary;

run;

OUTPUT:


Obs	Region	Total_Loots	Total_Amount	Avg_Age
1	ASIA_PACIFIC	7	$3,610,100	31.1429
2	EUROPE	6	$11,879,999	33.6000
3	LATAM	2	$1,060,000	33.5000
4	NORTH_AMERICA	4	$1,250,000	41.2500
5	NULL	1	$50,000	52.0000

Explanation

PROC SQL provides relational-style summarization and joins. Compared to DATA step BY-group processing, PROC SQL is often preferred for aggregation-heavy workflows. Financial intelligence systems use SQL-based summaries for executive dashboards, fraud heatmaps, and compliance reports. Proper grouping ensures consistent regional metrics after standardization logic has corrected corrupted region labels.

5.Advanced DATA Step MERGE Logic

data region_master;

input Region:$15. Risk_Level $;

datalines;

EUROPE HIGH

ASIA_PACIFIC MEDIUM

NORTH_AMERICA HIGH

LATAM MEDIUM

;

run;

proc print data = region_master;

run;

OUTPUT:


Obs	Region	Risk_Level
1	EUROPE	HIGH
2	ASIA_PACIFIC	MEDIUM
3	NORTH_AMERICA	HIGH
4	LATAM	MEDIUM

proc sort data=bank_loot_clean; by Region; run;

proc print data = bank_loot_clean;

run;

OUTPUT:


Obs	Source_System	Loot_ID	Bank_Name	Country	Region	Criminal_Email	Status	Loot_Amount	Loot_Date	Criminal_Age	i	Loot_Risk	Loot_Category
1	GLOBAL_FRAUD_AUDIT	L002	Safevault	India	ASIA_PACIFIC	INVALID_EMAIL	Open	70000	18FEB2025	18	6	0.7	LOW
2	GLOBAL_FRAUD_AUDIT	L007	Ironbank	India	ASIA_PACIFIC	Stealer@gmail.Com	Closed	1000000	.	40	6	10.0	HIGH
3	GLOBAL_FRAUD_AUDIT	L009	Vaultcore	Japan	ASIA_PACIFIC	Thief@gmail.Com	Closed	650000	15AUG2025	18	6	6.5	MEDIUM
4	GLOBAL_FRAUD_AUDIT	L012	Megafunds	Australia	ASIA_PACIFIC	Moneyloot@gmail.Com	Investigating	880000	14NOV2025	33	6	8.8	MEDIUM
5	GLOBAL_FRAUD_AUDIT	L014	Urbanvault	India	ASIA_PACIFIC	Urban@gmail.Com	Resolved	450000	11JAN2026	22	6	4.5	MEDIUM
6	GLOBAL_FRAUD_AUDIT	L016	Nextbank	China	ASIA_PACIFIC	China_thief@gmail	Open	560000	05MAR2026	60	6	5.6	MEDIUM
7	GLOBAL_FRAUD_AUDIT	L019	Safemoney	India	ASIA_PACIFIC	Safe@gmail.Com	Open	100	08JUN2026	27	6	0.0	LOW
8	GLOBAL_FRAUD_AUDIT	L003	Trustbank	Uk	EUROPE	INVALID_EMAIL	Investigating	950000	.	.	6	9.5	HIGH
9	GLOBAL_FRAUD_AUDIT	L006	Securehold	Germany	EUROPE	INVALID_EMAIL	Open	300000	09MAY2025	18	6	3.0	LOW
10	GLOBAL_FRAUD_AUDIT	L008	Royaltrust	France	EUROPE	Robber@domain	Open	200000	12JUL2025	39	6	2.0	LOW
11	GLOBAL_FRAUD_AUDIT	L011	Securelife	Uk	EUROPE	Stealworld@gmail.Com	Closed	0	11OCT2025	48	6	0.0	LOW
12	GLOBAL_FRAUD_AUDIT	L015	Trustsafe	Italy	EUROPE	Lootmail@gmail.Com	Closed	9999999	09FEB2026	31	6	100.0	HIGH
13	GLOBAL_FRAUD_AUDIT	L020	Eurotrust	Spain	EUROPE	Euro@gmail.Com	Closed	430000	09JUL2026	32	6	4.3	MEDIUM
14	GLOBAL_FRAUD_AUDIT	L010	Primecash	Brazil	LATAM	Badmail@	Resolved	720000	18SEP2025	29	6	7.2	MEDIUM
15	GLOBAL_FRAUD_AUDIT	L017	Cashflow	Mexico	LATAM	Mxloot@gmail.Com	Closed	340000	06APR2026	38	6	3.4	LOW
16	GLOBAL_FRAUD_AUDIT	L001	Worldbank	Usa	NORTH_AMERICA	Lootmaster@gmail.Com	Closed	500000	15JAN2025	45	6	5.0	MEDIUM
17	GLOBAL_FRAUD_AUDIT	L004	Moneysecure	Usa	NORTH_AMERICA	Crimeboss@yahoo	Closed	450000	20MAR2025	35	6	4.5	MEDIUM
18	GLOBAL_FRAUD_AUDIT	L013	Goldreserve	Usa	NORTH_AMERICA	INVALID_EMAIL	Closed	250000	17DEC2025	44	6	2.5	LOW
19	GLOBAL_FRAUD_AUDIT	L018	Vaultnation	Usa	NORTH_AMERICA	Nation@gmail.Com	Pending	50000	07MAY2026	41	6	0.5	LOW
20	GLOBAL_FRAUD_AUDIT	L005	Cashempire	Canada	NULL	Cashking@gmail.Com	Resolved	50000	01APR2025	52	6	0.5	LOW

proc sort data=region_master; by Region; run;

proc print data = region_master;

run;

OUTPUT:


Obs	Region	Risk_Level
1	ASIA_PACIFIC	MEDIUM
2	EUROPE	HIGH
3	LATAM	MEDIUM
4	NORTH_AMERICA	HIGH

data merged_loot;

merge bank_loot_clean(in=a)

region_master(in=b);

by Region;

if a;

run;

proc print data = merged_loot;

run;

OUTPUT:


Obs	Source_System	Loot_ID	Bank_Name	Country	Region	Criminal_Email	Status	Loot_Amount	Loot_Date	Criminal_Age	i	Loot_Risk	Loot_Category	Risk_Level
1	GLOBAL_FRAUD_AUDIT	L002	Safevault	India	ASIA_PACIFIC	INVALID_EMAIL	Open	70000	18FEB2025	18	6	0.7	LOW	MEDIUM
2	GLOBAL_FRAUD_AUDIT	L007	Ironbank	India	ASIA_PACIFIC	Stealer@gmail.Com	Closed	1000000	.	40	6	10.0	HIGH	MEDIUM
3	GLOBAL_FRAUD_AUDIT	L009	Vaultcore	Japan	ASIA_PACIFIC	Thief@gmail.Com	Closed	650000	15AUG2025	18	6	6.5	MEDIUM	MEDIUM
4	GLOBAL_FRAUD_AUDIT	L012	Megafunds	Australia	ASIA_PACIFIC	Moneyloot@gmail.Com	Investigating	880000	14NOV2025	33	6	8.8	MEDIUM	MEDIUM
5	GLOBAL_FRAUD_AUDIT	L014	Urbanvault	India	ASIA_PACIFIC	Urban@gmail.Com	Resolved	450000	11JAN2026	22	6	4.5	MEDIUM	MEDIUM
6	GLOBAL_FRAUD_AUDIT	L016	Nextbank	China	ASIA_PACIFIC	China_thief@gmail	Open	560000	05MAR2026	60	6	5.6	MEDIUM	MEDIUM
7	GLOBAL_FRAUD_AUDIT	L019	Safemoney	India	ASIA_PACIFIC	Safe@gmail.Com	Open	100	08JUN2026	27	6	0.0	LOW	MEDIUM
8	GLOBAL_FRAUD_AUDIT	L003	Trustbank	Uk	EUROPE	INVALID_EMAIL	Investigating	950000	.	.	6	9.5	HIGH	HIGH
9	GLOBAL_FRAUD_AUDIT	L006	Securehold	Germany	EUROPE	INVALID_EMAIL	Open	300000	09MAY2025	18	6	3.0	LOW	HIGH
10	GLOBAL_FRAUD_AUDIT	L008	Royaltrust	France	EUROPE	Robber@domain	Open	200000	12JUL2025	39	6	2.0	LOW	HIGH
11	GLOBAL_FRAUD_AUDIT	L011	Securelife	Uk	EUROPE	Stealworld@gmail.Com	Closed	0	11OCT2025	48	6	0.0	LOW	HIGH
12	GLOBAL_FRAUD_AUDIT	L015	Trustsafe	Italy	EUROPE	Lootmail@gmail.Com	Closed	9999999	09FEB2026	31	6	100.0	HIGH	HIGH
13	GLOBAL_FRAUD_AUDIT	L020	Eurotrust	Spain	EUROPE	Euro@gmail.Com	Closed	430000	09JUL2026	32	6	4.3	MEDIUM	HIGH
14	GLOBAL_FRAUD_AUDIT	L010	Primecash	Brazil	LATAM	Badmail@	Resolved	720000	18SEP2025	29	6	7.2	MEDIUM	MEDIUM
15	GLOBAL_FRAUD_AUDIT	L017	Cashflow	Mexico	LATAM	Mxloot@gmail.Com	Closed	340000	06APR2026	38	6	3.4	LOW	MEDIUM
16	GLOBAL_FRAUD_AUDIT	L001	Worldbank	Usa	NORTH_AMERICA	Lootmaster@gmail.Com	Closed	500000	15JAN2025	45	6	5.0	MEDIUM	HIGH
17	GLOBAL_FRAUD_AUDIT	L004	Moneysecure	Usa	NORTH_AMERICA	Crimeboss@yahoo	Closed	450000	20MAR2025	35	6	4.5	MEDIUM	HIGH
18	GLOBAL_FRAUD_AUDIT	L013	Goldreserve	Usa	NORTH_AMERICA	INVALID_EMAIL	Closed	250000	17DEC2025	44	6	2.5	LOW	HIGH
19	GLOBAL_FRAUD_AUDIT	L018	Vaultnation	Usa	NORTH_AMERICA	Nation@gmail.Com	Pending	50000	07MAY2026	41	6	0.5	LOW	HIGH
20	GLOBAL_FRAUD_AUDIT	L005	Cashempire	Canada	NULL	Cashking@gmail.Com	Resolved	50000	01APR2025	52	6	0.5	LOW

Explanation

MERGE logic is foundational in SAS production environments. Fraud systems often enrich operational data with lookup tables, risk classifications, or external reference datasets. The IN= dataset options provide record lineage tracking, ensuring only valid transactional observations are retained. Clinical trial programmers use similar logic when merging demographics, adverse events, and exposure datasets.

6.PROC REPORT for Executive Outputs

proc report data=loot_summary nowd;

columns Region Total_Loots Total_Amount Avg_Age;

define Region / group;

define Total_Loots / analysis;

define Total_Amount / analysis;

define Avg_Age / analysis;

run;

OUTPUT:


Region	Total_Loots	Total_Amount	Avg_Age
ASIA_PACIFIC	7	$3,610,100	31.142857
EUROPE	6	$11,879,999	33.6
LATAM	2	$1,060,000	33.5
NORTH_AMERICA	4	$1,250,000	41.25
NULL	1	$50,000	52

Explanation

PROC REPORT generates professional enterprise reports suitable for auditors, executives, and regulatory submissions. Unlike simple PROC PRINT outputs, PROC REPORT supports grouped presentation logic, calculated fields, custom formatting, and advanced layouts. In clinical programming, TLFs (Tables, Listings, Figures) heavily rely on reporting procedures like this.

7.Modern R Cleaning Workflow

Raw Dataset

library(tidyverse)

library(lubridate)

library(janitor)

bank_raw <- tibble(

Loot_ID = c("L001","L002","L003","L004","L004"),

Bank_Name = c("WorldBank","SafeVault","TrustBank","MoneySecure","MoneySecure"),

Region = c("NA","apac","EU"," north america ","EUROPE"),

Loot_Amount = c(500000,-70000,NA,450000,450000),

Criminal_Age = c(45,17,130,35,35),

Email = c("loot@gmail.com","badmail","NULL","boss@yahoo","boss@yahoo")

)

OUTPUT:

	Loot_ID	Bank_Name	Region	Loot_Amount	Criminal_Age	Email
1	L001	WorldBank	NA	500000	45	loot@gmail.com
2	L002	SafeVault	apac	-70000	17	badmail
3	L003	TrustBank	EU	NA	130	NULL
4	L004	MoneySecure	north america	450000	35	boss@yahoo
5	L004	MoneySecure	EUROPE	450000	35	boss@yahoo

Cleaning Logic in R

bank_clean <- bank_raw %>%

clean_names() %>%

mutate(region = str_trim(str_to_upper(region)),

region = case_when(

region %in% c("NA","NORTH AMERICA") ~ "NORTH_AMERICA",

region %in% c("EU","EUROPE") ~ "EUROPE",

TRUE ~ "ASIA_PACIFIC"),

loot_amount = abs(loot_amount),

criminal_age = if_else(criminal_age > 100 |

criminal_age < 18,NA_real_,

criminal_age),

email = if_else(grepl("@", email),email,"INVALID_EMAIL")

) %>%

distinct()

OUTPUT:

	loot_id	bank_name	region	loot_amount	criminal_age	email
1	L001	WorldBank	NORTH_AMERICA	500000	45	loot@gmail.com
2	L002	SafeVault	ASIA_PACIFIC	70000	NA	INVALID_EMAIL
3	L003	TrustBank	EUROPE	NA	NA	INVALID_EMAIL
4	L004	MoneySecure	NORTH_AMERICA	450000	35	boss@yahoo
5	L004	MoneySecure	EUROPE	450000	35	boss@yahoo

Explanation

This R workflow mirrors SAS transformations using tidyverse functions. mutate() behaves similarly to SAS assignment logic. case_when() parallels SELECT-WHEN. str_trim() and str_to_upper() standardize corrupted text. distinct() removes duplicates like PROC SORT NODUPKEY. if_else() provides conditional corrections comparable to IF-THEN logic in SAS. Tidyverse pipelines improve readability and modularity, especially in exploratory data science workflows.

SAS vs R Comparison

Feature	SAS	R
Auditability	Excellent	Moderate
Regulatory Acceptance	Very High	Growing
Scalability	Enterprise-grade	High with optimization
Metadata Control	Strong	Flexible
Clinical Trial Dominance	Industry Standard	Supplemental
Visualization	Moderate	Excellent
Reproducibility	Strong	Strong
Macro Automation	Powerful	Functional Programming

8.Enterprise Validation & Compliance

In SDTM and ADaM workflows, missing values are dangerous. SAS treats numeric missing values as lower than all valid numbers.

Example:

if score < 50 then flag='FAIL';

Missing scores also become FAIL unless explicitly checked.

Correct approach:

if score ne . and score < 50 then flag='FAIL';

Regulatory agencies expect:

Traceability
Audit trails
Independent QC
Metadata governance
Controlled terminology
Reproducibility
Validation independence

Improper missing-value handling can invalidate clinical trial analyses or fraud detection models.

9.Business Logic Behind Cleaning

Business logic defines how corrupted operational data should be transformed into analytically meaningful information. Missing values are often imputed because downstream reporting engines, machine learning models, and statistical calculations fail when essential variables are blank. For example, if a patient visit date is missing in a clinical trial, exposure duration calculations become impossible, affecting efficacy analysis. Similarly, missing transaction dates in fraud systems distort temporal risk trends.

Unrealistic values must also be corrected. A criminal age of 130 or negative stolen amount is logically impossible and likely caused by ETL corruption, manual-entry mistakes, or source-system failures. Standardizing dates using formats like YYYY-MM-DD ensures interoperability between systems and accurate time-series analysis.

Text normalization is equally critical. Region labels such as “EU,” “ europe ”, and “EUROPE” represent the same business entity but produce fragmented dashboards if not standardized. Email validation prevents broken communication pipelines and identity mismatches.

In healthcare, correcting age ranges ensures valid demographic analysis. In retail or banking, standardized customer categories improve segmentation models. Ultimately, business logic bridges raw operational chaos and reliable enterprise intelligence.

10.20 Data-Cleaning Best Practices

Validate metadata before transformations
Standardize controlled terminology
Remove duplicates early
Always preserve raw datasets
Use audit-ready logs
Separate derivation and validation logic
Apply defensive programming
Validate date ranges
Check impossible numeric values
Normalize categorical variables
Use reusable macros
Document assumptions clearly
Implement QC independence
Avoid hardcoded mappings
Track lineage across datasets
Use consistent formats/informats
Validate joins after merges
Perform frequency checks
Standardize missing-value handling
Automate validation reports

11.20 Sharp Insights

Dirty data creates expensive business mistakes.
Validation logic beats visual inspection.
Duplicate IDs destroy trust.
Missing values silently break analytics.
Standardized variables improve reproducibility.
Metadata controls enterprise quality.
Audit trails protect organizations.
PROC SQL simplifies relational analysis.
DATA step provides granular control.
R excels in exploratory transformations.
SAS dominates regulated industries.
Text normalization prevents dashboard fragmentation.
Defensive programming prevents production failures.
Controlled terminology reduces ambiguity.
Incorrect formats create hidden errors.
Validation is continuous, not optional.
Business rules must be documented.
Reproducibility is an enterprise asset.
Automated QC saves time and money.
Clean data drives trustworthy AI.

12.SAS and R Summary

SAS and R both play major roles in modern enterprise data engineering ecosystems, but they serve different operational strengths. SAS is deeply dominant in regulated industries such as clinical trials, banking, pharmaceuticals, and insurance because of its exceptional auditability, metadata governance, reproducibility, and validation traceability. Regulatory agencies trust SAS outputs because enterprise workflows are standardized, controlled, and easily reviewable. Procedures like PROC SQL, PROC REPORT, PROC FORMAT, and DATA step programming provide scalable frameworks for handling massive operational datasets with predictable execution behavior.

R, meanwhile, provides unmatched flexibility for exploratory analysis, modern visualization, advanced machine learning, and dynamic transformation pipelines. Packages like tidyverse, lubridate, janitor, and purrr simplify complex cleaning workflows using readable syntax and functional programming principles. R enables rapid experimentation and sophisticated analytics that complement SAS production systems.

From a scalability perspective, SAS excels in structured enterprise environments with large governance requirements, while R thrives in innovation-driven analytics teams. SAS macros provide industrial-grade automation, whereas R offers flexible reusable functions and package ecosystems.

The strongest enterprise strategy is not SAS versus R it is SAS and R together. Many organizations now use SAS for validated production pipelines and R for exploratory analytics, predictive modeling, and visualization. Combining both technologies creates scalable, compliant, high-performance analytical ecosystems capable of delivering trustworthy business intelligence across healthcare, banking, retail, and insurance domains.

13.Conclusion

Modern analytics systems are built on one foundational principle: reliable insights require reliable data. Whether the domain involves global bank loot investigations, clinical trial patient tracking, insurance claims analysis, or retail fraud detection, poor-quality operational data introduces enormous business risk. Duplicate transaction identifiers, malformed emails, corrupted region labels, impossible ages, invalid timestamps, and inconsistent formats do not merely create cosmetic reporting issues they directly impact strategic decisions, compliance outcomes, AI model reliability, and financial trust.

This project demonstrated how enterprise-grade data engineering workflows can systematically transform corrupted raw operational datasets into trustworthy analytical intelligence using both SAS and R. SAS provides unmatched control, auditability, metadata governance, and production reliability through DATA step programming, PROC SQL, macros, validation procedures, and reporting frameworks. Its strength lies in regulated environments where traceability and reproducibility are mandatory.

R complements these capabilities with flexible transformation pipelines, modern functional programming, advanced exploratory analytics, and efficient text handling through tidyverse ecosystems. Together, SAS and R create a hybrid analytics framework capable of supporting both highly governed enterprise reporting and modern data science innovation.

The most successful organizations do not treat data cleaning as a secondary technical task. They recognize it as a strategic operational discipline. Clean data improves fraud detection accuracy, strengthens clinical trial integrity, enhances executive dashboards, protects regulatory submissions, and enables trustworthy AI systems.

Ultimately, structured data-cleaning frameworks are no longer optional in modern analytics ecosystems. They are foundational pillars of scalable, production-grade intelligence. Organizations that invest in rigorous validation, metadata governance, defensive programming, and reproducible workflows gain something more valuable than clean datasets they gain confidence in every decision powered by their data.

14.Interview Questions and Answers

1. What are arrays in SAS and why are they useful?

Answer:

Arrays allow multiple variables to be processed using loops.

Example

data clean_scores;
set exam;

array scores(*) score1-score5;

do i=1 to dim(scores);

if scores(i) < 0 then scores(i)=0;

end;

run;

Key Points

Reduces repetitive code.
Useful in bulk cleaning operations.
Common in laboratory data processing.

2. What is the purpose of PROC FORMAT?

Answer:

PROC FORMAT creates custom labels and grouped categories.

Example

proc format;
value riskfmt 0-100000='LOW'
100001-500000='MEDIUM'
500001-high='HIGH';
run;

Usage

format amount riskfmt.;

Key Points

Improves report readability.
Standardizes outputs.
Widely used in TLF generation.

3. How do you validate email formats in SAS?

Answer:

Functions like FIND, INDEX, VERIFY, and PRXMATCH are used.

Example

if find(email,'@')=0 then email_status='INVALID';

Advanced Regex Example

if prxmatch('/^\S+@\S+\.\S+$/',email)
then status='VALID';

Key Points

Important in customer datasets.
Prevents communication failures.
Regex validation is more robust.

4. What is the difference between STRIP, TRIM, and COMPRESS?

Answer:

Function	Purpose
STRIP	Removes leading/trailing spaces
TRIM	Removes trailing spaces
COMPRESS	Removes specified characters

Example

clean_name=strip(name);

digits=compress(phone,'-() ');

Key Points

Important for standardization.
Improves matching accuracy.
Reduces hidden whitespace corruption.

5. How do you merge datasets in SAS?

Answer:

Datasets are merged using BY variables after sorting.

Example

proc sort data=demog; by Subject_ID; run;
proc sort data=lab; by Subject_ID; run;

data final;
merge demog(in=a)
lab(in=b);

by Subject_ID;

if a;
run;

Key Points

Sorting is mandatory.
IN= flags track dataset origin.
Common in SDTM/ADaM derivations.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about BANK LOOTS DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Billion-Dollar Marriages to Clean Datasets: The Art of SAS Standardization

2.The Hidden Power of SAS Dates: Building Smart Time Logic with INTNX and INTCK

3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Looted Banks, Dirty Data & Executive Panic: Building Production-Ready Fraud Intelligence Systems with SAS PROC SQL and R

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS