Looted Banks, Dirty Data & Executive Panic: Building Production-Ready Fraud Intelligence Systems with SAS PROC SQL and R
Global Bank Loot Records into Trusted Analytical Intelligence Using Advanced SAS (PROC SQL vs DATA Step) and Modern R Data Engineering Frameworks
Introduction: When Dirty Data Becomes a
Million-Dollar Disaster
Imagine a
multinational banking consortium investigating global bank loot incidents
across multiple countries. Fraud analysts discover that several suspicious
transactions were incorrectly classified as “LOW RISK” because of corrupted
timestamps, duplicated transaction IDs, inconsistent region labels, malformed
emails, and negative stolen amounts.
One
executive dashboard showed that losses in Europe were lower than Asia. After
investigation, analysts discovered that Europe records were coded as:
- EU
- Eu
- europe
- EUR
- NULL
Because
of this inconsistency, the dashboard fragmented the same region into five
categories.
Meanwhile,
a missing robbery date caused fraud trend models to skip high-risk events
entirely. Duplicate transaction identifiers triggered validation failures
during regulatory reporting. Even worse, negative stolen amounts contaminated
financial summaries and AI fraud-prediction models.
This is
not hypothetical. In real enterprise environments involving healthcare,
banking, insurance, retail, and clinical trials, dirty data destroys:
- Regulatory submissions
- Statistical outputs
- SDTM/ADaM derivations
- Executive dashboards
- Machine learning reliability
- Operational trust
A
Clinical SAS Programmer or Data Scientist quickly learns one truth:
“Analytics
is only as trustworthy as the quality of the underlying data.”
This project demonstrates how corrupted global bank-loot operational datasets can be transformed into production-grade analytical intelligence using both SAS and R.
Raw Corrupted Dataset Global Bank Loot Intelligence
SAS Raw Dataset Creation
data bank_loot_raw;
length Loot_ID $12 Bank_Name $30 Country $20 Region $15
Criminal_Email $40 Status $15;
infile datalines dlm='|' truncover;
input Loot_ID $ Bank_Name $ Country $ Region $ Criminal_Email $
Loot_Amount Loot_Date :?? yymmdd10. Criminal_Age Status $;
format Loot_Date date9.;
datalines;
L001|WorldBank|USA|NA|lootmaster@gmail.com|500000|2025-01-15|45|Closed
L002|SafeVault|india|APAC|robbermail.com|-70000|2025-02-18|17|Open
L003|TrustBank|UK|EU|NULL|950000|.|130|Investigating
L004|MoneySecure|usa| north america |crimeboss@yahoo|450000|2025-03-20|35|Closed
L004|MoneySecure|usa| north america |crimeboss@yahoo|450000|2025-03-20|35|Closed
L005|CashEmpire|Canada|NULL|cashking@gmail.com|.|2025-04-01|52|Resolved
L006|SecureHold|Germany|EUROPE|fraudster#mail.com|300000|2025-05-09|-5|Open
L007|IronBank|India|apac|stealer@gmail.com|1000000|2025-13-10|40|Closed
L008|RoyalTrust|France|EU|robber@domain|200000|2025-07-12|39|OPEN
L009|VaultCore|Japan|APAC| thief@gmail.com |650000|2025-08-15|NULL|Closed
L010|PrimeCash|Brazil|LATAM|badmail@|720000|2025-09-18|29|Resolved
L011|SecureLife|UK|EU|stealworld@gmail.com|0|2025-10-11|48|Closed
L012|MegaFunds|Australia|APAC|moneyloot@gmail.com|880000|2025-11-14|33|Investigating
L013|GoldReserve|USA|Na| |250000|2025-12-17|44|Closed
L014|UrbanVault|India|APAC|urban@gmail.com|450000|2026-01-11|22|Resolved
L015|TrustSafe|Italy|EU|lootmail@gmail.com|9999999|2026-02-09|31|Closed
L016|NextBank|China|APAC|china_thief@gmail|560000|2026-03-05|60|Open
L017|CashFlow|Mexico|LATAM|mxloot@gmail.com|340000|2026-04-06|38|Closed
L018|VaultNation|USA|NA|nation@gmail.com|.|2026-05-07|41|Pending
L019|SafeMoney|India|APAC|safe@gmail.com|-100|2026-06-08|27|Open
L020|EuroTrust|Spain|EUROPE|euro@gmail.com|430000|2026-07-09|32|Closed
;
run;
proc print data = bank_loot_raw;
run;
OUTPUT:
| Obs | Loot_ID | Bank_Name | Country | Region | Criminal_Email | Status | Loot_Amount | Loot_Date | Criminal_Age |
|---|---|---|---|---|---|---|---|---|---|
| 1 | L001 | WorldBank | USA | NA | lootmaster@gmail.com | Closed | 500000 | 15JAN2025 | 45 |
| 2 | L002 | SafeVault | india | APAC | robbermail.com | Open | -70000 | 18FEB2025 | 17 |
| 3 | L003 | TrustBank | UK | EU | NULL | Investigating | 950000 | . | 130 |
| 4 | L004 | MoneySecure | usa | north america | crimeboss@yahoo | Closed | 450000 | 20MAR2025 | 35 |
| 5 | L004 | MoneySecure | usa | north america | crimeboss@yahoo | Closed | 450000 | 20MAR2025 | 35 |
| 6 | L005 | CashEmpire | Canada | NULL | cashking@gmail.com | Resolved | . | 01APR2025 | 52 |
| 7 | L006 | SecureHold | Germany | EUROPE | fraudster#mail.com | Open | 300000 | 09MAY2025 | -5 |
| 8 | L007 | IronBank | India | apac | stealer@gmail.com | Closed | 1000000 | . | 40 |
| 9 | L008 | RoyalTrust | France | EU | robber@domain | OPEN | 200000 | 12JUL2025 | 39 |
| 10 | L009 | VaultCore | Japan | APAC | thief@gmail.com | Closed | 650000 | 15AUG2025 | . |
| 11 | L010 | PrimeCash | Brazil | LATAM | badmail@ | Resolved | 720000 | 18SEP2025 | 29 |
| 12 | L011 | SecureLife | UK | EU | stealworld@gmail.com | Closed | 0 | 11OCT2025 | 48 |
| 13 | L012 | MegaFunds | Australia | APAC | moneyloot@gmail.com | Investigating | 880000 | 14NOV2025 | 33 |
| 14 | L013 | GoldReserve | USA | Na | Closed | 250000 | 17DEC2025 | 44 | |
| 15 | L014 | UrbanVault | India | APAC | urban@gmail.com | Resolved | 450000 | 11JAN2026 | 22 |
| 16 | L015 | TrustSafe | Italy | EU | lootmail@gmail.com | Closed | 9999999 | 09FEB2026 | 31 |
| 17 | L016 | NextBank | China | APAC | china_thief@gmail | Open | 560000 | 05MAR2026 | 60 |
| 18 | L017 | CashFlow | Mexico | LATAM | mxloot@gmail.com | Closed | 340000 | 06APR2026 | 38 |
| 19 | L018 | VaultNation | USA | NA | nation@gmail.com | Pending | . | 07MAY2026 | 41 |
| 20 | L019 | SafeMoney | India | APAC | safe@gmail.com | Open | -100 | 08JUN2026 | 27 |
| 21 | L020 | EuroTrust | Spain | EUROPE | euro@gmail.com | Closed | 430000 | 09JUL2026 | 32 |
Why LENGTH Must Come First in SAS
One of
the biggest beginner mistakes in SAS programming is placing the LENGTH
statement after assignments.
Incorrect
Bank_Flag='HIGH';
length Bank_Flag $20;
SAS
already assigned default length before the LENGTH statement executes. This
causes:
- Character truncation
- Unexpected missing values
- Validation mismatches
Correct
length Bank_Flag $20;
Bank_Flag='HIGH';
In R,
character vectors dynamically resize, so truncation risks are far less severe.
SAS, however, allocates memory earlier during compilation.
Enterprise SAS Cleaning Workflow
1.Metadata Inspection
proc contents data=bank_loot_raw varnum;
run;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.BANK_LOOT_RAW | Observations | 21 |
|---|---|---|---|
| Member Type | DATA | Variables | 9 |
| Engine | V9 | Indexes | 0 |
| Created | 05/16/2026 15:40:06 | Observation Length | 160 |
| Last Modified | 05/16/2026 15:40:06 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 818 |
| Obs in First Data Page | 21 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_work860000004B2C_odaws02-apse1-2.oda.sas.com/SAS_workA50600004B2C_odaws02-apse1-2.oda.sas.com/bank_loot_raw.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 134325615 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Variables in Creation Order | ||||
|---|---|---|---|---|
| # | Variable | Type | Len | Format |
| 1 | Loot_ID | Char | 12 | |
| 2 | Bank_Name | Char | 30 | |
| 3 | Country | Char | 20 | |
| 4 | Region | Char | 15 | |
| 5 | Criminal_Email | Char | 40 | |
| 6 | Status | Char | 15 | |
| 7 | Loot_Amount | Num | 8 | |
| 8 | Loot_Date | Num | 8 | DATE9. |
| 9 | Criminal_Age | Num | 8 | |
Explanation
PROC
CONTENTS helps analysts inspect metadata, variable types, lengths, formats, and
informats. In regulated environments like clinical trials or fraud analytics,
metadata verification is mandatory because incorrect variable attributes can
break downstream derivations, SDTM mappings, and reporting pipelines. This step
identifies whether variables were incorrectly imported as character instead of
numeric or vice versa. It also validates format consistency before
transformations begin.
2.Data Cleaning Using DATA Step
data bank_loot_clean;
retain Source_System 'GLOBAL_FRAUD_AUDIT';
set bank_loot_raw;
array chars(*) Bank_Name Country Region Status Criminal_Email;
do i=1 to dim(chars);
chars(i)=strip(propcase(chars(i)));
end;
Region=upcase(strip(Region));
if Region in ('NA','NORTH AMERICA') then Region='NORTH_AMERICA';
else if Region in ('EU','EUROPE') then Region='EUROPE';
else if Region='APAC' then Region='ASIA_PACIFIC';
Loot_Amount=abs(Loot_Amount);
if Criminal_Age < 18 then Criminal_Age=18;
if Criminal_Age > 100 then Criminal_Age=.;
if find(Criminal_Email,'@')=0 then Criminal_Email='INVALID_EMAIL';
if Loot_Amount=. then Loot_Amount=50000;
Loot_Risk=round(Loot_Amount/100000,.1);
length Loot_Category $10.;
select;
when (Loot_Amount > 900000) Loot_Category='HIGH';
when (Loot_Amount > 400000) Loot_Category='MEDIUM';
otherwise Loot_Category='LOW';
end;
run;
proc print data = bank_loot_clean;
run;
OUTPUT:
| Obs | Source_System | Loot_ID | Bank_Name | Country | Region | Criminal_Email | Status | Loot_Amount | Loot_Date | Criminal_Age | i | Loot_Risk | Loot_Category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_FRAUD_AUDIT | L001 | Worldbank | Usa | NORTH_AMERICA | Lootmaster@gmail.Com | Closed | 500000 | 15JAN2025 | 45 | 6 | 5.0 | MEDIUM |
| 2 | GLOBAL_FRAUD_AUDIT | L002 | Safevault | India | ASIA_PACIFIC | INVALID_EMAIL | Open | 70000 | 18FEB2025 | 18 | 6 | 0.7 | LOW |
| 3 | GLOBAL_FRAUD_AUDIT | L003 | Trustbank | Uk | EUROPE | INVALID_EMAIL | Investigating | 950000 | . | . | 6 | 9.5 | HIGH |
| 4 | GLOBAL_FRAUD_AUDIT | L004 | Moneysecure | Usa | NORTH_AMERICA | Crimeboss@yahoo | Closed | 450000 | 20MAR2025 | 35 | 6 | 4.5 | MEDIUM |
| 5 | GLOBAL_FRAUD_AUDIT | L004 | Moneysecure | Usa | NORTH_AMERICA | Crimeboss@yahoo | Closed | 450000 | 20MAR2025 | 35 | 6 | 4.5 | MEDIUM |
| 6 | GLOBAL_FRAUD_AUDIT | L005 | Cashempire | Canada | NULL | Cashking@gmail.Com | Resolved | 50000 | 01APR2025 | 52 | 6 | 0.5 | LOW |
| 7 | GLOBAL_FRAUD_AUDIT | L006 | Securehold | Germany | EUROPE | INVALID_EMAIL | Open | 300000 | 09MAY2025 | 18 | 6 | 3.0 | LOW |
| 8 | GLOBAL_FRAUD_AUDIT | L007 | Ironbank | India | ASIA_PACIFIC | Stealer@gmail.Com | Closed | 1000000 | . | 40 | 6 | 10.0 | HIGH |
| 9 | GLOBAL_FRAUD_AUDIT | L008 | Royaltrust | France | EUROPE | Robber@domain | Open | 200000 | 12JUL2025 | 39 | 6 | 2.0 | LOW |
| 10 | GLOBAL_FRAUD_AUDIT | L009 | Vaultcore | Japan | ASIA_PACIFIC | Thief@gmail.Com | Closed | 650000 | 15AUG2025 | 18 | 6 | 6.5 | MEDIUM |
| 11 | GLOBAL_FRAUD_AUDIT | L010 | Primecash | Brazil | LATAM | Badmail@ | Resolved | 720000 | 18SEP2025 | 29 | 6 | 7.2 | MEDIUM |
| 12 | GLOBAL_FRAUD_AUDIT | L011 | Securelife | Uk | EUROPE | Stealworld@gmail.Com | Closed | 0 | 11OCT2025 | 48 | 6 | 0.0 | LOW |
| 13 | GLOBAL_FRAUD_AUDIT | L012 | Megafunds | Australia | ASIA_PACIFIC | Moneyloot@gmail.Com | Investigating | 880000 | 14NOV2025 | 33 | 6 | 8.8 | MEDIUM |
| 14 | GLOBAL_FRAUD_AUDIT | L013 | Goldreserve | Usa | NORTH_AMERICA | INVALID_EMAIL | Closed | 250000 | 17DEC2025 | 44 | 6 | 2.5 | LOW |
| 15 | GLOBAL_FRAUD_AUDIT | L014 | Urbanvault | India | ASIA_PACIFIC | Urban@gmail.Com | Resolved | 450000 | 11JAN2026 | 22 | 6 | 4.5 | MEDIUM |
| 16 | GLOBAL_FRAUD_AUDIT | L015 | Trustsafe | Italy | EUROPE | Lootmail@gmail.Com | Closed | 9999999 | 09FEB2026 | 31 | 6 | 100.0 | HIGH |
| 17 | GLOBAL_FRAUD_AUDIT | L016 | Nextbank | China | ASIA_PACIFIC | China_thief@gmail | Open | 560000 | 05MAR2026 | 60 | 6 | 5.6 | MEDIUM |
| 18 | GLOBAL_FRAUD_AUDIT | L017 | Cashflow | Mexico | LATAM | Mxloot@gmail.Com | Closed | 340000 | 06APR2026 | 38 | 6 | 3.4 | LOW |
| 19 | GLOBAL_FRAUD_AUDIT | L018 | Vaultnation | Usa | NORTH_AMERICA | Nation@gmail.Com | Pending | 50000 | 07MAY2026 | 41 | 6 | 0.5 | LOW |
| 20 | GLOBAL_FRAUD_AUDIT | L019 | Safemoney | India | ASIA_PACIFIC | Safe@gmail.Com | Open | 100 | 08JUN2026 | 27 | 6 | 0.0 | LOW |
| 21 | GLOBAL_FRAUD_AUDIT | L020 | Eurotrust | Spain | EUROPE | Euro@gmail.Com | Closed | 430000 | 09JUL2026 | 32 | 6 | 4.3 | MEDIUM |
Explanation
This DATA
step demonstrates enterprise-grade SAS cleaning logic. Arrays standardize
multiple variables simultaneously, reducing repetitive coding. PROPCASE, STRIP,
and UPCASE normalize inconsistent text values. ABS() corrects negative monetary
values frequently caused by ETL corruption or reversed accounting signs. FIND()
validates email structures, while conditional logic fixes impossible ages and
missing amounts. ROUND() standardizes derived metrics for reporting
consistency. Such logic is common in clinical trial SDTM preparation, banking
fraud detection, and insurance claims processing where regulatory traceability
matters.
3.Deduplication
proc sort data=bank_loot_clean nodupkey;
by Loot_ID;
run;
proc print data = bank_loot_clean;
run;
LOG:
OUTPUT:
| Obs | Source_System | Loot_ID | Bank_Name | Country | Region | Criminal_Email | Status | Loot_Amount | Loot_Date | Criminal_Age | i | Loot_Risk | Loot_Category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_FRAUD_AUDIT | L001 | Worldbank | Usa | NORTH_AMERICA | Lootmaster@gmail.Com | Closed | 500000 | 15JAN2025 | 45 | 6 | 5.0 | MEDIUM |
| 2 | GLOBAL_FRAUD_AUDIT | L002 | Safevault | India | ASIA_PACIFIC | INVALID_EMAIL | Open | 70000 | 18FEB2025 | 18 | 6 | 0.7 | LOW |
| 3 | GLOBAL_FRAUD_AUDIT | L003 | Trustbank | Uk | EUROPE | INVALID_EMAIL | Investigating | 950000 | . | . | 6 | 9.5 | HIGH |
| 4 | GLOBAL_FRAUD_AUDIT | L004 | Moneysecure | Usa | NORTH_AMERICA | Crimeboss@yahoo | Closed | 450000 | 20MAR2025 | 35 | 6 | 4.5 | MEDIUM |
| 5 | GLOBAL_FRAUD_AUDIT | L005 | Cashempire | Canada | NULL | Cashking@gmail.Com | Resolved | 50000 | 01APR2025 | 52 | 6 | 0.5 | LOW |
| 6 | GLOBAL_FRAUD_AUDIT | L006 | Securehold | Germany | EUROPE | INVALID_EMAIL | Open | 300000 | 09MAY2025 | 18 | 6 | 3.0 | LOW |
| 7 | GLOBAL_FRAUD_AUDIT | L007 | Ironbank | India | ASIA_PACIFIC | Stealer@gmail.Com | Closed | 1000000 | . | 40 | 6 | 10.0 | HIGH |
| 8 | GLOBAL_FRAUD_AUDIT | L008 | Royaltrust | France | EUROPE | Robber@domain | Open | 200000 | 12JUL2025 | 39 | 6 | 2.0 | LOW |
| 9 | GLOBAL_FRAUD_AUDIT | L009 | Vaultcore | Japan | ASIA_PACIFIC | Thief@gmail.Com | Closed | 650000 | 15AUG2025 | 18 | 6 | 6.5 | MEDIUM |
| 10 | GLOBAL_FRAUD_AUDIT | L010 | Primecash | Brazil | LATAM | Badmail@ | Resolved | 720000 | 18SEP2025 | 29 | 6 | 7.2 | MEDIUM |
| 11 | GLOBAL_FRAUD_AUDIT | L011 | Securelife | Uk | EUROPE | Stealworld@gmail.Com | Closed | 0 | 11OCT2025 | 48 | 6 | 0.0 | LOW |
| 12 | GLOBAL_FRAUD_AUDIT | L012 | Megafunds | Australia | ASIA_PACIFIC | Moneyloot@gmail.Com | Investigating | 880000 | 14NOV2025 | 33 | 6 | 8.8 | MEDIUM |
| 13 | GLOBAL_FRAUD_AUDIT | L013 | Goldreserve | Usa | NORTH_AMERICA | INVALID_EMAIL | Closed | 250000 | 17DEC2025 | 44 | 6 | 2.5 | LOW |
| 14 | GLOBAL_FRAUD_AUDIT | L014 | Urbanvault | India | ASIA_PACIFIC | Urban@gmail.Com | Resolved | 450000 | 11JAN2026 | 22 | 6 | 4.5 | MEDIUM |
| 15 | GLOBAL_FRAUD_AUDIT | L015 | Trustsafe | Italy | EUROPE | Lootmail@gmail.Com | Closed | 9999999 | 09FEB2026 | 31 | 6 | 100.0 | HIGH |
| 16 | GLOBAL_FRAUD_AUDIT | L016 | Nextbank | China | ASIA_PACIFIC | China_thief@gmail | Open | 560000 | 05MAR2026 | 60 | 6 | 5.6 | MEDIUM |
| 17 | GLOBAL_FRAUD_AUDIT | L017 | Cashflow | Mexico | LATAM | Mxloot@gmail.Com | Closed | 340000 | 06APR2026 | 38 | 6 | 3.4 | LOW |
| 18 | GLOBAL_FRAUD_AUDIT | L018 | Vaultnation | Usa | NORTH_AMERICA | Nation@gmail.Com | Pending | 50000 | 07MAY2026 | 41 | 6 | 0.5 | LOW |
| 19 | GLOBAL_FRAUD_AUDIT | L019 | Safemoney | India | ASIA_PACIFIC | Safe@gmail.Com | Open | 100 | 08JUN2026 | 27 | 6 | 0.0 | LOW |
| 20 | GLOBAL_FRAUD_AUDIT | L020 | Eurotrust | Spain | EUROPE | Euro@gmail.Com | Closed | 430000 | 09JUL2026 | 32 | 6 | 4.3 | MEDIUM |
Explanation
Duplicate
transaction identifiers create catastrophic reporting problems in banking and
healthcare systems. PROC SORT NODUPKEY removes duplicate observations based on
business keys. In clinical trials, duplicate subject IDs can invalidate
efficacy analysis populations. In fraud systems, duplicate transactions inflate
financial exposure calculations. Deduplication is therefore not cosmetic cleaning
it is a compliance requirement.
4.PROC SQL Validation Layer
proc sql;
create table loot_summary as
select Region,count(*) as Total_Loots,
sum(Loot_Amount) as Total_Amount format=dollar15.,
avg(Criminal_Age) as Avg_Age
from bank_loot_clean
group by Region;
quit;
proc print data = loot_summary;
run;
OUTPUT:
| Obs | Region | Total_Loots | Total_Amount | Avg_Age |
|---|---|---|---|---|
| 1 | ASIA_PACIFIC | 7 | $3,610,100 | 31.1429 |
| 2 | EUROPE | 6 | $11,879,999 | 33.6000 |
| 3 | LATAM | 2 | $1,060,000 | 33.5000 |
| 4 | NORTH_AMERICA | 4 | $1,250,000 | 41.2500 |
| 5 | NULL | 1 | $50,000 | 52.0000 |
Explanation
PROC SQL provides
relational-style summarization and joins. Compared to DATA step BY-group
processing, PROC SQL is often preferred for aggregation-heavy workflows.
Financial intelligence systems use SQL-based summaries for executive
dashboards, fraud heatmaps, and compliance reports. Proper grouping ensures
consistent regional metrics after standardization logic has corrected corrupted
region labels.
5.Advanced DATA Step MERGE Logic
data region_master;
input Region:$15. Risk_Level $;
datalines;
EUROPE HIGH
ASIA_PACIFIC MEDIUM
NORTH_AMERICA HIGH
LATAM MEDIUM
;
run;
proc print data = region_master;
run;
OUTPUT:
| Obs | Region | Risk_Level |
|---|---|---|
| 1 | EUROPE | HIGH |
| 2 | ASIA_PACIFIC | MEDIUM |
| 3 | NORTH_AMERICA | HIGH |
| 4 | LATAM | MEDIUM |
proc sort data=bank_loot_clean; by Region; run;
proc print data = bank_loot_clean;
run;
OUTPUT:
| Obs | Source_System | Loot_ID | Bank_Name | Country | Region | Criminal_Email | Status | Loot_Amount | Loot_Date | Criminal_Age | i | Loot_Risk | Loot_Category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_FRAUD_AUDIT | L002 | Safevault | India | ASIA_PACIFIC | INVALID_EMAIL | Open | 70000 | 18FEB2025 | 18 | 6 | 0.7 | LOW |
| 2 | GLOBAL_FRAUD_AUDIT | L007 | Ironbank | India | ASIA_PACIFIC | Stealer@gmail.Com | Closed | 1000000 | . | 40 | 6 | 10.0 | HIGH |
| 3 | GLOBAL_FRAUD_AUDIT | L009 | Vaultcore | Japan | ASIA_PACIFIC | Thief@gmail.Com | Closed | 650000 | 15AUG2025 | 18 | 6 | 6.5 | MEDIUM |
| 4 | GLOBAL_FRAUD_AUDIT | L012 | Megafunds | Australia | ASIA_PACIFIC | Moneyloot@gmail.Com | Investigating | 880000 | 14NOV2025 | 33 | 6 | 8.8 | MEDIUM |
| 5 | GLOBAL_FRAUD_AUDIT | L014 | Urbanvault | India | ASIA_PACIFIC | Urban@gmail.Com | Resolved | 450000 | 11JAN2026 | 22 | 6 | 4.5 | MEDIUM |
| 6 | GLOBAL_FRAUD_AUDIT | L016 | Nextbank | China | ASIA_PACIFIC | China_thief@gmail | Open | 560000 | 05MAR2026 | 60 | 6 | 5.6 | MEDIUM |
| 7 | GLOBAL_FRAUD_AUDIT | L019 | Safemoney | India | ASIA_PACIFIC | Safe@gmail.Com | Open | 100 | 08JUN2026 | 27 | 6 | 0.0 | LOW |
| 8 | GLOBAL_FRAUD_AUDIT | L003 | Trustbank | Uk | EUROPE | INVALID_EMAIL | Investigating | 950000 | . | . | 6 | 9.5 | HIGH |
| 9 | GLOBAL_FRAUD_AUDIT | L006 | Securehold | Germany | EUROPE | INVALID_EMAIL | Open | 300000 | 09MAY2025 | 18 | 6 | 3.0 | LOW |
| 10 | GLOBAL_FRAUD_AUDIT | L008 | Royaltrust | France | EUROPE | Robber@domain | Open | 200000 | 12JUL2025 | 39 | 6 | 2.0 | LOW |
| 11 | GLOBAL_FRAUD_AUDIT | L011 | Securelife | Uk | EUROPE | Stealworld@gmail.Com | Closed | 0 | 11OCT2025 | 48 | 6 | 0.0 | LOW |
| 12 | GLOBAL_FRAUD_AUDIT | L015 | Trustsafe | Italy | EUROPE | Lootmail@gmail.Com | Closed | 9999999 | 09FEB2026 | 31 | 6 | 100.0 | HIGH |
| 13 | GLOBAL_FRAUD_AUDIT | L020 | Eurotrust | Spain | EUROPE | Euro@gmail.Com | Closed | 430000 | 09JUL2026 | 32 | 6 | 4.3 | MEDIUM |
| 14 | GLOBAL_FRAUD_AUDIT | L010 | Primecash | Brazil | LATAM | Badmail@ | Resolved | 720000 | 18SEP2025 | 29 | 6 | 7.2 | MEDIUM |
| 15 | GLOBAL_FRAUD_AUDIT | L017 | Cashflow | Mexico | LATAM | Mxloot@gmail.Com | Closed | 340000 | 06APR2026 | 38 | 6 | 3.4 | LOW |
| 16 | GLOBAL_FRAUD_AUDIT | L001 | Worldbank | Usa | NORTH_AMERICA | Lootmaster@gmail.Com | Closed | 500000 | 15JAN2025 | 45 | 6 | 5.0 | MEDIUM |
| 17 | GLOBAL_FRAUD_AUDIT | L004 | Moneysecure | Usa | NORTH_AMERICA | Crimeboss@yahoo | Closed | 450000 | 20MAR2025 | 35 | 6 | 4.5 | MEDIUM |
| 18 | GLOBAL_FRAUD_AUDIT | L013 | Goldreserve | Usa | NORTH_AMERICA | INVALID_EMAIL | Closed | 250000 | 17DEC2025 | 44 | 6 | 2.5 | LOW |
| 19 | GLOBAL_FRAUD_AUDIT | L018 | Vaultnation | Usa | NORTH_AMERICA | Nation@gmail.Com | Pending | 50000 | 07MAY2026 | 41 | 6 | 0.5 | LOW |
| 20 | GLOBAL_FRAUD_AUDIT | L005 | Cashempire | Canada | NULL | Cashking@gmail.Com | Resolved | 50000 | 01APR2025 | 52 | 6 | 0.5 | LOW |
proc sort data=region_master; by Region; run;
proc print data = region_master;
run;
OUTPUT:
| Obs | Region | Risk_Level |
|---|---|---|
| 1 | ASIA_PACIFIC | MEDIUM |
| 2 | EUROPE | HIGH |
| 3 | LATAM | MEDIUM |
| 4 | NORTH_AMERICA | HIGH |
data merged_loot;
merge bank_loot_clean(in=a)
region_master(in=b);
by Region;
if a;
run;
proc print data = merged_loot;
run;
OUTPUT:
| Obs | Source_System | Loot_ID | Bank_Name | Country | Region | Criminal_Email | Status | Loot_Amount | Loot_Date | Criminal_Age | i | Loot_Risk | Loot_Category | Risk_Level |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_FRAUD_AUDIT | L002 | Safevault | India | ASIA_PACIFIC | INVALID_EMAIL | Open | 70000 | 18FEB2025 | 18 | 6 | 0.7 | LOW | MEDIUM |
| 2 | GLOBAL_FRAUD_AUDIT | L007 | Ironbank | India | ASIA_PACIFIC | Stealer@gmail.Com | Closed | 1000000 | . | 40 | 6 | 10.0 | HIGH | MEDIUM |
| 3 | GLOBAL_FRAUD_AUDIT | L009 | Vaultcore | Japan | ASIA_PACIFIC | Thief@gmail.Com | Closed | 650000 | 15AUG2025 | 18 | 6 | 6.5 | MEDIUM | MEDIUM |
| 4 | GLOBAL_FRAUD_AUDIT | L012 | Megafunds | Australia | ASIA_PACIFIC | Moneyloot@gmail.Com | Investigating | 880000 | 14NOV2025 | 33 | 6 | 8.8 | MEDIUM | MEDIUM |
| 5 | GLOBAL_FRAUD_AUDIT | L014 | Urbanvault | India | ASIA_PACIFIC | Urban@gmail.Com | Resolved | 450000 | 11JAN2026 | 22 | 6 | 4.5 | MEDIUM | MEDIUM |
| 6 | GLOBAL_FRAUD_AUDIT | L016 | Nextbank | China | ASIA_PACIFIC | China_thief@gmail | Open | 560000 | 05MAR2026 | 60 | 6 | 5.6 | MEDIUM | MEDIUM |
| 7 | GLOBAL_FRAUD_AUDIT | L019 | Safemoney | India | ASIA_PACIFIC | Safe@gmail.Com | Open | 100 | 08JUN2026 | 27 | 6 | 0.0 | LOW | MEDIUM |
| 8 | GLOBAL_FRAUD_AUDIT | L003 | Trustbank | Uk | EUROPE | INVALID_EMAIL | Investigating | 950000 | . | . | 6 | 9.5 | HIGH | HIGH |
| 9 | GLOBAL_FRAUD_AUDIT | L006 | Securehold | Germany | EUROPE | INVALID_EMAIL | Open | 300000 | 09MAY2025 | 18 | 6 | 3.0 | LOW | HIGH |
| 10 | GLOBAL_FRAUD_AUDIT | L008 | Royaltrust | France | EUROPE | Robber@domain | Open | 200000 | 12JUL2025 | 39 | 6 | 2.0 | LOW | HIGH |
| 11 | GLOBAL_FRAUD_AUDIT | L011 | Securelife | Uk | EUROPE | Stealworld@gmail.Com | Closed | 0 | 11OCT2025 | 48 | 6 | 0.0 | LOW | HIGH |
| 12 | GLOBAL_FRAUD_AUDIT | L015 | Trustsafe | Italy | EUROPE | Lootmail@gmail.Com | Closed | 9999999 | 09FEB2026 | 31 | 6 | 100.0 | HIGH | HIGH |
| 13 | GLOBAL_FRAUD_AUDIT | L020 | Eurotrust | Spain | EUROPE | Euro@gmail.Com | Closed | 430000 | 09JUL2026 | 32 | 6 | 4.3 | MEDIUM | HIGH |
| 14 | GLOBAL_FRAUD_AUDIT | L010 | Primecash | Brazil | LATAM | Badmail@ | Resolved | 720000 | 18SEP2025 | 29 | 6 | 7.2 | MEDIUM | MEDIUM |
| 15 | GLOBAL_FRAUD_AUDIT | L017 | Cashflow | Mexico | LATAM | Mxloot@gmail.Com | Closed | 340000 | 06APR2026 | 38 | 6 | 3.4 | LOW | MEDIUM |
| 16 | GLOBAL_FRAUD_AUDIT | L001 | Worldbank | Usa | NORTH_AMERICA | Lootmaster@gmail.Com | Closed | 500000 | 15JAN2025 | 45 | 6 | 5.0 | MEDIUM | HIGH |
| 17 | GLOBAL_FRAUD_AUDIT | L004 | Moneysecure | Usa | NORTH_AMERICA | Crimeboss@yahoo | Closed | 450000 | 20MAR2025 | 35 | 6 | 4.5 | MEDIUM | HIGH |
| 18 | GLOBAL_FRAUD_AUDIT | L013 | Goldreserve | Usa | NORTH_AMERICA | INVALID_EMAIL | Closed | 250000 | 17DEC2025 | 44 | 6 | 2.5 | LOW | HIGH |
| 19 | GLOBAL_FRAUD_AUDIT | L018 | Vaultnation | Usa | NORTH_AMERICA | Nation@gmail.Com | Pending | 50000 | 07MAY2026 | 41 | 6 | 0.5 | LOW | HIGH |
| 20 | GLOBAL_FRAUD_AUDIT | L005 | Cashempire | Canada | NULL | Cashking@gmail.Com | Resolved | 50000 | 01APR2025 | 52 | 6 | 0.5 | LOW |
Explanation
MERGE
logic is foundational in SAS production environments. Fraud systems often
enrich operational data with lookup tables, risk classifications, or external
reference datasets. The IN= dataset options provide record lineage tracking,
ensuring only valid transactional observations are retained. Clinical trial
programmers use similar logic when merging demographics, adverse events, and
exposure datasets.
6.PROC REPORT for Executive Outputs
proc report data=loot_summary nowd;
columns Region Total_Loots Total_Amount Avg_Age;
define Region / group;
define Total_Loots / analysis;
define Total_Amount / analysis;
define Avg_Age / analysis;
run;
OUTPUT:
| Region | Total_Loots | Total_Amount | Avg_Age |
|---|---|---|---|
| ASIA_PACIFIC | 7 | $3,610,100 | 31.142857 |
| EUROPE | 6 | $11,879,999 | 33.6 |
| LATAM | 2 | $1,060,000 | 33.5 |
| NORTH_AMERICA | 4 | $1,250,000 | 41.25 |
| NULL | 1 | $50,000 | 52 |
Explanation
PROC
REPORT generates professional enterprise reports suitable for auditors,
executives, and regulatory submissions. Unlike simple PROC PRINT outputs, PROC
REPORT supports grouped presentation logic, calculated fields, custom
formatting, and advanced layouts. In clinical programming, TLFs (Tables,
Listings, Figures) heavily rely on reporting procedures like this.
7.Modern R Cleaning Workflow
Raw Dataset
library(tidyverse)
library(lubridate)
library(janitor)
bank_raw <- tibble(
Loot_ID = c("L001","L002","L003","L004","L004"),
Bank_Name = c("WorldBank","SafeVault","TrustBank","MoneySecure","MoneySecure"),
Region = c("NA","apac","EU"," north america ","EUROPE"),
Loot_Amount = c(500000,-70000,NA,450000,450000),
Criminal_Age = c(45,17,130,35,35),
Email = c("loot@gmail.com","badmail","NULL","boss@yahoo","boss@yahoo")
)
OUTPUT:
|
|
Loot_ID |
Bank_Name |
Region |
Loot_Amount |
Criminal_Age |
Email |
|
1 |
L001 |
WorldBank |
NA |
500000 |
45 |
loot@gmail.com |
|
2 |
L002 |
SafeVault |
apac |
-70000 |
17 |
badmail |
|
3 |
L003 |
TrustBank |
EU |
NA |
130 |
NULL |
|
4 |
L004 |
MoneySecure |
north america |
450000 |
35 |
boss@yahoo |
|
5 |
L004 |
MoneySecure |
EUROPE |
450000 |
35 |
boss@yahoo |
Cleaning Logic in R
bank_clean <- bank_raw %>%
clean_names() %>%
mutate(region = str_trim(str_to_upper(region)),
region = case_when(
region %in% c("NA","NORTH AMERICA") ~ "NORTH_AMERICA",
region %in% c("EU","EUROPE") ~ "EUROPE",
TRUE ~ "ASIA_PACIFIC"),
loot_amount = abs(loot_amount),
criminal_age = if_else(criminal_age > 100 |
criminal_age < 18,NA_real_,
criminal_age),
email = if_else(grepl("@", email),email,"INVALID_EMAIL")
) %>%
distinct()
OUTPUT:
|
|
loot_id |
bank_name |
region |
loot_amount |
criminal_age |
email |
|
1 |
L001 |
WorldBank |
NORTH_AMERICA |
500000 |
45 |
loot@gmail.com |
|
2 |
L002 |
SafeVault |
ASIA_PACIFIC |
70000 |
NA |
INVALID_EMAIL |
|
3 |
L003 |
TrustBank |
EUROPE |
NA |
NA |
INVALID_EMAIL |
|
4 |
L004 |
MoneySecure |
NORTH_AMERICA |
450000 |
35 |
boss@yahoo |
|
5 |
L004 |
MoneySecure |
EUROPE |
450000 |
35 |
boss@yahoo |
Explanation
This R
workflow mirrors SAS transformations using tidyverse functions. mutate()
behaves similarly to SAS assignment logic. case_when() parallels SELECT-WHEN. str_trim()
and str_to_upper() standardize corrupted text. distinct() removes duplicates
like PROC SORT NODUPKEY. if_else() provides conditional corrections comparable
to IF-THEN logic in SAS. Tidyverse pipelines improve readability and
modularity, especially in exploratory data science workflows.
SAS vs R Comparison
|
Feature |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Very
High |
Growing |
|
Scalability |
Enterprise-grade |
High
with optimization |
|
Metadata
Control |
Strong |
Flexible |
|
Clinical
Trial Dominance |
Industry
Standard |
Supplemental |
|
Visualization |
Moderate |
Excellent |
|
Reproducibility |
Strong |
Strong |
|
Macro
Automation |
Powerful |
Functional
Programming |
8.Enterprise Validation
& Compliance
In SDTM
and ADaM workflows, missing values are dangerous. SAS treats numeric missing
values as lower than all valid numbers.
Example:
if score < 50 then
flag='FAIL';
Missing
scores also become FAIL unless explicitly checked.
Correct
approach:
if score ne . and score < 50
then flag='FAIL';
Regulatory
agencies expect:
- Traceability
- Audit trails
- Independent QC
- Metadata governance
- Controlled terminology
- Reproducibility
- Validation independence
Improper
missing-value handling can invalidate clinical trial analyses or fraud
detection models.
9.Business Logic Behind
Cleaning
Business
logic defines how corrupted operational data should be transformed into
analytically meaningful information. Missing values are often imputed because
downstream reporting engines, machine learning models, and statistical
calculations fail when essential variables are blank. For example, if a patient
visit date is missing in a clinical trial, exposure duration calculations
become impossible, affecting efficacy analysis. Similarly, missing transaction
dates in fraud systems distort temporal risk trends.
Unrealistic
values must also be corrected. A criminal age of 130 or negative stolen amount
is logically impossible and likely caused by ETL corruption, manual-entry
mistakes, or source-system failures. Standardizing dates using formats like
YYYY-MM-DD ensures interoperability between systems and accurate time-series
analysis.
Text
normalization is equally critical. Region labels such as “EU,” “ europe ”, and
“EUROPE” represent the same business entity but produce fragmented dashboards
if not standardized. Email validation prevents broken communication pipelines
and identity mismatches.
In
healthcare, correcting age ranges ensures valid demographic analysis. In retail
or banking, standardized customer categories improve segmentation models.
Ultimately, business logic bridges raw operational chaos and reliable
enterprise intelligence.
10.20 Data-Cleaning Best
Practices
- Validate metadata before
transformations
- Standardize controlled
terminology
- Remove duplicates early
- Always preserve raw datasets
- Use audit-ready logs
- Separate derivation and
validation logic
- Apply defensive programming
- Validate date ranges
- Check impossible numeric
values
- Normalize categorical
variables
- Use reusable macros
- Document assumptions clearly
- Implement QC independence
- Avoid hardcoded mappings
- Track lineage across
datasets
- Use consistent
formats/informats
- Validate joins after merges
- Perform frequency checks
- Standardize missing-value
handling
- Automate validation reports
11.20 Sharp Insights
- Dirty data creates expensive
business mistakes.
- Validation logic beats
visual inspection.
- Duplicate IDs destroy trust.
- Missing values silently
break analytics.
- Standardized variables
improve reproducibility.
- Metadata controls enterprise
quality.
- Audit trails protect
organizations.
- PROC SQL simplifies relational
analysis.
- DATA step provides granular
control.
- R excels in exploratory
transformations.
- SAS dominates regulated
industries.
- Text normalization prevents
dashboard fragmentation.
- Defensive programming
prevents production failures.
- Controlled terminology
reduces ambiguity.
- Incorrect formats create
hidden errors.
- Validation is continuous,
not optional.
- Business rules must be
documented.
- Reproducibility is an
enterprise asset.
- Automated QC saves time and
money.
- Clean data drives
trustworthy AI.
12.SAS and R Summary
SAS and R
both play major roles in modern enterprise data engineering ecosystems, but
they serve different operational strengths. SAS is deeply dominant in regulated
industries such as clinical trials, banking, pharmaceuticals, and insurance
because of its exceptional auditability, metadata governance, reproducibility,
and validation traceability. Regulatory agencies trust SAS outputs because
enterprise workflows are standardized, controlled, and easily reviewable.
Procedures like PROC SQL, PROC REPORT, PROC FORMAT, and DATA step programming
provide scalable frameworks for handling massive operational datasets with
predictable execution behavior.
R,
meanwhile, provides unmatched flexibility for exploratory analysis, modern
visualization, advanced machine learning, and dynamic transformation pipelines.
Packages like tidyverse, lubridate, janitor, and purrr simplify complex
cleaning workflows using readable syntax and functional programming principles.
R enables rapid experimentation and sophisticated analytics that complement SAS
production systems.
From a
scalability perspective, SAS excels in structured enterprise environments with
large governance requirements, while R thrives in innovation-driven analytics
teams. SAS macros provide industrial-grade automation, whereas R offers
flexible reusable functions and package ecosystems.
The
strongest enterprise strategy is not SAS versus R it is SAS and R together.
Many organizations now use SAS for validated production pipelines and R for
exploratory analytics, predictive modeling, and visualization. Combining both
technologies creates scalable, compliant, high-performance analytical
ecosystems capable of delivering trustworthy business intelligence across
healthcare, banking, retail, and insurance domains.
13.Conclusion
Modern
analytics systems are built on one foundational principle: reliable insights
require reliable data. Whether the domain involves global bank loot
investigations, clinical trial patient tracking, insurance claims analysis, or
retail fraud detection, poor-quality operational data introduces enormous
business risk. Duplicate transaction identifiers, malformed emails, corrupted
region labels, impossible ages, invalid timestamps, and inconsistent formats do
not merely create cosmetic reporting issues they directly impact strategic
decisions, compliance outcomes, AI model reliability, and financial trust.
This
project demonstrated how enterprise-grade data engineering workflows can
systematically transform corrupted raw operational datasets into trustworthy
analytical intelligence using both SAS and R. SAS provides unmatched control,
auditability, metadata governance, and production reliability through DATA step
programming, PROC SQL, macros, validation procedures, and reporting frameworks.
Its strength lies in regulated environments where traceability and
reproducibility are mandatory.
R
complements these capabilities with flexible transformation pipelines, modern
functional programming, advanced exploratory analytics, and efficient text
handling through tidyverse ecosystems. Together, SAS and R create a hybrid
analytics framework capable of supporting both highly governed enterprise
reporting and modern data science innovation.
The most
successful organizations do not treat data cleaning as a secondary technical
task. They recognize it as a strategic operational discipline. Clean data
improves fraud detection accuracy, strengthens clinical trial integrity,
enhances executive dashboards, protects regulatory submissions, and enables
trustworthy AI systems.
Ultimately,
structured data-cleaning frameworks are no longer optional in modern analytics
ecosystems. They are foundational pillars of scalable, production-grade
intelligence. Organizations that invest in rigorous validation, metadata
governance, defensive programming, and reproducible workflows gain something
more valuable than clean datasets they gain confidence in every decision
powered by their data.
14.Interview
Questions and Answers
1. What are arrays in SAS
and why are they useful?
Answer:
Arrays
allow multiple variables to be processed using loops.
Example
data clean_scores;
set exam;
array scores(*) score1-score5;
do i=1 to dim(scores);
if scores(i) < 0 then scores(i)=0;
end;
run;
Key Points
- Reduces repetitive code.
- Useful in bulk cleaning
operations.
- Common in laboratory data
processing.
2. What is the purpose of
PROC FORMAT?
Answer:
PROC
FORMAT creates custom labels and grouped categories.
Example
proc format;
value riskfmt 0-100000='LOW'
100001-500000='MEDIUM'
500001-high='HIGH';
run;
Usage
format amount riskfmt.;
Key Points
- Improves report readability.
- Standardizes outputs.
- Widely used in TLF
generation.
3. How do you validate
email formats in SAS?
Answer:
Functions
like FIND, INDEX, VERIFY, and PRXMATCH are used.
Example
if find(email,'@')=0 then
email_status='INVALID';
Advanced Regex Example
if prxmatch('/^\S+@\S+\.\S+$/',email)
then status='VALID';
Key Points
- Important in customer
datasets.
- Prevents communication
failures.
- Regex validation is more
robust.
4. What is the difference
between STRIP, TRIM, and COMPRESS?
Answer:
|
Function |
Purpose |
|
STRIP |
Removes
leading/trailing spaces |
|
TRIM |
Removes
trailing spaces |
|
COMPRESS |
Removes
specified characters |
Example
clean_name=strip(name);
digits=compress(phone,'-() ');
Key Points
- Important for
standardization.
- Improves matching accuracy.
- Reduces hidden whitespace
corruption.
5. How do you merge
datasets in SAS?
Answer:
Datasets
are merged using BY variables after sorting.
Example
proc sort data=demog; by
Subject_ID; run;
proc sort data=lab; by Subject_ID; run;
data final;
merge demog(in=a)
lab(in=b);
by Subject_ID;
if a;
run;
Key Points
- Sorting is mandatory.
- IN= flags track dataset
origin.
- Common in SDTM/ADaM
derivations.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about BANK LOOTS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment