Looted Banks, Dirty Data & Executive Panic: Building Production-Ready Fraud Intelligence Systems with SAS PROC SQL and R

Global Bank Loot Records into Trusted Analytical Intelligence Using Advanced SAS (PROC SQL vs DATA Step) and Modern R Data Engineering Frameworks

Introduction: When Dirty Data Becomes a Million-Dollar Disaster

Imagine a multinational banking consortium investigating global bank loot incidents across multiple countries. Fraud analysts discover that several suspicious transactions were incorrectly classified as “LOW RISK” because of corrupted timestamps, duplicated transaction IDs, inconsistent region labels, malformed emails, and negative stolen amounts.

One executive dashboard showed that losses in Europe were lower than Asia. After investigation, analysts discovered that Europe records were coded as:

  • EU
  • Eu
  • europe
  • EUR
  • NULL

Because of this inconsistency, the dashboard fragmented the same region into five categories.

Meanwhile, a missing robbery date caused fraud trend models to skip high-risk events entirely. Duplicate transaction identifiers triggered validation failures during regulatory reporting. Even worse, negative stolen amounts contaminated financial summaries and AI fraud-prediction models.

This is not hypothetical. In real enterprise environments involving healthcare, banking, insurance, retail, and clinical trials, dirty data destroys:

  • Regulatory submissions
  • Statistical outputs
  • SDTM/ADaM derivations
  • Executive dashboards
  • Machine learning reliability
  • Operational trust

A Clinical SAS Programmer or Data Scientist quickly learns one truth:

“Analytics is only as trustworthy as the quality of the underlying data.”

This project demonstrates how corrupted global bank-loot operational datasets can be transformed into production-grade analytical intelligence using both SAS and R.

Raw Corrupted Dataset Global Bank Loot Intelligence

SAS Raw Dataset Creation

data bank_loot_raw;

length Loot_ID $12 Bank_Name $30 Country $20 Region $15

       Criminal_Email $40  Status $15;

infile datalines dlm='|' truncover;

input Loot_ID $ Bank_Name $ Country $ Region $ Criminal_Email $

      Loot_Amount Loot_Date :?? yymmdd10. Criminal_Age Status $;

format Loot_Date date9.;

datalines;

L001|WorldBank|USA|NA|lootmaster@gmail.com|500000|2025-01-15|45|Closed

L002|SafeVault|india|APAC|robbermail.com|-70000|2025-02-18|17|Open

L003|TrustBank|UK|EU|NULL|950000|.|130|Investigating

L004|MoneySecure|usa| north america |crimeboss@yahoo|450000|2025-03-20|35|Closed

L004|MoneySecure|usa| north america |crimeboss@yahoo|450000|2025-03-20|35|Closed

L005|CashEmpire|Canada|NULL|cashking@gmail.com|.|2025-04-01|52|Resolved

L006|SecureHold|Germany|EUROPE|fraudster#mail.com|300000|2025-05-09|-5|Open

L007|IronBank|India|apac|stealer@gmail.com|1000000|2025-13-10|40|Closed

L008|RoyalTrust|France|EU|robber@domain|200000|2025-07-12|39|OPEN

L009|VaultCore|Japan|APAC|  thief@gmail.com |650000|2025-08-15|NULL|Closed

L010|PrimeCash|Brazil|LATAM|badmail@|720000|2025-09-18|29|Resolved

L011|SecureLife|UK|EU|stealworld@gmail.com|0|2025-10-11|48|Closed

L012|MegaFunds|Australia|APAC|moneyloot@gmail.com|880000|2025-11-14|33|Investigating

L013|GoldReserve|USA|Na| |250000|2025-12-17|44|Closed

L014|UrbanVault|India|APAC|urban@gmail.com|450000|2026-01-11|22|Resolved

L015|TrustSafe|Italy|EU|lootmail@gmail.com|9999999|2026-02-09|31|Closed

L016|NextBank|China|APAC|china_thief@gmail|560000|2026-03-05|60|Open

L017|CashFlow|Mexico|LATAM|mxloot@gmail.com|340000|2026-04-06|38|Closed

L018|VaultNation|USA|NA|nation@gmail.com|.|2026-05-07|41|Pending

L019|SafeMoney|India|APAC|safe@gmail.com|-100|2026-06-08|27|Open

L020|EuroTrust|Spain|EUROPE|euro@gmail.com|430000|2026-07-09|32|Closed

;

run;

proc print data = bank_loot_raw;

run;

OUTPUT:

ObsLoot_IDBank_NameCountryRegionCriminal_EmailStatusLoot_AmountLoot_DateCriminal_Age
1L001WorldBankUSANAlootmaster@gmail.comClosed50000015JAN202545
2L002SafeVaultindiaAPACrobbermail.comOpen-7000018FEB202517
3L003TrustBankUKEUNULLInvestigating950000.130
4L004MoneySecureusanorth americacrimeboss@yahooClosed45000020MAR202535
5L004MoneySecureusanorth americacrimeboss@yahooClosed45000020MAR202535
6L005CashEmpireCanadaNULLcashking@gmail.comResolved.01APR202552
7L006SecureHoldGermanyEUROPEfraudster#mail.comOpen30000009MAY2025-5
8L007IronBankIndiaapacstealer@gmail.comClosed1000000.40
9L008RoyalTrustFranceEUrobber@domainOPEN20000012JUL202539
10L009VaultCoreJapanAPACthief@gmail.comClosed65000015AUG2025.
11L010PrimeCashBrazilLATAMbadmail@Resolved72000018SEP202529
12L011SecureLifeUKEUstealworld@gmail.comClosed011OCT202548
13L012MegaFundsAustraliaAPACmoneyloot@gmail.comInvestigating88000014NOV202533
14L013GoldReserveUSANa Closed25000017DEC202544
15L014UrbanVaultIndiaAPACurban@gmail.comResolved45000011JAN202622
16L015TrustSafeItalyEUlootmail@gmail.comClosed999999909FEB202631
17L016NextBankChinaAPACchina_thief@gmailOpen56000005MAR202660
18L017CashFlowMexicoLATAMmxloot@gmail.comClosed34000006APR202638
19L018VaultNationUSANAnation@gmail.comPending.07MAY202641
20L019SafeMoneyIndiaAPACsafe@gmail.comOpen-10008JUN202627
21L020EuroTrustSpainEUROPEeuro@gmail.comClosed43000009JUL202632

Why LENGTH Must Come First in SAS

One of the biggest beginner mistakes in SAS programming is placing the LENGTH statement after assignments.

Incorrect

Bank_Flag='HIGH';

length Bank_Flag $20;

SAS already assigned default length before the LENGTH statement executes. This causes:

  • Character truncation
  • Unexpected missing values
  • Validation mismatches

Correct

length Bank_Flag $20;

Bank_Flag='HIGH';

In R, character vectors dynamically resize, so truncation risks are far less severe. SAS, however, allocates memory earlier during compilation.

Enterprise SAS Cleaning Workflow

1.Metadata Inspection

proc contents data=bank_loot_raw varnum;

run;

OUTPUT:

The CONTENTS Procedure

Data Set NameWORK.BANK_LOOT_RAWObservations21
Member TypeDATAVariables9
EngineV9Indexes0
Created05/16/2026 15:40:06Observation Length160
Last Modified05/16/2026 15:40:06Deleted Observations0
Protection CompressedNO
Data Set Type SortedNO
Label   
Data RepresentationSOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64  
Encodingutf-8 Unicode (UTF-8)  
Engine/Host Dependent Information
Data Set Page Size131072
Number of Data Set Pages1
First Data Page1
Max Obs per Page818
Obs in First Data Page21
Number of Data Set Repairs0
Filename/saswork/SAS_work860000004B2C_odaws02-apse1-2.oda.sas.com/SAS_workA50600004B2C_odaws02-apse1-2.oda.sas.com/bank_loot_raw.sas7bdat
Release Created9.0401M8
Host CreatedLinux
Inode Number134325615
Access Permissionrw-r--r--
Owner Nameu63247146
File Size256KB
File Size (bytes)262144
Variables in Creation Order
#VariableTypeLenFormat
1Loot_IDChar12 
2Bank_NameChar30 
3CountryChar20 
4RegionChar15 
5Criminal_EmailChar40 
6StatusChar15 
7Loot_AmountNum8 
8Loot_DateNum8DATE9.
9Criminal_AgeNum8 

Explanation

PROC CONTENTS helps analysts inspect metadata, variable types, lengths, formats, and informats. In regulated environments like clinical trials or fraud analytics, metadata verification is mandatory because incorrect variable attributes can break downstream derivations, SDTM mappings, and reporting pipelines. This step identifies whether variables were incorrectly imported as character instead of numeric or vice versa. It also validates format consistency before transformations begin.

2.Data Cleaning Using DATA Step

data bank_loot_clean;

retain Source_System 'GLOBAL_FRAUD_AUDIT';

set bank_loot_raw;

array chars(*) Bank_Name Country Region Status Criminal_Email;

do i=1 to dim(chars);

   chars(i)=strip(propcase(chars(i)));

end;

Region=upcase(strip(Region));

if Region in ('NA','NORTH AMERICA') then Region='NORTH_AMERICA';

else if Region in ('EU','EUROPE') then Region='EUROPE';

else if Region='APAC' then Region='ASIA_PACIFIC';

Loot_Amount=abs(Loot_Amount);

if Criminal_Age < 18 then Criminal_Age=18;

if Criminal_Age > 100 then Criminal_Age=.;

if find(Criminal_Email,'@')=0 then Criminal_Email='INVALID_EMAIL';

if Loot_Amount=. then Loot_Amount=50000;

Loot_Risk=round(Loot_Amount/100000,.1);

length Loot_Category $10.;

select;

   when (Loot_Amount > 900000) Loot_Category='HIGH';

   when (Loot_Amount > 400000) Loot_Category='MEDIUM';

   otherwise Loot_Category='LOW';

end;

run;

proc print data = bank_loot_clean;

run;

OUTPUT:

ObsSource_SystemLoot_IDBank_NameCountryRegionCriminal_EmailStatusLoot_AmountLoot_DateCriminal_AgeiLoot_RiskLoot_Category
1GLOBAL_FRAUD_AUDITL001WorldbankUsaNORTH_AMERICALootmaster@gmail.ComClosed50000015JAN20254565.0MEDIUM
2GLOBAL_FRAUD_AUDITL002SafevaultIndiaASIA_PACIFICINVALID_EMAILOpen7000018FEB20251860.7LOW
3GLOBAL_FRAUD_AUDITL003TrustbankUkEUROPEINVALID_EMAILInvestigating950000..69.5HIGH
4GLOBAL_FRAUD_AUDITL004MoneysecureUsaNORTH_AMERICACrimeboss@yahooClosed45000020MAR20253564.5MEDIUM
5GLOBAL_FRAUD_AUDITL004MoneysecureUsaNORTH_AMERICACrimeboss@yahooClosed45000020MAR20253564.5MEDIUM
6GLOBAL_FRAUD_AUDITL005CashempireCanadaNULLCashking@gmail.ComResolved5000001APR20255260.5LOW
7GLOBAL_FRAUD_AUDITL006SecureholdGermanyEUROPEINVALID_EMAILOpen30000009MAY20251863.0LOW
8GLOBAL_FRAUD_AUDITL007IronbankIndiaASIA_PACIFICStealer@gmail.ComClosed1000000.40610.0HIGH
9GLOBAL_FRAUD_AUDITL008RoyaltrustFranceEUROPERobber@domainOpen20000012JUL20253962.0LOW
10GLOBAL_FRAUD_AUDITL009VaultcoreJapanASIA_PACIFICThief@gmail.ComClosed65000015AUG20251866.5MEDIUM
11GLOBAL_FRAUD_AUDITL010PrimecashBrazilLATAMBadmail@Resolved72000018SEP20252967.2MEDIUM
12GLOBAL_FRAUD_AUDITL011SecurelifeUkEUROPEStealworld@gmail.ComClosed011OCT20254860.0LOW
13GLOBAL_FRAUD_AUDITL012MegafundsAustraliaASIA_PACIFICMoneyloot@gmail.ComInvestigating88000014NOV20253368.8MEDIUM
14GLOBAL_FRAUD_AUDITL013GoldreserveUsaNORTH_AMERICAINVALID_EMAILClosed25000017DEC20254462.5LOW
15GLOBAL_FRAUD_AUDITL014UrbanvaultIndiaASIA_PACIFICUrban@gmail.ComResolved45000011JAN20262264.5MEDIUM
16GLOBAL_FRAUD_AUDITL015TrustsafeItalyEUROPELootmail@gmail.ComClosed999999909FEB2026316100.0HIGH
17GLOBAL_FRAUD_AUDITL016NextbankChinaASIA_PACIFICChina_thief@gmailOpen56000005MAR20266065.6MEDIUM
18GLOBAL_FRAUD_AUDITL017CashflowMexicoLATAMMxloot@gmail.ComClosed34000006APR20263863.4LOW
19GLOBAL_FRAUD_AUDITL018VaultnationUsaNORTH_AMERICANation@gmail.ComPending5000007MAY20264160.5LOW
20GLOBAL_FRAUD_AUDITL019SafemoneyIndiaASIA_PACIFICSafe@gmail.ComOpen10008JUN20262760.0LOW
21GLOBAL_FRAUD_AUDITL020EurotrustSpainEUROPEEuro@gmail.ComClosed43000009JUL20263264.3MEDIUM

Explanation

This DATA step demonstrates enterprise-grade SAS cleaning logic. Arrays standardize multiple variables simultaneously, reducing repetitive coding. PROPCASE, STRIP, and UPCASE normalize inconsistent text values. ABS() corrects negative monetary values frequently caused by ETL corruption or reversed accounting signs. FIND() validates email structures, while conditional logic fixes impossible ages and missing amounts. ROUND() standardizes derived metrics for reporting consistency. Such logic is common in clinical trial SDTM preparation, banking fraud detection, and insurance claims processing where regulatory traceability matters.

3.Deduplication

proc sort data=bank_loot_clean nodupkey;

by Loot_ID;

run;

proc print data = bank_loot_clean;

run;

LOG:

NOTE: There were 21 observations read from the data set WORK.BANK_LOOT_CLEAN.
NOTE: 1 observations with duplicate key values were deleted.

OUTPUT:

ObsSource_SystemLoot_IDBank_NameCountryRegionCriminal_EmailStatusLoot_AmountLoot_DateCriminal_AgeiLoot_RiskLoot_Category
1GLOBAL_FRAUD_AUDITL001WorldbankUsaNORTH_AMERICALootmaster@gmail.ComClosed50000015JAN20254565.0MEDIUM
2GLOBAL_FRAUD_AUDITL002SafevaultIndiaASIA_PACIFICINVALID_EMAILOpen7000018FEB20251860.7LOW
3GLOBAL_FRAUD_AUDITL003TrustbankUkEUROPEINVALID_EMAILInvestigating950000..69.5HIGH
4GLOBAL_FRAUD_AUDITL004MoneysecureUsaNORTH_AMERICACrimeboss@yahooClosed45000020MAR20253564.5MEDIUM
5GLOBAL_FRAUD_AUDITL005CashempireCanadaNULLCashking@gmail.ComResolved5000001APR20255260.5LOW
6GLOBAL_FRAUD_AUDITL006SecureholdGermanyEUROPEINVALID_EMAILOpen30000009MAY20251863.0LOW
7GLOBAL_FRAUD_AUDITL007IronbankIndiaASIA_PACIFICStealer@gmail.ComClosed1000000.40610.0HIGH
8GLOBAL_FRAUD_AUDITL008RoyaltrustFranceEUROPERobber@domainOpen20000012JUL20253962.0LOW
9GLOBAL_FRAUD_AUDITL009VaultcoreJapanASIA_PACIFICThief@gmail.ComClosed65000015AUG20251866.5MEDIUM
10GLOBAL_FRAUD_AUDITL010PrimecashBrazilLATAMBadmail@Resolved72000018SEP20252967.2MEDIUM
11GLOBAL_FRAUD_AUDITL011SecurelifeUkEUROPEStealworld@gmail.ComClosed011OCT20254860.0LOW
12GLOBAL_FRAUD_AUDITL012MegafundsAustraliaASIA_PACIFICMoneyloot@gmail.ComInvestigating88000014NOV20253368.8MEDIUM
13GLOBAL_FRAUD_AUDITL013GoldreserveUsaNORTH_AMERICAINVALID_EMAILClosed25000017DEC20254462.5LOW
14GLOBAL_FRAUD_AUDITL014UrbanvaultIndiaASIA_PACIFICUrban@gmail.ComResolved45000011JAN20262264.5MEDIUM
15GLOBAL_FRAUD_AUDITL015TrustsafeItalyEUROPELootmail@gmail.ComClosed999999909FEB2026316100.0HIGH
16GLOBAL_FRAUD_AUDITL016NextbankChinaASIA_PACIFICChina_thief@gmailOpen56000005MAR20266065.6MEDIUM
17GLOBAL_FRAUD_AUDITL017CashflowMexicoLATAMMxloot@gmail.ComClosed34000006APR20263863.4LOW
18GLOBAL_FRAUD_AUDITL018VaultnationUsaNORTH_AMERICANation@gmail.ComPending5000007MAY20264160.5LOW
19GLOBAL_FRAUD_AUDITL019SafemoneyIndiaASIA_PACIFICSafe@gmail.ComOpen10008JUN20262760.0LOW
20GLOBAL_FRAUD_AUDITL020EurotrustSpainEUROPEEuro@gmail.ComClosed43000009JUL20263264.3MEDIUM

Explanation

Duplicate transaction identifiers create catastrophic reporting problems in banking and healthcare systems. PROC SORT NODUPKEY removes duplicate observations based on business keys. In clinical trials, duplicate subject IDs can invalidate efficacy analysis populations. In fraud systems, duplicate transactions inflate financial exposure calculations. Deduplication is therefore not cosmetic cleaning it is a compliance requirement.

4.PROC SQL Validation Layer

proc sql;

create table loot_summary as

select Region,count(*) as Total_Loots,

       sum(Loot_Amount) as Total_Amount format=dollar15.,

       avg(Criminal_Age) as Avg_Age

from bank_loot_clean

group by Region;

quit;

proc print data = loot_summary;

run;

OUTPUT:

ObsRegionTotal_LootsTotal_AmountAvg_Age
1ASIA_PACIFIC7$3,610,10031.1429
2EUROPE6$11,879,99933.6000
3LATAM2$1,060,00033.5000
4NORTH_AMERICA4$1,250,00041.2500
5NULL1$50,00052.0000

Explanation

PROC SQL provides relational-style summarization and joins. Compared to DATA step BY-group processing, PROC SQL is often preferred for aggregation-heavy workflows. Financial intelligence systems use SQL-based summaries for executive dashboards, fraud heatmaps, and compliance reports. Proper grouping ensures consistent regional metrics after standardization logic has corrected corrupted region labels.

5.Advanced DATA Step MERGE Logic

data region_master;

input Region:$15. Risk_Level $;

datalines;

EUROPE HIGH

ASIA_PACIFIC MEDIUM

NORTH_AMERICA HIGH

LATAM MEDIUM

;

run;

proc print data = region_master;

run;

OUTPUT:

ObsRegionRisk_Level
1EUROPEHIGH
2ASIA_PACIFICMEDIUM
3NORTH_AMERICAHIGH
4LATAMMEDIUM

proc sort data=bank_loot_clean; by Region; run;

proc print data = bank_loot_clean;

run;

OUTPUT:

ObsSource_SystemLoot_IDBank_NameCountryRegionCriminal_EmailStatusLoot_AmountLoot_DateCriminal_AgeiLoot_RiskLoot_Category
1GLOBAL_FRAUD_AUDITL002SafevaultIndiaASIA_PACIFICINVALID_EMAILOpen7000018FEB20251860.7LOW
2GLOBAL_FRAUD_AUDITL007IronbankIndiaASIA_PACIFICStealer@gmail.ComClosed1000000.40610.0HIGH
3GLOBAL_FRAUD_AUDITL009VaultcoreJapanASIA_PACIFICThief@gmail.ComClosed65000015AUG20251866.5MEDIUM
4GLOBAL_FRAUD_AUDITL012MegafundsAustraliaASIA_PACIFICMoneyloot@gmail.ComInvestigating88000014NOV20253368.8MEDIUM
5GLOBAL_FRAUD_AUDITL014UrbanvaultIndiaASIA_PACIFICUrban@gmail.ComResolved45000011JAN20262264.5MEDIUM
6GLOBAL_FRAUD_AUDITL016NextbankChinaASIA_PACIFICChina_thief@gmailOpen56000005MAR20266065.6MEDIUM
7GLOBAL_FRAUD_AUDITL019SafemoneyIndiaASIA_PACIFICSafe@gmail.ComOpen10008JUN20262760.0LOW
8GLOBAL_FRAUD_AUDITL003TrustbankUkEUROPEINVALID_EMAILInvestigating950000..69.5HIGH
9GLOBAL_FRAUD_AUDITL006SecureholdGermanyEUROPEINVALID_EMAILOpen30000009MAY20251863.0LOW
10GLOBAL_FRAUD_AUDITL008RoyaltrustFranceEUROPERobber@domainOpen20000012JUL20253962.0LOW
11GLOBAL_FRAUD_AUDITL011SecurelifeUkEUROPEStealworld@gmail.ComClosed011OCT20254860.0LOW
12GLOBAL_FRAUD_AUDITL015TrustsafeItalyEUROPELootmail@gmail.ComClosed999999909FEB2026316100.0HIGH
13GLOBAL_FRAUD_AUDITL020EurotrustSpainEUROPEEuro@gmail.ComClosed43000009JUL20263264.3MEDIUM
14GLOBAL_FRAUD_AUDITL010PrimecashBrazilLATAMBadmail@Resolved72000018SEP20252967.2MEDIUM
15GLOBAL_FRAUD_AUDITL017CashflowMexicoLATAMMxloot@gmail.ComClosed34000006APR20263863.4LOW
16GLOBAL_FRAUD_AUDITL001WorldbankUsaNORTH_AMERICALootmaster@gmail.ComClosed50000015JAN20254565.0MEDIUM
17GLOBAL_FRAUD_AUDITL004MoneysecureUsaNORTH_AMERICACrimeboss@yahooClosed45000020MAR20253564.5MEDIUM
18GLOBAL_FRAUD_AUDITL013GoldreserveUsaNORTH_AMERICAINVALID_EMAILClosed25000017DEC20254462.5LOW
19GLOBAL_FRAUD_AUDITL018VaultnationUsaNORTH_AMERICANation@gmail.ComPending5000007MAY20264160.5LOW
20GLOBAL_FRAUD_AUDITL005CashempireCanadaNULLCashking@gmail.ComResolved5000001APR20255260.5LOW

proc sort data=region_master; by Region; run;

proc print data = region_master;

run;

OUTPUT:

ObsRegionRisk_Level
1ASIA_PACIFICMEDIUM
2EUROPEHIGH
3LATAMMEDIUM
4NORTH_AMERICAHIGH

data merged_loot;

merge bank_loot_clean(in=a)

      region_master(in=b);

by Region;

if a;

run;

proc print data = merged_loot;

run;

OUTPUT:

ObsSource_SystemLoot_IDBank_NameCountryRegionCriminal_EmailStatusLoot_AmountLoot_DateCriminal_AgeiLoot_RiskLoot_CategoryRisk_Level
1GLOBAL_FRAUD_AUDITL002SafevaultIndiaASIA_PACIFICINVALID_EMAILOpen7000018FEB20251860.7LOWMEDIUM
2GLOBAL_FRAUD_AUDITL007IronbankIndiaASIA_PACIFICStealer@gmail.ComClosed1000000.40610.0HIGHMEDIUM
3GLOBAL_FRAUD_AUDITL009VaultcoreJapanASIA_PACIFICThief@gmail.ComClosed65000015AUG20251866.5MEDIUMMEDIUM
4GLOBAL_FRAUD_AUDITL012MegafundsAustraliaASIA_PACIFICMoneyloot@gmail.ComInvestigating88000014NOV20253368.8MEDIUMMEDIUM
5GLOBAL_FRAUD_AUDITL014UrbanvaultIndiaASIA_PACIFICUrban@gmail.ComResolved45000011JAN20262264.5MEDIUMMEDIUM
6GLOBAL_FRAUD_AUDITL016NextbankChinaASIA_PACIFICChina_thief@gmailOpen56000005MAR20266065.6MEDIUMMEDIUM
7GLOBAL_FRAUD_AUDITL019SafemoneyIndiaASIA_PACIFICSafe@gmail.ComOpen10008JUN20262760.0LOWMEDIUM
8GLOBAL_FRAUD_AUDITL003TrustbankUkEUROPEINVALID_EMAILInvestigating950000..69.5HIGHHIGH
9GLOBAL_FRAUD_AUDITL006SecureholdGermanyEUROPEINVALID_EMAILOpen30000009MAY20251863.0LOWHIGH
10GLOBAL_FRAUD_AUDITL008RoyaltrustFranceEUROPERobber@domainOpen20000012JUL20253962.0LOWHIGH
11GLOBAL_FRAUD_AUDITL011SecurelifeUkEUROPEStealworld@gmail.ComClosed011OCT20254860.0LOWHIGH
12GLOBAL_FRAUD_AUDITL015TrustsafeItalyEUROPELootmail@gmail.ComClosed999999909FEB2026316100.0HIGHHIGH
13GLOBAL_FRAUD_AUDITL020EurotrustSpainEUROPEEuro@gmail.ComClosed43000009JUL20263264.3MEDIUMHIGH
14GLOBAL_FRAUD_AUDITL010PrimecashBrazilLATAMBadmail@Resolved72000018SEP20252967.2MEDIUMMEDIUM
15GLOBAL_FRAUD_AUDITL017CashflowMexicoLATAMMxloot@gmail.ComClosed34000006APR20263863.4LOWMEDIUM
16GLOBAL_FRAUD_AUDITL001WorldbankUsaNORTH_AMERICALootmaster@gmail.ComClosed50000015JAN20254565.0MEDIUMHIGH
17GLOBAL_FRAUD_AUDITL004MoneysecureUsaNORTH_AMERICACrimeboss@yahooClosed45000020MAR20253564.5MEDIUMHIGH
18GLOBAL_FRAUD_AUDITL013GoldreserveUsaNORTH_AMERICAINVALID_EMAILClosed25000017DEC20254462.5LOWHIGH
19GLOBAL_FRAUD_AUDITL018VaultnationUsaNORTH_AMERICANation@gmail.ComPending5000007MAY20264160.5LOWHIGH
20GLOBAL_FRAUD_AUDITL005CashempireCanadaNULLCashking@gmail.ComResolved5000001APR20255260.5LOW 

Explanation

MERGE logic is foundational in SAS production environments. Fraud systems often enrich operational data with lookup tables, risk classifications, or external reference datasets. The IN= dataset options provide record lineage tracking, ensuring only valid transactional observations are retained. Clinical trial programmers use similar logic when merging demographics, adverse events, and exposure datasets.

6.PROC REPORT for Executive Outputs

proc report data=loot_summary nowd;

columns Region Total_Loots Total_Amount Avg_Age;

define Region / group;

define Total_Loots / analysis;

define Total_Amount / analysis;

define Avg_Age / analysis;

run;

OUTPUT:

RegionTotal_LootsTotal_AmountAvg_Age
ASIA_PACIFIC7$3,610,10031.142857
EUROPE6$11,879,99933.6
LATAM2$1,060,00033.5
NORTH_AMERICA4$1,250,00041.25
NULL1$50,00052

Explanation

PROC REPORT generates professional enterprise reports suitable for auditors, executives, and regulatory submissions. Unlike simple PROC PRINT outputs, PROC REPORT supports grouped presentation logic, calculated fields, custom formatting, and advanced layouts. In clinical programming, TLFs (Tables, Listings, Figures) heavily rely on reporting procedures like this.

7.Modern R Cleaning Workflow

Raw Dataset

library(tidyverse)

library(lubridate)

library(janitor)

bank_raw <- tibble(

  Loot_ID = c("L001","L002","L003","L004","L004"),

  Bank_Name = c("WorldBank","SafeVault","TrustBank","MoneySecure","MoneySecure"),

  Region = c("NA","apac","EU"," north america ","EUROPE"),

  Loot_Amount = c(500000,-70000,NA,450000,450000),

  Criminal_Age = c(45,17,130,35,35),

  Email = c("loot@gmail.com","badmail","NULL","boss@yahoo","boss@yahoo")

)

OUTPUT:

 

Loot_ID

Bank_Name

Region

Loot_Amount

Criminal_Age

Email

1

L001

WorldBank

NA

500000

45

loot@gmail.com

2

L002

SafeVault

apac

-70000

17

badmail

3

L003

TrustBank

EU

NA

130

NULL

4

L004

MoneySecure

 north america 

450000

35

boss@yahoo

5

L004

MoneySecure

EUROPE

450000

35

boss@yahoo


Cleaning Logic in R

bank_clean <- bank_raw %>%

  clean_names() %>%

  mutate(region = str_trim(str_to_upper(region)),

         region = case_when(

         region %in% c("NA","NORTH AMERICA") ~ "NORTH_AMERICA",

         region %in% c("EU","EUROPE") ~ "EUROPE",

         TRUE ~ "ASIA_PACIFIC"),

         loot_amount = abs(loot_amount),

         criminal_age = if_else(criminal_age > 100 |

                               criminal_age < 18,NA_real_,

                               criminal_age),

         email = if_else(grepl("@", email),email,"INVALID_EMAIL")

) %>%

  distinct()

OUTPUT:

 

loot_id

bank_name

region

loot_amount

criminal_age

email

1

L001

WorldBank

NORTH_AMERICA

500000

45

loot@gmail.com

2

L002

SafeVault

ASIA_PACIFIC

70000

NA

INVALID_EMAIL

3

L003

TrustBank

EUROPE

NA

NA

INVALID_EMAIL

4

L004

MoneySecure

NORTH_AMERICA

450000

35

boss@yahoo

5

L004

MoneySecure

EUROPE

450000

35

boss@yahoo


Explanation

This R workflow mirrors SAS transformations using tidyverse functions. mutate() behaves similarly to SAS assignment logic. case_when() parallels SELECT-WHEN. str_trim() and str_to_upper() standardize corrupted text. distinct() removes duplicates like PROC SORT NODUPKEY. if_else() provides conditional corrections comparable to IF-THEN logic in SAS. Tidyverse pipelines improve readability and modularity, especially in exploratory data science workflows.

SAS vs R Comparison

Feature

SAS

R

Auditability

Excellent

Moderate

Regulatory Acceptance

Very High

Growing

Scalability

Enterprise-grade

High with optimization

Metadata Control

Strong

Flexible

Clinical Trial Dominance

Industry Standard

Supplemental

Visualization

Moderate

Excellent

Reproducibility

Strong

Strong

Macro Automation

Powerful

Functional Programming

8.Enterprise Validation & Compliance

In SDTM and ADaM workflows, missing values are dangerous. SAS treats numeric missing values as lower than all valid numbers.

Example:

if score < 50 then flag='FAIL';

Missing scores also become FAIL unless explicitly checked.

Correct approach:

if score ne . and score < 50 then flag='FAIL';

Regulatory agencies expect:

  • Traceability
  • Audit trails
  • Independent QC
  • Metadata governance
  • Controlled terminology
  • Reproducibility
  • Validation independence

Improper missing-value handling can invalidate clinical trial analyses or fraud detection models.

9.Business Logic Behind Cleaning

Business logic defines how corrupted operational data should be transformed into analytically meaningful information. Missing values are often imputed because downstream reporting engines, machine learning models, and statistical calculations fail when essential variables are blank. For example, if a patient visit date is missing in a clinical trial, exposure duration calculations become impossible, affecting efficacy analysis. Similarly, missing transaction dates in fraud systems distort temporal risk trends.

Unrealistic values must also be corrected. A criminal age of 130 or negative stolen amount is logically impossible and likely caused by ETL corruption, manual-entry mistakes, or source-system failures. Standardizing dates using formats like YYYY-MM-DD ensures interoperability between systems and accurate time-series analysis.

Text normalization is equally critical. Region labels such as “EU,” “ europe ”, and “EUROPE” represent the same business entity but produce fragmented dashboards if not standardized. Email validation prevents broken communication pipelines and identity mismatches.

In healthcare, correcting age ranges ensures valid demographic analysis. In retail or banking, standardized customer categories improve segmentation models. Ultimately, business logic bridges raw operational chaos and reliable enterprise intelligence.

10.20 Data-Cleaning Best Practices

  1. Validate metadata before transformations
  2. Standardize controlled terminology
  3. Remove duplicates early
  4. Always preserve raw datasets
  5. Use audit-ready logs
  6. Separate derivation and validation logic
  7. Apply defensive programming
  8. Validate date ranges
  9. Check impossible numeric values
  10. Normalize categorical variables
  11. Use reusable macros
  12. Document assumptions clearly
  13. Implement QC independence
  14. Avoid hardcoded mappings
  15. Track lineage across datasets
  16. Use consistent formats/informats
  17. Validate joins after merges
  18. Perform frequency checks
  19. Standardize missing-value handling
  20. Automate validation reports

11.20 Sharp Insights

  • Dirty data creates expensive business mistakes.
  • Validation logic beats visual inspection.
  • Duplicate IDs destroy trust.
  • Missing values silently break analytics.
  • Standardized variables improve reproducibility.
  • Metadata controls enterprise quality.
  • Audit trails protect organizations.
  • PROC SQL simplifies relational analysis.
  • DATA step provides granular control.
  • R excels in exploratory transformations.
  • SAS dominates regulated industries.
  • Text normalization prevents dashboard fragmentation.
  • Defensive programming prevents production failures.
  • Controlled terminology reduces ambiguity.
  • Incorrect formats create hidden errors.
  • Validation is continuous, not optional.
  • Business rules must be documented.
  • Reproducibility is an enterprise asset.
  • Automated QC saves time and money.
  • Clean data drives trustworthy AI.

12.SAS and R Summary

SAS and R both play major roles in modern enterprise data engineering ecosystems, but they serve different operational strengths. SAS is deeply dominant in regulated industries such as clinical trials, banking, pharmaceuticals, and insurance because of its exceptional auditability, metadata governance, reproducibility, and validation traceability. Regulatory agencies trust SAS outputs because enterprise workflows are standardized, controlled, and easily reviewable. Procedures like PROC SQL, PROC REPORT, PROC FORMAT, and DATA step programming provide scalable frameworks for handling massive operational datasets with predictable execution behavior.

R, meanwhile, provides unmatched flexibility for exploratory analysis, modern visualization, advanced machine learning, and dynamic transformation pipelines. Packages like tidyverse, lubridate, janitor, and purrr simplify complex cleaning workflows using readable syntax and functional programming principles. R enables rapid experimentation and sophisticated analytics that complement SAS production systems.

From a scalability perspective, SAS excels in structured enterprise environments with large governance requirements, while R thrives in innovation-driven analytics teams. SAS macros provide industrial-grade automation, whereas R offers flexible reusable functions and package ecosystems.

The strongest enterprise strategy is not SAS versus R it is SAS and R together. Many organizations now use SAS for validated production pipelines and R for exploratory analytics, predictive modeling, and visualization. Combining both technologies creates scalable, compliant, high-performance analytical ecosystems capable of delivering trustworthy business intelligence across healthcare, banking, retail, and insurance domains.

13.Conclusion

Modern analytics systems are built on one foundational principle: reliable insights require reliable data. Whether the domain involves global bank loot investigations, clinical trial patient tracking, insurance claims analysis, or retail fraud detection, poor-quality operational data introduces enormous business risk. Duplicate transaction identifiers, malformed emails, corrupted region labels, impossible ages, invalid timestamps, and inconsistent formats do not merely create cosmetic reporting issues they directly impact strategic decisions, compliance outcomes, AI model reliability, and financial trust.

This project demonstrated how enterprise-grade data engineering workflows can systematically transform corrupted raw operational datasets into trustworthy analytical intelligence using both SAS and R. SAS provides unmatched control, auditability, metadata governance, and production reliability through DATA step programming, PROC SQL, macros, validation procedures, and reporting frameworks. Its strength lies in regulated environments where traceability and reproducibility are mandatory.

R complements these capabilities with flexible transformation pipelines, modern functional programming, advanced exploratory analytics, and efficient text handling through tidyverse ecosystems. Together, SAS and R create a hybrid analytics framework capable of supporting both highly governed enterprise reporting and modern data science innovation.

The most successful organizations do not treat data cleaning as a secondary technical task. They recognize it as a strategic operational discipline. Clean data improves fraud detection accuracy, strengthens clinical trial integrity, enhances executive dashboards, protects regulatory submissions, and enables trustworthy AI systems.

Ultimately, structured data-cleaning frameworks are no longer optional in modern analytics ecosystems. They are foundational pillars of scalable, production-grade intelligence. Organizations that invest in rigorous validation, metadata governance, defensive programming, and reproducible workflows gain something more valuable than clean datasets they gain confidence in every decision powered by their data.

14.Interview Questions and Answers

1. What are arrays in SAS and why are they useful?

Answer:

Arrays allow multiple variables to be processed using loops.

Example

data clean_scores;
set exam;

array scores(*) score1-score5;

do i=1 to dim(scores);

   if scores(i) < 0 then scores(i)=0;

end;

run;

Key Points

  • Reduces repetitive code.
  • Useful in bulk cleaning operations.
  • Common in laboratory data processing.

2. What is the purpose of PROC FORMAT?

Answer:

PROC FORMAT creates custom labels and grouped categories.

Example

proc format;
value riskfmt 0-100000='LOW'
            100001-500000='MEDIUM'
                 500001-high='HIGH';
run;

Usage

format amount riskfmt.;

Key Points

  • Improves report readability.
  • Standardizes outputs.
  • Widely used in TLF generation.

3. How do you validate email formats in SAS?

Answer:

Functions like FIND, INDEX, VERIFY, and PRXMATCH are used.

Example

if find(email,'@')=0 then email_status='INVALID';

Advanced Regex Example

if prxmatch('/^\S+@\S+\.\S+$/',email)
then status='VALID';

Key Points

  • Important in customer datasets.
  • Prevents communication failures.
  • Regex validation is more robust.

4. What is the difference between STRIP, TRIM, and COMPRESS?

Answer:

Function

Purpose

STRIP

Removes leading/trailing spaces

TRIM

Removes trailing spaces

COMPRESS

Removes specified characters

Example

clean_name=strip(name);

digits=compress(phone,'-() ');

Key Points

  • Important for standardization.
  • Improves matching accuracy.
  • Reduces hidden whitespace corruption.

5. How do you merge datasets in SAS?

Answer:

Datasets are merged using BY variables after sorting.

Example

proc sort data=demog; by Subject_ID; run;
proc sort data=lab; by Subject_ID; run;

data final;
merge demog(in=a)
      lab(in=b);

by Subject_ID;

if a;
run;

Key Points

  • Sorting is mandatory.
  • IN= flags track dataset origin.
  • Common in SDTM/ADaM derivations.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about BANK LOOTS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS