Illegal Movie Tracking, Corrupted Logs & Validation Battles: Advanced SAS and R Data Engineering for Trusted Reporting
Analysis-Ready Illegal Movie
Intelligence Pipelines with SAS and R for Enterprise-Grade Reporting Excellence
Introduction: When Dirty Data Becomes a Global
Business Risk
Imagine a
global cybercrime monitoring organization attempting to track illegal movie
distribution networks across multiple countries. Analysts receive streaming
intelligence feeds from law-enforcement agencies, telecom operators, digital
platforms, and copyright monitoring vendors. Unfortunately, the incoming
datasets are severely corrupted.
Some
movie IDs are duplicated.
Several release dates are invalid.
Billing losses appear as negative values.
Region codes are inconsistent.
Email alerts contain malformed addresses.
Movie categories contain values like "NULL", "???", "pir@cy",
and "unknown ".
One
dashboard incorrectly reports that piracy losses decreased by 40%. Another AI
fraud model misclassifies high-risk piracy syndicates as low-risk due to
missing values. Regulatory reporting teams submit inconsistent statistical
outputs to international compliance boards.
This is
exactly how poor-quality data destroys analytical trust.
In
enterprise analytics ecosystems, dirty data is not just a technical
inconvenience—it becomes a compliance failure, operational disaster, and
financial liability.
As
Clinical SAS Programmers and Data Scientists, our responsibility is to
transform unreliable operational records into audit-ready analytical
intelligence.
This
article demonstrates how to create, clean, validate, standardize, and engineer
an analysis-ready dataset about different types of illegal movies worldwide
using both SAS and R.
Different Types of Illegal
Movies in the World
Illegal
movie categories monitored globally include:
|
Illegal
Movie Type |
Description |
|
CamRip |
Theater-recorded
illegal copies |
|
Torrent
Piracy |
Peer-to-peer
unauthorized distribution |
|
Bootleg
DVD |
Unlicensed
physical distribution |
|
IPTV
Piracy |
Illegal
streaming subscription services |
|
Screen
Recording |
OTT
platform capture |
|
Deepfake
Films |
AI-generated
manipulated movie content |
|
Copyright
Bypass |
DRM
removal and redistribution |
|
Dark
Web Cinema |
Hidden
marketplace movie sharing |
|
Telegram
Piracy |
Unauthorized
sharing through messaging apps |
|
Subtitle
Manipulation |
Fake
subtitles with malware embedding |
Raw Dataset with Intentional Errors (SAS)
1.SAS Raw Dataset Creation
data illegal_movies_raw;
length Movie_ID $12 Movie_Name $40 Country $20 Piracy_Type $25
Upload_Date $12 Region_Code $8 Contact_Email $50 Risk_Level $15;
infile datalines dlm='|' truncover;
input Movie_ID $ Movie_Name $ Country $ Piracy_Type $ Loss_Amount
Upload_Date $ Region_Code $ Contact_Email $ Risk_Level $;
datalines;
MV001|avatar2|india|Torrent|50000|12JAN2025|AP01|alert@gmail.com|HIGH
MV001|Avatar2| INDIA |torrent|-50000|31FEB2025|ap01|wrongmail.com|high
MV002|Batman|USA|CamRip|75000|15MAR2025|USA1|batman@gmail|MEDIUM
MV003|NULL|UK|IPTV|.|22APR2025|UK_01|admin@yahoo.com|LOW
MV004|Joker|canada|ScreenRec|98000|99XYZ2025|CA01|joker@@mail.com|CRITICAL
MV005|Pushpa2|India|bootlegdvd|120000|05MAY2025|IN01|pushpa@gmail.com|HIGH
MV006|kgf2|india|???|-45000|14JUN2025|IN_02|kgfmail.com|HIGH
MV007|Spiderman|USA|Torrent|87000|.|US02|spider@gmail.com|NULL
MV008|Frozen|Sweden|DeepFake|25000|11AUG2025|SW01|frozen@gmail.com|LOW
MV009|Avengers|usa|DarkWeb|100000|15SEP2025|US03|avengers@gmailcom|CRITICAL
MV010|Interstellar|UK|IPTV|65000|21OCT2025|UK01|interstellar@gmail.com|MEDIUM
MV011|Titanic|India|TelegramPiracy|45000|18NOV2025| IN03 |titanic@gmail.com|HIGH
MV012|Bahubali|India|Torrent|85000|29FEB2023|AP02|bahubali@gmail.|HIGH
MV013|Leo|India|ScreenRec|67000|07DEC2025|TN01|leo@gmail.com|MEDIUM
MV014|Animal|India|CamRip|-98000|09JAN2025|TN02|animal.com|HIGH
MV015|Jawan|India|Torrent|99000|10FEB2025|AP03|jawan@gmail.com|CRITICAL
MV016|Dune2|USA|IPTV|45000|15MAR2025|US04|dune2@gmail.com|LOW
MV017|Kalki|India|DeepFake|87000|18APR2025|AP04| kalki@gmail.com |HIGH
MV018|Mufasa|UK|Torrent|78000|20MAY2025|UK02|mufasa@gmail.com|MEDIUM
MV019|Gladiator|Italy|DarkWeb|91000|11JUN2025|IT01|gladiator@gmail.com|CRITICAL
MV020|Matrix|USA|NULL|0|12JUL2025|US05|matrixgmail.com|LOW
;
run;
proc print data = illegal_movies_raw;
run;
OUTPUT:
| Obs | Movie_ID | Movie_Name | Country | Piracy_Type | Upload_Date | Region_Code | Contact_Email | Risk_Level | Loss_Amount |
|---|---|---|---|---|---|---|---|---|---|
| 1 | MV001 | avatar2 | india | Torrent | 12JAN2025 | AP01 | alert@gmail.com | HIGH | 50000 |
| 2 | MV001 | Avatar2 | INDIA | torrent | 31FEB2025 | ap01 | wrongmail.com | high | -50000 |
| 3 | MV002 | Batman | USA | CamRip | 15MAR2025 | USA1 | batman@gmail | MEDIUM | 75000 |
| 4 | MV003 | NULL | UK | IPTV | 22APR2025 | UK_01 | admin@yahoo.com | LOW | . |
| 5 | MV004 | Joker | canada | ScreenRec | 99XYZ2025 | CA01 | joker@@mail.com | CRITICAL | 98000 |
| 6 | MV005 | Pushpa2 | India | bootlegdvd | 05MAY2025 | IN01 | pushpa@gmail.com | HIGH | 120000 |
| 7 | MV006 | kgf2 | india | ??? | 14JUN2025 | IN_02 | kgfmail.com | HIGH | -45000 |
| 8 | MV007 | Spiderman | USA | Torrent | US02 | spider@gmail.com | NULL | 87000 | |
| 9 | MV008 | Frozen | Sweden | DeepFake | 11AUG2025 | SW01 | frozen@gmail.com | LOW | 25000 |
| 10 | MV009 | Avengers | usa | DarkWeb | 15SEP2025 | US03 | avengers@gmailcom | CRITICAL | 100000 |
| 11 | MV010 | Interstellar | UK | IPTV | 21OCT2025 | UK01 | interstellar@gmail.com | MEDIUM | 65000 |
| 12 | MV011 | Titanic | India | TelegramPiracy | 18NOV2025 | IN03 | titanic@gmail.com | HIGH | 45000 |
| 13 | MV012 | Bahubali | India | Torrent | 29FEB2023 | AP02 | bahubali@gmail. | HIGH | 85000 |
| 14 | MV013 | Leo | India | ScreenRec | 07DEC2025 | TN01 | leo@gmail.com | MEDIUM | 67000 |
| 15 | MV014 | Animal | India | CamRip | 09JAN2025 | TN02 | animal.com | HIGH | -98000 |
| 16 | MV015 | Jawan | India | Torrent | 10FEB2025 | AP03 | jawan@gmail.com | CRITICAL | 99000 |
| 17 | MV016 | Dune2 | USA | IPTV | 15MAR2025 | US04 | dune2@gmail.com | LOW | 45000 |
| 18 | MV017 | Kalki | India | DeepFake | 18APR2025 | AP04 | kalki@gmail.com | HIGH | 87000 |
| 19 | MV018 | Mufasa | UK | Torrent | 20MAY2025 | UK02 | mufasa@gmail.com | MEDIUM | 78000 |
| 20 | MV019 | Gladiator | Italy | DarkWeb | 11JUN2025 | IT01 | gladiator@gmail.com | CRITICAL | 91000 |
| 21 | MV020 | Matrix | USA | NULL | 12JUL2025 | US05 | matrixgmail.com | LOW | 0 |
Key Explanation
This raw
dataset intentionally contains:
- Duplicate Movie IDs
- Invalid dates
- Negative losses
- Missing values
- Malformed emails
- Inconsistent casing
- Corrupted category labels
- Whitespace contamination
- NULL string placeholders
The LENGTH
statement appears before assignments because SAS determines character variable
storage during compilation. If omitted early, SAS may truncate values
permanently.
This is
called Character Truncation Risk.
Unlike
SAS, R dynamically expands character vectors without predefined lengths.
2.Standardization and Cleaning
data illegal_movies_clean;
set illegal_movies_raw;
length Clean_Piracy_Type $25 Clean_Email $50;
Movie_Name=propcase(strip(Movie_Name));
Country=upcase(strip(Country));
Piracy_Type=compress(upcase(Piracy_Type));
Region_Code=compress(upcase(Region_Code));
Risk_Level=upcase(strip(Risk_Level));
if Loss_Amount < 0 then
Loss_Amount=abs(Loss_Amount);
Parsed_Date=input(Upload_Date,?? date9.);
format Parsed_Date yymmdd10.;
if missing(Parsed_Date) then
Parsed_Date='01JAN2025'd;
if find(Contact_Email,'@') and find(Contact_Email,'.')
then Clean_Email=strip(Contact_Email);
else Clean_Email='INVALID_EMAIL';
select(Piracy_Type);
when('TORRENT') Clean_Piracy_Type='TORRENT';
when('CAMRIP') Clean_Piracy_Type='CAMRIP';
when('BOOTLEGDVD') Clean_Piracy_Type='BOOTLEG_DVD';
when('IPTV') Clean_Piracy_Type='IPTV';
when('SCREENREC') Clean_Piracy_Type='SCREEN_RECORDING';
when('DEEPFAKE') Clean_Piracy_Type='DEEPFAKE';
when('DARKWEB') Clean_Piracy_Type='DARK_WEB';
otherwise
Clean_Piracy_Type='UNKNOWN';
end;
if Movie_Name='NULL' then
Movie_Name='UNKNOWN';
drop Piracy_Type Upload_Date Contact_Email;
rename Clean_Piracy_Type=Piracy_Type
Parsed_Date=Upload_Date
Clean_Email=Contact_Email;
run;
proc print data = illegal_movies_clean;
run;
OUTPUT:
| Obs | Movie_ID | Movie_Name | Country | Region_Code | Risk_Level | Loss_Amount | Piracy_Type | Contact_Email | Upload_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | MV001 | Avatar2 | INDIA | AP01 | HIGH | 50000 | TORRENT | alert@gmail.com | 2025-01-12 |
| 2 | MV001 | Avatar2 | INDIA | AP01 | HIGH | 50000 | TORRENT | INVALID_EMAIL | 2025-01-01 |
| 3 | MV002 | Batman | USA | USA1 | MEDIUM | 75000 | CAMRIP | INVALID_EMAIL | 2025-03-15 |
| 4 | MV003 | Null | UK | UK_01 | LOW | . | IPTV | admin@yahoo.com | 2025-04-22 |
| 5 | MV004 | Joker | CANADA | CA01 | CRITICAL | 98000 | SCREEN_RECORDING | joker@@mail.com | 2025-01-01 |
| 6 | MV005 | Pushpa2 | INDIA | IN01 | HIGH | 120000 | BOOTLEG_DVD | pushpa@gmail.com | 2025-05-05 |
| 7 | MV006 | Kgf2 | INDIA | IN_02 | HIGH | 45000 | UNKNOWN | INVALID_EMAIL | 2025-06-14 |
| 8 | MV007 | Spiderman | USA | US02 | NULL | 87000 | TORRENT | spider@gmail.com | 2025-01-01 |
| 9 | MV008 | Frozen | SWEDEN | SW01 | LOW | 25000 | DEEPFAKE | frozen@gmail.com | 2025-08-11 |
| 10 | MV009 | Avengers | USA | US03 | CRITICAL | 100000 | DARK_WEB | INVALID_EMAIL | 2025-09-15 |
| 11 | MV010 | Interstellar | UK | UK01 | MEDIUM | 65000 | IPTV | interstellar@gmail.com | 2025-10-21 |
| 12 | MV011 | Titanic | INDIA | IN03 | HIGH | 45000 | UNKNOWN | titanic@gmail.com | 2025-11-18 |
| 13 | MV012 | Bahubali | INDIA | AP02 | HIGH | 85000 | TORRENT | bahubali@gmail. | 2025-01-01 |
| 14 | MV013 | Leo | INDIA | TN01 | MEDIUM | 67000 | SCREEN_RECORDING | leo@gmail.com | 2025-12-07 |
| 15 | MV014 | Animal | INDIA | TN02 | HIGH | 98000 | CAMRIP | INVALID_EMAIL | 2025-01-09 |
| 16 | MV015 | Jawan | INDIA | AP03 | CRITICAL | 99000 | TORRENT | jawan@gmail.com | 2025-02-10 |
| 17 | MV016 | Dune2 | USA | US04 | LOW | 45000 | IPTV | dune2@gmail.com | 2025-03-15 |
| 18 | MV017 | Kalki | INDIA | AP04 | HIGH | 87000 | DEEPFAKE | kalki@gmail.com | 2025-04-18 |
| 19 | MV018 | Mufasa | UK | UK02 | MEDIUM | 78000 | TORRENT | mufasa@gmail.com | 2025-05-20 |
| 20 | MV019 | Gladiator | ITALY | IT01 | CRITICAL | 91000 | DARK_WEB | gladiator@gmail.com | 2025-06-11 |
| 21 | MV020 | Matrix | USA | US05 | LOW | 0 | UNKNOWN | INVALID_EMAIL | 2025-07-12 |
Key Explanation
This DATA
step demonstrates enterprise-grade cleaning logic.
Important Techniques Used
|
Function |
Purpose |
|
PROPCASE |
Standardizes
text formatting |
|
STRIP |
Removes
leading/trailing blanks |
|
COMPRESS |
Removes
spaces/symbols |
|
ABS |
Converts
negative values |
|
INPUT |
Converts
character to SAS date |
|
FIND |
Detects
malformed emails |
|
SELECT-WHEN |
Cleaner
alternative to multiple IF statements |
This
approach improves consistency before analytics and downstream reporting.
3.Deduplication using PROC SORT
proc sort data=illegal_movies_clean
out=movies_nodup nodupkey;
by Movie_ID;
run;
proc print data = movies_nodup;
run;
LOG:
OUTPUT:
| Obs | Movie_ID | Movie_Name | Country | Region_Code | Risk_Level | Loss_Amount | Piracy_Type | Contact_Email | Upload_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | MV001 | Avatar2 | INDIA | AP01 | HIGH | 50000 | TORRENT | alert@gmail.com | 2025-01-12 |
| 2 | MV002 | Batman | USA | USA1 | MEDIUM | 75000 | CAMRIP | INVALID_EMAIL | 2025-03-15 |
| 3 | MV003 | Null | UK | UK_01 | LOW | . | IPTV | admin@yahoo.com | 2025-04-22 |
| 4 | MV004 | Joker | CANADA | CA01 | CRITICAL | 98000 | SCREEN_RECORDING | joker@@mail.com | 2025-01-01 |
| 5 | MV005 | Pushpa2 | INDIA | IN01 | HIGH | 120000 | BOOTLEG_DVD | pushpa@gmail.com | 2025-05-05 |
| 6 | MV006 | Kgf2 | INDIA | IN_02 | HIGH | 45000 | UNKNOWN | INVALID_EMAIL | 2025-06-14 |
| 7 | MV007 | Spiderman | USA | US02 | NULL | 87000 | TORRENT | spider@gmail.com | 2025-01-01 |
| 8 | MV008 | Frozen | SWEDEN | SW01 | LOW | 25000 | DEEPFAKE | frozen@gmail.com | 2025-08-11 |
| 9 | MV009 | Avengers | USA | US03 | CRITICAL | 100000 | DARK_WEB | INVALID_EMAIL | 2025-09-15 |
| 10 | MV010 | Interstellar | UK | UK01 | MEDIUM | 65000 | IPTV | interstellar@gmail.com | 2025-10-21 |
| 11 | MV011 | Titanic | INDIA | IN03 | HIGH | 45000 | UNKNOWN | titanic@gmail.com | 2025-11-18 |
| 12 | MV012 | Bahubali | INDIA | AP02 | HIGH | 85000 | TORRENT | bahubali@gmail. | 2025-01-01 |
| 13 | MV013 | Leo | INDIA | TN01 | MEDIUM | 67000 | SCREEN_RECORDING | leo@gmail.com | 2025-12-07 |
| 14 | MV014 | Animal | INDIA | TN02 | HIGH | 98000 | CAMRIP | INVALID_EMAIL | 2025-01-09 |
| 15 | MV015 | Jawan | INDIA | AP03 | CRITICAL | 99000 | TORRENT | jawan@gmail.com | 2025-02-10 |
| 16 | MV016 | Dune2 | USA | US04 | LOW | 45000 | IPTV | dune2@gmail.com | 2025-03-15 |
| 17 | MV017 | Kalki | INDIA | AP04 | HIGH | 87000 | DEEPFAKE | kalki@gmail.com | 2025-04-18 |
| 18 | MV018 | Mufasa | UK | UK02 | MEDIUM | 78000 | TORRENT | mufasa@gmail.com | 2025-05-20 |
| 19 | MV019 | Gladiator | ITALY | IT01 | CRITICAL | 91000 | DARK_WEB | gladiator@gmail.com | 2025-06-11 |
| 20 | MV020 | Matrix | USA | US05 | LOW | 0 | UNKNOWN | INVALID_EMAIL | 2025-07-12 |
Key Explanation
PROC SORT
NODUPKEY removes duplicate records based on business keys.
In
enterprise systems, duplicate records can:
- Inflate revenue loss
- Corrupt statistical
summaries
- Produce inaccurate fraud
metrics
- Cause SDTM validation
failures
PROC SQL vs DATA Step
4.PROC SQL Aggregation
proc sql;
create table piracy_summary as
select Country,Piracy_Type,
count(*) as Total_Cases,
sum(Loss_Amount) as Total_Loss format=dollar15.
from movies_nodup
group by Country,Piracy_Type;
quit;
proc print data = piracy_summary;
run;
OUTPUT:
| Obs | Country | Piracy_Type | Total_Cases | Total_Loss |
|---|---|---|---|---|
| 1 | CANADA | SCREEN_RECORDING | 1 | $98,000 |
| 2 | INDIA | BOOTLEG_DVD | 1 | $120,000 |
| 3 | INDIA | CAMRIP | 1 | $98,000 |
| 4 | INDIA | DEEPFAKE | 1 | $87,000 |
| 5 | INDIA | SCREEN_RECORDING | 1 | $67,000 |
| 6 | INDIA | TORRENT | 3 | $234,000 |
| 7 | INDIA | UNKNOWN | 2 | $90,000 |
| 8 | ITALY | DARK_WEB | 1 | $91,000 |
| 9 | SWEDEN | DEEPFAKE | 1 | $25,000 |
| 10 | UK | IPTV | 2 | $65,000 |
| 11 | UK | TORRENT | 1 | $78,000 |
| 12 | USA | CAMRIP | 1 | $75,000 |
| 13 | USA | DARK_WEB | 1 | $100,000 |
| 14 | USA | IPTV | 1 | $45,000 |
| 15 | USA | TORRENT | 1 | $87,000 |
| 16 | USA | UNKNOWN | 1 | $0 |
Explanation
PROC SQL
is excellent for:
- Joins
- Aggregations
- Relational transformations
- Reporting summaries
It
resembles database programming and is easier for complex joins.
5.DATA Step Alternative
proc sort data=movies_nodup;
by Country;
run;
proc print data = movies_nodup;
run;
OUTPUT:
| Obs | Movie_ID | Movie_Name | Country | Region_Code | Risk_Level | Loss_Amount | Piracy_Type | Contact_Email | Upload_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | MV004 | Joker | CANADA | CA01 | CRITICAL | 98000 | SCREEN_RECORDING | joker@@mail.com | 2025-01-01 |
| 2 | MV001 | Avatar2 | INDIA | AP01 | HIGH | 50000 | TORRENT | alert@gmail.com | 2025-01-12 |
| 3 | MV005 | Pushpa2 | INDIA | IN01 | HIGH | 120000 | BOOTLEG_DVD | pushpa@gmail.com | 2025-05-05 |
| 4 | MV006 | Kgf2 | INDIA | IN_02 | HIGH | 45000 | UNKNOWN | INVALID_EMAIL | 2025-06-14 |
| 5 | MV011 | Titanic | INDIA | IN03 | HIGH | 45000 | UNKNOWN | titanic@gmail.com | 2025-11-18 |
| 6 | MV012 | Bahubali | INDIA | AP02 | HIGH | 85000 | TORRENT | bahubali@gmail. | 2025-01-01 |
| 7 | MV013 | Leo | INDIA | TN01 | MEDIUM | 67000 | SCREEN_RECORDING | leo@gmail.com | 2025-12-07 |
| 8 | MV014 | Animal | INDIA | TN02 | HIGH | 98000 | CAMRIP | INVALID_EMAIL | 2025-01-09 |
| 9 | MV015 | Jawan | INDIA | AP03 | CRITICAL | 99000 | TORRENT | jawan@gmail.com | 2025-02-10 |
| 10 | MV017 | Kalki | INDIA | AP04 | HIGH | 87000 | DEEPFAKE | kalki@gmail.com | 2025-04-18 |
| 11 | MV019 | Gladiator | ITALY | IT01 | CRITICAL | 91000 | DARK_WEB | gladiator@gmail.com | 2025-06-11 |
| 12 | MV008 | Frozen | SWEDEN | SW01 | LOW | 25000 | DEEPFAKE | frozen@gmail.com | 2025-08-11 |
| 13 | MV003 | Null | UK | UK_01 | LOW | . | IPTV | admin@yahoo.com | 2025-04-22 |
| 14 | MV010 | Interstellar | UK | UK01 | MEDIUM | 65000 | IPTV | interstellar@gmail.com | 2025-10-21 |
| 15 | MV018 | Mufasa | UK | UK02 | MEDIUM | 78000 | TORRENT | mufasa@gmail.com | 2025-05-20 |
| 16 | MV002 | Batman | USA | USA1 | MEDIUM | 75000 | CAMRIP | INVALID_EMAIL | 2025-03-15 |
| 17 | MV007 | Spiderman | USA | US02 | NULL | 87000 | TORRENT | spider@gmail.com | 2025-01-01 |
| 18 | MV009 | Avengers | USA | US03 | CRITICAL | 100000 | DARK_WEB | INVALID_EMAIL | 2025-09-15 |
| 19 | MV016 | Dune2 | USA | US04 | LOW | 45000 | IPTV | dune2@gmail.com | 2025-03-15 |
| 20 | MV020 | Matrix | USA | US05 | LOW | 0 | UNKNOWN | INVALID_EMAIL | 2025-07-12 |
data country_summary;
set movies_nodup;
by Country;
retain Total_Loss 0;
if first.Country then Total_Loss=0;
Total_Loss + Loss_Amount;
if last.Country then output;
run;
proc print data = country_summary;
run;
OUTPUT:
| Obs | Movie_ID | Movie_Name | Country | Region_Code | Risk_Level | Loss_Amount | Piracy_Type | Contact_Email | Upload_Date | Total_Loss |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | MV004 | Joker | CANADA | CA01 | CRITICAL | 98000 | SCREEN_RECORDING | joker@@mail.com | 2025-01-01 | 98000 |
| 2 | MV017 | Kalki | INDIA | AP04 | HIGH | 87000 | DEEPFAKE | kalki@gmail.com | 2025-04-18 | 696000 |
| 3 | MV019 | Gladiator | ITALY | IT01 | CRITICAL | 91000 | DARK_WEB | gladiator@gmail.com | 2025-06-11 | 91000 |
| 4 | MV008 | Frozen | SWEDEN | SW01 | LOW | 25000 | DEEPFAKE | frozen@gmail.com | 2025-08-11 | 25000 |
| 5 | MV018 | Mufasa | UK | UK02 | MEDIUM | 78000 | TORRENT | mufasa@gmail.com | 2025-05-20 | 143000 |
| 6 | MV020 | Matrix | USA | US05 | LOW | 0 | UNKNOWN | INVALID_EMAIL | 2025-07-12 | 307000 |
Explanation
The DATA
step provides row-wise control.
FIRST./LAST. Processing
This is
extremely powerful in clinical trials:
- Subject-level derivations
- Visit tracking
- Exposure calculations
- Event sequencing
6.PROC FORMAT for Controlled Standardization
proc format;
value $riskfmt 'LOW'=1
'MEDIUM'=2
'HIGH'=3
'CRITICAL'=4
'NULL'=0;
run;
LOG:
7.PROC REPORT
proc report data=movies_nodup nowd;
columns Country Piracy_Type Loss_Amount Risk_Level;
define Country/group;
define Piracy_Type/group;
define Loss_Amount/analysis sum;
define Risk_Level/display format=$riskfmt.;
run;
OUTPUT:
| Country | Piracy_Type | Loss_Amount | Risk_Level |
|---|---|---|---|
| CANADA | SCREEN_RECORDING | 98000 | 4 |
| INDIA | BOOTLEG_DVD | 120000 | 3 |
| CAMRIP | 98000 | 3 | |
| DEEPFAKE | 87000 | 3 | |
| SCREEN_RECORDING | 67000 | 2 | |
| TORRENT | 50000 | 3 | |
| 85000 | 3 | ||
| 99000 | 4 | ||
| UNKNOWN | 45000 | 3 | |
| 45000 | 3 | ||
| ITALY | DARK_WEB | 91000 | 4 |
| SWEDEN | DEEPFAKE | 25000 | 1 |
| UK | IPTV | . | 1 |
| 65000 | 2 | ||
| TORRENT | 78000 | 2 | |
| USA | CAMRIP | 75000 | 2 |
| DARK_WEB | 100000 | 4 | |
| IPTV | 45000 | 1 | |
| TORRENT | 87000 | 0 | |
| UNKNOWN | 0 | 1 |
8.SAS Macro Automation
%macro audit(dataset);
proc contents data=&dataset;
run;
proc means data=&dataset n nmiss;
run;
proc freq data=&dataset;
tables Country Risk_Level;
run;
%mend;
%audit(movies_nodup);
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.MOVIES_NODUP | Observations | 20 |
|---|---|---|---|
| Member Type | DATA | Variables | 9 |
| Engine | V9 | Indexes | 0 |
| Created | 06/08/2026 11:39:16 | Observation Length | 192 |
| Last Modified | 06/08/2026 11:39:16 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | YES | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 682 |
| Obs in First Data Page | 20 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_workADF600010B33_odaws01-apse1-2.oda.sas.com/SAS_work2A2900010B33_odaws01-apse1-2.oda.sas.com/movies_nodup.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 1379871 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Alphabetic List of Variables and Attributes | ||||
|---|---|---|---|---|
| # | Variable | Type | Len | Format |
| 8 | Contact_Email | Char | 50 | |
| 3 | Country | Char | 20 | |
| 6 | Loss_Amount | Num | 8 | |
| 1 | Movie_ID | Char | 12 | |
| 2 | Movie_Name | Char | 40 | |
| 7 | Piracy_Type | Char | 25 | |
| 4 | Region_Code | Char | 8 | |
| 5 | Risk_Level | Char | 15 | |
| 9 | Upload_Date | Num | 8 | YYMMDD10. |
| Sort Information | |
|---|---|
| Sortedby | Country |
| Validated | YES |
| Character Set | ASCII |
| Variable | N | N Miss |
|---|---|---|
Loss_Amount Upload_Date | 19 20 | 1 0 |
The FREQ Procedure
| Country | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| CANADA | 1 | 5.00 | 1 | 5.00 |
| INDIA | 9 | 45.00 | 10 | 50.00 |
| ITALY | 1 | 5.00 | 11 | 55.00 |
| SWEDEN | 1 | 5.00 | 12 | 60.00 |
| UK | 3 | 15.00 | 15 | 75.00 |
| USA | 5 | 25.00 | 20 | 100.00 |
| Risk_Level | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| CRITICAL | 4 | 20.00 | 4 | 20.00 |
| HIGH | 7 | 35.00 | 11 | 55.00 |
| LOW | 4 | 20.00 | 15 | 75.00 |
| MEDIUM | 4 | 20.00 | 19 | 95.00 |
| NULL | 1 | 5.00 | 20 | 100.00 |
Macro Explanation
Macros
improve:
- Reusability
- Standardization
- Audit consistency
- Production scalability
Clinical programming teams heavily depend on reusable macros.
R Data Cleaning Workflow
9.Raw Dataset
library(tidyverse)
library(lubridate)
library(janitor)
movies_raw <- tibble(
Movie_ID=c("MV001","MV001","MV002","MV003"),
Movie_Name=c("avatar2"," Avatar2 ","Batman","NULL"),
Country=c("india"," INDIA ","USA","UK"),
Loss_Amount=c(50000,-50000,75000,NA),
Upload_Date=c("12JAN2025","31FEB2025","15MAR2025","22APR2025"),
Email=c("alert@gmail.com","wrongmail.com",
"batman@gmail","admin@yahoo.com")
)
OUTPUT:
|
|
Movie_ID |
Movie_Name |
Country |
Loss_Amount |
Upload_Date |
Email |
|
1 |
MV001 |
avatar2 |
india |
50000 |
12JAN2025 |
alert@gmail.com |
|
2 |
MV001 |
Avatar2 |
INDIA |
-50000 |
31FEB2025 |
wrongmail.com |
|
3 |
MV002 |
Batman |
USA |
75000 |
15MAR2025 |
batman@gmail |
|
4 |
MV003 |
NULL |
UK |
NA |
22APR20 25 |
admin@yahoo.com |
10.Cleaning in R
movies_clean <- movies_raw %>%
clean_names() %>%
mutate(
movie_name=str_to_title(str_trim(movie_name)),
country=str_to_upper(str_trim(country)),
loss_amount=abs(loss_amount),
parsed_date=suppressWarnings(
parse_date_time(upload_date,"dby")),
parsed_date=if_else(is.na(parsed_date),
as.POSIXct("2025-01-01"),parsed_date),
email=if_else(
grepl("@",email) & grepl("\\.",email),
email,"INVALID_EMAIL"),
movie_name=case_when(
movie_name=="Null" ~ "UNKNOWN",
TRUE ~ movie_name)
)
|
|
movie_id |
movie_name |
country |
loss_amount |
upload_date |
email |
parsed_date |
|
1 |
MV001 |
Avatar2 |
INDIA |
50000 |
12JAN2025 |
alert@gmail.com |
2025-01-12 |
|
2 |
MV002 |
Batman |
USA |
75000 |
15MAR2025 |
INVALID_EMAIL |
2025-03-15 |
|
3 |
MV003 |
UNKNOWN |
UK |
NA |
22APR25 |
admin@yahoo.com |
2025-04-22 |
R Cleaning Explanation
Equivalent SAS vs R Logic
|
SAS |
R |
|
PROPCASE |
str_to_title |
|
STRIP |
str_trim |
|
ABS |
abs |
|
INPUT |
parse_date_time |
|
FIND |
grepl |
|
IF-THEN |
if_else |
|
SELECT-WHEN |
case_when |
R
provides highly flexible vectorized transformations.
Validation & Compliance
In
regulated environments like clinical trials:
- SDTM requires standardized
domains.
- ADaM requires traceable
derivations.
- Audit trails must track
transformations.
- QC teams independently
validate outputs.
One
dangerous SAS behavior:
Missing
numeric values are treated lower than valid numbers.
Example:
if score < 50 then risk='HIGH';
If score=.
then SAS still evaluates TRUE.
This can
catastrophically misclassify patients or fraud risk.
Always
validate missing logic explicitly.
20 Enterprise Data-Cleaning
Best Practices
- Always standardize variable
casing
- Remove duplicate business
keys
- Validate dates before
analysis
- Use controlled terminology
- Never overwrite raw datasets
- Build reusable macros
- Implement QC independence
- Maintain audit trails
- Validate missing values
explicitly
- Use metadata-driven
programming
- Standardize formats globally
- Track derivation lineage
- Apply defensive programming
- Separate staging and
production layers
- Validate joins carefully
- Avoid hardcoding logic
- Use parameterized macros
- Log transformation
exceptions
- Validate categorical
mappings
- Document every derivation
clearly
Business Logic Behind
Cleaning
Business
rules exist because analytics depend on trust.
Suppose
patient age appears as -45. If not corrected, demographic summaries become
invalid. Similarly, missing clinical visit dates can distort survival analysis
timelines.
In
banking, negative loan balances may incorrectly reduce exposure risk
calculations. In retail analytics, inconsistent region codes like "AP01"
and "ap01" create duplicate reporting groups.
Missing
values are often imputed because statistical models cannot process blanks
consistently. Date standardization enables reliable interval calculations using
functions like INTCK and INTNX.
Text
normalization improves merge accuracy and reporting consistency. Without
standardization, "India", " INDIA ", and "india"
may be interpreted as separate groups.
These
transformations ensure:
- Reproducibility
- Compliance
- Statistical integrity
- Executive trust
- Accurate AI predictions
20 One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic is stronger
than visual inspection.
- Standardized variables
improve reproducibility.
- Duplicate records destroy
analytical trust.
- Macros reduce programming
inconsistency.
- Missing dates break
longitudinal analysis.
- Controlled terminology
improves compliance.
- Audit trails protect
regulatory submissions.
- Defensive programming
prevents production failures.
- PROC SQL simplifies
relational logic.
- DATA step provides granular
control.
- R excels in flexible
transformation pipelines.
- SAS dominates regulated
analytics environments.
- QC independence is mandatory
in clinical trials.
- Metadata drives scalable
programming.
- Standardization improves
dashboard reliability.
- Invalid emails damage
notification systems.
- Character truncation
silently corrupts datasets.
- Data lineage improves
traceability.
- Clean data powers
trustworthy AI.
SAS vs R Comparison
|
Feature |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Very
High |
Growing |
|
Flexibility |
Structured |
Extremely
Flexible |
|
Performance |
Strong
on enterprise systems |
Strong with
memory optimization |
|
Reporting |
PROC
REPORT |
ggplot/tidyverse |
|
Macros |
Powerful |
Functions/Purrr |
|
Clinical
Trials |
Industry
Standard |
Increasing
adoption |
|
Traceability |
Excellent |
Requires
discipline |
Small Validation Checklist
1.Duplicate
checks
2.Missing value checks
3.Invalid date detection
4.Email validation
5.Range validation
6.Controlled terminology mapping
7. Format consistency
8.Join validation
9.Audit logging
10.Metadata verification
Summary
SAS and R
together form a powerful enterprise analytics ecosystem. SAS dominates highly
regulated industries because of its auditability, validation strength, metadata
management, and reproducible workflows. R complements SAS with modern
transformation libraries, rapid exploratory analysis, and flexible data
engineering capabilities.
SAS
excels in:
- SDTM/ADaM compliance
- Macro-driven automation
- Enterprise reporting
- Clinical validation
workflows
- Traceable derivations
R excels
in:
- Vectorized transformations
- Modern data wrangling
- Flexible visualization
- Rapid prototyping
- Open-source scalability
The
combination creates reliable analytical intelligence pipelines capable of
supporting healthcare, banking, insurance, cybersecurity, and retail
operations.
Conclusion
Modern
analytics systems are only as reliable as the data entering them. Whether
monitoring illegal movie piracy, managing clinical trial submissions,
evaluating insurance fraud, or processing financial transactions, poor-quality
data creates operational chaos.
Duplicate
records distort metrics. Missing values break models. Invalid dates corrupt
timelines. Inconsistent categorical labels destroy aggregation accuracy.
Malformed identifiers weaken reporting trust.
Enterprise-grade
data cleaning is not optional it is foundational.
SAS
provides the governance, auditability, and structured processing required for
regulated production systems. Its DATA step architecture, PROC SQL
capabilities, macro framework, and validation ecosystem make it indispensable
for enterprise reporting and clinical analytics.
R extends
these capabilities through modern transformation pipelines, flexible
programming paradigms, and scalable analytical workflows.
Together,
SAS and R enable organizations to transform corrupted operational records into
trustworthy analytical intelligence.
The real
value of data engineering lies not in writing code but in creating reliable
business truth.
That is what separates raw data from strategic intelligence.
1. How do you handle duplicate records in SAS?
Answer:
I use PROC SORT NODUPKEY for simple deduplication and FIRST./LAST. processing for advanced business-rule handling.
2. Why is LENGTH important in SAS?
Answer:
If LENGTH is defined after assignment, SAS may permanently truncate character variables during compilation.
3. Difference between PROC SQL and DATA step?
Answer:
PROC SQL is better for joins and aggregations, while DATA step provides row-wise sequential processing control.
4. How do you validate missing numeric values in SAS?
Answer:
I explicitly use missing(variable) because SAS treats missing numeric values as smaller than valid numbers.
5. Why combine SAS and R in enterprise analytics?
Answer:
SAS provides regulatory-grade traceability, while R offers flexible advanced transformation and visualization capabilities.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent ILLEGAL MOVIE DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and Exams Reviewers and Observers
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment