422.Can We Detect And Correct Errors In Digital Advertising Fraud Data Using Advanced SAS Programming?
Detecting And Correcting Digital Advertising Fraud Data Errors Using Advanced SAS Programming
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —DATA STEP | LENGTH | INPUT | DATALINES | SET | MERGE | PROC SORT | PROC PRINT | PROC CONTENTS | PROC FREQ | PROC TRANSPOSE | PROC APPEND | PROC DATASETS DELETE | MACRO / %MACRO / %MEND | NUMERIC AND CHARACTER FUNCTIONS
Introduction
Digital
advertising is one of the fastest-growing industries in the world. Companies
spend billions of dollars on online advertising campaigns across platforms such
as Google Ads, social media platforms, display networks, and affiliate traffic
sources. While digital marketing enables companies to reach targeted audiences
efficiently, it also introduces a major challenge: digital advertising fraud.
Advertising
fraud occurs when malicious actors generate fake clicks, impressions, or
conversions in order to steal advertising budgets. These fraudulent activities
can involve bot traffic, click farms, automated scripts, or manipulated traffic
sources. As a result, companies lose significant amounts of money and receive
inaccurate campaign performance metrics.
Data
analytics plays a critical role in identifying suspicious patterns in
advertising traffic. Organizations analyze campaign datasets to detect abnormal
click patterns, invalid traffic sources, and unusual conversion behavior.
Advanced analytics tools such as SAS (Statistical Analysis System) are
widely used in industries for fraud detection because of their powerful data
manipulation capabilities.
In this
project, we will simulate a Digital Advertising Fraud Dataset containing
campaign performance information such as:
- Campaign ID
- Traffic Source
- Clicks
- Invalid Clicks
- Conversion Rate
- Loss Amount
- Fraud Score
- Utilization Classification
- Fees
- Date Formats
The
dataset will intentionally include multiple programming errors, which
will later be identified and corrected. The goal is to demonstrate how SAS
programmers debug datasets and implement fraud detection logic using DATA
step programming, macros, character functions, numeric functions, and date
manipulation techniques.
By the
end of this project, we will:
- Create a dataset with intentional
coding errors
- Detect those errors
- Correct the code
- Implement fraud detection
logic using SAS macros
- Use character and numeric
functions
- Use date functions such
as MDY, INTCK, INTNX
- Apply SET, MERGE, APPEND,
TRANSPOSE
- Clean datasets using PROC
DATASETS DELETE
This
project simulates a real-world scenario where analysts must correct and analyze
campaign data before performing fraud analytics.
Table Of Contents
- Introduction
- Business Context
- Dataset Structure
- Creating Raw Dataset with
Intentional Errors
- Explanation of Errors
- Corrected Dataset Code
- Character Functions in SAS
- Numeric Functions in SAS
- Date Functions (MDY, INTCK,
INTNX)
- Dataset Combination using
SET
- Dataset Combination using
MERGE
- Dataset Combination using
PROC APPEND
- Dataset Transformation using
PROC TRANSPOSE
- Fraud Detection Macro
- Utilization Classification
Logic
- Fraud Score Classification
- Cleaning Temporary Datasets
using PROC DATASETS DELETE
- Final Dataset Output
- 20 Key Points About This Project
- Project Summary and
Conclusion
Business Context
Companies
invest heavily in digital advertising campaigns to generate customer engagement
and conversions. However, fraudulent traffic significantly affects campaign
performance metrics.
Common
fraud indicators include:
- Extremely high click counts
with low conversions
- High invalid click
percentages
- Unusual traffic spikes
within short time intervals
- Traffic from suspicious
sources
- Extremely high fraud risk
scores
Organizations
use analytics systems like SAS to monitor advertising datasets and
automatically detect fraud patterns.
For
example:
|
Campaign |
Clicks |
Invalid
Clicks |
Conversion
Rate |
|
C101 |
5000 |
50 |
4% |
|
C102 |
8000 |
4000 |
0.2% |
Campaign
C102 is suspicious because half of the clicks are invalid.
By
applying SAS analytics, companies can:
- Identify fraudulent
campaigns
- Calculate financial losses
- Block suspicious traffic
sources
- Optimize marketing budgets
Dataset Structure
Our
dataset will contain the following variables:
|
Variable |
Description |
|
Campaign_ID |
Unique
advertising campaign identifier |
|
Traffic_Source |
Source
of advertising traffic |
|
Clicks |
Total
number of clicks |
|
Invalid_Clicks |
Number
of fraudulent clicks |
|
Conversion_Rate |
Percentage
of clicks converting to sales |
|
Loss_Amount |
Estimated
monetary loss due to fraud |
|
Fraud_Score |
Fraud
risk score |
|
Percentage |
Percentage
of invalid clicks |
|
Fees |
Advertising
platform fees |
|
Utilization_Class |
Traffic
utilization classification |
|
Campaign_Date |
Date of
campaign |
1. Raw Dataset with Intentional Errors
data ad_fraud_raw;
length Campaign_ID $5 Traffic_Source $20 Utilization_Class $15;
input Campaign_ID $ Traffic_Source $ Clicks
Invalid_Clicks Conversion_Rate Loss_Amount
Fraud_Score Percentage Fees Utilization_Class $
Campaign_Date :ddmmyy10.;
format Campaign_Date date9.;
datalines;
C101 google 5000 50 4.2 200 15 1.0 20 High 01/01/2025
C102 facebook 8000 4000 0.2 900 90 50 35 High 02/01/2025
C103 affiliate 3500 200 2.3 150 30 5.7 10 Medium 03/01/2025
C104 bottraffic 9000 8500 0.1 1200 98 94 45 High 04/01/2025
C105 instagram 4200 150 3.0 180 25 3.5 15 Medium 05/01/2025
C106 youtube 3900 100 4.1 120 18 2.5 12 Low 06/01/2025
C107 google 5200 400 2.8 250 40 7.6 18 Medium 07/01/2025
C108 facebook 6100 300 3.2 200 35 4.9 17 Medium 08/01/2025
C109 affiliate 4500 600 1.5 310 55 13.3 20 High 09/01/2025
C110 bottraffic 10000 9000 0.0 1500 99 90 50 High 10/01/2025
C111 google 4800 120 3.9 160 22 2.5 14 Low 11/01/2025
C112 facebook 5100 180 3.5 170 28 3.5 15 Medium 12/01/2025
C113 youtube 4300 140 3.8 165 26 3.2 14 Low 13/01/2025
C114 affiliate 3700 500 2.0 300 50 13.5 22 High 14/01/2025
C115 bottraffic 8800 8400 0.1 1100 97 95 40 High 15/01/2025
C116 instagram 4600 160 3.3 175 27 3.4 16 Medium 16/01/2025
;
run;
proc print data=ad_fraud_raw;
run;
OUTPUT:
| Obs | Campaign_ID | Traffic_Source | Utilization_Class | Clicks | Invalid_Clicks | Conversion_Rate | Loss_Amount | Fraud_Score | Percentage | Fees | Campaign_Date |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | High | 5000 | 50 | 4.2 | 200 | 15 | 1.0 | 20 | 01JAN2025 | |
| 2 | C102 | High | 8000 | 4000 | 0.2 | 900 | 90 | 50.0 | 35 | 02JAN2025 | |
| 3 | C103 | affiliate | Medium | 3500 | 200 | 2.3 | 150 | 30 | 5.7 | 10 | 03JAN2025 |
| 4 | C104 | bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 98 | 94.0 | 45 | 04JAN2025 |
| 5 | C105 | Medium | 4200 | 150 | 3.0 | 180 | 25 | 3.5 | 15 | 05JAN2025 | |
| 6 | C106 | youtube | Low | 3900 | 100 | 4.1 | 120 | 18 | 2.5 | 12 | 06JAN2025 |
| 7 | C107 | Medium | 5200 | 400 | 2.8 | 250 | 40 | 7.6 | 18 | 07JAN2025 | |
| 8 | C108 | Medium | 6100 | 300 | 3.2 | 200 | 35 | 4.9 | 17 | 08JAN2025 | |
| 9 | C109 | affiliate | High | 4500 | 600 | 1.5 | 310 | 55 | 13.3 | 20 | 09JAN2025 |
| 10 | C110 | bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 99 | 90.0 | 50 | 10JAN2025 |
| 11 | C111 | Low | 4800 | 120 | 3.9 | 160 | 22 | 2.5 | 14 | 11JAN2025 | |
| 12 | C112 | Medium | 5100 | 180 | 3.5 | 170 | 28 | 3.5 | 15 | 12JAN2025 | |
| 13 | C113 | youtube | Low | 4300 | 140 | 3.8 | 165 | 26 | 3.2 | 14 | 13JAN2025 |
| 14 | C114 | affiliate | High | 3700 | 500 | 2.0 | 300 | 50 | 13.5 | 22 | 14JAN2025 |
| 15 | C115 | bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 97 | 95.0 | 40 | 15JAN2025 |
| 16 | C116 | Medium | 4600 | 160 | 3.3 | 175 | 27 | 3.4 | 16 | 16JAN2025 |
Intentional Errors in the Dataset
The dataset above contains multiple potential
issues that analysts must detect.
Examples:
Error 1 — Traffic Source Formatting
Traffic sources appear in inconsistent
formats.
Example:
googlefacebookaffiliatebottraffic
Best practice is to standardize case using UPCASE or PROPCASE.
Error 2 —
Percentage Not Derived
Percentage should be calculated using:
Invalid_Clicks / Clicks * 100
But the dataset manually entered values.
Error 3 — Fraud
Score Not Calculated Programmatically
Fraud scores are entered manually instead of
using analytics logic.
Error 4 — Character
Cleaning
Traffic source strings may contain leading or
trailing spaces.
Functions required:strip,trim
2. Corrected Dataset Code
data ad_fraud_clean;
set ad_fraud_raw;
Traffic_Source = propcase(strip(Traffic_Source));
Percentage = round((Invalid_Clicks/Clicks)*100,0.1);
Fraud_Score = round((Percentage*0.7) + ((1-Conversion_Rate)*20),0.1);
Total_Cost = sum(Loss_Amount,Fees);
Campaign_Name = catx("_",Campaign_ID,Traffic_Source);
run;
proc print data=ad_fraud_clean;
run;
OUTPUT:
| Obs | Campaign_ID | Traffic_Source | Utilization_Class | Clicks | Invalid_Clicks | Conversion_Rate | Loss_Amount | Fraud_Score | Percentage | Fees | Campaign_Date | Total_Cost | Campaign_Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | |
| 2 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | |
| 3 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate |
| 4 | C104 | Bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic |
| 5 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | |
| 6 | C106 | Youtube | Low | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube |
| 7 | C107 | Medium | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | |
| 8 | C108 | Medium | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | |
| 9 | C109 | Affiliate | High | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate |
| 10 | C110 | Bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic |
| 11 | C111 | Low | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | |
| 12 | C112 | Medium | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | |
| 13 | C113 | Youtube | Low | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube |
| 14 | C114 | Affiliate | High | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate |
| 15 | C115 | Bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic |
| 16 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram |
Explanation of Character Functions
STRIP
Removes leading and trailing spaces.
strip(Traffic_Source)
Example:
" google " → "google"
TRIM
Removes trailing spaces.
trim(variable)
CAT
Concatenates values.
cat(a,b)
CATX
Concatenates values with a delimiter.
catx("_",Campaign_ID,Traffic_Source)
Result:
C101_Google
UPCASE
Converts text to uppercase.
upcase(variable)
LOWCASE
Converts text to lowercase.
PROPCASE
Capitalizes the first letter.
Example:
google → Google
Numeric Functions
SUM
Adds numeric variables.
Total_Cost = sum(Loss_Amount,Fees);
ROUND
Rounds numeric values.
round(value,0.1)
Date Functions
MDY
Creates a SAS date.
Campaign_Start = mdy(1,1,2025);
INTCK
Counts intervals between dates.
Days_Between = intck('day',Campaign_Start,Campaign_Date);
INTNX
Moves forward or backward in time.
Next_Campaign = intnx('month',Campaign_Date,1);3. Using SET Statement
Method 1 — If You Want the Same Dataset Copied Then
data combined_data;
set ad_fraud_clean;
run;
proc print data=combined_data;
run;
OUTPUT:
| Obs | Campaign_ID | Traffic_Source | Utilization_Class | Clicks | Invalid_Clicks | Conversion_Rate | Loss_Amount | Fraud_Score | Percentage | Fees | Campaign_Date | Total_Cost | Campaign_Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | |
| 2 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | |
| 3 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate |
| 4 | C104 | Bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic |
| 5 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | |
| 6 | C106 | Youtube | Low | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube |
| 7 | C107 | Medium | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | |
| 8 | C108 | Medium | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | |
| 9 | C109 | Affiliate | High | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate |
| 10 | C110 | Bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic |
| 11 | C111 | Low | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | |
| 12 | C112 | Medium | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | |
| 13 | C113 | Youtube | Low | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube |
| 14 | C114 | Affiliate | High | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate |
| 15 | C115 | Bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic |
| 16 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram |
Explanation:
·
DATA combined_data;
Creates a new dataset named combined_data.
·
SET ad_fraud_clean;
Reads all observations from ad_fraud_clean.
·
RUN;
Executes the DATA step.
Result:
combined_data
will contain exactly the same observations
as ad_fraud_clean.
Method 2 — If You Intentionally Want Two Copies of Data Then
data campaign_part1;
set ad_fraud_clean;
run;
proc print data=campaign_part1;
run;
OUTPUT:
| Obs | Campaign_ID | Traffic_Source | Utilization_Class | Clicks | Invalid_Clicks | Conversion_Rate | Loss_Amount | Fraud_Score | Percentage | Fees | Campaign_Date | Total_Cost | Campaign_Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | |
| 2 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | |
| 3 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate |
| 4 | C104 | Bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic |
| 5 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | |
| 6 | C106 | Youtube | Low | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube |
| 7 | C107 | Medium | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | |
| 8 | C108 | Medium | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | |
| 9 | C109 | Affiliate | High | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate |
| 10 | C110 | Bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic |
| 11 | C111 | Low | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | |
| 12 | C112 | Medium | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | |
| 13 | C113 | Youtube | Low | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube |
| 14 | C114 | Affiliate | High | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate |
| 15 | C115 | Bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic |
| 16 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram |
data campaign_part2;
set ad_fraud_clean;
run;
proc print data=campaign_part2;
run;
OUTPUT:
| Obs | Campaign_ID | Traffic_Source | Utilization_Class | Clicks | Invalid_Clicks | Conversion_Rate | Loss_Amount | Fraud_Score | Percentage | Fees | Campaign_Date | Total_Cost | Campaign_Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | |
| 2 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | |
| 3 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate |
| 4 | C104 | Bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic |
| 5 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | |
| 6 | C106 | Youtube | Low | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube |
| 7 | C107 | Medium | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | |
| 8 | C108 | Medium | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | |
| 9 | C109 | Affiliate | High | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate |
| 10 | C110 | Bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic |
| 11 | C111 | Low | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | |
| 12 | C112 | Medium | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | |
| 13 | C113 | Youtube | Low | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube |
| 14 | C114 | Affiliate | High | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate |
| 15 | C115 | Bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic |
| 16 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram |
data combined_data;
set campaign_part1 campaign_part2;
run;
proc print data=combined_data;
run;
OUTPUT:
| Obs | Campaign_ID | Traffic_Source | Utilization_Class | Clicks | Invalid_Clicks | Conversion_Rate | Loss_Amount | Fraud_Score | Percentage | Fees | Campaign_Date | Total_Cost | Campaign_Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | |
| 2 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | |
| 3 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate |
| 4 | C104 | Bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic |
| 5 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | |
| 6 | C106 | Youtube | Low | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube |
| 7 | C107 | Medium | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | |
| 8 | C108 | Medium | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | |
| 9 | C109 | Affiliate | High | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate |
| 10 | C110 | Bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic |
| 11 | C111 | Low | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | |
| 12 | C112 | Medium | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | |
| 13 | C113 | Youtube | Low | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube |
| 14 | C114 | Affiliate | High | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate |
| 15 | C115 | Bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic |
| 16 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram | |
| 17 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | |
| 18 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | |
| 19 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate |
| 20 | C104 | Bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic |
| 21 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | |
| 22 | C106 | Youtube | Low | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube |
| 23 | C107 | Medium | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | |
| 24 | C108 | Medium | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | |
| 25 | C109 | Affiliate | High | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate |
| 26 | C110 | Bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic |
| 27 | C111 | Low | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | |
| 28 | C112 | Medium | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | |
| 29 | C113 | Youtube | Low | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube |
| 30 | C114 | Affiliate | High | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate |
| 31 | C115 | Bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic |
| 32 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram |
Explanation:
· campaign_part1 and campaign_part2 are identical copies.
· SET campaign_part1 campaign_part2; stacks them vertically.
· Final dataset will contain double the observations.
Method 3 — If the Goal is Appending Then
proc append base=combined_data
data=ad_fraud_clean;
run;
proc print data=combined_data;
run;
OUTPUT:
| Obs | Campaign_ID | Traffic_Source | Utilization_Class | Clicks | Invalid_Clicks | Conversion_Rate | Loss_Amount | Fraud_Score | Percentage | Fees | Campaign_Date | Total_Cost | Campaign_Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | |
| 2 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | |
| 3 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate |
| 4 | C104 | Bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic |
| 5 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | |
| 6 | C106 | Youtube | Low | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube |
| 7 | C107 | Medium | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | |
| 8 | C108 | Medium | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | |
| 9 | C109 | Affiliate | High | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate |
| 10 | C110 | Bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic |
| 11 | C111 | Low | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | |
| 12 | C112 | Medium | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | |
| 13 | C113 | Youtube | Low | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube |
| 14 | C114 | Affiliate | High | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate |
| 15 | C115 | Bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic |
| 16 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram | |
| 17 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | |
| 18 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | |
| 19 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate |
| 20 | C104 | Bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic |
| 21 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | |
| 22 | C106 | Youtube | Low | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube |
| 23 | C107 | Medium | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | |
| 24 | C108 | Medium | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | |
| 25 | C109 | Affiliate | High | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate |
| 26 | C110 | Bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic |
| 27 | C111 | Low | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | |
| 28 | C112 | Medium | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | |
| 29 | C113 | Youtube | Low | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube |
| 30 | C114 | Affiliate | High | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate |
| 31 | C115 | Bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic |
| 32 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram | |
| 33 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | |
| 34 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | |
| 35 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate |
| 36 | C104 | Bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic |
| 37 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | |
| 38 | C106 | Youtube | Low | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube |
| 39 | C107 | Medium | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | |
| 40 | C108 | Medium | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | |
| 41 | C109 | Affiliate | High | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate |
| 42 | C110 | Bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic |
| 43 | C111 | Low | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | |
| 44 | C112 | Medium | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | |
| 45 | C113 | Youtube | Low | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube |
| 46 | C114 | Affiliate | High | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate |
| 47 | C115 | Bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic |
| 48 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram |
Explanation:
·
BASE= → target dataset
·
DATA= → dataset to append
·
Adds new observations without recreating dataset structure.
APPEND adds new observations efficiently.
4. Using PROC TRANSPOSE
proc sort data=ad_fraud_clean;by Campaign_ID;run;
proc print data=ad_fraud_clean;
run;
OUTPUT:
| Obs | Campaign_ID | Traffic_Source | Utilization_Class | Clicks | Invalid_Clicks | Conversion_Rate | Loss_Amount | Fraud_Score | Percentage | Fees | Campaign_Date | Total_Cost | Campaign_Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | |
| 2 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | |
| 3 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate |
| 4 | C104 | Bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic |
| 5 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | |
| 6 | C106 | Youtube | Low | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube |
| 7 | C107 | Medium | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | |
| 8 | C108 | Medium | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | |
| 9 | C109 | Affiliate | High | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate |
| 10 | C110 | Bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic |
| 11 | C111 | Low | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | |
| 12 | C112 | Medium | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | |
| 13 | C113 | Youtube | Low | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube |
| 14 | C114 | Affiliate | High | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate |
| 15 | C115 | Bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic |
| 16 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram |
proc transpose data=ad_fraud_clean out=fraud_transpose;
by Campaign_ID;
var Clicks Invalid_Clicks Loss_Amount;
run;
proc print data=fraud_transpose;
run;
OUTPUT:
| Obs | Campaign_ID | _NAME_ | COL1 |
|---|---|---|---|
| 1 | C101 | Clicks | 5000 |
| 2 | C101 | Invalid_Clicks | 50 |
| 3 | C101 | Loss_Amount | 200 |
| 4 | C102 | Clicks | 8000 |
| 5 | C102 | Invalid_Clicks | 4000 |
| 6 | C102 | Loss_Amount | 900 |
| 7 | C103 | Clicks | 3500 |
| 8 | C103 | Invalid_Clicks | 200 |
| 9 | C103 | Loss_Amount | 150 |
| 10 | C104 | Clicks | 9000 |
| 11 | C104 | Invalid_Clicks | 8500 |
| 12 | C104 | Loss_Amount | 1200 |
| 13 | C105 | Clicks | 4200 |
| 14 | C105 | Invalid_Clicks | 150 |
| 15 | C105 | Loss_Amount | 180 |
| 16 | C106 | Clicks | 3900 |
| 17 | C106 | Invalid_Clicks | 100 |
| 18 | C106 | Loss_Amount | 120 |
| 19 | C107 | Clicks | 5200 |
| 20 | C107 | Invalid_Clicks | 400 |
| 21 | C107 | Loss_Amount | 250 |
| 22 | C108 | Clicks | 6100 |
| 23 | C108 | Invalid_Clicks | 300 |
| 24 | C108 | Loss_Amount | 200 |
| 25 | C109 | Clicks | 4500 |
| 26 | C109 | Invalid_Clicks | 600 |
| 27 | C109 | Loss_Amount | 310 |
| 28 | C110 | Clicks | 10000 |
| 29 | C110 | Invalid_Clicks | 9000 |
| 30 | C110 | Loss_Amount | 1500 |
| 31 | C111 | Clicks | 4800 |
| 32 | C111 | Invalid_Clicks | 120 |
| 33 | C111 | Loss_Amount | 160 |
| 34 | C112 | Clicks | 5100 |
| 35 | C112 | Invalid_Clicks | 180 |
| 36 | C112 | Loss_Amount | 170 |
| 37 | C113 | Clicks | 4300 |
| 38 | C113 | Invalid_Clicks | 140 |
| 39 | C113 | Loss_Amount | 165 |
| 40 | C114 | Clicks | 3700 |
| 41 | C114 | Invalid_Clicks | 500 |
| 42 | C114 | Loss_Amount | 300 |
| 43 | C115 | Clicks | 8800 |
| 44 | C115 | Invalid_Clicks | 8400 |
| 45 | C115 | Loss_Amount | 1100 |
| 46 | C116 | Clicks | 4600 |
| 47 | C116 | Invalid_Clicks | 160 |
| 48 | C116 | Loss_Amount | 175 |
TRANSPOSE converts rows
into columns.
5. Fraud Detection Macro
%macro fraud_check;
data fraud_analysis;
set ad_fraud_clean;
length Fraud_Level $12.;
if Percentage > 40 then Fraud_Level="Severe";
else if Percentage > 10 then Fraud_Level="Moderate";
else Fraud_Level="Low";
run;
proc print data=fraud_analysis;
run;
%mend;
%fraud_check;
OUTPUT:
| Obs | Campaign_ID | Traffic_Source | Utilization_Class | Clicks | Invalid_Clicks | Conversion_Rate | Loss_Amount | Fraud_Score | Percentage | Fees | Campaign_Date | Total_Cost | Campaign_Name | Fraud_Level |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | Low | |
| 2 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | Severe | |
| 3 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate | Low |
| 4 | C104 | Bottraffic | High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic | Severe |
| 5 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | Low | |
| 6 | C106 | Youtube | Low | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube | Low |
| 7 | C107 | Medium | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | Low | |
| 8 | C108 | Medium | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | Low | |
| 9 | C109 | Affiliate | High | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate | Moderate |
| 10 | C110 | Bottraffic | High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic | Severe |
| 11 | C111 | Low | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | Low | |
| 12 | C112 | Medium | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | Low | |
| 13 | C113 | Youtube | Low | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube | Low |
| 14 | C114 | Affiliate | High | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate | Moderate |
| 15 | C115 | Bottraffic | High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic | Severe |
| 16 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram | Low |
proc contents data=fraud_analysis;
run;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.FRAUD_ANALYSIS | Observations | 16 |
|---|---|---|---|
| Member Type | DATA | Variables | 14 |
| Engine | V9 | Indexes | 0 |
| Created | 03/07/2026 11:49:58 | Observation Length | 328 |
| Last Modified | 03/07/2026 11:49:58 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 399 |
| Obs in First Data Page | 16 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_workAFA100004268_odaws02-apse1-2.oda.sas.com/SAS_workDBAA00004268_odaws02-apse1-2.oda.sas.com/fraud_analysis.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 134320160 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Alphabetic List of Variables and Attributes | ||||
|---|---|---|---|---|
| # | Variable | Type | Len | Format |
| 11 | Campaign_Date | Num | 8 | DATE9. |
| 1 | Campaign_ID | Char | 5 | |
| 13 | Campaign_Name | Char | 200 | |
| 4 | Clicks | Num | 8 | |
| 6 | Conversion_Rate | Num | 8 | |
| 10 | Fees | Num | 8 | |
| 14 | Fraud_Level | Char | 12 | |
| 8 | Fraud_Score | Num | 8 | |
| 5 | Invalid_Clicks | Num | 8 | |
| 7 | Loss_Amount | Num | 8 | |
| 9 | Percentage | Num | 8 | |
| 12 | Total_Cost | Num | 8 | |
| 2 | Traffic_Source | Char | 20 | |
| 3 | Utilization_Class | Char | 15 | |
proc freq data=fraud_analysis;
tables Fraud_Level Traffic_Source;
run;
OUTPUT:
The FREQ Procedure
| Fraud_Level | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Low | 10 | 62.50 | 10 | 62.50 |
| Moderate | 2 | 12.50 | 12 | 75.00 |
| Severe | 4 | 25.00 | 16 | 100.00 |
| Traffic_Source | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Affiliate | 3 | 18.75 | 3 | 18.75 |
| Bottraffic | 3 | 18.75 | 6 | 37.50 |
| 3 | 18.75 | 9 | 56.25 | |
| 3 | 18.75 | 12 | 75.00 | |
| 2 | 12.50 | 14 | 87.50 | |
| Youtube | 2 | 12.50 | 16 | 100.00 |
6. Utilization Classification
%macro Utilization;
data Final;
set fraud_analysis;
length Utilization_Class $12.;
if Clicks > 8000 then Utilization_Class="Very High";
else if Clicks > 5000 then Utilization_Class="High";
else if Clicks > 3000 then Utilization_Class="Medium";
else Utilization_Class="Low";
run;
proc print data=Final;
run;
%mend;
%Utilization;
OUTPUT:
| Obs | Campaign_ID | Traffic_Source | Utilization_Class | Clicks | Invalid_Clicks | Conversion_Rate | Loss_Amount | Fraud_Score | Percentage | Fees | Campaign_Date | Total_Cost | Campaign_Name | Fraud_Level |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | Medium | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | Low | |
| 2 | C102 | High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | Severe | |
| 3 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate | Low |
| 4 | C104 | Bottraffic | Very High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic | Severe |
| 5 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | Low | |
| 6 | C106 | Youtube | Medium | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube | Low |
| 7 | C107 | High | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | Low | |
| 8 | C108 | High | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | Low | |
| 9 | C109 | Affiliate | Medium | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate | Moderate |
| 10 | C110 | Bottraffic | Very High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic | Severe |
| 11 | C111 | Medium | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | Low | |
| 12 | C112 | High | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | Low | |
| 13 | C113 | Youtube | Medium | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube | Low |
| 14 | C114 | Affiliate | Medium | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate | Moderate |
| 15 | C115 | Bottraffic | Very High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic | Severe |
| 16 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram | Low |
7. Cleaning Temporary Datasets
proc datasets library=work nolist;
delete combined_data merged_data fraud_transpose;
quit;
LOG:
8. Final Dataset Creation (Including Utilization Classification)
data ad_fraud_final;
set ad_fraud_clean;
length Fraud_Level $10 Utilization_Class $12;
* Fraud classification;
if Percentage > 40 then Fraud_Level="Severe";
else if Percentage > 10 then Fraud_Level="Moderate";
else Fraud_Level="Low";
* Utilization classification;
if Clicks >= 8000 then Utilization_Class="Very High";
else if Clicks >= 5000 then Utilization_Class="High";
else if Clicks >= 3000 then Utilization_Class="Medium";
else Utilization_Class="Low";
run;
proc print data=ad_fraud_final;
run;
OUTPUT:
| Obs | Campaign_ID | Traffic_Source | Utilization_Class | Clicks | Invalid_Clicks | Conversion_Rate | Loss_Amount | Fraud_Score | Percentage | Fees | Campaign_Date | Total_Cost | Campaign_Name | Fraud_Level |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | C101 | High | 5000 | 50 | 4.2 | 200 | -63.3 | 1.0 | 20 | 01JAN2025 | 220 | C101_Google | Low | |
| 2 | C102 | Very High | 8000 | 4000 | 0.2 | 900 | 51.0 | 50.0 | 35 | 02JAN2025 | 935 | C102_Facebook | Severe | |
| 3 | C103 | Affiliate | Medium | 3500 | 200 | 2.3 | 150 | -22.0 | 5.7 | 10 | 03JAN2025 | 160 | C103_Affiliate | Low |
| 4 | C104 | Bottraffic | Very High | 9000 | 8500 | 0.1 | 1200 | 84.1 | 94.4 | 45 | 04JAN2025 | 1245 | C104_Bottraffic | Severe |
| 5 | C105 | Medium | 4200 | 150 | 3.0 | 180 | -37.5 | 3.6 | 15 | 05JAN2025 | 195 | C105_Instagram | Low | |
| 6 | C106 | Youtube | Medium | 3900 | 100 | 4.1 | 120 | -60.2 | 2.6 | 12 | 06JAN2025 | 132 | C106_Youtube | Low |
| 7 | C107 | High | 5200 | 400 | 2.8 | 250 | -30.6 | 7.7 | 18 | 07JAN2025 | 268 | C107_Google | Low | |
| 8 | C108 | High | 6100 | 300 | 3.2 | 200 | -40.6 | 4.9 | 17 | 08JAN2025 | 217 | C108_Facebook | Low | |
| 9 | C109 | Affiliate | Medium | 4500 | 600 | 1.5 | 310 | -0.7 | 13.3 | 20 | 09JAN2025 | 330 | C109_Affiliate | Moderate |
| 10 | C110 | Bottraffic | Very High | 10000 | 9000 | 0.0 | 1500 | 83.0 | 90.0 | 50 | 10JAN2025 | 1550 | C110_Bottraffic | Severe |
| 11 | C111 | Medium | 4800 | 120 | 3.9 | 160 | -56.3 | 2.5 | 14 | 11JAN2025 | 174 | C111_Google | Low | |
| 12 | C112 | High | 5100 | 180 | 3.5 | 170 | -47.6 | 3.5 | 15 | 12JAN2025 | 185 | C112_Facebook | Low | |
| 13 | C113 | Youtube | Medium | 4300 | 140 | 3.8 | 165 | -53.7 | 3.3 | 14 | 13JAN2025 | 179 | C113_Youtube | Low |
| 14 | C114 | Affiliate | Medium | 3700 | 500 | 2.0 | 300 | -10.6 | 13.5 | 22 | 14JAN2025 | 322 | C114_Affiliate | Moderate |
| 15 | C115 | Bottraffic | Very High | 8800 | 8400 | 0.1 | 1100 | 84.9 | 95.5 | 40 | 15JAN2025 | 1140 | C115_Bottraffic | Severe |
| 16 | C116 | Medium | 4600 | 160 | 3.3 | 175 | -43.6 | 3.5 | 16 | 16JAN2025 | 191 | C116_Instagram | Low |
Explanation of the Code
DATA
ad_fraud_final
Creates the final analytical dataset
that contains all cleaned and derived variables.
SET
ad_fraud_clean
Reads observations from the cleaned dataset
and allows us to add more derived variables.
LENGTH
Fraud_Level $10 Utilization_Class $12
Defines the length of the character variables.
This is important because SAS assigns the
length based on the first value encountered if not specified.
This
classification evaluates campaign traffic utilization.
|
Click
Range |
Utilization |
|
≥ 8000 |
Very
High |
|
5000–7999 |
High |
|
3000–4999 |
Medium |
|
<
3000 |
Low |
This
helps analysts identify campaign popularity and resource usage.
Final Dataset Variables
|
Variable |
Description |
|
Campaign_ID |
Campaign
identifier |
|
Traffic_Source |
Advertising
platform |
|
Clicks |
Total
campaign clicks |
|
Invalid_Clicks |
Fraud
clicks |
|
Conversion_Rate |
Conversion
percentage |
|
Loss_Amount |
Fraud
loss |
|
Fraud_Score |
Fraud
risk score |
|
Percentage |
Invalid
click percentage |
|
Fees |
Platform
fees |
|
Total_Cost |
Loss +
fees |
|
Campaign_Name |
Combined
campaign label |
|
Fraud_Level |
Fraud
classification |
|
Utilization_Class |
Campaign
utilization category |
|
Campaign_Date |
Campaign
date |
Key Skills Demonstrated in This Project
• Character data cleaning techniques
• Numeric calculations and derived metrics
• Date function usage
• Dataset combination methods
• SAS macro development
• Dataset management and cleanup
- A simulated digital
advertising fraud dataset was created to analyze campaign traffic
behavior and identify fraudulent activity patterns.
- The dataset included key
variables such as Campaign_ID, Traffic_Source, Clicks, Invalid_Clicks,
Conversion_Rate, Loss_Amount, Fraud_Score, Percentage, Fees,
Utilization_Class, and Campaign_Date.
- More than 15 observations
were generated to simulate multiple advertising campaigns across
different traffic sources.
- Intentional data issues and
inconsistencies were introduced to simulate real-world raw marketing
data problems.
- The raw dataset AD_FRAUD_RAW
served as the starting point for the fraud analytics workflow.
- Data cleaning and
standardization were performed using character functions such as STRIP,
TRIM, PROPCASE, UPCASE, LOWCASE, CAT, and CATX.
- Numeric calculations were
implemented using SUM and ROUND functions to derive campaign cost
and fraud-related metrics.
- The invalid click
percentage was calculated using the formula (Invalid_Clicks / Clicks) × 100.
- A Fraud_Score metric
was derived to quantify the likelihood of fraudulent traffic within each
campaign.
- Campaign date analysis was
implemented using date functions MDY, INTCK, and INTNX to handle
campaign timelines and date intervals.
- The SET statement was
used to read and combine datasets vertically during data processing steps.
- Dataset integration
techniques such as MERGE were used to combine campaign data with
additional analytical variables.
- Additional observations
could be appended using PROC APPEND, which efficiently adds new
campaign records.
- The PROC TRANSPOSE
procedure was demonstrated to restructure campaign metrics for
analytical reporting.
- A reusable SAS macro was
developed to automate fraud detection classification across campaign
records.
- Campaigns were categorized
into Fraud Levels (Low, Moderate, Severe) based on the percentage
of invalid clicks.
- Campaign traffic was
classified into Utilization Classes (Low, Medium, High, Very High)
based on total click counts.
- Temporary or intermediate
datasets were cleaned using PROC DATASETS DELETE to maintain a
structured SAS working environment.
- The final analytical dataset
AD_FRAUD_FINAL contained cleaned variables, calculated fraud
metrics, and campaign classifications.
- The project demonstrates how
SAS programming techniques can be used to detect digital advertising
fraud, clean marketing datasets, and support campaign performance
analytics.
Project Summary
This project demonstrates how SAS programming can be used to detect and correct
errors in digital advertising fraud data.Digital advertising platforms
generate massive volumes of campaign performance data every day. However, raw
marketing data often contains inconsistencies, formatting issues, incorrect
calculations, and missing fraud indicators. Before meaningful analysis can be
performed, data must be cleaned and validated.
In this project, a simulated dataset representing digital advertising
campaigns was created with variables such as campaign identifiers, traffic
sources, click counts, invalid clicks, conversion rates, fraud scores, and
financial loss estimates.To simulate real-world analytics scenarios, intentional errors were introduced into the
dataset. These errors included inconsistent formatting of traffic
source names, manually entered fraud percentages, and unverified fraud scores.
Using SAS DATA step programming, these issues were corrected through various
techniques including character functions, numeric functions, and calculated
variables.Several SAS character functions were applied to standardize and clean
textual data. Functions such as STRIP,
TRIM, PROPCASE, UPCASE, and LOWCASE were used to remove unnecessary
spaces and convert text values into consistent formats.Numeric functions were
used to calculate derived metrics such as fraud percentages, total campaign
cost, and fraud scores. Functions like SUM
and ROUND helped ensure accurate financial calculations.
The project also demonstrated the use of date manipulation functions such as MDY, INTCK, and INTNX.
These functions are important when analyzing campaign timelines, calculating
campaign durations, and scheduling future campaign evaluations.
Dataset manipulation techniques such as SET, MERGE, and PROC APPEND were used to
combine multiple datasets and manage campaign data efficiently. The PROC TRANSPOSE procedure was used to
restructure data for analytical reporting.A fraud detection macro was developed to automatically
classify campaigns into fraud categories based on invalid click percentages.
Macros allow SAS programmers to automate repetitive logic and build reusable
fraud detection frameworks.
Finally, unnecessary temporary datasets were removed using PROC DATASETS DELETE, ensuring that the working library remained clean and organized.The final dataset produced in this project provides a structured and validated view of campaign fraud risks. Analysts can use this information to identify suspicious campaigns and prevent advertising budget losses.
Conclusion
Digital advertising fraud analytics requires
strong data management and fraud detection techniques. In this project we
demonstrated how SAS programming can be used to build and analyze a campaign
dataset containing digital advertising performance metrics.
We created a dataset with more than fifteen
observations representing advertising campaigns from multiple traffic sources.
The dataset contained important variables such as clicks, invalid clicks,
conversion rates, fraud scores, and loss amounts.
Intentional errors were introduced to
simulate real-world data problems. These errors included inconsistent
formatting, manually entered percentages, and improper fraud scoring. We then
corrected these issues using SAS DATA step programming.
Key SAS techniques used in this project
include:
·
Character functions for text cleaning
·
Numeric functions for calculations
·
Date functions for campaign timeline
analysis
·
Dataset manipulation using SET, MERGE,
and APPEND
·
Data restructuring using PROC TRANSPOSE
·
Fraud detection logic implemented
through SAS macros
·
Dataset cleanup using PROC DATASETS
This type of analytics approach allows companies to detect suspicious
campaign behavior, reduce financial losses, and improve digital marketing
efficiency.
For SAS programmers preparing for analytics
or fraud detection roles, projects like this demonstrate practical knowledge of
data cleaning, dataset transformation, and automated fraud classification.
SAS INTERVIEW QUESTIONS
1. How do FIRST. and LAST. work?
Answer:
FIRST. and LAST. are temporary
variables automatically created by SAS when using a BY statement after
sorting data.
- FIRST.variable = 1 for the first
observation of each BY group
- LAST.variable = 1 for the last
observation of each BY group
Example:
Used for group totals or identifying first/last records in a group.
2.
Difference between CALL SYMPUT and CALL SYMPUTX?
Answer:
|
Feature |
CALL
SYMPUT |
CALL
SYMPUTX |
|
Spaces |
Keeps
leading/trailing spaces |
Removes
spaces automatically |
|
Type
conversion |
May
cause numeric-to-character notes |
Automatically
converts values |
|
Usage |
Older
method |
Recommended
method |
Short: CALL SYMPUTX is an improved version of CALL SYMPUT.
3.
Difference between INPUT and PUT functions?
Answer:
|
Function |
Purpose |
|
INPUT |
Converts character → numeric |
|
PUT |
Converts numeric → character |
Example:
- INPUT(char_var,8.) → character to numeric
- PUT(num_var,8.) → numeric to character.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent DIGITAL ADVERTISING data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
· Clinical SAS Programmer
· Research Data Analyst
· Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment