408.Can We Build an Accurate Product Demand Forecasting & Fraud Detection System in SAS While Identifying and Fixing Intentional Errors?
Product Demand Forecasting and Fraud Detection Analytics Using SAS with Intentional Error Handling
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —
DATA STEP | INPUT | DATALINES | FORMAT | SET | MERGE | IF-THEN-ELSE | LENGTH | STRIP | TRIM | CAT | CATX | PROPCASE | UPCASE | LOWCASE | COALESCE | ABS | MONTH | YEAR | MDY | INTNX | INTCK | PROC DATASETS DELETE | PROC SQL | PROC FREQ | PROC MEANS | PROC UNIVARIATE | PROC CORR | PROC TRANSPOSE | PROC SGPLOT | SCATTER | REG | MACRO | %MEND | %UTILIZATION | %FRAUD_CHECK
Table of Contents
1.
Introduction
2.
Business Context
3.
Project Objectives
4.
Dataset Design Structure
5.
Raw Data Creation with Intentional Errors
6.
Error Identification & Explanation
7.
Corrected Full-Length SAS Code
8.
Data Transformation & Feature Engineering
9.
Forecast Accuracy Calculation
10. Utilization
Classification Macro
11. Fraud
Detection Logic Macro
12. Statistical
Analysis (PROC MEANS, UNIVARIATE, CORR)
13. SQL-Based
Analysis
14. Visualization
(PROC SGPLOT)
15. Dataset
Management (TRANSPOSE, DELETE)
16. Why
Each Procedure Is Used
17. 12 Key Points About The Project
18. Conclusion
1. Introduction
Product demand forecasting is a critical
analytics problem in supply chain management. Organizations rely on forecast
models to estimate future sales demand and align procurement, production, and
inventory accordingly.
However:
·
Forecasts may be inaccurate
·
Inventory may be underutilized or overstocked
·
Fraudulent sales patterns may distort actual
data
·
Data quality issues may cause incorrect
reporting
In this project, we simulate a real-world
product demand forecasting environment using Base SAS, PROC SQL, statistical procedures, macros, and data
management techniques.
We will:
·
Create a dataset (15+ observations)
·
Introduce intentional errors
·
Identify and correct them
·
Apply forecasting accuracy calculations
·
Perform fraud detection logic
·
Use date functions (MDY, INTCK, INTNX)
·
Apply numeric & character functions
·
Use TRANSPOSE, SET
·
Perform statistical and correlation analysis
·
Generate visualizations
2. Business Context
Assume a retail company operates across 4
regions:
·
North
·
South
·
East
·
West
Each month, the company:
·
Forecasts demand for products
·
Records actual sales
·
Tracks inventory levels
·
Calculates forecast error
·
Measures accuracy percentage
The company wants to answer:
·
Which region has high forecast error?
·
Which products are underutilized?
·
Is there suspicious sales inflation?
·
How accurate are forecasts?
·
Is inventory aligned with demand?
3.Dataset Structure
|
Variable |
Description |
|
Product_ID |
Unique Product Code |
|
Region |
Sales Region |
|
Forecasted_Demand |
Predicted Units |
|
Actual_Sales |
Sold Units |
|
Forecast_Error |
Difference |
|
Inventory_Level |
Current Inventory |
|
Accuracy_Score |
Forecast Accuracy % |
|
Sales_Date |
Transaction Date |
|
Fraud_Flag |
Fraud Indicator |
4. RAW DATA CREATION (WITH INTENTIONAL ERRORS)
data demand_raw;
input Product_ID $ Region $ Forecasted_Demand Actual_Sales Inventory_Level
Sales_Date : date9.;
format Sales_Date date9.;
datalines;
P101 North 500 520 600 01JAN2025
P102 South 450 430 500 15JAN2025
P103 East 600 580 650 30JAN2025
P104 West 700 . 750 10FEB2025
P105 North 800 900 820 25FEB2025
P106 South -200 300 400 05MAR2025
P107 East 650 640 660 20MAR2025
P108 West 720 710 700 05APR2025
P109 North 550 0 580 15APR2025
P110 South 480 490 510 30APR2025
P111 East 620 610 630 10MAY2025
P112 West 710 730 720 20MAY2025
P113 North 900 950 910 05JUN2025
P114 South 500 10000 520 15JUN2025
P115 East 600 580 610 25JUN2025
;
run;
proc print data=demand_raw;
run;
OUTPUT:
| Obs | Product_ID | Region | Forecasted_Demand | Actual_Sales | Inventory_Level | Sales_Date |
|---|---|---|---|---|---|---|
| 1 | P101 | North | 500 | 520 | 600 | 01JAN2025 |
| 2 | P102 | South | 450 | 430 | 500 | 15JAN2025 |
| 3 | P103 | East | 600 | 580 | 650 | 30JAN2025 |
| 4 | P104 | West | 700 | . | 750 | 10FEB2025 |
| 5 | P105 | North | 800 | 900 | 820 | 25FEB2025 |
| 6 | P106 | South | -200 | 300 | 400 | 05MAR2025 |
| 7 | P107 | East | 650 | 640 | 660 | 20MAR2025 |
| 8 | P108 | West | 720 | 710 | 700 | 05APR2025 |
| 9 | P109 | North | 550 | 0 | 580 | 15APR2025 |
| 10 | P110 | South | 480 | 490 | 510 | 30APR2025 |
| 11 | P111 | East | 620 | 610 | 630 | 10MAY2025 |
| 12 | P112 | West | 710 | 730 | 720 | 20MAY2025 |
| 13 | P113 | North | 900 | 950 | 910 | 05JUN2025 |
| 14 | P114 | South | 500 | 10000 | 520 | 15JUN2025 |
| 15 | P115 | East | 600 | 580 | 610 | 25JUN2025 |
❗ Intentional Errors
·
Missing Actual_Sales (P104)
·
Negative Forecast (P106)
·
Zero Sales (P109)
·
Extremely High Sales (P114)
·
Forecast_Error not calculated
· Accuracy not calculated
What It Does
Creates the initial forecasting dataset with 15+
observations.
Why
Used
DATA step is used to manually create structured
raw data.
Business
Meaning
Simulates real forecasting records including:
·
Forecast
·
Actual sales
·
Inventory
·
Sales date
Intentional errors were included:
·
Missing sales
·
Negative forecast
·
Extremely high sales
·
Zero sales
5. ERROR IDENTIFICATION
Missing Sales
proc means data=demand_raw n nmiss;
run;
OUTPUT:
The MEANS Procedure
| Variable | N | N Miss |
|---|---|---|
Forecasted_Demand Actual_Sales Inventory_Level Sales_Date | 15 14 15 15 | 0 1 0 0 |
Outlier Detection
proc univariate data=demand_raw;
var Actual_Sales;
run;
OUTPUT:
The UNIVARIATE Procedure
Variable: Actual_Sales
| Moments | |||
|---|---|---|---|
| N | 14 | Sum Weights | 14 |
| Mean | 1245.71429 | Sum Observations | 17440 |
| Std Deviation | 2530.79712 | Variance | 6404934.07 |
| Skewness | 3.68364406 | Kurtosis | 13.69523 |
| Uncorrected SS | 104989400 | Corrected SS | 83264142.9 |
| Coeff Variation | 203.160319 | Std Error Mean | 676.383981 |
| Basic Statistical Measures | |||
|---|---|---|---|
| Location | Variability | ||
| Mean | 1245.714 | Std Deviation | 2531 |
| Median | 595.000 | Variance | 6404934 |
| Mode | 580.000 | Range | 10000 |
| Interquartile Range | 240.00000 | ||
| Tests for Location: Mu0=0 | ||||
|---|---|---|---|---|
| Test | Statistic | p Value | ||
| Student's t | t | 1.841726 | Pr > |t| | 0.0884 |
| Sign | M | 6.5 | Pr >= |M| | 0.0002 |
| Signed Rank | S | 45.5 | Pr >= |S| | 0.0002 |
| Quantiles (Definition 5) | |
|---|---|
| Level | Quantile |
| 100% Max | 10000 |
| 99% | 10000 |
| 95% | 10000 |
| 90% | 950 |
| 75% Q3 | 730 |
| 50% Median | 595 |
| 25% Q1 | 490 |
| 10% | 300 |
| 5% | 0 |
| 1% | 0 |
| 0% Min | 0 |
| Extreme Observations | |||
|---|---|---|---|
| Lowest | Highest | ||
| Value | Obs | Value | Obs |
| 0 | 9 | 710 | 8 |
| 300 | 6 | 730 | 12 |
| 430 | 2 | 900 | 5 |
| 490 | 10 | 950 | 13 |
| 520 | 1 | 10000 | 14 |
| Missing Values | |||
|---|---|---|---|
| Missing Value | Count | Percent Of | |
| All Obs | Missing Obs | ||
| . | 1 | 6.67 | 100.00 |
6. CORRECTED FULL-LENGTH DATA PROCESSING CODE
data demand_clean;
set demand_raw;
Product_ID = strip(upcase(Product_ID));
Region = propcase(Region);
if Forecasted_Demand < 0 then Forecasted_Demand = .;
if Actual_Sales = . then Actual_Sales = 0;
Forecast_Error = Actual_Sales - Forecasted_Demand;
Accuracy_Score = (1 - abs(Forecast_Error)/Forecasted_Demand)*100;
Month = month(Sales_Date);
Year = year(Sales_Date);
Month_Start = intnx('month', Sales_Date, 0, 'b');
Month_End = intnx('month', Sales_Date, 0, 'e');
Months_Since_Start = intck('month', mdy(1,1,2025), Sales_Date);
format Month_Start Month_End date9.;
run;
proc print data=demand_clean;
run;
OUTPUT:
| Obs | Product_ID | Region | Forecasted_Demand | Actual_Sales | Inventory_Level | Sales_Date | Forecast_Error | Accuracy_Score | Month | Year | Month_Start | Month_End | Months_Since_Start |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | P101 | North | 500 | 520 | 600 | 01JAN2025 | 20 | 96.00 | 1 | 2025 | 01JAN2025 | 31JAN2025 | 0 |
| 2 | P102 | South | 450 | 430 | 500 | 15JAN2025 | -20 | 95.56 | 1 | 2025 | 01JAN2025 | 31JAN2025 | 0 |
| 3 | P103 | East | 600 | 580 | 650 | 30JAN2025 | -20 | 96.67 | 1 | 2025 | 01JAN2025 | 31JAN2025 | 0 |
| 4 | P104 | West | 700 | 0 | 750 | 10FEB2025 | -700 | 0.00 | 2 | 2025 | 01FEB2025 | 28FEB2025 | 1 |
| 5 | P105 | North | 800 | 900 | 820 | 25FEB2025 | 100 | 87.50 | 2 | 2025 | 01FEB2025 | 28FEB2025 | 1 |
| 6 | P106 | South | . | 300 | 400 | 05MAR2025 | . | . | 3 | 2025 | 01MAR2025 | 31MAR2025 | 2 |
| 7 | P107 | East | 650 | 640 | 660 | 20MAR2025 | -10 | 98.46 | 3 | 2025 | 01MAR2025 | 31MAR2025 | 2 |
| 8 | P108 | West | 720 | 710 | 700 | 05APR2025 | -10 | 98.61 | 4 | 2025 | 01APR2025 | 30APR2025 | 3 |
| 9 | P109 | North | 550 | 0 | 580 | 15APR2025 | -550 | 0.00 | 4 | 2025 | 01APR2025 | 30APR2025 | 3 |
| 10 | P110 | South | 480 | 490 | 510 | 30APR2025 | 10 | 97.92 | 4 | 2025 | 01APR2025 | 30APR2025 | 3 |
| 11 | P111 | East | 620 | 610 | 630 | 10MAY2025 | -10 | 98.39 | 5 | 2025 | 01MAY2025 | 31MAY2025 | 4 |
| 12 | P112 | West | 710 | 730 | 720 | 20MAY2025 | 20 | 97.18 | 5 | 2025 | 01MAY2025 | 31MAY2025 | 4 |
| 13 | P113 | North | 900 | 950 | 910 | 05JUN2025 | 50 | 94.44 | 6 | 2025 | 01JUN2025 | 30JUN2025 | 5 |
| 14 | P114 | South | 500 | 10000 | 520 | 15JUN2025 | 9500 | -1800.00 | 6 | 2025 | 01JUN2025 | 30JUN2025 | 5 |
| 15 | P115 | East | 600 | 580 | 610 | 25JUN2025 | -20 | 96.67 | 6 | 2025 | 01JUN2025 | 30JUN2025 | 5 |
Explanation of Corrections:
1.STRIP + UPCASE
Removes
extra spaces and standardizes product codes.
2.PROPCASE
Standardizes
region names.
3.Negative Forecast Fix
Negative
demand is unrealistic → converted to missing.
4.Missing Sales
Replaced
with zero for safe calculation.
5. Forecast Error
Actual -
Forecast.
6. Accuracy Formula
Industry
standard forecast accuracy formula.
·
ABS( ) → Absolute error
·
MONTH( ) → Extract month
·
YEAR( ) → Extract year
7. UTILIZATION CLASSIFICATION MACRO
%macro utilization;
data demand_final;
set demand_clean;
Utilization_Ratio = Actual_Sales / Inventory_Level;
length Utilization_Class $20;
if Utilization_Ratio > 0.9 then Utilization_Class = "High";
else if Utilization_Ratio > 0.7 then Utilization_Class = "Medium";
else Utilization_Class = "Low";
run;
proc print data=demand_final;
run;
%mend;
%utilization;
OUTPUT:
| Obs | Product_ID | Region | Forecasted_Demand | Actual_Sales | Inventory_Level | Sales_Date | Forecast_Error | Accuracy_Score | Month | Year | Month_Start | Month_End | Months_Since_Start | Utilization_Ratio | Utilization_Class |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | P101 | North | 500 | 520 | 600 | 01JAN2025 | 20 | 96.00 | 1 | 2025 | 01JAN2025 | 31JAN2025 | 0 | 0.8667 | Medium |
| 2 | P102 | South | 450 | 430 | 500 | 15JAN2025 | -20 | 95.56 | 1 | 2025 | 01JAN2025 | 31JAN2025 | 0 | 0.8600 | Medium |
| 3 | P103 | East | 600 | 580 | 650 | 30JAN2025 | -20 | 96.67 | 1 | 2025 | 01JAN2025 | 31JAN2025 | 0 | 0.8923 | Medium |
| 4 | P104 | West | 700 | 0 | 750 | 10FEB2025 | -700 | 0.00 | 2 | 2025 | 01FEB2025 | 28FEB2025 | 1 | 0.0000 | Low |
| 5 | P105 | North | 800 | 900 | 820 | 25FEB2025 | 100 | 87.50 | 2 | 2025 | 01FEB2025 | 28FEB2025 | 1 | 1.0976 | High |
| 6 | P106 | South | . | 300 | 400 | 05MAR2025 | . | . | 3 | 2025 | 01MAR2025 | 31MAR2025 | 2 | 0.7500 | Medium |
| 7 | P107 | East | 650 | 640 | 660 | 20MAR2025 | -10 | 98.46 | 3 | 2025 | 01MAR2025 | 31MAR2025 | 2 | 0.9697 | High |
| 8 | P108 | West | 720 | 710 | 700 | 05APR2025 | -10 | 98.61 | 4 | 2025 | 01APR2025 | 30APR2025 | 3 | 1.0143 | High |
| 9 | P109 | North | 550 | 0 | 580 | 15APR2025 | -550 | 0.00 | 4 | 2025 | 01APR2025 | 30APR2025 | 3 | 0.0000 | Low |
| 10 | P110 | South | 480 | 490 | 510 | 30APR2025 | 10 | 97.92 | 4 | 2025 | 01APR2025 | 30APR2025 | 3 | 0.9608 | High |
| 11 | P111 | East | 620 | 610 | 630 | 10MAY2025 | -10 | 98.39 | 5 | 2025 | 01MAY2025 | 31MAY2025 | 4 | 0.9683 | High |
| 12 | P112 | West | 710 | 730 | 720 | 20MAY2025 | 20 | 97.18 | 5 | 2025 | 01MAY2025 | 31MAY2025 | 4 | 1.0139 | High |
| 13 | P113 | North | 900 | 950 | 910 | 05JUN2025 | 50 | 94.44 | 6 | 2025 | 01JUN2025 | 30JUN2025 | 5 | 1.0440 | High |
| 14 | P114 | South | 500 | 10000 | 520 | 15JUN2025 | 9500 | -1800.00 | 6 | 2025 | 01JUN2025 | 30JUN2025 | 5 | 19.2308 | High |
| 15 | P115 | East | 600 | 580 | 610 | 25JUN2025 | -20 | 96.67 | 6 | 2025 | 01JUN2025 | 30JUN2025 | 5 | 0.9508 | High |
What It Does
Classifies inventory utilization.
Why Used
Automates classification logic.
Business Meaning
Identifies:
·
Overstock (Low utilization)
·
Optimal stock (Medium)
·
Risk of stockout (High)
8. FRAUD DETECTION MACRO
%macro fraud_check;
data fraud_data;
set demand_final;
length Fraud_Flag $10;
if Actual_Sales > 3*Forecasted_Demand then Fraud_Flag="Yes";
else Fraud_Flag="No";
run;
proc print data=fraud_data;
run;
%mend;
%fraud_check;
OUTPUT:
| Obs | Product_ID | Region | Forecasted_Demand | Actual_Sales | Inventory_Level | Sales_Date | Forecast_Error | Accuracy_Score | Month | Year | Month_Start | Month_End | Months_Since_Start | Utilization_Ratio | Utilization_Class | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | P101 | North | 500 | 520 | 600 | 01JAN2025 | 20 | 96.00 | 1 | 2025 | 01JAN2025 | 31JAN2025 | 0 | 0.8667 | Medium | No |
| 2 | P102 | South | 450 | 430 | 500 | 15JAN2025 | -20 | 95.56 | 1 | 2025 | 01JAN2025 | 31JAN2025 | 0 | 0.8600 | Medium | No |
| 3 | P103 | East | 600 | 580 | 650 | 30JAN2025 | -20 | 96.67 | 1 | 2025 | 01JAN2025 | 31JAN2025 | 0 | 0.8923 | Medium | No |
| 4 | P104 | West | 700 | 0 | 750 | 10FEB2025 | -700 | 0.00 | 2 | 2025 | 01FEB2025 | 28FEB2025 | 1 | 0.0000 | Low | No |
| 5 | P105 | North | 800 | 900 | 820 | 25FEB2025 | 100 | 87.50 | 2 | 2025 | 01FEB2025 | 28FEB2025 | 1 | 1.0976 | High | No |
| 6 | P106 | South | . | 300 | 400 | 05MAR2025 | . | . | 3 | 2025 | 01MAR2025 | 31MAR2025 | 2 | 0.7500 | Medium | Yes |
| 7 | P107 | East | 650 | 640 | 660 | 20MAR2025 | -10 | 98.46 | 3 | 2025 | 01MAR2025 | 31MAR2025 | 2 | 0.9697 | High | No |
| 8 | P108 | West | 720 | 710 | 700 | 05APR2025 | -10 | 98.61 | 4 | 2025 | 01APR2025 | 30APR2025 | 3 | 1.0143 | High | No |
| 9 | P109 | North | 550 | 0 | 580 | 15APR2025 | -550 | 0.00 | 4 | 2025 | 01APR2025 | 30APR2025 | 3 | 0.0000 | Low | No |
| 10 | P110 | South | 480 | 490 | 510 | 30APR2025 | 10 | 97.92 | 4 | 2025 | 01APR2025 | 30APR2025 | 3 | 0.9608 | High | No |
| 11 | P111 | East | 620 | 610 | 630 | 10MAY2025 | -10 | 98.39 | 5 | 2025 | 01MAY2025 | 31MAY2025 | 4 | 0.9683 | High | No |
| 12 | P112 | West | 710 | 730 | 720 | 20MAY2025 | 20 | 97.18 | 5 | 2025 | 01MAY2025 | 31MAY2025 | 4 | 1.0139 | High | No |
| 13 | P113 | North | 900 | 950 | 910 | 05JUN2025 | 50 | 94.44 | 6 | 2025 | 01JUN2025 | 30JUN2025 | 5 | 1.0440 | High | No |
| 14 | P114 | South | 500 | 10000 | 520 | 15JUN2025 | 9500 | -1800.00 | 6 | 2025 | 01JUN2025 | 30JUN2025 | 5 | 19.2308 | High | Yes |
| 15 | P115 | East | 600 | 580 | 610 | 25JUN2025 | -20 | 96.67 | 6 | 2025 | 01JUN2025 | 30JUN2025 | 5 | 0.9508 | High | No |
What It Does
Flags suspicious sales spikes.
Why Used
Detects abnormal patterns.
Business Meaning
Sales > 3x forecast may indicate:
·
Data manipulation
·
Reporting error
·
Fraud activity
9. SQL ANALYSIS
proc sql;
create table summary_data as
select Region,
avg(Actual_Sales) as Avg_Sales,
avg(Forecasted_Demand) as Avg_Forecast,
avg(Accuracy_Score) as Avg_Accuracy
from fraud_data
group by Region;
quit;
proc print data=summary_data;
run;
OUTPUT:
| Obs | Region | Avg_Sales | Avg_Forecast | Avg_Accuracy |
|---|---|---|---|---|
| 1 | East | 602.5 | 617.500 | 97.545 |
| 2 | North | 592.5 | 687.500 | 69.486 |
| 3 | South | 2805.0 | 476.667 | -535.509 |
| 4 | West | 480.0 | 710.000 | 65.265 |
10. PROC FREQ
proc freq data=fraud_data;
tables Region*Fraud_Flag;
run;
OUTPUT:
The FREQ Procedure
|
| ||||||||||||||||||||||||||||||||
11. PROC MEANS
proc means data=fraud_data mean std min max;
var Forecasted_Demand Actual_Sales Accuracy_Score;
run;
OUTPUT:
The MEANS Procedure
| Variable | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|
Forecasted_Demand Actual_Sales Accuracy_Score | 627.1428571 1162.67 -53.0433682 | 129.6402461 2459.86 504.0110229 | 450.0000000 0 -1800.00 | 900.0000000 10000.00 98.6111111 |
12. PROC UNIVARIATE
proc univariate data=fraud_data;
var Forecast_Error;
histogram Forecast_Error;
run;
OUTPUT:
The UNIVARIATE Procedure
Variable: Forecast_Error
| Moments | |||
|---|---|---|---|
| N | 14 | Sum Weights | 14 |
| Mean | 597.142857 | Sum Observations | 8360 |
| Std Deviation | 2573.01508 | Variance | 6620406.59 |
| Skewness | 3.68631118 | Kurtosis | 13.7163928 |
| Uncorrected SS | 91057400 | Corrected SS | 86065285.7 |
| Coeff Variation | 430.887693 | Std Error Mean | 687.667205 |
| Basic Statistical Measures | |||
|---|---|---|---|
| Location | Variability | ||
| Mean | 597.1429 | Std Deviation | 2573 |
| Median | -10.0000 | Variance | 6620407 |
| Mode | -20.0000 | Range | 10200 |
| Interquartile Range | 40.00000 | ||
Note: The mode displayed is the smallest of 2 modes with a count of 3.
| Tests for Location: Mu0=0 | ||||
|---|---|---|---|---|
| Test | Statistic | p Value | ||
| Student's t | t | 0.86836 | Pr > |t| | 0.4009 |
| Sign | M | -1 | Pr >= |M| | 0.7905 |
| Signed Rank | S | -1 | Pr >= |S| | 0.9657 |
| Quantiles (Definition 5) | |
|---|---|
| Level | Quantile |
| 100% Max | 9500 |
| 99% | 9500 |
| 95% | 9500 |
| 90% | 100 |
| 75% Q3 | 20 |
| 50% Median | -10 |
| 25% Q1 | -20 |
| 10% | -550 |
| 5% | -700 |
| 1% | -700 |
| 0% Min | -700 |
| Extreme Observations | |||
|---|---|---|---|
| Lowest | Highest | ||
| Value | Obs | Value | Obs |
| -700 | 4 | 20 | 1 |
| -550 | 9 | 20 | 12 |
| -20 | 15 | 50 | 13 |
| -20 | 3 | 100 | 5 |
| -20 | 2 | 9500 | 14 |
| Missing Values | |||
|---|---|---|---|
| Missing Value | Count | Percent Of | |
| All Obs | Missing Obs | ||
| . | 1 | 6.67 | 100.00 |
The UNIVARIATE Procedure
13. PROC CORR
proc corr data=fraud_data;
var Forecasted_Demand Actual_Sales Inventory_Level Accuracy_Score;
run;
OUTPUT:
The CORR Procedure
| 4 Variables: | Forecasted_Demand Actual_Sales Inventory_Level Accuracy_Score |
|---|
| Simple Statistics | ||||||
|---|---|---|---|---|---|---|
| Variable | N | Mean | Std Dev | Sum | Minimum | Maximum |
| Forecasted_Demand | 14 | 627.14286 | 129.64025 | 8780 | 450.00000 | 900.00000 |
| Actual_Sales | 15 | 1163 | 2460 | 17440 | 0 | 10000 |
| Inventory_Level | 15 | 637.33333 | 131.60909 | 9560 | 400.00000 | 910.00000 |
| Accuracy_Score | 14 | -53.04337 | 504.01102 | -742.60716 | -1800 | 98.61111 |
| Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations | ||||
|---|---|---|---|---|
| Forecasted_Demand | Actual_Sales | Inventory_Level | Accuracy_Score | |
| Forecasted_Demand | 1.00000 14 | -0.22556 0.4381 14 | 0.97736 <.0001 14 | 0.28222 0.3283 14 |
| Actual_Sales | -0.22556 0.4381 14 | 1.00000 15 | -0.18877 0.5004 15 | -0.98561 <.0001 14 |
| Inventory_Level | 0.97736 <.0001 14 | -0.18877 0.5004 15 | 1.00000 15 | 0.32319 0.2597 14 |
| Accuracy_Score | 0.28222 0.3283 14 | -0.98561 <.0001 14 | 0.32319 0.2597 14 | 1.00000 14 |
What It Does
Calculates correlation matrix.
Why Used
Identifies relationships between variables.
14. TRANSPOSE
proc transpose data=summary_data out=transposed_summary;
by Region NotSorted;
var Avg_Sales Avg_Forecast Avg_Accuracy;
run;
proc print data=transposed_summary;
run;
OUTPUT:
| Obs | Region | _NAME_ | COL1 |
|---|---|---|---|
| 1 | East | Avg_Sales | 602.50 |
| 2 | East | Avg_Forecast | 617.50 |
| 3 | East | Avg_Accuracy | 97.55 |
| 4 | North | Avg_Sales | 592.50 |
| 5 | North | Avg_Forecast | 687.50 |
| 6 | North | Avg_Accuracy | 69.49 |
| 7 | South | Avg_Sales | 2805.00 |
| 8 | South | Avg_Forecast | 476.67 |
| 9 | South | Avg_Accuracy | -535.51 |
| 10 | West | Avg_Sales | 480.00 |
| 11 | West | Avg_Forecast | 710.00 |
| 12 | West | Avg_Accuracy | 65.26 |
What It Does
Converts rows to columns.
Why Used
For reporting format change.
15. VISUALIZATION
proc sgplot data=fraud_data;
scatter x=Forecasted_Demand y=Actual_Sales;
reg x=Forecasted_Demand y=Actual_Sales;
run;
OUTPUT:
What It Does
Creates scatter plot with regression line.
Why Used
Visual validation of forecast model.
16. PROC DATASETS – Clean Workspace
proc datasets library=work nolist;
delete demand_clean demand_final fraud_data summary_data;
quit;
LOG:
Why
Each Procedure Is Used
|
Procedure |
Purpose |
|
PROC
SQL |
Aggregation
& summary |
|
PROC
FREQ |
Categorical
analysis |
|
PROC
MEANS |
Summary
statistics |
|
PROC
UNIVARIATE |
Distribution
& outlier |
|
PROC
CORR |
Correlation |
|
PROC
SGPLOT |
Visualization |
|
PROC
TRANSPOSE |
Restructure
data |
|
PROC
APPEND |
Add
datasets |
|
PROC
DATASETS DELETE |
Clean
workspace |
- The project simulates a
real-world retail demand forecasting environment across multiple regions,
making it practical and business-oriented rather than purely theoretical.
- A structured dataset was
created with key variables such as Product_ID, Region, Forecasted_Demand,
Actual_Sales, Inventory_Level, Forecast_Error, Accuracy_Score, Sales_Date,
and Fraud_Flag to reflect real supply chain reporting systems.
- Multiple intentional data
issues were introduced, including missing sales values, negative
forecasts, zero sales, and extremely high sales, to demonstrate error
identification and correction techniques.
- Data cleaning was performed
using character functions like STRIP( ), UPCASE( ), and PROPCASE( ) to
standardize product and region values and ensure consistent reporting.
- Numeric validation logic was
applied to handle unrealistic values, such as converting negative
forecasts to missing and replacing missing sales with zero to prevent
calculation failures.
- Forecast performance was
evaluated using calculated metrics such as Forecast_Error (Actual −
Forecast) and Accuracy_Score percentage, providing measurable forecasting
efficiency.
- Date intelligence functions
like MDY( ), INTNX( ), and INTCK( ) were used to derive monthly boundaries
and time differences, enabling time-based analysis and reporting.
- Inventory utilization was
automated through a macro that classified products into High, Medium, and
Low utilization categories based on the sales-to-inventory ratio.
- Fraud detection logic was
implemented using a macro to flag abnormal sales spikes where Actual_Sales
significantly exceeded Forecasted_Demand.
- Analytical procedures such
as PROC SQL, PROC MEANS, PROC UNIVARIATE, PROC FREQ, and PROC CORR were
used to generate summaries, distributions, correlations, and categorical insights.
- Visualization using PROC
SGPLOT helped validate the relationship between forecasted demand and
actual sales through scatter plots and regression lines.
- Dataset management
techniques including TRANSPOSE, and PROC DATASETS DELETE ensured clean
workflow execution, structured reporting, and scalable production-level
processing.
This project demonstrates how a structured SAS-based
analytics workflow can transform raw forecasting data into meaningful business
insights. By creating a realistic multi-region product demand dataset and
intentionally introducing errors, the project highlights the importance of data
validation, cleansing, and correction before analysis. Forecast error and
accuracy calculations provide measurable indicators of model performance, while
date functions enable time-based tracking. Inventory utilization classification
supports stock optimization, and fraud detection logic identifies abnormal
sales spikes that may indicate reporting issues or manipulation. Statistical
procedures such as PROC MEANS, PROC UNIVARIATE, PROC FREQ, PROC CORR, and PROC
SQL deliver descriptive, distributional, and relational insights. Visualization
through PROC SGPLOT strengthens analytical interpretation. Overall, the project
integrates data management, statistical evaluation, automation via macros, and
business logic into a single scalable framework. It reflects real-world supply
chain analytics practices and demonstrates strong SAS programming competency suitable for
enterprise-level forecasting and fraud monitoring environments.
SAS INTERVIEW QUESTIONS
·
How do you handle missing values?
·
What is reconciliation?
·
What is log checking?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent FRAUD DETECTION SYSTEM data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
· Clinical SAS Programmer
· Research Data Analyst
· Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment