405.What Do Vehicles, Waiting Time, and Cashless Payments Really Say About Toll Plaza Performance?A Complete SAS Analytics Investigation
What Happens When High Traffic Meets Long Waiting Time?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HERE WE USED THESE SAS STATEMENTS AND FUNCTIONS FOR THIS PROJECT:
DATA STEP | SET | INPUT | DATALINES | LENGTH | FORMAT |
INFORMAT | IF-THEN-ELSE | RENAME | DROP | KEEP | LABEL | ROUND | INTNX | INTCK
| MDY | STRIP | TRIM | CAT | CATX | PROPCASE | UPCASE | LOWCASE | COALESCE |
PROC CONTENTS | PROC SQL | CREATE TABLE | SELECT | GROUP BY | ORDER BY | PROC
FREQ | PROC MEANS | PROC UNIVARIATE | PROC CORR | PROC SGPLOT | SCATTER | PROC
APPEND | PROC TRANSPOSE | PROC DATASETS DELETE | MERGE | BY | %MACRO | %MEND
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
INTRODUCTION
Toll plazas are critical infrastructure points
that directly impact:
·
Traffic flow
·
Revenue collection
·
Public satisfaction
·
Fraud risk
·
Digital payment adoption
This project simulates real-world toll plaza data and
demonstrates how a SAS programmer would
analyze operations, efficiency, utilization, and fraud patterns using Base SAS, PROC SQL, Macros, and Analytics
procedures.
TABLE OF CONTENTS
1.
Business Context
2.
Project Objectives
3.
Dataset Design
4.
Raw Data Creation (with intentional error)
5.
Identifying the Error
6.
Correcting the Error
7.
Final Clean Dataset Creation
8.
Date Handling & Derivations
9.
Efficiency & Utilization Logic (Macro)
10. Fraud
Detection Logic (Macro)
11. PROC
SQL Analytics
12. PROC
FREQ, MEANS, UNIVARIATE
13. Correlation
Analysis
14. Visualization
(PROC SGPLOT)
15. Append,
Set, Merge, Transpose
16. Character
& Numeric Functions
17. PROC
DATASETS Delete
18. Business
Insights
19. Key
Points
20. Conclusion
1. BUSINESS CONTEXT
Government authorities and private
concessionaires monitor toll plazas to:
·
Reduce congestion
·
Improve digital payments (FASTag / RFID)
·
Detect revenue leakage
·
Improve operational efficiency
SAS is widely used in:
·
Transport analytics
·
Government data platforms
·
Fraud and compliance systems
2. PROJECT OBJECTIVES
·
Create a realistic
toll plaza dataset
·
Introduce a deliberate data error
·
Detect and correct the error
·
Calculate Efficiency
Score (%)
·
Classify Utilization
Levels
·
Identify Fraud
Risk patterns
·
Use full
SAS toolchain
3. DATASET DESIGN
Variables
|
Variable |
Description |
|
Plaza_ID |
Unique toll plaza ID |
|
Highway_Name |
Name of highway |
|
Vehicles_Per_Day |
Average daily vehicles |
|
Revenue |
Daily revenue (INR) |
|
Avg_Wait_Time |
Avg waiting time (minutes) |
|
Cashless_Ratio |
% of digital payments |
|
Efficiency_Score |
Calculated (%) |
|
Collection_Date |
Observation date |
|
Region |
Geographic region |
|
Fraud_Flag |
Fraud indicator |
Plaza_ID
A unique identifier for each toll plaza.
Used as a primary reference across datasets.
Highway_Name
Indicates the highway on which the toll plaza
operates.
Useful for infrastructure planning and comparative studies.
Vehicles_Per_Day
Represents average daily traffic volume.
This is a core operational KPI.
Revenue
Daily revenue collected at the toll plaza.
Highly sensitive and audit-critical.
Avg_Wait_Time
Average waiting time per vehicle (minutes).
Directly impacts public satisfaction.
Cashless_Ratio
Percentage of vehicles using digital payment
methods.
Indicates modernization and efficiency.
Efficiency_Score
Derived metric measuring operational
effectiveness.
Combines traffic, payment mode, and waiting time.
Collection_Date
Date of data collection.
Enables trend and time-series analysis.
Region
Geographical grouping for regional insights.
Fraud_Flag
Derived indicator highlighting suspicious
patterns.
4. RAW DATA CREATION (WITH INTENTIONAL ERROR)
Intentional Error:
Revenue typed as character ($) instead of numeric
This will break calculations later
data toll_raw;
length Plaza_ID $6 Highway_Name $30 Region $15 Revenue $12;
input Plaza_ID $ Highway_Name $ Vehicles_Per_Day Revenue $ Avg_Wait_Time
Cashless_Ratio Collection_Date : date9. Region $;
format Collection_Date date9.;
datalines;
TP001 NH44 45000 850000 3.5 82 01JAN2026 South
TP002 NH48 38000 720000 4.2 78 01JAN2026 West
TP003 NH16 51000 910000 2.9 88 01JAN2026 East
TP004 NH65 29000 560000 6.8 60 01JAN2026 South
TP005 NH19 47000 880000 3.1 85 01JAN2026 North
TP006 NH27 26000 490000 7.5 55 01JAN2026 West
TP007 NH30 34000 650000 5.2 70 01JAN2026 Central
TP008 NH66 52000 970000 2.7 90 01JAN2026 South
TP009 NH52 31000 600000 6.1 62 01JAN2026 North
TP010 NH40 28000 540000 7.0 58 01JAN2026 East
TP011 NH75 36000 690000 4.8 73 01JAN2026 West
TP012 NH22 24000 460000 8.0 50 01JAN2026 North
TP013 NH13 49000 920000 3.0 86 01JAN2026 South
;
run;
proc print data=toll_raw;
run;
OUTPUT:
| Obs | Plaza_ID | Highway_Name | Region | Revenue | Vehicles_Per_Day | Avg_Wait_Time | Cashless_Ratio | Collection_Date |
|---|---|---|---|---|---|---|---|---|
| 1 | TP001 | NH44 | South | 850000 | 45000 | 3.5 | 82 | 01JAN2026 |
| 2 | TP002 | NH48 | West | 720000 | 38000 | 4.2 | 78 | 01JAN2026 |
| 3 | TP003 | NH16 | East | 910000 | 51000 | 2.9 | 88 | 01JAN2026 |
| 4 | TP004 | NH65 | South | 560000 | 29000 | 6.8 | 60 | 01JAN2026 |
| 5 | TP005 | NH19 | North | 880000 | 47000 | 3.1 | 85 | 01JAN2026 |
| 6 | TP006 | NH27 | West | 490000 | 26000 | 7.5 | 55 | 01JAN2026 |
| 7 | TP007 | NH30 | Central | 650000 | 34000 | 5.2 | 70 | 01JAN2026 |
| 8 | TP008 | NH66 | South | 970000 | 52000 | 2.7 | 90 | 01JAN2026 |
| 9 | TP009 | NH52 | North | 600000 | 31000 | 6.1 | 62 | 01JAN2026 |
| 10 | TP010 | NH40 | East | 540000 | 28000 | 7.0 | 58 | 01JAN2026 |
| 11 | TP011 | NH75 | West | 690000 | 36000 | 4.8 | 73 | 01JAN2026 |
| 12 | TP012 | NH22 | North | 460000 | 24000 | 8.0 | 50 | 01JAN2026 |
| 13 | TP013 | NH13 | South | 920000 | 49000 | 3.0 | 86 | 01JAN2026 |
Common issues:
·
Numeric values stored as character
·
Missing values
·
Format mismatches
·
Inconsistent lengths
This project intentionally introduces an error where Revenue
is stored as a character variable
instead of numeric.
This simulates:
·
Manual data entry mistakes
·
Poor upstream system design
·
CSV or Excel ingestion issues
5. IDENTIFYING THE ERROR
proc contents data=toll_raw;
run;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.TOLL_RAW | Observations | 13 |
|---|---|---|---|
| Member Type | DATA | Variables | 8 |
| Engine | V9 | Indexes | 0 |
| Created | 02/18/2026 19:27:00 | Observation Length | 96 |
| Last Modified | 02/18/2026 19:27:00 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 1363 |
| Obs in First Data Page | 13 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_work9BFC00019166_odaws02-apse1-2.oda.sas.com/SAS_workD59100019166_odaws02-apse1-2.oda.sas.com/toll_raw.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 201335737 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Alphabetic List of Variables and Attributes | ||||
|---|---|---|---|---|
| # | Variable | Type | Len | Format |
| 6 | Avg_Wait_Time | Num | 8 | |
| 7 | Cashless_Ratio | Num | 8 | |
| 8 | Collection_Date | Num | 8 | DATE9. |
| 2 | Highway_Name | Char | 30 | |
| 1 | Plaza_ID | Char | 6 | |
| 3 | Region | Char | 15 | |
| 4 | Revenue | Char | 12 | |
| 5 | Vehicles_Per_Day | Num | 8 | |
PROC
CONTENTS is one of the most underrated but critical
procedures in SAS.
It allows us to:
·
Validate variable types
·
Check lengths and formats
·
Confirm dataset structure
Problem Identified
· Revenue is character
·
Cannot compute efficiency or fraud metrics
If
Revenue is character:
·
Arithmetic operations fail
·
Efficiency calculations break
·
Fraud logic becomes invalid
·
Reports become misleading
6. CORRECTING THE ERROR
data toll_clean;
set toll_raw;
Revenue_Num = input(strip(Revenue), best12.);
drop Revenue;
rename Revenue_Num = Revenue;
run;
proc print data=toll_clean;
run;
OUTPUT:
| Obs | Plaza_ID | Highway_Name | Region | Vehicles_Per_Day | Avg_Wait_Time | Cashless_Ratio | Collection_Date | Revenue |
|---|---|---|---|---|---|---|---|---|
| 1 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 01JAN2026 | 850000 |
| 2 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 01JAN2026 | 720000 |
| 3 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 01JAN2026 | 910000 |
| 4 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 01JAN2026 | 560000 |
| 5 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 01JAN2026 | 880000 |
| 6 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 01JAN2026 | 490000 |
| 7 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 01JAN2026 | 650000 |
| 8 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 01JAN2026 | 970000 |
| 9 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 01JAN2026 | 600000 |
| 10 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 01JAN2026 | 540000 |
| 11 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 01JAN2026 | 690000 |
| 12 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 01JAN2026 | 460000 |
| 13 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 01JAN2026 | 920000 |
·
STRIP ( ) removes spaces
· INPUT ( ) converts character → numeric
·
Dataset becomes analytics-ready
7. FINAL CLEAN DATASET
data toll_final;
set toll_clean;
Efficiency_Score =
round((Cashless_Ratio * Vehicles_Per_Day) /(Avg_Wait_Time * 1000),0.01);
format Efficiency_Score percent8.2;
run;
proc print data=toll_final;
run;
OUTPUT:
| Obs | Plaza_ID | Highway_Name | Region | Vehicles_Per_Day | Avg_Wait_Time | Cashless_Ratio | Collection_Date | Revenue | Efficiency_Score |
|---|---|---|---|---|---|---|---|---|---|
| 1 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 01JAN2026 | 850000 | 105E3% |
| 2 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 01JAN2026 | 720000 | 70571% |
| 3 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 01JAN2026 | 910000 | 155E3% |
| 4 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 01JAN2026 | 560000 | 25588% |
| 5 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 01JAN2026 | 880000 | 129E3% |
| 6 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 01JAN2026 | 490000 | 19067% |
| 7 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 01JAN2026 | 650000 | 45769% |
| 8 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 01JAN2026 | 970000 | 173E3% |
| 9 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 01JAN2026 | 600000 | 31508% |
| 10 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 01JAN2026 | 540000 | 23200% |
| 11 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 01JAN2026 | 690000 | 54750% |
| 12 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 01JAN2026 | 460000 | 15000% |
| 13 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 01JAN2026 | 920000 | 14E4% |
WHY PERCENT FORMAT IS USED:
Efficiency is expressed as a percentage because:
·
It is easy to compare across plazas
·
It is intuitive for business users
·
It aligns with KPI dashboards
8. DATE HANDLING & DERIVATIONS
data toll_dates;
set toll_final;
Month = month(Collection_Date);
Next_Month = intnx('month', Collection_Date, 1);
Days_Since = intck('day', Collection_Date, today());
format Next_Month date9.;
run;
proc print data=toll_dates;
run;
OUTPUT:
| Obs | Plaza_ID | Highway_Name | Region | Vehicles_Per_Day | Avg_Wait_Time | Cashless_Ratio | Collection_Date | Revenue | Efficiency_Score | Month | Next_Month | Days_Since |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 01JAN2026 | 850000 | 105E3% | 1 | 01FEB2026 | 48 |
| 2 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 01JAN2026 | 720000 | 70571% | 1 | 01FEB2026 | 48 |
| 3 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 01JAN2026 | 910000 | 155E3% | 1 | 01FEB2026 | 48 |
| 4 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 01JAN2026 | 560000 | 25588% | 1 | 01FEB2026 | 48 |
| 5 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 01JAN2026 | 880000 | 129E3% | 1 | 01FEB2026 | 48 |
| 6 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 01JAN2026 | 490000 | 19067% | 1 | 01FEB2026 | 48 |
| 7 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 01JAN2026 | 650000 | 45769% | 1 | 01FEB2026 | 48 |
| 8 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 01JAN2026 | 970000 | 173E3% | 1 | 01FEB2026 | 48 |
| 9 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 01JAN2026 | 600000 | 31508% | 1 | 01FEB2026 | 48 |
| 10 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 01JAN2026 | 540000 | 23200% | 1 | 01FEB2026 | 48 |
| 11 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 01JAN2026 | 690000 | 54750% | 1 | 01FEB2026 | 48 |
| 12 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 01JAN2026 | 460000 | 15000% | 1 | 01FEB2026 | 48 |
| 13 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 01JAN2026 | 920000 | 14E4% | 1 | 01FEB2026 | 48 |
Functions used:
·
INTCK() → difference between dates
·
INTNX() → shifting dates
In toll systems, fraud indicators
include:
·
High traffic but low revenue
·
Sudden revenue drops
·
Inconsistent payment ratios
9. UTILIZATION CLASSIFICATION MACRO
%macro utilization;
data toll_util;
set toll_dates;
length Utilization $10.;
if Vehicles_Per_Day >= 50000 then Utilization = 'Very High';
else if Vehicles_Per_Day >= 35000 then Utilization = 'High';
else if Vehicles_Per_Day >= 25000 then Utilization = 'Medium';
else Utilization = 'Low';
run;
proc print data=toll_util;
run;
%mend;
%utilization;
OUTPUT:
| Obs | Plaza_ID | Highway_Name | Region | Vehicles_Per_Day | Avg_Wait_Time | Cashless_Ratio | Collection_Date | Revenue | Efficiency_Score | Month | Next_Month | Days_Since | Utilization |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 01JAN2026 | 850000 | 105E3% | 1 | 01FEB2026 | 48 | High |
| 2 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 01JAN2026 | 720000 | 70571% | 1 | 01FEB2026 | 48 | High |
| 3 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 01JAN2026 | 910000 | 155E3% | 1 | 01FEB2026 | 48 | Very High |
| 4 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 01JAN2026 | 560000 | 25588% | 1 | 01FEB2026 | 48 | Medium |
| 5 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 01JAN2026 | 880000 | 129E3% | 1 | 01FEB2026 | 48 | High |
| 6 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 01JAN2026 | 490000 | 19067% | 1 | 01FEB2026 | 48 | Medium |
| 7 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 01JAN2026 | 650000 | 45769% | 1 | 01FEB2026 | 48 | Medium |
| 8 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 01JAN2026 | 970000 | 173E3% | 1 | 01FEB2026 | 48 | Very High |
| 9 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 01JAN2026 | 600000 | 31508% | 1 | 01FEB2026 | 48 | Medium |
| 10 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 01JAN2026 | 540000 | 23200% | 1 | 01FEB2026 | 48 | Medium |
| 11 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 01JAN2026 | 690000 | 54750% | 1 | 01FEB2026 | 48 | High |
| 12 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 01JAN2026 | 460000 | 15000% | 1 | 01FEB2026 | 48 | Low |
| 13 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 01JAN2026 | 920000 | 14E4% | 1 | 01FEB2026 | 48 | High |
10. FRAUD DETECTION MACRO
%macro fraud_check;
data toll_fraud;
set toll_util;
length Fraud_Flag $3.;
if Revenue < 500000 and Vehicles_Per_Day > 20000 then Fraud_Flag = 'YES';
else Fraud_Flag = 'NO';
run;
proc print data=toll_fraud;
run;
%mend;
%fraud_check;
OUTPUT:
| Obs | Plaza_ID | Highway_Name | Region | Vehicles_Per_Day | Avg_Wait_Time | Cashless_Ratio | Collection_Date | Revenue | Efficiency_Score | Month | Next_Month | Days_Since | Utilization | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 01JAN2026 | 850000 | 105E3% | 1 | 01FEB2026 | 48 | High | NO |
| 2 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 01JAN2026 | 720000 | 70571% | 1 | 01FEB2026 | 48 | High | NO |
| 3 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 01JAN2026 | 910000 | 155E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 4 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 01JAN2026 | 560000 | 25588% | 1 | 01FEB2026 | 48 | Medium | NO |
| 5 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 01JAN2026 | 880000 | 129E3% | 1 | 01FEB2026 | 48 | High | NO |
| 6 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 01JAN2026 | 490000 | 19067% | 1 | 01FEB2026 | 48 | Medium | YES |
| 7 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 01JAN2026 | 650000 | 45769% | 1 | 01FEB2026 | 48 | Medium | NO |
| 8 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 01JAN2026 | 970000 | 173E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 9 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 01JAN2026 | 600000 | 31508% | 1 | 01FEB2026 | 48 | Medium | NO |
| 10 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 01JAN2026 | 540000 | 23200% | 1 | 01FEB2026 | 48 | Medium | NO |
| 11 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 01JAN2026 | 690000 | 54750% | 1 | 01FEB2026 | 48 | High | NO |
| 12 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 01JAN2026 | 460000 | 15000% | 1 | 01FEB2026 | 48 | Low | YES |
| 13 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 01JAN2026 | 920000 | 14E4% | 1 | 01FEB2026 | 48 | High | NO |
In toll systems, fraud indicators
include:
·
High traffic but low revenue
·
Sudden revenue drops
·
Inconsistent payment ratios
11. PROC SQL ANALYTICS
proc sql;
create table region_summary as
select Region,
count(*) as Plaza_Count,
avg(Revenue) as Avg_Revenue format=comma12.,
avg(Efficiency_Score) as Avg_Efficiency
from toll_fraud
group by Region;
quit;
proc print data=region_summary;
run;
OUTPUT:
| Obs | Region | Plaza_Count | Avg_Revenue | Avg_Efficiency |
|---|---|---|---|---|
| 1 | Central | 1 | 650,000 | 457.69 |
| 2 | East | 2 | 725,000 | 889.80 |
| 3 | North | 3 | 646,667 | 584.60 |
| 4 | South | 4 | 825,000 | 1112.04 |
| 5 | West | 3 | 633,333 | 481.29 |
·
It is powerful for aggregation
·
Business users understand SQL logic
·
It simplifies grouping and summarization
12. PROC FREQ, MEANS, UNIVARIATE
proc freq data=toll_fraud;
tables Utilization Fraud_Flag;
run;
OUTPUT:
The FREQ Procedure
| Utilization | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| High | 5 | 38.46 | 5 | 38.46 |
| Low | 1 | 7.69 | 6 | 46.15 |
| Medium | 5 | 38.46 | 11 | 84.62 |
| Very High | 2 | 15.38 | 13 | 100.00 |
| Fraud_Flag | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| NO | 11 | 84.62 | 11 | 84.62 |
| YES | 2 | 15.38 | 13 | 100.00 |
proc means data=toll_fraud mean min max;
var Revenue Vehicles_Per_Day Avg_Wait_Time;
run;
OUTPUT:
The MEANS Procedure
| Variable | Mean | Minimum | Maximum |
|---|---|---|---|
Revenue Vehicles_Per_Day Avg_Wait_Time | 710769.23 37692.31 4.9846154 | 460000.00 24000.00 2.7000000 | 970000.00 52000.00 8.0000000 |
PROC MEANS
·
Quick summary
·
Operational monitoring
proc univariate data=toll_fraud;
var Revenue;
run;
OUTPUT:
The UNIVARIATE Procedure
Variable: Revenue
| Moments | |||
|---|---|---|---|
| N | 13 | Sum Weights | 13 |
| Mean | 710769.231 | Sum Observations | 9240000 |
| Std Deviation | 177644.849 | Variance | 3.15577E10 |
| Skewness | 0.096711 | Kurtosis | -1.5645914 |
| Uncorrected SS | 6.9462E12 | Corrected SS | 3.78692E11 |
| Coeff Variation | 24.9933229 | Std Error Mean | 49269.8162 |
| Basic Statistical Measures | |||
|---|---|---|---|
| Location | Variability | ||
| Mean | 710769.2 | Std Deviation | 177645 |
| Median | 690000.0 | Variance | 3.15577E10 |
| Mode | . | Range | 510000 |
| Interquartile Range | 320000 | ||
| Tests for Location: Mu0=0 | ||||
|---|---|---|---|---|
| Test | Statistic | p Value | ||
| Student's t | t | 14.42606 | Pr > |t| | <.0001 |
| Sign | M | 6.5 | Pr >= |M| | 0.0002 |
| Signed Rank | S | 45.5 | Pr >= |S| | 0.0002 |
| Quantiles (Definition 5) | |
|---|---|
| Level | Quantile |
| 100% Max | 970000 |
| 99% | 970000 |
| 95% | 970000 |
| 90% | 920000 |
| 75% Q3 | 880000 |
| 50% Median | 690000 |
| 25% Q1 | 560000 |
| 10% | 490000 |
| 5% | 460000 |
| 1% | 460000 |
| 0% Min | 460000 |
| Extreme Observations | |||
|---|---|---|---|
| Lowest | Highest | ||
| Value | Obs | Value | Obs |
| 460000 | 12 | 850000 | 1 |
| 490000 | 6 | 880000 | 5 |
| 540000 | 10 | 910000 | 3 |
| 560000 | 4 | 920000 | 13 |
| 600000 | 9 | 970000 | 8 |
PROC UNIVARIATE
·
Detailed distribution
·
Outlier detection
·
Statistical depth
13. CORRELATION ANALYSIS
proc corr data=toll_fraud;
var Vehicles_Per_Day Revenue Cashless_Ratio Avg_Wait_Time;
run;
OUTPUT:
The CORR Procedure
| 4 Variables: | Vehicles_Per_Day Revenue Cashless_Ratio Avg_Wait_Time |
|---|
| Simple Statistics | ||||||
|---|---|---|---|---|---|---|
| Variable | N | Mean | Std Dev | Sum | Minimum | Maximum |
| Vehicles_Per_Day | 13 | 37692 | 10020 | 490000 | 24000 | 52000 |
| Revenue | 13 | 710769 | 177645 | 9240000 | 460000 | 970000 |
| Cashless_Ratio | 13 | 72.07692 | 13.84113 | 937.00000 | 50.00000 | 90.00000 |
| Avg_Wait_Time | 13 | 4.98462 | 1.91435 | 64.80000 | 2.70000 | 8.00000 |
| Pearson Correlation Coefficients, N = 13 Prob > |r| under H0: Rho=0 | ||||
|---|---|---|---|---|
| Vehicles_Per_Day | Revenue | Cashless_Ratio | Avg_Wait_Time | |
| Vehicles_Per_Day | 1.00000 | 0.99735 <.0001 | 0.98382 <.0001 | -0.97951 <.0001 |
| Revenue | 0.99735 <.0001 | 1.00000 | 0.98656 <.0001 | -0.98455 <.0001 |
| Cashless_Ratio | 0.98382 <.0001 | 0.98656 <.0001 | 1.00000 | -0.99756 <.0001 |
| Avg_Wait_Time | -0.97951 <.0001 | -0.98455 <.0001 | -0.99756 <.0001 | 1.00000 |
14. VISUALIZATION
proc sgplot data=toll_fraud;
scatter x=Vehicles_Per_Day y=Revenue;
title "Vehicles vs Revenue";
run;
OUTPUT:
15. APPEND,TRANSPOSE
proc append base=toll_fraud
data=toll_fraud force;
run;
proc print data=toll_fraud;
run;
OUTPUT:
| Obs | Plaza_ID | Highway_Name | Region | Vehicles_Per_Day | Avg_Wait_Time | Cashless_Ratio | Collection_Date | Revenue | Efficiency_Score | Month | Next_Month | Days_Since | Utilization | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 01JAN2026 | 850000 | 105E3% | 1 | 01FEB2026 | 48 | High | NO |
| 2 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 01JAN2026 | 720000 | 70571% | 1 | 01FEB2026 | 48 | High | NO |
| 3 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 01JAN2026 | 910000 | 155E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 4 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 01JAN2026 | 560000 | 25588% | 1 | 01FEB2026 | 48 | Medium | NO |
| 5 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 01JAN2026 | 880000 | 129E3% | 1 | 01FEB2026 | 48 | High | NO |
| 6 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 01JAN2026 | 490000 | 19067% | 1 | 01FEB2026 | 48 | Medium | YES |
| 7 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 01JAN2026 | 650000 | 45769% | 1 | 01FEB2026 | 48 | Medium | NO |
| 8 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 01JAN2026 | 970000 | 173E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 9 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 01JAN2026 | 600000 | 31508% | 1 | 01FEB2026 | 48 | Medium | NO |
| 10 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 01JAN2026 | 540000 | 23200% | 1 | 01FEB2026 | 48 | Medium | NO |
| 11 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 01JAN2026 | 690000 | 54750% | 1 | 01FEB2026 | 48 | High | NO |
| 12 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 01JAN2026 | 460000 | 15000% | 1 | 01FEB2026 | 48 | Low | YES |
| 13 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 01JAN2026 | 920000 | 14E4% | 1 | 01FEB2026 | 48 | High | NO |
| 14 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 01JAN2026 | 850000 | 105E3% | 1 | 01FEB2026 | 48 | High | NO |
| 15 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 01JAN2026 | 720000 | 70571% | 1 | 01FEB2026 | 48 | High | NO |
| 16 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 01JAN2026 | 910000 | 155E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 17 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 01JAN2026 | 560000 | 25588% | 1 | 01FEB2026 | 48 | Medium | NO |
| 18 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 01JAN2026 | 880000 | 129E3% | 1 | 01FEB2026 | 48 | High | NO |
| 19 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 01JAN2026 | 490000 | 19067% | 1 | 01FEB2026 | 48 | Medium | YES |
| 20 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 01JAN2026 | 650000 | 45769% | 1 | 01FEB2026 | 48 | Medium | NO |
| 21 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 01JAN2026 | 970000 | 173E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 22 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 01JAN2026 | 600000 | 31508% | 1 | 01FEB2026 | 48 | Medium | NO |
| 23 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 01JAN2026 | 540000 | 23200% | 1 | 01FEB2026 | 48 | Medium | NO |
| 24 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 01JAN2026 | 690000 | 54750% | 1 | 01FEB2026 | 48 | High | NO |
| 25 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 01JAN2026 | 460000 | 15000% | 1 | 01FEB2026 | 48 | Low | YES |
| 26 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 01JAN2026 | 920000 | 14E4% | 1 | 01FEB2026 | 48 | High | NO |
What Is Happening Here?
we are trying to append a dataset to
itself.
· BASE=
→ toll_fraud
·
DATA=
→ toll_fraud
·
Both point to the same physical dataset
Step 1:New Day Toll Data
data toll_fraud_new;
set toll_fraud;
Collection_Date = intnx('day', Collection_Date, 1);
Revenue = Revenue * 1.03;
run;
proc print data=toll_fraud_new;
run;
OUTPUT:
| Obs | Plaza_ID | Highway_Name | Region | Vehicles_Per_Day | Avg_Wait_Time | Cashless_Ratio | Collection_Date | Revenue | Efficiency_Score | Month | Next_Month | Days_Since | Utilization | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 02JAN2026 | 875500 | 105E3% | 1 | 01FEB2026 | 48 | High | NO |
| 2 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 02JAN2026 | 741600 | 70571% | 1 | 01FEB2026 | 48 | High | NO |
| 3 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 02JAN2026 | 937300 | 155E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 4 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 02JAN2026 | 576800 | 25588% | 1 | 01FEB2026 | 48 | Medium | NO |
| 5 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 02JAN2026 | 906400 | 129E3% | 1 | 01FEB2026 | 48 | High | NO |
| 6 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 02JAN2026 | 504700 | 19067% | 1 | 01FEB2026 | 48 | Medium | YES |
| 7 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 02JAN2026 | 669500 | 45769% | 1 | 01FEB2026 | 48 | Medium | NO |
| 8 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 02JAN2026 | 999100 | 173E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 9 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 02JAN2026 | 618000 | 31508% | 1 | 01FEB2026 | 48 | Medium | NO |
| 10 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 02JAN2026 | 556200 | 23200% | 1 | 01FEB2026 | 48 | Medium | NO |
| 11 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 02JAN2026 | 710700 | 54750% | 1 | 01FEB2026 | 48 | High | NO |
| 12 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 02JAN2026 | 473800 | 15000% | 1 | 01FEB2026 | 48 | Low | YES |
| 13 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 02JAN2026 | 947600 | 14E4% | 1 | 01FEB2026 | 48 | High | NO |
Step 2:Correct Append
proc append base=toll_fraud
data=toll_fraud_new force;
run;
proc print data=toll_fraud;
run;
OUTPUT:
| Obs | Plaza_ID | Highway_Name | Region | Vehicles_Per_Day | Avg_Wait_Time | Cashless_Ratio | Collection_Date | Revenue | Efficiency_Score | Month | Next_Month | Days_Since | Utilization | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 01JAN2026 | 850000 | 105E3% | 1 | 01FEB2026 | 48 | High | NO |
| 2 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 01JAN2026 | 720000 | 70571% | 1 | 01FEB2026 | 48 | High | NO |
| 3 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 01JAN2026 | 910000 | 155E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 4 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 01JAN2026 | 560000 | 25588% | 1 | 01FEB2026 | 48 | Medium | NO |
| 5 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 01JAN2026 | 880000 | 129E3% | 1 | 01FEB2026 | 48 | High | NO |
| 6 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 01JAN2026 | 490000 | 19067% | 1 | 01FEB2026 | 48 | Medium | YES |
| 7 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 01JAN2026 | 650000 | 45769% | 1 | 01FEB2026 | 48 | Medium | NO |
| 8 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 01JAN2026 | 970000 | 173E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 9 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 01JAN2026 | 600000 | 31508% | 1 | 01FEB2026 | 48 | Medium | NO |
| 10 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 01JAN2026 | 540000 | 23200% | 1 | 01FEB2026 | 48 | Medium | NO |
| 11 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 01JAN2026 | 690000 | 54750% | 1 | 01FEB2026 | 48 | High | NO |
| 12 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 01JAN2026 | 460000 | 15000% | 1 | 01FEB2026 | 48 | Low | YES |
| 13 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 01JAN2026 | 920000 | 14E4% | 1 | 01FEB2026 | 48 | High | NO |
| 14 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 02JAN2026 | 875500 | 105E3% | 1 | 01FEB2026 | 48 | High | NO |
| 15 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 02JAN2026 | 741600 | 70571% | 1 | 01FEB2026 | 48 | High | NO |
| 16 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 02JAN2026 | 937300 | 155E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 17 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 02JAN2026 | 576800 | 25588% | 1 | 01FEB2026 | 48 | Medium | NO |
| 18 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 02JAN2026 | 906400 | 129E3% | 1 | 01FEB2026 | 48 | High | NO |
| 19 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 02JAN2026 | 504700 | 19067% | 1 | 01FEB2026 | 48 | Medium | YES |
| 20 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 02JAN2026 | 669500 | 45769% | 1 | 01FEB2026 | 48 | Medium | NO |
| 21 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 02JAN2026 | 999100 | 173E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 22 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 02JAN2026 | 618000 | 31508% | 1 | 01FEB2026 | 48 | Medium | NO |
| 23 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 02JAN2026 | 556200 | 23200% | 1 | 01FEB2026 | 48 | Medium | NO |
| 24 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 02JAN2026 | 710700 | 54750% | 1 | 01FEB2026 | 48 | High | NO |
| 25 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 02JAN2026 | 473800 | 15000% | 1 | 01FEB2026 | 48 | Low | YES |
| 26 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 02JAN2026 | 947600 | 14E4% | 1 | 01FEB2026 | 48 | High | NO |
PROC
APPEND is
designed for:
·
Daily incremental loads
·
Monthly data ingestion
·
Combining historical + current data
·
Faster than SET (no full
rewrite)
Step 1: Sort first
proc sort data=toll_fraud;
by Plaza_ID;
run;
proc print data=toll_fraud;
run;
OUTPUT:
| Obs | Plaza_ID | Highway_Name | Region | Vehicles_Per_Day | Avg_Wait_Time | Cashless_Ratio | Collection_Date | Revenue | Efficiency_Score | Month | Next_Month | Days_Since | Utilization | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 01JAN2026 | 850000 | 105E3% | 1 | 01FEB2026 | 48 | High | NO |
| 2 | TP001 | NH44 | South | 45000 | 3.5 | 82 | 02JAN2026 | 875500 | 105E3% | 1 | 01FEB2026 | 48 | High | NO |
| 3 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 01JAN2026 | 720000 | 70571% | 1 | 01FEB2026 | 48 | High | NO |
| 4 | TP002 | NH48 | West | 38000 | 4.2 | 78 | 02JAN2026 | 741600 | 70571% | 1 | 01FEB2026 | 48 | High | NO |
| 5 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 01JAN2026 | 910000 | 155E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 6 | TP003 | NH16 | East | 51000 | 2.9 | 88 | 02JAN2026 | 937300 | 155E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 7 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 01JAN2026 | 560000 | 25588% | 1 | 01FEB2026 | 48 | Medium | NO |
| 8 | TP004 | NH65 | South | 29000 | 6.8 | 60 | 02JAN2026 | 576800 | 25588% | 1 | 01FEB2026 | 48 | Medium | NO |
| 9 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 01JAN2026 | 880000 | 129E3% | 1 | 01FEB2026 | 48 | High | NO |
| 10 | TP005 | NH19 | North | 47000 | 3.1 | 85 | 02JAN2026 | 906400 | 129E3% | 1 | 01FEB2026 | 48 | High | NO |
| 11 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 01JAN2026 | 490000 | 19067% | 1 | 01FEB2026 | 48 | Medium | YES |
| 12 | TP006 | NH27 | West | 26000 | 7.5 | 55 | 02JAN2026 | 504700 | 19067% | 1 | 01FEB2026 | 48 | Medium | YES |
| 13 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 01JAN2026 | 650000 | 45769% | 1 | 01FEB2026 | 48 | Medium | NO |
| 14 | TP007 | NH30 | Central | 34000 | 5.2 | 70 | 02JAN2026 | 669500 | 45769% | 1 | 01FEB2026 | 48 | Medium | NO |
| 15 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 01JAN2026 | 970000 | 173E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 16 | TP008 | NH66 | South | 52000 | 2.7 | 90 | 02JAN2026 | 999100 | 173E3% | 1 | 01FEB2026 | 48 | Very High | NO |
| 17 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 01JAN2026 | 600000 | 31508% | 1 | 01FEB2026 | 48 | Medium | NO |
| 18 | TP009 | NH52 | North | 31000 | 6.1 | 62 | 02JAN2026 | 618000 | 31508% | 1 | 01FEB2026 | 48 | Medium | NO |
| 19 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 01JAN2026 | 540000 | 23200% | 1 | 01FEB2026 | 48 | Medium | NO |
| 20 | TP010 | NH40 | East | 28000 | 7.0 | 58 | 02JAN2026 | 556200 | 23200% | 1 | 01FEB2026 | 48 | Medium | NO |
| 21 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 01JAN2026 | 690000 | 54750% | 1 | 01FEB2026 | 48 | High | NO |
| 22 | TP011 | NH75 | West | 36000 | 4.8 | 73 | 02JAN2026 | 710700 | 54750% | 1 | 01FEB2026 | 48 | High | NO |
| 23 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 01JAN2026 | 460000 | 15000% | 1 | 01FEB2026 | 48 | Low | YES |
| 24 | TP012 | NH22 | North | 24000 | 8.0 | 50 | 02JAN2026 | 473800 | 15000% | 1 | 01FEB2026 | 48 | Low | YES |
| 25 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 01JAN2026 | 920000 | 14E4% | 1 | 01FEB2026 | 48 | High | NO |
| 26 | TP013 | NH13 | South | 49000 | 3.0 | 86 | 02JAN2026 | 947600 | 14E4% | 1 | 01FEB2026 | 48 | High | NO |
Step 2:Transpose with BY
proc transpose data=toll_fraud out=toll_transposed;
by Plaza_ID NotSorted;
var Revenue Vehicles_Per_Day;
run;
proc print data=toll_transposed;
run;
OUTPUT:
| Obs | Plaza_ID | _NAME_ | COL1 | COL2 |
|---|---|---|---|---|
| 1 | TP001 | Revenue | 850000 | 875500 |
| 2 | TP001 | Vehicles_Per_Day | 45000 | 45000 |
| 3 | TP002 | Revenue | 720000 | 741600 |
| 4 | TP002 | Vehicles_Per_Day | 38000 | 38000 |
| 5 | TP003 | Revenue | 910000 | 937300 |
| 6 | TP003 | Vehicles_Per_Day | 51000 | 51000 |
| 7 | TP004 | Revenue | 560000 | 576800 |
| 8 | TP004 | Vehicles_Per_Day | 29000 | 29000 |
| 9 | TP005 | Revenue | 880000 | 906400 |
| 10 | TP005 | Vehicles_Per_Day | 47000 | 47000 |
| 11 | TP006 | Revenue | 490000 | 504700 |
| 12 | TP006 | Vehicles_Per_Day | 26000 | 26000 |
| 13 | TP007 | Revenue | 650000 | 669500 |
| 14 | TP007 | Vehicles_Per_Day | 34000 | 34000 |
| 15 | TP008 | Revenue | 970000 | 999100 |
| 16 | TP008 | Vehicles_Per_Day | 52000 | 52000 |
| 17 | TP009 | Revenue | 600000 | 618000 |
| 18 | TP009 | Vehicles_Per_Day | 31000 | 31000 |
| 19 | TP010 | Revenue | 540000 | 556200 |
| 20 | TP010 | Vehicles_Per_Day | 28000 | 28000 |
| 21 | TP011 | Revenue | 690000 | 710700 |
| 22 | TP011 | Vehicles_Per_Day | 36000 | 36000 |
| 23 | TP012 | Revenue | 460000 | 473800 |
| 24 | TP012 | Vehicles_Per_Day | 24000 | 24000 |
| 25 | TP013 | Revenue | 920000 | 947600 |
| 26 | TP013 | Vehicles_Per_Day | 49000 | 49000 |
16. CHARACTER & NUMERIC FUNCTIONS
data toll_text;
set toll_fraud;
Highway_Clean = propcase(strip(Highway_Name));
Region_Upper = upcase(Region);
Combined = catx('-', Plaza_ID, Region);
run;
proc print data=toll_text;
var Highway_Name Highway_Clean Region Region_Upper Plaza_ID Combined;
run;
OUTPUT:
| Obs | Highway_Name | Highway_Clean | Region | Region_Upper | Plaza_ID | Combined |
|---|---|---|---|---|---|---|
| 1 | NH44 | Nh44 | South | SOUTH | TP001 | TP001-South |
| 2 | NH44 | Nh44 | South | SOUTH | TP001 | TP001-South |
| 3 | NH48 | Nh48 | West | WEST | TP002 | TP002-West |
| 4 | NH48 | Nh48 | West | WEST | TP002 | TP002-West |
| 5 | NH16 | Nh16 | East | EAST | TP003 | TP003-East |
| 6 | NH16 | Nh16 | East | EAST | TP003 | TP003-East |
| 7 | NH65 | Nh65 | South | SOUTH | TP004 | TP004-South |
| 8 | NH65 | Nh65 | South | SOUTH | TP004 | TP004-South |
| 9 | NH19 | Nh19 | North | NORTH | TP005 | TP005-North |
| 10 | NH19 | Nh19 | North | NORTH | TP005 | TP005-North |
| 11 | NH27 | Nh27 | West | WEST | TP006 | TP006-West |
| 12 | NH27 | Nh27 | West | WEST | TP006 | TP006-West |
| 13 | NH30 | Nh30 | Central | CENTRAL | TP007 | TP007-Central |
| 14 | NH30 | Nh30 | Central | CENTRAL | TP007 | TP007-Central |
| 15 | NH66 | Nh66 | South | SOUTH | TP008 | TP008-South |
| 16 | NH66 | Nh66 | South | SOUTH | TP008 | TP008-South |
| 17 | NH52 | Nh52 | North | NORTH | TP009 | TP009-North |
| 18 | NH52 | Nh52 | North | NORTH | TP009 | TP009-North |
| 19 | NH40 | Nh40 | East | EAST | TP010 | TP010-East |
| 20 | NH40 | Nh40 | East | EAST | TP010 | TP010-East |
| 21 | NH75 | Nh75 | West | WEST | TP011 | TP011-West |
| 22 | NH75 | Nh75 | West | WEST | TP011 | TP011-West |
| 23 | NH22 | Nh22 | North | NORTH | TP012 | TP012-North |
| 24 | NH22 | Nh22 | North | NORTH | TP012 | TP012-North |
| 25 | NH13 | Nh13 | South | SOUTH | TP013 | TP013-South |
| 26 | NH13 | Nh13 | South | SOUTH | TP013 | TP013-South |
17. PROC DATASETS DELETE
proc datasets lib=work;
delete toll_raw toll_clean;
quit;
LOG:
Deleting intermediate datasets:
·
Saves memory
·
Reduces confusion
·
Improves workflow hygiene
18. BUSINESS INSIGHTS
·
High cashless ratio improves efficiency
·
Low revenue + high traffic = fraud risk
·
Waiting time is a critical KPI
·
Digital tolling directly impacts
congestion
19. 5 KEY POINTS ABOUT THIS PROJECT
1.This project covers the complete SAS workflow — from raw
data creation and validation to cleaning, ransformation, analysis,
visualization, and reporting. It demonstrates how real-world toll plaza data is
handled in structured enterprise environments.
2. It calculates key performance indicators such as
Vehicles_Per_Day, Revenue, Avg_Wait_Time, Cashless_Ratio, and derives an
Efficiency_Score (%). This helps evaluate how effectively each toll plaza
operates under varying traffic conditions.
3. The project includes rule-based fraud identification
(e.g., high traffic but low revenue scenarios). It simulates how analytics
teams detect potential revenue leakage or operational manipulation in
infrastructure systems.
4. It integrates DATA step, PROC SQL, PROC MEANS, PROC
UNIVARIATE, PROC FREQ, PROC CORR, PROC SGPLOT, APPEND, MERGE, TRANSPOSE, and
macro programming. This shows strong command over Base SAS and advanced
programming concepts.
5. An intentional data-type error is introduced and corrected to demonstrate real-world debugging, validation, and best practices. This strengthens production readiness and audit reliability.
20. CONCLUSION
The Toll Plaza Operations, Revenue,
Efficiency, and Fraud Analytics project demonstrates how SAS can be applied to
real-world infrastructure and financial monitoring systems. By simulating daily
toll plaza data, the project highlights the importance of structured data
management, validation, and analytical modeling. It shows how operational
metrics such as traffic volume, waiting time, and digital payment adoption
directly influence revenue generation and overall efficiency.
The inclusion of fraud detection logic adds
practical business value, as revenue leakage and irregular patterns are
critical concerns in toll management systems. Through procedures like PROC SQL,
PROC MEANS, PROC FREQ, PROC CORR, and visualization tools, the project
transforms raw operational data into meaningful insights.
Overall, this project not only strengthens technical SAS skills but also
builds business understanding, analytical thinking, and data governance
awareness. It is a strong portfolio- demonstrating both programming proficiency
and the ability to translate data into strategic decision-making insights.
INTERVIEW QUESTIONS FOR YOU
·
What is the difference between the DATA
step and PROC step in SAS, and when would you choose one over the other?
·
What is the difference between SET and
MERGE statements in SAS, and how does SAS handle matching observations during a
MERGE with BY variables?
·
How do you debug a SAS program if the
log shows warnings or errors, and what is the importance of checking the SAS
log before validating output?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
·
What is the difference between the DATA
step and PROC step in SAS, and when would you choose one over the other?
·
What is the difference between SET and
MERGE statements in SAS, and how does SAS handle matching observations during a
MERGE with BY variables?
·
How do you debug a SAS program if the
log shows warnings or errors, and what is the importance of checking the SAS
log before validating output?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent TOLL data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment