429.Can Advanced SAS Programming Detect, Analyze, and Fix Errors in High-Frequency Trading Data While Identifying Fraud Patterns?
Advanced SAS Programming for High-Frequency Trading Data: Error Detection, Data Correction, and Fraud Pattern Identification
HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —
DATA STEP | SET | MERGE | PROC SORT | PROC PRINT | PROC CONTENTS | MACROS |
PROC TRANSPOSE | PROC APPEND | PROC DATASETS DELETE | NUMERIC FUNCTIONS |
CHARACTER FUNCTIONS | DATE FUNCTIONS
Table of Contents
- Introduction
- Business Context
- Dataset Design Overview
- Raw Data Creation with
Intentional Errors
- Code 1 – DATA Step with
INPUT & DATALINES
- Code 2 – Error Identification
- Code 3 – Data Cleaning &
Correction
- Code 4 – Date Handling (MDY,
INTCK, INTNX)
- Code 5 – Numeric Functions
- Code 6 – Character Functions
- Code 7 – Fraud Detection
Logic using Macro
- Code 8 – SET Statement
- Code 9 – MERGE Statement
- Code 10 – PROC SORT
- Code 11 – PROC TRANSPOSE
- Code 12 – PROC APPEND
- Code 13 – PROC DATASETS
DELETE
- Code 14 – Utilization
Classification
- Final Corrected Dataset Code
(Full Length)
- 15 Key Points
- Summary
- Conclusion
1. Introduction
High-Frequency
Trading (HFT) involves executing thousands of trades within milliseconds using
algorithms. Such data is extremely sensitive to:
·
Latency issues
·
Data inconsistencies
·
Fraudulent patterns
·
Incorrect timestamps
In
real-world financial systems, even a minor error can lead to millions
in losses.
This
project demonstrates:
·
Creating a realistic HFT dataset
·
Introducing intentional errors
·
Detecting and fixing them
·
Applying SAS programming techniques
·
Building fraud detection logic
2. Business Context
In an HFT
firm:
·
Traders use algorithms (Arbitrage, Momentum, Mean Reversion)
·
Trades are executed in milliseconds
·
Profit/Loss depends on:
· Speed (Latency)
· Market impact
· Fees
Key Risks:
·
Fake trades (fraud)
·
Incorrect timestamps
·
Duplicate trades
·
High latency manipulation
3. Dataset Design Overview
Variables:
· Trader_ID
· Algorithm_Type
· Trade_Frequency
· Profit_Loss
· Latency_ms
· Market_Impact
· Fees
· Utilization
· Trade_Date
4. Raw Data Creation (WITH INTENTIONAL ERRORS)
data hft_raw;
length Trader_ID $10 Algorithm_Type $20 Utilization $15 Trade_Date $20;
input Trader_ID $ Algorithm_Type $ Trade_Frequency Profit_Loss Latency_ms Market_Impact Fees Utilization $ Trade_Date $;
datalines;
T001 arbitrage 5000 12000 2 0.5 50 HIGH 12/31/2024
T002 momentum 7000 -5000 3 0.8 70 medium 31/12/2024
T003 mean_reversion . 8000 5 . 60 LOW 01-15-2025
T004 arbitrage 9000 abc 1 0.3 40 HIGH 2025/01/20
T005 momentum 8500 15000 -2 0.9 80 HIGH 02/30/2025
T006 arbitrage 10000 20000 2 1.2 90 HIGH 03/10/2025
;
run;
LOG:
Errors in Dataset
· Missing
Trade_Frequency (.)
· Character
in numeric field (abc)
· Invalid
date formats
· Negative
latency
· Invalid
date (Feb 30)
· Mixed case
values
· Missing Market_Impact
5. Code 1 – DATA Step Explanation
Purpose:
· Create
base dataset
Why used:
· Fundamental
SAS step for data creation
Key
Points:
· length
defines variable types
· input
reads raw data
· datalines
provides inline data
6. Code 2 – Error Identification
proc print data=hft_raw;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees |
|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 |
| 2 | T002 | momentum | medium | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 |
| 3 | T003 | mean_reversion | LOW | 01-15-2025 | . | 8000 | 5 | . | 60 |
| 4 | T004 | arbitrage | HIGH | 2025/01/20 | 9000 | . | 1 | 0.3 | 40 |
| 5 | T005 | momentum | HIGH | 02/30/2025 | 8500 | 15000 | -2 | 0.9 | 80 |
| 6 | T006 | arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 |
proc contents data=hft_raw;
run;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.HFT_RAW | Observations | 6 |
|---|---|---|---|
| Member Type | DATA | Variables | 9 |
| Engine | V9 | Indexes | 0 |
| Created | 03/22/2026 18:19:08 | Observation Length | 112 |
| Last Modified | 03/22/2026 18:19:08 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 1168 |
| Obs in First Data Page | 6 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_workCA240001CDEF_odaws01-apse1-2.oda.sas.com/SAS_work034B0001CDEF_odaws01-apse1-2.oda.sas.com/hft_raw.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 201328818 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Alphabetic List of Variables and Attributes | |||
|---|---|---|---|
| # | Variable | Type | Len |
| 2 | Algorithm_Type | Char | 20 |
| 9 | Fees | Num | 8 |
| 7 | Latency_ms | Num | 8 |
| 8 | Market_Impact | Num | 8 |
| 6 | Profit_Loss | Num | 8 |
| 4 | Trade_Date | Char | 20 |
| 5 | Trade_Frequency | Num | 8 |
| 1 | Trader_ID | Char | 10 |
| 3 | Utilization | Char | 15 |
Purpose:
· Inspect dataset
Why used:
· Detect structural
issues
Key
Points:
· PROC PRINT
→ data preview
· PROC CONTENTS → metadata
7. Code 3 – Data Cleaning & Correction
data hft_clean;
set hft_raw;
/* Fix Profit_Loss */
if Profit_Loss='abc' then Profit_Loss=0;
/* Fix missing Trade_Frequency */
if Trade_Frequency=. then Trade_Frequency=1000;
/* Fix negative latency */
if Latency_ms < 0 then Latency_ms = abs(Latency_ms);
/* Fix Market Impact */
if Market_Impact=. then Market_Impact=0.5;
/* Standardize Algorithm_Type */
Algorithm_Type=propcase(Algorithm_Type);
/* Standardize Utilization */
Utilization=upcase(Utilization);
run;
proc print data=hft_clean;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees |
|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 |
| 2 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 |
| 3 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 |
| 4 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 |
| 5 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 |
| 6 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 |
Explanation:
· SET
reads existing dataset
· IF
conditions correct errors
· ABS()
fixes negative values
Why used:
· Real-world data is
messy → must clean before analysis
8. Code 4 – Date Handling (MDY, INTCK, INTNX)
data hft_dates;
set hft_clean;
length Trade_Date_Num 8;
/* Step 1: Clean string */
td = strip(Trade_Date);
/* Step 2: Extract components */
if index(td,'/') then do;
part1 = scan(td,1,'/');
part2 = scan(td,2,'/');
part3 = scan(td,3,'/');
end;
else if index(td,'-') then do;
part1 = scan(td,1,'-');
part2 = scan(td,2,'-');
part3 = scan(td,3,'-');
end;
/* Step 3: Convert to numeric */
p1 = input(part1, best.);
p2 = input(part2, best.);
year = input(part3, best.);
/* Step 4: Detect format */
if p1 > 12 then do;
day = p1;
month = p2;
end;
else do;
month = p1;
day = p2;
end;
/* Step 5: Fix invalid month */
if month < 1 then month = 1;
if month > 12 then month = 12;
/* Step 6: Create base date (1st of month) */
base_date = mdy(month,1,year);
/* Step 7: Adjust day automatically */
Trade_Date_Num = intnx('day', base_date, day-1);
/* Step 8: Final calculations */
Year = year(Trade_Date_Num);
Month = month(Trade_Date_Num);
Days_Since_Trade = intck('day', Trade_Date_Num, today());
Next_Trade = intnx('day', Trade_Date_Num, 1);
/* Format */
format Trade_Date_Num Next_Trade date9.;
drop td part1 part2 part3 p1 p2 base_date day month year;
run;
proc print data=hft_dates;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Days_Since_Trade | Next_Trade |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 | 31DEC2024 | 446 | 01JAN2025 |
| 2 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 | 31DEC2024 | 446 | 01JAN2025 |
| 3 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 | 15JAN2025 | 431 | 16JAN2025 |
| 4 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 | 17JUL2025 | 248 | 18JUL2025 |
| 5 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 |
| 6 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 |
Explanation:
· INPUT() converts string to date
· INTCK() calculates difference
· INTNX() shifts dates
Why used:
· Time analysis is critical
in trading
9. Code 5 – Numeric Functions
data hft_numeric;
set hft_dates;
Profit_Per_Trade = Profit_Loss / Trade_Frequency;
Adjusted_Profit = sum(Profit_Loss, -Fees);
Log_Profit = log(abs(Profit_Loss)+1);
drop td part1 part2 part3 p1 p2 base_date day month year;
run;
proc print data=hft_numeric;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Days_Since_Trade | Next_Trade | Profit_Per_Trade | Adjusted_Profit | Log_Profit |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 | 31DEC2024 | 446 | 01JAN2025 | 2.40000 | 11950 | 9.39275 |
| 2 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 | 31DEC2024 | 446 | 01JAN2025 | -0.71429 | -5070 | 8.51739 |
| 3 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 | 15JAN2025 | 431 | 16JAN2025 | 8.00000 | 7940 | 8.98732 |
| 4 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 | 17JUL2025 | 248 | 18JUL2025 | 0.00000 | -40 | 0.00000 |
| 5 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 |
| 6 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 |
Functions:
· SUM() handles missing safely
· LOG() used in financial modelling
10. Code 6 – Character Functions
data hft_char;
set hft_numeric;
Trader_ID_clean = strip(Trader_ID);
Algo_Upper = upcase(Algorithm_Type);
Algo_Lower = lowcase(Algorithm_Type);
Full_ID = catx('-', Trader_ID, Algorithm_Type);
Utilization_clean = trim(Utilization);
run;
proc print data=hft_char;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Days_Since_Trade | Next_Trade | Profit_Per_Trade | Adjusted_Profit | Log_Profit | Trader_ID_clean | Algo_Upper | Algo_Lower | Full_ID | Utilization_clean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 | 31DEC2024 | 446 | 01JAN2025 | 2.40000 | 11950 | 9.39275 | T001 | ARBITRAGE | arbitrage | T001-Arbitrage | HIGH |
| 2 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 | 31DEC2024 | 446 | 01JAN2025 | -0.71429 | -5070 | 8.51739 | T002 | MOMENTUM | momentum | T002-Momentum | MEDIUM |
| 3 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 | 15JAN2025 | 431 | 16JAN2025 | 8.00000 | 7940 | 8.98732 | T003 | MEAN_REVERSION | mean_reversion | T003-Mean_reversion | LOW |
| 4 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 | 17JUL2025 | 248 | 18JUL2025 | 0.00000 | -40 | 0.00000 | T004 | ARBITRAGE | arbitrage | T004-Arbitrage | HIGH |
| 5 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 | T005 | MOMENTUM | momentum | T005-Momentum | HIGH |
| 6 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 | T006 | ARBITRAGE | arbitrage | T006-Arbitrage | HIGH |
Functions:
· STRIP() removes spaces
· CATX() concatenates with delimiter
· TRIM() removes trailing blanks
11. Code 7 – Fraud Detection Macro
%macro fraud_check(input=, output=);
data &output;
set &input;
if Trade_Frequency > 9000 and Latency_ms < 2 then Fraud_Flag=1;
else if Profit_Loss < -10000 then Fraud_Flag=1;
else Fraud_Flag=0;
run;
proc print data=&output;
run;
%mend;
%fraud_check(input=hft_char, output=hft_fraud);
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Days_Since_Trade | Next_Trade | Profit_Per_Trade | Adjusted_Profit | Log_Profit | Trader_ID_clean | Algo_Upper | Algo_Lower | Full_ID | Utilization_clean | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 | 31DEC2024 | 446 | 01JAN2025 | 2.40000 | 11950 | 9.39275 | T001 | ARBITRAGE | arbitrage | T001-Arbitrage | HIGH | 0 |
| 2 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 | 31DEC2024 | 446 | 01JAN2025 | -0.71429 | -5070 | 8.51739 | T002 | MOMENTUM | momentum | T002-Momentum | MEDIUM | 0 |
| 3 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 | 15JAN2025 | 431 | 16JAN2025 | 8.00000 | 7940 | 8.98732 | T003 | MEAN_REVERSION | mean_reversion | T003-Mean_reversion | LOW | 0 |
| 4 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 | 17JUL2025 | 248 | 18JUL2025 | 0.00000 | -40 | 0.00000 | T004 | ARBITRAGE | arbitrage | T004-Arbitrage | HIGH | 0 |
| 5 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 | T005 | MOMENTUM | momentum | T005-Momentum | HIGH | 0 |
| 6 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 | T006 | ARBITRAGE | arbitrage | T006-Arbitrage | HIGH | 0 |
Logic:
·
High frequency + low latency → suspicious
·
Huge losses → suspicious
Why Macro:
·
Reusable
logic
·
Scalable
across datasets
12. Code 8 – SET Statement
data hft_combined;
set hft_fraud;
run;
proc print data=hft_combined;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Days_Since_Trade | Next_Trade | Profit_Per_Trade | Adjusted_Profit | Log_Profit | Trader_ID_clean | Algo_Upper | Algo_Lower | Full_ID | Utilization_clean | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 | 31DEC2024 | 446 | 01JAN2025 | 2.40000 | 11950 | 9.39275 | T001 | ARBITRAGE | arbitrage | T001-Arbitrage | HIGH | 0 |
| 2 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 | 31DEC2024 | 446 | 01JAN2025 | -0.71429 | -5070 | 8.51739 | T002 | MOMENTUM | momentum | T002-Momentum | MEDIUM | 0 |
| 3 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 | 15JAN2025 | 431 | 16JAN2025 | 8.00000 | 7940 | 8.98732 | T003 | MEAN_REVERSION | mean_reversion | T003-Mean_reversion | LOW | 0 |
| 4 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 | 17JUL2025 | 248 | 18JUL2025 | 0.00000 | -40 | 0.00000 | T004 | ARBITRAGE | arbitrage | T004-Arbitrage | HIGH | 0 |
| 5 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 | T005 | MOMENTUM | momentum | T005-Momentum | HIGH | 0 |
| 6 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 | T006 | ARBITRAGE | arbitrage | T006-Arbitrage | HIGH | 0 |
Purpose:
· Copy
dataset
13. Code 9 – MERGE Statement
data trader_info;
input Trader_ID $ Trader_Name $;
datalines;
T001 John
T002 Mike
T003 Sara
;
run;
proc print data=trader_info;
run;
OUTPUT:
| Obs | Trader_ID | Trader_Name |
|---|---|---|
| 1 | T001 | John |
| 2 | T002 | Mike |
| 3 | T003 | Sara |
proc sort data=hft_combined; by Trader_ID; run;
proc print data=hft_combined;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Days_Since_Trade | Next_Trade | Profit_Per_Trade | Adjusted_Profit | Log_Profit | Trader_ID_clean | Algo_Upper | Algo_Lower | Full_ID | Utilization_clean | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 | 31DEC2024 | 446 | 01JAN2025 | 2.40000 | 11950 | 9.39275 | T001 | ARBITRAGE | arbitrage | T001-Arbitrage | HIGH | 0 |
| 2 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 | 31DEC2024 | 446 | 01JAN2025 | -0.71429 | -5070 | 8.51739 | T002 | MOMENTUM | momentum | T002-Momentum | MEDIUM | 0 |
| 3 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 | 15JAN2025 | 431 | 16JAN2025 | 8.00000 | 7940 | 8.98732 | T003 | MEAN_REVERSION | mean_reversion | T003-Mean_reversion | LOW | 0 |
| 4 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 | 17JUL2025 | 248 | 18JUL2025 | 0.00000 | -40 | 0.00000 | T004 | ARBITRAGE | arbitrage | T004-Arbitrage | HIGH | 0 |
| 5 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 | T005 | MOMENTUM | momentum | T005-Momentum | HIGH | 0 |
| 6 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 | T006 | ARBITRAGE | arbitrage | T006-Arbitrage | HIGH | 0 |
proc sort data=trader_info; by Trader_ID; run;
proc print data=trader_info;
run;
OUTPUT:
| Obs | Trader_ID | Trader_Name |
|---|---|---|
| 1 | T001 | John |
| 2 | T002 | Mike |
| 3 | T003 | Sara |
data merged_data;
merge hft_combined
trader_info;
by Trader_ID;
run;
proc print data=merged_data;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Days_Since_Trade | Next_Trade | Profit_Per_Trade | Adjusted_Profit | Log_Profit | Trader_ID_clean | Algo_Upper | Algo_Lower | Full_ID | Utilization_clean | Fraud_Flag | Trader_Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 | 31DEC2024 | 446 | 01JAN2025 | 2.40000 | 11950 | 9.39275 | T001 | ARBITRAGE | arbitrage | T001-Arbitrage | HIGH | 0 | John |
| 2 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 | 31DEC2024 | 446 | 01JAN2025 | -0.71429 | -5070 | 8.51739 | T002 | MOMENTUM | momentum | T002-Momentum | MEDIUM | 0 | Mike |
| 3 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 | 15JAN2025 | 431 | 16JAN2025 | 8.00000 | 7940 | 8.98732 | T003 | MEAN_REVERSION | mean_reversion | T003-Mean_reversion | LOW | 0 | Sara |
| 4 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 | 17JUL2025 | 248 | 18JUL2025 | 0.00000 | -40 | 0.00000 | T004 | ARBITRAGE | arbitrage | T004-Arbitrage | HIGH | 0 | |
| 5 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 | T005 | MOMENTUM | momentum | T005-Momentum | HIGH | 0 | |
| 6 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 | T006 | ARBITRAGE | arbitrage | T006-Arbitrage | HIGH | 0 |
Why MERGE:
· Combine
datasets based on key
14. Code 10 – PROC SORT
proc sort data=merged_data;
by descending Profit_Loss;
run;
proc print data=merged_data;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Days_Since_Trade | Next_Trade | Profit_Per_Trade | Adjusted_Profit | Log_Profit | Trader_ID_clean | Algo_Upper | Algo_Lower | Full_ID | Utilization_clean | Fraud_Flag | Trader_Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 | T006 | ARBITRAGE | arbitrage | T006-Arbitrage | HIGH | 0 | |
| 2 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 | T005 | MOMENTUM | momentum | T005-Momentum | HIGH | 0 | |
| 3 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 | 31DEC2024 | 446 | 01JAN2025 | 2.40000 | 11950 | 9.39275 | T001 | ARBITRAGE | arbitrage | T001-Arbitrage | HIGH | 0 | John |
| 4 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 | 15JAN2025 | 431 | 16JAN2025 | 8.00000 | 7940 | 8.98732 | T003 | MEAN_REVERSION | mean_reversion | T003-Mean_reversion | LOW | 0 | Sara |
| 5 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 | 17JUL2025 | 248 | 18JUL2025 | 0.00000 | -40 | 0.00000 | T004 | ARBITRAGE | arbitrage | T004-Arbitrage | HIGH | 0 | |
| 6 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 | 31DEC2024 | 446 | 01JAN2025 | -0.71429 | -5070 | 8.51739 | T002 | MOMENTUM | momentum | T002-Momentum | MEDIUM | 0 | Mike |
15. Code 11 – PROC TRANSPOSE
proc transpose data=merged_data out=transposed;
by Trader_ID NotSorted;
var Profit_Loss Fees;
run;
proc print data=transposed;
run;
OUTPUT:
| Obs | Trader_ID | _NAME_ | COL1 |
|---|---|---|---|
| 1 | T006 | Profit_Loss | 20000 |
| 2 | T006 | Fees | 90 |
| 3 | T005 | Profit_Loss | 15000 |
| 4 | T005 | Fees | 80 |
| 5 | T001 | Profit_Loss | 12000 |
| 6 | T001 | Fees | 50 |
| 7 | T003 | Profit_Loss | 8000 |
| 8 | T003 | Fees | 60 |
| 9 | T004 | Profit_Loss | 0 |
| 10 | T004 | Fees | 40 |
| 11 | T002 | Profit_Loss | -5000 |
| 12 | T002 | Fees | 70 |
16. Code 12 – PROC APPEND
data new_trades;
set merged_data(obs=2);
run;
proc print data=new_trades;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Days_Since_Trade | Next_Trade | Profit_Per_Trade | Adjusted_Profit | Log_Profit | Trader_ID_clean | Algo_Upper | Algo_Lower | Full_ID | Utilization_clean | Fraud_Flag | Trader_Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 | T006 | ARBITRAGE | arbitrage | T006-Arbitrage | HIGH | 0 | |
| 2 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 | T005 | MOMENTUM | momentum | T005-Momentum | HIGH | 0 |
proc append base=merged_data
data=new_trades;
run;
proc print data=merged_data;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Days_Since_Trade | Next_Trade | Profit_Per_Trade | Adjusted_Profit | Log_Profit | Trader_ID_clean | Algo_Upper | Algo_Lower | Full_ID | Utilization_clean | Fraud_Flag | Trader_Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 | T006 | ARBITRAGE | arbitrage | T006-Arbitrage | HIGH | 0 | |
| 2 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 | T005 | MOMENTUM | momentum | T005-Momentum | HIGH | 0 | |
| 3 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 | 31DEC2024 | 446 | 01JAN2025 | 2.40000 | 11950 | 9.39275 | T001 | ARBITRAGE | arbitrage | T001-Arbitrage | HIGH | 0 | John |
| 4 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 | 15JAN2025 | 431 | 16JAN2025 | 8.00000 | 7940 | 8.98732 | T003 | MEAN_REVERSION | mean_reversion | T003-Mean_reversion | LOW | 0 | Sara |
| 5 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 | 17JUL2025 | 248 | 18JUL2025 | 0.00000 | -40 | 0.00000 | T004 | ARBITRAGE | arbitrage | T004-Arbitrage | HIGH | 0 | |
| 6 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 | 31DEC2024 | 446 | 01JAN2025 | -0.71429 | -5070 | 8.51739 | T002 | MOMENTUM | momentum | T002-Momentum | MEDIUM | 0 | Mike |
| 7 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 | T006 | ARBITRAGE | arbitrage | T006-Arbitrage | HIGH | 0 | |
| 8 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 | T005 | MOMENTUM | momentum | T005-Momentum | HIGH | 0 |
17. Code 13 – PROC DATASETS DELETE
proc datasets library=work;
delete hft_raw;
quit;
LOG:
NOTE: Deleting WORK.HFT_RAW (memtype=DATA).
18. Code 14 – Utilization Classification
data final_class;
set merged_data;
length Util_Class $8.;
if Trade_Frequency > 9000 then Util_Class='HIGH';
else if Trade_Frequency > 5000 then Util_Class='MEDIUM';
else Util_Class='LOW';
run;
proc print data=final_class;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Days_Since_Trade | Next_Trade | Profit_Per_Trade | Adjusted_Profit | Log_Profit | Trader_ID_clean | Algo_Upper | Algo_Lower | Full_ID | Utilization_clean | Fraud_Flag | Trader_Name | Util_Class |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 | T006 | ARBITRAGE | arbitrage | T006-Arbitrage | HIGH | 0 | HIGH | |
| 2 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 | T005 | MOMENTUM | momentum | T005-Momentum | HIGH | 0 | MEDIUM | |
| 3 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 | 31DEC2024 | 446 | 01JAN2025 | 2.40000 | 11950 | 9.39275 | T001 | ARBITRAGE | arbitrage | T001-Arbitrage | HIGH | 0 | John | LOW |
| 4 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 | 15JAN2025 | 431 | 16JAN2025 | 8.00000 | 7940 | 8.98732 | T003 | MEAN_REVERSION | mean_reversion | T003-Mean_reversion | LOW | 0 | Sara | LOW |
| 5 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 | 17JUL2025 | 248 | 18JUL2025 | 0.00000 | -40 | 0.00000 | T004 | ARBITRAGE | arbitrage | T004-Arbitrage | HIGH | 0 | MEDIUM | |
| 6 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 | 31DEC2024 | 446 | 01JAN2025 | -0.71429 | -5070 | 8.51739 | T002 | MOMENTUM | momentum | T002-Momentum | MEDIUM | 0 | Mike | MEDIUM |
| 7 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 377 | 11MAR2025 | 2.00000 | 19910 | 9.90354 | T006 | ARBITRAGE | arbitrage | T006-Arbitrage | HIGH | 0 | HIGH | |
| 8 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 385 | 03MAR2025 | 1.76471 | 14920 | 9.61587 | T005 | MOMENTUM | momentum | T005-Momentum | HIGH | 0 | MEDIUM |
19. FINAL FULL CORRECTED DATASET CODE
data final_hft_dataset;
set hft_raw;
/* ---------- FIX NUMERIC ERRORS ---------- */
/* Convert Profit_Loss safely */
if vtype(Profit_Loss)='C' then Profit_Loss_num=input(Profit_Loss,best.);
else Profit_Loss_num=Profit_Loss;
if missing(Profit_Loss_num) then Profit_Loss_num=0;
/* Replace original */
Profit_Loss=Profit_Loss_num;
drop Profit_Loss_num;
/* Fix missing / invalid values */
if missing(Trade_Frequency) then Trade_Frequency=1000;
if Latency_ms < 0 then Latency_ms=abs(Latency_ms);
if missing(Market_Impact) then Market_Impact=0.5;
/* ---------- STANDARDIZATION ---------- */
Algorithm_Type=propcase(Algorithm_Type);
Utilization=upcase(Utilization);
/* ---------- ROBUST DATE FIX (ALWAYS VALID) ---------- */
length Trade_Date_Num 8;
td=strip(Trade_Date);
/* Split date */
if index(td,'/') then do;
p1=input(scan(td,1,'/'),best.);
p2=input(scan(td,2,'/'),best.);
p3=input(scan(td,3,'/'),best.);
end;
else if index(td,'-') then do;
p1=input(scan(td,1,'-'),best.);
p2=input(scan(td,2,'-'),best.);
p3=input(scan(td,3,'-'),best.);
end;
else if index(td,'/')=0 and index(td,'-')=0 then do;
Trade_Date_Num=input(td,anydtdte.);
end;
/* Detect format */
if missing(Trade_Date_Num) then do;
if p1>12 then do; day=p1; month=p2; end;
else do; month=p1; day=p2; end;
year=p3;
/* Fix month */
if month<1 then month=1;
if month>12 then month=12;
/* Base date */
base=mdy(month,1,year);
/* Adjust day */
Trade_Date_Num=intnx('day',base,day-1);
end;
format Trade_Date_Num date9.;
/* ---------- DERIVED VARIABLES ---------- */
Profit_Per_Trade=Profit_Loss/Trade_Frequency;
Adjusted_Profit=sum(Profit_Loss,-Fees);
/* ---------- FRAUD LOGIC ---------- */
if Trade_Frequency>9000 and Latency_ms<2 then Fraud_Flag=1;
else Fraud_Flag=0;
/* ---------- CLASSIFICATION ---------- */
length Util_Class $6;
if Trade_Frequency>9000 then Util_Class='HIGH';
else if Trade_Frequency>5000 then Util_Class='MEDIUM';
else Util_Class='LOW';
/* Cleanup */
drop td p1 p2 p3 base day month year;
run;
proc print data=final_hft_dataset;
run;
OUTPUT:
| Obs | Trader_ID | Algorithm_Type | Utilization | Trade_Date | Trade_Frequency | Profit_Loss | Latency_ms | Market_Impact | Fees | Trade_Date_Num | Profit_Per_Trade | Adjusted_Profit | Fraud_Flag | Util_Class |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | Arbitrage | HIGH | 12/31/2024 | 5000 | 12000 | 2 | 0.5 | 50 | 31DEC2024 | 2.40000 | 11950 | 0 | LOW |
| 2 | T002 | Momentum | MEDIUM | 31/12/2024 | 7000 | -5000 | 3 | 0.8 | 70 | 31DEC2024 | -0.71429 | -5070 | 0 | MEDIUM |
| 3 | T003 | Mean_reversion | LOW | 01-15-2025 | 1000 | 8000 | 5 | 0.5 | 60 | 15JAN2025 | 8.00000 | 7940 | 0 | LOW |
| 4 | T004 | Arbitrage | HIGH | 2025/01/20 | 9000 | 0 | 1 | 0.3 | 40 | 17JUL2025 | 0.00000 | -40 | 0 | MEDIUM |
| 5 | T005 | Momentum | HIGH | 02/30/2025 | 8500 | 15000 | 2 | 0.9 | 80 | 02MAR2025 | 1.76471 | 14920 | 0 | MEDIUM |
| 6 | T006 | Arbitrage | HIGH | 03/10/2025 | 10000 | 20000 | 2 | 1.2 | 90 | 10MAR2025 | 2.00000 | 19910 | 0 | HIGH |
20. 15 Key Points About This Project
The project simulates a real-world High-Frequency
Trading (HFT) dataset with
trading, latency, and financial metrics.
- It includes critical
variables like Trader_ID,
Algorithm_Type, Trade_Frequency, Profit_Loss, and Latency_ms.
- Intentional
data issues (missing values, invalid formats, incorrect entries) were
introduced to mimic real-world data challenges.
- A DATA step with INPUT and DATALINES
was used to construct the raw dataset.
- PROC PRINT and
PROC CONTENTS were applied to inspect structure and identify
anomalies.
- Data cleaning logic
corrected errors such as negative
latency, missing frequency, and invalid numeric values.
- Character standardization was
implemented using functions like PROPERCASE (PROPCASE), UPCASE, and STRIP.
- Date conversion was
handled using INPUT with ANYDTDTE informat for flexible parsing.
- Time-based
calculations were performed using INTCK
(difference) and INTNX (increment) functions.
- Financial
derivations included Profit per
Trade and Adjusted Profit using SUM and LOG functions.
- Character
manipulation functions like CATX,
TRIM, LOWCASE were used for data consistency.
- A reusable Macro (%fraud_check) was developed
to detect suspicious trading patterns.
- Fraud logic
flagged trades based on high
frequency + low latency and extreme losses.
- Data
integration was performed using SET,
MERGE, PROC SORT, PROC APPEND, and PROC TRANSPOSE.
- Final
dataset included Utilization
classification and cleaned, analysis-ready variables, ensuring
business usability.
21. Summary
This project demonstrates a complete SAS-based workflow for handling High-Frequency Trading (HFT) data, focusing on data quality, transformation, and fraud detection. A synthetic dataset was created with realistic trading variables and intentional errors to simulate real-world data challenges. Using SAS DATA step programming, issues such as missing values, invalid formats, and incorrect entries were identified and corrected. Key functions like INTCK, INTNX, SUM, and LOG were applied for temporal and financial derivations, while character functions ensured data standardization. A reusable macro was developed to implement fraud detection logic based on abnormal trading patterns such as high frequency with low latency and significant losses. Advanced dataset operations including SET, MERGE, PROC SORT, TRANSPOSE, and APPEND enabled efficient data manipulation and integration. The final output produced a clean, structured, and analysis-ready dataset with utilization classification, demonstrating practical SAS programming skills relevant for financial analytics and interview scenarios.
20. Conclusion
This project demonstrates:
·
Real-world HFT dataset simulation
·
Identification of data quality issues
·
Application of:
· DATA
step
· Macros
· Date
functions
· Character
& numeric functions
·
Fraud detection using business logic
Key Takeaways:
·
Clean data = reliable decisions
·
Macros = scalable solutions
·
Date functions = critical in finance
·
Fraud detection = must-have in HFT
INTERVIEW QUESTIONS FOR YOU
· What is CALL EXECUTE?
· What is SAS Metadata?
· What is SAS Grid?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
· What is CALL EXECUTE?
· What is SAS Metadata?
· What is SAS Grid?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent HIGH FREQUENCY TRADING data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
· EV and energy industry professionals
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Follow Us On :
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--->Follow our blog for more SAS-based analytics projects and industry data models.
---> Support Us By Following Our Blog..
To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment