429.Can Advanced SAS Programming Detect, Analyze, and Fix Errors in High-Frequency Trading Data While Identifying Fraud Patterns?

Advanced SAS Programming for High-Frequency Trading Data: Error Detection, Data Correction, and Fraud Pattern Identification

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —
DATA STEP | SET | MERGE | PROC SORT | PROC PRINT | PROC CONTENTS | MACROS | PROC TRANSPOSE | PROC APPEND | PROC DATASETS DELETE | NUMERIC FUNCTIONS | CHARACTER FUNCTIONS | DATE FUNCTIONS

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Table of Contents

  1. Introduction
  2. Business Context
  3. Dataset Design Overview
  4. Raw Data Creation with Intentional Errors
  5. Code 1 – DATA Step with INPUT & DATALINES
  6. Code 2 – Error Identification
  7. Code 3 – Data Cleaning & Correction
  8. Code 4 – Date Handling (MDY, INTCK, INTNX)
  9. Code 5 – Numeric Functions
  10. Code 6 – Character Functions
  11. Code 7 – Fraud Detection Logic using Macro
  12. Code 8 – SET Statement
  13. Code 9 – MERGE Statement
  14. Code 10 – PROC SORT
  15. Code 11 – PROC TRANSPOSE
  16. Code 12 – PROC APPEND
  17. Code 13 – PROC DATASETS DELETE
  18. Code 14 – Utilization Classification
  19. Final Corrected Dataset Code (Full Length)
  20. 15 Key Points
  21. Summary
  22. Conclusion

1. Introduction

High-Frequency Trading (HFT) involves executing thousands of trades within milliseconds using algorithms. Such data is extremely sensitive to:

·  Latency issues

·  Data inconsistencies

·  Fraudulent patterns

·  Incorrect timestamps

In real-world financial systems, even a minor error can lead to millions in losses.

This project demonstrates:

·  Creating a realistic HFT dataset

·  Introducing intentional errors

·  Detecting and fixing them

·  Applying SAS programming techniques

·  Building fraud detection logic

2. Business Context

In an HFT firm:

·  Traders use algorithms (Arbitrage, Momentum, Mean Reversion)

·  Trades are executed in milliseconds

·  Profit/Loss depends on:

    ·  Speed (Latency)

    ·  Market impact

    ·  Fees

Key Risks:

·  Fake trades (fraud)

·  Incorrect timestamps

·  Duplicate trades

·  High latency manipulation

3. Dataset Design Overview

Variables:

·  Trader_ID

·   Algorithm_Type

·  Trade_Frequency

·  Profit_Loss

·  Latency_ms

·  Market_Impact

·  Fees

·  Utilization 

·  Trade_Date

4. Raw Data Creation (WITH INTENTIONAL ERRORS)

data hft_raw;

length Trader_ID $10 Algorithm_Type $20 Utilization $15 Trade_Date $20;

input Trader_ID $ Algorithm_Type $ Trade_Frequency Profit_Loss Latency_ms Market_Impact Fees Utilization $ Trade_Date $;

datalines;

T001 arbitrage 5000 12000 2 0.5 50 HIGH 12/31/2024

T002 momentum 7000 -5000 3 0.8 70 medium 31/12/2024

T003 mean_reversion . 8000 5 . 60 LOW 01-15-2025

T004 arbitrage 9000 abc 1 0.3 40 HIGH 2025/01/20

T005 momentum 8500 15000 -2 0.9 80 HIGH 02/30/2025

T006 arbitrage 10000 20000 2 1.2 90 HIGH 03/10/2025

;

run;

LOG:

NOTE: Invalid data for Profit_Loss in line 76 21-23.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
76 T004 arbitrage 9000 abc 1 0.3 40 HIGH 2025/01/20
Trader_ID=T004 Algorithm_Type=arbitrage Utilization=HIGH Trade_Date=2025/01/20 Trade_Frequency=9000 Profit_Loss=. Latency_ms=1
Market_Impact=0.3 Fees=40 _ERROR_=1 _N_=4
NOTE: The data set WORK.HFT_RAW has 6 observations and 9 variables.

Errors in Dataset

·  Missing Trade_Frequency (.)

·  Character in numeric field (abc)

·  Invalid date formats

·  Negative latency

·  Invalid date (Feb 30)

·  Mixed case values

·  Missing Market_Impact

5. Code 1 – DATA Step Explanation

Purpose:

·  Create base dataset

Why used:

·  Fundamental SAS step for data creation

Key Points:

·  length defines variable types

·  input reads raw data

·  datalines provides inline data

6. Code 2 – Error Identification

proc print data=hft_raw;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFees
1T001arbitrageHIGH12/31/202450001200020.550
2T002momentummedium31/12/20247000-500030.870
3T003mean_reversionLOW01-15-2025.80005.60
4T004arbitrageHIGH2025/01/209000.10.340
5T005momentumHIGH02/30/2025850015000-20.980
6T006arbitrageHIGH03/10/2025100002000021.290

proc contents data=hft_raw;

run;

OUTPUT:

The CONTENTS Procedure

Data Set NameWORK.HFT_RAWObservations6
Member TypeDATAVariables9
EngineV9Indexes0
Created03/22/2026 18:19:08Observation Length112
Last Modified03/22/2026 18:19:08Deleted Observations0
Protection CompressedNO
Data Set Type SortedNO
Label   
Data RepresentationSOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64  
Encodingutf-8 Unicode (UTF-8)  
Engine/Host Dependent Information
Data Set Page Size131072
Number of Data Set Pages1
First Data Page1
Max Obs per Page1168
Obs in First Data Page6
Number of Data Set Repairs0
Filename/saswork/SAS_workCA240001CDEF_odaws01-apse1-2.oda.sas.com/SAS_work034B0001CDEF_odaws01-apse1-2.oda.sas.com/hft_raw.sas7bdat
Release Created9.0401M8
Host CreatedLinux
Inode Number201328818
Access Permissionrw-r--r--
Owner Nameu63247146
File Size256KB
File Size (bytes)262144
Alphabetic List of Variables and Attributes
#VariableTypeLen
2Algorithm_TypeChar20
9FeesNum8
7Latency_msNum8
8Market_ImpactNum8
6Profit_LossNum8
4Trade_DateChar20
5Trade_FrequencyNum8
1Trader_IDChar10
3UtilizationChar15

Purpose:

·  Inspect dataset

Why used:

·  Detect structural issues

Key Points:

·  PROC PRINT → data preview

 

·  PROC CONTENTS → metadata

7. Code 3 – Data Cleaning & Correction

data hft_clean;

set hft_raw;

/* Fix Profit_Loss */

if Profit_Loss='abc' then Profit_Loss=0;

/* Fix missing Trade_Frequency */

if Trade_Frequency=. then Trade_Frequency=1000;

/* Fix negative latency */

if Latency_ms < 0 then Latency_ms = abs(Latency_ms);

/* Fix Market Impact */

if Market_Impact=. then Market_Impact=0.5;

/* Standardize Algorithm_Type */

Algorithm_Type=propcase(Algorithm_Type);

/* Standardize Utilization */

Utilization=upcase(Utilization);

run;

proc print data=hft_clean;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFees
1T001ArbitrageHIGH12/31/202450001200020.550
2T002MomentumMEDIUM31/12/20247000-500030.870
3T003Mean_reversionLOW01-15-20251000800050.560
4T004ArbitrageHIGH2025/01/209000010.340
5T005MomentumHIGH02/30/202585001500020.980
6T006ArbitrageHIGH03/10/2025100002000021.290

Explanation:

·  SET reads existing dataset

·  IF conditions correct errors

·  ABS() fixes negative values

Why used:

·  Real-world data is messy → must clean before analysis

8. Code 4 – Date Handling (MDY, INTCK, INTNX)

data hft_dates;

set hft_clean;

length Trade_Date_Num 8;

/* Step 1: Clean string */

td = strip(Trade_Date);

/* Step 2: Extract components */

if index(td,'/') then do;

    part1 = scan(td,1,'/');

    part2 = scan(td,2,'/');

    part3 = scan(td,3,'/');

end;

else if index(td,'-') then do;

    part1 = scan(td,1,'-');

    part2 = scan(td,2,'-');

    part3 = scan(td,3,'-');

end;

/* Step 3: Convert to numeric */

p1 = input(part1, best.);

p2 = input(part2, best.);

year = input(part3, best.);

/* Step 4: Detect format */

if p1 > 12 then do; 

    day = p1; 

    month = p2; 

end;

else do; 

    month = p1; 

    day = p2; 

end;

/* Step 5: Fix invalid month */

if month < 1 then month = 1;

if month > 12 then month = 12;

/* Step 6: Create base date (1st of month) */

base_date = mdy(month,1,year);

/* Step 7: Adjust day automatically */

Trade_Date_Num = intnx('day', base_date, day-1);

/* Step 8: Final calculations */

Year  = year(Trade_Date_Num);

Month = month(Trade_Date_Num);

Days_Since_Trade = intck('day', Trade_Date_Num, today());

Next_Trade       = intnx('day', Trade_Date_Num, 1);

/* Format */

format Trade_Date_Num Next_Trade date9.;

drop td part1 part2 part3 p1 p2 base_date day month year;

run;

proc print data=hft_dates;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumDays_Since_TradeNext_Trade
1T001ArbitrageHIGH12/31/202450001200020.55031DEC202444601JAN2025
2T002MomentumMEDIUM31/12/20247000-500030.87031DEC202444601JAN2025
3T003Mean_reversionLOW01-15-20251000800050.56015JAN202543116JAN2025
4T004ArbitrageHIGH2025/01/209000010.34017JUL202524818JUL2025
5T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR2025
6T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR2025

Explanation:

·  INPUT() converts string to date

·  INTCK() calculates difference

·  INTNX() shifts dates

Why used:

·  Time analysis is critical in trading

9. Code 5 – Numeric Functions

data hft_numeric;

set hft_dates;

Profit_Per_Trade = Profit_Loss / Trade_Frequency;

Adjusted_Profit = sum(Profit_Loss, -Fees);

Log_Profit = log(abs(Profit_Loss)+1);

drop td part1 part2 part3 p1 p2 base_date day month year;

run;

proc print data=hft_numeric;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumDays_Since_TradeNext_TradeProfit_Per_TradeAdjusted_ProfitLog_Profit
1T001ArbitrageHIGH12/31/202450001200020.55031DEC202444601JAN20252.40000119509.39275
2T002MomentumMEDIUM31/12/20247000-500030.87031DEC202444601JAN2025-0.71429-50708.51739
3T003Mean_reversionLOW01-15-20251000800050.56015JAN202543116JAN20258.0000079408.98732
4T004ArbitrageHIGH2025/01/209000010.34017JUL202524818JUL20250.00000-400.00000
5T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587
6T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354

Functions:

·  SUM() handles missing safely

·  LOG() used in financial modelling

10. Code 6 – Character Functions

data hft_char;

set hft_numeric;

Trader_ID_clean = strip(Trader_ID);

Algo_Upper = upcase(Algorithm_Type);

Algo_Lower = lowcase(Algorithm_Type);

Full_ID = catx('-', Trader_ID, Algorithm_Type);

Utilization_clean = trim(Utilization);

run;

proc print data=hft_char;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumDays_Since_TradeNext_TradeProfit_Per_TradeAdjusted_ProfitLog_ProfitTrader_ID_cleanAlgo_UpperAlgo_LowerFull_IDUtilization_clean
1T001ArbitrageHIGH12/31/202450001200020.55031DEC202444601JAN20252.40000119509.39275T001ARBITRAGEarbitrageT001-ArbitrageHIGH
2T002MomentumMEDIUM31/12/20247000-500030.87031DEC202444601JAN2025-0.71429-50708.51739T002MOMENTUMmomentumT002-MomentumMEDIUM
3T003Mean_reversionLOW01-15-20251000800050.56015JAN202543116JAN20258.0000079408.98732T003MEAN_REVERSIONmean_reversionT003-Mean_reversionLOW
4T004ArbitrageHIGH2025/01/209000010.34017JUL202524818JUL20250.00000-400.00000T004ARBITRAGEarbitrageT004-ArbitrageHIGH
5T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587T005MOMENTUMmomentumT005-MomentumHIGH
6T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354T006ARBITRAGEarbitrageT006-ArbitrageHIGH

Functions:

·  STRIP() removes spaces 

·  CATX() concatenates with delimiter

·  TRIM() removes trailing blanks

11. Code 7 – Fraud Detection Macro

%macro fraud_check(input=, output=);

data &output;

set &input;

if Trade_Frequency > 9000 and Latency_ms < 2 then Fraud_Flag=1;

else if Profit_Loss < -10000 then Fraud_Flag=1;

else Fraud_Flag=0;

run;

proc print data=&output;

run;

%mend;


%fraud_check(input=hft_char, output=hft_fraud);

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumDays_Since_TradeNext_TradeProfit_Per_TradeAdjusted_ProfitLog_ProfitTrader_ID_cleanAlgo_UpperAlgo_LowerFull_IDUtilization_cleanFraud_Flag
1T001ArbitrageHIGH12/31/202450001200020.55031DEC202444601JAN20252.40000119509.39275T001ARBITRAGEarbitrageT001-ArbitrageHIGH0
2T002MomentumMEDIUM31/12/20247000-500030.87031DEC202444601JAN2025-0.71429-50708.51739T002MOMENTUMmomentumT002-MomentumMEDIUM0
3T003Mean_reversionLOW01-15-20251000800050.56015JAN202543116JAN20258.0000079408.98732T003MEAN_REVERSIONmean_reversionT003-Mean_reversionLOW0
4T004ArbitrageHIGH2025/01/209000010.34017JUL202524818JUL20250.00000-400.00000T004ARBITRAGEarbitrageT004-ArbitrageHIGH0
5T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587T005MOMENTUMmomentumT005-MomentumHIGH0
6T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354T006ARBITRAGEarbitrageT006-ArbitrageHIGH0

Logic:

·  High frequency + low latency → suspicious

·  Huge losses → suspicious

Why Macro:

·  Reusable logic

·  Scalable across datasets

12. Code 8 – SET Statement

data hft_combined;

set hft_fraud;

run;

proc print data=hft_combined;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumDays_Since_TradeNext_TradeProfit_Per_TradeAdjusted_ProfitLog_ProfitTrader_ID_cleanAlgo_UpperAlgo_LowerFull_IDUtilization_cleanFraud_Flag
1T001ArbitrageHIGH12/31/202450001200020.55031DEC202444601JAN20252.40000119509.39275T001ARBITRAGEarbitrageT001-ArbitrageHIGH0
2T002MomentumMEDIUM31/12/20247000-500030.87031DEC202444601JAN2025-0.71429-50708.51739T002MOMENTUMmomentumT002-MomentumMEDIUM0
3T003Mean_reversionLOW01-15-20251000800050.56015JAN202543116JAN20258.0000079408.98732T003MEAN_REVERSIONmean_reversionT003-Mean_reversionLOW0
4T004ArbitrageHIGH2025/01/209000010.34017JUL202524818JUL20250.00000-400.00000T004ARBITRAGEarbitrageT004-ArbitrageHIGH0
5T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587T005MOMENTUMmomentumT005-MomentumHIGH0
6T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354T006ARBITRAGEarbitrageT006-ArbitrageHIGH0

Purpose:

·  Copy dataset

13. Code 9 – MERGE Statement

data trader_info;

input Trader_ID $ Trader_Name $;

datalines;

T001 John

T002 Mike

T003 Sara

;

run;

proc print data=trader_info;

run;

OUTPUT:

ObsTrader_IDTrader_Name
1T001John
2T002Mike
3T003Sara

proc sort data=hft_combined; by Trader_ID; run;

proc print data=hft_combined;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumDays_Since_TradeNext_TradeProfit_Per_TradeAdjusted_ProfitLog_ProfitTrader_ID_cleanAlgo_UpperAlgo_LowerFull_IDUtilization_cleanFraud_Flag
1T001ArbitrageHIGH12/31/202450001200020.55031DEC202444601JAN20252.40000119509.39275T001ARBITRAGEarbitrageT001-ArbitrageHIGH0
2T002MomentumMEDIUM31/12/20247000-500030.87031DEC202444601JAN2025-0.71429-50708.51739T002MOMENTUMmomentumT002-MomentumMEDIUM0
3T003Mean_reversionLOW01-15-20251000800050.56015JAN202543116JAN20258.0000079408.98732T003MEAN_REVERSIONmean_reversionT003-Mean_reversionLOW0
4T004ArbitrageHIGH2025/01/209000010.34017JUL202524818JUL20250.00000-400.00000T004ARBITRAGEarbitrageT004-ArbitrageHIGH0
5T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587T005MOMENTUMmomentumT005-MomentumHIGH0
6T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354T006ARBITRAGEarbitrageT006-ArbitrageHIGH0

proc sort data=trader_info; by Trader_ID; run;

proc print data=trader_info;

run;

OUTPUT:

ObsTrader_IDTrader_Name
1T001John
2T002Mike
3T003Sara

data merged_data;

merge hft_combined 

      trader_info;

by Trader_ID;

run;

proc print data=merged_data;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumDays_Since_TradeNext_TradeProfit_Per_TradeAdjusted_ProfitLog_ProfitTrader_ID_cleanAlgo_UpperAlgo_LowerFull_IDUtilization_cleanFraud_FlagTrader_Name
1T001ArbitrageHIGH12/31/202450001200020.55031DEC202444601JAN20252.40000119509.39275T001ARBITRAGEarbitrageT001-ArbitrageHIGH0John
2T002MomentumMEDIUM31/12/20247000-500030.87031DEC202444601JAN2025-0.71429-50708.51739T002MOMENTUMmomentumT002-MomentumMEDIUM0Mike
3T003Mean_reversionLOW01-15-20251000800050.56015JAN202543116JAN20258.0000079408.98732T003MEAN_REVERSIONmean_reversionT003-Mean_reversionLOW0Sara
4T004ArbitrageHIGH2025/01/209000010.34017JUL202524818JUL20250.00000-400.00000T004ARBITRAGEarbitrageT004-ArbitrageHIGH0 
5T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587T005MOMENTUMmomentumT005-MomentumHIGH0 
6T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354T006ARBITRAGEarbitrageT006-ArbitrageHIGH0 

Why MERGE:

·  Combine datasets based on key

14. Code 10 – PROC SORT

proc sort data=merged_data;

by descending Profit_Loss;

run;

proc print data=merged_data;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumDays_Since_TradeNext_TradeProfit_Per_TradeAdjusted_ProfitLog_ProfitTrader_ID_cleanAlgo_UpperAlgo_LowerFull_IDUtilization_cleanFraud_FlagTrader_Name
1T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354T006ARBITRAGEarbitrageT006-ArbitrageHIGH0 
2T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587T005MOMENTUMmomentumT005-MomentumHIGH0 
3T001ArbitrageHIGH12/31/202450001200020.55031DEC202444601JAN20252.40000119509.39275T001ARBITRAGEarbitrageT001-ArbitrageHIGH0John
4T003Mean_reversionLOW01-15-20251000800050.56015JAN202543116JAN20258.0000079408.98732T003MEAN_REVERSIONmean_reversionT003-Mean_reversionLOW0Sara
5T004ArbitrageHIGH2025/01/209000010.34017JUL202524818JUL20250.00000-400.00000T004ARBITRAGEarbitrageT004-ArbitrageHIGH0 
6T002MomentumMEDIUM31/12/20247000-500030.87031DEC202444601JAN2025-0.71429-50708.51739T002MOMENTUMmomentumT002-MomentumMEDIUM0Mike

15. Code 11 – PROC TRANSPOSE

proc transpose data=merged_data out=transposed;

by Trader_ID NotSorted;

var Profit_Loss Fees;

run;

proc print data=transposed;

run;

OUTPUT:

ObsTrader_ID_NAME_COL1
1T006Profit_Loss20000
2T006Fees90
3T005Profit_Loss15000
4T005Fees80
5T001Profit_Loss12000
6T001Fees50
7T003Profit_Loss8000
8T003Fees60
9T004Profit_Loss0
10T004Fees40
11T002Profit_Loss-5000
12T002Fees70

16. Code 12 – PROC APPEND

data new_trades;

set merged_data(obs=2);

run;

proc print data=new_trades;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumDays_Since_TradeNext_TradeProfit_Per_TradeAdjusted_ProfitLog_ProfitTrader_ID_cleanAlgo_UpperAlgo_LowerFull_IDUtilization_cleanFraud_FlagTrader_Name
1T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354T006ARBITRAGEarbitrageT006-ArbitrageHIGH0 
2T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587T005MOMENTUMmomentumT005-MomentumHIGH0 

proc append base=merged_data 

            data=new_trades;

run;

proc print data=merged_data;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumDays_Since_TradeNext_TradeProfit_Per_TradeAdjusted_ProfitLog_ProfitTrader_ID_cleanAlgo_UpperAlgo_LowerFull_IDUtilization_cleanFraud_FlagTrader_Name
1T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354T006ARBITRAGEarbitrageT006-ArbitrageHIGH0 
2T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587T005MOMENTUMmomentumT005-MomentumHIGH0 
3T001ArbitrageHIGH12/31/202450001200020.55031DEC202444601JAN20252.40000119509.39275T001ARBITRAGEarbitrageT001-ArbitrageHIGH0John
4T003Mean_reversionLOW01-15-20251000800050.56015JAN202543116JAN20258.0000079408.98732T003MEAN_REVERSIONmean_reversionT003-Mean_reversionLOW0Sara
5T004ArbitrageHIGH2025/01/209000010.34017JUL202524818JUL20250.00000-400.00000T004ARBITRAGEarbitrageT004-ArbitrageHIGH0 
6T002MomentumMEDIUM31/12/20247000-500030.87031DEC202444601JAN2025-0.71429-50708.51739T002MOMENTUMmomentumT002-MomentumMEDIUM0Mike
7T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354T006ARBITRAGEarbitrageT006-ArbitrageHIGH0 
8T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587T005MOMENTUMmomentumT005-MomentumHIGH0 

17. Code 13 – PROC DATASETS DELETE

proc datasets library=work;

delete hft_raw;

quit;

LOG:

NOTE: Deleting WORK.HFT_RAW (memtype=DATA).

18. Code 14 – Utilization Classification

data final_class;

set merged_data;

length Util_Class $8.;

if Trade_Frequency > 9000 then Util_Class='HIGH';

else if Trade_Frequency > 5000 then Util_Class='MEDIUM';

else Util_Class='LOW';

run;

proc print data=final_class;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumDays_Since_TradeNext_TradeProfit_Per_TradeAdjusted_ProfitLog_ProfitTrader_ID_cleanAlgo_UpperAlgo_LowerFull_IDUtilization_cleanFraud_FlagTrader_NameUtil_Class
1T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354T006ARBITRAGEarbitrageT006-ArbitrageHIGH0 HIGH
2T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587T005MOMENTUMmomentumT005-MomentumHIGH0 MEDIUM
3T001ArbitrageHIGH12/31/202450001200020.55031DEC202444601JAN20252.40000119509.39275T001ARBITRAGEarbitrageT001-ArbitrageHIGH0JohnLOW
4T003Mean_reversionLOW01-15-20251000800050.56015JAN202543116JAN20258.0000079408.98732T003MEAN_REVERSIONmean_reversionT003-Mean_reversionLOW0SaraLOW
5T004ArbitrageHIGH2025/01/209000010.34017JUL202524818JUL20250.00000-400.00000T004ARBITRAGEarbitrageT004-ArbitrageHIGH0 MEDIUM
6T002MomentumMEDIUM31/12/20247000-500030.87031DEC202444601JAN2025-0.71429-50708.51739T002MOMENTUMmomentumT002-MomentumMEDIUM0MikeMEDIUM
7T006ArbitrageHIGH03/10/2025100002000021.29010MAR202537711MAR20252.00000199109.90354T006ARBITRAGEarbitrageT006-ArbitrageHIGH0 HIGH
8T005MomentumHIGH02/30/202585001500020.98002MAR202538503MAR20251.76471149209.61587T005MOMENTUMmomentumT005-MomentumHIGH0 MEDIUM

19. FINAL FULL CORRECTED DATASET CODE

data final_hft_dataset;

set hft_raw;

/* ---------- FIX NUMERIC ERRORS ---------- */

/* Convert Profit_Loss safely */

if vtype(Profit_Loss)='C' then Profit_Loss_num=input(Profit_Loss,best.);

else Profit_Loss_num=Profit_Loss;

if missing(Profit_Loss_num) then Profit_Loss_num=0;

/* Replace original */

Profit_Loss=Profit_Loss_num;

drop Profit_Loss_num;

/* Fix missing / invalid values */

if missing(Trade_Frequency) then Trade_Frequency=1000;

if Latency_ms < 0 then Latency_ms=abs(Latency_ms);

if missing(Market_Impact) then Market_Impact=0.5;

/* ---------- STANDARDIZATION ---------- */

Algorithm_Type=propcase(Algorithm_Type);

Utilization=upcase(Utilization);

/* ---------- ROBUST DATE FIX (ALWAYS VALID) ---------- */

length Trade_Date_Num 8;

td=strip(Trade_Date);

/* Split date */

if index(td,'/') then do;

    p1=input(scan(td,1,'/'),best.);

    p2=input(scan(td,2,'/'),best.);

    p3=input(scan(td,3,'/'),best.);

end;

else if index(td,'-') then do;

    p1=input(scan(td,1,'-'),best.);

    p2=input(scan(td,2,'-'),best.);

    p3=input(scan(td,3,'-'),best.);

end;

else if index(td,'/')=0 and index(td,'-')=0 then do;

    Trade_Date_Num=input(td,anydtdte.);

end;

/* Detect format */

if missing(Trade_Date_Num) then do;

    if p1>12 then do; day=p1; month=p2; end;

    else do; month=p1; day=p2; end;

    year=p3;

    /* Fix month */

    if month<1 then month=1;

    if month>12 then month=12;

    /* Base date */

    base=mdy(month,1,year);

    /* Adjust day */

    Trade_Date_Num=intnx('day',base,day-1);

end;

format Trade_Date_Num date9.;

/* ---------- DERIVED VARIABLES ---------- */

Profit_Per_Trade=Profit_Loss/Trade_Frequency;

Adjusted_Profit=sum(Profit_Loss,-Fees);

/* ---------- FRAUD LOGIC ---------- */

if Trade_Frequency>9000 and Latency_ms<2 then Fraud_Flag=1;

else Fraud_Flag=0;

/* ---------- CLASSIFICATION ---------- */

length Util_Class $6;

if Trade_Frequency>9000 then Util_Class='HIGH';

else if Trade_Frequency>5000 then Util_Class='MEDIUM';

else Util_Class='LOW';

/* Cleanup */

drop td p1 p2 p3 base day month year;

run;

proc print data=final_hft_dataset;

run;

OUTPUT:

ObsTrader_IDAlgorithm_TypeUtilizationTrade_DateTrade_FrequencyProfit_LossLatency_msMarket_ImpactFeesTrade_Date_NumProfit_Per_TradeAdjusted_ProfitFraud_FlagUtil_Class
1T001ArbitrageHIGH12/31/202450001200020.55031DEC20242.40000119500LOW
2T002MomentumMEDIUM31/12/20247000-500030.87031DEC2024-0.71429-50700MEDIUM
3T003Mean_reversionLOW01-15-20251000800050.56015JAN20258.0000079400LOW
4T004ArbitrageHIGH2025/01/209000010.34017JUL20250.00000-400MEDIUM
5T005MomentumHIGH02/30/202585001500020.98002MAR20251.76471149200MEDIUM
6T006ArbitrageHIGH03/10/2025100002000021.29010MAR20252.00000199100HIGH

20. 15 Key Points About This Project

The project simulates a real-world High-Frequency Trading (HFT) dataset with trading, latency, and financial metrics.

  1. It includes critical variables like Trader_ID, Algorithm_Type, Trade_Frequency, Profit_Loss, and Latency_ms.
  2. Intentional data issues (missing values, invalid formats, incorrect entries) were introduced to mimic real-world data challenges.
  3. A DATA step with INPUT and DATALINES was used to construct the raw dataset.
  4. PROC PRINT and PROC CONTENTS were applied to inspect structure and identify anomalies.
  5. Data cleaning logic corrected errors such as negative latency, missing frequency, and invalid numeric values.
  6. Character standardization was implemented using functions like PROPERCASE (PROPCASE), UPCASE, and STRIP.
  7. Date conversion was handled using INPUT with ANYDTDTE informat for flexible parsing.
  8. Time-based calculations were performed using INTCK (difference) and INTNX (increment) functions.
  9. Financial derivations included Profit per Trade and Adjusted Profit using SUM and LOG functions.
  10. Character manipulation functions like CATX, TRIM, LOWCASE were used for data consistency.
  11. A reusable Macro (%fraud_check) was developed to detect suspicious trading patterns.
  12. Fraud logic flagged trades based on high frequency + low latency and extreme losses.
  13. Data integration was performed using SET, MERGE, PROC SORT, PROC APPEND, and PROC TRANSPOSE.
  14. Final dataset included Utilization classification and cleaned, analysis-ready variables, ensuring business usability.

21. Summary

This project demonstrates a complete SAS-based workflow for handling High-Frequency Trading (HFT) data, focusing on data quality, transformation, and fraud detection. A synthetic dataset was created with realistic trading variables and intentional errors to simulate real-world data challenges. Using SAS DATA step programming, issues such as missing values, invalid formats, and incorrect entries were identified and corrected. Key functions like INTCK, INTNX, SUM, and LOG were applied for temporal and financial derivations, while character functions ensured data standardization. A reusable macro was developed to implement fraud detection logic based on abnormal trading patterns such as high frequency with low latency and significant losses. Advanced dataset operations including SET, MERGE, PROC SORT, TRANSPOSE, and APPEND enabled efficient data manipulation and integration. The final output produced a clean, structured, and analysis-ready dataset with utilization classification, demonstrating practical SAS programming skills relevant for financial analytics and interview scenarios.

20. Conclusion

This project demonstrates:

·  Real-world HFT dataset simulation

·  Identification of data quality issues

·  Application of:

    ·  DATA step

    ·  Macros

    ·  Date functions

    ·  Character & numeric functions

·  Fraud detection using business logic

Key Takeaways:

·  Clean data = reliable decisions

·  Macros = scalable solutions

·  Date functions = critical in finance

·  Fraud detection = must-have in HFT

INTERVIEW QUESTIONS FOR YOU

·  What is CALL EXECUTE?

·  What is SAS Metadata?

·  What is SAS Grid?


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent HIGH FREQUENCY TRADING data.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

·  EV and energy industry professionals

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Is there a strong relationship between employment rate and poverty reduction across states?A Complete Sas Study

2.Which mobile payment apps handle the highest number of transactions, and are they truly the most reliable?

3.Which namkeen products sell the most, and what does SAS data creation reveal about customer taste?A Sas Study

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy






Comments

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

418.Can We Design, Debug, Detect Fraud, And Optimize A Smart Parking System Using Advanced SAS Programming Techniques?