Railway Data Wars: Converting Broken Global Train Operations into Analysis-Ready Business Intelligence with SAS and R

Engineering World Train System Intelligence into Analysis-Ready SAS and R Pipelines for Enterprise-Grade Reporting Excellence

Introduction

Modern analytics teams often assume transportation data is “structured enough” for dashboards and AI modeling. In reality, enterprise train-system datasets are frequently chaotic. One global transportation analytics company recently discovered that multiple railway operators across Europe and Asia were reporting duplicate train IDs, corrupted timestamps, invalid maintenance categories, negative ticket revenue values, and inconsistent station region codes.

The consequences became catastrophic:

  • Executive dashboards showed impossible passenger growth.
  • Predictive maintenance AI models classified damaged engines as “low risk.”
  • Financial systems overestimated profitability due to duplicate ticket entries.
  • Regulatory transport audits failed because timestamps violated ISO standards.
  • Operational planning systems routed trains using invalid location mappings.

This is exactly where professional SAS and R data engineering becomes mission-critical.

In this project, we will build a complete enterprise-grade workflow for cleaning and transforming corrupted World Train Systems operational datasets into reliable analytical intelligence using both SAS and R.

Business Crisis Scenario: When Dirty Train Data Derails Enterprise Decisions

Imagine a multinational railway consortium managing bullet trains, metro systems, freight corridors, and passenger railways across India, Japan, Germany, France, and the United States.

A quarterly executive report suddenly shows:

  • 35% passenger increase in inactive regions
  • Negative maintenance costs
  • Missing inspection dates
  • Invalid route categories
  • Duplicate train transaction IDs
  • Corrupted conductor emails
  • Impossible train speeds
  • Mixed currency values
  • NULL region identifiers

Executives initially celebrate growth.

Later, auditors discover the issue was caused entirely by corrupted operational data.

This is not hypothetical.

In real-world analytics environments, dirty data damages:

  • AI forecasting
  • Operational KPIs
  • Clinical-style validation workflows
  • Executive dashboards
  • Statistical modeling
  • Regulatory reporting
  • Resource planning
  • Predictive maintenance systems

The lesson is simple:

“Dirty data creates expensive business mistakes.”

Raw Corrupted Train Systems Dataset 

Below is a deliberately corrupted enterprise-style railway dataset.

SAS Raw Dataset Creation

data world_train_raw;

length Train_ID $12 Country $20 Train_Type $25 Passenger_Count $20

Revenue $20 Journey_Date $15 Engineer_Email $20 Region_Code $10 

Maintenance_Category $20 Delay_Code $12;

infile datalines dlm='|' truncover;

input Train_ID $ Country $ Train_Type $ Passenger_Count $

Revenue $ Journey_Date $ Engineer_Email $ Region_Code $

Maintenance_Category $ Delay_Code $ Avg_Speed;

datalines;

TR001|india|Bullet Train|450|250000|12JAN2025|engine1@gmail.com| in01 |Critical|D01|320

TR001|INDIA|bullet train|450|-250000|31FEB2025|wrongmail.com|IN01|critical|D01|-320

TR002|Japan|Metro Express|.|180000|15MAR2025|metro.jp@gmail.com|JP_01|Medium|D02|210

TR003|Germany|Freight Rail|700|abc500|18APR2025|freight@rail|DE01|LOW|NULL|150

TR004|France|Passenger Rail|9999|550000|NULL|france_rail@gmail.com| FR01|High|D04|500

TR005|India|Cargo Line|-25|750000|22MAY2025|cargo@gmail|IN02|Medium|D02|110

TR006|USA|Metro Rail|380|450000|01JUN2025|usa.metro@gmail.com|US01|critical|D03|95

TR007|India|Passenger Rail|420|520000|15JUN2025|NULL|IN 03|Unknown|DXX|130

TR008|Japan|Bullet Train|510|870000|18JUL2025|bullet@japan.com|JP01|Critical|D01|340

TR009|Germany|Metro Rail|0|320000|29FEB2023|metro.de@gmail.com|DE_02|LOW|D02|85

TR010|France|Cargo Line|250|650000|05AUG2025|cargo.fr@gmail.com|FR02|Medium|D03|75

TR011|India|Metro Rail|390|NULL|12SEP2025|metro.india@gmail.com|IN01|HIGH|D01|102

TR012|USA|Passenger Rail|480|950000|15OCT2025|passenger.us@gmail.com|US02|Low|D04|145

TR013|Japan|Freight Rail|300|340000|31NOV2025|freight.jp@gmail.com|JP02|Medium|D03|88

TR014|Germany|Bullet Train|520|1200000|11DEC2025|bullet.de@gmail|DE03|Critical|D01|355

TR015|India|Metro Express|410|470000|07JAN2025| express@gmail.com |IN04|Medium|D02|118

TR016|France|Passenger Rail|NULL|610000|15FEB2025|rail.fr@gmail.com|FR03|LOW|D05|140

TR017|USA|Cargo Line|290|-990000|18MAR2025|cargo.us@gmail.com|US03|Critical|D01|92

TR018|Japan|Metro Rail|360|430000|20APR2025|metro.jp2@gmail.com|JP03|Medium|D02|98

TR019|India|Bullet Train|540|1500000|25MAY2025|india.bullet@gmail.com|IN05|Critical|D01|360

TR020|Germany|Freight Rail|610|880000|10JUN2025|germany.freight@gmail.com|DE04|High|D03|78

;

run;

proc print data = world_train_raw;

run;

OUTPUT:

ObsTrain_IDCountryTrain_TypePassenger_CountRevenueJourney_DateEngineer_EmailRegion_CodeMaintenance_CategoryDelay_CodeAvg_Speed
1TR001indiaBullet Train45025000012JAN2025engine1@gmail.comin01CriticalD01320
2TR001INDIAbullet train450-25000031FEB2025wrongmail.comIN01criticalD01-320
3TR002JapanMetro Express 18000015MAR2025metro.jp@gmail.comJP_01MediumD02210
4TR003GermanyFreight Rail700abc50018APR2025freight@railDE01LOWNULL150
5TR004FrancePassenger Rail9999550000NULLfrance_rail@gmail.coFR01HighD04500
6TR005IndiaCargo Line-2575000022MAY2025cargo@gmailIN02MediumD02110
7TR006USAMetro Rail38045000001JUN2025usa.metro@gmail.comUS01criticalD0395
8TR007IndiaPassenger Rail42052000015JUN2025NULLIN 03UnknownDXX130
9TR008JapanBullet Train51087000018JUL2025bullet@japan.comJP01CriticalD01340
10TR009GermanyMetro Rail032000029FEB2023metro.de@gmail.comDE_02LOWD0285
11TR010FranceCargo Line25065000005AUG2025cargo.fr@gmail.comFR02MediumD0375
12TR011IndiaMetro Rail390NULL12SEP2025metro.india@gmail.coIN01HIGHD01102
13TR012USAPassenger Rail48095000015OCT2025passenger.us@gmail.cUS02LowD04145
14TR013JapanFreight Rail30034000031NOV2025freight.jp@gmail.comJP02MediumD0388
15TR014GermanyBullet Train520120000011DEC2025bullet.de@gmailDE03CriticalD01355
16TR015IndiaMetro Express41047000007JAN2025express@gmail.comIN04MediumD02118
17TR016FrancePassenger RailNULL61000015FEB2025rail.fr@gmail.comFR03LOWD05140
18TR017USACargo Line290-99000018MAR2025cargo.us@gmail.comUS03CriticalD0192
19TR018JapanMetro Rail36043000020APR2025metro.jp2@gmail.comJP03MediumD0298
20TR019IndiaBullet Train540150000025MAY2025india.bullet@gmail.cIN05CriticalD01360
21TR020GermanyFreight Rail61088000010JUN2025germany.freight@gmaiDE04HighD0378

Why LENGTH Statements Matter in SAS

One of the most misunderstood production risks in SAS is character truncation.

If you assign:

Country="United States of America";

before defining:

length Country $20;

SAS may permanently truncate values based on the first assignment length.

This becomes extremely dangerous in:

  • SDTM domains
  • ADaM derivations
  • Regulatory submissions
  • Train routing systems
  • Insurance policy mappings

In R, character vectors dynamically resize, making truncation less dangerous. SAS, however, allocates memory earlier in compilation.

That is why professional SAS programmers place LENGTH statements BEFORE assignments.

Enterprise SAS Cleaning Workflow

1.Standardization Layer

data train_clean_stage1;

retain Data_Source "GLOBAL_TRAIN_SYSTEM";

set world_train_raw;

length Standard_Country $25;

Standard_Country = propcase(strip(lowcase(Country)));

Train_Type = propcase(compbl(strip(Train_Type)));

Passenger_Count_Num = input(strip(Passenger_Count),12.);

Passenger_Count_Num = abs(Passenger_Count_Num);

Region_Code = compress(upcase(Region_Code));

Maintenance_Category =propcase(strip(Maintenance_Category));

Journey_Date_Num = input(Journey_Date,anydtdte15.);

format Journey_Date_Num date9.;

Engineer_Email =lowcase(strip(Engineer_Email));

if Engineer_Email='null' then Engineer_Email='';

Revenue_Num = input(compress(put(Revenue,$20.),,'kd'),8.);

Revenue_Num = abs(Revenue_Num);

Avg_Speed = abs(Avg_Speed);

Passenger_Count = abs(Passenger_Count);

drop Passenger_Count Revenue Journey_Date;

rename Revenue_Num = Revenue

       Passenger_Count_Num = Passenger_Count

       Journey_Date_Num = Journey_Date;

run;

proc print data = train_clean_stage1;

run;

OUTPUT:

ObsData_SourceTrain_IDCountryTrain_TypeEngineer_EmailRegion_CodeMaintenance_CategoryDelay_CodeAvg_SpeedStandard_CountryPassenger_CountJourney_DateRevenue
1GLOBAL_TRAIN_SYSTEMTR001indiaBullet Trainengine1@gmail.comIN01CriticalD01320India45012JAN2025250000
2GLOBAL_TRAIN_SYSTEMTR001INDIABullet Trainwrongmail.comIN01CriticalD01320India450.250000
3GLOBAL_TRAIN_SYSTEMTR002JapanMetro Expressmetro.jp@gmail.comJP_01MediumD02210Japan.15MAR2025180000
4GLOBAL_TRAIN_SYSTEMTR003GermanyFreight Railfreight@railDE01LowNULL150Germany70018APR2025500
5GLOBAL_TRAIN_SYSTEMTR004FrancePassenger Railfrance_rail@gmail.coFR01HighD04500France9999.550000
6GLOBAL_TRAIN_SYSTEMTR005IndiaCargo Linecargo@gmailIN02MediumD02110India2522MAY2025750000
7GLOBAL_TRAIN_SYSTEMTR006USAMetro Railusa.metro@gmail.comUS01CriticalD0395Usa38001JUN2025450000
8GLOBAL_TRAIN_SYSTEMTR007IndiaPassenger Rail IN03UnknownDXX130India42015JUN2025520000
9GLOBAL_TRAIN_SYSTEMTR008JapanBullet Trainbullet@japan.comJP01CriticalD01340Japan51018JUL2025870000
10GLOBAL_TRAIN_SYSTEMTR009GermanyMetro Railmetro.de@gmail.comDE_02LowD0285Germany0.320000
11GLOBAL_TRAIN_SYSTEMTR010FranceCargo Linecargo.fr@gmail.comFR02MediumD0375France25005AUG2025650000
12GLOBAL_TRAIN_SYSTEMTR011IndiaMetro Railmetro.india@gmail.coIN01HighD01102India39012SEP2025.
13GLOBAL_TRAIN_SYSTEMTR012USAPassenger Railpassenger.us@gmail.cUS02LowD04145Usa48015OCT2025950000
14GLOBAL_TRAIN_SYSTEMTR013JapanFreight Railfreight.jp@gmail.comJP02MediumD0388Japan300.340000
15GLOBAL_TRAIN_SYSTEMTR014GermanyBullet Trainbullet.de@gmailDE03CriticalD01355Germany52011DEC20251200000
16GLOBAL_TRAIN_SYSTEMTR015IndiaMetro Expressexpress@gmail.comIN04MediumD02118India41007JAN2025470000
17GLOBAL_TRAIN_SYSTEMTR016FrancePassenger Railrail.fr@gmail.comFR03LowD05140France.15FEB2025610000
18GLOBAL_TRAIN_SYSTEMTR017USACargo Linecargo.us@gmail.comUS03CriticalD0192Usa29018MAR2025990000
19GLOBAL_TRAIN_SYSTEMTR018JapanMetro Railmetro.jp2@gmail.comJP03MediumD0298Japan36020APR2025430000
20GLOBAL_TRAIN_SYSTEMTR019IndiaBullet Trainindia.bullet@gmail.cIN05CriticalD01360India54025MAY20251500000
21GLOBAL_TRAIN_SYSTEMTR020GermanyFreight Railgermany.freight@gmaiDE04HighD0378Germany61010JUN2025880000

Explanation

This step performs enterprise normalization. PROPCASE, LOWCASE, and COMPBL standardize inconsistent text formatting. COMPRESS removes whitespace corruption in region codes. INPUT converts mixed character/numeric revenue fields into valid numeric values. ABS corrects negative operational metrics. This stage mirrors real-world SDTM standardization pipelines where operational inconsistencies must be corrected before downstream derivations. In R, equivalent transformations are handled using mutate(), str_trim(), and parse_number(). The key principle is reproducibility. Standardized variables create stable reporting logic, validated outputs, and reliable AI training datasets.

2.Date Validation Using INTCK and INPUT

data train_clean_stage2;

set train_clean_stage1;

format Parsed_Date yymmdd10.;

Parsed_Date = Journey_Date;

if missing(Parsed_Date) then Date_Flag='INVALID_DATE';

else Date_Flag='VALID_DATE';

Train_Age_Months =intck('month',Parsed_Date,today());

if Avg_Speed > 350 then Speed_Flag='EXTREME_SPEED';

else Speed_Flag='NORMAL_SPEED';

run;

proc print data = train_clean_stage2;

run;

OUTPUT:

ObsData_SourceTrain_IDCountryTrain_TypeEngineer_EmailRegion_CodeMaintenance_CategoryDelay_CodeAvg_SpeedStandard_CountryPassenger_CountJourney_DateRevenueParsed_DateDate_FlagTrain_Age_MonthsSpeed_Flag
1GLOBAL_TRAIN_SYSTEMTR001indiaBullet Trainengine1@gmail.comIN01CriticalD01320India45012JAN20252500002025-01-12VALID_DATE17NORMAL_SPEED
2GLOBAL_TRAIN_SYSTEMTR001INDIABullet Trainwrongmail.comIN01CriticalD01320India450.250000.INVALID_DATE.NORMAL_SPEED
3GLOBAL_TRAIN_SYSTEMTR002JapanMetro Expressmetro.jp@gmail.comJP_01MediumD02210Japan.15MAR20251800002025-03-15VALID_DATE15NORMAL_SPEED
4GLOBAL_TRAIN_SYSTEMTR003GermanyFreight Railfreight@railDE01LowNULL150Germany70018APR20255002025-04-18VALID_DATE14NORMAL_SPEED
5GLOBAL_TRAIN_SYSTEMTR004FrancePassenger Railfrance_rail@gmail.coFR01HighD04500France9999.550000.INVALID_DATE.EXTREME_SPEED
6GLOBAL_TRAIN_SYSTEMTR005IndiaCargo Linecargo@gmailIN02MediumD02110India2522MAY20257500002025-05-22VALID_DATE13NORMAL_SPEED
7GLOBAL_TRAIN_SYSTEMTR006USAMetro Railusa.metro@gmail.comUS01CriticalD0395Usa38001JUN20254500002025-06-01VALID_DATE12NORMAL_SPEED
8GLOBAL_TRAIN_SYSTEMTR007IndiaPassenger Rail IN03UnknownDXX130India42015JUN20255200002025-06-15VALID_DATE12NORMAL_SPEED
9GLOBAL_TRAIN_SYSTEMTR008JapanBullet Trainbullet@japan.comJP01CriticalD01340Japan51018JUL20258700002025-07-18VALID_DATE11NORMAL_SPEED
10GLOBAL_TRAIN_SYSTEMTR009GermanyMetro Railmetro.de@gmail.comDE_02LowD0285Germany0.320000.INVALID_DATE.NORMAL_SPEED
11GLOBAL_TRAIN_SYSTEMTR010FranceCargo Linecargo.fr@gmail.comFR02MediumD0375France25005AUG20256500002025-08-05VALID_DATE10NORMAL_SPEED
12GLOBAL_TRAIN_SYSTEMTR011IndiaMetro Railmetro.india@gmail.coIN01HighD01102India39012SEP2025.2025-09-12VALID_DATE9NORMAL_SPEED
13GLOBAL_TRAIN_SYSTEMTR012USAPassenger Railpassenger.us@gmail.cUS02LowD04145Usa48015OCT20259500002025-10-15VALID_DATE8NORMAL_SPEED
14GLOBAL_TRAIN_SYSTEMTR013JapanFreight Railfreight.jp@gmail.comJP02MediumD0388Japan300.340000.INVALID_DATE.NORMAL_SPEED
15GLOBAL_TRAIN_SYSTEMTR014GermanyBullet Trainbullet.de@gmailDE03CriticalD01355Germany52011DEC202512000002025-12-11VALID_DATE6EXTREME_SPEED
16GLOBAL_TRAIN_SYSTEMTR015IndiaMetro Expressexpress@gmail.comIN04MediumD02118India41007JAN20254700002025-01-07VALID_DATE17NORMAL_SPEED
17GLOBAL_TRAIN_SYSTEMTR016FrancePassenger Railrail.fr@gmail.comFR03LowD05140France.15FEB20256100002025-02-15VALID_DATE16NORMAL_SPEED
18GLOBAL_TRAIN_SYSTEMTR017USACargo Linecargo.us@gmail.comUS03CriticalD0192Usa29018MAR20259900002025-03-18VALID_DATE15NORMAL_SPEED
19GLOBAL_TRAIN_SYSTEMTR018JapanMetro Railmetro.jp2@gmail.comJP03MediumD0298Japan36020APR20254300002025-04-20VALID_DATE14NORMAL_SPEED
20GLOBAL_TRAIN_SYSTEMTR019IndiaBullet Trainindia.bullet@gmail.cIN05CriticalD01360India54025MAY202515000002025-05-25VALID_DATE13EXTREME_SPEED
21GLOBAL_TRAIN_SYSTEMTR020GermanyFreight Railgermany.freight@gmaiDE04HighD0378Germany61010JUN20258800002025-06-10VALID_DATE12NORMAL_SPEED

Explanation

Dates are among the most dangerous enterprise-quality risks. Invalid dates such as 31FEB2025 silently corrupt analytical timelines. INPUT() converts character dates into SAS date values while invalid values become missing. INTCK() calculates operational durations. This is heavily used in clinical trials for treatment exposure analysis and visit-window derivations. Here, it measures train operational aging. Missing dates are flagged for audit review instead of silently ignored. This defensive programming strategy protects downstream KPIs and predictive maintenance systems.

3.PROC SORT NODUPKEY Deduplication

proc sort data=train_clean_stage2

          out=train_dedup nodupkey;

by Train_ID;

run;

proc print data = train_dedup;

run;

LOG:

NOTE: There were 21 observations read from the data set WORK.TRAIN_CLEAN_STAGE2.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.TRAIN_DEDUP has 20 observations and 17 variables.

OUTPUT:

ObsData_SourceTrain_IDCountryTrain_TypeEngineer_EmailRegion_CodeMaintenance_CategoryDelay_CodeAvg_SpeedStandard_CountryPassenger_CountJourney_DateRevenueParsed_DateDate_FlagTrain_Age_MonthsSpeed_Flag
1GLOBAL_TRAIN_SYSTEMTR001indiaBullet Trainengine1@gmail.comIN01CriticalD01320India45012JAN20252500002025-01-12VALID_DATE17NORMAL_SPEED
2GLOBAL_TRAIN_SYSTEMTR002JapanMetro Expressmetro.jp@gmail.comJP_01MediumD02210Japan.15MAR20251800002025-03-15VALID_DATE15NORMAL_SPEED
3GLOBAL_TRAIN_SYSTEMTR003GermanyFreight Railfreight@railDE01LowNULL150Germany70018APR20255002025-04-18VALID_DATE14NORMAL_SPEED
4GLOBAL_TRAIN_SYSTEMTR004FrancePassenger Railfrance_rail@gmail.coFR01HighD04500France9999.550000.INVALID_DATE.EXTREME_SPEED
5GLOBAL_TRAIN_SYSTEMTR005IndiaCargo Linecargo@gmailIN02MediumD02110India2522MAY20257500002025-05-22VALID_DATE13NORMAL_SPEED
6GLOBAL_TRAIN_SYSTEMTR006USAMetro Railusa.metro@gmail.comUS01CriticalD0395Usa38001JUN20254500002025-06-01VALID_DATE12NORMAL_SPEED
7GLOBAL_TRAIN_SYSTEMTR007IndiaPassenger Rail IN03UnknownDXX130India42015JUN20255200002025-06-15VALID_DATE12NORMAL_SPEED
8GLOBAL_TRAIN_SYSTEMTR008JapanBullet Trainbullet@japan.comJP01CriticalD01340Japan51018JUL20258700002025-07-18VALID_DATE11NORMAL_SPEED
9GLOBAL_TRAIN_SYSTEMTR009GermanyMetro Railmetro.de@gmail.comDE_02LowD0285Germany0.320000.INVALID_DATE.NORMAL_SPEED
10GLOBAL_TRAIN_SYSTEMTR010FranceCargo Linecargo.fr@gmail.comFR02MediumD0375France25005AUG20256500002025-08-05VALID_DATE10NORMAL_SPEED
11GLOBAL_TRAIN_SYSTEMTR011IndiaMetro Railmetro.india@gmail.coIN01HighD01102India39012SEP2025.2025-09-12VALID_DATE9NORMAL_SPEED
12GLOBAL_TRAIN_SYSTEMTR012USAPassenger Railpassenger.us@gmail.cUS02LowD04145Usa48015OCT20259500002025-10-15VALID_DATE8NORMAL_SPEED
13GLOBAL_TRAIN_SYSTEMTR013JapanFreight Railfreight.jp@gmail.comJP02MediumD0388Japan300.340000.INVALID_DATE.NORMAL_SPEED
14GLOBAL_TRAIN_SYSTEMTR014GermanyBullet Trainbullet.de@gmailDE03CriticalD01355Germany52011DEC202512000002025-12-11VALID_DATE6EXTREME_SPEED
15GLOBAL_TRAIN_SYSTEMTR015IndiaMetro Expressexpress@gmail.comIN04MediumD02118India41007JAN20254700002025-01-07VALID_DATE17NORMAL_SPEED
16GLOBAL_TRAIN_SYSTEMTR016FrancePassenger Railrail.fr@gmail.comFR03LowD05140France.15FEB20256100002025-02-15VALID_DATE16NORMAL_SPEED
17GLOBAL_TRAIN_SYSTEMTR017USACargo Linecargo.us@gmail.comUS03CriticalD0192Usa29018MAR20259900002025-03-18VALID_DATE15NORMAL_SPEED
18GLOBAL_TRAIN_SYSTEMTR018JapanMetro Railmetro.jp2@gmail.comJP03MediumD0298Japan36020APR20254300002025-04-20VALID_DATE14NORMAL_SPEED
19GLOBAL_TRAIN_SYSTEMTR019IndiaBullet Trainindia.bullet@gmail.cIN05CriticalD01360India54025MAY202515000002025-05-25VALID_DATE13EXTREME_SPEED
20GLOBAL_TRAIN_SYSTEMTR020GermanyFreight Railgermany.freight@gmaiDE04HighD0378Germany61010JUN20258800002025-06-10VALID_DATE12NORMAL_SPEED

Explanation

Duplicate transaction IDs create massive financial distortion. PROC SORT NODUPKEY removes duplicate keys while preserving one trusted record. In production, additional reconciliation rules are often applied using timestamps or source-priority ranking. Deduplication is foundational for SDTM subject uniqueness and operational reporting consistency. In R, distinct() performs similar logic.

4.PROC FORMAT for Enterprise Classification

proc format;

value speedfmt low-100='LOW SPEED'

               101-250='MEDIUM SPEED'

              251-high='HIGH SPEED';

run;

LOG:

NOTE: Format SPEEDFMT has been output.

Explanation

Formats centralize business logic. Instead of repeatedly coding classifications, enterprise teams standardize reusable mappings. This improves governance, traceability, and consistency across dashboards, TLFs, and AI features.

5.PROC SQL Enterprise Reporting

proc sql;

create table regional_summary as

select Standard_Country,Maintenance_Category,

count(*) as Total_Trains,

sum(Revenue) as Total_Revenue,

mean(Avg_Speed) as Avg_Speed format=speedfmt.

from train_dedup

group by Standard_Country,Maintenance_Category;

quit;

proc print data = regional_summary;

run;

OUTPUT:

ObsStandard_CountryMaintenance_CategoryTotal_TrainsTotal_RevenueAvg_Speed
1FranceHigh1550000HIGH SPEED
2FranceLow1610000MEDIUM SPEED
3FranceMedium1650000LOW SPEED
4GermanyCritical11200000HIGH SPEED
5GermanyHigh1880000LOW SPEED
6GermanyLow2320500MEDIUM SPEED
7IndiaCritical21750000HIGH SPEED
8IndiaHigh1.MEDIUM SPEED
9IndiaMedium21220000MEDIUM SPEED
10IndiaUnknown1520000MEDIUM SPEED
11JapanCritical1870000HIGH SPEED
12JapanMedium3950000MEDIUM SPEED
13UsaCritical21440000LOW SPEED
14UsaLow1950000MEDIUM SPEED

Explanation

PROC SQL provides relational-style aggregation ideal for enterprise reporting. It simplifies grouped summaries and integrates naturally with validation frameworks. Compared with DATA step BY-group processing, SQL is often easier for reporting logic but less memory-efficient for very large datasets.

6.DATA Step FIRST./LAST. Processing

proc sort data=train_dedup;

by Standard_Country;

run;

proc print data = train_dedup;

run;

OUTPUT:

ObsData_SourceTrain_IDCountryTrain_TypeEngineer_EmailRegion_CodeMaintenance_CategoryDelay_CodeAvg_SpeedStandard_CountryPassenger_CountJourney_DateRevenueParsed_DateDate_FlagTrain_Age_MonthsSpeed_Flag
1GLOBAL_TRAIN_SYSTEMTR004FrancePassenger Railfrance_rail@gmail.coFR01HighD04500France9999.550000.INVALID_DATE.EXTREME_SPEED
2GLOBAL_TRAIN_SYSTEMTR010FranceCargo Linecargo.fr@gmail.comFR02MediumD0375France25005AUG20256500002025-08-05VALID_DATE10NORMAL_SPEED
3GLOBAL_TRAIN_SYSTEMTR016FrancePassenger Railrail.fr@gmail.comFR03LowD05140France.15FEB20256100002025-02-15VALID_DATE16NORMAL_SPEED
4GLOBAL_TRAIN_SYSTEMTR003GermanyFreight Railfreight@railDE01LowNULL150Germany70018APR20255002025-04-18VALID_DATE14NORMAL_SPEED
5GLOBAL_TRAIN_SYSTEMTR009GermanyMetro Railmetro.de@gmail.comDE_02LowD0285Germany0.320000.INVALID_DATE.NORMAL_SPEED
6GLOBAL_TRAIN_SYSTEMTR014GermanyBullet Trainbullet.de@gmailDE03CriticalD01355Germany52011DEC202512000002025-12-11VALID_DATE6EXTREME_SPEED
7GLOBAL_TRAIN_SYSTEMTR020GermanyFreight Railgermany.freight@gmaiDE04HighD0378Germany61010JUN20258800002025-06-10VALID_DATE12NORMAL_SPEED
8GLOBAL_TRAIN_SYSTEMTR001indiaBullet Trainengine1@gmail.comIN01CriticalD01320India45012JAN20252500002025-01-12VALID_DATE17NORMAL_SPEED
9GLOBAL_TRAIN_SYSTEMTR005IndiaCargo Linecargo@gmailIN02MediumD02110India2522MAY20257500002025-05-22VALID_DATE13NORMAL_SPEED
10GLOBAL_TRAIN_SYSTEMTR007IndiaPassenger Rail IN03UnknownDXX130India42015JUN20255200002025-06-15VALID_DATE12NORMAL_SPEED
11GLOBAL_TRAIN_SYSTEMTR011IndiaMetro Railmetro.india@gmail.coIN01HighD01102India39012SEP2025.2025-09-12VALID_DATE9NORMAL_SPEED
12GLOBAL_TRAIN_SYSTEMTR015IndiaMetro Expressexpress@gmail.comIN04MediumD02118India41007JAN20254700002025-01-07VALID_DATE17NORMAL_SPEED
13GLOBAL_TRAIN_SYSTEMTR019IndiaBullet Trainindia.bullet@gmail.cIN05CriticalD01360India54025MAY202515000002025-05-25VALID_DATE13EXTREME_SPEED
14GLOBAL_TRAIN_SYSTEMTR002JapanMetro Expressmetro.jp@gmail.comJP_01MediumD02210Japan.15MAR20251800002025-03-15VALID_DATE15NORMAL_SPEED
15GLOBAL_TRAIN_SYSTEMTR008JapanBullet Trainbullet@japan.comJP01CriticalD01340Japan51018JUL20258700002025-07-18VALID_DATE11NORMAL_SPEED
16GLOBAL_TRAIN_SYSTEMTR013JapanFreight Railfreight.jp@gmail.comJP02MediumD0388Japan300.340000.INVALID_DATE.NORMAL_SPEED
17GLOBAL_TRAIN_SYSTEMTR018JapanMetro Railmetro.jp2@gmail.comJP03MediumD0298Japan36020APR20254300002025-04-20VALID_DATE14NORMAL_SPEED
18GLOBAL_TRAIN_SYSTEMTR006USAMetro Railusa.metro@gmail.comUS01CriticalD0395Usa38001JUN20254500002025-06-01VALID_DATE12NORMAL_SPEED
19GLOBAL_TRAIN_SYSTEMTR012USAPassenger Railpassenger.us@gmail.cUS02LowD04145Usa48015OCT20259500002025-10-15VALID_DATE8NORMAL_SPEED
20GLOBAL_TRAIN_SYSTEMTR017USACargo Linecargo.us@gmail.comUS03CriticalD0192Usa29018MAR20259900002025-03-18VALID_DATE15NORMAL_SPEED

data country_summary;

set train_dedup;

by Standard_Country;

retain Total_Revenue 0;

Total_Revenue + Revenue;

if last.Standard_Country;

run;

proc print data = country_summary;

run;

OUTPUT:

ObsData_SourceTrain_IDCountryTrain_TypeEngineer_EmailRegion_CodeMaintenance_CategoryDelay_CodeAvg_SpeedStandard_CountryPassenger_CountJourney_DateRevenueParsed_DateDate_FlagTrain_Age_MonthsSpeed_FlagTotal_Revenue
1GLOBAL_TRAIN_SYSTEMTR016FrancePassenger Railrail.fr@gmail.comFR03LowD05140France.15FEB20256100002025-02-15VALID_DATE16NORMAL_SPEED1810000
2GLOBAL_TRAIN_SYSTEMTR020GermanyFreight Railgermany.freight@gmaiDE04HighD0378Germany61010JUN20258800002025-06-10VALID_DATE12NORMAL_SPEED4210500
3GLOBAL_TRAIN_SYSTEMTR019IndiaBullet Trainindia.bullet@gmail.cIN05CriticalD01360India54025MAY202515000002025-05-25VALID_DATE13EXTREME_SPEED7700500
4GLOBAL_TRAIN_SYSTEMTR018JapanMetro Railmetro.jp2@gmail.comJP03MediumD0298Japan36020APR20254300002025-04-20VALID_DATE14NORMAL_SPEED9520500
5GLOBAL_TRAIN_SYSTEMTR017USACargo Linecargo.us@gmail.comUS03CriticalD0192Usa29018MAR20259900002025-03-18VALID_DATE15NORMAL_SPEED11910500

Explanation

FIRST./LAST. logic is extremely powerful in SAS. It enables grouped calculations without SQL. This technique is widely used in SDTM and ADaM derivations, particularly exposure duration, adverse event counts, and patient visit sequencing.

7.Reusable SAS Macro

%macro validation(dsname);

proc freq data=&dsname;

tables Maintenance_Category Delay_Code / missing;

run;


proc means data=&dsname n nmiss mean max min;

var Revenue Avg_Speed Passenger_Count;

run;

%mend;

%validation(train_dedup);

OUTPUT:

The FREQ Procedure

Maintenance_CategoryFrequencyPercentCumulative
Frequency
Cumulative
Percent
Critical630.00630.00
High315.00945.00
Low420.001365.00
Medium630.001995.00
Unknown15.0020100.00
Delay_CodeFrequencyPercentCumulative
Frequency
Cumulative
Percent
D01630.00630.00
D02525.001155.00
D03420.001575.00
D04210.001785.00
D0515.001890.00
DXX15.001995.00
NULL15.0020100.00
The MEANS Procedure
VariableNN MissMeanMaximumMinimum
Revenue
Avg_Speed
Passenger_Count
19
20
18
1
0
2
626868.42
179.5500000
924.1111111
1500000.00
500.0000000
9999.00
500.0000000
75.0000000
0

Explanation

Macros standardize enterprise workflows. Instead of repeating validation logic manually, teams encapsulate reusable quality checks. This improves reproducibility, auditability, and deployment consistency.

8.R Raw Dataset

library(tidyverse)

library(lubridate)

library(janitor)

train_raw <- tribble(

  ~Train_ID,~Country,~Train_Type,~Passenger_Count,

  ~Revenue,~Journey_Date,~Engineer_Email,

  ~Region_Code,~Maintenance_Category,

  "TR001","india","Bullet Train",450,"250000",

  "12JAN2025","engine1@gmail.com"," in01 ","Critical", 

  "TR001","INDIA","bullet train",450,"-250000",

  "31FEB2025","wrongmail.com","IN01","critical",

  "TR002","Japan","Metro Express",NA,"180000",

  "15MAR2025","metro.jp@gmail.com","JP_01","Medium"

)

OUTPUT:

 

Train_ID

Country

Train_Type

Passenger_Count

Revenue

Journey_Date

Engineer_Email

Region_Code

Maintenance_Category

1

TR001

india

Bullet Train

450

250000

12-Jan-2025

engine1@gmail.com

 in01 

Critical

2

TR001

INDIA

bullet train

450

-250000

31FEB2025

wrongmail.com

IN01

critical

3

TR002

Japan

Metro Express

NA

180000

15-Mar-2025

metro.jp@gmail.com

JP_01

Medium


9.R Cleaning Pipeline

train_clean <- train_raw %>%

  clean_names() %>%

  mutate(

    country = str_to_title(str_trim(country)),

    train_type = str_to_title(train_type),

    region_code = str_replace_all(

        str_to_upper(region_code)," ",""),

    revenue = abs(as.numeric(gsub("[^0-9-]","",revenue))),

    journey_date = suppressWarnings(parse_date_time(

        journey_date,orders="dby")),

    engineer_email = str_trim(str_to_lower(engineer_email)),

    maintenance_category =case_when(

        maintenance_category %in%

          c("critical","Critical") ~ "Critical",

        TRUE ~ maintenance_category)

  ) %>%

  distinct(train_id,.keep_all=TRUE)

OUTPUT:

 

train_id

country

train_type

passenger_count

revenue

journey_date

engineer_email

region_code

maintenance_category

1

TR001

India

Bullet Train

450

250000

12-01-2025

engine1@gmail.com

IN01

Critical

2

TR002

Japan

Metro Express

NA

180000

15-03-2025

metro.jp@gmail.com

JP_01

Medium

Explanation

The R pipeline mirrors SAS transformations using tidyverse functions. mutate() handles variable derivation, case_when() replicates SAS IF-THEN logic, and parse_date_time() validates dates. distinct() performs deduplication. Compared with SAS DATA steps, R offers more concise syntax and functional chaining, while SAS provides stronger regulatory traceability.

Enterprise Validation & Compliance

In regulated industries like clinical trials and transportation safety systems, validation is mandatory.

Key concepts include:

  • SDTM standardization
  • ADaM traceability
  • QC independence
  • Audit trails
  • Metadata governance
  • Reproducibility
  • CFR Part 11 compliance

One major SAS risk:

Missing numeric values are treated as lower than valid numbers.

Example:

if Revenue_Num < 1000 then Flag='LOW';

Missing values also satisfy this condition unless explicitly checked.

Correct logic:

if not missing(Revenue_Num)

and Revenue_Num < 1000;

This single oversight can invalidate regulatory analyses.

20 Enterprise Data-Cleaning Best Practices

  1. Always validate primary keys
  2. Standardize date formats early
  3. Remove duplicate records before aggregation
  4. Normalize categorical values
  5. Validate email structures
  6. Use audit-ready derivation logic
  7. Avoid hardcoded mappings
  8. Centralize formats
  9. Create reusable macros
  10. Maintain metadata lineage
  11. Separate raw and clean layers
  12. Use defensive programming
  13. Validate impossible ranges
  14. Track imputation rules
  15. Document transformations
  16. Use QC-independent review
  17. Preserve source traceability
  18. Standardize missing-value handling
  19. Maintain production logs
  20. Build reproducible workflows

Business Logic Behind Data Cleaning

Enterprise analytics depends on trustworthy data. Missing values are imputed because downstream algorithms cannot reliably process incomplete records. For example, if passenger counts are missing for major train routes, forecasting models underestimate demand. Unrealistic values such as negative revenue or impossible train speeds distort executive KPIs and AI predictions. Dates must be standardized because operational sequencing depends on chronological consistency. Invalid dates break duration calculations, maintenance schedules, and trend analysis. Text normalization is equally critical. “india,” “INDIA,” and “India” must become one standardized category; otherwise dashboards produce fragmented analytics. Similar logic applies in healthcare where patient age correction prevents invalid dosing calculations, and salary normalization ensures accurate insurance risk modeling. Every cleaning rule should align with business meaning rather than cosmetic formatting.

20 One-Line Enterprise Insights

  • Dirty data creates expensive business mistakes.
  • Standardized variables improve reproducibility.
  • Validation logic is stronger than visual inspection.
  • Missing dates silently damage analytics.
  • Deduplication protects financial accuracy.
  • Metadata governance improves audit readiness.
  • SAS excels in regulated workflows.
  • R accelerates exploratory transformation.
  • Defensive programming prevents production failures.
  • AI models inherit data quality problems.
  • Enterprise reporting requires traceability.
  • QC independence improves trustworthiness.
  • Macros improve scalability.
  • Clean inputs create reliable outputs.
  • Regulatory systems demand reproducibility.
  • Invalid ranges corrupt forecasting models.
  • Character truncation causes hidden defects.
  • Standard formats improve cross-team collaboration.
  • Documentation is part of validation.
  • Reliable data powers executive confidence.

SAS vs R Comparison

Feature

SAS

R

Regulatory Acceptance

Excellent

Moderate

Audit Trails

Strong

Custom

Flexibility

Moderate

Excellent

Visualization

Moderate

Excellent

Enterprise Scalability

Excellent

Strong

Validation Frameworks

Built-in

Custom

Learning Curve

Structured

Flexible

Metadata Governance

Strong

Package-dependent

SAS dominates regulated enterprise ecosystems because of auditability, stability, validation frameworks, and controlled execution. Clinical trials, banking risk systems, and insurance reporting heavily rely on SAS due to compliance requirements. R, however, excels in flexibility, modern data wrangling, advanced visualization, machine learning integration, and rapid experimentation. SAS DATA steps provide deterministic execution suitable for production pipelines, while tidyverse workflows offer elegant transformation chains ideal for exploratory analytics. Many organizations now use hybrid ecosystems where SAS handles validated production reporting and R supports advanced modeling and exploratory data science. Together, they create scalable, production-grade analytical intelligence systems.

Validation Checklist

1.  Duplicate key validation
2.  Date standardization
3.  Missing value review
4.  Invalid range checks
5.  Category normalization
6.  Email validation
7.  Metadata documentation
8.  QC review completed
9.  Audit trail maintained
10.Reporting outputs verified

Summary

Global train-system analytics environments often suffer from corrupted operational data containing duplicate train IDs, invalid dates, negative revenue values, malformed emails, inconsistent region codes, NULL strings, and impossible speed measurements. These issues severely impact dashboards, AI prediction systems, financial reporting, maintenance forecasting, and enterprise decision-making. Using SAS and R together creates a powerful framework for transforming unreliable raw railway data into trusted analytical intelligence.

In SAS, enterprise-grade cleaning workflows were developed using DATA step programming, PROC SQL, PROC FORMAT, PROC SORT NODUPKEY, arrays, macros, FIRST./LAST. processing, INPUT/PUT conversions, INTNX/INTCK date calculations, and validation procedures such as PROC FREQ and PROC MEANS. Important production concepts like character truncation risk, metadata governance, audit trails, and missing-value handling were explained in detail. SAS demonstrated strong capabilities in regulatory compliance, reproducibility, and scalable reporting.

In R, modern data-cleaning pipelines were implemented using tidyverse, dplyr, stringr, lubridate, janitor, and purrr. Functions such as mutate(), case_when(), distinct(), parse_date_time(), replace_na(), and str_replace_all() provided flexible and readable transformations. The project emphasized that clean, standardized, and validated datasets are essential for reliable analytics, regulatory trust, AI accuracy, and enterprise-grade business intelligence across transportation and clinical-style operational systems.

Conclusion

Modern analytics ecosystems depend on reliable, analysis-ready data. Whether managing global train systems, clinical trials, insurance claims, or banking operations, organizations cannot trust dashboards, AI predictions, or executive reporting without structured data-engineering frameworks. Corrupted operational data introduces enormous business risk through duplicate records, invalid dates, inconsistent categorical values, malformed identifiers, and missing variables. Left unresolved, these issues damage forecasting accuracy, compliance submissions, regulatory audits, and strategic decision-making.

SAS remains one of the strongest enterprise platforms for production-grade data engineering because of its deterministic execution model, regulatory acceptance, auditability, metadata governance, and scalable validation frameworks. DATA step processing, PROC SQL, macros, formats, and enterprise reporting procedures provide extraordinary control over transformation pipelines. In highly regulated industries such as clinical research, insurance, and transportation safety systems, this level of traceability is indispensable.

R complements SAS by offering flexible transformation logic, modern functional programming, rich package ecosystems, and rapid exploratory analytics. Packages such as tidyverse, lubridate, janitor, and stringr allow data scientists to rapidly iterate through cleansing workflows while maintaining readable pipelines. When integrated strategically, SAS and R form a powerful hybrid ecosystem capable of delivering scalable, trustworthy, and production-grade analytical intelligence.

The ultimate lesson is simple:

Clean data is not merely a technical requirement it is the foundation of trustworthy business intelligence, regulatory credibility, AI reliability, and executive confidence.

Interview Questions and Answers

1. A train analytics dashboard suddenly shows negative operational revenue. How would you investigate this issue?

Answer

I would first validate the raw source data and identify whether the negative values originated from data-entry errors, refund transactions, or incorrect type conversions. In SAS, I would use PROC MEANS, PROC FREQ, and conditional DATA step logic to isolate suspicious records. I would also verify whether character-to-numeric conversions introduced corruption during ingestion. In R, I would use filter(revenue < 0) and summarize abnormal trends. After identifying the root cause, I would apply business rules such as ABS() or exception flagging depending on regulatory requirements.

2. How would you detect hidden duplicate records when Train_ID values appear unique?

Answer

Sometimes duplicate business records exist even when IDs differ slightly because of whitespace corruption, lowercase/uppercase inconsistencies, or hidden special characters. I would standardize variables using COMPRESS, STRIP, UPCASE, and PROPCASE in SAS before deduplication. In R, I would use str_trim() and str_to_upper(). Then I would compare combinations of train attributes such as route, date, and engineer email using PROC SQL grouping or group_by() in R.

3. A regulatory auditor says your train maintenance dates are unreliable. What validation steps would you perform?

Answer

I would validate all dates using SAS INPUT() with proper informats and identify invalid dates through missing conversions. I would also check impossible calendar values like 31FEB2025. In R, I would use parse_date_time() from lubridate. Additional checks would include future-date validation, chronological sequencing, and interval consistency using INTCK() in SAS. Every correction would be documented for audit traceability.

4. Why is missing-value handling considered dangerous in SAS production systems?

Answer

SAS treats missing numeric values as smaller than valid numbers. This behavior can unintentionally classify missing records into analytical categories. For example:

if revenue < 1000 then flag='LOW';

This condition also captures missing values unless explicitly checked. Proper logic should use:

if not missing(revenue) and revenue < 1000;

This prevents catastrophic reporting errors in regulated environments.

5. When would you choose PROC SQL over DATA Step in enterprise cleaning workflows?

Answer

I would use PROC SQL when performing joins, grouped aggregations, relational filtering, and reporting summaries because SQL syntax is concise and readable. However, for sequential row-level processing, retained calculations, FIRST./LAST. logic, and complex derivations, DATA step is usually faster and more memory-efficient. In production systems, both approaches are often combined strategically for optimal scalability and maintainability.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent TRAIN DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can Advanced SAS Programming Detect, Correct, and Optimize Global Cargo Shipping Route Data While Improving Efficiency and Reducing Costs?

2.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS

3.Can Advanced SAS Programming Detect, Clean, and Optimize AI Training Experiment Data While Identifying Fraud Patterns?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy


Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS