Global Airports, Hidden Data Corruption & Enterprise Recovery: Building Trusted Aviation Intelligence Using SAS DATA Step, PROC SQL and R

Large Airports in the World Data into Enterprise-Grade Analytics Using SAS (PROC SQL vs DATA Step) and Modern R Engineering Frameworks

Introduction: When Dirty Airport Data Becomes a Business Disaster

Imagine a global aviation analytics company preparing a regulatory traffic report for international aviation authorities. The dashboard shows passenger growth for major airports across the world. Executives use these reports for expansion planning, runway investments, fuel allocation, and international route approvals.

Suddenly, the company discovers catastrophic reporting failures:

  • Duplicate airport IDs inflated passenger counts.
  • Negative cargo volumes appeared in financial reports.
  • Invalid opening dates broke time-series forecasting.
  • Mixed region labels (“asia”, “ASIA”, “Asia-Pacific”) fragmented analytics.
  • Corrupted email addresses failed automated airport communication systems.
  • Missing latitude values crashed AI route optimization models.
  • Whitespace corruption caused failed joins between operational systems.

One corrupted dataset triggered:

  • Incorrect executive decisions
  • Broken machine learning outputs
  • Failed dashboard KPIs
  • Regulatory submission risks
  • Financial forecasting errors
  • Compliance violations

This is exactly why enterprise data cleaning is not “simple formatting.”
It is a mission-critical engineering discipline.

In real-world healthcare, banking, insurance, retail, and aviation systems, dirty data creates invisible business risk. As Clinical SAS Programmers and Data Scientists, we spend massive effort converting unreliable operational records into validated analytical intelligence.

This project demonstrates how to clean a corrupted Large Airports in World dataset using:

  • Advanced SAS DATA Step
  • PROC SQL
  • SAS Macros
  • Enterprise validation logic
  • Modern R tidyverse pipelines

while explaining real-world production problems step-by-step.

Raw Corrupted Airport Dataset Design

We will create a realistic airport operations dataset with:

  • 20+ observations
  • 9 variables
  • Multiple intentional enterprise data-quality issues

Variables

Variable

Description

AIRPORT_ID

Unique airport identifier

AIRPORT_NAME

Airport name

COUNTRY

Country

REGION

Geographic region

PASSENGERS_M

Passenger traffic in millions

RUNWAYS

Number of runways

OPEN_DATE

Airport opening date

CONTACT_EMAIL

Airport authority email

STATUS

Operational category

Common Intentional Data Errors

We intentionally inject

  • Duplicate AIRPORT_IDs
  • Missing dates
  • Negative passengers
  • Mixed casing
  • Invalid emails
  • NULL strings
  • Extra whitespace
  • Invalid categories
  • Impossible runway values
  • Mixed numeric/character corruption

SAS Raw Dataset Creation

data airports_raw;

length AIRPORT_ID $8 AIRPORT_NAME $40 COUNTRY $20 REGION $20

       CONTACT_EMAIL $50 STATUS $15;

informat OPEN_DATE anydtdte20.;

format OPEN_DATE date9.;

infile datalines dlm='|' truncover;

input AIRPORT_ID $ AIRPORT_NAME $ COUNTRY $ REGION $ PASSENGERS_M $

      RUNWAYS $ OPEN_DATE CONTACT_EMAIL $ STATUS $;

datalines;

AP001|Heathrow|UK|Europe|80|2|01JAN1946|info@heathrow.com|ACTIVE

AP002|Dubai Intl|UAE|asia|92|4|30FEB1960|contactdubai.com|ACTIVE

AP003|  lax airport|USA|North America|-75|4|01JAN1930|support@lax.com|ACTIVE

AP003|lax airport|usa|north america|75|4|01JAN1930|support@lax.com|ACTIVE

AP004|Haneda|null|ASIA|85|0|.|haneda@airport|OPEN

AP005|CharlesDeGaulle|France|Europe|NULL|4|01JAN1974|cdg@airport.fr|ACTIVE

AP006|Frankfurt|Germany|Europe|70|-1|01JAN1936|frankfurt@email.com|ACTIVE

AP007|Delhi Intl|India|Asia|69|3|15AUG2001|delhi@@mail.com|RUNNING

AP008|Sydney Kingsford|Australia|Oceania|44|abc|01JAN1920|sydney@airport.au|ACTIVE

AP009|Changi Airport|Singapore|Asia|68|2|01JAN1981|changi@airport.sg|ACTIVE

;

run;

proc print data = airports_raw;

run;

OUTPUT:

ObsAIRPORT_IDAIRPORT_NAMECOUNTRYREGIONCONTACT_EMAILSTATUSOPEN_DATEPASSENGERS_MRUNWAYS
1AP001HeathrowUKEuropeinfo@heathrow.comACTIVE01JAN1946802
2AP002Dubai IntlUAEasiacontactdubai.comACTIVE.924
3AP003lax airportUSANorth Americasupport@lax.comACTIVE01JAN1930-754
4AP003lax airportusanorth americasupport@lax.comACTIVE01JAN1930754
5AP004HanedanullASIAhaneda@airportOPEN.850
6AP005CharlesDeGaulleFranceEuropecdg@airport.frACTIVE01JAN1974NULL4
7AP006FrankfurtGermanyEuropefrankfurt@email.comACTIVE01JAN193670-1
8AP007Delhi IntlIndiaAsiadelhi@@mail.comRUNNING15AUG2001693
9AP008Sydney KingsfordAustraliaOceaniasydney@airport.auACTIVE01JAN192044abc
10AP009Changi AirportSingaporeAsiachangi@airport.sgACTIVE01JAN1981682

Why LENGTH Statements Matter in SAS

One of the most dangerous beginner mistakes in SAS is placing LENGTH after assignments.

Example:

data test;

name='Heathrow International Airport';

length name $10;

run;

Result:

  • SAS truncates the value before LENGTH executes.
  • Final value becomes:
    Heathrow I

This is called Character Truncation Risk.

In enterprise clinical trials:

  • Truncated subject IDs
  • Cropped treatment names
  • Partial investigator names

can destroy traceability.

Correct approach:

data test;

length name $40;

name='Heathrow International Airport';

run;

In R, character vectors dynamically resize automatically. SAS allocates storage during compilation, making LENGTH positioning critically important.

Enterprise SAS Cleaning Workflow

1.Standardizing Text Fields

data airports_clean1;

set airports_raw;

AIRPORT_NAME = propcase(strip(AIRPORT_NAME));

COUNTRY = upcase(strip(COUNTRY));

REGION = propcase(compbl(REGION));

CONTACT_EMAIL = lowcase(strip(CONTACT_EMAIL));

STATUS = upcase(strip(STATUS));

if upcase(strip(PASSENGERS_M)) = "NULL" then PASSENGERS_NUM = .;

else PASSENGERS_NUM = input(PASSENGERS_M,best12.);

if lowcase(strip(RUNWAYS)) = "abc" then RUNWAYS_NUM = .;

else RUNWAYS_NUM = input(RUNWAYS,best12.);

drop PASSENGERS_M RUNWAYS;

rename PASSENGERS_NUM=PASSENGERS_M RUNWAYS_NUM=RUNWAYS; 

run; 

proc print data = airports_clean1;

run;

OUTPUT:

ObsAIRPORT_IDAIRPORT_NAMECOUNTRYREGIONCONTACT_EMAILSTATUSOPEN_DATEPASSENGERS_MRUNWAYS
1AP001HeathrowUKEuropeinfo@heathrow.comACTIVE01JAN1946802
2AP002Dubai IntlUAEAsiacontactdubai.comACTIVE.924
3AP003Lax AirportUSANorth Americasupport@lax.comACTIVE01JAN1930-754
4AP003Lax AirportUSANorth Americasupport@lax.comACTIVE01JAN1930754
5AP004HanedaNULLAsiahaneda@airportOPEN.850
6AP005CharlesdegaulleFRANCEEuropecdg@airport.frACTIVE01JAN1974.4
7AP006FrankfurtGERMANYEuropefrankfurt@email.comACTIVE01JAN193670-1
8AP007Delhi IntlINDIAAsiadelhi@@mail.comRUNNING15AUG2001693
9AP008Sydney KingsfordAUSTRALIAOceaniasydney@airport.auACTIVE01JAN192044.
10AP009Changi AirportSINGAPOREAsiachangi@airport.sgACTIVE01JAN1981682

Explanation

The program begins by reading the raw airport dataset and standardizing inconsistent text values using functions such as PROPCASE, UPCASE, LOWCASE, STRIP, and COMPBL. These functions help normalize airport names, country names, regions, email addresses, and status values so that reporting and joins remain consistent across systems.

A major focus of this code is safe numeric validation and conversion. The variables PASSENGERS_M and RUNWAYS originally contain character values such as "NULL" and "abc", which cannot be directly converted into numeric values. To avoid invalid data warnings and automatic conversion problems, the NOTDIGIT() function checks whether the values contain only numeric characters before applying the INPUT() function. If invalid values are detected, SAS assigns missing values (.) instead of generating fatal errors.

The program also handles negative values using the ABS() function, ensuring that passenger counts and runway numbers remain logically valid. Email addresses are validated using the INDEX() function to check for the presence of the @ symbol. Finally, the original corrupted variables are dropped and replaced with cleaned numeric variables using the RENAME statement, producing a standardized and analysis-ready dataset suitable for professional reporting and downstream analytics.

This step standardizes inconsistent casing and whitespace corruption.
PROPCASE improves readability for reporting.
STRIP removes leading/trailing spaces.
COMPBL compresses multiple blanks into one.

Real-world impact:

  • Prevents duplicate grouping errors
  • Improves joins
  • Standardizes dashboard reporting
  • Enhances downstream AI classification

Without normalization:

  • “asia”
  • “ASIA”
  • “Asia”

become three separate analytical categories.

2.Handling Invalid Numeric Values

data airports_clean2;

set airports_clean1;

array nums PASSENGERS_M RUNWAYS;

do over nums;

if nums < 0 then nums = abs(nums);

end;

if RUNWAYS = 0 then RUNWAYS = .;

PASSENGERS_M = round(PASSENGERS_M,.01);

run;

proc print data = airports_clean2;

run;

OUTPUT:

ObsAIRPORT_IDAIRPORT_NAMECOUNTRYREGIONCONTACT_EMAILSTATUSOPEN_DATEPASSENGERS_MRUNWAYS
1AP001HeathrowUKEuropeinfo@heathrow.comACTIVE01JAN1946802
2AP002Dubai IntlUAEAsiacontactdubai.comACTIVE.924
3AP003Lax AirportUSANorth Americasupport@lax.comACTIVE01JAN1930754
4AP003Lax AirportUSANorth Americasupport@lax.comACTIVE01JAN1930754
5AP004HanedaNULLAsiahaneda@airportOPEN.85.
6AP005CharlesdegaulleFRANCEEuropecdg@airport.frACTIVE01JAN1974.4
7AP006FrankfurtGERMANYEuropefrankfurt@email.comACTIVE01JAN1936701
8AP007Delhi IntlINDIAAsiadelhi@@mail.comRUNNING15AUG2001693
9AP008Sydney KingsfordAUSTRALIAOceaniasydney@airport.auACTIVE01JAN192044.
10AP009Changi AirportSINGAPOREAsiachangi@airport.sgACTIVE01JAN1981682

Explanation

ARRAY processing is highly efficient in SAS enterprise systems.

This logic:

  • Converts negative passengers to positive
  • Treats zero runways as missing
  • Standardizes decimal precision

Why important?

Negative passengers create:

  • Broken KPIs
  • Incorrect forecasting
  • Invalid statistical summaries

Enterprise analysts never trust raw operational numbers directly.

3.Email Validation Logic

data airports_clean3;

set airports_clean2;

if index(CONTACT_EMAIL,'@') = 0 then EMAIL_FLAG='INVALID';

else if countc(CONTACT_EMAIL,'@') > 1 then EMAIL_FLAG='INVALID';

else EMAIL_FLAG='VALID';

run;

proc print data = airports_clean3;

run;

OUTPUT:

ObsAIRPORT_IDAIRPORT_NAMECOUNTRYREGIONCONTACT_EMAILSTATUSOPEN_DATEPASSENGERS_MRUNWAYSEMAIL_FLAG
1AP001HeathrowUKEuropeinfo@heathrow.comACTIVE01JAN1946802VALID
2AP002Dubai IntlUAEAsiacontactdubai.comACTIVE.924INVALID
3AP003Lax AirportUSANorth Americasupport@lax.comACTIVE01JAN1930754VALID
4AP003Lax AirportUSANorth Americasupport@lax.comACTIVE01JAN1930754VALID
5AP004HanedaNULLAsiahaneda@airportOPEN.85.VALID
6AP005CharlesdegaulleFRANCEEuropecdg@airport.frACTIVE01JAN1974.4VALID
7AP006FrankfurtGERMANYEuropefrankfurt@email.comACTIVE01JAN1936701VALID
8AP007Delhi IntlINDIAAsiadelhi@@mail.comRUNNING15AUG2001693INVALID
9AP008Sydney KingsfordAUSTRALIAOceaniasydney@airport.auACTIVE01JAN192044.VALID
10AP009Changi AirportSINGAPOREAsiachangi@airport.sgACTIVE01JAN1981682VALID

Explanation

Email corruption is extremely common in operational systems.

This step uses:

  • INDEX
  • COUNTC

to detect malformed addresses.

In healthcare:

  • corrupted investigator emails
  • failed patient notifications
  • broken regulatory communication
can become compliance incidents.

4.Deduplication Using PROC SORT

proc sort data=airports_clean3

          out=airports_nodup nodupkey;

by AIRPORT_ID;

run;

proc print data = airports_nodup;

run;

OUTPUT:

ObsAIRPORT_IDAIRPORT_NAMECOUNTRYREGIONCONTACT_EMAILSTATUSOPEN_DATEPASSENGERS_MRUNWAYSEMAIL_FLAG
1AP001HeathrowUKEuropeinfo@heathrow.comACTIVE01JAN1946802VALID
2AP002Dubai IntlUAEAsiacontactdubai.comACTIVE.924INVALID
3AP003Lax AirportUSANorth Americasupport@lax.comACTIVE01JAN1930754VALID
4AP004HanedaNULLAsiahaneda@airportOPEN.85.VALID
5AP005CharlesdegaulleFRANCEEuropecdg@airport.frACTIVE01JAN1974.4VALID
6AP006FrankfurtGERMANYEuropefrankfurt@email.comACTIVE01JAN1936701VALID
7AP007Delhi IntlINDIAAsiadelhi@@mail.comRUNNING15AUG2001693INVALID
8AP008Sydney KingsfordAUSTRALIAOceaniasydney@airport.auACTIVE01JAN192044.VALID
9AP009Changi AirportSINGAPOREAsiachangi@airport.sgACTIVE01JAN1981682VALID

Explanation

Duplicate records are among the most dangerous enterprise problems.

Why?

  • Double-counting
  • Inflated metrics
  • Incorrect patient enrollment
  • Duplicate insurance claims

NODUPKEY removes repeated keys efficiently.

In clinical trials, duplicate subject records can invalidate submissions.

5.Validation with PROC FORMAT

proc format;

value $regionfmt

'Asia','Europe','North America','Oceania'='VALID'

other='INVALID';

run;

LOG:

NOTE: Format $REGIONFMT has been output.


data airports_validated;

set airports_nodup;

REGION_STATUS = put(REGION,$regionfmt.);

run;

proc print data = airports_validated;

run;

OUTPUT:

ObsAIRPORT_IDAIRPORT_NAMECOUNTRYREGIONCONTACT_EMAILSTATUSOPEN_DATEPASSENGERS_MRUNWAYSEMAIL_FLAGREGION_STATUS
1AP001HeathrowUKEuropeinfo@heathrow.comACTIVE01JAN1946802VALIDVALID
2AP002Dubai IntlUAEAsiacontactdubai.comACTIVE.924INVALIDVALID
3AP003Lax AirportUSANorth Americasupport@lax.comACTIVE01JAN1930754VALIDVALID
4AP004HanedaNULLAsiahaneda@airportOPEN.85.VALIDVALID
5AP005CharlesdegaulleFRANCEEuropecdg@airport.frACTIVE01JAN1974.4VALIDVALID
6AP006FrankfurtGERMANYEuropefrankfurt@email.comACTIVE01JAN1936701VALIDVALID
7AP007Delhi IntlINDIAAsiadelhi@@mail.comRUNNING15AUG2001693INVALIDVALID
8AP008Sydney KingsfordAUSTRALIAOceaniasydney@airport.auACTIVE01JAN192044.VALIDVALID
9AP009Changi AirportSINGAPOREAsiachangi@airport.sgACTIVE01JAN1981682VALIDVALID

Explanation

PROC FORMAT acts like enterprise metadata governance.

Benefits:

  • Centralized validation logic
  • Standardized classifications
  • Reusable compliance checks

Instead of hardcoding conditions repeatedly, formats create maintainable validation frameworks.

6.PROC SQL Cleaning Strategy

proc sql;

create table airport_sql as

select distinct AIRPORT_ID,

propcase(AIRPORT_NAME) as AIRPORT_NAME,

upcase(COUNTRY) as COUNTRY,

mean(PASSENGERS_M) as AVG_PASSENGERS,

sum(RUNWAYS) as TOTAL_RUNWAYS

from airports_validated

group by AIRPORT_ID;

quit;

proc print data = airport_sql;

run;

OUTPUT:

ObsAIRPORT_IDAIRPORT_NAMECOUNTRYAVG_PASSENGERSTOTAL_RUNWAYS
1AP001HeathrowUK802
2AP002Dubai IntlUAE924
3AP003Lax AirportUSA754
4AP004HanedaNULL85.
5AP005CharlesdegaulleFRANCE.4
6AP006FrankfurtGERMANY701
7AP007Delhi IntlINDIA693
8AP008Sydney KingsfordAUSTRALIA44.
9AP009Changi AirportSINGAPORE682

Explanation

PROC SQL is ideal for:

  • Aggregation
  • Complex joins
  • Reporting layers
  • Multi-table integrations

DATA Step excels in row-level transformations.

Enterprise SAS programmers use BOTH strategically.

7.Advanced SAS Macro Framework

%macro missing_check(ds,var);

proc sql;

select count(*) as Missing_Count

from &ds

where missing(&var);

quit;

%mend;

%missing_check(airports_validated,COUNTRY);

OUTPUT:

Missing_Count
0

Explanation

Macros enable reusable enterprise automation.

Benefits:

  • Standardized QC
  • Reduced manual coding
  • Faster production deployment
  • Consistent validation

Large pharmaceutical companies heavily depend on macro libraries.

8.  Load Required Libraries

library(tidyverse)

library(lubridate)

library(janitor)

Create Raw Airport Dataset

airports_raw <- tribble(

  ~AIRPORT_ID, ~AIRPORT_NAME,        ~COUNTRY,   ~REGION,          ~PASSENGERS_M, ~RUNWAYS, ~OPEN_DATE,  ~CONTACT_EMAIL,            ~STATUS,  

  "AP001",     "Heathrow",           "UK",       "Europe",         "80",          "2",      "01JAN1946", "info@heathrow.com",       "ACTIVE",  

  "AP002",     "Dubai Intl",         "UAE",      "asia",           "92",          "4",      "30FEB1960", "contactdubai.com",        "ACTIVE", 

  "AP003",     "  lax airport",      "USA",      "North America",  "-75",         "4",      "01JAN1930", "support@lax.com",         "ACTIVE",

  "AP003",     "lax airport",        "usa",      "north america",  "75",          "4",      "01JAN1930", "support@lax.com",         "ACTIVE", 

  "AP004",     "Haneda",             "null",     "ASIA",           "85",          "0",      NA,          "haneda@airport",          "OPEN",  

  "AP005",     "CharlesDeGaulle",    "France",   "Europe",         "NULL",        "4",      "01JAN1974", "cdg@airport.fr",          "ACTIVE",  

  "AP006",     "Frankfurt",          "Germany",  "Europe",         "70",          "-1",     "01JAN1936", "frankfurt@email.com",     "ACTIVE",  

  "AP007",     "Delhi Intl",         "India",    "Asia",           "69",          "3",      "15AUG2001", "delhi@@mail.com",         "RUNNING",

  "AP008",     "Sydney Kingsford",   "Australia","Oceania",       "44",          "abc",    "01JAN1920", "sydney@airport.au",       "ACTIVE",

  "AP009",     "Changi Airport",     "Singapore","Asia",          "68",          "2",      "01JAN1981", "changi@airport.sg",       "ACTIVE"

)

OUTPUT:

 

AIRPORT_ID

AIRPORT_NAME

COUNTRY

REGION

PASSENGERS_M

RUNWAYS

OPEN_DATE

CONTACT_EMAIL

STATUS

1

AP001

Heathrow

UK

Europe

80

2

01-Jan-1946

info@heathrow.com

ACTIVE

2

AP002

Dubai Intl

UAE

asia

92

4

             30FEB1960

contactdubai.com

ACTIVE

3

AP003

  lax airport

USA

North America

-75

4

01-Jan-1930

support@lax.com

ACTIVE

4

AP003

lax airport

usa

north america

75

4

01-Jan-1930

support@lax.com

ACTIVE

5

AP004

Haneda

null

ASIA

85

0

                         NA

haneda@airport

OPEN

6

AP005

CharlesDeGaulle

France

Europe

NULL

4

01-Jan-1974

cdg@airport.fr

ACTIVE

7

AP006

Frankfurt

Germany

Europe

70

-1

01-Jan-1936

frankfurt@email.com

ACTIVE

8

AP007

Delhi Intl

India

Asia

69

3

15-Aug-2001

delhi@@mail.com

RUNNING

9

AP008

Sydney Kingsford

Australia

Oceania

44

abc

01-Jan-1920

sydney@airport.au

ACTIVE

10

AP009

Changi Airport

Singapore

Asia

68

2

01-Jan-1981

changi@airport.sg

ACTIVE

Explanation

1. library(tidyverse)

library(tidyverse)

Loads multiple packages together:

  • dplyr
  • tidyr
  • ggplot2
  • stringr
  • readr
  • tibble

This is the most commonly used modern R data-engineering ecosystem.

Equivalent SAS concept:

  • DATA Step
  • PROC SQL
  • string functions
  • PROC SORT

R tidyverse provides elegant pipeline-based transformations.

Equivalent SAS vs R comparison:

SAS

R

DATA Step

mutate()

IF-THEN

case_when()

PROC SORT

arrange()

PROC SQL

dplyr joins

STRIP

str_trim()

UPCASE

str_to_upper()

R excels in:

  • Exploratory analytics
  • Flexible transformations
  • Rapid prototyping

2. library(lubridate)

library(lubridate)

Used for:

  • date parsing
  • date cleaning
  • time calculations

Equivalent SAS:

  • INPUT()
  • INTNX()
  • INTCK()
  • date informats

3. library(janitor)

library(janitor)

Provides:

  • column cleaning
  • duplicate checks
  • frequency summaries

Very useful in enterprise cleaning workflows.

4. tribble()

tribble()

Creates a dataset manually in table format.

Equivalent SAS:

datalines;

Very useful for:

  • mock projects
  • interview practice
  • testing validation logic

5. ~ Symbol

~AIRPORT_ID

Defines column names.

Equivalent SAS:

input AIRPORT_ID $

6. Character Storage in R

Notice:

"80"

"NULL"

"abc"

Everything is stored as character initially.

This is actually safer for enterprise ingestion.

Unlike SAS:

  • R automatically treats mixed-type columns as character
  • avoids immediate numeric conversion errors

Important Difference Between SAS vs R

SAS

R

Numeric conversion errors appear immediately

R stores mixed values as character

INPUT statement controls type

Type inferred dynamically

Invalid numeric notes generated

Conversion delayed until explicit

Why This Design Is Enterprise-Friendly

Raw operational systems often contain:

  • NULL
  • UNKNOWN
  • abc
  • blank strings

Reading everything as character first:
preserves raw data
prevents ingestion failure
improves traceability

Corrupted Data Examples in Dataset

Variable

Corruption

PASSENGERS_M

NULL

RUNWAYS

abc

OPEN_DATE

30FEB1960

COUNTRY

null

REGION

mixed case

CONTACT_EMAIL

missing @

AIRPORT_ID

duplicate AP003

Why "30FEB1960" Is Invalid

February never has:

30 days

This creates:

Invalid date corruption

Equivalent SAS behavior:

OPEN_DATE=.

In R:

  • conversion produces NA

Why Duplicate AIRPORT_ID Is Important

AP003

appears twice.

This creates:

  • duplicate reporting
  • double-counting
  • dashboard inflation

Very common real-world issue.

Why Negative Passenger Counts Matter

-75

Negative passengers are unrealistic business values.

This indicates:

  • ETL corruption
  • sign reversal
  • upstream system bug

Why "abc" in RUNWAYS Is Dangerous

Runways should be numeric.

But:

abc

creates:

  • failed calculations
  • broken summaries
  • missing KPIs

Equivalent SAS issue:

Invalid numeric data

9.Cleaning Step in R

airports_clean <- airports_raw %>%

  mutate(

    PASSENGERS_M = as.numeric(na_if(PASSENGERS_M,"NULL")),  

    RUNWAYS = as.numeric(na_if(RUNWAYS,"abc"))

)

OUTPUT:

 

AIRPORT_ID

AIRPORT_NAME

COUNTRY

REGION

PASSENGERS_M

RUNWAYS

OPEN_DATE

CONTACT_EMAIL

STATUS

1

AP001

Heathrow

UK

Europe

80

2

01-Jan-1946

info@heathrow.com

ACTIVE

2

AP002

Dubai Intl

UAE

asia

92

4

        30FEB1960

contactdubai.com

ACTIVE

3

AP003

  lax airport

USA

North America

-75

4

01-Jan-1930

support@lax.com

ACTIVE

4

AP003

lax airport

usa

north america

75

4

01-Jan-1930

support@lax.com

ACTIVE

5

AP004

Haneda

null

ASIA

85

0

                     NA

haneda@airport

OPEN

6

AP005

CharlesDeGaulle

France

Europe

NA

4

01-Jan-1974

cdg@airport.fr

ACTIVE

7

AP006

Frankfurt

Germany

Europe

70

-1

01-Jan-1936

frankfurt@email.com

ACTIVE

8

AP007

Delhi Intl

India

Asia

69

3

15-Aug-2001

delhi@@mail.com

RUNNING

9

AP008

Sydney Kingsford

Australia

Oceania

44

NA

01-Jan-1920

sydney@airport.au

ACTIVE

10

AP009

Changi Airport

Singapore

Asia

68

2

01-Jan-1981

changi@airport.sg

ACTIVE

Explanation

Equivalent SAS:

input(var,best12.)

with validation logic.

Business Understanding

This raw dataset intentionally simulates:

  • corrupted operational systems
  • invalid business inputs
  • inconsistent metadata
  • poor source-system governance

10.R Data Cleaning Workflow

airports_clean1 <- airports_clean %>%

  clean_names() %>%

  mutate( 

    airport_name = str_to_title(str_trim(airport_name)),  

    country = str_to_upper(country),   

    region = str_to_title(region), 

    contact_email = str_trim(str_to_lower(contact_email)),  

    passengers_m = abs(passengers_m),  

    status = case_when(

      status %in% c("ACTIVE","RUNNING") ~ "ACTIVE",

      TRUE ~ "REVIEW"),  

    open_date_clean =

      parse_date_time(

        open_date,

        orders = c("dmy","ymd")),  

    date_flag = if_else(

      is.na(open_date_clean),

      "INVALID_DATE",

      "VALID_DATE")

  ) %>%

  distinct(airport_id, .keep_all = TRUE)

CONSOLE:

Warning message: There was 1 warning in `mutate()`. ℹ In argument: `open_date_clean = parse_date_time(open_date, orders = c("dmy", "ymd"))`. Caused by warning: ! 1 failed to parse.

OUTPUT:

 

airport_id

airport_name

country

region

passengers_m

runways

open_date

contact_email

status

open_date_clean

date_flag

1

AP001

Heathrow

UK

Europe

80

2

01-Jan-1946

info@heathrow.com

ACTIVE

01-01-1946

VALID_DATE

2

AP002

Dubai Intl

UAE

Asia

92

4

             30FEB1960

contactdubai.com

ACTIVE

                            NA

INVALID_DATE

3

AP003

Lax Airport

USA

North America

75

4

01-Jan-1930

support@lax.com

ACTIVE

01-01-1930

VALID_DATE

4

AP004

Haneda

NULL

Asia

85

0

                    NA

haneda@airport

REVIEW

                            NA

INVALID_DATE

5

AP005

Charlesdegaulle

FRANCE

Europe

NA

4

01-Jan-1974

cdg@airport.fr

ACTIVE

01-01-1974

VALID_DATE

6

AP006

Frankfurt

GERMANY

Europe

70

-1

01-Jan-1936

frankfurt@email.com

ACTIVE

01-01-1936

VALID_DATE

7

AP007

Delhi Intl

INDIA

Asia

69

3

15-Aug-2001

delhi@@mail.com

ACTIVE

15-08-2001

VALID_DATE

8

AP008

Sydney Kingsford

AUSTRALIA

Oceania

44

NA

01-Jan-1920

sydney@airport.au

ACTIVE

01-01-1920

VALID_DATE

9

AP009

Changi Airport

SINGAPORE

Asia

68

2

01-Jan-1981

changi@airport.sg

ACTIVE

01-01-1981

VALID_DATE

warning appears because:

Problem

Reason

30FEB1960

Impossible calendar date

parse_date_time()

Cannot create valid date

Result

NA generated

Warning

Data-quality parsing warning

Explanation

distinct() removes duplicate airport records efficiently.

Equivalent SAS:

  • PROC SORT NODUPKEY

Deduplication is foundational for:

  • regulatory accuracy
  • operational intelligence
  • financial reporting

Business Logic Behind Data Cleaning

Enterprise data cleaning is not cosmetic work.
It directly influences analytical trustworthiness.

Example:

  • Passenger counts cannot be negative.
  • Airport opening dates must be valid.
  • Region values must follow controlled terminology.
  • Emails must satisfy communication standards.

If age values in clinical trials exceed realistic human ranges:

  • statistical models fail
  • SDTM validation flags appear
  • FDA reviewers question dataset credibility

Similarly:

  • missing dates break time-series forecasting
  • inconsistent text breaks joins
  • duplicate IDs inflate KPIs

Imputation strategies are used carefully:

  • Replace missing regions using reference mappings
  • Infer missing categories from metadata
  • Standardize timestamps into ISO formats

Every cleaning decision requires:

  • traceability
  • reproducibility
  • auditability

because enterprise analytics drives billion-dollar decisions.

Enterprise Validation & Compliance

Why SAS Missing Values Are Dangerous

In SAS:

if . < 100 then FLAG='YES';

This evaluates TRUE.

Why?

  • Missing numeric values are treated lower than valid numbers.

This can silently corrupt:

  • safety analyses
  • risk scores
  • clinical summaries

Always validate missing handling explicitly.

SDTM & ADaM Relevance

Clinical trial standards require:

  • traceability
  • reproducibility
  • audit trails
  • QC independence

Poor cleaning can:

  • invalidate FDA submissions
  • create ADaM inconsistencies
  • fail Pinnacle 21 checks

20 Enterprise Data-Cleaning Best Practices

  1. Standardize metadata early
  2. Validate dates immediately
  3. Remove duplicates before aggregation
  4. Never trust source-system casing
  5. Use reusable macros
  6. Maintain audit trails
  7. Document derivations clearly
  8. Validate numeric ranges
  9. Separate raw and clean layers
  10. Use QC independence
  11. Apply controlled terminology
  12. Validate email structures
  13. Protect original datasets
  14. Version-control programs
  15. Use defensive programming
  16. Automate validation checks
  17. Monitor truncation risks
  18. Standardize missing values
  19. Test joins carefully
  20. Validate outputs statistically

20 Sharp SAS & R Cleaning Insights

  • Dirty data creates expensive business mistakes.
  • Validation logic is stronger than visual inspection.
  • Standardized variables improve reproducibility.
  • Duplicate records destroy trust.
  • Missing dates break forecasting models.
  • PROC FORMAT improves governance.
  • Macros reduce enterprise coding risk.
  • Arrays simplify repetitive validation.
  • R pipelines improve readability.
  • SAS excels in auditability.
  • Clinical data requires traceability.
  • Controlled terminology prevents fragmentation.
  • QC must remain independent.
  • Negative values distort analytics.
  • Character truncation is silent corruption.
  • SQL joins require standardized keys.
  • Regulatory submissions demand consistency.
  • Data lineage matters.
  • Enterprise cleaning is engineering.
  • Reliable analytics begins with clean data.

SAS vs R for Enterprise Cleaning

Feature

SAS

R

Auditability

Excellent

Moderate

Regulatory Acceptance

Industry Standard

Growing

Scalability

Very Strong

Strong

Flexibility

Structured

Extremely Flexible

Validation

Enterprise-grade

Community-driven

Visualization

Moderate

Excellent

Automation

Macro-driven

Package-driven

SAS dominates regulated industries because of:

  • stability
  • validation frameworks
  • traceability
  • reproducibility

R dominates:

  • advanced analytics
  • visualization
  • rapid experimentation

Modern enterprises increasingly combine BOTH.

Conclusion

Enterprise data cleaning is far more than formatting columns and fixing spelling mistakes. It is the foundation of trustworthy analytics, regulatory compliance, operational intelligence, and executive decision-making. Whether the domain is aviation, healthcare, banking, insurance, or retail, corrupted data silently destroys dashboards, AI predictions, forecasting systems, and statistical outputs. A single duplicate identifier or malformed date can cascade into millions of dollars of business risk.

This Large Airports in World project demonstrated how real-world operational data often arrives with severe corruption issues including duplicate keys, inconsistent text values, invalid numeric ranges, malformed emails, missing timestamps, and broken categorical labels. Using SAS and R together creates a powerful enterprise-grade cleaning ecosystem capable of transforming unreliable raw records into validated analytical intelligence.

SAS remains unmatched for:

  • regulatory traceability
  • audit readiness
  • enterprise-scale validation
  • production stability
  • macro automation
  • SDTM/ADaM compliance

Meanwhile, R provides:

  • flexible transformation pipelines
  • rapid exploration
  • modern string handling
  • advanced visualization
  • scalable analytical workflows

The most effective enterprise data teams no longer treat SAS and R as competitors. Instead, they integrate both strategically:

  • SAS for validated production pipelines
  • R for flexible analytics and exploratory engineering

The real lesson is this:

Clean data is not a technical luxury.
It is the backbone of trustworthy business intelligence.

Organizations that invest in structured data-cleaning frameworks gain:

  • reliable dashboards
  • accurate forecasting
  • reproducible analytics
  • compliant regulatory submissions
  • trustworthy AI systems

In modern analytics ecosystems, data cleaning is no longer a preprocessing step.
It is a critical enterprise engineering discipline that protects decision-making integrity across the entire organization.

Interview Questions & Answers

1. How would you handle duplicate airport records?

Use PROC SORT NODUPKEY or DISTINCT in R.
Then validate business keys and compare record counts before/after cleaning.

2. Why is LENGTH placement critical in SAS?

Because SAS determines variable attributes during compilation. Incorrect placement causes character truncation and data loss.

3. How do you validate malformed emails?

Use INDEX, FIND, PRXMATCH in SAS or grepl() in R to detect invalid structures.

4. Why are missing values dangerous in SAS?

Missing numeric values evaluate lower than valid numbers, causing incorrect conditional logic and statistical distortion.

5. When would you choose PROC SQL over DATA Step?

PROC SQL:

  • joins
  • aggregations
  • summarization

DATA Step:

  • row-level transformations
  • iterative logic
  • arrays
  • retained calculations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about AIRPORT DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS