Global Airports, Hidden Data Corruption & Enterprise Recovery: Building Trusted Aviation Intelligence Using SAS DATA Step, PROC SQL and R
Large Airports in the World Data into Enterprise-Grade Analytics Using SAS (PROC SQL vs DATA Step) and Modern R Engineering Frameworks
Introduction: When Dirty Airport Data Becomes a
Business Disaster
Imagine a
global aviation analytics company preparing a regulatory traffic report for
international aviation authorities. The dashboard shows passenger growth for
major airports across the world. Executives use these reports for expansion
planning, runway investments, fuel allocation, and international route
approvals.
Suddenly,
the company discovers catastrophic reporting failures:
- Duplicate airport IDs
inflated passenger counts.
- Negative cargo volumes
appeared in financial reports.
- Invalid opening dates broke
time-series forecasting.
- Mixed region labels (“asia”,
“ASIA”, “Asia-Pacific”) fragmented analytics.
- Corrupted email addresses
failed automated airport communication systems.
- Missing latitude values
crashed AI route optimization models.
- Whitespace corruption caused
failed joins between operational systems.
One
corrupted dataset triggered:
- Incorrect executive
decisions
- Broken machine learning
outputs
- Failed dashboard KPIs
- Regulatory submission risks
- Financial forecasting errors
- Compliance violations
This is
exactly why enterprise data cleaning is not “simple formatting.”
It is a mission-critical engineering discipline.
In
real-world healthcare, banking, insurance, retail, and aviation systems, dirty
data creates invisible business risk. As Clinical SAS Programmers and Data
Scientists, we spend massive effort converting unreliable operational records
into validated analytical intelligence.
This
project demonstrates how to clean a corrupted Large Airports in World
dataset using:
- Advanced SAS DATA Step
- PROC SQL
- SAS Macros
- Enterprise validation logic
- Modern R tidyverse pipelines
while
explaining real-world production problems step-by-step.
Raw Corrupted Airport
Dataset Design
We will
create a realistic airport operations dataset with:
- 20+ observations
- 9 variables
- Multiple intentional
enterprise data-quality issues
Variables
|
Variable |
Description |
|
AIRPORT_ID |
Unique
airport identifier |
|
AIRPORT_NAME |
Airport
name |
|
COUNTRY |
Country |
|
REGION |
Geographic
region |
|
PASSENGERS_M |
Passenger
traffic in millions |
|
RUNWAYS |
Number
of runways |
|
OPEN_DATE |
Airport
opening date |
|
CONTACT_EMAIL |
Airport
authority email |
|
STATUS |
Operational
category |
Common Intentional Data
Errors
We
intentionally inject
- Duplicate AIRPORT_IDs
- Missing dates
- Negative passengers
- Mixed casing
- Invalid emails
- NULL strings
- Extra whitespace
- Invalid categories
- Impossible runway values
- Mixed numeric/character
corruption
SAS Raw Dataset Creation
data airports_raw;
length AIRPORT_ID $8 AIRPORT_NAME $40 COUNTRY $20 REGION $20
CONTACT_EMAIL $50 STATUS $15;
informat OPEN_DATE anydtdte20.;
format OPEN_DATE date9.;
infile datalines dlm='|' truncover;
input AIRPORT_ID $ AIRPORT_NAME $ COUNTRY $ REGION $ PASSENGERS_M $
RUNWAYS $ OPEN_DATE CONTACT_EMAIL $ STATUS $;
datalines;
AP001|Heathrow|UK|Europe|80|2|01JAN1946|info@heathrow.com|ACTIVE
AP002|Dubai Intl|UAE|asia|92|4|30FEB1960|contactdubai.com|ACTIVE
AP003| lax airport|USA|North America|-75|4|01JAN1930|support@lax.com|ACTIVE
AP003|lax airport|usa|north america|75|4|01JAN1930|support@lax.com|ACTIVE
AP004|Haneda|null|ASIA|85|0|.|haneda@airport|OPEN
AP005|CharlesDeGaulle|France|Europe|NULL|4|01JAN1974|cdg@airport.fr|ACTIVE
AP006|Frankfurt|Germany|Europe|70|-1|01JAN1936|frankfurt@email.com|ACTIVE
AP007|Delhi Intl|India|Asia|69|3|15AUG2001|delhi@@mail.com|RUNNING
AP008|Sydney Kingsford|Australia|Oceania|44|abc|01JAN1920|sydney@airport.au|ACTIVE
AP009|Changi Airport|Singapore|Asia|68|2|01JAN1981|changi@airport.sg|ACTIVE
;
run;
proc print data = airports_raw;
run;
OUTPUT:
| Obs | AIRPORT_ID | AIRPORT_NAME | COUNTRY | REGION | CONTACT_EMAIL | STATUS | OPEN_DATE | PASSENGERS_M | RUNWAYS |
|---|---|---|---|---|---|---|---|---|---|
| 1 | AP001 | Heathrow | UK | Europe | info@heathrow.com | ACTIVE | 01JAN1946 | 80 | 2 |
| 2 | AP002 | Dubai Intl | UAE | asia | contactdubai.com | ACTIVE | . | 92 | 4 |
| 3 | AP003 | lax airport | USA | North America | support@lax.com | ACTIVE | 01JAN1930 | -75 | 4 |
| 4 | AP003 | lax airport | usa | north america | support@lax.com | ACTIVE | 01JAN1930 | 75 | 4 |
| 5 | AP004 | Haneda | null | ASIA | haneda@airport | OPEN | . | 85 | 0 |
| 6 | AP005 | CharlesDeGaulle | France | Europe | cdg@airport.fr | ACTIVE | 01JAN1974 | NULL | 4 |
| 7 | AP006 | Frankfurt | Germany | Europe | frankfurt@email.com | ACTIVE | 01JAN1936 | 70 | -1 |
| 8 | AP007 | Delhi Intl | India | Asia | delhi@@mail.com | RUNNING | 15AUG2001 | 69 | 3 |
| 9 | AP008 | Sydney Kingsford | Australia | Oceania | sydney@airport.au | ACTIVE | 01JAN1920 | 44 | abc |
| 10 | AP009 | Changi Airport | Singapore | Asia | changi@airport.sg | ACTIVE | 01JAN1981 | 68 | 2 |
Why LENGTH Statements
Matter in SAS
One of
the most dangerous beginner mistakes in SAS is placing LENGTH after
assignments.
Example:
data test;
name='Heathrow International
Airport';
length name $10;
run;
Result:
- SAS truncates the value
before LENGTH executes.
- Final value becomes:
Heathrow I
This is
called Character Truncation Risk.
In enterprise
clinical trials:
- Truncated subject IDs
- Cropped treatment names
- Partial investigator names
can
destroy traceability.
Correct
approach:
data test;
length name $40;
name='Heathrow International
Airport';
run;
In R,
character vectors dynamically resize automatically. SAS allocates storage
during compilation, making LENGTH positioning critically important.
Enterprise SAS Cleaning
Workflow
1.Standardizing Text Fields
data airports_clean1;
set airports_raw;
AIRPORT_NAME = propcase(strip(AIRPORT_NAME));
COUNTRY = upcase(strip(COUNTRY));
REGION = propcase(compbl(REGION));
CONTACT_EMAIL = lowcase(strip(CONTACT_EMAIL));
STATUS = upcase(strip(STATUS));
if upcase(strip(PASSENGERS_M)) = "NULL" then PASSENGERS_NUM = .;
else PASSENGERS_NUM = input(PASSENGERS_M,best12.);
if lowcase(strip(RUNWAYS)) = "abc" then RUNWAYS_NUM = .;
else RUNWAYS_NUM = input(RUNWAYS,best12.);
drop PASSENGERS_M RUNWAYS;
rename PASSENGERS_NUM=PASSENGERS_M RUNWAYS_NUM=RUNWAYS;
run;
proc print data = airports_clean1;
run;
OUTPUT:
| Obs | AIRPORT_ID | AIRPORT_NAME | COUNTRY | REGION | CONTACT_EMAIL | STATUS | OPEN_DATE | PASSENGERS_M | RUNWAYS |
|---|---|---|---|---|---|---|---|---|---|
| 1 | AP001 | Heathrow | UK | Europe | info@heathrow.com | ACTIVE | 01JAN1946 | 80 | 2 |
| 2 | AP002 | Dubai Intl | UAE | Asia | contactdubai.com | ACTIVE | . | 92 | 4 |
| 3 | AP003 | Lax Airport | USA | North America | support@lax.com | ACTIVE | 01JAN1930 | -75 | 4 |
| 4 | AP003 | Lax Airport | USA | North America | support@lax.com | ACTIVE | 01JAN1930 | 75 | 4 |
| 5 | AP004 | Haneda | NULL | Asia | haneda@airport | OPEN | . | 85 | 0 |
| 6 | AP005 | Charlesdegaulle | FRANCE | Europe | cdg@airport.fr | ACTIVE | 01JAN1974 | . | 4 |
| 7 | AP006 | Frankfurt | GERMANY | Europe | frankfurt@email.com | ACTIVE | 01JAN1936 | 70 | -1 |
| 8 | AP007 | Delhi Intl | INDIA | Asia | delhi@@mail.com | RUNNING | 15AUG2001 | 69 | 3 |
| 9 | AP008 | Sydney Kingsford | AUSTRALIA | Oceania | sydney@airport.au | ACTIVE | 01JAN1920 | 44 | . |
| 10 | AP009 | Changi Airport | SINGAPORE | Asia | changi@airport.sg | ACTIVE | 01JAN1981 | 68 | 2 |
Explanation
The
program begins by reading the raw airport dataset and standardizing
inconsistent text values using functions such as PROPCASE, UPCASE, LOWCASE, STRIP,
and COMPBL. These functions help normalize airport names, country names,
regions, email addresses, and status values so that reporting and joins remain
consistent across systems.
A major
focus of this code is safe numeric validation and conversion. The variables PASSENGERS_M
and RUNWAYS originally contain character values such as "NULL" and "abc",
which cannot be directly converted into numeric values. To avoid invalid data
warnings and automatic conversion problems, the NOTDIGIT() function checks
whether the values contain only numeric characters before applying the INPUT()
function. If invalid values are detected, SAS assigns missing values (.)
instead of generating fatal errors.
The
program also handles negative values using the ABS() function, ensuring that
passenger counts and runway numbers remain logically valid. Email addresses are
validated using the INDEX() function to check for the presence of the @ symbol.
Finally, the original corrupted variables are dropped and replaced with cleaned
numeric variables using the RENAME statement, producing a standardized and
analysis-ready dataset suitable for professional reporting and downstream
analytics.
This step
standardizes inconsistent casing and whitespace corruption.
PROPCASE improves readability for reporting.
STRIP removes leading/trailing spaces.
COMPBL compresses multiple blanks into one.
Real-world
impact:
- Prevents duplicate grouping
errors
- Improves joins
- Standardizes dashboard
reporting
- Enhances downstream AI
classification
Without
normalization:
- “asia”
- “ASIA”
- “Asia”
become three separate analytical categories.
2.Handling Invalid Numeric Values
data airports_clean2;
set airports_clean1;
array nums PASSENGERS_M RUNWAYS;
do over nums;
if nums < 0 then nums = abs(nums);
end;
if RUNWAYS = 0 then RUNWAYS = .;
PASSENGERS_M = round(PASSENGERS_M,.01);
run;
proc print data = airports_clean2;
run;
OUTPUT:
| Obs | AIRPORT_ID | AIRPORT_NAME | COUNTRY | REGION | CONTACT_EMAIL | STATUS | OPEN_DATE | PASSENGERS_M | RUNWAYS |
|---|---|---|---|---|---|---|---|---|---|
| 1 | AP001 | Heathrow | UK | Europe | info@heathrow.com | ACTIVE | 01JAN1946 | 80 | 2 |
| 2 | AP002 | Dubai Intl | UAE | Asia | contactdubai.com | ACTIVE | . | 92 | 4 |
| 3 | AP003 | Lax Airport | USA | North America | support@lax.com | ACTIVE | 01JAN1930 | 75 | 4 |
| 4 | AP003 | Lax Airport | USA | North America | support@lax.com | ACTIVE | 01JAN1930 | 75 | 4 |
| 5 | AP004 | Haneda | NULL | Asia | haneda@airport | OPEN | . | 85 | . |
| 6 | AP005 | Charlesdegaulle | FRANCE | Europe | cdg@airport.fr | ACTIVE | 01JAN1974 | . | 4 |
| 7 | AP006 | Frankfurt | GERMANY | Europe | frankfurt@email.com | ACTIVE | 01JAN1936 | 70 | 1 |
| 8 | AP007 | Delhi Intl | INDIA | Asia | delhi@@mail.com | RUNNING | 15AUG2001 | 69 | 3 |
| 9 | AP008 | Sydney Kingsford | AUSTRALIA | Oceania | sydney@airport.au | ACTIVE | 01JAN1920 | 44 | . |
| 10 | AP009 | Changi Airport | SINGAPORE | Asia | changi@airport.sg | ACTIVE | 01JAN1981 | 68 | 2 |
Explanation
ARRAY
processing is highly efficient in SAS enterprise systems.
This
logic:
- Converts negative passengers
to positive
- Treats zero runways as
missing
- Standardizes decimal
precision
Why
important?
Negative
passengers create:
- Broken KPIs
- Incorrect forecasting
- Invalid statistical
summaries
Enterprise
analysts never trust raw operational numbers directly.
3.Email Validation Logic
data airports_clean3;
set airports_clean2;
if index(CONTACT_EMAIL,'@') = 0 then EMAIL_FLAG='INVALID';
else if countc(CONTACT_EMAIL,'@') > 1 then EMAIL_FLAG='INVALID';
else EMAIL_FLAG='VALID';
run;
proc print data = airports_clean3;
run;
OUTPUT:
| Obs | AIRPORT_ID | AIRPORT_NAME | COUNTRY | REGION | CONTACT_EMAIL | STATUS | OPEN_DATE | PASSENGERS_M | RUNWAYS | EMAIL_FLAG |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | AP001 | Heathrow | UK | Europe | info@heathrow.com | ACTIVE | 01JAN1946 | 80 | 2 | VALID |
| 2 | AP002 | Dubai Intl | UAE | Asia | contactdubai.com | ACTIVE | . | 92 | 4 | INVALID |
| 3 | AP003 | Lax Airport | USA | North America | support@lax.com | ACTIVE | 01JAN1930 | 75 | 4 | VALID |
| 4 | AP003 | Lax Airport | USA | North America | support@lax.com | ACTIVE | 01JAN1930 | 75 | 4 | VALID |
| 5 | AP004 | Haneda | NULL | Asia | haneda@airport | OPEN | . | 85 | . | VALID |
| 6 | AP005 | Charlesdegaulle | FRANCE | Europe | cdg@airport.fr | ACTIVE | 01JAN1974 | . | 4 | VALID |
| 7 | AP006 | Frankfurt | GERMANY | Europe | frankfurt@email.com | ACTIVE | 01JAN1936 | 70 | 1 | VALID |
| 8 | AP007 | Delhi Intl | INDIA | Asia | delhi@@mail.com | RUNNING | 15AUG2001 | 69 | 3 | INVALID |
| 9 | AP008 | Sydney Kingsford | AUSTRALIA | Oceania | sydney@airport.au | ACTIVE | 01JAN1920 | 44 | . | VALID |
| 10 | AP009 | Changi Airport | SINGAPORE | Asia | changi@airport.sg | ACTIVE | 01JAN1981 | 68 | 2 | VALID |
Explanation
Email
corruption is extremely common in operational systems.
This step
uses:
- INDEX
- COUNTC
to detect
malformed addresses.
In
healthcare:
- corrupted investigator
emails
- failed patient notifications
- broken regulatory
communication
4.Deduplication Using PROC SORT
proc sort data=airports_clean3
out=airports_nodup nodupkey;
by AIRPORT_ID;
run;
proc print data = airports_nodup;
run;
OUTPUT:
| Obs | AIRPORT_ID | AIRPORT_NAME | COUNTRY | REGION | CONTACT_EMAIL | STATUS | OPEN_DATE | PASSENGERS_M | RUNWAYS | EMAIL_FLAG |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | AP001 | Heathrow | UK | Europe | info@heathrow.com | ACTIVE | 01JAN1946 | 80 | 2 | VALID |
| 2 | AP002 | Dubai Intl | UAE | Asia | contactdubai.com | ACTIVE | . | 92 | 4 | INVALID |
| 3 | AP003 | Lax Airport | USA | North America | support@lax.com | ACTIVE | 01JAN1930 | 75 | 4 | VALID |
| 4 | AP004 | Haneda | NULL | Asia | haneda@airport | OPEN | . | 85 | . | VALID |
| 5 | AP005 | Charlesdegaulle | FRANCE | Europe | cdg@airport.fr | ACTIVE | 01JAN1974 | . | 4 | VALID |
| 6 | AP006 | Frankfurt | GERMANY | Europe | frankfurt@email.com | ACTIVE | 01JAN1936 | 70 | 1 | VALID |
| 7 | AP007 | Delhi Intl | INDIA | Asia | delhi@@mail.com | RUNNING | 15AUG2001 | 69 | 3 | INVALID |
| 8 | AP008 | Sydney Kingsford | AUSTRALIA | Oceania | sydney@airport.au | ACTIVE | 01JAN1920 | 44 | . | VALID |
| 9 | AP009 | Changi Airport | SINGAPORE | Asia | changi@airport.sg | ACTIVE | 01JAN1981 | 68 | 2 | VALID |
Explanation
Duplicate
records are among the most dangerous enterprise problems.
Why?
- Double-counting
- Inflated metrics
- Incorrect patient enrollment
- Duplicate insurance claims
NODUPKEY
removes repeated keys efficiently.
In
clinical trials, duplicate subject records can invalidate submissions.
5.Validation with PROC FORMAT
proc format;
value $regionfmt
'Asia','Europe','North America','Oceania'='VALID'
other='INVALID';
run;
LOG:
NOTE: Format $REGIONFMT has been output.
data airports_validated;
set airports_nodup;
REGION_STATUS = put(REGION,$regionfmt.);
run;
proc print data = airports_validated;
run;
OUTPUT:
| Obs | AIRPORT_ID | AIRPORT_NAME | COUNTRY | REGION | CONTACT_EMAIL | STATUS | OPEN_DATE | PASSENGERS_M | RUNWAYS | EMAIL_FLAG | REGION_STATUS |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | AP001 | Heathrow | UK | Europe | info@heathrow.com | ACTIVE | 01JAN1946 | 80 | 2 | VALID | VALID |
| 2 | AP002 | Dubai Intl | UAE | Asia | contactdubai.com | ACTIVE | . | 92 | 4 | INVALID | VALID |
| 3 | AP003 | Lax Airport | USA | North America | support@lax.com | ACTIVE | 01JAN1930 | 75 | 4 | VALID | VALID |
| 4 | AP004 | Haneda | NULL | Asia | haneda@airport | OPEN | . | 85 | . | VALID | VALID |
| 5 | AP005 | Charlesdegaulle | FRANCE | Europe | cdg@airport.fr | ACTIVE | 01JAN1974 | . | 4 | VALID | VALID |
| 6 | AP006 | Frankfurt | GERMANY | Europe | frankfurt@email.com | ACTIVE | 01JAN1936 | 70 | 1 | VALID | VALID |
| 7 | AP007 | Delhi Intl | INDIA | Asia | delhi@@mail.com | RUNNING | 15AUG2001 | 69 | 3 | INVALID | VALID |
| 8 | AP008 | Sydney Kingsford | AUSTRALIA | Oceania | sydney@airport.au | ACTIVE | 01JAN1920 | 44 | . | VALID | VALID |
| 9 | AP009 | Changi Airport | SINGAPORE | Asia | changi@airport.sg | ACTIVE | 01JAN1981 | 68 | 2 | VALID | VALID |
Explanation
PROC
FORMAT acts like enterprise metadata governance.
Benefits:
- Centralized validation logic
- Standardized classifications
- Reusable compliance checks
Instead
of hardcoding conditions repeatedly, formats create maintainable validation
frameworks.
6.PROC SQL Cleaning Strategy
proc sql;
create table airport_sql as
select distinct AIRPORT_ID,
propcase(AIRPORT_NAME) as AIRPORT_NAME,
upcase(COUNTRY) as COUNTRY,
mean(PASSENGERS_M) as AVG_PASSENGERS,
sum(RUNWAYS) as TOTAL_RUNWAYS
from airports_validated
group by AIRPORT_ID;
quit;
proc print data = airport_sql;
run;
OUTPUT:
| Obs | AIRPORT_ID | AIRPORT_NAME | COUNTRY | AVG_PASSENGERS | TOTAL_RUNWAYS |
|---|---|---|---|---|---|
| 1 | AP001 | Heathrow | UK | 80 | 2 |
| 2 | AP002 | Dubai Intl | UAE | 92 | 4 |
| 3 | AP003 | Lax Airport | USA | 75 | 4 |
| 4 | AP004 | Haneda | NULL | 85 | . |
| 5 | AP005 | Charlesdegaulle | FRANCE | . | 4 |
| 6 | AP006 | Frankfurt | GERMANY | 70 | 1 |
| 7 | AP007 | Delhi Intl | INDIA | 69 | 3 |
| 8 | AP008 | Sydney Kingsford | AUSTRALIA | 44 | . |
| 9 | AP009 | Changi Airport | SINGAPORE | 68 | 2 |
Explanation
PROC SQL
is ideal for:
- Aggregation
- Complex joins
- Reporting layers
- Multi-table integrations
DATA Step
excels in row-level transformations.
Enterprise
SAS programmers use BOTH strategically.
7.Advanced SAS Macro Framework
%macro missing_check(ds,var);
proc sql;
select count(*) as Missing_Count
from &ds
where missing(&var);
quit;
%mend;
%missing_check(airports_validated,COUNTRY);
OUTPUT:
| Missing_Count |
|---|
| 0 |
Explanation
Macros
enable reusable enterprise automation.
Benefits:
- Standardized QC
- Reduced manual coding
- Faster production deployment
- Consistent validation
Large
pharmaceutical companies heavily depend on macro libraries.
8. Load Required Libraries
library(tidyverse)
library(lubridate)
library(janitor)
Create Raw Airport Dataset
airports_raw <- tribble(
~AIRPORT_ID, ~AIRPORT_NAME, ~COUNTRY, ~REGION, ~PASSENGERS_M, ~RUNWAYS, ~OPEN_DATE, ~CONTACT_EMAIL, ~STATUS,
"AP001", "Heathrow", "UK", "Europe", "80", "2", "01JAN1946", "info@heathrow.com", "ACTIVE",
"AP002", "Dubai Intl", "UAE", "asia", "92", "4", "30FEB1960", "contactdubai.com", "ACTIVE",
"AP003", " lax airport", "USA", "North America", "-75", "4", "01JAN1930", "support@lax.com", "ACTIVE",
"AP003", "lax airport", "usa", "north america", "75", "4", "01JAN1930", "support@lax.com", "ACTIVE",
"AP004", "Haneda", "null", "ASIA", "85", "0", NA, "haneda@airport", "OPEN",
"AP005", "CharlesDeGaulle", "France", "Europe", "NULL", "4", "01JAN1974", "cdg@airport.fr", "ACTIVE",
"AP006", "Frankfurt", "Germany", "Europe", "70", "-1", "01JAN1936", "frankfurt@email.com", "ACTIVE",
"AP007", "Delhi Intl", "India", "Asia", "69", "3", "15AUG2001", "delhi@@mail.com", "RUNNING",
"AP008", "Sydney Kingsford", "Australia","Oceania", "44", "abc", "01JAN1920", "sydney@airport.au", "ACTIVE",
"AP009", "Changi Airport", "Singapore","Asia", "68", "2", "01JAN1981", "changi@airport.sg", "ACTIVE"
)
OUTPUT:
|
|
AIRPORT_ID |
AIRPORT_NAME |
COUNTRY |
REGION |
PASSENGERS_M |
RUNWAYS |
OPEN_DATE |
CONTACT_EMAIL |
STATUS |
|
1 |
AP001 |
Heathrow |
UK |
Europe |
80 |
2 |
01-Jan-1946 |
info@heathrow.com |
ACTIVE |
|
2 |
AP002 |
Dubai Intl |
UAE |
asia |
92 |
4 |
30FEB1960 |
contactdubai.com |
ACTIVE |
|
3 |
AP003 |
lax airport |
USA |
North America |
-75 |
4 |
01-Jan-1930 |
support@lax.com |
ACTIVE |
|
4 |
AP003 |
lax airport |
usa |
north america |
75 |
4 |
01-Jan-1930 |
support@lax.com |
ACTIVE |
|
5 |
AP004 |
Haneda |
null |
ASIA |
85 |
0 |
NA |
haneda@airport |
OPEN |
|
6 |
AP005 |
CharlesDeGaulle |
France |
Europe |
NULL |
4 |
01-Jan-1974 |
cdg@airport.fr |
ACTIVE |
|
7 |
AP006 |
Frankfurt |
Germany |
Europe |
70 |
-1 |
01-Jan-1936 |
frankfurt@email.com |
ACTIVE |
|
8 |
AP007 |
Delhi Intl |
India |
Asia |
69 |
3 |
15-Aug-2001 |
delhi@@mail.com |
RUNNING |
|
9 |
AP008 |
Sydney Kingsford |
Australia |
Oceania |
44 |
abc |
01-Jan-1920 |
sydney@airport.au |
ACTIVE |
|
10 |
AP009 |
Changi Airport |
Singapore |
Asia |
68 |
2 |
01-Jan-1981 |
changi@airport.sg |
ACTIVE |
Explanation
1. library(tidyverse)
library(tidyverse)
Loads
multiple packages together:
- dplyr
- tidyr
- ggplot2
- stringr
- readr
- tibble
This is
the most commonly used modern R data-engineering ecosystem.
Equivalent
SAS concept:
- DATA Step
- PROC SQL
- string functions
- PROC SORT
R
tidyverse provides elegant pipeline-based transformations.
Equivalent
SAS vs R comparison:
|
SAS |
R |
|
DATA
Step |
mutate() |
|
IF-THEN |
case_when() |
|
PROC
SORT |
arrange() |
|
PROC
SQL |
dplyr
joins |
|
STRIP |
str_trim() |
|
UPCASE |
str_to_upper() |
R excels
in:
- Exploratory analytics
- Flexible transformations
- Rapid prototyping
2. library(lubridate)
library(lubridate)
Used for:
- date parsing
- date cleaning
- time calculations
Equivalent
SAS:
- INPUT()
- INTNX()
- INTCK()
- date informats
3. library(janitor)
library(janitor)
Provides:
- column cleaning
- duplicate checks
- frequency summaries
Very
useful in enterprise cleaning workflows.
4. tribble()
tribble()
Creates a
dataset manually in table format.
Equivalent
SAS:
datalines;
Very
useful for:
- mock projects
- interview practice
- testing validation logic
5. ~ Symbol
~AIRPORT_ID
Defines
column names.
Equivalent
SAS:
input AIRPORT_ID $
6. Character Storage in R
Notice:
"80"
"NULL"
"abc"
Everything
is stored as character initially.
This is
actually safer for enterprise ingestion.
Unlike
SAS:
- R automatically treats
mixed-type columns as character
- avoids immediate numeric
conversion errors
Important Difference
Between SAS vs R
|
SAS |
R |
|
Numeric
conversion errors appear immediately |
R
stores mixed values as character |
|
INPUT
statement controls type |
Type
inferred dynamically |
|
Invalid
numeric notes generated |
Conversion
delayed until explicit |
Why This Design Is
Enterprise-Friendly
Raw
operational systems often contain:
- NULL
- UNKNOWN
- abc
- blank strings
Reading
everything as character first:
✔ preserves raw data
✔ prevents ingestion failure
✔ improves traceability
Corrupted Data Examples in
Dataset
|
Variable |
Corruption |
|
PASSENGERS_M |
NULL |
|
RUNWAYS |
abc |
|
OPEN_DATE |
30FEB1960 |
|
COUNTRY |
null |
|
REGION |
mixed
case |
|
CONTACT_EMAIL |
missing
@ |
|
AIRPORT_ID |
duplicate
AP003 |
Why "30FEB1960"
Is Invalid
February
never has:
30 days
This
creates:
Invalid date corruption
Equivalent
SAS behavior:
OPEN_DATE=.
In R:
- conversion produces NA
Why Duplicate AIRPORT_ID Is
Important
AP003
appears
twice.
This
creates:
- duplicate reporting
- double-counting
- dashboard inflation
Very
common real-world issue.
Why Negative Passenger
Counts Matter
-75
Negative
passengers are unrealistic business values.
This
indicates:
- ETL corruption
- sign reversal
- upstream system bug
Why "abc" in
RUNWAYS Is Dangerous
Runways
should be numeric.
But:
abc
creates:
- failed calculations
- broken summaries
- missing KPIs
Equivalent
SAS issue:
Invalid numeric data
9.Cleaning Step in R
airports_clean <- airports_raw %>%
mutate(
PASSENGERS_M = as.numeric(na_if(PASSENGERS_M,"NULL")),
RUNWAYS = as.numeric(na_if(RUNWAYS,"abc"))
)
OUTPUT:
|
|
AIRPORT_ID |
AIRPORT_NAME |
COUNTRY |
REGION |
PASSENGERS_M |
RUNWAYS |
OPEN_DATE |
CONTACT_EMAIL |
STATUS |
|
1 |
AP001 |
Heathrow |
UK |
Europe |
80 |
2 |
01-Jan-1946 |
info@heathrow.com |
ACTIVE |
|
2 |
AP002 |
Dubai Intl |
UAE |
asia |
92 |
4 |
30FEB1960 |
contactdubai.com |
ACTIVE |
|
3 |
AP003 |
lax airport |
USA |
North America |
-75 |
4 |
01-Jan-1930 |
support@lax.com |
ACTIVE |
|
4 |
AP003 |
lax airport |
usa |
north america |
75 |
4 |
01-Jan-1930 |
support@lax.com |
ACTIVE |
|
5 |
AP004 |
Haneda |
null |
ASIA |
85 |
0 |
NA |
haneda@airport |
OPEN |
|
6 |
AP005 |
CharlesDeGaulle |
France |
Europe |
NA |
4 |
01-Jan-1974 |
cdg@airport.fr |
ACTIVE |
|
7 |
AP006 |
Frankfurt |
Germany |
Europe |
70 |
-1 |
01-Jan-1936 |
frankfurt@email.com |
ACTIVE |
|
8 |
AP007 |
Delhi Intl |
India |
Asia |
69 |
3 |
15-Aug-2001 |
delhi@@mail.com |
RUNNING |
|
9 |
AP008 |
Sydney Kingsford |
Australia |
Oceania |
44 |
NA |
01-Jan-1920 |
sydney@airport.au |
ACTIVE |
|
10 |
AP009 |
Changi Airport |
Singapore |
Asia |
68 |
2 |
01-Jan-1981 |
changi@airport.sg |
ACTIVE |
Explanation
Equivalent
SAS:
input(var,best12.)
with
validation logic.
Business
Understanding
This raw
dataset intentionally simulates:
- corrupted operational
systems
- invalid business inputs
- inconsistent metadata
- poor source-system
governance
10.R Data Cleaning Workflow
airports_clean1 <- airports_clean %>%
clean_names() %>%
mutate(
airport_name = str_to_title(str_trim(airport_name)),
country = str_to_upper(country),
region = str_to_title(region),
contact_email = str_trim(str_to_lower(contact_email)),
passengers_m = abs(passengers_m),
status = case_when(
status %in% c("ACTIVE","RUNNING") ~ "ACTIVE",
TRUE ~ "REVIEW"),
open_date_clean =
parse_date_time(
open_date,
orders = c("dmy","ymd")),
date_flag = if_else(
is.na(open_date_clean),
"INVALID_DATE",
"VALID_DATE")
) %>%
distinct(airport_id, .keep_all = TRUE)
CONSOLE:
Warning message: There was 1 warning in `mutate()`. ℹ In argument: `open_date_clean = parse_date_time(open_date, orders = c("dmy", "ymd"))`. Caused by warning: ! 1 failed to parse.
OUTPUT:
|
|
airport_id |
airport_name |
country |
region |
passengers_m |
runways |
open_date |
contact_email |
status |
open_date_clean |
date_flag |
|
1 |
AP001 |
Heathrow |
UK |
Europe |
80 |
2 |
01-Jan-1946 |
info@heathrow.com |
ACTIVE |
01-01-1946 |
VALID_DATE |
|
2 |
AP002 |
Dubai Intl |
UAE |
Asia |
92 |
4 |
30FEB1960 |
contactdubai.com |
ACTIVE |
NA |
INVALID_DATE |
|
3 |
AP003 |
Lax Airport |
USA |
North America |
75 |
4 |
01-Jan-1930 |
support@lax.com |
ACTIVE |
01-01-1930 |
VALID_DATE |
|
4 |
AP004 |
Haneda |
NULL |
Asia |
85 |
0 |
NA |
haneda@airport |
REVIEW |
NA |
INVALID_DATE |
|
5 |
AP005 |
Charlesdegaulle |
FRANCE |
Europe |
NA |
4 |
01-Jan-1974 |
cdg@airport.fr |
ACTIVE |
01-01-1974 |
VALID_DATE |
|
6 |
AP006 |
Frankfurt |
GERMANY |
Europe |
70 |
-1 |
01-Jan-1936 |
frankfurt@email.com |
ACTIVE |
01-01-1936 |
VALID_DATE |
|
7 |
AP007 |
Delhi Intl |
INDIA |
Asia |
69 |
3 |
15-Aug-2001 |
delhi@@mail.com |
ACTIVE |
15-08-2001 |
VALID_DATE |
|
8 |
AP008 |
Sydney Kingsford |
AUSTRALIA |
Oceania |
44 |
NA |
01-Jan-1920 |
sydney@airport.au |
ACTIVE |
01-01-1920 |
VALID_DATE |
|
9 |
AP009 |
Changi Airport |
SINGAPORE |
Asia |
68 |
2 |
01-Jan-1981 |
changi@airport.sg |
ACTIVE |
01-01-1981 |
VALID_DATE |
warning
appears because:
|
Problem |
Reason |
|
30FEB1960 |
Impossible
calendar date |
|
parse_date_time() |
Cannot
create valid date |
|
Result |
NA
generated |
|
Warning |
Data-quality
parsing warning |
distinct()
removes duplicate airport records efficiently.
Equivalent
SAS:
- PROC SORT NODUPKEY
Deduplication
is foundational for:
- regulatory accuracy
- operational intelligence
- financial reporting
Business Logic Behind Data
Cleaning
Enterprise
data cleaning is not cosmetic work.
It directly influences analytical trustworthiness.
Example:
- Passenger counts cannot be
negative.
- Airport opening dates must
be valid.
- Region values must follow
controlled terminology.
- Emails must satisfy
communication standards.
If age
values in clinical trials exceed realistic human ranges:
- statistical models fail
- SDTM validation flags appear
- FDA reviewers question
dataset credibility
Similarly:
- missing dates break
time-series forecasting
- inconsistent text breaks
joins
- duplicate IDs inflate KPIs
Imputation
strategies are used carefully:
- Replace missing regions
using reference mappings
- Infer missing categories
from metadata
- Standardize timestamps into
ISO formats
Every
cleaning decision requires:
- traceability
- reproducibility
- auditability
because
enterprise analytics drives billion-dollar decisions.
Enterprise Validation &
Compliance
Why SAS Missing Values Are Dangerous
In SAS:
if . < 100 then FLAG='YES';
This
evaluates TRUE.
Why?
- Missing numeric values are
treated lower than valid numbers.
This can
silently corrupt:
- safety analyses
- risk scores
- clinical summaries
Always
validate missing handling explicitly.
SDTM & ADaM Relevance
Clinical
trial standards require:
- traceability
- reproducibility
- audit trails
- QC independence
Poor
cleaning can:
- invalidate FDA submissions
- create ADaM inconsistencies
- fail Pinnacle 21 checks
20 Enterprise Data-Cleaning
Best Practices
- Standardize metadata early
- Validate dates immediately
- Remove duplicates before
aggregation
- Never trust source-system
casing
- Use reusable macros
- Maintain audit trails
- Document derivations clearly
- Validate numeric ranges
- Separate raw and clean
layers
- Use QC independence
- Apply controlled terminology
- Validate email structures
- Protect original datasets
- Version-control programs
- Use defensive programming
- Automate validation checks
- Monitor truncation risks
- Standardize missing values
- Test joins carefully
- Validate outputs
statistically
20 Sharp SAS & R
Cleaning Insights
- Dirty data creates expensive
business mistakes.
- Validation logic is stronger
than visual inspection.
- Standardized variables
improve reproducibility.
- Duplicate records destroy
trust.
- Missing dates break
forecasting models.
- PROC FORMAT improves
governance.
- Macros reduce enterprise
coding risk.
- Arrays simplify repetitive
validation.
- R pipelines improve
readability.
- SAS excels in auditability.
- Clinical data requires
traceability.
- Controlled terminology
prevents fragmentation.
- QC must remain independent.
- Negative values distort
analytics.
- Character truncation is
silent corruption.
- SQL joins require
standardized keys.
- Regulatory submissions
demand consistency.
- Data lineage matters.
- Enterprise cleaning is engineering.
- Reliable analytics begins
with clean data.
SAS vs R for Enterprise
Cleaning
|
Feature |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Industry
Standard |
Growing |
|
Scalability |
Very
Strong |
Strong |
|
Flexibility |
Structured |
Extremely
Flexible |
|
Validation |
Enterprise-grade |
Community-driven |
|
Visualization |
Moderate |
Excellent |
|
Automation |
Macro-driven |
Package-driven |
SAS
dominates regulated industries because of:
- stability
- validation frameworks
- traceability
- reproducibility
R
dominates:
- advanced analytics
- visualization
- rapid experimentation
Modern
enterprises increasingly combine BOTH.
Conclusion
Enterprise
data cleaning is far more than formatting columns and fixing spelling mistakes.
It is the foundation of trustworthy analytics, regulatory compliance,
operational intelligence, and executive decision-making. Whether the domain is
aviation, healthcare, banking, insurance, or retail, corrupted data silently destroys
dashboards, AI predictions, forecasting systems, and statistical outputs. A
single duplicate identifier or malformed date can cascade into millions of
dollars of business risk.
This
Large Airports in World project demonstrated how real-world operational data
often arrives with severe corruption issues including duplicate keys,
inconsistent text values, invalid numeric ranges, malformed emails, missing
timestamps, and broken categorical labels. Using SAS and R together creates a
powerful enterprise-grade cleaning ecosystem capable of transforming unreliable
raw records into validated analytical intelligence.
SAS
remains unmatched for:
- regulatory traceability
- audit readiness
- enterprise-scale validation
- production stability
- macro automation
- SDTM/ADaM compliance
Meanwhile,
R provides:
- flexible transformation
pipelines
- rapid exploration
- modern string handling
- advanced visualization
- scalable analytical
workflows
The most
effective enterprise data teams no longer treat SAS and R as competitors.
Instead, they integrate both strategically:
- SAS for validated production
pipelines
- R for flexible analytics and
exploratory engineering
The real
lesson is this:
Clean
data is not a technical luxury.
It is the backbone of trustworthy business intelligence.
Organizations
that invest in structured data-cleaning frameworks gain:
- reliable dashboards
- accurate forecasting
- reproducible analytics
- compliant regulatory
submissions
- trustworthy AI systems
In modern
analytics ecosystems, data cleaning is no longer a preprocessing step.
It is a critical enterprise engineering discipline that protects
decision-making integrity across the entire organization.
Interview Questions &
Answers
1. How would you handle duplicate airport records?
Use PROC
SORT NODUPKEY or DISTINCT in R.
Then validate business keys and compare record counts before/after cleaning.
2. Why is LENGTH placement critical in SAS?
Because
SAS determines variable attributes during compilation. Incorrect placement
causes character truncation and data loss.
3. How do you validate malformed emails?
Use
INDEX, FIND, PRXMATCH in SAS or grepl() in R to detect invalid structures.
4. Why are missing values dangerous in SAS?
Missing
numeric values evaluate lower than valid numbers, causing incorrect conditional
logic and statistical distortion.
5. When would you choose PROC SQL over DATA Step?
PROC SQL:
- joins
- aggregations
- summarization
DATA
Step:
- row-level transformations
- iterative logic
- arrays
- retained calculations
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about AIRPORT DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment