From Taj Mahal to Machine Learning: Enterprise Data Cleaning Secrets Behind Reliable Tourism Analytics in SAS and R
The World’s Most Famous Tourist Spots Dataset into Trusted Business Analytics Using SAS (PROC SQL vs DATA Step) and R Data Engineering Frameworks
Introduction — When Beautiful Tourist Data Turns
Into an Enterprise Disaster
Imagine
you are working for a global travel analytics company responsible for
predicting tourism trends across the world. Your dashboards influence hotel
investments, airline route planning, tourism ministry budgets, and AI-powered
recommendation engines. One wrong value inside your dataset can distort
millions of dollars in decisions.
Now
imagine the following:
- Paris visitor counts stored
as "15M" instead of numeric values
- Taj Mahal dates entered as
32/14/2025
- Duplicate records for the
same tourist spot
- Negative revenue values
- Missing country names
- Inconsistent casing like
"new york", "NEW YORK", and "New york"
- Invalid ratings above 5
- Random special characters in
text columns
This is
exactly how real-world enterprise data looks before cleaning.
The same
nightmare happens in clinical trials. A single incorrect patient age, missing
treatment date, or duplicated adverse event record can delay regulatory
approvals from organizations like FDA or EMA.
Data
cleaning is not cosmetic work.
It is business survival.
In this
enterprise-level SAS and R case study, we will build a corrupted “World Famous
Tourist Spots” dataset with intentional errors and transform it into
production-ready analytical intelligence using:
- SAS DATA Step
- PROC SQL
- PROC REPORT
- SAS Macros
- R tidyverse ecosystem
- Enterprise validation logic
- Audit-ready reporting
pipelines
This
tutorial combines:
- Clinical-trial-grade
validation logic
- Tourism analytics
- SAS interview preparation
- Real-world business
intelligence engineering
Creating the Raw Tourist Dataset with Intentional Errors
Suppose a
tourism intelligence company collects worldwide tourist data from:
- APIs
- Excel sheets
- Manual entry systems
- Government portals
- Web scraping engines
The raw data arrives corrupted and inconsistent.
SAS Raw Dataset Creation
/*-----------------------------------------------------------
STEP 1: DEFINE VARIABLE LENGTHS BEFORE READING DATA.WHY?
SAS creates variable lengths during
compilation.
If LENGTH is placed after INPUT, truncation
occurs.
------------------------------------------------------------*/
data tourist_raw;
length Tourist_Spot $40 Country $25 City $25 Category $20 Rating_Text $10
Revenue_Text $15 Visit_Date_Text $20 Remarks $50;
infile datalines dlm='|' truncover;
input Tourist_Spot $ Country $ City $ Category $ Rating_Text $ Revenue_Text $
Visit_Date_Text $ Visitors Remarks $;
datalines;
Eiffel_Tower|France|Paris|Historical|4.8|1500000|12-05-2025|500000|Top attraction
Taj_Mahal|India|Agra|Historical|6.5|2500000|31-14-2025|700000|Invalid rating
Statue_of_Liberty|USA|NewYork|Historical|4.7|-100000|15-08-2025|400000|Negative revenue
Great_Wall|China|Beijing|Historical|4.9|3500000|25-06-2025|-5000|Negative visitors
Machu_Picchu|Peru|Cusco|Historical|4.6|2100000|10-07-2025|300000|Good
Eiffel_Tower|France|Paris|Historical|4.8|1500000|12-05-2025|500000|Duplicate
Santorini|Greece|santorini|Beach|4.5|1800000|22-09-2025|250000|lowercase city
Burj_Khalifa|UAE|Dubai|Modern|4.7|NULL|18-04-2025|600000|Missing revenue
Niagara_Falls|Canada|Toronto|Nature|abc|1300000|11-05-2025|450000|Invalid rating text
Colosseum| |Rome|Historical|4.4|1200000|05-03-2025|380000|Missing country
Sydney_Opera|Australia|Sydney|Modern|4.3|1700000|17-11-2025|320000|Good
Mount_Fuji|Japan|Tokyo|Nature|4.9|1600000|29-02-2025|410000|Invalid date
Grand_Canyon|USA|Arizona|Nature|4.8|2000000|07-08-2025|390000|Good
Banff_Park|Canada|Alberta|Nature|4.7|1900000|09-10-2025|280000|Good
Petra|Jordan|Amman|Historical|4.6|1450000|15-06-2025|310000|Good
;
run;
proc print data = tourist_raw;
run;
OUTPUT:
| Obs | Tourist_Spot | Country | City | Category | Rating_Text | Revenue_Text | Visit_Date_Text | Remarks | Visitors |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | Top attraction | 500000 |
| 2 | Taj_Mahal | India | Agra | Historical | 6.5 | 2500000 | 31-14-2025 | Invalid rating | 700000 |
| 3 | Statue_of_Liberty | USA | NewYork | Historical | 4.7 | -100000 | 15-08-2025 | Negative revenue | 400000 |
| 4 | Great_Wall | China | Beijing | Historical | 4.9 | 3500000 | 25-06-2025 | Negative visitors | -5000 |
| 5 | Machu_Picchu | Peru | Cusco | Historical | 4.6 | 2100000 | 10-07-2025 | Good | 300000 |
| 6 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | Duplicate | 500000 |
| 7 | Santorini | Greece | santorini | Beach | 4.5 | 1800000 | 22-09-2025 | lowercase city | 250000 |
| 8 | Burj_Khalifa | UAE | Dubai | Modern | 4.7 | NULL | 18-04-2025 | Missing revenue | 600000 |
| 9 | Niagara_Falls | Canada | Toronto | Nature | abc | 1300000 | 11-05-2025 | Invalid rating text | 450000 |
| 10 | Colosseum | Rome | Historical | 4.4 | 1200000 | 05-03-2025 | Missing country | 380000 | |
| 11 | Sydney_Opera | Australia | Sydney | Modern | 4.3 | 1700000 | 17-11-2025 | Good | 320000 |
| 12 | Mount_Fuji | Japan | Tokyo | Nature | 4.9 | 1600000 | 29-02-2025 | Invalid date | 410000 |
| 13 | Grand_Canyon | USA | Arizona | Nature | 4.8 | 2000000 | 07-08-2025 | Good | 390000 |
| 14 | Banff_Park | Canada | Alberta | Nature | 4.7 | 1900000 | 09-10-2025 | Good | 280000 |
| 15 | Petra | Jordan | Amman | Historical | 4.6 | 1450000 | 15-06-2025 | Good | 310000 |
Explanation
INTENTIONAL ERRORS INTRODUCED
- Missing countries
- Invalid dates
- Negative visitors
- Duplicate records
- Mixed casing
- Invalid ratings
- Embedded symbols
Understanding the SAS
“Truncation Trap”
One of
the biggest mistakes beginners make is placing LENGTH after the INPUT
statement.
Incorrect:
input Country $;
length Country $25;
In SAS,
variable attributes are assigned during compilation. If SAS first sees
"USA", it may assign length 3 permanently. Later
"Australia" becomes truncated.
Correct
approach:
length Country $25;
input Country $;
This is
extremely important in:
- SDTM datasets
- ADaM derivations
- Regulatory submissions
- Production ETL systems
R behaves
differently because strings are dynamically managed in memory. SAS allocates
fixed-length storage unless explicitly controlled.
#-----------------------------------------------------------
# CREATE RAW TOURIST DATASET IN R
# Equivalent to SAS DATALINES approach
#-----------------------------------------------------------
library(tidyverse)
library(stringr)
library(lubridate)
library(janitor)
library(purrr)
#-----------------------------------------------------------
# RAW DATA CREATED EXACTLY LIKE SAS DATALINES
# sep="|" acts like DLM='|'
# header=FALSE because SAS DATALINES has no header row
# stringsAsFactors=FALSE prevents automatic factor conversion
#-----------------------------------------------------------
tourist_raw <- read.table(text = "
Eiffel_Tower|France|Paris|Historical|4.8|1500000|12-05-2025|500000|Top attraction
Taj_Mahal|India|Agra|Historical|6.5|2500000|31-14-2025|700000|Invalid rating
Statue_of_Liberty|USA|NewYork|Historical|4.7|-100000|15-08-2025|400000|Negative revenue
Great_Wall|China|Beijing|Historical|4.9|3500000|25-06-2025|-5000|Negative visitors
Machu_Picchu|Peru|Cusco|Historical|4.6|2100000|10-07-2025|300000|Good
Eiffel_Tower|France|Paris|Historical|4.8|1500000|12-05-2025|500000|Duplicate
Santorini|Greece|santorini|Beach|4.5|1800000|22-09-2025|250000|lowercase city
Burj_Khalifa|UAE|Dubai|Modern|4.7|NULL|18-04-2025|600000|Missing revenue
Niagara_Falls|Canada|Toronto|Nature|abc|1300000|11-05-2025|450000|Invalid rating text
Colosseum||Rome|Historical|4.4|1200000|05-03-2025|380000|Missing country
Sydney_Opera|Australia|Sydney|Modern|4.3|1700000|17-11-2025|320000|Good
Mount_Fuji|Japan|Tokyo|Nature|4.9|1600000|29-02-2025|410000|Invalid date
Grand_Canyon|USA|Arizona|Nature|4.8|2000000|07-08-2025|390000|Good
Banff_Park|Canada|Alberta|Nature|4.7|1900000|09-10-2025|280000|Good
Petra|Jordan|Amman|Historical|4.6|1450000|15-06-2025|310000|Good
",
sep = "|", header = FALSE, stringsAsFactors = FALSE, fill = TRUE)
OUTPUT:
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 |
1 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | 500000 | Top attraction |
2 | Taj_Mahal | India | Agra | Historical | 6.5 | 2500000 | 31-14-2025 | 700000 | Invalid rating |
3 | Statue_of_Liberty | USA | NewYork | Historical | 4.7 | -100000 | 15-08-2025 | 400000 | Negative revenue |
4 | Great_Wall | China | Beijing | Historical | 4.9 | 3500000 | 25-06-2025 | -5000 | Negative visitors |
5 | Machu_Picchu | Peru | Cusco | Historical | 4.6 | 2100000 | 10-07-2025 | 300000 | Good |
6 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | 500000 | Duplicate |
7 | Santorini | Greece | santorini | Beach | 4.5 | 1800000 | 22-09-2025 | 250000 | lowercase city |
8 | Burj_Khalifa | UAE | Dubai | Modern | 4.7 | NULL | 18-04-2025 | 600000 | Missing revenue |
9 | Niagara_Falls | Canada | Toronto | Nature | abc | 1300000 | 11-05-2025 | 450000 | Invalid rating text |
10 | Colosseum | | Rome | Historical | 4.4 | 1200000 | 05-03-2025 | 380000 | Missing country |
11 | Sydney_Opera | Australia | Sydney | Modern | 4.3 | 1700000 | 17-11-2025 | 320000 | Good |
12 | Mount_Fuji | Japan | Tokyo | Nature | 4.9 | 1600000 | 29-02-2025 | 410000 | Invalid date |
13 | Grand_Canyon | USA | Arizona | Nature | 4.8 | 2000000 | 07-08-2025 | 390000 | Good |
14 | Banff_Park | Canada | Alberta | Nature | 4.7 | 1900000 | 09-10-2025 | 280000 | Good |
15 | Petra | Jordan | Amman | Historical | 4.6 | 1450000 | 15-06-2025 | 310000 | Good |
#-----------------------------------------------------------
# ASSIGN COLUMN NAMES
# Equivalent to SAS INPUT variable list
#-----------------------------------------------------------
colnames(tourist_raw) <- c("Tourist_Spot","Country","City","Category",
"Rating_Text","Revenue_Text","Visit_Date_Text","Visitors","Remarks"
)
OUTPUT:
| Tourist_Spot | Country | City | Category | Rating_Text | Revenue_Text | Visit_Date_Text | Visitors | Remarks |
1 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | 500000 | Top attraction |
2 | Taj_Mahal | India | Agra | Historical | 6.5 | 2500000 | 31-14-2025 | 700000 | Invalid rating |
3 | Statue_of_Liberty | USA | NewYork | Historical | 4.7 | -100000 | 15-08-2025 | 400000 | Negative revenue |
4 | Great_Wall | China | Beijing | Historical | 4.9 | 3500000 | 25-06-2025 | -5000 | Negative visitors |
5 | Machu_Picchu | Peru | Cusco | Historical | 4.6 | 2100000 | 10-07-2025 | 300000 | Good |
6 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | 500000 | Duplicate |
7 | Santorini | Greece | santorini | Beach | 4.5 | 1800000 | 22-09-2025 | 250000 | lowercase city |
8 | Burj_Khalifa | UAE | Dubai | Modern | 4.7 | NULL | 18-04-2025 | 600000 | Missing revenue |
9 | Niagara_Falls | Canada | Toronto | Nature | abc | 1300000 | 11-05-2025 | 450000 | Invalid rating text |
10 | Colosseum | | Rome | Historical | 4.4 | 1200000 | 05-03-2025 | 380000 | Missing country |
11 | Sydney_Opera | Australia | Sydney | Modern | 4.3 | 1700000 | 17-11-2025 | 320000 | Good |
12 | Mount_Fuji | Japan | Tokyo | Nature | 4.9 | 1600000 | 29-02-2025 | 410000 | Invalid date |
13 | Grand_Canyon | USA | Arizona | Nature | 4.8 | 2000000 | 07-08-2025 | 390000 | Good |
14 | Banff_Park | Canada | Alberta | Nature | 4.7 | 1900000 | 09-10-2025 | 280000 | Good |
15 | Petra | Jordan | Amman | Historical | 4.6 | 1450000 | 15-06-2025 | 310000 | Good |
This R
program is the direct equivalent of SAS DATALINES dataset creation workflow.
SAS vs R Mapping
|
SAS |
R |
|
DATALINES |
text= |
|
infile
datalines dlm=' |
' |
|
input |
colnames() |
|
PROC
PRINT |
print() |
|
PROC
CONTENTS |
str() |
Important Enterprise
Concepts
1. read.table()
This
function reads raw delimited text data into R.
Equivalent
SAS concept:
infile datalines dlm='|';
2. sep="|"
Defines
delimiter exactly like:
dlm='|'
3. fill=TRUE
Critical
for enterprise ingestion.
Suppose
some rows have missing fields.
Without fill=TRUE:
- R may fail ingestion
- column shifting occurs
This
behaves similarly to SAS:
truncover
4. stringsAsFactors=FALSE
Older R
versions automatically converted text into factors.
That
causes:
- unexpected modeling behavior
- merge issues
- reporting inconsistencies
Enterprise
systems usually disable this.
5. Why Column Names
Assigned Separately?
Because
raw text has no header row.
Equivalent
SAS logic:
input Tourist_Spot $
Country $
City $;
#-----------------------------------------------------------
# DISPLAY DATASET
# Equivalent to PROC PRINT
#-----------------------------------------------------------
print(tourist_raw)
#-----------------------------------------------------------
# STRUCTURE OF DATASET
# Similar to PROC CONTENTS
#-----------------------------------------------------------
str(tourist_raw)
SAS Data Cleaning Workflow — Enterprise Style
Step 1 — Standardization
data tourist_clean;
set tourist_raw;
/*-----------------------------------------------------------
PROPCASE standardizes inconsistent capitalization
------------------------------------------------------------*/
City = propcase(strip(City));
Country = propcase(strip(Country));
/*-----------------------------------------------------------
REMOVE SPECIAL CHARACTERS
COMPRESS keeps only alphabets and spaces
------------------------------------------------------------*/
Remarks = compress(Remarks,'@#$%^&*','k');
/*-----------------------------------------------------------
CONVERT REVENUE TO NUMERIC
INPUT converts character to numeric
------------------------------------------------------------*/
Revenue = input(Revenue_Text,best12.);
/*-----------------------------------------------------------
HANDLE NULL VALUES
------------------------------------------------------------*/
if Revenue_Text='NULL' then Revenue=.;
/*-----------------------------------------------------------
FIX NEGATIVE VALUES
ABS converts negative to positive
------------------------------------------------------------*/
Visitors = abs(Visitors);
Revenue = abs(Revenue);
/*-----------------------------------------------------------
CONVERT RATING
------------------------------------------------------------*/
Rating=input(Rating_Text,best12.);
/*-----------------------------------------------------------
INVALID RATINGS
------------------------------------------------------------*/
if Rating > 5 then Rating=5;
if Rating < 0 then Rating=.;
/*-----------------------------------------------------------
DATE CONVERSION
------------------------------------------------------------*/
Visit_Date=input(Visit_Date_Text,ddmmyy10.);
format Visit_Date date9.;
/*-----------------------------------------------------------
INVALID DATES
------------------------------------------------------------*/
if missing(Visit_Date) then
Visit_Date=intnx('month',today(),-1,'same');
run;
proc print data = tourist_clean;
run;
OUTPUT:
| Obs | Tourist_Spot | Country | City | Category | Rating_Text | Revenue_Text | Visit_Date_Text | Remarks | Visitors | Revenue | Rating | Visit_Date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | 500000 | 1500000 | 4.8 | 12MAY2025 | |
| 2 | Taj_Mahal | India | Agra | Historical | 6.5 | 2500000 | 31-14-2025 | 700000 | 2500000 | 5.0 | 14APR2026 | |
| 3 | Statue_of_Liberty | Usa | Newyork | Historical | 4.7 | -100000 | 15-08-2025 | 400000 | 100000 | 4.7 | 15AUG2025 | |
| 4 | Great_Wall | China | Beijing | Historical | 4.9 | 3500000 | 25-06-2025 | 5000 | 3500000 | 4.9 | 25JUN2025 | |
| 5 | Machu_Picchu | Peru | Cusco | Historical | 4.6 | 2100000 | 10-07-2025 | 300000 | 2100000 | 4.6 | 10JUL2025 | |
| 6 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | 500000 | 1500000 | 4.8 | 12MAY2025 | |
| 7 | Santorini | Greece | Santorini | Beach | 4.5 | 1800000 | 22-09-2025 | 250000 | 1800000 | 4.5 | 22SEP2025 | |
| 8 | Burj_Khalifa | Uae | Dubai | Modern | 4.7 | NULL | 18-04-2025 | 600000 | . | 4.7 | 18APR2025 | |
| 9 | Niagara_Falls | Canada | Toronto | Nature | abc | 1300000 | 11-05-2025 | 450000 | 1300000 | . | 11MAY2025 | |
| 10 | Colosseum | Rome | Historical | 4.4 | 1200000 | 05-03-2025 | 380000 | 1200000 | 4.4 | 05MAR2025 | ||
| 11 | Sydney_Opera | Australia | Sydney | Modern | 4.3 | 1700000 | 17-11-2025 | 320000 | 1700000 | 4.3 | 17NOV2025 | |
| 12 | Mount_Fuji | Japan | Tokyo | Nature | 4.9 | 1600000 | 29-02-2025 | 410000 | 1600000 | 4.9 | 14APR2026 | |
| 13 | Grand_Canyon | Usa | Arizona | Nature | 4.8 | 2000000 | 07-08-2025 | 390000 | 2000000 | 4.8 | 07AUG2025 | |
| 14 | Banff_Park | Canada | Alberta | Nature | 4.7 | 1900000 | 09-10-2025 | 280000 | 1900000 | 4.7 | 09OCT2025 | |
| 15 | Petra | Jordan | Amman | Historical | 4.6 | 1450000 | 15-06-2025 | 310000 | 1450000 | 4.6 | 15JUN2025 |
Explanation
This DATA
step demonstrates enterprise-grade defensive programming.
Instead of assuming clean input, every variable is validated.
Key
business logic:
- ABS() protects dashboards
from impossible negative metrics
- INPUT() converts raw text
into analytical numeric formats
- INTNX() imputes invalid
dates
- PROPCASE() standardizes text
for grouping consistency
This is
one of the most misunderstood yet powerful SAS character-cleaning statements
used in enterprise data cleaning projects, especially in:
- Clinical trial SDTM/ADaM
preparation
- Banking transaction
cleansing
- Tourism analytics
- Insurance claim systems
- Regulatory reporting
pipelines
Understanding the COMPRESS Function in SAS
General Syntax
COMPRESS(source, characters-to-remove, modifiers)
Parameters
|
Parameter |
Meaning |
|
|
Original variable |
|
|
Characters SAS should target |
|
|
Special behavior instructions |
Source Variable
Remarks
Suppose the raw values are:
Excellent@Place
Very#Crowded
Good^View
Amazing&Safe
These values contain unwanted special
characters.
'@#$%^&*'
This list
defines characters SAS should examine.
These
are:
- @
- $
- %
- ^
- &
'k'
The k
modifier means:
“KEEP the
listed characters instead of removing them.”
This
completely changes the behavior.
Critical Logic Difference
WITHOUT k
compress(Remarks,'@#$%^&*')
Means:
Remove @
# $ % ^ & *
Example:
|
Before |
After |
|
Good@Place |
GoodPlace |
|
Great#View |
GreatView |
This is
the normal behavior.
WITH k
compress(Remarks,'@#$%^&*','k')
Means:
KEEP ONLY
@ # $ % ^ & *
Remove everything else.
Example:
|
Before |
After |
|
Good@Place |
@ |
|
Great#View |
# |
|
Amazing&Safe |
& |
So your
original statement is actually NOT cleaning remarks properly.
It is
doing the opposite.
In
clinical trials, similar logic ensures:
- No impossible patient ages
- No future adverse event
dates
- No duplicate subject IDs
Without
these checks, regulatory audits fail.
Removing Duplicate Records
proc sort data=tourist_clean nodupkey;
by Tourist_Spot Country City;
run;
proc print data = tourist_clean;
run;
OUTPUT:
| Obs | Tourist_Spot | Country | City | Category | Rating_Text | Revenue_Text | Visit_Date_Text | Remarks | Visitors | Revenue | Rating | Visit_Date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Banff_Park | Canada | Alberta | Nature | 4.7 | 1900000 | 09-10-2025 | 280000 | 1900000 | 4.7 | 09OCT2025 | |
| 2 | Burj_Khalifa | Uae | Dubai | Modern | 4.7 | NULL | 18-04-2025 | 600000 | . | 4.7 | 18APR2025 | |
| 3 | Colosseum | Rome | Historical | 4.4 | 1200000 | 05-03-2025 | 380000 | 1200000 | 4.4 | 05MAR2025 | ||
| 4 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | 500000 | 1500000 | 4.8 | 12MAY2025 | |
| 5 | Grand_Canyon | Usa | Arizona | Nature | 4.8 | 2000000 | 07-08-2025 | 390000 | 2000000 | 4.8 | 07AUG2025 | |
| 6 | Great_Wall | China | Beijing | Historical | 4.9 | 3500000 | 25-06-2025 | 5000 | 3500000 | 4.9 | 25JUN2025 | |
| 7 | Machu_Picchu | Peru | Cusco | Historical | 4.6 | 2100000 | 10-07-2025 | 300000 | 2100000 | 4.6 | 10JUL2025 | |
| 8 | Mount_Fuji | Japan | Tokyo | Nature | 4.9 | 1600000 | 29-02-2025 | 410000 | 1600000 | 4.9 | 14APR2026 | |
| 9 | Niagara_Falls | Canada | Toronto | Nature | abc | 1300000 | 11-05-2025 | 450000 | 1300000 | . | 11MAY2025 | |
| 10 | Petra | Jordan | Amman | Historical | 4.6 | 1450000 | 15-06-2025 | 310000 | 1450000 | 4.6 | 15JUN2025 | |
| 11 | Santorini | Greece | Santorini | Beach | 4.5 | 1800000 | 22-09-2025 | 250000 | 1800000 | 4.5 | 22SEP2025 | |
| 12 | Statue_of_Liberty | Usa | Newyork | Historical | 4.7 | -100000 | 15-08-2025 | 400000 | 100000 | 4.7 | 15AUG2025 | |
| 13 | Sydney_Opera | Australia | Sydney | Modern | 4.3 | 1700000 | 17-11-2025 | 320000 | 1700000 | 4.3 | 17NOV2025 | |
| 14 | Taj_Mahal | India | Agra | Historical | 6.5 | 2500000 | 31-14-2025 | 700000 | 2500000 | 5.0 | 14APR2026 |
Explanation
NODUPKEY
removes duplicate business keys.
In SDTM
datasets:
- Duplicate AE records create
safety-reporting failures
- Duplicate DM subjects break
population counts
Sorting
before analysis ensures:
- Accurate aggregation
- Reliable AI training
- Trustworthy KPI calculations
PROC SQL vs DATA Step
PROC SQL Approach
proc sql;
create table revenue_summary as
select Country,Category,count(*) as Total_Spots,
mean(Revenue) as Avg_Revenue format=dollar15.,
sum(Visitors) as Total_Visitors
from tourist_clean
group by Country, Category
order by Avg_Revenue desc;
quit;
proc print data = revenue_summary;
run;
OUTPUT:
| Obs | Country | Category | Total_Spots | Avg_Revenue | Total_Visitors |
|---|---|---|---|---|---|
| 1 | China | Historical | 1 | $3,500,000 | 5000 |
| 2 | India | Historical | 1 | $2,500,000 | 700000 |
| 3 | Peru | Historical | 1 | $2,100,000 | 300000 |
| 4 | Usa | Nature | 1 | $2,000,000 | 390000 |
| 5 | Greece | Beach | 1 | $1,800,000 | 250000 |
| 6 | Australia | Modern | 1 | $1,700,000 | 320000 |
| 7 | Canada | Nature | 2 | $1,600,000 | 730000 |
| 8 | Japan | Nature | 1 | $1,600,000 | 410000 |
| 9 | France | Historical | 1 | $1,500,000 | 500000 |
| 10 | Jordan | Historical | 1 | $1,450,000 | 310000 |
| 11 | Historical | 1 | $1,200,000 | 380000 | |
| 12 | Usa | Historical | 1 | $100,000 | 400000 |
| 13 | Uae | Modern | 1 | . | 600000 |
Explanation
PROC SQL
is excellent for:
- Aggregations
- Multi-table joins
- Database pushdown
optimization
- Business reporting
SQL
resembles enterprise warehouse systems like:
- Oracle
- Snowflake
- Teradata
DATA Step Alternative
proc sort data=tourist_clean;
by Country Category;
run;
proc print data = tourist_clean;
run;
OUTPUT:
| Obs | Tourist_Spot | Country | City | Category | Rating_Text | Revenue_Text | Visit_Date_Text | Remarks | Visitors | Revenue | Rating | Visit_Date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Colosseum | Rome | Historical | 4.4 | 1200000 | 05-03-2025 | 380000 | 1200000 | 4.4 | 05MAR2025 | ||
| 2 | Sydney_Opera | Australia | Sydney | Modern | 4.3 | 1700000 | 17-11-2025 | 320000 | 1700000 | 4.3 | 17NOV2025 | |
| 3 | Banff_Park | Canada | Alberta | Nature | 4.7 | 1900000 | 09-10-2025 | 280000 | 1900000 | 4.7 | 09OCT2025 | |
| 4 | Niagara_Falls | Canada | Toronto | Nature | abc | 1300000 | 11-05-2025 | 450000 | 1300000 | . | 11MAY2025 | |
| 5 | Great_Wall | China | Beijing | Historical | 4.9 | 3500000 | 25-06-2025 | 5000 | 3500000 | 4.9 | 25JUN2025 | |
| 6 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | 500000 | 1500000 | 4.8 | 12MAY2025 | |
| 7 | Santorini | Greece | Santorini | Beach | 4.5 | 1800000 | 22-09-2025 | 250000 | 1800000 | 4.5 | 22SEP2025 | |
| 8 | Taj_Mahal | India | Agra | Historical | 6.5 | 2500000 | 31-14-2025 | 700000 | 2500000 | 5.0 | 14APR2026 | |
| 9 | Mount_Fuji | Japan | Tokyo | Nature | 4.9 | 1600000 | 29-02-2025 | 410000 | 1600000 | 4.9 | 14APR2026 | |
| 10 | Petra | Jordan | Amman | Historical | 4.6 | 1450000 | 15-06-2025 | 310000 | 1450000 | 4.6 | 15JUN2025 | |
| 11 | Machu_Picchu | Peru | Cusco | Historical | 4.6 | 2100000 | 10-07-2025 | 300000 | 2100000 | 4.6 | 10JUL2025 | |
| 12 | Burj_Khalifa | Uae | Dubai | Modern | 4.7 | NULL | 18-04-2025 | 600000 | . | 4.7 | 18APR2025 | |
| 13 | Statue_of_Liberty | Usa | Newyork | Historical | 4.7 | -100000 | 15-08-2025 | 400000 | 100000 | 4.7 | 15AUG2025 | |
| 14 | Grand_Canyon | Usa | Arizona | Nature | 4.8 | 2000000 | 07-08-2025 | 390000 | 2000000 | 4.8 | 07AUG2025 |
data summary_ds;
set tourist_clean;
by Country Category;
retain Total_Visitors Total_Revenue Count;
if first.Category then do;
Total_Visitors=0;
Total_Revenue=0;
Count=0;
end;
Total_Visitors+Visitors;
Total_Revenue+Revenue;
Count+1;
if last.Category then do;
Avg_Revenue=round(Total_Revenue/Count,0.01);
output;
end;
run;
proc print data = summary_ds;
run;
OUTPUT:
| Obs | Tourist_Spot | Country | City | Category | Rating_Text | Revenue_Text | Visit_Date_Text | Remarks | Visitors | Revenue | Rating | Visit_Date | Total_Visitors | Total_Revenue | Count | Avg_Revenue |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Colosseum | Rome | Historical | 4.4 | 1200000 | 05-03-2025 | 380000 | 1200000 | 4.4 | 05MAR2025 | 380000 | 1200000 | 1 | 1200000 | ||
| 2 | Sydney_Opera | Australia | Sydney | Modern | 4.3 | 1700000 | 17-11-2025 | 320000 | 1700000 | 4.3 | 17NOV2025 | 320000 | 1700000 | 1 | 1700000 | |
| 3 | Niagara_Falls | Canada | Toronto | Nature | abc | 1300000 | 11-05-2025 | 450000 | 1300000 | . | 11MAY2025 | 730000 | 3200000 | 2 | 1600000 | |
| 4 | Great_Wall | China | Beijing | Historical | 4.9 | 3500000 | 25-06-2025 | 5000 | 3500000 | 4.9 | 25JUN2025 | 5000 | 3500000 | 1 | 3500000 | |
| 5 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | 500000 | 1500000 | 4.8 | 12MAY2025 | 500000 | 1500000 | 1 | 1500000 | |
| 6 | Santorini | Greece | Santorini | Beach | 4.5 | 1800000 | 22-09-2025 | 250000 | 1800000 | 4.5 | 22SEP2025 | 250000 | 1800000 | 1 | 1800000 | |
| 7 | Taj_Mahal | India | Agra | Historical | 6.5 | 2500000 | 31-14-2025 | 700000 | 2500000 | 5.0 | 14APR2026 | 700000 | 2500000 | 1 | 2500000 | |
| 8 | Mount_Fuji | Japan | Tokyo | Nature | 4.9 | 1600000 | 29-02-2025 | 410000 | 1600000 | 4.9 | 14APR2026 | 410000 | 1600000 | 1 | 1600000 | |
| 9 | Petra | Jordan | Amman | Historical | 4.6 | 1450000 | 15-06-2025 | 310000 | 1450000 | 4.6 | 15JUN2025 | 310000 | 1450000 | 1 | 1450000 | |
| 10 | Machu_Picchu | Peru | Cusco | Historical | 4.6 | 2100000 | 10-07-2025 | 300000 | 2100000 | 4.6 | 10JUL2025 | 300000 | 2100000 | 1 | 2100000 | |
| 11 | Burj_Khalifa | Uae | Dubai | Modern | 4.7 | NULL | 18-04-2025 | 600000 | . | 4.7 | 18APR2025 | 600000 | 0 | 1 | 0 | |
| 12 | Statue_of_Liberty | Usa | Newyork | Historical | 4.7 | -100000 | 15-08-2025 | 400000 | 100000 | 4.7 | 15AUG2025 | 400000 | 100000 | 1 | 100000 | |
| 13 | Grand_Canyon | Usa | Arizona | Nature | 4.8 | 2000000 | 07-08-2025 | 390000 | 2000000 | 4.8 | 07AUG2025 | 390000 | 2000000 | 1 | 2000000 |
Explanation
The DATA
step gives granular row-level control using:
- FIRST.
- LAST.
- RETAIN
This
approach is preferred when:
- Complex derivations exist
- Stateful calculations are
needed
- Clinical-trial lineage must
be preserved
Advanced SAS Features
PROC FORMAT
proc format;
value revenuefmt low-1000000 = 'Low Revenue'
1000001-2000000 = 'Medium Revenue'
2000001-high = 'High Revenue';
run;
LOG:
NOTE: Format REVENUEFMT has been output.
Explanation
Formats
improve readability and dashboard usability.
Instead
of showing raw numbers, business users see:
- High Revenue
- Medium Revenue
- Low Revenue
This
improves executive communication.
data tourist_clean2;
set tourist_clean;
length Revenue_FMT $20;
Revenue_FMT = put(Revenue,revenuefmt.);
run;
proc print data = tourist_clean2;
run;
OUTPUT:
| Obs | Tourist_Spot | Country | City | Category | Rating_Text | Revenue_Text | Visit_Date_Text | Remarks | Visitors | Revenue | Rating | Visit_Date | Revenue_FMT |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Colosseum | Rome | Historical | 4.4 | 1200000 | 05-03-2025 | 380000 | 1200000 | 4.4 | 05MAR2025 | Medium Revenue | ||
| 2 | Sydney_Opera | Australia | Sydney | Modern | 4.3 | 1700000 | 17-11-2025 | 320000 | 1700000 | 4.3 | 17NOV2025 | Medium Revenue | |
| 3 | Banff_Park | Canada | Alberta | Nature | 4.7 | 1900000 | 09-10-2025 | 280000 | 1900000 | 4.7 | 09OCT2025 | Medium Revenue | |
| 4 | Niagara_Falls | Canada | Toronto | Nature | abc | 1300000 | 11-05-2025 | 450000 | 1300000 | . | 11MAY2025 | Medium Revenue | |
| 5 | Great_Wall | China | Beijing | Historical | 4.9 | 3500000 | 25-06-2025 | 5000 | 3500000 | 4.9 | 25JUN2025 | High Revenue | |
| 6 | Eiffel_Tower | France | Paris | Historical | 4.8 | 1500000 | 12-05-2025 | 500000 | 1500000 | 4.8 | 12MAY2025 | Medium Revenue | |
| 7 | Santorini | Greece | Santorini | Beach | 4.5 | 1800000 | 22-09-2025 | 250000 | 1800000 | 4.5 | 22SEP2025 | Medium Revenue | |
| 8 | Taj_Mahal | India | Agra | Historical | 6.5 | 2500000 | 31-14-2025 | 700000 | 2500000 | 5.0 | 14APR2026 | High Revenue | |
| 9 | Mount_Fuji | Japan | Tokyo | Nature | 4.9 | 1600000 | 29-02-2025 | 410000 | 1600000 | 4.9 | 14APR2026 | Medium Revenue | |
| 10 | Petra | Jordan | Amman | Historical | 4.6 | 1450000 | 15-06-2025 | 310000 | 1450000 | 4.6 | 15JUN2025 | Medium Revenue | |
| 11 | Machu_Picchu | Peru | Cusco | Historical | 4.6 | 2100000 | 10-07-2025 | 300000 | 2100000 | 4.6 | 10JUL2025 | High Revenue | |
| 12 | Burj_Khalifa | Uae | Dubai | Modern | 4.7 | NULL | 18-04-2025 | 600000 | . | 4.7 | 18APR2025 | . | |
| 13 | Statue_of_Liberty | Usa | Newyork | Historical | 4.7 | -100000 | 15-08-2025 | 400000 | 100000 | 4.7 | 15AUG2025 | Low Revenue | |
| 14 | Grand_Canyon | Usa | Arizona | Nature | 4.8 | 2000000 | 07-08-2025 | 390000 | 2000000 | 4.8 | 07AUG2025 | Medium Revenue |
PROC REPORT Dashboard
proc report data=tourist_clean2 nowd;
column Country Tourist_Spot Revenue Revenue_FMT Visitors Rating;
define Country / group;
define Tourist_Spot / display;
define Revenue / analysis sum format=dollar15.;
define Revenue_FMT / group;
define Visitors / analysis sum;
define Rating / analysis mean format=4.2;
compute Revenue;
if Revenue.sum > 2000000 then
call define(_col_,'style',
'style={background=lightgreen}');
endcomp;
run;
OUTPUT:
| Country | Tourist_Spot | Revenue | Revenue_FMT | Visitors | Rating |
|---|---|---|---|---|---|
| Australia | Sydney_Opera | $1,700,000 | Medium Revenue | 320000 | 4.30 |
| Canada | Banff_Park | $1,900,000 | Medium Revenue | 280000 | 4.70 |
| Niagara_Falls | $1,300,000 | 450000 | . | ||
| China | Great_Wall | $3,500,000 | High Revenue | 5000 | 4.90 |
| France | Eiffel_Tower | $1,500,000 | Medium Revenue | 500000 | 4.80 |
| Greece | Santorini | $1,800,000 | Medium Revenue | 250000 | 4.50 |
| India | Taj_Mahal | $2,500,000 | High Revenue | 700000 | 5.00 |
| Japan | Mount_Fuji | $1,600,000 | Medium Revenue | 410000 | 4.90 |
| Jordan | Petra | $1,450,000 | Medium Revenue | 310000 | 4.60 |
| Peru | Machu_Picchu | $2,100,000 | High Revenue | 300000 | 4.60 |
| Uae | Burj_Khalifa | . | . | 600000 | 4.70 |
| Usa | Statue_of_Liberty | $100,000 | Low Revenue | 400000 | 4.70 |
| Grand_Canyon | $2,000,000 | Medium Revenue | 390000 | 4.80 |
Explanation
PROC
REPORT creates audit-ready professional outputs.
Used
heavily in:
- Clinical TLF generation
- Financial reporting
- Regulatory submissions
Conditional
formatting improves executive visibility.
SAS Macro for Reusable Validation
%macro validate(ds,var);
proc freq data=&ds;
tables &var / missing;
run;
%mend;
%validate(tourist_clean2,Country);
OUTPUT:
The FREQ Procedure
| Country | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| 1 | 7.14 | 1 | 7.14 | |
| Australia | 1 | 7.14 | 2 | 14.29 |
| Canada | 2 | 14.29 | 4 | 28.57 |
| China | 1 | 7.14 | 5 | 35.71 |
| France | 1 | 7.14 | 6 | 42.86 |
| Greece | 1 | 7.14 | 7 | 50.00 |
| India | 1 | 7.14 | 8 | 57.14 |
| Japan | 1 | 7.14 | 9 | 64.29 |
| Jordan | 1 | 7.14 | 10 | 71.43 |
| Peru | 1 | 7.14 | 11 | 78.57 |
| Uae | 1 | 7.14 | 12 | 85.71 |
| Usa | 2 | 14.29 | 14 | 100.00 |
Explanation
Macros
reduce repetitive programming.
Enterprise
benefits:
- Standardization
- Faster validation
- Reduced human error
- Easier maintenance
In pharmaceutical programming, reusable macros are essential.
R Refinement Layer
R Dataset Cleaning
tourist_clean <- tourist_raw %>%
clean_names() %>%
mutate(city = str_trim(str_to_title(city)),
country = coalesce(country,"Unknown"),
revenue_text = if_else(revenue_text == "NULL",
NA_character_,revenue_text),
revenue = as.numeric(revenue_text),
revenue = abs(revenue),
visitors = abs(visitors),
rating_text = if_else(grepl("[A-Za-z]", rating_text),
NA_character_,rating_text),rating = as.numeric(rating_text),
rating = if_else(rating > 5,5,rating),
visit_date =suppressWarnings(parse_date_time(visit_date_text,orders="dmy")),
remarks = str_replace_all(remarks,"[@#$%^&*]","")
) %>%
distinct()
|
|
tourist_spot |
country |
city |
category |
rating_text |
revenue_text |
visit_date_text |
visitors |
remarks |
revenue |
rating |
visit_date |
|
1 |
Eiffel_Tower |
France |
Paris |
Historical |
4.8 |
1500000 |
12-05-2025 |
500000 |
Top attraction |
1500000 |
4.8 |
12-05-2025 |
|
2 |
Taj_Mahal |
India |
Agra |
Historical |
6.5 |
2500000 |
31-14-2025 |
700000 |
Invalid rating |
2500000 |
5 |
NA |
|
3 |
Statue_of_Liberty |
USA |
Newyork |
Historical |
4.7 |
-100000 |
15-08-2025 |
400000 |
Negative revenue |
100000 |
4.7 |
15-08-2025 |
|
4 |
Great_Wall |
China |
Beijing |
Historical |
4.9 |
3500000 |
25-06-2025 |
5000 |
Negative visitors |
3500000 |
4.9 |
25-06-2025 |
|
5 |
Machu_Picchu |
Peru |
Cusco |
Historical |
4.6 |
2100000 |
10-07-2025 |
300000 |
Good |
2100000 |
4.6 |
10-07-2025 |
|
6 |
Eiffel_Tower |
France |
Paris |
Historical |
4.8 |
1500000 |
12-05-2025 |
500000 |
Duplicate |
1500000 |
4.8 |
12-05-2025 |
|
7 |
Santorini |
Greece |
Santorini |
Beach |
4.5 |
1800000 |
22-09-2025 |
250000 |
lowercase city |
1800000 |
4.5 |
22-09-2025 |
|
8 |
Burj_Khalifa |
UAE |
Dubai |
Modern |
4.7 |
NA |
18-04-2025 |
600000 |
Missing revenue |
NA |
4.7 |
18-04-2025 |
|
9 |
Niagara_Falls |
Canada |
Toronto |
Nature |
NA |
1300000 |
11-05-2025 |
450000 |
Invalid rating text |
1300000 |
NA |
11-05-2025 |
|
10 |
Colosseum |
|
Rome |
Historical |
4.4 |
1200000 |
05-03-2025 |
380000 |
Missing country |
1200000 |
4.4 |
05-03-2025 |
|
11 |
Sydney_Opera |
Australia |
Sydney |
Modern |
4.3 |
1700000 |
17-11-2025 |
320000 |
Good |
1700000 |
4.3 |
17-11-2025 |
|
12 |
Mount_Fuji |
Japan |
Tokyo |
Nature |
4.9 |
1600000 |
29-02-2025 |
410000 |
Invalid date |
1600000 |
4.9 |
NA |
|
13 |
Grand_Canyon |
USA |
Arizona |
Nature |
4.8 |
2000000 |
07-08-2025 |
390000 |
Good |
2000000 |
4.8 |
07-08-2025 |
|
14 |
Banff_Park |
Canada |
Alberta |
Nature |
4.7 |
1900000 |
09-10-2025 |
280000 |
Good |
1900000 |
4.7 |
09-10-2025 |
|
15 |
Petra |
Jordan |
Amman |
Historical |
4.6 |
1450000 |
15-06-2025 |
310000 |
Good |
1450000 |
4.6 |
15-06-2025 |
R’s
tidyverse pipeline is highly expressive.
Equivalent
SAS vs R comparisons:
|
SAS |
R |
|
PROPCASE |
str_to_title |
|
COMPRESS |
str_replace_all |
|
MISSING |
is.na |
|
IF-THEN |
if_else |
|
PROC
SORT NODUPKEY |
distinct |
|
INPUT |
as.numeric |
R excels
in:
- Interactive exploration
- Visualization
- Machine learning integration
SAS dominates
in:
- Regulatory compliance
- Metadata governance
- Audit traceability
Business Logic Behind Data
Cleaning
Data
cleaning exists because raw data reflects human behavior, system limitations,
and operational chaos. In healthcare, tourism, banking, and AI systems,
decisions are only as accurate as the underlying data.
Suppose a
patient age is recorded as 250. Without validation, clinical-trial analysis may
classify impossible demographics, impacting safety conclusions. Similarly, if
tourism revenue becomes negative due to ingestion errors, executive dashboards
may falsely indicate economic collapse.
Missing
values are replaced because analytical models cannot reliably interpret blanks.
For example:
- Missing tourist country →
replaced using business rules
- Missing patient treatment
dates → imputed for continuity
- Missing salary data →
normalized using averages or medians
Date
correction is equally critical. Invalid dates break forecasting pipelines, machine-learning
models, and regulatory timelines.
Standardization
also matters:
- "usa",
"USA", "Usa" should become one standardized value
- Duplicate tourist locations
inflate counts
- Malformed text disrupts
joins and reporting
In SDTM
and ADaM environments, every transformation must be reproducible and traceable.
That is why enterprise SAS programming emphasizes:
- validation,
- audit trails,
- controlled terminology,
- metadata consistency,
- and deterministic
transformations.
Clean
data is not convenience.
It is operational integrity.
20 Additional Data Cleaning
Best Practices
- Always validate
source-system lineage before ingestion.
- Maintain audit trails for
every transformation.
- Never overwrite raw datasets
directly.
- Use controlled terminology
in SDTM domains.
- Validate all date variables
against protocol timelines.
- Standardize casing before
joins.
- Remove hidden special
characters.
- Track duplicate subject IDs
carefully.
- Separate business rules from
transformation logic.
- Use macros for reusable
validations.
- Validate negative numeric
values.
- Document derivation logic
clearly.
- Perform frequency checks
before analysis.
- Use PROC CONTENTS to verify
metadata.
- Validate missingness
patterns statistically.
- Ensure reproducibility
across environments.
- Compare PROC SQL and DATA
step outputs.
- Preserve raw ingestion
copies for audits.
- Build exception reports for
invalid records.
- Implement peer-review
validation before production release.
20 Key Points — Sharp
Enterprise Insights
- Dirty data leads to wrong
conclusions.
- Standardization ensures
reproducibility.
- Duplicate records inflate
business KPIs.
- Invalid dates destroy
forecasting accuracy.
- Missing values impact
machine learning reliability.
- SAS excels in regulatory
governance.
- R excels in analytical
flexibility.
- LENGTH placement prevents truncation
disasters.
- PROC SQL simplifies
aggregation logic.
- DATA step enables row-level
control.
- Audit trails are mandatory
in healthcare analytics.
- Controlled terminology
improves consistency.
- PROC REPORT creates
executive-ready outputs.
- Macros reduce repetitive
validation work.
- Data lineage improves
compliance transparency.
- Enterprise ETL requires
defensive programming.
- Text normalization improves
join accuracy.
- Validation protects AI model
integrity.
- Clean data drives
trustworthy dashboards.
- Reliable analytics starts
with disciplined engineering.
Summary
This case
study demonstrated how messy tourism data can be transformed into
enterprise-grade analytical intelligence using both SAS and R. We intentionally
created corrupted datasets containing duplicates, missing values, malformed
text, negative metrics, inconsistent casing, and invalid dates to simulate
real-world enterprise ingestion problems.
Using SAS
DATA step programming, we implemented defensive validation logic with functions
like:
- PROPCASE
- COMPRESS
- INPUT
- INTNX
- ABS
- MISSING
- CATX
- SCAN
We also
explored:
- PROC SQL aggregations
- PROC REPORT dashboards
- PROC FORMAT categorization
- PROC SORT NODUPKEY
deduplication
- reusable SAS macros
The
tutorial highlighted why SAS remains dominant in regulated industries such as
clinical trials, banking, and compliance-focused analytics due to:
- auditability,
- metadata governance,
- reproducibility,
- and validation traceability.
On the R
side, tidyverse functions such as:
- mutate()
- case_when()
- distinct()
- replace_na()
- parse_date_time()
- str_replace_all()
provided
a modern and highly expressive refinement layer suitable for exploratory
analytics and machine-learning workflows.
The
biggest lesson is simple:
Raw data
is never trustworthy by default.
Whether
you are building:
- SDTM datasets,
- tourism dashboards,
- fraud-detection engines,
- or AI recommendation
systems,
your
analytical quality depends entirely on structured, validated, and reproducible
cleaning frameworks.
Clean
data creates trustworthy intelligence.
Trustworthy intelligence drives confident business decisions.
Conclusion
Modern
analytics is no longer about simply generating reports. It is about engineering
trust.
Organizations
today depend on data for:
- AI automation,
- regulatory compliance,
- predictive forecasting,
- financial planning,
- healthcare decisions,
- and operational
intelligence.
But raw
enterprise data is inherently chaotic.
Tourism
systems contain duplicate bookings, malformed locations, inconsistent currency
formats, and invalid visitor counts. Clinical-trial systems contain missing
treatment dates, duplicated adverse events, inconsistent medical coding, and
protocol deviations. Banking systems face corrupted transactions, fraud
anomalies, and incomplete customer records.
Without
structured cleaning frameworks, these issues silently poison analytical
outputs.
This
project demonstrated how SAS and R complement each other in enterprise
environments.
SAS
provides:
- industrial-grade governance,
- validation traceability,
- audit readiness,
- metadata control,
- and production reliability.
R
provides:
- agile data exploration,
- elegant transformation
pipelines,
- machine-learning
compatibility,
- and rapid experimentation.
The real
power emerges when both ecosystems work together.
A mature
enterprise workflow often follows this pattern:
- SAS performs ingestion,
compliance validation, SDTM-standard transformation, and production
reporting.
- R performs exploratory
analytics, visualization, statistical modeling, and advanced AI workflows.
The
future of analytics belongs to professionals who understand:
- data engineering,
- validation logic,
- business rules,
- compliance requirements,
- and scalable transformation
frameworks.
Because
ultimately, dashboards are only as trustworthy as the pipelines behind them.
Clean
data is not merely technical hygiene.
It is the foundation of regulatory credibility, executive confidence, AI
reliability, and enterprise intelligence.
Interview Questions and
Answers
1. Why is LENGTH placement important in SAS?
Answer:
SAS assigns variable attributes during compilation. If LENGTH is placed after
INPUT, character variables may truncate permanently. This can corrupt joins,
grouping logic, and reporting outputs.
2. When would you prefer DATA Step over PROC SQL?
Answer:
Use DATA Step when row-by-row processing, FIRST./LAST. logic, RETAIN
statements, or complex derivations are required. PROC SQL is better for
aggregation and joins.
3. How would you detect duplicate clinical records?
Answer:
proc sort data=ae nodupkey;
by usubjid aestdtc aeterm;
run;
This
removes duplicate adverse-event records using business keys.
4. How does R handle missing values differently
from SAS?
Answer:
R uses NA, while SAS uses numeric . and blank character values. R functions
like replace_na() explicitly manage missingness, whereas SAS uses MISSING() and
COALESCEC().
5. A clinical dataset shows negative patient weight values. How would you handle this?
Answer:
First
investigate source-system lineage. If confirmed as data-entry errors, apply
business-rule correction:
Weight = abs(Weight);
Then:
- flag corrected records,
- document derivation logic,
- preserve audit traceability,
- and validate downstream
calculations.
This
ensures regulatory compliance and analytical reliability.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TOURIST DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Follow Us On :
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--->Follow our blog for more SAS-based analytics projects and industry data models.
---> Support Us By Following Our Blog..
To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:
3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About Us | Contact | Privacy Policy
Comments
Post a Comment