From Taj Mahal to Machine Learning: Enterprise Data Cleaning Secrets Behind Reliable Tourism Analytics in SAS and R

The World’s Most Famous Tourist Spots Dataset into Trusted Business Analytics Using SAS (PROC SQL vs DATA Step) and R Data Engineering Frameworks

Introduction — When Beautiful Tourist Data Turns Into an Enterprise Disaster

Imagine you are working for a global travel analytics company responsible for predicting tourism trends across the world. Your dashboards influence hotel investments, airline route planning, tourism ministry budgets, and AI-powered recommendation engines. One wrong value inside your dataset can distort millions of dollars in decisions.

Now imagine the following:

  • Paris visitor counts stored as "15M" instead of numeric values
  • Taj Mahal dates entered as 32/14/2025
  • Duplicate records for the same tourist spot
  • Negative revenue values
  • Missing country names
  • Inconsistent casing like "new york", "NEW YORK", and "New york"
  • Invalid ratings above 5
  • Random special characters in text columns

This is exactly how real-world enterprise data looks before cleaning.

The same nightmare happens in clinical trials. A single incorrect patient age, missing treatment date, or duplicated adverse event record can delay regulatory approvals from organizations like FDA or EMA.

Data cleaning is not cosmetic work.
It is business survival.

In this enterprise-level SAS and R case study, we will build a corrupted “World Famous Tourist Spots” dataset with intentional errors and transform it into production-ready analytical intelligence using:

  • SAS DATA Step
  • PROC SQL
  • PROC REPORT
  • SAS Macros
  • R tidyverse ecosystem
  • Enterprise validation logic
  • Audit-ready reporting pipelines

This tutorial combines:

  • Clinical-trial-grade validation logic
  • Tourism analytics
  • SAS interview preparation
  • Real-world business intelligence engineering

Creating the Raw Tourist Dataset with Intentional Errors

Suppose a tourism intelligence company collects worldwide tourist data from:

  • APIs
  • Excel sheets
  • Manual entry systems
  • Government portals
  • Web scraping engines

The raw data arrives corrupted and inconsistent.

SAS Raw Dataset Creation

/*-----------------------------------------------------------

 STEP 1: DEFINE VARIABLE LENGTHS BEFORE READING DATA.WHY?

 SAS creates variable lengths during compilation.

 If LENGTH is placed after INPUT, truncation occurs.

------------------------------------------------------------*/

data tourist_raw;

length Tourist_Spot $40 Country $25 City $25 Category $20 Rating_Text $10

       Revenue_Text $15 Visit_Date_Text $20 Remarks $50;

infile datalines dlm='|' truncover;

input Tourist_Spot $ Country $ City $ Category $ Rating_Text $ Revenue_Text $

      Visit_Date_Text $ Visitors Remarks $;

datalines;

Eiffel_Tower|France|Paris|Historical|4.8|1500000|12-05-2025|500000|Top attraction

Taj_Mahal|India|Agra|Historical|6.5|2500000|31-14-2025|700000|Invalid rating

Statue_of_Liberty|USA|NewYork|Historical|4.7|-100000|15-08-2025|400000|Negative revenue

Great_Wall|China|Beijing|Historical|4.9|3500000|25-06-2025|-5000|Negative visitors

Machu_Picchu|Peru|Cusco|Historical|4.6|2100000|10-07-2025|300000|Good

Eiffel_Tower|France|Paris|Historical|4.8|1500000|12-05-2025|500000|Duplicate

Santorini|Greece|santorini|Beach|4.5|1800000|22-09-2025|250000|lowercase city

Burj_Khalifa|UAE|Dubai|Modern|4.7|NULL|18-04-2025|600000|Missing revenue

Niagara_Falls|Canada|Toronto|Nature|abc|1300000|11-05-2025|450000|Invalid rating text

Colosseum| |Rome|Historical|4.4|1200000|05-03-2025|380000|Missing country

Sydney_Opera|Australia|Sydney|Modern|4.3|1700000|17-11-2025|320000|Good

Mount_Fuji|Japan|Tokyo|Nature|4.9|1600000|29-02-2025|410000|Invalid date

Grand_Canyon|USA|Arizona|Nature|4.8|2000000|07-08-2025|390000|Good

Banff_Park|Canada|Alberta|Nature|4.7|1900000|09-10-2025|280000|Good

Petra|Jordan|Amman|Historical|4.6|1450000|15-06-2025|310000|Good

;

run;

proc print data = tourist_raw;

run;

OUTPUT:

ObsTourist_SpotCountryCityCategoryRating_TextRevenue_TextVisit_Date_TextRemarksVisitors
1Eiffel_TowerFranceParisHistorical4.8150000012-05-2025Top attraction500000
2Taj_MahalIndiaAgraHistorical6.5250000031-14-2025Invalid rating700000
3Statue_of_LibertyUSANewYorkHistorical4.7-10000015-08-2025Negative revenue400000
4Great_WallChinaBeijingHistorical4.9350000025-06-2025Negative visitors-5000
5Machu_PicchuPeruCuscoHistorical4.6210000010-07-2025Good300000
6Eiffel_TowerFranceParisHistorical4.8150000012-05-2025Duplicate500000
7SantoriniGreecesantoriniBeach4.5180000022-09-2025lowercase city250000
8Burj_KhalifaUAEDubaiModern4.7NULL18-04-2025Missing revenue600000
9Niagara_FallsCanadaTorontoNatureabc130000011-05-2025Invalid rating text450000
10Colosseum RomeHistorical4.4120000005-03-2025Missing country380000
11Sydney_OperaAustraliaSydneyModern4.3170000017-11-2025Good320000
12Mount_FujiJapanTokyoNature4.9160000029-02-2025Invalid date410000
13Grand_CanyonUSAArizonaNature4.8200000007-08-2025Good390000
14Banff_ParkCanadaAlbertaNature4.7190000009-10-2025Good280000
15PetraJordanAmmanHistorical4.6145000015-06-2025Good310000

Explanation

INTENTIONAL ERRORS INTRODUCED

 - Missing countries

 - Invalid dates

 - Negative visitors

 - Duplicate records

 - Mixed casing

 - Invalid ratings

 - Embedded symbols

Understanding the SAS “Truncation Trap”

One of the biggest mistakes beginners make is placing LENGTH after the INPUT statement.

Incorrect:

input Country $;

length Country $25;

In SAS, variable attributes are assigned during compilation. If SAS first sees "USA", it may assign length 3 permanently. Later "Australia" becomes truncated.

Correct approach:

length Country $25;

input Country $;

This is extremely important in:

  • SDTM datasets
  • ADaM derivations
  • Regulatory submissions
  • Production ETL systems

R behaves differently because strings are dynamically managed in memory. SAS allocates fixed-length storage unless explicitly controlled.

#-----------------------------------------------------------

# CREATE RAW TOURIST DATASET IN R

# Equivalent to SAS DATALINES approach

#-----------------------------------------------------------

library(tidyverse)

library(stringr)

library(lubridate)

library(janitor)

library(purrr)

#-----------------------------------------------------------

# RAW DATA CREATED EXACTLY LIKE SAS DATALINES

# sep="|" acts like DLM='|'

# header=FALSE because SAS DATALINES has no header row

# stringsAsFactors=FALSE prevents automatic factor conversion

#-----------------------------------------------------------

tourist_raw <- read.table(text = "

Eiffel_Tower|France|Paris|Historical|4.8|1500000|12-05-2025|500000|Top attraction

Taj_Mahal|India|Agra|Historical|6.5|2500000|31-14-2025|700000|Invalid rating

Statue_of_Liberty|USA|NewYork|Historical|4.7|-100000|15-08-2025|400000|Negative revenue

Great_Wall|China|Beijing|Historical|4.9|3500000|25-06-2025|-5000|Negative visitors

Machu_Picchu|Peru|Cusco|Historical|4.6|2100000|10-07-2025|300000|Good

Eiffel_Tower|France|Paris|Historical|4.8|1500000|12-05-2025|500000|Duplicate

Santorini|Greece|santorini|Beach|4.5|1800000|22-09-2025|250000|lowercase city

Burj_Khalifa|UAE|Dubai|Modern|4.7|NULL|18-04-2025|600000|Missing revenue

Niagara_Falls|Canada|Toronto|Nature|abc|1300000|11-05-2025|450000|Invalid rating text

Colosseum||Rome|Historical|4.4|1200000|05-03-2025|380000|Missing country

Sydney_Opera|Australia|Sydney|Modern|4.3|1700000|17-11-2025|320000|Good

Mount_Fuji|Japan|Tokyo|Nature|4.9|1600000|29-02-2025|410000|Invalid date

Grand_Canyon|USA|Arizona|Nature|4.8|2000000|07-08-2025|390000|Good

Banff_Park|Canada|Alberta|Nature|4.7|1900000|09-10-2025|280000|Good

Petra|Jordan|Amman|Historical|4.6|1450000|15-06-2025|310000|Good

",

  sep = "|", header = FALSE, stringsAsFactors = FALSE, fill = TRUE)

OUTPUT:

 

V1

V2

V3

V4

V5

V6

V7

V8

V9

1

Eiffel_Tower

France

Paris

Historical

4.8

1500000

12-05-2025

500000

Top attraction

2

Taj_Mahal

India

Agra

Historical

6.5

2500000

31-14-2025

700000

Invalid rating

3

Statue_of_Liberty

USA

NewYork

Historical

4.7

-100000

15-08-2025

400000

Negative revenue

4

Great_Wall

China

Beijing

Historical

4.9

3500000

25-06-2025

-5000

Negative visitors

5

Machu_Picchu

Peru

Cusco

Historical

4.6

2100000

10-07-2025

300000

Good

6

Eiffel_Tower

France

Paris

Historical

4.8

1500000

12-05-2025

500000

Duplicate

7

Santorini

Greece

santorini

Beach

4.5

1800000

22-09-2025

250000

lowercase city

8

Burj_Khalifa

UAE

Dubai

Modern

4.7

NULL

18-04-2025

600000

Missing revenue

9

Niagara_Falls

Canada

Toronto

Nature

abc

1300000

11-05-2025

450000

Invalid rating text

10

Colosseum

 

Rome

Historical

4.4

1200000

05-03-2025

380000

Missing country

11

Sydney_Opera

Australia

Sydney

Modern

4.3

1700000

17-11-2025

320000

Good

12

Mount_Fuji

Japan

Tokyo

Nature

4.9

1600000

29-02-2025

410000

Invalid date

13

Grand_Canyon

USA

Arizona

Nature

4.8

2000000

07-08-2025

390000

Good

14

Banff_Park

Canada

Alberta

Nature

4.7

1900000

09-10-2025

280000

Good

15

Petra

Jordan

Amman

Historical

4.6

1450000

15-06-2025

310000

Good


#-----------------------------------------------------------

# ASSIGN COLUMN NAMES

# Equivalent to SAS INPUT variable list

#-----------------------------------------------------------

colnames(tourist_raw) <- c("Tourist_Spot","Country","City","Category",

  "Rating_Text","Revenue_Text","Visit_Date_Text","Visitors","Remarks"

)

OUTPUT:

 

Tourist_Spot

Country

City

Category

Rating_Text

Revenue_Text

Visit_Date_Text

Visitors

Remarks

1

Eiffel_Tower

France

Paris

Historical

4.8

1500000

12-05-2025

500000

Top attraction

2

Taj_Mahal

India

Agra

Historical

6.5

2500000

31-14-2025

700000

Invalid rating

3

Statue_of_Liberty

USA

NewYork

Historical

4.7

-100000

15-08-2025

400000

Negative revenue

4

Great_Wall

China

Beijing

Historical

4.9

3500000

25-06-2025

-5000

Negative visitors

5

Machu_Picchu

Peru

Cusco

Historical

4.6

2100000

10-07-2025

300000

Good

6

Eiffel_Tower

France

Paris

Historical

4.8

1500000

12-05-2025

500000

Duplicate

7

Santorini

Greece

santorini

Beach

4.5

1800000

22-09-2025

250000

lowercase city

8

Burj_Khalifa

UAE

Dubai

Modern

4.7

NULL

18-04-2025

600000

Missing revenue

9

Niagara_Falls

Canada

Toronto

Nature

abc

1300000

11-05-2025

450000

Invalid rating text

10

Colosseum

 

Rome

Historical

4.4

1200000

05-03-2025

380000

Missing country

11

Sydney_Opera

Australia

Sydney

Modern

4.3

1700000

17-11-2025

320000

Good

12

Mount_Fuji

Japan

Tokyo

Nature

4.9

1600000

29-02-2025

410000

Invalid date

13

Grand_Canyon

USA

Arizona

Nature

4.8

2000000

07-08-2025

390000

Good

14

Banff_Park

Canada

Alberta

Nature

4.7

1900000

09-10-2025

280000

Good

15

Petra

Jordan

Amman

Historical

4.6

1450000

15-06-2025

310000

Good

This R program is the direct equivalent of SAS DATALINES dataset creation workflow.

SAS vs R Mapping

SAS

R

DATALINES

text=

infile datalines dlm='

'

input

colnames()

PROC PRINT

print()

PROC CONTENTS

str()

Important Enterprise Concepts

1. read.table()

This function reads raw delimited text data into R.

Equivalent SAS concept:

infile datalines dlm='|';

2. sep="|"

Defines delimiter exactly like:

dlm='|'

3. fill=TRUE

Critical for enterprise ingestion.

Suppose some rows have missing fields.

Without fill=TRUE:

  • R may fail ingestion
  • column shifting occurs

This behaves similarly to SAS:

truncover

4. stringsAsFactors=FALSE

Older R versions automatically converted text into factors.

That causes:

  • unexpected modeling behavior
  • merge issues
  • reporting inconsistencies

Enterprise systems usually disable this.

5. Why Column Names Assigned Separately?

Because raw text has no header row.

Equivalent SAS logic:

input Tourist_Spot $
      Country $
      City $;

#-----------------------------------------------------------

# DISPLAY DATASET

# Equivalent to PROC PRINT

#-----------------------------------------------------------

print(tourist_raw)

#-----------------------------------------------------------

# STRUCTURE OF DATASET

# Similar to PROC CONTENTS

#-----------------------------------------------------------

str(tourist_raw)

SAS Data Cleaning Workflow — Enterprise Style

Step 1 — Standardization

data tourist_clean;

set tourist_raw;

/*-----------------------------------------------------------

 PROPCASE standardizes inconsistent capitalization

------------------------------------------------------------*/

City = propcase(strip(City));

Country = propcase(strip(Country));

/*-----------------------------------------------------------

 REMOVE SPECIAL CHARACTERS

 COMPRESS keeps only alphabets and spaces

------------------------------------------------------------*/

Remarks = compress(Remarks,'@#$%^&*','k');

/*-----------------------------------------------------------

 CONVERT REVENUE TO NUMERIC

 INPUT converts character to numeric

------------------------------------------------------------*/

Revenue = input(Revenue_Text,best12.);

/*-----------------------------------------------------------

 HANDLE NULL VALUES

------------------------------------------------------------*/

if Revenue_Text='NULL' then Revenue=.;

/*-----------------------------------------------------------

 FIX NEGATIVE VALUES

 ABS converts negative to positive

------------------------------------------------------------*/

Visitors = abs(Visitors);

Revenue  = abs(Revenue);

/*-----------------------------------------------------------

 CONVERT RATING

------------------------------------------------------------*/

Rating=input(Rating_Text,best12.);

/*-----------------------------------------------------------

 INVALID RATINGS

------------------------------------------------------------*/

if Rating > 5 then Rating=5;

if Rating < 0 then Rating=.;

/*-----------------------------------------------------------

 DATE CONVERSION

------------------------------------------------------------*/

Visit_Date=input(Visit_Date_Text,ddmmyy10.);

format Visit_Date date9.;

/*-----------------------------------------------------------

 INVALID DATES

------------------------------------------------------------*/

if missing(Visit_Date) then

Visit_Date=intnx('month',today(),-1,'same');

run;

proc print data = tourist_clean;

run;

OUTPUT:

ObsTourist_SpotCountryCityCategoryRating_TextRevenue_TextVisit_Date_TextRemarksVisitorsRevenueRatingVisit_Date
1Eiffel_TowerFranceParisHistorical4.8150000012-05-2025 50000015000004.812MAY2025
2Taj_MahalIndiaAgraHistorical6.5250000031-14-2025 70000025000005.014APR2026
3Statue_of_LibertyUsaNewyorkHistorical4.7-10000015-08-2025 4000001000004.715AUG2025
4Great_WallChinaBeijingHistorical4.9350000025-06-2025 500035000004.925JUN2025
5Machu_PicchuPeruCuscoHistorical4.6210000010-07-2025 30000021000004.610JUL2025
6Eiffel_TowerFranceParisHistorical4.8150000012-05-2025 50000015000004.812MAY2025
7SantoriniGreeceSantoriniBeach4.5180000022-09-2025 25000018000004.522SEP2025
8Burj_KhalifaUaeDubaiModern4.7NULL18-04-2025 600000.4.718APR2025
9Niagara_FallsCanadaTorontoNatureabc130000011-05-2025 4500001300000.11MAY2025
10Colosseum RomeHistorical4.4120000005-03-2025 38000012000004.405MAR2025
11Sydney_OperaAustraliaSydneyModern4.3170000017-11-2025 32000017000004.317NOV2025
12Mount_FujiJapanTokyoNature4.9160000029-02-2025 41000016000004.914APR2026
13Grand_CanyonUsaArizonaNature4.8200000007-08-2025 39000020000004.807AUG2025
14Banff_ParkCanadaAlbertaNature4.7190000009-10-2025 28000019000004.709OCT2025
15PetraJordanAmmanHistorical4.6145000015-06-2025 31000014500004.615JUN2025

Explanation

This DATA step demonstrates enterprise-grade defensive programming.
Instead of assuming clean input, every variable is validated.

Key business logic:

  • ABS() protects dashboards from impossible negative metrics
  • INPUT() converts raw text into analytical numeric formats
  • INTNX() imputes invalid dates
  • PROPCASE() standardizes text for grouping consistency

This is one of the most misunderstood yet powerful SAS character-cleaning statements used in enterprise data cleaning projects, especially in:

  • Clinical trial SDTM/ADaM preparation
  • Banking transaction cleansing
  • Tourism analytics
  • Insurance claim systems
  • Regulatory reporting pipelines

Understanding the COMPRESS Function in SAS

General Syntax

COMPRESS(source, characters-to-remove, modifiers)

Parameters

Parameter

Meaning

source

Original variable

characters-to-remove

Characters SAS should target

modifiers

Special behavior instructions

Source Variable

Remarks

Suppose the raw values are:

Excellent@Place
Very#Crowded
Good^View
Amazing&Safe

These values contain unwanted special characters.

'@#$%^&*'

This list defines characters SAS should examine.

These are:

  • @
  •  
  • $
  • %
  • ^
  • &

'k'

The k modifier means:

“KEEP the listed characters instead of removing them.”

This completely changes the behavior.

Critical Logic Difference

WITHOUT k

compress(Remarks,'@#$%^&*')

Means:

Remove @ # $ % ^ & *

Example:

Before

After

Good@Place

GoodPlace

Great#View

GreatView

This is the normal behavior.

WITH k

compress(Remarks,'@#$%^&*','k')

Means:

KEEP ONLY @ # $ % ^ & *
Remove everything else.

Example:

Before

After

Good@Place

@

Great#View

#

Amazing&Safe

&

So your original statement is actually NOT cleaning remarks properly.

It is doing the opposite.

In clinical trials, similar logic ensures:

  • No impossible patient ages
  • No future adverse event dates
  • No duplicate subject IDs

Without these checks, regulatory audits fail.

Removing Duplicate Records

proc sort data=tourist_clean nodupkey;

by Tourist_Spot Country City;

run;

proc print data = tourist_clean;

run;

OUTPUT:

ObsTourist_SpotCountryCityCategoryRating_TextRevenue_TextVisit_Date_TextRemarksVisitorsRevenueRatingVisit_Date
1Banff_ParkCanadaAlbertaNature4.7190000009-10-2025 28000019000004.709OCT2025
2Burj_KhalifaUaeDubaiModern4.7NULL18-04-2025 600000.4.718APR2025
3Colosseum RomeHistorical4.4120000005-03-2025 38000012000004.405MAR2025
4Eiffel_TowerFranceParisHistorical4.8150000012-05-2025 50000015000004.812MAY2025
5Grand_CanyonUsaArizonaNature4.8200000007-08-2025 39000020000004.807AUG2025
6Great_WallChinaBeijingHistorical4.9350000025-06-2025 500035000004.925JUN2025
7Machu_PicchuPeruCuscoHistorical4.6210000010-07-2025 30000021000004.610JUL2025
8Mount_FujiJapanTokyoNature4.9160000029-02-2025 41000016000004.914APR2026
9Niagara_FallsCanadaTorontoNatureabc130000011-05-2025 4500001300000.11MAY2025
10PetraJordanAmmanHistorical4.6145000015-06-2025 31000014500004.615JUN2025
11SantoriniGreeceSantoriniBeach4.5180000022-09-2025 25000018000004.522SEP2025
12Statue_of_LibertyUsaNewyorkHistorical4.7-10000015-08-2025 4000001000004.715AUG2025
13Sydney_OperaAustraliaSydneyModern4.3170000017-11-2025 32000017000004.317NOV2025
14Taj_MahalIndiaAgraHistorical6.5250000031-14-2025 70000025000005.014APR2026

Explanation

NODUPKEY removes duplicate business keys.

In SDTM datasets:

  • Duplicate AE records create safety-reporting failures
  • Duplicate DM subjects break population counts

Sorting before analysis ensures:

  • Accurate aggregation
  • Reliable AI training
  • Trustworthy KPI calculations

PROC SQL vs DATA Step

PROC SQL Approach

proc sql;

create table revenue_summary as

select Country,Category,count(*) as Total_Spots,

       mean(Revenue) as Avg_Revenue format=dollar15.,

       sum(Visitors) as Total_Visitors

from tourist_clean

group by Country, Category

order by Avg_Revenue desc;

quit;

proc print data = revenue_summary;

run;

OUTPUT:

ObsCountryCategoryTotal_SpotsAvg_RevenueTotal_Visitors
1ChinaHistorical1$3,500,0005000
2IndiaHistorical1$2,500,000700000
3PeruHistorical1$2,100,000300000
4UsaNature1$2,000,000390000
5GreeceBeach1$1,800,000250000
6AustraliaModern1$1,700,000320000
7CanadaNature2$1,600,000730000
8JapanNature1$1,600,000410000
9FranceHistorical1$1,500,000500000
10JordanHistorical1$1,450,000310000
11 Historical1$1,200,000380000
12UsaHistorical1$100,000400000
13UaeModern1.600000

Explanation

PROC SQL is excellent for:

  • Aggregations
  • Multi-table joins
  • Database pushdown optimization
  • Business reporting

SQL resembles enterprise warehouse systems like:

  • Oracle
  • Snowflake
  • Teradata

DATA Step Alternative

proc sort data=tourist_clean;

by Country Category;

run;

proc print data = tourist_clean;

run;

OUTPUT:

ObsTourist_SpotCountryCityCategoryRating_TextRevenue_TextVisit_Date_TextRemarksVisitorsRevenueRatingVisit_Date
1Colosseum RomeHistorical4.4120000005-03-2025 38000012000004.405MAR2025
2Sydney_OperaAustraliaSydneyModern4.3170000017-11-2025 32000017000004.317NOV2025
3Banff_ParkCanadaAlbertaNature4.7190000009-10-2025 28000019000004.709OCT2025
4Niagara_FallsCanadaTorontoNatureabc130000011-05-2025 4500001300000.11MAY2025
5Great_WallChinaBeijingHistorical4.9350000025-06-2025 500035000004.925JUN2025
6Eiffel_TowerFranceParisHistorical4.8150000012-05-2025 50000015000004.812MAY2025
7SantoriniGreeceSantoriniBeach4.5180000022-09-2025 25000018000004.522SEP2025
8Taj_MahalIndiaAgraHistorical6.5250000031-14-2025 70000025000005.014APR2026
9Mount_FujiJapanTokyoNature4.9160000029-02-2025 41000016000004.914APR2026
10PetraJordanAmmanHistorical4.6145000015-06-2025 31000014500004.615JUN2025
11Machu_PicchuPeruCuscoHistorical4.6210000010-07-2025 30000021000004.610JUL2025
12Burj_KhalifaUaeDubaiModern4.7NULL18-04-2025 600000.4.718APR2025
13Statue_of_LibertyUsaNewyorkHistorical4.7-10000015-08-2025 4000001000004.715AUG2025
14Grand_CanyonUsaArizonaNature4.8200000007-08-2025 39000020000004.807AUG2025

data summary_ds;

set tourist_clean;

by Country Category;

retain Total_Visitors Total_Revenue Count;

if first.Category then do;

   Total_Visitors=0;

   Total_Revenue=0;

   Count=0;

end;

Total_Visitors+Visitors;

Total_Revenue+Revenue;

Count+1;

if last.Category then do;

   Avg_Revenue=round(Total_Revenue/Count,0.01);

   output;

end;

run;

proc print data = summary_ds;

run;

OUTPUT:

ObsTourist_SpotCountryCityCategoryRating_TextRevenue_TextVisit_Date_TextRemarksVisitorsRevenueRatingVisit_DateTotal_VisitorsTotal_RevenueCountAvg_Revenue
1Colosseum RomeHistorical4.4120000005-03-2025 38000012000004.405MAR2025380000120000011200000
2Sydney_OperaAustraliaSydneyModern4.3170000017-11-2025 32000017000004.317NOV2025320000170000011700000
3Niagara_FallsCanadaTorontoNatureabc130000011-05-2025 4500001300000.11MAY2025730000320000021600000
4Great_WallChinaBeijingHistorical4.9350000025-06-2025 500035000004.925JUN20255000350000013500000
5Eiffel_TowerFranceParisHistorical4.8150000012-05-2025 50000015000004.812MAY2025500000150000011500000
6SantoriniGreeceSantoriniBeach4.5180000022-09-2025 25000018000004.522SEP2025250000180000011800000
7Taj_MahalIndiaAgraHistorical6.5250000031-14-2025 70000025000005.014APR2026700000250000012500000
8Mount_FujiJapanTokyoNature4.9160000029-02-2025 41000016000004.914APR2026410000160000011600000
9PetraJordanAmmanHistorical4.6145000015-06-2025 31000014500004.615JUN2025310000145000011450000
10Machu_PicchuPeruCuscoHistorical4.6210000010-07-2025 30000021000004.610JUL2025300000210000012100000
11Burj_KhalifaUaeDubaiModern4.7NULL18-04-2025 600000.4.718APR2025600000010
12Statue_of_LibertyUsaNewyorkHistorical4.7-10000015-08-2025 4000001000004.715AUG20254000001000001100000
13Grand_CanyonUsaArizonaNature4.8200000007-08-2025 39000020000004.807AUG2025390000200000012000000

Explanation

The DATA step gives granular row-level control using:

  • FIRST.
  • LAST.
  • RETAIN

This approach is preferred when:

  • Complex derivations exist
  • Stateful calculations are needed
  • Clinical-trial lineage must be preserved

Advanced SAS Features

PROC FORMAT

proc format;

value revenuefmt low-1000000 = 'Low Revenue'

             1000001-2000000 = 'Medium Revenue'

                2000001-high = 'High Revenue';

run;

LOG:

NOTE: Format REVENUEFMT has been output.

Explanation

Formats improve readability and dashboard usability.

Instead of showing raw numbers, business users see:

  • High Revenue
  • Medium Revenue
  • Low Revenue

This improves executive communication.

data tourist_clean2;

set tourist_clean;

length Revenue_FMT $20;

Revenue_FMT = put(Revenue,revenuefmt.);

run;

proc print data = tourist_clean2;

run;

OUTPUT:

ObsTourist_SpotCountryCityCategoryRating_TextRevenue_TextVisit_Date_TextRemarksVisitorsRevenueRatingVisit_DateRevenue_FMT
1Colosseum RomeHistorical4.4120000005-03-2025 38000012000004.405MAR2025Medium Revenue
2Sydney_OperaAustraliaSydneyModern4.3170000017-11-2025 32000017000004.317NOV2025Medium Revenue
3Banff_ParkCanadaAlbertaNature4.7190000009-10-2025 28000019000004.709OCT2025Medium Revenue
4Niagara_FallsCanadaTorontoNatureabc130000011-05-2025 4500001300000.11MAY2025Medium Revenue
5Great_WallChinaBeijingHistorical4.9350000025-06-2025 500035000004.925JUN2025High Revenue
6Eiffel_TowerFranceParisHistorical4.8150000012-05-2025 50000015000004.812MAY2025Medium Revenue
7SantoriniGreeceSantoriniBeach4.5180000022-09-2025 25000018000004.522SEP2025Medium Revenue
8Taj_MahalIndiaAgraHistorical6.5250000031-14-2025 70000025000005.014APR2026High Revenue
9Mount_FujiJapanTokyoNature4.9160000029-02-2025 41000016000004.914APR2026Medium Revenue
10PetraJordanAmmanHistorical4.6145000015-06-2025 31000014500004.615JUN2025Medium Revenue
11Machu_PicchuPeruCuscoHistorical4.6210000010-07-2025 30000021000004.610JUL2025High Revenue
12Burj_KhalifaUaeDubaiModern4.7NULL18-04-2025 600000.4.718APR2025.
13Statue_of_LibertyUsaNewyorkHistorical4.7-10000015-08-2025 4000001000004.715AUG2025Low Revenue
14Grand_CanyonUsaArizonaNature4.8200000007-08-2025 39000020000004.807AUG2025Medium Revenue

PROC REPORT Dashboard

proc report data=tourist_clean2 nowd;

column Country Tourist_Spot Revenue Revenue_FMT Visitors Rating;

define Country / group;

define Tourist_Spot / display;

define Revenue / analysis sum format=dollar15.;

define Revenue_FMT / group;

define Visitors / analysis sum;

define Rating / analysis mean format=4.2;

compute Revenue;

   if Revenue.sum > 2000000 then

   call define(_col_,'style',

   'style={background=lightgreen}');

endcomp;

run;

OUTPUT:

CountryTourist_SpotRevenueRevenue_FMTVisitorsRating
AustraliaSydney_Opera$1,700,000Medium Revenue3200004.30
CanadaBanff_Park$1,900,000Medium Revenue2800004.70
 Niagara_Falls$1,300,000 450000.
ChinaGreat_Wall$3,500,000High Revenue50004.90
FranceEiffel_Tower$1,500,000Medium Revenue5000004.80
GreeceSantorini$1,800,000Medium Revenue2500004.50
IndiaTaj_Mahal$2,500,000High Revenue7000005.00
JapanMount_Fuji$1,600,000Medium Revenue4100004.90
JordanPetra$1,450,000Medium Revenue3100004.60
PeruMachu_Picchu$2,100,000High Revenue3000004.60
UaeBurj_Khalifa..6000004.70
UsaStatue_of_Liberty$100,000Low Revenue4000004.70
 Grand_Canyon$2,000,000Medium Revenue3900004.80

Explanation

PROC REPORT creates audit-ready professional outputs.

Used heavily in:

  • Clinical TLF generation
  • Financial reporting
  • Regulatory submissions

Conditional formatting improves executive visibility.

SAS Macro for Reusable Validation

%macro validate(ds,var);

proc freq data=&ds;

tables &var / missing;

run;

%mend;

%validate(tourist_clean2,Country);

OUTPUT:

The FREQ Procedure

CountryFrequencyPercentCumulative
Frequency
Cumulative
Percent
 17.1417.14
Australia17.14214.29
Canada214.29428.57
China17.14535.71
France17.14642.86
Greece17.14750.00
India17.14857.14
Japan17.14964.29
Jordan17.141071.43
Peru17.141178.57
Uae17.141285.71
Usa214.2914100.00

Explanation

Macros reduce repetitive programming.

Enterprise benefits:

  • Standardization
  • Faster validation
  • Reduced human error
  • Easier maintenance

In pharmaceutical programming, reusable macros are essential.

R Refinement Layer

R Dataset Cleaning

tourist_clean <- tourist_raw %>%

 clean_names() %>%

 mutate(city = str_trim(str_to_title(city)),

        country = coalesce(country,"Unknown"),

        revenue_text = if_else(revenue_text == "NULL",

           NA_character_,revenue_text),

        revenue = as.numeric(revenue_text),

        revenue = abs(revenue),

        visitors = abs(visitors),

        rating_text = if_else(grepl("[A-Za-z]", rating_text),

          NA_character_,rating_text),rating = as.numeric(rating_text),

        rating = if_else(rating > 5,5,rating),

        visit_date =suppressWarnings(parse_date_time(visit_date_text,orders="dmy")),

        remarks = str_replace_all(remarks,"[@#$%^&*]","")

  ) %>%

  distinct()

OUTPUT:

 

tourist_spot

country

city

category

rating_text

revenue_text

visit_date_text

visitors

remarks

revenue

rating

visit_date

1

Eiffel_Tower

France

Paris

Historical

4.8

1500000

12-05-2025

500000

Top attraction

1500000

4.8

12-05-2025

2

Taj_Mahal

India

Agra

Historical

6.5

2500000

31-14-2025

700000

Invalid rating

2500000

5

NA

3

Statue_of_Liberty

USA

Newyork

Historical

4.7

-100000

15-08-2025

400000

Negative revenue

100000

4.7

15-08-2025

4

Great_Wall

China

Beijing

Historical

4.9

3500000

25-06-2025

5000

Negative visitors

3500000

4.9

25-06-2025

5

Machu_Picchu

Peru

Cusco

Historical

4.6

2100000

10-07-2025

300000

Good

2100000

4.6

10-07-2025

6

Eiffel_Tower

France

Paris

Historical

4.8

1500000

12-05-2025

500000

Duplicate

1500000

4.8

12-05-2025

7

Santorini

Greece

Santorini

Beach

4.5

1800000

22-09-2025

250000

lowercase city

1800000

4.5

22-09-2025

8

Burj_Khalifa

UAE

Dubai

Modern

4.7

NA

18-04-2025

600000

Missing revenue

NA

4.7

18-04-2025

9

Niagara_Falls

Canada

Toronto

Nature

NA

1300000

11-05-2025

450000

Invalid rating text

1300000

NA

11-05-2025

10

Colosseum

 

Rome

Historical

4.4

1200000

05-03-2025

380000

Missing country

1200000

4.4

05-03-2025

11

Sydney_Opera

Australia

Sydney

Modern

4.3

1700000

17-11-2025

320000

Good

1700000

4.3

17-11-2025

12

Mount_Fuji

Japan

Tokyo

Nature

4.9

1600000

29-02-2025

410000

Invalid date

1600000

4.9

NA

13

Grand_Canyon

USA

Arizona

Nature

4.8

2000000

07-08-2025

390000

Good

2000000

4.8

07-08-2025

14

Banff_Park

Canada

Alberta

Nature

4.7

1900000

09-10-2025

280000

Good

1900000

4.7

09-10-2025

15

Petra

Jordan

Amman

Historical

4.6

1450000

15-06-2025

310000

Good

1450000

4.6

15-06-2025

  
Explanation

R’s tidyverse pipeline is highly expressive.

Equivalent SAS vs R comparisons:

SAS

R

PROPCASE

str_to_title

COMPRESS

str_replace_all

MISSING

is.na

IF-THEN

if_else

PROC SORT NODUPKEY

distinct

INPUT

as.numeric

R excels in:

  • Interactive exploration
  • Visualization
  • Machine learning integration

SAS dominates in:

  • Regulatory compliance
  • Metadata governance
  • Audit traceability

Business Logic Behind Data Cleaning

Data cleaning exists because raw data reflects human behavior, system limitations, and operational chaos. In healthcare, tourism, banking, and AI systems, decisions are only as accurate as the underlying data.

Suppose a patient age is recorded as 250. Without validation, clinical-trial analysis may classify impossible demographics, impacting safety conclusions. Similarly, if tourism revenue becomes negative due to ingestion errors, executive dashboards may falsely indicate economic collapse.

Missing values are replaced because analytical models cannot reliably interpret blanks. For example:

  • Missing tourist country → replaced using business rules
  • Missing patient treatment dates → imputed for continuity
  • Missing salary data → normalized using averages or medians

Date correction is equally critical. Invalid dates break forecasting pipelines, machine-learning models, and regulatory timelines.

Standardization also matters:

  • "usa", "USA", "Usa" should become one standardized value
  • Duplicate tourist locations inflate counts
  • Malformed text disrupts joins and reporting

In SDTM and ADaM environments, every transformation must be reproducible and traceable. That is why enterprise SAS programming emphasizes:

  • validation,
  • audit trails,
  • controlled terminology,
  • metadata consistency,
  • and deterministic transformations.

Clean data is not convenience.
It is operational integrity.

20 Additional Data Cleaning Best Practices

  1. Always validate source-system lineage before ingestion.
  2. Maintain audit trails for every transformation.
  3. Never overwrite raw datasets directly.
  4. Use controlled terminology in SDTM domains.
  5. Validate all date variables against protocol timelines.
  6. Standardize casing before joins.
  7. Remove hidden special characters.
  8. Track duplicate subject IDs carefully.
  9. Separate business rules from transformation logic.
  10. Use macros for reusable validations.
  11. Validate negative numeric values.
  12. Document derivation logic clearly.
  13. Perform frequency checks before analysis.
  14. Use PROC CONTENTS to verify metadata.
  15. Validate missingness patterns statistically.
  16. Ensure reproducibility across environments.
  17. Compare PROC SQL and DATA step outputs.
  18. Preserve raw ingestion copies for audits.
  19. Build exception reports for invalid records.
  20. Implement peer-review validation before production release.

20 Key Points — Sharp Enterprise Insights

  • Dirty data leads to wrong conclusions.
  • Standardization ensures reproducibility.
  • Duplicate records inflate business KPIs.
  • Invalid dates destroy forecasting accuracy.
  • Missing values impact machine learning reliability.
  • SAS excels in regulatory governance.
  • R excels in analytical flexibility.
  • LENGTH placement prevents truncation disasters.
  • PROC SQL simplifies aggregation logic.
  • DATA step enables row-level control.
  • Audit trails are mandatory in healthcare analytics.
  • Controlled terminology improves consistency.
  • PROC REPORT creates executive-ready outputs.
  • Macros reduce repetitive validation work.
  • Data lineage improves compliance transparency.
  • Enterprise ETL requires defensive programming.
  • Text normalization improves join accuracy.
  • Validation protects AI model integrity.
  • Clean data drives trustworthy dashboards.
  • Reliable analytics starts with disciplined engineering.

Summary

This case study demonstrated how messy tourism data can be transformed into enterprise-grade analytical intelligence using both SAS and R. We intentionally created corrupted datasets containing duplicates, missing values, malformed text, negative metrics, inconsistent casing, and invalid dates to simulate real-world enterprise ingestion problems.

Using SAS DATA step programming, we implemented defensive validation logic with functions like:

  • PROPCASE
  • COMPRESS
  • INPUT
  • INTNX
  • ABS
  • MISSING
  • CATX
  • SCAN

We also explored:

  • PROC SQL aggregations
  • PROC REPORT dashboards
  • PROC FORMAT categorization
  • PROC SORT NODUPKEY deduplication
  • reusable SAS macros

The tutorial highlighted why SAS remains dominant in regulated industries such as clinical trials, banking, and compliance-focused analytics due to:

  • auditability,
  • metadata governance,
  • reproducibility,
  • and validation traceability.

On the R side, tidyverse functions such as:

  • mutate()
  • case_when()
  • distinct()
  • replace_na()
  • parse_date_time()
  • str_replace_all()

provided a modern and highly expressive refinement layer suitable for exploratory analytics and machine-learning workflows.

The biggest lesson is simple:

Raw data is never trustworthy by default.

Whether you are building:

  • SDTM datasets,
  • tourism dashboards,
  • fraud-detection engines,
  • or AI recommendation systems,

your analytical quality depends entirely on structured, validated, and reproducible cleaning frameworks.

Clean data creates trustworthy intelligence.
Trustworthy intelligence drives confident business decisions.

Conclusion

Modern analytics is no longer about simply generating reports. It is about engineering trust.

Organizations today depend on data for:

  • AI automation,
  • regulatory compliance,
  • predictive forecasting,
  • financial planning,
  • healthcare decisions,
  • and operational intelligence.

But raw enterprise data is inherently chaotic.

Tourism systems contain duplicate bookings, malformed locations, inconsistent currency formats, and invalid visitor counts. Clinical-trial systems contain missing treatment dates, duplicated adverse events, inconsistent medical coding, and protocol deviations. Banking systems face corrupted transactions, fraud anomalies, and incomplete customer records.

Without structured cleaning frameworks, these issues silently poison analytical outputs.

This project demonstrated how SAS and R complement each other in enterprise environments.

SAS provides:

  • industrial-grade governance,
  • validation traceability,
  • audit readiness,
  • metadata control,
  • and production reliability.

R provides:

  • agile data exploration,
  • elegant transformation pipelines,
  • machine-learning compatibility,
  • and rapid experimentation.

The real power emerges when both ecosystems work together.

A mature enterprise workflow often follows this pattern:

  1. SAS performs ingestion, compliance validation, SDTM-standard transformation, and production reporting.
  2. R performs exploratory analytics, visualization, statistical modeling, and advanced AI workflows.

The future of analytics belongs to professionals who understand:

  • data engineering,
  • validation logic,
  • business rules,
  • compliance requirements,
  • and scalable transformation frameworks.

Because ultimately, dashboards are only as trustworthy as the pipelines behind them.

Clean data is not merely technical hygiene.
It is the foundation of regulatory credibility, executive confidence, AI reliability, and enterprise intelligence.

Interview Questions and Answers

1. Why is LENGTH placement important in SAS?

Answer:
SAS assigns variable attributes during compilation. If LENGTH is placed after INPUT, character variables may truncate permanently. This can corrupt joins, grouping logic, and reporting outputs.

2. When would you prefer DATA Step over PROC SQL?

Answer:
Use DATA Step when row-by-row processing, FIRST./LAST. logic, RETAIN statements, or complex derivations are required. PROC SQL is better for aggregation and joins.

3. How would you detect duplicate clinical records?

Answer:

proc sort data=ae nodupkey;

by usubjid aestdtc aeterm;

run;

This removes duplicate adverse-event records using business keys.

4. How does R handle missing values differently from SAS?

Answer:
R uses NA, while SAS uses numeric . and blank character values. R functions like replace_na() explicitly manage missingness, whereas SAS uses MISSING() and COALESCEC().

5. A clinical dataset shows negative patient weight values. How would you handle this?

Answer:

First investigate source-system lineage. If confirmed as data-entry errors, apply business-rule correction:

Weight = abs(Weight);

Then:

  • flag corrected records,
  • document derivation logic,
  • preserve audit traceability,
  • and validate downstream calculations.

This ensures regulatory compliance and analytical reliability.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TOURIST DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy

 


Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS