The World’s Most Famous Tourist Spots Dataset into Trusted Business Analytics Using SAS (PROC SQL vs DATA Step) and R Data Engineering Frameworks

Introduction — When Beautiful Tourist Data Turns Into an Enterprise Disaster

Imagine you are working for a global travel analytics company responsible for predicting tourism trends across the world. Your dashboards influence hotel investments, airline route planning, tourism ministry budgets, and AI-powered recommendation engines. One wrong value inside your dataset can distort millions of dollars in decisions.

Now imagine the following:

Paris visitor counts stored as "15M" instead of numeric values
Taj Mahal dates entered as 32/14/2025
Duplicate records for the same tourist spot
Negative revenue values
Missing country names
Inconsistent casing like "new york", "NEW YORK", and "New york"
Invalid ratings above 5
Random special characters in text columns

This is exactly how real-world enterprise data looks before cleaning.

The same nightmare happens in clinical trials. A single incorrect patient age, missing treatment date, or duplicated adverse event record can delay regulatory approvals from organizations like FDA or EMA.

Data cleaning is not cosmetic work.
It is business survival.

In this enterprise-level SAS and R case study, we will build a corrupted “World Famous Tourist Spots” dataset with intentional errors and transform it into production-ready analytical intelligence using:

SAS DATA Step
PROC SQL
PROC REPORT
SAS Macros
R tidyverse ecosystem
Enterprise validation logic
Audit-ready reporting pipelines

This tutorial combines:

Clinical-trial-grade validation logic
Tourism analytics
SAS interview preparation
Real-world business intelligence engineering

Creating the Raw Tourist Dataset with Intentional Errors

Suppose a tourism intelligence company collects worldwide tourist data from:

APIs
Excel sheets
Manual entry systems
Government portals
Web scraping engines

The raw data arrives corrupted and inconsistent.

SAS Raw Dataset Creation

/*-----------------------------------------------------------

STEP 1: DEFINE VARIABLE LENGTHS BEFORE READING DATA.WHY?

SAS creates variable lengths during compilation.

If LENGTH is placed after INPUT, truncation occurs.

------------------------------------------------------------*/

data tourist_raw;

length Tourist_Spot $40 Country $25 City $25 Category $20 Rating_Text $10

Revenue_Text $15 Visit_Date_Text $20 Remarks $50;

infile datalines dlm='|' truncover;

input Tourist_Spot $ Country $ City $ Category $ Rating_Text $ Revenue_Text $

Visit_Date_Text $ Visitors Remarks $;

datalines;

Eiffel_Tower|France|Paris|Historical|4.8|1500000|12-05-2025|500000|Top attraction

Taj_Mahal|India|Agra|Historical|6.5|2500000|31-14-2025|700000|Invalid rating

Statue_of_Liberty|USA|NewYork|Historical|4.7|-100000|15-08-2025|400000|Negative revenue

Great_Wall|China|Beijing|Historical|4.9|3500000|25-06-2025|-5000|Negative visitors

Machu_Picchu|Peru|Cusco|Historical|4.6|2100000|10-07-2025|300000|Good

Eiffel_Tower|France|Paris|Historical|4.8|1500000|12-05-2025|500000|Duplicate

Santorini|Greece|santorini|Beach|4.5|1800000|22-09-2025|250000|lowercase city

Burj_Khalifa|UAE|Dubai|Modern|4.7|NULL|18-04-2025|600000|Missing revenue

Niagara_Falls|Canada|Toronto|Nature|abc|1300000|11-05-2025|450000|Invalid rating text

Colosseum| |Rome|Historical|4.4|1200000|05-03-2025|380000|Missing country

Sydney_Opera|Australia|Sydney|Modern|4.3|1700000|17-11-2025|320000|Good

Mount_Fuji|Japan|Tokyo|Nature|4.9|1600000|29-02-2025|410000|Invalid date

Grand_Canyon|USA|Arizona|Nature|4.8|2000000|07-08-2025|390000|Good

Banff_Park|Canada|Alberta|Nature|4.7|1900000|09-10-2025|280000|Good

Petra|Jordan|Amman|Historical|4.6|1450000|15-06-2025|310000|Good

;

run;

proc print data = tourist_raw;

run;

OUTPUT:


Obs	Tourist_Spot	Country	City	Category	Rating_Text	Revenue_Text	Visit_Date_Text	Remarks	Visitors
1	Eiffel_Tower	France	Paris	Historical	4.8	1500000	12-05-2025	Top attraction	500000
2	Taj_Mahal	India	Agra	Historical	6.5	2500000	31-14-2025	Invalid rating	700000
3	Statue_of_Liberty	USA	NewYork	Historical	4.7	-100000	15-08-2025	Negative revenue	400000
4	Great_Wall	China	Beijing	Historical	4.9	3500000	25-06-2025	Negative visitors	-5000
5	Machu_Picchu	Peru	Cusco	Historical	4.6	2100000	10-07-2025	Good	300000
6	Eiffel_Tower	France	Paris	Historical	4.8	1500000	12-05-2025	Duplicate	500000
7	Santorini	Greece	santorini	Beach	4.5	1800000	22-09-2025	lowercase city	250000
8	Burj_Khalifa	UAE	Dubai	Modern	4.7	NULL	18-04-2025	Missing revenue	600000
9	Niagara_Falls	Canada	Toronto	Nature	abc	1300000	11-05-2025	Invalid rating text	450000
10	Colosseum		Rome	Historical	4.4	1200000	05-03-2025	Missing country	380000
11	Sydney_Opera	Australia	Sydney	Modern	4.3	1700000	17-11-2025	Good	320000
12	Mount_Fuji	Japan	Tokyo	Nature	4.9	1600000	29-02-2025	Invalid date	410000
13	Grand_Canyon	USA	Arizona	Nature	4.8	2000000	07-08-2025	Good	390000
14	Banff_Park	Canada	Alberta	Nature	4.7	1900000	09-10-2025	Good	280000
15	Petra	Jordan	Amman	Historical	4.6	1450000	15-06-2025	Good	310000

Explanation

INTENTIONAL ERRORS INTRODUCED

- Missing countries

- Invalid dates

- Negative visitors

- Duplicate records

- Mixed casing

- Invalid ratings

- Embedded symbols

Understanding the SAS “Truncation Trap”

One of the biggest mistakes beginners make is placing LENGTH after the INPUT statement.

Incorrect:

input Country $;

length Country $25;

In SAS, variable attributes are assigned during compilation. If SAS first sees "USA", it may assign length 3 permanently. Later "Australia" becomes truncated.

Correct approach:

length Country $25;

input Country $;

This is extremely important in:

SDTM datasets
ADaM derivations
Regulatory submissions
Production ETL systems

R behaves differently because strings are dynamically managed in memory. SAS allocates fixed-length storage unless explicitly controlled.

#-----------------------------------------------------------

# CREATE RAW TOURIST DATASET IN R

# Equivalent to SAS DATALINES approach

#-----------------------------------------------------------

library(tidyverse)

library(stringr)

library(lubridate)

library(janitor)

library(purrr)

#-----------------------------------------------------------

# RAW DATA CREATED EXACTLY LIKE SAS DATALINES

# sep="|" acts like DLM='|'

# header=FALSE because SAS DATALINES has no header row

# stringsAsFactors=FALSE prevents automatic factor conversion

#-----------------------------------------------------------

tourist_raw <- read.table(text = "