From Chaos to Clarity: Cleaning Global Tourist Data with SAS TRANWRD Magic

1. Introduction

Imagine you are working as a SAS Programmer in a global tourism analytics company. Your team is tasked with analyzing “Top Attractive Sites Around the World” to support business decisions like where to invest, promote, or build travel partnerships.

You receive the dataset.

At first glance, it looks fine.

But within minutes, you realize:

Country names appear as "usa", "USA", "Usa", "NULL"
Visitor counts contain negative numbers
Dates are inconsistent (2023/01/01 vs 01-01-2023)
Duplicate entries exist
Some records say “N/A”, “null”, blank”

Now imagine making business decisions on THIS data.

👉 You might recommend investing in the wrong country
👉 You might miscalculate tourism trends
👉 You might produce completely misleading dashboards

This is why data cleaning is not optional it is critical.

In industries like clinical trials (SDTM/ADaM) or global analytics, poor data quality can lead to:

Regulatory rejection
Wrong conclusions
Financial loss

This is where SAS and R shine. SAS provides powerful functions like TRANWRD, while R offers flexible data manipulation tools.

2. Raw Data Creation in SAS and R

Intentionally creating a messy raw dataset.

SAS Code (Raw Dataset)

DATA world_sites_raw;

INPUT ID Site_Name:$30. Country $ Visitors Year

Established_Date:$12. Rating;

DATALINES;

1 TajMahal india 5000000 2023 01-01-1632 4.8

2 EiffelTower FRANCE -2000000 2023 1889/03/31 4.7

3 GreatWall china NULL 2022 220BC 4.9

4 StatueOfLiberty usa 4300000 2023 10-28-1886 4.6

5 Colosseum Italy 3800000 2023 80AD 4.7

6 TajMahal india 5000000 2023 01-01-1632 4.8

7 MachuPicchu peru 1500000 2022 NULL 4.9

8 ChristRedeemer brazil 2000000 2023 1931-10-12 4.8

9 NULL NULL NULL 2023 NULL 4.0

10 Petra jordan 900000 2022 312BC 4.7

;

RUN;

PROC PRINT DATA = world_sites_raw;

RUN;

OUTPUT:


Obs	ID	Site_Name	Country	Visitors	Year	Established_Date	Rating
1	1	TajMahal	india	5000000	2023	01-01-1632	4.8
2	2	EiffelTower	FRANCE	-2000000	2023	1889/03/31	4.7
3	3	GreatWall	china	.	2022	220BC	4.9
4	4	StatueOfLiberty	usa	4300000	2023	10-28-1886	4.6
5	5	Colosseum	Italy	3800000	2023	80AD	4.7
6	6	TajMahal	india	5000000	2023	01-01-1632	4.8
7	7	MachuPicchu	peru	1500000	2022	NULL	4.9
8	8	ChristRedeemer	brazil	2000000	2023	1931-10-12	4.8
9	9	NULL	NULL	.	2023	NULL	4.0
10	10	Petra	jordan	900000	2022	312BC	4.7

Explanation

This dataset simulates real-world inconsistencies:

Visitors has invalid values (negative, NULL)
Country is inconsistent (case issues, NULL)
Duplicate records (Taj Mahal repeated)
Date formats vary widely
Missing fields exist

Key point: In real projects, raw datasets are rarely clean. They come from multiple sources (APIs, Excel, databases), causing inconsistencies. As a SAS programmer, your responsibility is to standardize, validate, and prepare data for downstream analysis (ADaM/Reporting).

R Code (Equivalent Dataset)

world_sites_raw <- data.frame(

ID = c(1,2,3,4,5,6,7,8,9,10),

Site_Name = c("TajMahal","EiffelTower","GreatWall","StatueOfLiberty",

"Colosseum","TajMahal","MachuPicchu","ChristRedeemer",

"NULL","Petra"),

Country = c("india","FRANCE","china","usa","Italy","india","peru",

"brazil","NULL","jordan"),

Visitors = c(5000000,-2000000,NA,4300000,3800000,5000000,1500000,

2000000,NA,900000),

Year = c(2023,2023,2022,2023,2023,2023,2022,2023,2023,2022),

Established_Date = c("01-01-1632","1889/03/31","220BC","10-28-1886",

"80AD","01-01-1632",NA,"1931-10-12",NA,"312BC"),

Rating = c(4.8,4.7,4.9,4.6,4.7,4.8,4.9,4.8,4.0,4.7)

)

OUTPUT:

	ID	Site_Name	Country	Visitors	Year	Established_Date	Rating
1	1	TajMahal	india	5000000	2023	01-01-1632	4.8
2	2	EiffelTower	FRANCE	-2000000	2023	1889/03/31	4.7
3	3	GreatWall	china	NA	2022	220BC	4.9
4	4	StatueOfLiberty	usa	4300000	2023	10-28-1886	4.6
5	5	Colosseum	Italy	3800000	2023	80AD	4.7
6	6	TajMahal	india	5000000	2023	01-01-1632	4.8
7	7	MachuPicchu	peru	1500000	2022	NA	4.9
8	8	ChristRedeemer	brazil	2000000	2023	12-10-1931	4.8
9	9	NULL	NULL	NA	2023	NA	4
10	10	Petra	jordan	900000	2022	312BC	4.7

Explanation

The R dataset mirrors SAS data issues:

NA values represent missing data
Character inconsistencies affect grouping and analysis
Duplicate entries distort aggregation
Dates are inconsistent strings

In R, data cleaning requires packages like dplyr and functions like mutate(), filter(), and distinct(). The key takeaway is that both SAS and R require structured pipelines to ensure data integrity.

3. Phase 1: Data Cleaning in SAS

PROC SORT DATA=world_sites_raw NODUPKEY OUT=world_nodup;

BY ID Site_Name;

RUN;

PROC PRINT DATA = world_nodup;

RUN;

OUTPUT:


Obs	ID	Site_Name	Country	Visitors	Year	Established_Date	Rating
1	1	TajMahal	india	5000000	2023	01-01-1632	4.8
2	2	EiffelTower	FRANCE	-2000000	2023	1889/03/31	4.7
3	3	GreatWall	china	.	2022	220BC	4.9
4	4	StatueOfLiberty	usa	4300000	2023	10-28-1886	4.6
5	5	Colosseum	Italy	3800000	2023	80AD	4.7
6	6	TajMahal	india	5000000	2023	01-01-1632	4.8
7	7	MachuPicchu	peru	1500000	2022	NULL	4.9
8	8	ChristRedeemer	brazil	2000000	2023	1931-10-12	4.8
9	9	NULL	NULL	.	2023	NULL	4.0
10	10	Petra	jordan	900000	2022	312BC	4.7

DATA world_clean;

SET world_nodup;

/* Standardize Country */

Country = UPCASE(COALESCEC(Country, "UNKNOWN"));

Country = TRANWRD(Country, "NULL", "UNKNOWN");

/* Fix Visitors */

IF Visitors = . THEN Visitors = 0;

ELSE IF Visitors < 0 THEN Visitors = ABS(Visitors);

/* Standardize Site_Name */

Site_Name = TRANWRD(Site_Name, "NULL", "UNKNOWN");

IF STRIP(Site_Name) = "" THEN Site_Name = "UNKNOWN";

/* Fix Dates */

Formatted_Date = INPUT(Established_Date, ANYDTDTE.);

FORMAT Formatted_Date DATE9.;

RUN;

PROC PRINT DATA = world_clean;

RUN;

OUTPUT:


Obs	ID	Site_Name	Country	Visitors	Year	Established_Date	Rating	Formatted_Date
1	1	TajMahal	INDIA	5000000	2023	01-01-1632	4.8	.
2	2	EiffelTower	FRANCE	2000000	2023	1889/03/31	4.7	03MAR1889
3	3	GreatWall	CHINA	0	2022	220BC	4.9	.
4	4	StatueOfLiberty	USA	4300000	2023	10-28-1886	4.6	.
5	5	Colosseum	ITALY	3800000	2023	80AD	4.7	.
6	6	TajMahal	INDIA	5000000	2023	01-01-1632	4.8	.
7	7	MachuPicchu	PERU	1500000	2022	NULL	4.9	.
8	8	ChristRedeemer	BRAZIL	2000000	2023	1931-10-12	4.8	01OCT1931
9	9	UNKNOWN	UNKNOWN	0	2023	NULL	4.0	.
10	10	Petra	JORDAN	900000	2022	312BC	4.7	.

Explanation

This step demonstrates core SAS cleaning techniques:

PROC SORT NODUPKEY removes duplicates
COALESCEC() replaces missing character values
ABS() corrects negative numbers
TRANWRD() replaces unwanted text like "NULL" → "UNKNOWN"
INPUT() standardizes dates

👉 TRANWRD is powerful because it replaces substrings globally, making it ideal for:

Fixing inconsistent labels
Cleaning textual data
Standardizing coding values

Key point: SAS ensures traceability and reproducibility, which is critical in regulated environments like clinical trials.

4. Phase 2: Data Cleaning in R

library(dplyr)

library(lubridate)

world_clean <- world_sites_raw %>%

distinct(ID, Site_Name, .keep_all = TRUE) %>%

mutate(

Country = toupper(coalesce(Country, "UNKNOWN")),

Country = ifelse(Country == "NULL", "UNKNOWN", Country),

Visitors = case_when(is.na(Visitors) ~ 0,Visitors < 0 ~ abs(Visitors),

TRUE ~ Visitors),

Site_Name = ifelse(is.na(Site_Name), "UNKNOWN", Site_Name),

Site_Name = ifelse(trimws(Site_Name) == "", "UNKNOWN", Site_Name),

Site_Name = ifelse(Site_Name == "NULL", "UNKNOWN", Site_Name),

Formatted_Date = parse_date_time(Established_Date,

orders = c("dmy", "ymd", "mdy"),

quiet = TRUE)

)

OUTPUT:

	ID	Site_Name	Country	Visitors	Year	Established_Date	Rating	Formatted_Date
1	1	TajMahal	INDIA	5000000	2023	01-01-1632	4.8	1632-01-01
2	2	EiffelTower	FRANCE	2000000	2023	1889/03/31	4.7	1889-03-31
3	3	GreatWall	CHINA	0	2022	220BC	4.9	NA
4	4	StatueOfLiberty	USA	4300000	2023	10-28-1886	4.6	1886-10-28
5	5	Colosseum	ITALY	3800000	2023	80AD	4.7	NA
6	6	TajMahal	INDIA	5000000	2023	01-01-1632	4.8	1632-01-01
7	7	MachuPicchu	PERU	1500000	2022	NA	4.9	NA
8	8	ChristRedeemer	BRAZIL	2000000	2023	12-10-1931	4.8	12-10-1931
9	9	UNKNOWN	UNKNOWN	0	2023	NA	4	NA
10	10	Petra	JORDAN	900000	2022	312BC	4.7	NA

Explanation

In R:

distinct() removes duplicates
mutate() transforms variables
ifelse() handles missing values
toupper() standardizes text

R provides flexibility but requires careful handling of formats. Unlike SAS, R doesn’t enforce strict data typing, so developers must ensure consistency manually.

Key point: R is excellent for exploratory analysis, while SAS is preferred for validated pipelines.

Line-by-Line Mapping (SAS ↔ R)

SAS Logic	R Equivalent
`COALESCEC(Country, "UNKNOWN")`	`coalesce(Country, "UNKNOWN")`
`UPCASE()`	`toupper()`
`TRANWRD(...,"NULL","UNKNOWN")`	`ifelse(...=="NULL","UNKNOWN",...)`
`IF Visitors = . THEN 0`	`is.na(Visitors) ~ 0`
`IF Visitors < 0 THEN ABS()`	`Visitors < 0 ~ abs()`
`STRIP()`	`trimws()`
`INPUT(...ANYDTDTE.)`	`parse_date_time()`

Key Takeaways

%>% builds readable pipelines
distinct() = duplicate removal
mutate() = transformation engine
Always handle:

NA values
Text inconsistencies
Mixed formats

Date parsing is one of the biggest real-world challenges
Never assume raw data follows a single format

5. Phase 3: Additional SAS Processing

DATA world_enhanced;

SET world_clean;

/* Create category */

IF Visitors > 4000000 THEN Category = "High Traffic";

ELSE Category = "Moderate";

/* Length control */

LENGTH Country $20;

/* Additional cleaning */

Country = TRANWRD(Country, "NULL", "UNKNOWN");

RUN;

PROC PRINT DATA = world_enhanced;

RUN;

OUTPUT:


Obs	ID	Site_Name	Country	Visitors	Year	Established_Date	Rating	Formatted_Date	Category
1	1	TajMahal	INDIA	5000000	2023	01-01-1632	4.8	.	High Traffic
2	2	EiffelTower	FRANCE	2000000	2023	1889/03/31	4.7	03MAR1889	Moderate
3	3	GreatWall	CHINA	0	2022	220BC	4.9	.	Moderate
4	4	StatueOfLiberty	USA	4300000	2023	10-28-1886	4.6	.	High Traffic
5	5	Colosseum	ITALY	3800000	2023	80AD	4.7	.	Moderate
6	6	TajMahal	INDIA	5000000	2023	01-01-1632	4.8	.	High Traffic
7	7	MachuPicchu	PERU	1500000	2022	NULL	4.9	.	Moderate
8	8	ChristRedeemer	BRAZIL	2000000	2023	1931-10-12	4.8	01OCT1931	Moderate
9	9	UNKNOWN	UNKNOWN	0	2023	NULL	4.0	.	Moderate
10	10	Petra	JORDAN	900000	2022	312BC	4.7	.	Moderate

Explanation

This phase enhances dataset usability:

Categorization helps in reporting
LENGTH prevents truncation
Additional TRANWRD ensures full standardization

Key point: Cleaning is not just fixing errors it’s about making data analysis-ready.

6. 20 Additional Data Cleaning Best Practices

Always validate against source data
Maintain audit trails
Use controlled terminology (CDISC)
Handle missing values explicitly
Avoid hardcoding values
Use macros for scalability
Validate duplicates carefully
Standardize date formats
Apply range checks
Use logs for debugging
Document transformations
Validate domain consistency
Cross-check derived variables
Ensure reproducibility
Follow SDTM standards
Maintain metadata
Perform QC checks
Use version control
Avoid overwriting raw data
Validate before reporting

7. Business Logic Behind Data Cleaning

Why do we clean data?

Missing values → replaced to avoid bias
Negative visitors → corrected (invalid in reality)
Duplicate records → inflate analytics

Example:

A patient age = -5 → biologically impossible
Salary = NULL → affects averages
Date missing → affects timelines

👉 Clean data ensures accurate decision-making

8. 20 Key Sharp Insights

Dirty data leads to wrong conclusions
Standardization ensures reproducibility
Validation prevents regulatory issues
Missing data must be handled carefully
Duplicate data inflates metrics
Text inconsistency breaks grouping
Dates must be uniform
Logs are your best friend
Always QC your dataset
Never trust raw data blindly
Use functions like TRANWRD wisely
Document everything
Data cleaning is iterative
Business logic matters
Automation improves efficiency
SAS ensures reliability
R enables flexibility
Clean data builds trust
Accuracy drives decisions
Quality data = quality insights

9. Summary (SAS vs R)

Feature	SAS	R
Stability	High	Medium
Flexibility	Moderate	High
Validation	Strong	Manual
Regulatory Use	Preferred	Limited
Learning Curve	Moderate	Steep

10. Conclusion

Data cleaning is not a technical step it is a strategic necessity. Whether you are working with tourism data or clinical trials, your insights are only as good as your data quality.

Functions like TRANWRD in SAS act as silent heroes, transforming messy, inconsistent datasets into structured, reliable intelligence.

A strong data cleaning framework ensures:

Accuracy
Compliance
Trust

11. Interview Questions

Q1: How do you handle NULL values in SAS?

👉 Use COALESCEC() for characters, IF for numerics.

Q2: What is TRANWRD used for?

👉 Replaces all occurrences of a substring in a variable.

Q3: How do you remove duplicates?

👉 PROC SORT NODUPKEY

Q4: R vs SAS for cleaning?

👉 R is flexible, SAS is robust and validated.

Q5: Debugging scenario

👉 Check logs → validate input → test transformations step-by-step.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about DISASTER DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS
2.Hidden Errors in Movie Data and How SAS LENGTH & FORMAT Bring Clarity
3.Why Even Small Data Errors Can Collapse Enterprise Analytics
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Data Disasters to Data Intelligence: Mastering TRANWRD in SAS

From Chaos to Clarity: Cleaning Global Tourist Data with SAS TRANWRD Magic

Line-by-Line Mapping (SAS ↔ R)

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS