452.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
From Chaos to Clarity: Cleaning Global Tourist Data with SAS TRANWRD Magic
1. Introduction
Imagine
you are working as a SAS Programmer in a global tourism analytics company. Your
team is tasked with analyzing “Top Attractive Sites Around the World” to
support business decisions like where to invest, promote, or build travel
partnerships.
You
receive the dataset.
At first
glance, it looks fine.
But
within minutes, you realize:
- Country names appear as "usa",
"USA", "Usa", "NULL"
- Visitor counts contain negative
numbers
- Dates are inconsistent (2023/01/01
vs 01-01-2023)
- Duplicate entries exist
- Some records say “N/A”,
“null”, blank”
Now
imagine making business decisions on THIS data.
👉 You might recommend investing in the wrong country
👉 You might miscalculate tourism trends
👉 You might produce completely misleading dashboards
This is
why data cleaning is not optional it is critical.
In
industries like clinical trials (SDTM/ADaM) or global analytics, poor
data quality can lead to:
- Regulatory rejection
- Wrong conclusions
- Financial loss
This is where SAS and R shine. SAS provides powerful functions like TRANWRD, while R offers flexible data manipulation tools.
2. Raw Data Creation in SAS and R
Intentionally creating a messy raw dataset.
SAS Code (Raw Dataset)
DATA world_sites_raw;
INPUT ID Site_Name:$30. Country $ Visitors Year
Established_Date:$12. Rating;
DATALINES;
1 TajMahal india 5000000 2023 01-01-1632 4.8
2 EiffelTower FRANCE -2000000 2023 1889/03/31 4.7
3 GreatWall china NULL 2022 220BC 4.9
4 StatueOfLiberty usa 4300000 2023 10-28-1886 4.6
5 Colosseum Italy 3800000 2023 80AD 4.7
6 TajMahal india 5000000 2023 01-01-1632 4.8
7 MachuPicchu peru 1500000 2022 NULL 4.9
8 ChristRedeemer brazil 2000000 2023 1931-10-12 4.8
9 NULL NULL NULL 2023 NULL 4.0
10 Petra jordan 900000 2022 312BC 4.7
;
RUN;
PROC PRINT DATA = world_sites_raw;
RUN;
OUTPUT:
| Obs | ID | Site_Name | Country | Visitors | Year | Established_Date | Rating |
|---|---|---|---|---|---|---|---|
| 1 | 1 | TajMahal | india | 5000000 | 2023 | 01-01-1632 | 4.8 |
| 2 | 2 | EiffelTower | FRANCE | -2000000 | 2023 | 1889/03/31 | 4.7 |
| 3 | 3 | GreatWall | china | . | 2022 | 220BC | 4.9 |
| 4 | 4 | StatueOfLiberty | usa | 4300000 | 2023 | 10-28-1886 | 4.6 |
| 5 | 5 | Colosseum | Italy | 3800000 | 2023 | 80AD | 4.7 |
| 6 | 6 | TajMahal | india | 5000000 | 2023 | 01-01-1632 | 4.8 |
| 7 | 7 | MachuPicchu | peru | 1500000 | 2022 | NULL | 4.9 |
| 8 | 8 | ChristRedeemer | brazil | 2000000 | 2023 | 1931-10-12 | 4.8 |
| 9 | 9 | NULL | NULL | . | 2023 | NULL | 4.0 |
| 10 | 10 | Petra | jordan | 900000 | 2022 | 312BC | 4.7 |
Explanation
This
dataset simulates real-world inconsistencies:
- Visitors has invalid values
(negative, NULL)
- Country is inconsistent (case
issues, NULL)
- Duplicate records (Taj Mahal repeated)
- Date formats vary widely
- Missing fields exist
Key point: In real projects, raw datasets
are rarely clean. They come from multiple sources (APIs, Excel, databases),
causing inconsistencies. As a SAS programmer, your responsibility is to standardize, validate, and prepare data for
downstream analysis (ADaM/Reporting).
R Code (Equivalent Dataset)
world_sites_raw <- data.frame(
ID = c(1,2,3,4,5,6,7,8,9,10),
Site_Name = c("TajMahal","EiffelTower","GreatWall","StatueOfLiberty",
"Colosseum","TajMahal","MachuPicchu","ChristRedeemer",
"NULL","Petra"),
Country = c("india","FRANCE","china","usa","Italy","india","peru",
"brazil","NULL","jordan"),
Visitors = c(5000000,-2000000,NA,4300000,3800000,5000000,1500000,
2000000,NA,900000),
Year = c(2023,2023,2022,2023,2023,2023,2022,2023,2023,2022),
Established_Date = c("01-01-1632","1889/03/31","220BC","10-28-1886",
"80AD","01-01-1632",NA,"1931-10-12",NA,"312BC"),
Rating = c(4.8,4.7,4.9,4.6,4.7,4.8,4.9,4.8,4.0,4.7)
)
OUTPUT:
|
|
ID |
Site_Name |
Country |
Visitors |
Year |
Established_Date |
Rating |
|
1 |
1 |
TajMahal |
india |
5000000 |
2023 |
01-01-1632 |
4.8 |
|
2 |
2 |
EiffelTower |
FRANCE |
-2000000 |
2023 |
1889/03/31 |
4.7 |
|
3 |
3 |
GreatWall |
china |
NA |
2022 |
220BC |
4.9 |
|
4 |
4 |
StatueOfLiberty |
usa |
4300000 |
2023 |
10-28-1886 |
4.6 |
|
5 |
5 |
Colosseum |
Italy |
3800000 |
2023 |
80AD |
4.7 |
|
6 |
6 |
TajMahal |
india |
5000000 |
2023 |
01-01-1632 |
4.8 |
|
7 |
7 |
MachuPicchu |
peru |
1500000 |
2022 |
NA |
4.9 |
|
8 |
8 |
ChristRedeemer |
brazil |
2000000 |
2023 |
12-10-1931 |
4.8 |
|
9 |
9 |
NULL |
NULL |
NA |
2023 |
NA |
4 |
|
10 |
10 |
Petra |
jordan |
900000 |
2022 |
312BC |
4.7 |
Explanation
The R
dataset mirrors SAS data issues:
- NA values represent missing
data
- Character inconsistencies
affect grouping and analysis
- Duplicate entries distort
aggregation
- Dates are inconsistent
strings
In R,
data cleaning requires packages like dplyr and functions like mutate(), filter(),
and distinct(). The key takeaway is that both SAS and R require structured
pipelines to ensure data integrity.
3. Phase 1: Data Cleaning in SAS
PROC SORT DATA=world_sites_raw NODUPKEY OUT=world_nodup;
BY ID Site_Name;
RUN;
PROC PRINT DATA = world_nodup;
RUN;
OUTPUT:
| Obs | ID | Site_Name | Country | Visitors | Year | Established_Date | Rating |
|---|---|---|---|---|---|---|---|
| 1 | 1 | TajMahal | india | 5000000 | 2023 | 01-01-1632 | 4.8 |
| 2 | 2 | EiffelTower | FRANCE | -2000000 | 2023 | 1889/03/31 | 4.7 |
| 3 | 3 | GreatWall | china | . | 2022 | 220BC | 4.9 |
| 4 | 4 | StatueOfLiberty | usa | 4300000 | 2023 | 10-28-1886 | 4.6 |
| 5 | 5 | Colosseum | Italy | 3800000 | 2023 | 80AD | 4.7 |
| 6 | 6 | TajMahal | india | 5000000 | 2023 | 01-01-1632 | 4.8 |
| 7 | 7 | MachuPicchu | peru | 1500000 | 2022 | NULL | 4.9 |
| 8 | 8 | ChristRedeemer | brazil | 2000000 | 2023 | 1931-10-12 | 4.8 |
| 9 | 9 | NULL | NULL | . | 2023 | NULL | 4.0 |
| 10 | 10 | Petra | jordan | 900000 | 2022 | 312BC | 4.7 |
DATA world_clean;
SET world_nodup;
/* Standardize Country */
Country = UPCASE(COALESCEC(Country, "UNKNOWN"));
Country = TRANWRD(Country, "NULL", "UNKNOWN");
/* Fix Visitors */
IF Visitors = . THEN Visitors = 0;
ELSE IF Visitors < 0 THEN Visitors = ABS(Visitors);
/* Standardize Site_Name */
Site_Name = TRANWRD(Site_Name, "NULL", "UNKNOWN");
IF STRIP(Site_Name) = "" THEN Site_Name = "UNKNOWN";
/* Fix Dates */
Formatted_Date = INPUT(Established_Date, ANYDTDTE.);
FORMAT Formatted_Date DATE9.;
RUN;
PROC PRINT DATA = world_clean;
RUN;
OUTPUT:
| Obs | ID | Site_Name | Country | Visitors | Year | Established_Date | Rating | Formatted_Date |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | TajMahal | INDIA | 5000000 | 2023 | 01-01-1632 | 4.8 | . |
| 2 | 2 | EiffelTower | FRANCE | 2000000 | 2023 | 1889/03/31 | 4.7 | 03MAR1889 |
| 3 | 3 | GreatWall | CHINA | 0 | 2022 | 220BC | 4.9 | . |
| 4 | 4 | StatueOfLiberty | USA | 4300000 | 2023 | 10-28-1886 | 4.6 | . |
| 5 | 5 | Colosseum | ITALY | 3800000 | 2023 | 80AD | 4.7 | . |
| 6 | 6 | TajMahal | INDIA | 5000000 | 2023 | 01-01-1632 | 4.8 | . |
| 7 | 7 | MachuPicchu | PERU | 1500000 | 2022 | NULL | 4.9 | . |
| 8 | 8 | ChristRedeemer | BRAZIL | 2000000 | 2023 | 1931-10-12 | 4.8 | 01OCT1931 |
| 9 | 9 | UNKNOWN | UNKNOWN | 0 | 2023 | NULL | 4.0 | . |
| 10 | 10 | Petra | JORDAN | 900000 | 2022 | 312BC | 4.7 | . |
Explanation
This step
demonstrates core SAS cleaning techniques:
- PROC SORT NODUPKEY removes
duplicates
- COALESCEC() replaces missing
character values
- ABS() corrects negative
numbers
- TRANWRD() replaces unwanted
text like "NULL" → "UNKNOWN"
- INPUT() standardizes dates
👉 TRANWRD is powerful because it replaces
substrings globally, making it ideal for:
- Fixing inconsistent labels
- Cleaning textual data
- Standardizing coding values
Key point: SAS ensures traceability and
reproducibility, which is critical in regulated environments like clinical
trials.
4. Phase 2: Data Cleaning in R
library(dplyr)
library(lubridate)
world_clean <- world_sites_raw %>%
distinct(ID, Site_Name, .keep_all = TRUE) %>%
mutate(
Country = toupper(coalesce(Country, "UNKNOWN")),
Country = ifelse(Country == "NULL", "UNKNOWN", Country),
Visitors = case_when(is.na(Visitors) ~ 0,Visitors < 0 ~ abs(Visitors),
TRUE ~ Visitors),
Site_Name = ifelse(is.na(Site_Name), "UNKNOWN", Site_Name),
Site_Name = ifelse(trimws(Site_Name) == "", "UNKNOWN", Site_Name),
Site_Name = ifelse(Site_Name == "NULL", "UNKNOWN", Site_Name),
Formatted_Date = parse_date_time(Established_Date,
orders = c("dmy", "ymd", "mdy"),
quiet = TRUE)
)
OUTPUT:
|
|
ID |
Site_Name |
Country |
Visitors |
Year |
Established_Date |
Rating |
Formatted_Date |
|
1 |
1 |
TajMahal |
INDIA |
5000000 |
2023 |
01-01-1632 |
4.8 |
1632-01-01 |
|
2 |
2 |
EiffelTower |
FRANCE |
2000000 |
2023 |
1889/03/31 |
4.7 |
1889-03-31 |
|
3 |
3 |
GreatWall |
CHINA |
0 |
2022 |
220BC |
4.9 |
NA |
|
4 |
4 |
StatueOfLiberty |
USA |
4300000 |
2023 |
10-28-1886 |
4.6 |
1886-10-28 |
|
5 |
5 |
Colosseum |
ITALY |
3800000 |
2023 |
80AD |
4.7 |
NA |
|
6 |
6 |
TajMahal |
INDIA |
5000000 |
2023 |
01-01-1632 |
4.8 |
1632-01-01 |
|
7 |
7 |
MachuPicchu |
PERU |
1500000 |
2022 |
NA |
4.9 |
NA |
|
8 |
8 |
ChristRedeemer |
BRAZIL |
2000000 |
2023 |
12-10-1931 |
4.8 |
12-10-1931 |
|
9 |
9 |
UNKNOWN |
UNKNOWN |
0 |
2023 |
NA |
4 |
NA |
|
10 |
10 |
Petra |
JORDAN |
900000 |
2022 |
312BC |
4.7 |
NA |
Explanation
In R:
- distinct() removes
duplicates
- mutate() transforms
variables
- ifelse() handles missing
values
- toupper() standardizes text
R
provides flexibility but requires careful handling of formats. Unlike SAS, R
doesn’t enforce strict data typing, so developers must ensure consistency
manually.
Key point: R is excellent for exploratory
analysis, while SAS is preferred for validated pipelines.
Line-by-Line Mapping (SAS ↔ R)
|
SAS Logic |
R Equivalent |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Key Takeaways
- %>% builds readable
pipelines
- distinct() = duplicate
removal
- mutate() = transformation
engine
- Always handle:
- NA values
- Text inconsistencies
- Mixed formats
- Date parsing is one of
the biggest real-world challenges
- Never assume raw data
follows a single format
5. Phase 3: Additional SAS Processing
DATA world_enhanced;
SET world_clean;
/* Create category */
IF Visitors > 4000000 THEN Category = "High Traffic";
ELSE Category = "Moderate";
/* Length control */
LENGTH Country $20;
/* Additional cleaning */
Country = TRANWRD(Country, "NULL", "UNKNOWN");
RUN;
PROC PRINT DATA = world_enhanced;
RUN;
OUTPUT:
| Obs | ID | Site_Name | Country | Visitors | Year | Established_Date | Rating | Formatted_Date | Category |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | TajMahal | INDIA | 5000000 | 2023 | 01-01-1632 | 4.8 | . | High Traffic |
| 2 | 2 | EiffelTower | FRANCE | 2000000 | 2023 | 1889/03/31 | 4.7 | 03MAR1889 | Moderate |
| 3 | 3 | GreatWall | CHINA | 0 | 2022 | 220BC | 4.9 | . | Moderate |
| 4 | 4 | StatueOfLiberty | USA | 4300000 | 2023 | 10-28-1886 | 4.6 | . | High Traffic |
| 5 | 5 | Colosseum | ITALY | 3800000 | 2023 | 80AD | 4.7 | . | Moderate |
| 6 | 6 | TajMahal | INDIA | 5000000 | 2023 | 01-01-1632 | 4.8 | . | High Traffic |
| 7 | 7 | MachuPicchu | PERU | 1500000 | 2022 | NULL | 4.9 | . | Moderate |
| 8 | 8 | ChristRedeemer | BRAZIL | 2000000 | 2023 | 1931-10-12 | 4.8 | 01OCT1931 | Moderate |
| 9 | 9 | UNKNOWN | UNKNOWN | 0 | 2023 | NULL | 4.0 | . | Moderate |
| 10 | 10 | Petra | JORDAN | 900000 | 2022 | 312BC | 4.7 | . | Moderate |
Explanation
This
phase enhances dataset usability:
- Categorization helps in
reporting
- LENGTH prevents truncation
- Additional TRANWRD ensures
full standardization
Key point: Cleaning is not just fixing
errors it’s about making data analysis-ready.
6. 20 Additional Data Cleaning Best Practices
- Always validate against
source data
- Maintain audit trails
- Use controlled terminology (CDISC)
- Handle missing values
explicitly
- Avoid hardcoding values
- Use macros for scalability
- Validate duplicates
carefully
- Standardize date formats
- Apply range checks
- Use logs for debugging
- Document transformations
- Validate domain consistency
- Cross-check derived
variables
- Ensure reproducibility
- Follow SDTM standards
- Maintain metadata
- Perform QC checks
- Use version control
- Avoid overwriting raw data
- Validate before reporting
7. Business Logic Behind Data Cleaning
Why do we
clean data?
- Missing values → replaced to
avoid bias
- Negative visitors →
corrected (invalid in reality)
- Duplicate records → inflate
analytics
Example:
- A patient age = -5 →
biologically impossible
- Salary = NULL → affects
averages
- Date missing → affects
timelines
👉 Clean data ensures accurate decision-making
8. 20 Key Sharp Insights
- Dirty data leads to wrong
conclusions
- Standardization ensures
reproducibility
- Validation prevents
regulatory issues
- Missing data must be handled
carefully
- Duplicate data inflates
metrics
- Text inconsistency breaks
grouping
- Dates must be uniform
- Logs are your best friend
- Always QC your dataset
- Never trust raw data blindly
- Use functions like TRANWRD
wisely
- Document everything
- Data cleaning is iterative
- Business logic matters
- Automation improves efficiency
- SAS ensures reliability
- R enables flexibility
- Clean data builds trust
- Accuracy drives decisions
- Quality data = quality
insights
9. Summary (SAS vs R)
|
Feature |
SAS |
R |
|
Stability |
High |
Medium |
|
Flexibility |
Moderate |
High |
|
Validation |
Strong |
Manual |
|
Regulatory
Use |
Preferred |
Limited |
|
Learning
Curve |
Moderate |
Steep |
10. Conclusion
Data cleaning
is not a technical step it is a strategic necessity. Whether you are
working with tourism data or clinical trials, your insights are only as good as
your data quality.
Functions
like TRANWRD in SAS act as silent heroes, transforming messy,
inconsistent datasets into structured, reliable intelligence.
A strong
data cleaning framework ensures:
- Accuracy
- Compliance
- Trust
11. Interview Questions
Q1: How do you handle NULL values in SAS?
👉 Use COALESCEC() for characters, IF for numerics.
Q2: What is TRANWRD used for?
👉 Replaces all occurrences of a substring in a
variable.
Q3: How do you remove duplicates?
👉 PROC SORT NODUPKEY
Q4: R vs SAS for cleaning?
👉 R is flexible, SAS is robust and validated.
Q5: Debugging scenario
👉 Check logs → validate input → test transformations
step-by-step.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about DISASTER DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment