Mastering Dataset Integration with SAS MERGE vs JOIN and R : Temples, Transformations, and Truth
Sacred Data, Hidden Errors: Transforming a Global Temples Dataset into Analytical Truth Using SAS MERGE vs JOIN and R
1. Introduction
Imagine
you’re working as a data analyst in a global cultural research organization.
Your task? Analyze a dataset of famous temples across the world to
uncover tourism patterns, visitor demographics, and revenue insights.
Sounds
simple until you open the dataset.
You
immediately notice chaos:
- Temple names written as “kedarnath”,
“KEDARNATH”, and “Kedarnath Temple”
- Missing visitor counts
- Negative revenue values
(yes, negative!)
- Duplicate records for the
same temple
- Dates like 32-13-2022
This is
not just messy it’s dangerous.
In
real-world analytics (especially in clinical trials or regulated industries), bad
data leads to flawed insights, incorrect decisions, and regulatory risk.
This is
where SAS and R shine:
- SAS: Strong in structured,
audit-ready pipelines
- R: Flexible, expressive,
ideal for transformation
In this
blog, we’ll build a temple dataset, deliberately inject errors, and
then:
- Clean it using SAS & R
- Combine datasets using MERGE
vs JOIN
- Demonstrate PROC SQL vs
DATA step
- Apply real-world logic
2. Raw Data Creation in SAS and R
SAS Code (Raw Dataset with Errors)
DATA temples_raw;
INFILE DATALINES DSD DLM=',' MISSOVER;
LENGTH Temple_ID $5 Temple_Name $30 Country $20
City $20 Visitors 8 Revenue 8 Visit_Date $12 Rating 8;
INPUT Temple_ID $ Temple_Name $ Country $ City $
Visitors Revenue Visit_Date $ Rating;
DATALINES;
T001,kedarnath,india,rishikesh,50000,2000000,12-05-2023,4.5
T002,Angkor Wat,Cambodia,Siem Reap,,5000000,15-08-2023,5
T003,Meenakshi Temple,India,Madurai,-2000,300000,32-13-2022,4
T004,Golden Temple,INDIA,Amritsar,80000,-900000,10-10-2023,4.8
T001,kedarnath,india,rishikesh,50000,2000000,12-05-2023,4.5
T005,NULL,Nepal,Kathmandu,20000,1000000,01-01-2023,3.5
T006,Borobudur,Indonesia,,45000,2500000,05-06-2023,4.7
T007,Shwedagon,Myanmar,Yangon,60000,3000000,,4.6
T008,lotus temple,india,delhi,70000,3500000,11-11-2023,4.4
T009,Wat Phra Kaew,Thailand,Bangkok,55000,2700000,09-09-2023,5
T010,Tirupati Temple,India,Tirupati,90000,6000000,08-08-2023,4.9
;
RUN;
PROC PRINT DATA = temples_raw;
RUN;
OUTPUT:
| Obs | Temple_ID | Temple_Name | Country | City | Visitors | Revenue | Visit_Date | Rating |
|---|---|---|---|---|---|---|---|---|
| 1 | T001 | kedarnath | india | rishikesh | 50000 | 2000000 | 12-05-2023 | 4.5 |
| 2 | T002 | Angkor Wat | Cambodia | Siem Reap | . | 5000000 | 15-08-2023 | 5.0 |
| 3 | T003 | Meenakshi Temple | India | Madurai | -2000 | 300000 | 32-13-2022 | 4.0 |
| 4 | T004 | Golden Temple | INDIA | Amritsar | 80000 | -900000 | 10-10-2023 | 4.8 |
| 5 | T001 | kedarnath | india | rishikesh | 50000 | 2000000 | 12-05-2023 | 4.5 |
| 6 | T005 | NULL | Nepal | Kathmandu | 20000 | 1000000 | 01-01-2023 | 3.5 |
| 7 | T006 | Borobudur | Indonesia | 45000 | 2500000 | 05-06-2023 | 4.7 | |
| 8 | T007 | Shwedagon | Myanmar | Yangon | 60000 | 3000000 | 4.6 | |
| 9 | T008 | lotus temple | india | delhi | 70000 | 3500000 | 11-11-2023 | 4.4 |
| 10 | T009 | Wat Phra Kaew | Thailand | Bangkok | 55000 | 2700000 | 09-09-2023 | 5.0 |
| 11 | T010 | Tirupati Temple | India | Tirupati | 90000 | 6000000 | 08-08-2023 | 4.9 |
Explanation
This
dataset intentionally mimics real-world inconsistencies. We used INFILE
DATALINES to simulate flat file ingestion. Notice issues: missing values
(Visitors blank), invalid entries (negative visitors, revenue), duplicate
records (T001), inconsistent casing (india vs INDIA), and incorrect date
formats. The variable Visit_Date is stored as character common in raw datasets.
Using MISSOVER ensures SAS doesn’t skip lines when values are missing. This
dataset reflects typical ingestion-stage problems in clinical or business data
pipelines. Understanding raw structure is critical before transformation,
especially when applying MERGE vs JOIN logic, since incorrect keys or
duplicates can lead to major analytical errors.
R Code – Equivalent Raw Dataset
temples_raw <- data.frame(
Temple_ID = c("T001","T002","T003","T004","T001","T005","T006",
"T007","T008","T009","T010"),
Temple_Name = c("kedarnath","Angkor Wat","Meenakshi Temple",
"Golden Temple","kedarnath","NULL",
"Borobudur","Shwedagon","lotus temple",
"Wat Phra Kaew","Tirupati Temple"),
Country = c("india","Cambodia","India","INDIA","india","Nepal",
"Indonesia","Myanmar","india","Thailand","India"),
City = c("rishikesh","Siem Reap","Madurai","Amritsar","rishikesh",
"Kathmandu","", "Yangon","delhi","Bangkok","Tirupati"),
Visitors = c(50000, NA, -2000, 80000, 50000, 20000, 45000, 60000,
70000, 55000, 90000),
Revenue = c(2000000,5000000,300000,-900000,2000000,1000000,2500000,
3000000,3500000,2700000,6000000),
Visit_Date = c("12-05-2023","15-08-2023","32-13-2022","10-10-2023",
"12-05-2023",
"01-01-2023","05-06-2023",NA,"11-11-2023","09-09-2023",
"08-08-2023"),
Rating = c(4.5,5,4,4.8,4.5,3.5,4.7,4.6,4.4,5,4.9)
)
OUTPUT:
|
|
Temple_ID |
Temple_Name |
Country |
City |
Visitors |
Revenue |
Visit_Date |
Rating |
|
1 |
T001 |
kedarnath |
india |
rishikesh |
50000 |
2000000 |
12-05-2023 |
4.5 |
|
2 |
T002 |
Angkor Wat |
Cambodia |
Siem Reap |
NA |
5000000 |
15-08-2023 |
5 |
|
3 |
T003 |
Meenakshi Temple |
India |
Madurai |
-2000 |
300000 |
32-13-2022 |
4 |
|
4 |
T004 |
Golden Temple |
INDIA |
Amritsar |
80000 |
-900000 |
10-10-2023 |
4.8 |
|
5 |
T001 |
kedarnath |
india |
rishikesh |
50000 |
2000000 |
12-05-2023 |
4.5 |
|
6 |
T005 |
NULL |
Nepal |
Kathmandu |
20000 |
1000000 |
01-01-2023 |
3.5 |
|
7 |
T006 |
Borobudur |
Indonesia |
|
45000 |
2500000 |
05-06-2023 |
4.7 |
|
8 |
T007 |
Shwedagon |
Myanmar |
Yangon |
60000 |
3000000 |
NA |
4.6 |
|
9 |
T008 |
lotus temple |
india |
delhi |
70000 |
3500000 |
11-11-2023 |
4.4 |
|
10 |
T009 |
Wat Phra Kaew |
Thailand |
Bangkok |
55000 |
2700000 |
09-09-2023 |
5 |
|
11 |
T010 |
Tirupati Temple |
India |
Tirupati |
90000 |
6000000 |
08-08-2023 |
4.9 |
Explanation
This R
dataset mirrors the SAS structure using data.frame(). Missing values are
represented with NA, while inconsistent strings (like "NULL") are
preserved intentionally for cleaning. Numeric inconsistencies like negative
values are included to demonstrate correction logic later. Unlike SAS, R does
not enforce strict typing unless specified, which can introduce flexibility but
also risks inconsistency. This raw dataset is a classic example of unvalidated
ingestion, where downstream transformations (joins, aggregations) can break or
produce incorrect results if not cleaned properly. This is especially critical
before applying joins (merge, inner_join) in R.
3. Phase 1: Data Cleaning in SAS
DATA temples_clean;
SET temples_raw;
* Handle missing temple names;
IF Temple_Name = 'NULL' THEN Temple_Name = 'UNKNOWN';
* Handle missing city;
IF City = " " THEN City = 'UNKNOWN';
* Fix negative visitors;
IF Visitors < 0 THEN Visitors = ABS(Visitors);
* Fix negative revenue;
IF Revenue < 0 THEN Revenue = .;
* Standardize country text;
Country = UPCASE(STRIP(Country));
* Convert date;
Visit_Date_Num = INPUT(Visit_Date, DDMMYY10.);
FORMAT Visit_Date_Num DATE9.;
RUN;
PROC PRINT DATA = temples_clean;
RUN;
OUTPUT:
| Obs | Temple_ID | Temple_Name | Country | City | Visitors | Revenue | Visit_Date | Rating | Visit_Date_Num |
|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | kedarnath | INDIA | rishikesh | 50000 | 2000000 | 12-05-2023 | 4.5 | 12MAY2023 |
| 2 | T002 | Angkor Wat | CAMBODIA | Siem Reap | . | 5000000 | 15-08-2023 | 5.0 | 15AUG2023 |
| 3 | T003 | Meenakshi Temple | INDIA | Madurai | 2000 | 300000 | 32-13-2022 | 4.0 | . |
| 4 | T004 | Golden Temple | INDIA | Amritsar | 80000 | . | 10-10-2023 | 4.8 | 10OCT2023 |
| 5 | T001 | kedarnath | INDIA | rishikesh | 50000 | 2000000 | 12-05-2023 | 4.5 | 12MAY2023 |
| 6 | T005 | UNKNOWN | NEPAL | Kathmandu | 20000 | 1000000 | 01-01-2023 | 3.5 | 01JAN2023 |
| 7 | T006 | Borobudur | INDONESIA | UNKNOWN | 45000 | 2500000 | 05-06-2023 | 4.7 | 05JUN2023 |
| 8 | T007 | Shwedagon | MYANMAR | Yangon | 60000 | 3000000 | 4.6 | . | |
| 9 | T008 | lotus temple | INDIA | delhi | 70000 | 3500000 | 11-11-2023 | 4.4 | 11NOV2023 |
| 10 | T009 | Wat Phra Kaew | THAILAND | Bangkok | 55000 | 2700000 | 09-09-2023 | 5.0 | 09SEP2023 |
| 11 | T010 | Tirupati Temple | INDIA | Tirupati | 90000 | 6000000 | 08-08-2023 | 4.9 | 08AUG2023 |
PROC SORT DATA=temples_clean NODUPKEY;
BY Temple_ID;
RUN;
PROC PRINT DATA = temples_clean;
RUN;
OUTPUT:
| Obs | Temple_ID | Temple_Name | Country | City | Visitors | Revenue | Visit_Date | Rating | Visit_Date_Num |
|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | kedarnath | INDIA | rishikesh | 50000 | 2000000 | 12-05-2023 | 4.5 | 12MAY2023 |
| 2 | T002 | Angkor Wat | CAMBODIA | Siem Reap | . | 5000000 | 15-08-2023 | 5.0 | 15AUG2023 |
| 3 | T003 | Meenakshi Temple | INDIA | Madurai | 2000 | 300000 | 32-13-2022 | 4.0 | . |
| 4 | T004 | Golden Temple | INDIA | Amritsar | 80000 | . | 10-10-2023 | 4.8 | 10OCT2023 |
| 5 | T005 | UNKNOWN | NEPAL | Kathmandu | 20000 | 1000000 | 01-01-2023 | 3.5 | 01JAN2023 |
| 6 | T006 | Borobudur | INDONESIA | UNKNOWN | 45000 | 2500000 | 05-06-2023 | 4.7 | 05JUN2023 |
| 7 | T007 | Shwedagon | MYANMAR | Yangon | 60000 | 3000000 | 4.6 | . | |
| 8 | T008 | lotus temple | INDIA | delhi | 70000 | 3500000 | 11-11-2023 | 4.4 | 11NOV2023 |
| 9 | T009 | Wat Phra Kaew | THAILAND | Bangkok | 55000 | 2700000 | 09-09-2023 | 5.0 | 09SEP2023 |
| 10 | T010 | Tirupati Temple | INDIA | Tirupati | 90000 | 6000000 | 08-08-2023 | 4.9 | 08AUG2023 |
Explanation
This step
uses DATA step transformations, which are row-wise and efficient.
IF-THEN logic handles invalid data such as negative values using ABS() or
missing replacements. UPCASE() standardizes categorical variables a critical
step before joins. INPUT() converts character dates into numeric SAS dates,
enabling proper sorting and filtering. Finally, PROC SORT NODUPKEY removes
duplicate records based on Temple_ID, ensuring uniqueness essential before
using MERGE. This pipeline reflects clinical data cleaning practices (e.g.,
SDTM preprocessing), where traceability and reproducibility are mandatory.
4. Phase 2: Data Refinement in R
library(dplyr)
library(lubridate)
temples_clean <- temples_raw %>%
mutate(
# Handle missing temple names
Temple_Name = ifelse(Temple_Name == "NULL", "UNKNOWN", Temple_Name),
# Handle missing city (space only like SAS " ")
City = ifelse(is.na(City) | trimws(City) == "", "UNKNOWN", City),
# Fix negative visitors
Visitors = ifelse(Visitors < 0, abs(Visitors), Visitors),
# Fix negative revenue (set to NA like SAS .)
Revenue = ifelse(Revenue < 0, NA, Revenue),
# Standardize country text
Country = toupper(trimws(Country)),
# Convert date (equivalent to INPUT + FORMAT in SAS)
Visit_Date_Num = suppressWarnings(dmy(Visit_Date)),
# 🔥 Fix invalid dates
Visit_Date_Num = coalesce(Visit_Date_Num, as.Date("2023-01-01"))
)%>%
distinct(Temple_ID, .keep_all = TRUE)
OUTPUT:
|
|
Temple_ID |
Temple_Name |
Country |
City |
Visitors |
Revenue |
Visit_Date |
Rating |
Visit_Date_Num |
|
1 |
T001 |
kedarnath |
INDIA |
rishikesh |
50000 |
2000000 |
12-05-2023 |
4.5 |
12-05-2023 |
|
2 |
T002 |
Angkor Wat |
CAMBODIA |
Siem Reap |
NA |
5000000 |
15-08-2023 |
5 |
15-08-2023 |
|
3 |
T003 |
Meenakshi Temple |
INDIA |
Madurai |
2000 |
300000 |
32-13-2022 |
4 |
01-01-2023 |
|
4 |
T004 |
Golden Temple |
INDIA |
Amritsar |
80000 |
NA |
10-10-2023 |
4.8 |
10-10-2023 |
|
5 |
T005 |
UNKNOWN |
NEPAL |
Kathmandu |
20000 |
1000000 |
01-01-2023 |
3.5 |
01-01-2023 |
|
6 |
T006 |
Borobudur |
INDONESIA |
UNKNOWN |
45000 |
2500000 |
05-06-2023 |
4.7 |
05-06-2023 |
|
7 |
T007 |
Shwedagon |
MYANMAR |
Yangon |
60000 |
3000000 |
NA |
4.6 |
01-01-2023 |
|
8 |
T008 |
lotus temple |
INDIA |
delhi |
70000 |
3500000 |
11-11-2023 |
4.4 |
11-11-2023 |
|
9 |
T009 |
Wat Phra Kaew |
THAILAND |
Bangkok |
55000 |
2700000 |
09-09-2023 |
5 |
09-09-2023 |
|
10 |
T010 |
Tirupati Temple |
INDIA |
Tirupati |
90000 |
6000000 |
08-08-2023 |
4.9 |
08-08-2023 |
Explanation
The
mutate() function allows column-wise transformation. We replace invalid values
using ifelse(), ensuring data consistency. Always combine is.na()
+ trimws() for text cleaning. toupper() and trimws()
standardize text, which is critical before performing joins. distinct() removes
duplicates based on Temple_ID, similar to SAS NODUPKEY. R’s pipeline (%>%)
enhances readability and modularity. This approach is widely used in
exploratory data analysis and preprocessing workflows. Unlike SAS, R provides
more expressive transformations but requires careful handling to maintain
auditability.
|
SAS
Statement |
R
Equivalent |
Explanation |
|
IF
Temple_Name = 'NULL' |
Temple_Name
== "NULL" |
Condition
check |
|
City =
" " |
City ==
" " |
Exact
match (not trimmed) |
|
ABS(Visitors) |
abs(Visitors) |
Same
numeric function |
|
Revenue
= . |
NA |
Missing
numeric in R |
|
UPCASE(STRIP()) |
toupper(trimws()) |
Text
standardization |
|
FORMAT
DATE9. |
Not
needed |
R
stores Date internally |
5. Phase 3: Extended Analysis in SAS
DATA temples_flag;
SET temples_clean;
* Flag high revenue temples;
IF Revenue > 3000000 THEN High_Value='YES';
ELSE High_Value='NO';
RUN;
PROC PRINT DATA = temples_flag;
RUN;
OUTPUT:
| Obs | Temple_ID | Temple_Name | Country | City | Visitors | Revenue | Visit_Date | Rating | Visit_Date_Num | High_Value |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | kedarnath | INDIA | rishikesh | 50000 | 2000000 | 12-05-2023 | 4.5 | 12MAY2023 | NO |
| 2 | T002 | Angkor Wat | CAMBODIA | Siem Reap | . | 5000000 | 15-08-2023 | 5.0 | 15AUG2023 | YES |
| 3 | T003 | Meenakshi Temple | INDIA | Madurai | 2000 | 300000 | 32-13-2022 | 4.0 | . | NO |
| 4 | T004 | Golden Temple | INDIA | Amritsar | 80000 | . | 10-10-2023 | 4.8 | 10OCT2023 | NO |
| 5 | T005 | UNKNOWN | NEPAL | Kathmandu | 20000 | 1000000 | 01-01-2023 | 3.5 | 01JAN2023 | NO |
| 6 | T006 | Borobudur | INDONESIA | UNKNOWN | 45000 | 2500000 | 05-06-2023 | 4.7 | 05JUN2023 | NO |
| 7 | T007 | Shwedagon | MYANMAR | Yangon | 60000 | 3000000 | 4.6 | . | NO | |
| 8 | T008 | lotus temple | INDIA | delhi | 70000 | 3500000 | 11-11-2023 | 4.4 | 11NOV2023 | YES |
| 9 | T009 | Wat Phra Kaew | THAILAND | Bangkok | 55000 | 2700000 | 09-09-2023 | 5.0 | 09SEP2023 | NO |
| 10 | T010 | Tirupati Temple | INDIA | Tirupati | 90000 | 6000000 | 08-08-2023 | 4.9 | 08AUG2023 | YES |
PROC SUMMARY DATA=temples_flag NWAY;
CLASS Country;
VAR Revenue;
OUTPUT OUT=summary_data SUM=Total_Revenue;
RUN;
PROC PRINT DATA = summary_data;
RUN;
OUTPUT:
| Obs | Country | _TYPE_ | _FREQ_ | Total_Revenue |
|---|---|---|---|---|
| 1 | CAMBODIA | 1 | 1 | 5000000 |
| 2 | INDIA | 1 | 5 | 11800000 |
| 3 | INDONESIA | 1 | 1 | 2500000 |
| 4 | MYANMAR | 1 | 1 | 3000000 |
| 5 | NEPAL | 1 | 1 | 1000000 |
| 6 | THAILAND | 1 | 1 | 2700000 |
PROC REPORT DATA=summary_data;
COLUMN Country Total_Revenue;
DEFINE Country / GROUP;
DEFINE Total_Revenue / SUM;
RUN;
OUTPUT:
| Country | Total_Revenue |
|---|---|
| CAMBODIA | 5000000 |
| INDIA | 11800000 |
| INDONESIA | 2500000 |
| MYANMAR | 3000000 |
| NEPAL | 1000000 |
| THAILAND | 2700000 |
Explanation
This section demonstrates analytical enrichment. The IF-THEN-ELSE logic flags high-value temples a common pattern in business rules. PROC SUMMARY aggregates revenue by country, offering insights into regional performance. PROC REPORT formats output for reporting, aligning with regulatory expectations. These steps resemble clinical reporting pipelines (TLFs), where derived flags and aggregated summaries are essential. SAS excels here due to its structured, validated output generation.
MERGE vs JOIN
SAS MERGE
MERGE vs JOIN
· Creating a valid secondary dataset
· Performing MERGE (DATA step)
· Performing JOIN (PROC SQL)
· Explaining the difference clearly
Step 1: Create a Lookup Dataset
/* Create revenue category lookup from existing dataset */
DATA revenue_lookup;
SET temples_clean;
LENGTH Revenue_Category $15;
IF Revenue >= 5000000 THEN Revenue_Category = 'HIGH';
ELSE IF Revenue >= 2000000 THEN Revenue_Category = 'MEDIUM';
ELSE IF Revenue > 0 THEN Revenue_Category = 'LOW';
ELSE Revenue_Category = 'UNKNOWN';
KEEP Temple_ID Revenue_Category;
RUN;
PROC PRINT DATA = revenue_lookup;
RUN;
OUTPUT:
| Obs | Temple_ID | Revenue_Category |
|---|---|---|
| 1 | T001 | MEDIUM |
| 2 | T002 | HIGH |
| 3 | T003 | LOW |
| 4 | T004 | UNKNOWN |
| 5 | T005 | LOW |
| 6 | T006 | MEDIUM |
| 7 | T007 | MEDIUM |
| 8 | T008 | MEDIUM |
| 9 | T009 | MEDIUM |
| 10 | T010 | HIGH |
/* Remove duplicates to make it proper lookup */
PROC SORT DATA=revenue_lookup NODUPKEY;
BY Temple_ID;
RUN;
PROC PRINT DATA = revenue_lookup;
RUN;
OUTPUT:
| Obs | Temple_ID | Revenue_Category |
|---|---|---|
| 1 | T001 | MEDIUM |
| 2 | T002 | HIGH |
| 3 | T003 | LOW |
| 4 | T004 | UNKNOWN |
| 5 | T005 | LOW |
| 6 | T006 | MEDIUM |
| 7 | T007 | MEDIUM |
| 8 | T008 | MEDIUM |
| 9 | T009 | MEDIUM |
| 10 | T010 | HIGH |
Explanation
Here, we derive
a secondary dataset from an existing dataset, which is a best practice.
Instead of introducing external dependency, we:
- Create Revenue_Category
using business logic
- Keep only key (Temple_ID) +
lookup variable
- Remove duplicates using
NODUPKEY to ensure 1-to-1 merge integrity
👉 This mimics dimension table creation in
real ETL pipelines.
Step 2: SAS MERGE (DATA Step)
/* Ensure both datasets are sorted before MERGE */
PROC SORT DATA=temples_clean; BY Temple_ID; RUN;
PROC PRINT DATA = temples_clean;
RUN;
OUTPUT:
| Obs | Temple_ID | Temple_Name | Country | City | Visitors | Revenue | Visit_Date | Rating | Visit_Date_Num |
|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | kedarnath | INDIA | rishikesh | 50000 | 2000000 | 12-05-2023 | 4.5 | 12MAY2023 |
| 2 | T002 | Angkor Wat | CAMBODIA | Siem Reap | . | 5000000 | 15-08-2023 | 5.0 | 15AUG2023 |
| 3 | T003 | Meenakshi Temple | INDIA | Madurai | 2000 | 300000 | 32-13-2022 | 4.0 | . |
| 4 | T004 | Golden Temple | INDIA | Amritsar | 80000 | . | 10-10-2023 | 4.8 | 10OCT2023 |
| 5 | T005 | UNKNOWN | NEPAL | Kathmandu | 20000 | 1000000 | 01-01-2023 | 3.5 | 01JAN2023 |
| 6 | T006 | Borobudur | INDONESIA | UNKNOWN | 45000 | 2500000 | 05-06-2023 | 4.7 | 05JUN2023 |
| 7 | T007 | Shwedagon | MYANMAR | Yangon | 60000 | 3000000 | 4.6 | . | |
| 8 | T008 | lotus temple | INDIA | delhi | 70000 | 3500000 | 11-11-2023 | 4.4 | 11NOV2023 |
| 9 | T009 | Wat Phra Kaew | THAILAND | Bangkok | 55000 | 2700000 | 09-09-2023 | 5.0 | 09SEP2023 |
| 10 | T010 | Tirupati Temple | INDIA | Tirupati | 90000 | 6000000 | 08-08-2023 | 4.9 | 08AUG2023 |
PROC SORT DATA=revenue_lookup; BY Temple_ID; RUN;
PROC PRINT DATA = revenue_lookup;
RUN;
OUTPUT:
| Obs | Temple_ID | Revenue_Category |
|---|---|---|
| 1 | T001 | MEDIUM |
| 2 | T002 | HIGH |
| 3 | T003 | LOW |
| 4 | T004 | UNKNOWN |
| 5 | T005 | LOW |
| 6 | T006 | MEDIUM |
| 7 | T007 | MEDIUM |
| 8 | T008 | MEDIUM |
| 9 | T009 | MEDIUM |
| 10 | T010 | HIGH |
DATA temples_merged;
MERGE temples_clean (IN=a)
revenue_lookup (IN=b);
BY Temple_ID;
/* Keep only matching records (INNER JOIN behavior) */
IF a AND b;
RUN;
PROC PRINT DATA = temples_merged;
RUN;
OUTPUT:
| Obs | Temple_ID | Temple_Name | Country | City | Visitors | Revenue | Visit_Date | Rating | Visit_Date_Num | Revenue_Category |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | kedarnath | INDIA | rishikesh | 50000 | 2000000 | 12-05-2023 | 4.5 | 12MAY2023 | MEDIUM |
| 2 | T002 | Angkor Wat | CAMBODIA | Siem Reap | . | 5000000 | 15-08-2023 | 5.0 | 15AUG2023 | HIGH |
| 3 | T003 | Meenakshi Temple | INDIA | Madurai | 2000 | 300000 | 32-13-2022 | 4.0 | . | LOW |
| 4 | T004 | Golden Temple | INDIA | Amritsar | 80000 | . | 10-10-2023 | 4.8 | 10OCT2023 | UNKNOWN |
| 5 | T005 | UNKNOWN | NEPAL | Kathmandu | 20000 | 1000000 | 01-01-2023 | 3.5 | 01JAN2023 | LOW |
| 6 | T006 | Borobudur | INDONESIA | UNKNOWN | 45000 | 2500000 | 05-06-2023 | 4.7 | 05JUN2023 | MEDIUM |
| 7 | T007 | Shwedagon | MYANMAR | Yangon | 60000 | 3000000 | 4.6 | . | MEDIUM | |
| 8 | T008 | lotus temple | INDIA | delhi | 70000 | 3500000 | 11-11-2023 | 4.4 | 11NOV2023 | MEDIUM |
| 9 | T009 | Wat Phra Kaew | THAILAND | Bangkok | 55000 | 2700000 | 09-09-2023 | 5.0 | 09SEP2023 | MEDIUM |
| 10 | T010 | Tirupati Temple | INDIA | Tirupati | 90000 | 6000000 | 08-08-2023 | 4.9 | 08AUG2023 | HIGH |
Explanation
The MERGE
statement works row-by-row and requires both datasets to be sorted by the BY
variable. The IN= dataset options create temporary flags (a, b) to control
join logic. Here, IF a AND b ensures only matching records are kept, mimicking
an INNER JOIN. Without this condition, SAS would perform a full outer
merge, which may introduce unexpected missing values. This method is highly
efficient for large datasets but rigid — incorrect sorting or duplicate keys
can silently corrupt results. That’s why preprocessing (deduplication) is
mandatory before MERGE.
Step 3: SAS JOIN (PROC SQL)
PROC SQL;
CREATE TABLE temples_joined AS
SELECT a.*,b.Revenue_Category
FROM temples_clean AS a
INNER JOIN revenue_lookup AS b
ON a.Temple_ID = b.Temple_ID;
QUIT;
PROC PRINT DATA = temples_joined;
RUN;
OUTPUT:
| Obs | Temple_ID | Temple_Name | Country | City | Visitors | Revenue | Visit_Date | Rating | Visit_Date_Num | Revenue_Category |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | kedarnath | INDIA | rishikesh | 50000 | 2000000 | 12-05-2023 | 4.5 | 12MAY2023 | MEDIUM |
| 2 | T002 | Angkor Wat | CAMBODIA | Siem Reap | . | 5000000 | 15-08-2023 | 5.0 | 15AUG2023 | HIGH |
| 3 | T003 | Meenakshi Temple | INDIA | Madurai | 2000 | 300000 | 32-13-2022 | 4.0 | . | LOW |
| 4 | T004 | Golden Temple | INDIA | Amritsar | 80000 | . | 10-10-2023 | 4.8 | 10OCT2023 | UNKNOWN |
| 5 | T005 | UNKNOWN | NEPAL | Kathmandu | 20000 | 1000000 | 01-01-2023 | 3.5 | 01JAN2023 | LOW |
| 6 | T006 | Borobudur | INDONESIA | UNKNOWN | 45000 | 2500000 | 05-06-2023 | 4.7 | 05JUN2023 | MEDIUM |
| 7 | T007 | Shwedagon | MYANMAR | Yangon | 60000 | 3000000 | 4.6 | . | MEDIUM | |
| 8 | T008 | lotus temple | INDIA | delhi | 70000 | 3500000 | 11-11-2023 | 4.4 | 11NOV2023 | MEDIUM |
| 9 | T009 | Wat Phra Kaew | THAILAND | Bangkok | 55000 | 2700000 | 09-09-2023 | 5.0 | 09SEP2023 | MEDIUM |
| 10 | T010 | Tirupati Temple | INDIA | Tirupati | 90000 | 6000000 | 08-08-2023 | 4.9 | 08AUG2023 | HIGH |
Explanation
PROC SQL
provides a relational approach to combining datasets. Unlike MERGE, it
does not require sorting, making it more flexible. The INNER JOIN
ensures only matching records are included, similar to IF a AND b in MERGE. SQL
joins are easier to read and maintain, especially when dealing with multiple
conditions or complex joins. However, they can be slightly slower on very large
datasets compared to DATA step MERGE. In regulated environments, SQL is often
preferred for clarity, while MERGE is used for performance-critical pipelines.
MERGE vs JOIN — Real
Difference
|
Aspect |
MERGE
(DATA Step) |
JOIN
(PROC SQL) |
|
Sorting
required |
✅ Yes |
❌ No |
|
Performance |
⚡
Faster for large data |
Moderate |
|
Flexibility |
❌
Limited |
✅ High |
|
Debugging |
⚠️
Harder |
✅
Easier |
|
Duplicate
handling |
⚠️
Risky |
✅
Controlled |
Key Insight
- MERGE: Requires sorted data,
faster for large datasets
- JOIN: Flexible, SQL-based,
easier for complex logic
Key Takeaways (Sharp &
Practical)
- Never use non-existent
datasets in SAS
- Always derive lookup
tables explicitly
- MERGE requires:
- Sorted datasets
- Unique keys
- JOIN is:
- Flexible
- Easier to debug
- Use IN= flags to control
merge logic
- Always validate row counts
after merge
6. 20 Additional Data Cleaning Best Practices
- Always validate key
identifiers (USUBJID equivalent)
- Maintain audit trails for
transformations
- Use controlled terminology
(CDISC standards)
- Validate date formats
strictly
- Avoid hardcoding logic
- Use metadata-driven
programming
- Track derivations in
Define.xml
- Perform double programming
(QC)
- Validate joins/merges
carefully
- Handle duplicates explicitly
- Normalize categorical
variables
- Use formats for
standardization
- Validate ranges (e.g., age
0–120)
- Log all transformations
- Avoid overwriting raw data
- Use macro variables for
flexibility
- Implement validation checks
- Cross-check totals after
aggregation
- Use PROC COMPARE for QC
- Document assumptions clearly
7. Business Logic Behind Data Cleaning
Data
cleaning is not arbitrary it is driven by business and domain logic. For
instance, replacing missing values ensures continuity in analysis. If visitor
count is missing, imputing with median values prevents skewed summaries. In
clinical trials, missing patient age could invalidate subgroup analysis.
Unrealistic
values must be corrected because they distort analytics. A negative revenue value
doesn’t make business sense it may indicate data entry error or system bug. If
not corrected, it can lead to incorrect financial forecasting.
Similarly,
date imputation is crucial. Suppose a visit date is missing; analysts may use
the study midpoint or last known visit. This ensures consistency in
longitudinal analysis.
In temple
datasets, incorrect city names or inconsistent country labels can break joins
during aggregation. For example, "india" vs "INDIA" will
create separate groups, leading to inaccurate regional insights.
Ultimately,
data cleaning ensures:
- Accuracy
- Consistency
- Reliability
Without
it, even the most advanced analytics models produce misleading results.
8. 20 Key Points
- Dirty data leads to wrong
conclusions
- Standardization ensures
reproducibility
- Missing values must be
handled carefully
- Duplicate records distort
analysis
- Text inconsistency breaks
joins
- Dates must be validated
- Negative values need
correction
- Merge requires sorted data
- SQL joins are flexible
- SAS ensures auditability
- R enables rapid
transformation
- Data validation is mandatory
- Cleaning improves model
accuracy
- Business rules drive
cleaning
- Documentation is critical
- QC checks prevent errors
- Aggregation needs clean keys
- Imputation must be justified
- Always preserve raw data
- Clean data builds trust
9. Summary
This
project demonstrated how a global temples dataset, when left uncleaned,
can lead to analytical inconsistencies and misleading conclusions. We started
with a deliberately flawed dataset containing missing values, invalid entries,
duplicates, and inconsistent formatting mirroring real-world data challenges.
Using SAS,
we applied structured data cleaning techniques through the DATA step,
leveraging functions like ABS, UPCASE, and INPUT. We removed duplicates using
PROC SORT NODUPKEY and performed aggregations with PROC SUMMARY. SAS proved
highly reliable for enterprise-grade workflows, especially where audit trails
and reproducibility are critical.
In R,
we used dplyr functions like mutate, filter, and distinct to transform and
refine the dataset. R’s syntax is expressive and efficient, making it ideal for
exploratory data analysis and rapid prototyping.
A key
highlight was understanding MERGE vs JOIN:
- MERGE is efficient but
requires sorted datasets and strict key alignment
- JOIN (PROC SQL / dplyr
joins) offers flexibility and is easier for complex relationships
Both
approaches have their place, and choosing the right one depends on the use
case.
Ultimately,
the combination of SAS and R provides a powerful ecosystem for data cleaning
and analysis. Clean data ensures accurate insights, better decision-making, and
compliance with industry standards.
10. Conclusion
Data is
often described as the “new oil,” but raw data like crude oil is unusable
without refinement. This project reinforced a critical truth: data cleaning
is not a preliminary step; it is the foundation of all analytics.
Through
the lens of a global temples dataset, we explored how even simple
inconsistencies like case sensitivity or duplicate records can cascade into
major analytical errors. In high-stakes environments such as clinical trials or
financial reporting, these errors can have serious consequences.
SAS and R
offer complementary strengths. SAS provides a robust, regulated framework
ideal for production environments, while R delivers flexibility and speed
for data exploration and transformation. Understanding when to use MERGE vs
JOIN is essential for ensuring data integrity during dataset combination.
Moreover,
the importance of business logic cannot be overstated. Data cleaning
decisions must align with domain knowledge whether it’s correcting patient age
in a clinical dataset or validating revenue figures in a business dataset.
As an
aspiring or experienced data professional, your ability to clean, validate, and
structure data will define the quality of your insights. Tools and techniques
will evolve, but the principles of data integrity, consistency, and validation
remain constant.
In the
end, clean data is not just about accuracy it’s about trust. And in
analytics, trust is everything.
11. Interview Questions (Scenario-Based)
Q1: Difference between MERGE and JOIN in SAS?
Answer: MERGE requires sorted datasets
and works row-wise. JOIN (PROC SQL) is more flexible and supports complex
conditions.
Q2: How do you handle duplicate records?
Answer: Use PROC SORT NODUPKEY in SAS or
distinct() in R based on key variables.
Q3: What happens if you merge unsorted data?
Answer: SAS may produce incorrect
results or warnings data integrity is compromised.
Q4: How do you validate cleaned data?
Answer: Use PROC COMPARE, frequency
checks, and cross-validation against raw data.
Q5: R vs SAS for data cleaning?
Answer: R is flexible and fast; SAS is
structured and audit-compliant use based on project needs.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TEMPLES DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment