Sacred Data, Hidden Errors: Transforming a Global Temples Dataset into Analytical Truth Using SAS MERGE vs JOIN and R

1. Introduction

Imagine you’re working as a data analyst in a global cultural research organization. Your task? Analyze a dataset of famous temples across the world to uncover tourism patterns, visitor demographics, and revenue insights.

Sounds simple until you open the dataset.

You immediately notice chaos:

Temple names written as “kedarnath”, “KEDARNATH”, and “Kedarnath Temple”
Missing visitor counts
Negative revenue values (yes, negative!)
Duplicate records for the same temple
Dates like 32-13-2022

This is not just messy it’s dangerous.

In real-world analytics (especially in clinical trials or regulated industries), bad data leads to flawed insights, incorrect decisions, and regulatory risk.

This is where SAS and R shine:

SAS: Strong in structured, audit-ready pipelines
R: Flexible, expressive, ideal for transformation

In this blog, we’ll build a temple dataset, deliberately inject errors, and then:

Clean it using SAS & R
Combine datasets using MERGE vs JOIN
Demonstrate PROC SQL vs DATA step
Apply real-world logic

2. Raw Data Creation in SAS and R

SAS Code (Raw Dataset with Errors)

DATA temples_raw;

INFILE DATALINES DSD DLM=',' MISSOVER;

LENGTH Temple_ID $5 Temple_Name $30 Country $20

City $20 Visitors 8 Revenue 8 Visit_Date $12 Rating 8;

INPUT Temple_ID $ Temple_Name $ Country $ City $

Visitors Revenue Visit_Date $ Rating;

DATALINES;

T001,kedarnath,india,rishikesh,50000,2000000,12-05-2023,4.5

T002,Angkor Wat,Cambodia,Siem Reap,,5000000,15-08-2023,5

T003,Meenakshi Temple,India,Madurai,-2000,300000,32-13-2022,4

T004,Golden Temple,INDIA,Amritsar,80000,-900000,10-10-2023,4.8

T001,kedarnath,india,rishikesh,50000,2000000,12-05-2023,4.5

T005,NULL,Nepal,Kathmandu,20000,1000000,01-01-2023,3.5

T006,Borobudur,Indonesia,,45000,2500000,05-06-2023,4.7

T007,Shwedagon,Myanmar,Yangon,60000,3000000,,4.6

T008,lotus temple,india,delhi,70000,3500000,11-11-2023,4.4

T009,Wat Phra Kaew,Thailand,Bangkok,55000,2700000,09-09-2023,5

T010,Tirupati Temple,India,Tirupati,90000,6000000,08-08-2023,4.9

;

RUN;

PROC PRINT DATA = temples_raw;

RUN;

OUTPUT:


Obs	Temple_ID	Temple_Name	Country	City	Visitors	Revenue	Visit_Date	Rating
1	T001	kedarnath	india	rishikesh	50000	2000000	12-05-2023	4.5
2	T002	Angkor Wat	Cambodia	Siem Reap	.	5000000	15-08-2023	5.0
3	T003	Meenakshi Temple	India	Madurai	-2000	300000	32-13-2022	4.0
4	T004	Golden Temple	INDIA	Amritsar	80000	-900000	10-10-2023	4.8
5	T001	kedarnath	india	rishikesh	50000	2000000	12-05-2023	4.5
6	T005	NULL	Nepal	Kathmandu	20000	1000000	01-01-2023	3.5
7	T006	Borobudur	Indonesia		45000	2500000	05-06-2023	4.7
8	T007	Shwedagon	Myanmar	Yangon	60000	3000000		4.6
9	T008	lotus temple	india	delhi	70000	3500000	11-11-2023	4.4
10	T009	Wat Phra Kaew	Thailand	Bangkok	55000	2700000	09-09-2023	5.0
11	T010	Tirupati Temple	India	Tirupati	90000	6000000	08-08-2023	4.9

Explanation

This dataset intentionally mimics real-world inconsistencies. We used INFILE DATALINES to simulate flat file ingestion. Notice issues: missing values (Visitors blank), invalid entries (negative visitors, revenue), duplicate records (T001), inconsistent casing (india vs INDIA), and incorrect date formats. The variable Visit_Date is stored as character common in raw datasets. Using MISSOVER ensures SAS doesn’t skip lines when values are missing. This dataset reflects typical ingestion-stage problems in clinical or business data pipelines. Understanding raw structure is critical before transformation, especially when applying MERGE vs JOIN logic, since incorrect keys or duplicates can lead to major analytical errors.

R Code – Equivalent Raw Dataset

temples_raw <- data.frame(

Temple_ID = c("T001","T002","T003","T004","T001","T005","T006",

"T007","T008","T009","T010"),

Temple_Name = c("kedarnath","Angkor Wat","Meenakshi Temple",

"Golden Temple","kedarnath","NULL",

"Borobudur","Shwedagon","lotus temple",

"Wat Phra Kaew","Tirupati Temple"),

Country = c("india","Cambodia","India","INDIA","india","Nepal",

"Indonesia","Myanmar","india","Thailand","India"),

City = c("rishikesh","Siem Reap","Madurai","Amritsar","rishikesh",

"Kathmandu","", "Yangon","delhi","Bangkok","Tirupati"),

Visitors = c(50000, NA, -2000, 80000, 50000, 20000, 45000, 60000,

70000, 55000, 90000),

Revenue = c(2000000,5000000,300000,-900000,2000000,1000000,2500000,

3000000,3500000,2700000,6000000),

Visit_Date = c("12-05-2023","15-08-2023","32-13-2022","10-10-2023",

"12-05-2023",

"01-01-2023","05-06-2023",NA,"11-11-2023","09-09-2023",

"08-08-2023"),

Rating = c(4.5,5,4,4.8,4.5,3.5,4.7,4.6,4.4,5,4.9)

)

OUTPUT:

	Temple_ID	Temple_Name	Country	City	Visitors	Revenue	Visit_Date	Rating
1	T001	kedarnath	india	rishikesh	50000	2000000	12-05-2023	4.5
2	T002	Angkor Wat	Cambodia	Siem Reap	NA	5000000	15-08-2023	5
3	T003	Meenakshi Temple	India	Madurai	-2000	300000	32-13-2022	4
4	T004	Golden Temple	INDIA	Amritsar	80000	-900000	10-10-2023	4.8
5	T001	kedarnath	india	rishikesh	50000	2000000	12-05-2023	4.5
6	T005	NULL	Nepal	Kathmandu	20000	1000000	01-01-2023	3.5
7	T006	Borobudur	Indonesia		45000	2500000	05-06-2023	4.7
8	T007	Shwedagon	Myanmar	Yangon	60000	3000000	NA	4.6
9	T008	lotus temple	india	delhi	70000	3500000	11-11-2023	4.4
10	T009	Wat Phra Kaew	Thailand	Bangkok	55000	2700000	09-09-2023	5
11	T010	Tirupati Temple	India	Tirupati	90000	6000000	08-08-2023	4.9

Explanation

This R dataset mirrors the SAS structure using data.frame(). Missing values are represented with NA, while inconsistent strings (like "NULL") are preserved intentionally for cleaning. Numeric inconsistencies like negative values are included to demonstrate correction logic later. Unlike SAS, R does not enforce strict typing unless specified, which can introduce flexibility but also risks inconsistency. This raw dataset is a classic example of unvalidated ingestion, where downstream transformations (joins, aggregations) can break or produce incorrect results if not cleaned properly. This is especially critical before applying joins (merge, inner_join) in R.

3. Phase 1: Data Cleaning in SAS

DATA temples_clean;

SET temples_raw;

* Handle missing temple names;

IF Temple_Name = 'NULL' THEN Temple_Name = 'UNKNOWN';

* Handle missing city;

IF City = " " THEN City = 'UNKNOWN';

* Fix negative visitors;

IF Visitors < 0 THEN Visitors = ABS(Visitors);

* Fix negative revenue;

IF Revenue < 0 THEN Revenue = .;

* Standardize country text;

Country = UPCASE(STRIP(Country));

* Convert date;

Visit_Date_Num = INPUT(Visit_Date, DDMMYY10.);

FORMAT Visit_Date_Num DATE9.;

RUN;

PROC PRINT DATA = temples_clean;

RUN;

OUTPUT:


Obs	Temple_ID	Temple_Name	Country	City	Visitors	Revenue	Visit_Date	Rating	Visit_Date_Num
1	T001	kedarnath	INDIA	rishikesh	50000	2000000	12-05-2023	4.5	12MAY2023
2	T002	Angkor Wat	CAMBODIA	Siem Reap	.	5000000	15-08-2023	5.0	15AUG2023
3	T003	Meenakshi Temple	INDIA	Madurai	2000	300000	32-13-2022	4.0	.
4	T004	Golden Temple	INDIA	Amritsar	80000	.	10-10-2023	4.8	10OCT2023
5	T001	kedarnath	INDIA	rishikesh	50000	2000000	12-05-2023	4.5	12MAY2023
6	T005	UNKNOWN	NEPAL	Kathmandu	20000	1000000	01-01-2023	3.5	01JAN2023
7	T006	Borobudur	INDONESIA	UNKNOWN	45000	2500000	05-06-2023	4.7	05JUN2023
8	T007	Shwedagon	MYANMAR	Yangon	60000	3000000		4.6	.
9	T008	lotus temple	INDIA	delhi	70000	3500000	11-11-2023	4.4	11NOV2023
10	T009	Wat Phra Kaew	THAILAND	Bangkok	55000	2700000	09-09-2023	5.0	09SEP2023
11	T010	Tirupati Temple	INDIA	Tirupati	90000	6000000	08-08-2023	4.9	08AUG2023

PROC SORT DATA=temples_clean NODUPKEY;

BY Temple_ID;

RUN;

PROC PRINT DATA = temples_clean;

RUN;

OUTPUT:


Obs	Temple_ID	Temple_Name	Country	City	Visitors	Revenue	Visit_Date	Rating	Visit_Date_Num
1	T001	kedarnath	INDIA	rishikesh	50000	2000000	12-05-2023	4.5	12MAY2023
2	T002	Angkor Wat	CAMBODIA	Siem Reap	.	5000000	15-08-2023	5.0	15AUG2023
3	T003	Meenakshi Temple	INDIA	Madurai	2000	300000	32-13-2022	4.0	.
4	T004	Golden Temple	INDIA	Amritsar	80000	.	10-10-2023	4.8	10OCT2023
5	T005	UNKNOWN	NEPAL	Kathmandu	20000	1000000	01-01-2023	3.5	01JAN2023
6	T006	Borobudur	INDONESIA	UNKNOWN	45000	2500000	05-06-2023	4.7	05JUN2023
7	T007	Shwedagon	MYANMAR	Yangon	60000	3000000		4.6	.
8	T008	lotus temple	INDIA	delhi	70000	3500000	11-11-2023	4.4	11NOV2023
9	T009	Wat Phra Kaew	THAILAND	Bangkok	55000	2700000	09-09-2023	5.0	09SEP2023
10	T010	Tirupati Temple	INDIA	Tirupati	90000	6000000	08-08-2023	4.9	08AUG2023

Explanation

This step uses DATA step transformations, which are row-wise and efficient. IF-THEN logic handles invalid data such as negative values using ABS() or missing replacements. UPCASE() standardizes categorical variables a critical step before joins. INPUT() converts character dates into numeric SAS dates, enabling proper sorting and filtering. Finally, PROC SORT NODUPKEY removes duplicate records based on Temple_ID, ensuring uniqueness essential before using MERGE. This pipeline reflects clinical data cleaning practices (e.g., SDTM preprocessing), where traceability and reproducibility are mandatory.

4. Phase 2: Data Refinement in R

library(dplyr)

library(lubridate)

temples_clean <- temples_raw %>%

mutate(

# Handle missing temple names

Temple_Name = ifelse(Temple_Name == "NULL", "UNKNOWN", Temple_Name),

# Handle missing city (space only like SAS " ")

City = ifelse(is.na(City) | trimws(City) == "", "UNKNOWN", City),

# Fix negative visitors

Visitors = ifelse(Visitors < 0, abs(Visitors), Visitors),

# Fix negative revenue (set to NA like SAS .)

Revenue = ifelse(Revenue < 0, NA, Revenue),

# Standardize country text

Country = toupper(trimws(Country)),

# Convert date (equivalent to INPUT + FORMAT in SAS)

Visit_Date_Num = suppressWarnings(dmy(Visit_Date)),

# 🔥 Fix invalid dates

Visit_Date_Num = coalesce(Visit_Date_Num, as.Date("2023-01-01"))

)%>%

distinct(Temple_ID, .keep_all = TRUE)

OUTPUT:

	Temple_ID	Temple_Name	Country	City	Visitors	Revenue	Visit_Date	Rating	Visit_Date_Num
1	T001	kedarnath	INDIA	rishikesh	50000	2000000	12-05-2023	4.5	12-05-2023
2	T002	Angkor Wat	CAMBODIA	Siem Reap	NA	5000000	15-08-2023	5	15-08-2023
3	T003	Meenakshi Temple	INDIA	Madurai	2000	300000	32-13-2022	4	01-01-2023
4	T004	Golden Temple	INDIA	Amritsar	80000	NA	10-10-2023	4.8	10-10-2023
5	T005	UNKNOWN	NEPAL	Kathmandu	20000	1000000	01-01-2023	3.5	01-01-2023
6	T006	Borobudur	INDONESIA	UNKNOWN	45000	2500000	05-06-2023	4.7	05-06-2023
7	T007	Shwedagon	MYANMAR	Yangon	60000	3000000	NA	4.6	01-01-2023
8	T008	lotus temple	INDIA	delhi	70000	3500000	11-11-2023	4.4	11-11-2023
9	T009	Wat Phra Kaew	THAILAND	Bangkok	55000	2700000	09-09-2023	5	09-09-2023
10	T010	Tirupati Temple	INDIA	Tirupati	90000	6000000	08-08-2023	4.9	08-08-2023

Explanation

The mutate() function allows column-wise transformation. We replace invalid values using ifelse(), ensuring data consistency. Always combine is.na() + trimws() for text cleaning. toupper() and trimws() standardize text, which is critical before performing joins. distinct() removes duplicates based on Temple_ID, similar to SAS NODUPKEY. R’s pipeline (%>%) enhances readability and modularity. This approach is widely used in exploratory data analysis and preprocessing workflows. Unlike SAS, R provides more expressive transformations but requires careful handling to maintain auditability.

SAS Statement	R Equivalent	Explanation
IF Temple_Name = 'NULL'	Temple_Name == "NULL"	Condition check
City = " "	City == " "	Exact match (not trimmed)
ABS(Visitors)	abs(Visitors)	Same numeric function
Revenue = .	NA	Missing numeric in R
UPCASE(STRIP())	toupper(trimws())	Text standardization
FORMAT DATE9.	Not needed	R stores Date internally

5. Phase 3: Extended Analysis in SAS

DATA temples_flag;

SET temples_clean;

* Flag high revenue temples;

IF Revenue > 3000000 THEN High_Value='YES';

ELSE High_Value='NO';

RUN;

PROC PRINT DATA = temples_flag;

RUN;

OUTPUT:


Obs	Temple_ID	Temple_Name	Country	City	Visitors	Revenue	Visit_Date	Rating	Visit_Date_Num	High_Value
1	T001	kedarnath	INDIA	rishikesh	50000	2000000	12-05-2023	4.5	12MAY2023	NO
2	T002	Angkor Wat	CAMBODIA	Siem Reap	.	5000000	15-08-2023	5.0	15AUG2023	YES
3	T003	Meenakshi Temple	INDIA	Madurai	2000	300000	32-13-2022	4.0	.	NO
4	T004	Golden Temple	INDIA	Amritsar	80000	.	10-10-2023	4.8	10OCT2023	NO
5	T005	UNKNOWN	NEPAL	Kathmandu	20000	1000000	01-01-2023	3.5	01JAN2023	NO
6	T006	Borobudur	INDONESIA	UNKNOWN	45000	2500000	05-06-2023	4.7	05JUN2023	NO
7	T007	Shwedagon	MYANMAR	Yangon	60000	3000000		4.6	.	NO
8	T008	lotus temple	INDIA	delhi	70000	3500000	11-11-2023	4.4	11NOV2023	YES
9	T009	Wat Phra Kaew	THAILAND	Bangkok	55000	2700000	09-09-2023	5.0	09SEP2023	NO
10	T010	Tirupati Temple	INDIA	Tirupati	90000	6000000	08-08-2023	4.9	08AUG2023	YES

PROC SUMMARY DATA=temples_flag NWAY;

CLASS Country;

VAR Revenue;

OUTPUT OUT=summary_data SUM=Total_Revenue;

RUN;

PROC PRINT DATA = summary_data;

RUN;

OUTPUT:


Obs	Country	_TYPE_	_FREQ_	Total_Revenue
1	CAMBODIA	1	1	5000000
2	INDIA	1	5	11800000
3	INDONESIA	1	1	2500000
4	MYANMAR	1	1	3000000
5	NEPAL	1	1	1000000
6	THAILAND	1	1	2700000

PROC REPORT DATA=summary_data;

COLUMN Country Total_Revenue;

DEFINE Country / GROUP;

DEFINE Total_Revenue / SUM;

RUN;

OUTPUT:


Country	Total_Revenue
CAMBODIA	5000000
INDIA	11800000
INDONESIA	2500000
MYANMAR	3000000
NEPAL	1000000
THAILAND	2700000

Explanation

This section demonstrates analytical enrichment. The IF-THEN-ELSE logic flags high-value temples a common pattern in business rules. PROC SUMMARY aggregates revenue by country, offering insights into regional performance. PROC REPORT formats output for reporting, aligning with regulatory expectations. These steps resemble clinical reporting pipelines (TLFs), where derived flags and aggregated summaries are essential. SAS excels here due to its structured, validated output generation.

MERGE vs JOIN

SAS MERGE

MERGE vs JOIN

· Creating a valid secondary dataset

· Performing MERGE (DATA step)

· Performing JOIN (PROC SQL)

· Explaining the difference clearly

Step 1: Create a Lookup Dataset

/* Create revenue category lookup from existing dataset */

DATA revenue_lookup;

SET temples_clean;

LENGTH Revenue_Category $15;

IF Revenue >= 5000000 THEN Revenue_Category = 'HIGH';

ELSE IF Revenue >= 2000000 THEN Revenue_Category = 'MEDIUM';

ELSE IF Revenue > 0 THEN Revenue_Category = 'LOW';

ELSE Revenue_Category = 'UNKNOWN';

KEEP Temple_ID Revenue_Category;

RUN;

PROC PRINT DATA = revenue_lookup;

RUN;

OUTPUT:


Obs	Temple_ID	Revenue_Category
1	T001	MEDIUM
2	T002	HIGH
3	T003	LOW
4	T004	UNKNOWN
5	T005	LOW
6	T006	MEDIUM
7	T007	MEDIUM
8	T008	MEDIUM
9	T009	MEDIUM
10	T010	HIGH

/* Remove duplicates to make it proper lookup */

PROC SORT DATA=revenue_lookup NODUPKEY;

BY Temple_ID;

RUN;

PROC PRINT DATA = revenue_lookup;

RUN;

OUTPUT:


Obs	Temple_ID	Revenue_Category
1	T001	MEDIUM
2	T002	HIGH
3	T003	LOW
4	T004	UNKNOWN
5	T005	LOW
6	T006	MEDIUM
7	T007	MEDIUM
8	T008	MEDIUM
9	T009	MEDIUM
10	T010	HIGH

Explanation

Here, we derive a secondary dataset from an existing dataset, which is a best practice. Instead of introducing external dependency, we:

Create Revenue_Category using business logic
Keep only key (Temple_ID) + lookup variable
Remove duplicates using NODUPKEY to ensure 1-to-1 merge integrity

👉 This mimics dimension table creation in real ETL pipelines.

Step 2: SAS MERGE (DATA Step)

/* Ensure both datasets are sorted before MERGE */

PROC SORT DATA=temples_clean; BY Temple_ID; RUN;

PROC PRINT DATA = temples_clean;

RUN;

OUTPUT:


Obs	Temple_ID	Temple_Name	Country	City	Visitors	Revenue	Visit_Date	Rating	Visit_Date_Num
1	T001	kedarnath	INDIA	rishikesh	50000	2000000	12-05-2023	4.5	12MAY2023
2	T002	Angkor Wat	CAMBODIA	Siem Reap	.	5000000	15-08-2023	5.0	15AUG2023
3	T003	Meenakshi Temple	INDIA	Madurai	2000	300000	32-13-2022	4.0	.
4	T004	Golden Temple	INDIA	Amritsar	80000	.	10-10-2023	4.8	10OCT2023
5	T005	UNKNOWN	NEPAL	Kathmandu	20000	1000000	01-01-2023	3.5	01JAN2023
6	T006	Borobudur	INDONESIA	UNKNOWN	45000	2500000	05-06-2023	4.7	05JUN2023
7	T007	Shwedagon	MYANMAR	Yangon	60000	3000000		4.6	.
8	T008	lotus temple	INDIA	delhi	70000	3500000	11-11-2023	4.4	11NOV2023
9	T009	Wat Phra Kaew	THAILAND	Bangkok	55000	2700000	09-09-2023	5.0	09SEP2023
10	T010	Tirupati Temple	INDIA	Tirupati	90000	6000000	08-08-2023	4.9	08AUG2023

PROC SORT DATA=revenue_lookup; BY Temple_ID; RUN;

PROC PRINT DATA = revenue_lookup;

RUN;

OUTPUT:


Obs	Temple_ID	Revenue_Category
1	T001	MEDIUM
2	T002	HIGH
3	T003	LOW
4	T004	UNKNOWN
5	T005	LOW
6	T006	MEDIUM
7	T007	MEDIUM
8	T008	MEDIUM
9	T009	MEDIUM
10	T010	HIGH

DATA temples_merged;

MERGE temples_clean (IN=a)

revenue_lookup (IN=b);

BY Temple_ID;

/* Keep only matching records (INNER JOIN behavior) */

IF a AND b;

RUN;

PROC PRINT DATA = temples_merged;

RUN;

OUTPUT:


Obs	Temple_ID	Temple_Name	Country	City	Visitors	Revenue	Visit_Date	Rating	Visit_Date_Num	Revenue_Category
1	T001	kedarnath	INDIA	rishikesh	50000	2000000	12-05-2023	4.5	12MAY2023	MEDIUM
2	T002	Angkor Wat	CAMBODIA	Siem Reap	.	5000000	15-08-2023	5.0	15AUG2023	HIGH
3	T003	Meenakshi Temple	INDIA	Madurai	2000	300000	32-13-2022	4.0	.	LOW
4	T004	Golden Temple	INDIA	Amritsar	80000	.	10-10-2023	4.8	10OCT2023	UNKNOWN
5	T005	UNKNOWN	NEPAL	Kathmandu	20000	1000000	01-01-2023	3.5	01JAN2023	LOW
6	T006	Borobudur	INDONESIA	UNKNOWN	45000	2500000	05-06-2023	4.7	05JUN2023	MEDIUM
7	T007	Shwedagon	MYANMAR	Yangon	60000	3000000		4.6	.	MEDIUM
8	T008	lotus temple	INDIA	delhi	70000	3500000	11-11-2023	4.4	11NOV2023	MEDIUM
9	T009	Wat Phra Kaew	THAILAND	Bangkok	55000	2700000	09-09-2023	5.0	09SEP2023	MEDIUM
10	T010	Tirupati Temple	INDIA	Tirupati	90000	6000000	08-08-2023	4.9	08AUG2023	HIGH

Explanation

The MERGE statement works row-by-row and requires both datasets to be sorted by the BY variable. The IN= dataset options create temporary flags (a, b) to control join logic. Here, IF a AND b ensures only matching records are kept, mimicking an INNER JOIN. Without this condition, SAS would perform a full outer merge, which may introduce unexpected missing values. This method is highly efficient for large datasets but rigid — incorrect sorting or duplicate keys can silently corrupt results. That’s why preprocessing (deduplication) is mandatory before MERGE.

Step 3: SAS JOIN (PROC SQL)

PROC SQL;

CREATE TABLE temples_joined AS

SELECT a.*,b.Revenue_Category

FROM temples_clean AS a

INNER JOIN revenue_lookup AS b

ON a.Temple_ID = b.Temple_ID;

QUIT;

PROC PRINT DATA = temples_joined;

RUN;

OUTPUT:


Obs	Temple_ID	Temple_Name	Country	City	Visitors	Revenue	Visit_Date	Rating	Visit_Date_Num	Revenue_Category
1	T001	kedarnath	INDIA	rishikesh	50000	2000000	12-05-2023	4.5	12MAY2023	MEDIUM
2	T002	Angkor Wat	CAMBODIA	Siem Reap	.	5000000	15-08-2023	5.0	15AUG2023	HIGH
3	T003	Meenakshi Temple	INDIA	Madurai	2000	300000	32-13-2022	4.0	.	LOW
4	T004	Golden Temple	INDIA	Amritsar	80000	.	10-10-2023	4.8	10OCT2023	UNKNOWN
5	T005	UNKNOWN	NEPAL	Kathmandu	20000	1000000	01-01-2023	3.5	01JAN2023	LOW
6	T006	Borobudur	INDONESIA	UNKNOWN	45000	2500000	05-06-2023	4.7	05JUN2023	MEDIUM
7	T007	Shwedagon	MYANMAR	Yangon	60000	3000000		4.6	.	MEDIUM
8	T008	lotus temple	INDIA	delhi	70000	3500000	11-11-2023	4.4	11NOV2023	MEDIUM
9	T009	Wat Phra Kaew	THAILAND	Bangkok	55000	2700000	09-09-2023	5.0	09SEP2023	MEDIUM
10	T010	Tirupati Temple	INDIA	Tirupati	90000	6000000	08-08-2023	4.9	08AUG2023	HIGH

Explanation

PROC SQL provides a relational approach to combining datasets. Unlike MERGE, it does not require sorting, making it more flexible. The INNER JOIN ensures only matching records are included, similar to IF a AND b in MERGE. SQL joins are easier to read and maintain, especially when dealing with multiple conditions or complex joins. However, they can be slightly slower on very large datasets compared to DATA step MERGE. In regulated environments, SQL is often preferred for clarity, while MERGE is used for performance-critical pipelines.

MERGE vs JOIN — Real Difference

Aspect	MERGE (DATA Step)	JOIN (PROC SQL)
Sorting required	✅ Yes	❌ No
Performance	⚡ Faster for large data	Moderate
Flexibility	❌ Limited	✅ High
Debugging	⚠️ Harder	✅ Easier
Duplicate handling	⚠️ Risky	✅ Controlled

Key Insight

MERGE: Requires sorted data, faster for large datasets
JOIN: Flexible, SQL-based, easier for complex logic

Key Takeaways (Sharp & Practical)

Never use non-existent datasets in SAS
Always derive lookup tables explicitly
MERGE requires:

Sorted datasets
Unique keys

JOIN is:

Flexible
Easier to debug

Use IN= flags to control merge logic
Always validate row counts after merge

6. 20 Additional Data Cleaning Best Practices

Always validate key identifiers (USUBJID equivalent)
Maintain audit trails for transformations
Use controlled terminology (CDISC standards)
Validate date formats strictly
Avoid hardcoding logic
Use metadata-driven programming
Track derivations in Define.xml
Perform double programming (QC)
Validate joins/merges carefully
Handle duplicates explicitly
Normalize categorical variables
Use formats for standardization
Validate ranges (e.g., age 0–120)
Log all transformations
Avoid overwriting raw data
Use macro variables for flexibility
Implement validation checks
Cross-check totals after aggregation
Use PROC COMPARE for QC
Document assumptions clearly

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary it is driven by business and domain logic. For instance, replacing missing values ensures continuity in analysis. If visitor count is missing, imputing with median values prevents skewed summaries. In clinical trials, missing patient age could invalidate subgroup analysis.

Unrealistic values must be corrected because they distort analytics. A negative revenue value doesn’t make business sense it may indicate data entry error or system bug. If not corrected, it can lead to incorrect financial forecasting.

Similarly, date imputation is crucial. Suppose a visit date is missing; analysts may use the study midpoint or last known visit. This ensures consistency in longitudinal analysis.

In temple datasets, incorrect city names or inconsistent country labels can break joins during aggregation. For example, "india" vs "INDIA" will create separate groups, leading to inaccurate regional insights.

Ultimately, data cleaning ensures:

Accuracy
Consistency
Reliability

Without it, even the most advanced analytics models produce misleading results.

8. 20 Key Points

Dirty data leads to wrong conclusions
Standardization ensures reproducibility
Missing values must be handled carefully
Duplicate records distort analysis
Text inconsistency breaks joins
Dates must be validated
Negative values need correction
Merge requires sorted data
SQL joins are flexible
SAS ensures auditability
R enables rapid transformation
Data validation is mandatory
Cleaning improves model accuracy
Business rules drive cleaning
Documentation is critical
QC checks prevent errors
Aggregation needs clean keys
Imputation must be justified
Always preserve raw data
Clean data builds trust

9. Summary

This project demonstrated how a global temples dataset, when left uncleaned, can lead to analytical inconsistencies and misleading conclusions. We started with a deliberately flawed dataset containing missing values, invalid entries, duplicates, and inconsistent formatting mirroring real-world data challenges.

Using SAS, we applied structured data cleaning techniques through the DATA step, leveraging functions like ABS, UPCASE, and INPUT. We removed duplicates using PROC SORT NODUPKEY and performed aggregations with PROC SUMMARY. SAS proved highly reliable for enterprise-grade workflows, especially where audit trails and reproducibility are critical.

In R, we used dplyr functions like mutate, filter, and distinct to transform and refine the dataset. R’s syntax is expressive and efficient, making it ideal for exploratory data analysis and rapid prototyping.

A key highlight was understanding MERGE vs JOIN:

MERGE is efficient but requires sorted datasets and strict key alignment
JOIN (PROC SQL / dplyr joins) offers flexibility and is easier for complex relationships

Both approaches have their place, and choosing the right one depends on the use case.

Ultimately, the combination of SAS and R provides a powerful ecosystem for data cleaning and analysis. Clean data ensures accurate insights, better decision-making, and compliance with industry standards.

10. Conclusion

Data is often described as the “new oil,” but raw data like crude oil is unusable without refinement. This project reinforced a critical truth: data cleaning is not a preliminary step; it is the foundation of all analytics.

Through the lens of a global temples dataset, we explored how even simple inconsistencies like case sensitivity or duplicate records can cascade into major analytical errors. In high-stakes environments such as clinical trials or financial reporting, these errors can have serious consequences.

SAS and R offer complementary strengths. SAS provides a robust, regulated framework ideal for production environments, while R delivers flexibility and speed for data exploration and transformation. Understanding when to use MERGE vs JOIN is essential for ensuring data integrity during dataset combination.

Moreover, the importance of business logic cannot be overstated. Data cleaning decisions must align with domain knowledge whether it’s correcting patient age in a clinical dataset or validating revenue figures in a business dataset.

As an aspiring or experienced data professional, your ability to clean, validate, and structure data will define the quality of your insights. Tools and techniques will evolve, but the principles of data integrity, consistency, and validation remain constant.

In the end, clean data is not just about accuracy it’s about trust. And in analytics, trust is everything.

11. Interview Questions (Scenario-Based)

Q1: Difference between MERGE and JOIN in SAS?

Answer: MERGE requires sorted datasets and works row-wise. JOIN (PROC SQL) is more flexible and supports complex conditions.

Q2: How do you handle duplicate records?

Answer: Use PROC SORT NODUPKEY in SAS or distinct() in R based on key variables.

Q3: What happens if you merge unsorted data?

Answer: SAS may produce incorrect results or warnings data integrity is compromised.

Q4: How do you validate cleaned data?

Answer: Use PROC COMPARE, frequency checks, and cross-validation against raw data.

Q5: R vs SAS for data cleaning?

Answer: R is flexible and fast; SAS is structured and audit-compliant use based on project needs.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TEMPLES DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Numbers Don’t Lie—Until Dates Do: Rebuilding Accuracy in Financial Data

2.When Brilliant Data Goes Bad: Mastering Transformation with SAS & R

3.Elevating SAS Validation to a New Standard : When Vintage Cameras Meet Dirty Data

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Mastering Dataset Integration with SAS MERGE vs JOIN and R : Temples, Transformations, and Truth

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Numbers Don’t Lie—Until Dates Do: Rebuilding Accuracy in Financial Data

2.When Brilliant Data Goes Bad: Mastering Transformation with SAS & R

3.Elevating SAS Validation to a New Standard : When Vintage Cameras Meet Dirty Data

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function

Mastering Dataset Integration with SAS MERGE vs JOIN and R : Temples, Transformations, and Truth

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Numbers Don’t Lie—Until Dates Do: Rebuilding Accuracy in Financial Data 2.When Brilliant Data Goes Bad: Mastering Transformation with SAS & R3.Elevating SAS Validation to a New Standard : When Vintage Cameras Meet Dirty Data

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function

1.Numbers Don’t Lie—Until Dates Do: Rebuilding Accuracy in Financial Data

2.When Brilliant Data Goes Bad: Mastering Transformation with SAS & R

3.Elevating SAS Validation to a New Standard : When Vintage Cameras Meet Dirty Data