When Bombay Meets Mumbai: Transforming Inconsistent City Data into Analytical Gold
Cities Reborn, Data Refined: Turning Chaotic Global City Name Changes into Analytical Intelligence with SAS & R
1. Introduction
Imagine
you’re working on a global pharmaceutical trial. Patient data is coming in from
multiple countries India, USA, Japan, Russia. Everything looks fine until you
run a regional analysis. Suddenly, your output shows Madras and Chennai
as two different locations. Bombay appears separately from Mumbai.
Worse, some records show “NULL,” others “unknown,” and some are blank.
Now your
regional efficacy results are split across duplicate geographies.
That’s
not just messy that’s dangerous.
In
clinical trials, business intelligence, or financial systems, bad data leads
to:
- Incorrect aggregations
- Misleading trends
- Faulty decision-making
- Regulatory risks
This is
where SAS and R become critical. SAS provides structured,
enterprise-grade data processing (ideal for clinical environments), while R
offers flexible, modern data wrangling capabilities.
In this
project, we simulate a global dataset of old vs new city names,
intentionally inject errors, and then clean, transform, and combine datasets
using:
- SAS DATA Step vs PROC SQL
(MERGE vs JOIN)
- R dplyr workflows
2. Raw Data Creation (With Intentional Errors)
SAS Code (Raw Dataset)
data city_raw;
length Old_Name $20 New_Name $20 Country $15 Region $15
Update_Date $12 Source $20 Status $10;
input ID Old_Name $ New_Name $ Country $ Region $ Update_Date $
Population Source $ Status $;
datalines;
1 Bombay Mumbai India West 2020-01-01 20000000 govt active
2 Madras Chennai India South 2019-05-10 11000000 govt active
3 Calcutta Kolkata India East 2018-07-21 15000000 govt active
4 NULL Delhi India North 2021-02-01 18000000 private active
5 Peking Beijing China North 2017-06-15 21000000 govt active
6 Saigon HoChiMinh Vietnam South 2016-03-12 9000000 govt active
7 Leningrad StPetersburg Russia West 2015-09-30 5000000 govt active
8 Bombay Mumbai India West 2020-01-01 20000000 govt active
9 -999 Karachi Pakistan South 2018-11-11 14000000 govt inactive
10 Edo Tokyo Japan East wrongdate 13000000 govt active
;
run;
proc print data = city_raw;
run;
OUTPUT:
| Obs | Old_Name | New_Name | Country | Region | Update_Date | Source | Status | ID | Population |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Bombay | Mumbai | India | West | 2020-01-01 | govt | active | 1 | 20000000 |
| 2 | Madras | Chennai | India | South | 2019-05-10 | govt | active | 2 | 11000000 |
| 3 | Calcutta | Kolkata | India | East | 2018-07-21 | govt | active | 3 | 15000000 |
| 4 | NULL | Delhi | India | North | 2021-02-01 | private | active | 4 | 18000000 |
| 5 | Peking | Beijing | China | North | 2017-06-15 | govt | active | 5 | 21000000 |
| 6 | Saigon | HoChiMinh | Vietnam | South | 2016-03-12 | govt | active | 6 | 9000000 |
| 7 | Leningrad | StPetersburg | Russia | West | 2015-09-30 | govt | active | 7 | 5000000 |
| 8 | Bombay | Mumbai | India | West | 2020-01-01 | govt | active | 8 | 20000000 |
| 9 | -999 | Karachi | Pakistan | South | 2018-11-11 | govt | inactive | 9 | 14000000 |
| 10 | Edo | Tokyo | Japan | East | wrongdate | govt | active | 10 | 13000000 |
Explanation (SAS Raw Data)
This
dataset intentionally includes:
- Missing values (NULL)
- Invalid numeric (-999)
- Wrong date format
(wrongdate)
- Duplicate records (Bombay →
Mumbai)
- Inconsistent naming
This
mirrors real-world ingestion issues. Notice the use of length this is critical
to prevent truncation. Without it, SAS may default to smaller lengths and
silently corrupt strings.
R Code – Equivalent Raw Dataset
options(scipen=999)
city_raw <- data.frame(
ID = 1:10,
Old_Name = c("Bombay","Madras","Calcutta","NULL","Peking","Saigon",
"Leningrad","Bombay","-999","Edo"),
New_Name = c("Mumbai","Chennai","Kolkata","Delhi","Beijing",
"HoChiMinh","StPetersburg","Mumbai","Karachi","Tokyo"),
Country = c("India","India","India","India","China","Vietnam","Russia",
"India","Pakistan","Japan"),
Region = c("West","South","East","North","North","South","West","West",
"South","East"),
Update_Date = c("2020-01-01","2019-05-10","2018-07-21","2021-02-01",
"2017-06-15","2016-03-12","2015-09-30","2020-01-01",
"2018-11-11","wrongdate"),
Population = c(20000000,11000000,15000000,18000000,21000000,9000000,
5000000,20000000,14000000,13000000),
Source = c("govt","govt","govt","private","govt","govt","govt","govt",
"govt","govt"),
Status = c("active","active","active","active","active","active","active",
"active","inactive","active")
)
OUTPUT:
|
|
ID |
Old_Name |
New_Name |
Country |
Region |
Update_Date |
Population |
Source |
Status |
|
1 |
1 |
Bombay |
Mumbai |
India |
West |
01-01-2020 |
20000000 |
govt |
active |
|
2 |
2 |
Madras |
Chennai |
India |
South |
10-05-2019 |
11000000 |
govt |
active |
|
3 |
3 |
Calcutta |
Kolkata |
India |
East |
21-07-2018 |
15000000 |
govt |
active |
|
4 |
4 |
NULL |
Delhi |
India |
North |
01-02-2021 |
18000000 |
private |
active |
|
5 |
5 |
Peking |
Beijing |
China |
North |
15-06-2017 |
21000000 |
govt |
active |
|
6 |
6 |
Saigon |
HoChiMinh |
Vietnam |
South |
12-03-2016 |
9000000 |
govt |
active |
|
7 |
7 |
Leningrad |
StPetersburg |
Russia |
West |
30-09-2015 |
5000000 |
govt |
active |
|
8 |
8 |
Bombay |
Mumbai |
India |
West |
01-01-2020 |
20000000 |
govt |
active |
|
9 |
9 |
-999 |
Karachi |
Pakistan |
South |
11-11-2018 |
14000000 |
govt |
inactive |
|
10 |
10 |
Edo |
Tokyo |
Japan |
East |
wrongdate |
13000000 |
govt |
active |
Explanation (R Raw Data)
R’s
data.frame() mirrors SAS input but allows more flexibility. However:
- Strings like
"NULL" are not automatically treated as missing
- "wrongdate"
remains a string
- No strict typing enforcement
like SAS
This
flexibility is powerful but dangerous if not controlled.
Section 1: Phase 1 – The SAS Engineering Layer
Data Cleaning in SAS
data city_clean;
set city_raw;
/* Fix missing names */
if strip(upcase(Old_Name)) in ("NULL","-999") then Old_Name = "";
Old_Name = coalescec(Old_Name, "UNKNOWN");
/* Handle invalid numeric */
if Population < 0 then Population = abs(Population);
/* Date conversion */
New_Date =input(Update_Date,?? yymmdd10.);
format New_Date yymmdd10.;
/* Standardize text */
Old_Name = upcase(strip(Old_Name));
New_Name = propcase(strip(New_Name));
/* Category assignment */
select(Status);
when("active") Flag = 1;
when("inactive") Flag = 0;
otherwise Flag = .;
end;
run;
proc print data = city_clean;
run;
OUTPUT:
| Obs | Old_Name | New_Name | Country | Region | Update_Date | Source | Status | ID | Population | New_Date | Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BOMBAY | Mumbai | India | West | 2020-01-01 | govt | active | 1 | 20000000 | 2020-01-01 | 1 |
| 2 | MADRAS | Chennai | India | South | 2019-05-10 | govt | active | 2 | 11000000 | 2019-05-10 | 1 |
| 3 | CALCUTTA | Kolkata | India | East | 2018-07-21 | govt | active | 3 | 15000000 | 2018-07-21 | 1 |
| 4 | UNKNOWN | Delhi | India | North | 2021-02-01 | private | active | 4 | 18000000 | 2021-02-01 | 1 |
| 5 | PEKING | Beijing | China | North | 2017-06-15 | govt | active | 5 | 21000000 | 2017-06-15 | 1 |
| 6 | SAIGON | Hochiminh | Vietnam | South | 2016-03-12 | govt | active | 6 | 9000000 | 2016-03-12 | 1 |
| 7 | LENINGRAD | Stpetersburg | Russia | West | 2015-09-30 | govt | active | 7 | 5000000 | 2015-09-30 | 1 |
| 8 | BOMBAY | Mumbai | India | West | 2020-01-01 | govt | active | 8 | 20000000 | 2020-01-01 | 1 |
| 9 | UNKNOWN | Karachi | Pakistan | South | 2018-11-11 | govt | inactive | 9 | 14000000 | 2018-11-11 | 0 |
| 10 | EDO | Tokyo | Japan | East | wrongdate | govt | active | 10 | 13000000 | . | 1 |
Explanation
COALESCECfills missing character valuesABS()fixes invalid numeric valuesINPUT()converts string → dateSELECT-WHENis cleaner than IF-ELSE for categorical logicSTRIP + UPCASEensures standardization- Convert
"NULL"→ blank → thenCOALESCEC()works.
This is advanced SAS defensive programming
- ? → suppress warning
- ?? → suppress warning + NOTE
👉 If "wrongdate" appears:
- No log clutter
- Value safely becomes .
/* Remove duplicates */
proc sort data=city_clean nodupkey;
by Old_Name New_Name Country;
run;
proc print data = city_clean;
run;
OUTPUT:
| Obs | Old_Name | New_Name | Country | Region | Update_Date | Source | Status | ID | Population | New_Date | Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BOMBAY | Mumbai | India | West | 2020-01-01 | govt | active | 1 | 20000000 | 2020-01-01 | 1 |
| 2 | CALCUTTA | Kolkata | India | East | 2018-07-21 | govt | active | 3 | 15000000 | 2018-07-21 | 1 |
| 3 | EDO | Tokyo | Japan | East | wrongdate | govt | active | 10 | 13000000 | . | 1 |
| 4 | LENINGRAD | Stpetersburg | Russia | West | 2015-09-30 | govt | active | 7 | 5000000 | 2015-09-30 | 1 |
| 5 | MADRAS | Chennai | India | South | 2019-05-10 | govt | active | 2 | 11000000 | 2019-05-10 | 1 |
| 6 | PEKING | Beijing | China | North | 2017-06-15 | govt | active | 5 | 21000000 | 2017-06-15 | 1 |
| 7 | SAIGON | Hochiminh | Vietnam | South | 2016-03-12 | govt | active | 6 | 9000000 | 2016-03-12 | 1 |
| 8 | UNKNOWN | Delhi | India | North | 2021-02-01 | private | active | 4 | 18000000 | 2021-02-01 | 1 |
| 9 | UNKNOWN | Karachi | Pakistan | South | 2018-11-11 | govt | inactive | 9 | 14000000 | 2018-11-11 | 0 |
Explanation
NODUPKEY
removes duplicate combinations. This is critical in preventing double counting.
MERGE vs JOIN (SAS)
data region_map;
input Country $ Region_Code $;
datalines;
India IN
China CN
Japan JP
Pakistan PAK
Russia Rus
Vietnam Viet
;
run;
proc print data = region_map;
run;
OUTPUT:
| Obs | Country | Region_Code |
|---|---|---|
| 1 | India | IN |
| 2 | China | CN |
| 3 | Japan | JP |
| 4 | Pakistan | PAK |
| 5 | Russia | Rus |
| 6 | Vietnam | Viet |
/*Sorting Before Merging*/
proc sort data=region_map;
by Country;
run;
proc print data = region_map;
run;
OUTPUT:
| Obs | Country | Region_Code |
|---|---|---|
| 1 | China | CN |
| 2 | India | IN |
| 3 | Japan | JP |
| 4 | Pakistan | PAK |
| 5 | Russia | Rus |
| 6 | Vietnam | Viet |
proc sort data=city_clean;
by Country;
run;
proc print data = city_clean;
run;
OUTPUT:
| Obs | Old_Name | New_Name | Country | Region | Update_Date | Source | Status | ID | Population | New_Date | Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PEKING | Beijing | China | North | 2017-06-15 | govt | active | 5 | 21000000 | 2017-06-15 | 1 |
| 2 | BOMBAY | Mumbai | India | West | 2020-01-01 | govt | active | 1 | 20000000 | 2020-01-01 | 1 |
| 3 | CALCUTTA | Kolkata | India | East | 2018-07-21 | govt | active | 3 | 15000000 | 2018-07-21 | 1 |
| 4 | MADRAS | Chennai | India | South | 2019-05-10 | govt | active | 2 | 11000000 | 2019-05-10 | 1 |
| 5 | UNKNOWN | Delhi | India | North | 2021-02-01 | private | active | 4 | 18000000 | 2021-02-01 | 1 |
| 6 | EDO | Tokyo | Japan | East | wrongdate | govt | active | 10 | 13000000 | . | 1 |
| 7 | UNKNOWN | Karachi | Pakistan | South | 2018-11-11 | govt | inactive | 9 | 14000000 | 2018-11-11 | 0 |
| 8 | LENINGRAD | Stpetersburg | Russia | West | 2015-09-30 | govt | active | 7 | 5000000 | 2015-09-30 | 1 |
| 9 | SAIGON | Hochiminh | Vietnam | South | 2016-03-12 | govt | active | 6 | 9000000 | 2016-03-12 | 1 |
/* MERGE */
data merged_data;
merge city_clean(in=a) region_map(in=b);
by Country;
if a;
run;
proc print data = merged_data;
run;
OUTPUT:
| Obs | Old_Name | New_Name | Country | Region | Update_Date | Source | Status | ID | Population | New_Date | Flag | Region_Code |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PEKING | Beijing | China | North | 2017-06-15 | govt | active | 5 | 21000000 | 2017-06-15 | 1 | CN |
| 2 | BOMBAY | Mumbai | India | West | 2020-01-01 | govt | active | 1 | 20000000 | 2020-01-01 | 1 | IN |
| 3 | CALCUTTA | Kolkata | India | East | 2018-07-21 | govt | active | 3 | 15000000 | 2018-07-21 | 1 | IN |
| 4 | MADRAS | Chennai | India | South | 2019-05-10 | govt | active | 2 | 11000000 | 2019-05-10 | 1 | IN |
| 5 | UNKNOWN | Delhi | India | North | 2021-02-01 | private | active | 4 | 18000000 | 2021-02-01 | 1 | IN |
| 6 | EDO | Tokyo | Japan | East | wrongdate | govt | active | 10 | 13000000 | . | 1 | JP |
| 7 | UNKNOWN | Karachi | Pakistan | South | 2018-11-11 | govt | inactive | 9 | 14000000 | 2018-11-11 | 0 | PAK |
| 8 | LENINGRAD | Stpetersburg | Russia | West | 2015-09-30 | govt | active | 7 | 5000000 | 2015-09-30 | 1 | Rus |
| 9 | SAIGON | Hochiminh | Vietnam | South | 2016-03-12 | govt | active | 6 | 9000000 | 2016-03-12 | 1 | Viet |
/* PROC SQL JOIN */
proc sql;
create table joined_data as
select a.*, b.Region_Code
from city_clean a
left join region_map b
on a.Country = b.Country
order by ID;
quit;
proc print data = joined_data;
run;
OUTPUT:
| Obs | Old_Name | New_Name | Country | Region | Update_Date | Source | Status | ID | Population | New_Date | Flag | Region_Code |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BOMBAY | Mumbai | India | West | 2020-01-01 | govt | active | 1 | 20000000 | 2020-01-01 | 1 | IN |
| 2 | MADRAS | Chennai | India | South | 2019-05-10 | govt | active | 2 | 11000000 | 2019-05-10 | 1 | IN |
| 3 | CALCUTTA | Kolkata | India | East | 2018-07-21 | govt | active | 3 | 15000000 | 2018-07-21 | 1 | IN |
| 4 | UNKNOWN | Delhi | India | North | 2021-02-01 | private | active | 4 | 18000000 | 2021-02-01 | 1 | IN |
| 5 | PEKING | Beijing | China | North | 2017-06-15 | govt | active | 5 | 21000000 | 2017-06-15 | 1 | CN |
| 6 | SAIGON | Hochiminh | Vietnam | South | 2016-03-12 | govt | active | 6 | 9000000 | 2016-03-12 | 1 | Viet |
| 7 | LENINGRAD | Stpetersburg | Russia | West | 2015-09-30 | govt | active | 7 | 5000000 | 2015-09-30 | 1 | Rus |
| 8 | UNKNOWN | Karachi | Pakistan | South | 2018-11-11 | govt | inactive | 9 | 14000000 | 2018-11-11 | 0 | PAK |
| 9 | EDO | Tokyo | Japan | East | wrongdate | govt | active | 10 | 13000000 | . | 1 | JP |
Explanation
- MERGE requires sorted data and
works row-wise
- JOIN is flexible and does not
require sorting
- SQL JOIN is preferred for
complex relationships
Section 2: Phase 2 – R Refinement Layer
library(dplyr)
city_clean <- city_raw %>%
mutate(
Old_Name = ifelse(Old_Name %in% c("NULL","-999"), "UNKNOWN", Old_Name),
Population = ifelse(Population < 0, abs(Population), Population),
Update_Date = as.Date(Update_Date, format="%Y-%m-%d"),
Old_Name = toupper(trimws(Old_Name)),
New_Name = tools::toTitleCase(trimws(New_Name)),
Flag = case_when(
Status == "active" ~ 1,
Status == "inactive" ~ 0,
TRUE ~ NA_real_
)
) %>%
distinct(Old_Name, New_Name, .keep_all = TRUE)
OUTPUT:
|
|
ID |
Old_Name |
New_Name |
Country |
Region |
Update_Date |
Population |
Source |
Status |
Flag |
|
1 |
1 |
BOMBAY |
Mumbai |
India |
West |
01-01-2020 |
20000000 |
govt |
active |
1 |
|
2 |
2 |
MADRAS |
Chennai |
India |
South |
10-05-2019 |
11000000 |
govt |
active |
1 |
|
3 |
3 |
CALCUTTA |
Kolkata |
India |
East |
21-07-2018 |
15000000 |
govt |
active |
1 |
|
4 |
4 |
UNKNOWN |
Delhi |
India |
North |
01-02-2021 |
18000000 |
private |
active |
1 |
|
5 |
5 |
PEKING |
Beijing |
China |
North |
15-06-2017 |
21000000 |
govt |
active |
1 |
|
6 |
6 |
SAIGON |
HoChiMinh |
Vietnam |
South |
12-03-2016 |
9000000 |
govt |
active |
1 |
|
7 |
7 |
LENINGRAD |
StPetersburg |
Russia |
West |
30-09-2015 |
5000000 |
govt |
active |
1 |
|
8 |
9 |
UNKNOWN |
Karachi |
Pakistan |
South |
11-11-2018 |
14000000 |
govt |
inactive |
0 |
|
9 |
10 |
EDO |
Tokyo |
Japan |
East |
NA |
13000000 |
govt |
active |
1 |
Logic Bridge (SAS vs R)
- mutate() = DATA step
transformation
- case_when() = SELECT-WHEN
- distinct() = PROC SORT
NODUPKEY
- as.Date() = INPUT()
trimws()beforeas.Date()- Removes
hidden spaces → avoids conversion failure
- %>% requires dplyr
library.checks
multiple values correctly
- %in% is safer than chained
OR conditions
Section 3: Business Logic
& The “Why”
In a
financial system, if Mumbai and Bombay are treated separately:
- Revenue aggregation splits
- Regional performance becomes
inaccurate
- Decision-makers misinterpret
market size
In
clinical trials:
- Duplicate site names distort
patient counts
- Regulatory submissions fail
validation
A single
missing value treated incorrectly can lead to:
- Wrong dosage recommendations
- Financial misreporting
Section 4: 20 Key
Implementation Practices
- Always define LENGTH in SAS
- Standardize before merging
- Never trust raw input
- Deduplicate early
- Use audit logs
- Validate joins
- Avoid implicit conversions
- Check missing explicitly
- Use consistent casing
- Separate raw and clean
layers
- Validate dates
- Flag anomalies
- Avoid hardcoding
- Use reusable macros
- Test edge cases
- Use metadata-driven design
- Validate counts before/after
joins
- Keep backup datasets
- Document transformations
- Automate QC checks
3. Phase 3: Extended SAS Analysis
data file_import;
infile datalines dlm=',' firstobs=2;
input Old_Name $ New_Name $ Population;
datalines;
Bombay,Mumbai,20000000
Madras,Chennai,11000000
Calcutta,Kolkata,15000000
Peking,Beijing,21000000
Saigon,HoChiMinh,9000000
Leningrad,StPetersburg,5000000
Bombay,Mumbai,20000000
Edo,Tokyo,13000000
;
run;
proc print data = file_import;
run;
OUTPUT:
| Obs | Old_Name | New_Name | Population |
|---|---|---|---|
| 1 | Madras | Chennai | 11000000 |
| 2 | Calcutta | Kolkata | 15000000 |
| 3 | Peking | Beijing | 21000000 |
| 4 | Saigon | HoChiMin | 9000000 |
| 5 | Leningra | StPeters | 5000000 |
| 6 | Bombay | Mumbai | 20000000 |
| 7 | Edo | Tokyo | 13000000 |
/* Flag high population */
data flagged;
set file_import;
if Population > 15000000 then High_Flag = 1;
else High_Flag = 0;
run;
proc print data = flagged;
run;
OUTPUT:
| Obs | Old_Name | New_Name | Population | High_Flag |
|---|---|---|---|---|
| 1 | Madras | Chennai | 11000000 | 0 |
| 2 | Calcutta | Kolkata | 15000000 | 0 |
| 3 | Peking | Beijing | 21000000 | 1 |
| 4 | Saigon | HoChiMin | 9000000 | 0 |
| 5 | Leningra | StPeters | 5000000 | 0 |
| 6 | Bombay | Mumbai | 20000000 | 1 |
| 7 | Edo | Tokyo | 13000000 | 0 |
/* Aggregation */
proc means data=flagged;
var Population;
class High_Flag;
run;
OUTPUT:
The MEANS Procedure
| Analysis Variable : Population | ||||||
|---|---|---|---|---|---|---|
| High_Flag | N Obs | N | Mean | Std Dev | Minimum | Maximum |
| 0 | 5 | 5 | 10600000.00 | 3847076.81 | 5000000.00 | 15000000.00 |
| 1 | 2 | 2 | 20500000.00 | 707106.78 | 20000000.00 | 21000000.00 |
/* Deduplication */
proc sort data=flagged nodupkey;
by Old_Name;
run;
proc print data=flagged;
run;
OUTPUT:
| Obs | Old_Name | New_Name | Population | High_Flag |
|---|---|---|---|---|
| 1 | Bombay | Mumbai | 20000000 | 1 |
| 2 | Calcutta | Kolkata | 15000000 | 0 |
| 3 | Edo | Tokyo | 13000000 | 0 |
| 4 | Leningra | StPeters | 5000000 | 0 |
| 5 | Madras | Chennai | 11000000 | 0 |
| 6 | Peking | Beijing | 21000000 | 1 |
| 7 | Saigon | HoChiMin | 9000000 | 0 |
Explanation
- Import
- Transform
- Aggregate
- Deduplicate
- Report
4. 20 Additional Data
Cleaning Best Practices
- Follow CDISC standards
- Maintain SDTM compliance
- Validate ADaM derivations
- Track lineage
- Ensure reproducibility
- Maintain audit trails
- Validate controlled
terminology
- Perform double programming
- Use version control
- Validate joins
- Cross-check outputs
- Use QC datasets
- Document assumptions
- Handle partial dates
- Normalize units
- Validate ranges
- Track missing patterns
- Ensure traceability
- Maintain regulatory logs
- Perform independent review
5. Business Logic Behind
Data Cleaning
Data
cleaning is not cosmetic it directly impacts decision-making. Missing values
are replaced to ensure continuity in analysis. For example, if a patient’s age
is missing, imputing it based on median prevents bias in demographic summaries.
Similarly, unrealistic values such as negative population or age must be
corrected using functions like ABS() because they distort statistical
distributions.
Date
imputation is critical in longitudinal studies. If a visit date is missing,
incorrect sequencing may lead to wrong conclusions about treatment efficacy. In
financial datasets, salary normalization ensures comparability across regions.
Incorrect
handling of missing values (e.g., treating NULL as zero) can significantly bias
results. For instance, in clinical trials, a missing lab value treated as zero
might falsely indicate abnormality.
Thus,
data cleaning ensures:
- Accuracy
- Consistency
- Regulatory compliance
- Reliable analytics
6. 20 Sharp Insights
- Dirty data leads to wrong
conclusions
- Standardization ensures
reproducibility
- Missing ≠ zero
- Deduplication prevents
inflation
- Dates drive timelines
- Validation prevents
disasters
- Merge carefully
- Always audit transformations
- Clean before analysis
- Consistency is key
- Logic errors are silent
killers
- Documentation is power
- Automation saves time
- QC is non-negotiable
- Data lineage matters
- Format before logic
- Always test edge cases
- Clean data builds trust
- Reproducibility is science
- Accuracy beats speed
7. Summary
SAS and R
both provide powerful frameworks for data cleaning, but they serve different
operational philosophies. SAS excels in structured, enterprise-level
processing, making it ideal for regulated industries like pharmaceuticals. Its
DATA step offers deterministic execution, while PROC SQL enables flexible joins
and aggregations.
R, on the
other hand, thrives in exploratory and flexible environments. Packages like
dplyr simplify transformations, while stringr enhances text cleaning
capabilities. The “tidyverse” ecosystem allows analysts to write expressive,
readable pipelines.
In this
project, we demonstrated:
- How messy global city name
data can distort analytics
- How SAS handles structured
cleaning using DATA steps, SELECT-WHEN, and PROC SORT
- How R achieves similar
outcomes using mutate(), case_when(), and distinct()
- The difference between MERGE
(row-wise, sorted) and JOIN (relational, flexible)
The key
takeaway:
SAS ensures reliability and compliance, while R provides agility and speed.
A hybrid
approach is often the best strategy in real-world projects.
8. Conclusion
Data is
only as valuable as its quality. Whether you are analyzing global city
transformations, financial transactions, or clinical trial outcomes, the
underlying principle remains the same: clean, structured, and validated data
is the foundation of trustworthy analytics.
Through
this project, we explored how seemingly simple inconsistencies like “Bombay” vs
“Mumbai” can cascade into major analytical errors. These issues are not
hypothetical; they occur daily in real-world systems. Without proper cleaning
frameworks, organizations risk making flawed decisions that can impact revenue,
compliance, and even human lives.
SAS
provides a robust, auditable environment where every transformation is
controlled and reproducible. This is why it remains the gold standard in
clinical and regulatory domains. R complements this by offering flexibility and
speed, allowing analysts to iterate quickly and explore data deeply.
The
combination of:
- Structured SAS pipelines
- Agile R workflows
- Strong business logic
- Rigorous validation
creates a
powerful data engineering ecosystem.
Ultimately,
data cleaning is not a preliminary step it is a core analytical function.
Investing time in cleaning ensures accuracy, builds trust, and enables scalable
analytics.
9. Interview Questions
1. Difference between MERGE and JOIN in SAS?
Answer: MERGE works row-wise and
requires sorting; JOIN is relational and flexible.
2. How do you handle missing values in SAS?
Answer: Use COALESCEC for character and
IF conditions for numeric.
3. R equivalent of SELECT-WHEN?
Answer: case_when()
4. What happens if LENGTH is not defined?
Answer: SAS may truncate variables,
causing data loss.
5. Real-world debugging scenario?
Answer: Duplicate city names causing
double counting resolved using PROC SORT NODUPKEY and standardization.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about CITY NAME DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment