From Capitals to Clean Code: Turning Global Country Data into Analytical Gold with SAS & R
Capitals, Chaos & Clean Code: Transforming a Global Countries Dataset into Analytical Gold with SAS & R
1. Introduction
Imagine
you are working on a global analytics project for a multinational organization.
The dataset contains country names, capital cities, population, GDP, and region
classifications. It looks clean at first glance but once you start analyzing,
things fall apart.
You
notice countries like “india”, “INDIA”, and “India ”
treated as separate entities. Some capitals are missing. Population values
include negative numbers. GDP fields contain text like “NULL.” Dates are
incorrectly formatted. Duplicate rows silently inflate your statistics.
This is
not just inconvenient it’s dangerous.
In
industries like clinical trials or financial analytics, poor data quality leads
to flawed decisions. Imagine a clinical dataset where patient age is -5 or
treatment dates are reversed. The consequences can be regulatory rejection or
incorrect conclusions.
This is
where SAS and R become critical tools. They don’t just analyze data they
transform chaos into reliability.
In this
blog, we will:
- Create a messy dataset of popular
countries and capitals
- Clean it using SAS and R
- Explore PROC MEANS vs
PROC SUMMARY
- Apply real-world data
cleaning logic
- Make it interview-ready
2. Raw Data Creation in SAS and R
SAS Code (With Intentional Errors)
DATA countries_raw;
INFILE DATALINES DLM=',' MISSOVER;
LENGTH Country $20 Capital $20 Region $15;
INPUT Country $ Capital $ Population GDP Year Region $;
DATALINES;
India,New Delhi,1400000000,3.5,2023,asia
USA,Washington,330000000,21.4,2023,North America
india,New Delhi,.,3.5,2023,ASIA
UK,London,-67000000,2.8,2023,Europe
France,Paris,65000000,NULL,2023,Europe
Germany,Berlin,83000000,3.8,2023,europe
Japan,Tokyo,125000000,5.1,2023,Asia
China,Beijing,1400000000,14.7,2023,asia
Brazil,Brasilia,212000000,1.8,2023,South America
India,New Delhi,1400000000,3.5,2023,asia
Canada,Ottawa,38000000,1.9,2023,North America
Australia,Canberra,26000000,1.6,2023,Australia
;
RUN;
PROC PRINT DATA = countries_raw;
RUN;
OUTPUT:
| Obs | Country | Capital | Region | Population | GDP | Year |
|---|---|---|---|---|---|---|
| 1 | India | New Delhi | asia | 1400000000 | 3.5 | 2023 |
| 2 | USA | Washington | North America | 330000000 | 21.4 | 2023 |
| 3 | india | New Delhi | ASIA | . | 3.5 | 2023 |
| 4 | UK | London | Europe | -67000000 | 2.8 | 2023 |
| 5 | France | Paris | Europe | 65000000 | . | 2023 |
| 6 | Germany | Berlin | europe | 83000000 | 3.8 | 2023 |
| 7 | Japan | Tokyo | Asia | 125000000 | 5.1 | 2023 |
| 8 | China | Beijing | asia | 1400000000 | 14.7 | 2023 |
| 9 | Brazil | Brasilia | South America | 212000000 | 1.8 | 2023 |
| 10 | India | New Delhi | asia | 1400000000 | 3.5 | 2023 |
| 11 | Canada | Ottawa | North America | 38000000 | 1.9 | 2023 |
| 12 | Australia | Canberra | Australia | 26000000 | 1.6 | 2023 |
Explanation
This
dataset intentionally introduces real-world issues. Missing values appear as .
and “NULL”. Negative population values (-67000000) simulate data entry errors.
Duplicate records (India repeated) inflate metrics. Text inconsistencies (asia,
ASIA, Europe, europe) cause grouping issues. Mixed casing in country names
leads to duplicate categories during aggregation. The dataset mimics real-world
ingestion problems from multiple systems. The use of INFILE DATALINES ensures
inline data creation, while MISSOVER prevents row breaks due to missing fields.
This raw dataset forms the foundation for demonstrating structured cleaning
using SAS procedures and functions.
R Code – Equivalent Raw Dataset
options(scipen = 999)
countries_raw <- data.frame(
Country = c("India","USA","india","UK","France","Germany","Japan",
"China","Brazil","India","Canada","Australia"),
Capital = c("New Delhi","Washington","New Delhi","London","Paris",
"Berlin","Tokyo","Beijing","Brasilia","New Delhi",
"Ottawa","Canberra"),
Population = c(1400000000,330000000,NA,-67000000,65000000,83000000,
125000000,1400000000,212000000,1400000000,38000000,
26000000),
GDP = c(3.5,21.4,3.5,2.8,NA,3.8,5.1,14.7,1.8,3.5,1.9,1.6),
Year = c(2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,
2023,2023),
Region = c("asia","North America","ASIA","Europe","Europe","europe",
"Asia","asia","South America","asia","North America","Australia")
)
OUTPUT:
|
|
Country |
Capital |
Population |
GDP |
Year |
Region |
|
1 |
India |
New Delhi |
1400000000 |
3.5 |
2023 |
asia |
|
2 |
USA |
Washington |
330000000 |
21.4 |
2023 |
North America |
|
3 |
india |
New Delhi |
NA |
3.5 |
2023 |
ASIA |
|
4 |
UK |
London |
-67000000 |
2.8 |
2023 |
Europe |
|
5 |
France |
Paris |
65000000 |
NA |
2023 |
Europe |
|
6 |
Germany |
Berlin |
83000000 |
3.8 |
2023 |
europe |
|
7 |
Japan |
Tokyo |
125000000 |
5.1 |
2023 |
Asia |
|
8 |
China |
Beijing |
1400000000 |
14.7 |
2023 |
asia |
|
9 |
Brazil |
Brasilia |
212000000 |
1.8 |
2023 |
South America |
|
10 |
India |
New Delhi |
1400000000 |
3.5 |
2023 |
asia |
|
11 |
Canada |
Ottawa |
38000000 |
1.9 |
2023 |
North America |
|
12 |
Australia |
Canberra |
26000000 |
1.6 |
2023 |
Australia |
Explanation
The
options(scipen = 999)
setting increases the penalty for scientific notation, forcing R to display
full numeric values. The R
dataset mirrors SAS but uses NA for missing values. Data inconsistencies remain
intact for cleaning demonstration. The data.frame() function provides a quick
way to simulate structured tabular data. Negative population and inconsistent
region casing replicate real-world ingestion issues. Duplicate rows are
preserved to show deduplication techniques. This setup allows us to apply dplyr
transformations effectively. By keeping raw data untouched initially, we ensure
traceability a critical requirement in regulated environments like clinical
trials (SDTM/ADaM). The dataset acts as a sandbox for demonstrating robust data
cleaning pipelines in R.
3. Phase 1: Data Cleaning in SAS
DATA countries_clean;
SET countries_raw;
/* Standardize text */
Country = PROPCASE(STRIP(Country));
Region = UPCASE(STRIP(Region));
/* Handle missing GDP */
IF GDP = . THEN GDP = 0;
/* Fix invalid population */
IF Population < 0 THEN Population = ABS(Population);
/* Replace missing population */
IF Population = . THEN Population = 0;
/* Format Year */
Year_num = INPUT(PUT(Year,4.),4.);
RUN;
PROC PRINT DATA = countries_clean;
RUN;
OUTPUT:
| Obs | Country | Capital | Region | Population | GDP | Year | Year_num |
|---|---|---|---|---|---|---|---|
| 1 | India | New Delhi | ASIA | 1400000000 | 3.5 | 2023 | 2023 |
| 2 | Usa | Washington | NORTH AMERICA | 330000000 | 21.4 | 2023 | 2023 |
| 3 | India | New Delhi | ASIA | 0 | 3.5 | 2023 | 2023 |
| 4 | Uk | London | EUROPE | 67000000 | 2.8 | 2023 | 2023 |
| 5 | France | Paris | EUROPE | 65000000 | 0.0 | 2023 | 2023 |
| 6 | Germany | Berlin | EUROPE | 83000000 | 3.8 | 2023 | 2023 |
| 7 | Japan | Tokyo | ASIA | 125000000 | 5.1 | 2023 | 2023 |
| 8 | China | Beijing | ASIA | 1400000000 | 14.7 | 2023 | 2023 |
| 9 | Brazil | Brasilia | SOUTH AMERICA | 212000000 | 1.8 | 2023 | 2023 |
| 10 | India | New Delhi | ASIA | 1400000000 | 3.5 | 2023 | 2023 |
| 11 | Canada | Ottawa | NORTH AMERICA | 38000000 | 1.9 | 2023 | 2023 |
| 12 | Australia | Canberra | AUSTRALIA | 26000000 | 1.6 | 2023 | 2023 |
/* Remove duplicates */
PROC SORT DATA=countries_clean NODUPKEY;
BY Country Capital;
RUN;
PROC PRINT DATA = countries_clean;
RUN;
OUTPUT:
| Obs | Country | Capital | Region | Population | GDP | Year | Year_num |
|---|---|---|---|---|---|---|---|
| 1 | Australia | Canberra | AUSTRALIA | 26000000 | 1.6 | 2023 | 2023 |
| 2 | Brazil | Brasilia | SOUTH AMERICA | 212000000 | 1.8 | 2023 | 2023 |
| 3 | Canada | Ottawa | NORTH AMERICA | 38000000 | 1.9 | 2023 | 2023 |
| 4 | China | Beijing | ASIA | 1400000000 | 14.7 | 2023 | 2023 |
| 5 | France | Paris | EUROPE | 65000000 | 0.0 | 2023 | 2023 |
| 6 | Germany | Berlin | EUROPE | 83000000 | 3.8 | 2023 | 2023 |
| 7 | India | New Delhi | ASIA | 1400000000 | 3.5 | 2023 | 2023 |
| 8 | Japan | Tokyo | ASIA | 125000000 | 5.1 | 2023 | 2023 |
| 9 | Uk | London | EUROPE | 67000000 | 2.8 | 2023 | 2023 |
| 10 | Usa | Washington | NORTH AMERICA | 330000000 | 21.4 | 2023 | 2023 |
Explanation
This step
applies core SAS data cleaning techniques. PROPCASE standardizes country names,
ensuring uniformity. UPCASE normalizes region values, enabling accurate
grouping. Missing GDP values are replaced with zero—useful for aggregation.
Negative population values are corrected using ABS, ensuring logical
consistency. Missing population values are also handled to avoid analysis
errors. INPUT and PUT functions ensure proper numeric formatting. Finally, PROC
SORT NODUPKEY removes duplicate records based on key variables. This ensures
dataset integrity and prevents inflated counts during analysis. The approach
reflects real-world SAS programming practices in clinical and business
analytics.
4. Phase 2: Data Cleaning in R
library(dplyr)
countries_clean <- countries_raw %>%
mutate(
Country = tools::toTitleCase(trimws(Country)),
Region = toupper(trimws(Region)),
GDP = ifelse(is.na(GDP), 0, GDP),
Population = ifelse(Population < 0, abs(Population), Population),
Population = ifelse(is.na(Population), 0, Population)
) %>%
distinct(Country, Capital, .keep_all = TRUE)
OUTPUT:
|
|
Country |
Capital |
Population |
GDP |
Year |
Region |
|
1 |
India |
New Delhi |
1400000000 |
3.5 |
2023 |
ASIA |
|
2 |
USA |
Washington |
330000000 |
21.4 |
2023 |
NORTH AMERICA |
|
3 |
UK |
London |
67000000 |
2.8 |
2023 |
EUROPE |
|
4 |
France |
Paris |
65000000 |
0 |
2023 |
EUROPE |
|
5 |
Germany |
Berlin |
83000000 |
3.8 |
2023 |
EUROPE |
|
6 |
Japan |
Tokyo |
125000000 |
5.1 |
2023 |
ASIA |
|
7 |
China |
Beijing |
1400000000 |
14.7 |
2023 |
ASIA |
|
8 |
Brazil |
Brasilia |
212000000 |
1.8 |
2023 |
SOUTH AMERICA |
|
9 |
Canada |
Ottawa |
38000000 |
1.9 |
2023 |
NORTH AMERICA |
|
10 |
Australia |
Canberra |
26000000 |
1.6 |
2023 |
AUSTRALIA |
Explanation
The R
cleaning pipeline uses dplyr for transformation. trimws() removes unwanted
spaces, while toTitleCase() standardizes country names. toupper() ensures
consistent region formatting. Missing GDP values are replaced using ifelse().
Negative population values are corrected using abs(). Missing population values
are also handled explicitly. The distinct() function removes duplicate records
based on key identifiers. This pipeline demonstrates a declarative and readable
approach to data cleaning. Compared to SAS, R provides more chaining
flexibility, making transformations easier to interpret and debug. This is particularly
useful in exploratory data analysis workflows.
5. Phase 3: Additional SAS Processing + PROC MEANS vs SUMMARY
PROC MEANS DATA=countries_clean N MEAN SUM MAXDEC=2;
CLASS Region;
VAR Population GDP;
RUN;
OUTPUT:
The MEANS Procedure
| Region | N Obs | Variable | N | Mean | Sum |
|---|---|---|---|---|---|
| ASIA | 3 | Population GDP | 3 3 | 975000000.00 7.77 | 2925000000.0 23.30 |
| AUSTRALIA | 1 | Population GDP | 1 1 | 26000000.00 1.60 | 26000000.00 1.60 |
| EUROPE | 3 | Population GDP | 3 3 | 71666666.67 2.20 | 215000000.00 6.60 |
| NORTH AMERICA | 2 | Population GDP | 2 2 | 184000000.00 11.65 | 368000000.00 23.30 |
| SOUTH AMERICA | 1 | Population GDP | 1 1 | 212000000.00 1.80 | 212000000.00 1.80 |
PROC SUMMARY DATA=countries_clean NWAY;
CLASS Region;
VAR Population GDP;
OUTPUT OUT=summary_out MEAN= SUM=;
RUN;
PROC PRINT DATA = summary_out;
RUN;
OUTPUT:
| Obs | Region | _TYPE_ | _FREQ_ | Population | GDP |
|---|---|---|---|---|---|
| 1 | ASIA | 1 | 3 | 975000000.00 | 7.7667 |
| 2 | AUSTRALIA | 1 | 1 | 26000000.00 | 1.6000 |
| 3 | EUROPE | 1 | 3 | 71666666.67 | 2.2000 |
| 4 | NORTH AMERICA | 1 | 2 | 184000000.00 | 11.6500 |
| 5 | SOUTH AMERICA | 1 | 1 | 212000000.00 | 1.8000 |
Explanation
PROC
MEANS and PROC SUMMARY are nearly identical but differ in output behavior. PROC
MEANS prints results directly, making it useful for quick analysis. PROC
SUMMARY is more flexible it suppresses printed output unless specified and is
ideal for creating datasets (OUT=). The NWAY option ensures only the
highest-level aggregation is output. Both procedures calculate statistics like
mean and sum. In production environments, PROC SUMMARY is preferred for
pipeline integration, while PROC MEANS is used for exploratory checks.
Understanding this distinction is critical for efficient SAS programming.
6. 20 Additional Data Cleaning Best Practices
- Always validate source data
before transformation
- Maintain audit trails for
regulatory compliance
- Use metadata-driven
programming (CDISC standards)
- Validate ranges (e.g., age
0–120)
- Standardize formats early
- Avoid overwriting raw data
- Document all transformations
- Use macros for repeatable
cleaning
- Perform QC checks
independently
- Track missing value patterns
- Use controlled terminology
(MedDRA, WHO)
- Validate key identifiers
- Ensure referential integrity
- Apply consistent date
formats
- Log all warnings/errors
- Use version control
- Validate duplicates
carefully
- Perform cross-domain checks
- Automate validation rules
- Ensure reproducibility
7. Business Logic Behind Data Cleaning
Data
cleaning is not just technical it’s deeply tied to business logic. Missing
values are replaced based on context. For example, in clinical trials, missing
patient weight may be imputed using last observation carried forward (LOCF). In
business datasets, missing GDP might be set to zero or estimated using
historical averages.
Unrealistic
values must be corrected because they distort analysis. A negative population
is logically impossible and must be fixed or removed. Similarly, salary
normalization ensures fair comparisons across regions.
Date
imputation is critical. If treatment start dates are missing, derived variables
like study duration become invalid. In financial datasets, incorrect dates can
affect revenue recognition.
Every
correction must align with domain knowledge. Blind transformations can
introduce bias. For instance, replacing all missing values with zero may
underestimate averages.
Ultimately,
clean data ensures:
- Accurate reporting
- Reliable decision-making
- Regulatory compliance
8. 20 Key Points
- Dirty data leads to wrong
conclusions
- Standardization ensures
reproducibility
- Missing values distort
statistics
- Duplicates inflate metrics
- Validation prevents errors
- Clean data builds trust
- Automation improves
efficiency
- Documentation ensures
transparency
- QC is non-negotiable
- Regulatory compliance
depends on data quality
- Consistency drives accuracy
- Raw data must remain
untouched
- Transformation logic must be
traceable
- Business rules guide
cleaning
- Outliers must be
investigated
- Metadata defines structure
- Clean data accelerates
insights
- Poor data wastes resources
- Data quality is a
competitive advantage
- Analytics is only as good as
the data
9. Summary
SAS and R
both provide powerful capabilities for data cleaning, but they differ in
approach and strengths. SAS excels in structured, regulatory-driven
environments such as clinical trials. Its procedures like PROC SORT, PROC MEANS,
and PROC SUMMARY provide robust and standardized workflows. SAS ensures
reproducibility, auditability, and compliance critical in industries like
pharmaceuticals.
R, on the
other hand, offers flexibility and readability. Packages like dplyr allow
chaining transformations, making the code intuitive and concise. R is highly
effective for exploratory data analysis and rapid prototyping.
In this
blog, we transformed a messy dataset of global countries and capitals into a
clean, analysis-ready dataset. We handled missing values, corrected invalid
data, standardized text, and removed duplicates. We also explored the
difference between PROC MEANS and PROC SUMMARY, highlighting their practical
use cases.
The key
takeaway is that data cleaning is not optional it is foundational. Whether
using SAS or R, the goal remains the same: ensure accuracy, consistency, and
reliability.
10. Conclusion
Data
cleaning is often underestimated, yet it is the backbone of any analytical
process. Without clean data, even the most sophisticated models fail. In
real-world scenarios whether clinical trials, financial analytics, or global
datasets data arrives messy, inconsistent, and incomplete.
This blog
demonstrated how to systematically approach data cleaning using SAS and R. By
creating a flawed dataset and correcting it step by step, we highlighted
practical techniques that are directly applicable in industry settings.
The
comparison between PROC MEANS and PROC SUMMARY emphasized the importance of
choosing the right tool for the right task. While both provide similar
statistical outputs, their usage differs based on context exploratory vs
production.
A
structured data cleaning framework ensures:
- Consistency across datasets
- Reliable statistical outputs
- Regulatory compliance
- Faster decision-making
For
professionals preparing for interviews or working in data-intensive roles,
mastering these concepts is essential. Employers look for candidates who not
only write code but understand the logic behind transformations.
In
conclusion, clean data is not a luxury it is a necessity. Investing time in
data cleaning pays dividends in accuracy, trust, and performance.
11. Interview Questions
Q1: Difference between PROC MEANS and PROC SUMMARY?
Answer: PROC MEANS prints output; PROC
SUMMARY creates datasets. SUMMARY is preferred for pipelines.
Q2: How do you handle negative values in clinical
datasets?
Answer: Use validation rules. Replace
with missing or correct using ABS if justified.
Q3: How do you remove duplicates in SAS?
Answer: Use PROC SORT NODUPKEY with key
variables.
Q4: How do you standardize text in R?
Answer: Use toupper(), tolower(), trimws().
Q5: Scenario: GDP column has NULL and missing
values what will you do?
Answer: Convert NULL to missing, then
apply imputation (0 or mean based on business logic).
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about CAPITALS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment