Capitals, Chaos & Clean Code: Transforming a Global Countries Dataset into Analytical Gold with SAS & R

1. Introduction

Imagine you are working on a global analytics project for a multinational organization. The dataset contains country names, capital cities, population, GDP, and region classifications. It looks clean at first glance but once you start analyzing, things fall apart.

You notice countries like “india”, “INDIA”, and “India ” treated as separate entities. Some capitals are missing. Population values include negative numbers. GDP fields contain text like “NULL.” Dates are incorrectly formatted. Duplicate rows silently inflate your statistics.

This is not just inconvenient it’s dangerous.

In industries like clinical trials or financial analytics, poor data quality leads to flawed decisions. Imagine a clinical dataset where patient age is -5 or treatment dates are reversed. The consequences can be regulatory rejection or incorrect conclusions.

This is where SAS and R become critical tools. They don’t just analyze data they transform chaos into reliability.

In this blog, we will:

Create a messy dataset of popular countries and capitals
Clean it using SAS and R
Explore PROC MEANS vs PROC SUMMARY
Apply real-world data cleaning logic
Make it interview-ready

2. Raw Data Creation in SAS and R

SAS Code (With Intentional Errors)

DATA countries_raw;

INFILE DATALINES DLM=',' MISSOVER;

LENGTH Country $20 Capital $20 Region $15;

INPUT Country $ Capital $ Population GDP Year Region $;

DATALINES;

India,New Delhi,1400000000,3.5,2023,asia

USA,Washington,330000000,21.4,2023,North America

india,New Delhi,.,3.5,2023,ASIA

UK,London,-67000000,2.8,2023,Europe

France,Paris,65000000,NULL,2023,Europe

Germany,Berlin,83000000,3.8,2023,europe

Japan,Tokyo,125000000,5.1,2023,Asia

China,Beijing,1400000000,14.7,2023,asia

Brazil,Brasilia,212000000,1.8,2023,South America

India,New Delhi,1400000000,3.5,2023,asia

Canada,Ottawa,38000000,1.9,2023,North America

Australia,Canberra,26000000,1.6,2023,Australia

;

RUN;

PROC PRINT DATA = countries_raw;

RUN;

OUTPUT:


Obs	Country	Capital	Region	Population	GDP	Year
1	India	New Delhi	asia	1400000000	3.5	2023
2	USA	Washington	North America	330000000	21.4	2023
3	india	New Delhi	ASIA	.	3.5	2023
4	UK	London	Europe	-67000000	2.8	2023
5	France	Paris	Europe	65000000	.	2023
6	Germany	Berlin	europe	83000000	3.8	2023
7	Japan	Tokyo	Asia	125000000	5.1	2023
8	China	Beijing	asia	1400000000	14.7	2023
9	Brazil	Brasilia	South America	212000000	1.8	2023
10	India	New Delhi	asia	1400000000	3.5	2023
11	Canada	Ottawa	North America	38000000	1.9	2023
12	Australia	Canberra	Australia	26000000	1.6	2023

Explanation

This dataset intentionally introduces real-world issues. Missing values appear as . and “NULL”. Negative population values (-67000000) simulate data entry errors. Duplicate records (India repeated) inflate metrics. Text inconsistencies (asia, ASIA, Europe, europe) cause grouping issues. Mixed casing in country names leads to duplicate categories during aggregation. The dataset mimics real-world ingestion problems from multiple systems. The use of INFILE DATALINES ensures inline data creation, while MISSOVER prevents row breaks due to missing fields. This raw dataset forms the foundation for demonstrating structured cleaning using SAS procedures and functions.

R Code – Equivalent Raw Dataset

options(scipen = 999)

countries_raw <- data.frame(

Country = c("India","USA","india","UK","France","Germany","Japan",

"China","Brazil","India","Canada","Australia"),

Capital = c("New Delhi","Washington","New Delhi","London","Paris",

"Berlin","Tokyo","Beijing","Brasilia","New Delhi",

"Ottawa","Canberra"),

Population = c(1400000000,330000000,NA,-67000000,65000000,83000000,

125000000,1400000000,212000000,1400000000,38000000,

26000000),

GDP = c(3.5,21.4,3.5,2.8,NA,3.8,5.1,14.7,1.8,3.5,1.9,1.6),

Year = c(2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,

2023,2023),

Region = c("asia","North America","ASIA","Europe","Europe","europe",

"Asia","asia","South America","asia","North America","Australia")

)

OUTPUT:

	Country	Capital	Population	GDP	Year	Region
1	India	New Delhi	1400000000	3.5	2023	asia
2	USA	Washington	330000000	21.4	2023	North America
3	india	New Delhi	NA	3.5	2023	ASIA
4	UK	London	-67000000	2.8	2023	Europe
5	France	Paris	65000000	NA	2023	Europe
6	Germany	Berlin	83000000	3.8	2023	europe
7	Japan	Tokyo	125000000	5.1	2023	Asia
8	China	Beijing	1400000000	14.7	2023	asia
9	Brazil	Brasilia	212000000	1.8	2023	South America
10	India	New Delhi	1400000000	3.5	2023	asia
11	Canada	Ottawa	38000000	1.9	2023	North America
12	Australia	Canberra	26000000	1.6	2023	Australia

Explanation

The options(scipen = 999) setting increases the penalty for scientific notation, forcing R to display full numeric values. The R dataset mirrors SAS but uses NA for missing values. Data inconsistencies remain intact for cleaning demonstration. The data.frame() function provides a quick way to simulate structured tabular data. Negative population and inconsistent region casing replicate real-world ingestion issues. Duplicate rows are preserved to show deduplication techniques. This setup allows us to apply dplyr transformations effectively. By keeping raw data untouched initially, we ensure traceability a critical requirement in regulated environments like clinical trials (SDTM/ADaM). The dataset acts as a sandbox for demonstrating robust data cleaning pipelines in R.

3. Phase 1: Data Cleaning in SAS

DATA countries_clean;

SET countries_raw;

/* Standardize text */

Country = PROPCASE(STRIP(Country));

Region = UPCASE(STRIP(Region));

/* Handle missing GDP */

IF GDP = . THEN GDP = 0;

/* Fix invalid population */

IF Population < 0 THEN Population = ABS(Population);

/* Replace missing population */

IF Population = . THEN Population = 0;

/* Format Year */

Year_num = INPUT(PUT(Year,4.),4.);

RUN;

PROC PRINT DATA = countries_clean;

RUN;

OUTPUT:


Obs	Country	Capital	Region	Population	GDP	Year	Year_num
1	India	New Delhi	ASIA	1400000000	3.5	2023	2023
2	Usa	Washington	NORTH AMERICA	330000000	21.4	2023	2023
3	India	New Delhi	ASIA	0	3.5	2023	2023
4	Uk	London	EUROPE	67000000	2.8	2023	2023
5	France	Paris	EUROPE	65000000	0.0	2023	2023
6	Germany	Berlin	EUROPE	83000000	3.8	2023	2023
7	Japan	Tokyo	ASIA	125000000	5.1	2023	2023
8	China	Beijing	ASIA	1400000000	14.7	2023	2023
9	Brazil	Brasilia	SOUTH AMERICA	212000000	1.8	2023	2023
10	India	New Delhi	ASIA	1400000000	3.5	2023	2023
11	Canada	Ottawa	NORTH AMERICA	38000000	1.9	2023	2023
12	Australia	Canberra	AUSTRALIA	26000000	1.6	2023	2023

/* Remove duplicates */

PROC SORT DATA=countries_clean NODUPKEY;

BY Country Capital;

RUN;

PROC PRINT DATA = countries_clean;

RUN;

OUTPUT:


Obs	Country	Capital	Region	Population	GDP	Year	Year_num
1	Australia	Canberra	AUSTRALIA	26000000	1.6	2023	2023
2	Brazil	Brasilia	SOUTH AMERICA	212000000	1.8	2023	2023
3	Canada	Ottawa	NORTH AMERICA	38000000	1.9	2023	2023
4	China	Beijing	ASIA	1400000000	14.7	2023	2023
5	France	Paris	EUROPE	65000000	0.0	2023	2023
6	Germany	Berlin	EUROPE	83000000	3.8	2023	2023
7	India	New Delhi	ASIA	1400000000	3.5	2023	2023
8	Japan	Tokyo	ASIA	125000000	5.1	2023	2023
9	Uk	London	EUROPE	67000000	2.8	2023	2023
10	Usa	Washington	NORTH AMERICA	330000000	21.4	2023	2023

Explanation

This step applies core SAS data cleaning techniques. PROPCASE standardizes country names, ensuring uniformity. UPCASE normalizes region values, enabling accurate grouping. Missing GDP values are replaced with zero—useful for aggregation. Negative population values are corrected using ABS, ensuring logical consistency. Missing population values are also handled to avoid analysis errors. INPUT and PUT functions ensure proper numeric formatting. Finally, PROC SORT NODUPKEY removes duplicate records based on key variables. This ensures dataset integrity and prevents inflated counts during analysis. The approach reflects real-world SAS programming practices in clinical and business analytics.

4. Phase 2: Data Cleaning in R

library(dplyr)

countries_clean <- countries_raw %>%

mutate(

Country = tools::toTitleCase(trimws(Country)),

Region = toupper(trimws(Region)),

GDP = ifelse(is.na(GDP), 0, GDP),

Population = ifelse(Population < 0, abs(Population), Population),

Population = ifelse(is.na(Population), 0, Population)

) %>%

distinct(Country, Capital, .keep_all = TRUE)

OUTPUT:

	Country	Capital	Population	GDP	Year	Region
1	India	New Delhi	1400000000	3.5	2023	ASIA
2	USA	Washington	330000000	21.4	2023	NORTH AMERICA
3	UK	London	67000000	2.8	2023	EUROPE
4	France	Paris	65000000	0	2023	EUROPE
5	Germany	Berlin	83000000	3.8	2023	EUROPE
6	Japan	Tokyo	125000000	5.1	2023	ASIA
7	China	Beijing	1400000000	14.7	2023	ASIA
8	Brazil	Brasilia	212000000	1.8	2023	SOUTH AMERICA
9	Canada	Ottawa	38000000	1.9	2023	NORTH AMERICA
10	Australia	Canberra	26000000	1.6	2023	AUSTRALIA

Explanation

The R cleaning pipeline uses dplyr for transformation. trimws() removes unwanted spaces, while toTitleCase() standardizes country names. toupper() ensures consistent region formatting. Missing GDP values are replaced using ifelse(). Negative population values are corrected using abs(). Missing population values are also handled explicitly. The distinct() function removes duplicate records based on key identifiers. This pipeline demonstrates a declarative and readable approach to data cleaning. Compared to SAS, R provides more chaining flexibility, making transformations easier to interpret and debug. This is particularly useful in exploratory data analysis workflows.

5. Phase 3: Additional SAS Processing + PROC MEANS vs SUMMARY

PROC MEANS DATA=countries_clean N MEAN SUM MAXDEC=2;

CLASS Region;

VAR Population GDP;

RUN;

OUTPUT:

The MEANS Procedure


Region	N Obs	Variable	N	Mean	Sum
ASIA	3	Population GDP	3 3	975000000.00 7.77	2925000000.0 23.30
AUSTRALIA	1	Population GDP	1 1	26000000.00 1.60	26000000.00 1.60
EUROPE	3	Population GDP	3 3	71666666.67 2.20	215000000.00 6.60
NORTH AMERICA	2	Population GDP	2 2	184000000.00 11.65	368000000.00 23.30
SOUTH AMERICA	1	Population GDP	1 1	212000000.00 1.80	212000000.00 1.80

PROC SUMMARY DATA=countries_clean NWAY;

CLASS Region;

VAR Population GDP;

OUTPUT OUT=summary_out MEAN= SUM=;

RUN;

PROC PRINT DATA = summary_out;

RUN;

OUTPUT:


Obs	Region	_TYPE_	_FREQ_	Population	GDP
1	ASIA	1	3	975000000.00	7.7667
2	AUSTRALIA	1	1	26000000.00	1.6000
3	EUROPE	1	3	71666666.67	2.2000
4	NORTH AMERICA	1	2	184000000.00	11.6500
5	SOUTH AMERICA	1	1	212000000.00	1.8000

Explanation

PROC MEANS and PROC SUMMARY are nearly identical but differ in output behavior. PROC MEANS prints results directly, making it useful for quick analysis. PROC SUMMARY is more flexible it suppresses printed output unless specified and is ideal for creating datasets (OUT=). The NWAY option ensures only the highest-level aggregation is output. Both procedures calculate statistics like mean and sum. In production environments, PROC SUMMARY is preferred for pipeline integration, while PROC MEANS is used for exploratory checks. Understanding this distinction is critical for efficient SAS programming.

6. 20 Additional Data Cleaning Best Practices

Always validate source data before transformation
Maintain audit trails for regulatory compliance
Use metadata-driven programming (CDISC standards)
Validate ranges (e.g., age 0–120)
Standardize formats early
Avoid overwriting raw data
Document all transformations
Use macros for repeatable cleaning
Perform QC checks independently
Track missing value patterns
Use controlled terminology (MedDRA, WHO)
Validate key identifiers
Ensure referential integrity
Apply consistent date formats
Log all warnings/errors
Use version control
Validate duplicates carefully
Perform cross-domain checks
Automate validation rules
Ensure reproducibility

7. Business Logic Behind Data Cleaning

Data cleaning is not just technical it’s deeply tied to business logic. Missing values are replaced based on context. For example, in clinical trials, missing patient weight may be imputed using last observation carried forward (LOCF). In business datasets, missing GDP might be set to zero or estimated using historical averages.

Unrealistic values must be corrected because they distort analysis. A negative population is logically impossible and must be fixed or removed. Similarly, salary normalization ensures fair comparisons across regions.

Date imputation is critical. If treatment start dates are missing, derived variables like study duration become invalid. In financial datasets, incorrect dates can affect revenue recognition.

Every correction must align with domain knowledge. Blind transformations can introduce bias. For instance, replacing all missing values with zero may underestimate averages.

Ultimately, clean data ensures:

Accurate reporting
Reliable decision-making
Regulatory compliance

8. 20 Key Points

Dirty data leads to wrong conclusions
Standardization ensures reproducibility
Missing values distort statistics
Duplicates inflate metrics
Validation prevents errors
Clean data builds trust
Automation improves efficiency
Documentation ensures transparency
QC is non-negotiable
Regulatory compliance depends on data quality
Consistency drives accuracy
Raw data must remain untouched
Transformation logic must be traceable
Business rules guide cleaning
Outliers must be investigated
Metadata defines structure
Clean data accelerates insights
Poor data wastes resources
Data quality is a competitive advantage
Analytics is only as good as the data

9. Summary

SAS and R both provide powerful capabilities for data cleaning, but they differ in approach and strengths. SAS excels in structured, regulatory-driven environments such as clinical trials. Its procedures like PROC SORT, PROC MEANS, and PROC SUMMARY provide robust and standardized workflows. SAS ensures reproducibility, auditability, and compliance critical in industries like pharmaceuticals.

R, on the other hand, offers flexibility and readability. Packages like dplyr allow chaining transformations, making the code intuitive and concise. R is highly effective for exploratory data analysis and rapid prototyping.

In this blog, we transformed a messy dataset of global countries and capitals into a clean, analysis-ready dataset. We handled missing values, corrected invalid data, standardized text, and removed duplicates. We also explored the difference between PROC MEANS and PROC SUMMARY, highlighting their practical use cases.

The key takeaway is that data cleaning is not optional it is foundational. Whether using SAS or R, the goal remains the same: ensure accuracy, consistency, and reliability.

10. Conclusion

Data cleaning is often underestimated, yet it is the backbone of any analytical process. Without clean data, even the most sophisticated models fail. In real-world scenarios whether clinical trials, financial analytics, or global datasets data arrives messy, inconsistent, and incomplete.

This blog demonstrated how to systematically approach data cleaning using SAS and R. By creating a flawed dataset and correcting it step by step, we highlighted practical techniques that are directly applicable in industry settings.

The comparison between PROC MEANS and PROC SUMMARY emphasized the importance of choosing the right tool for the right task. While both provide similar statistical outputs, their usage differs based on context exploratory vs production.

A structured data cleaning framework ensures:

Consistency across datasets
Reliable statistical outputs
Regulatory compliance
Faster decision-making

For professionals preparing for interviews or working in data-intensive roles, mastering these concepts is essential. Employers look for candidates who not only write code but understand the logic behind transformations.

In conclusion, clean data is not a luxury it is a necessity. Investing time in data cleaning pays dividends in accuracy, trust, and performance.

11. Interview Questions

Q1: Difference between PROC MEANS and PROC SUMMARY?

Answer: PROC MEANS prints output; PROC SUMMARY creates datasets. SUMMARY is preferred for pipelines.

Q2: How do you handle negative values in clinical datasets?

Answer: Use validation rules. Replace with missing or correct using ABS if justified.

Q3: How do you remove duplicates in SAS?

Answer: Use PROC SORT NODUPKEY with key variables.

Q4: How do you standardize text in R?

Answer: Use toupper(), tolower(), trimws().

Q5: Scenario: GDP column has NULL and missing values what will you do?

Answer: Convert NULL to missing, then apply imputation (0 or mean based on business logic).

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about CAPITALS DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

From Capitals to Clean Code: Turning Global Country Data into Analytical Gold with SAS & R

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function

From Capitals to Clean Code: Turning Global Country Data into Analytical Gold with SAS & R

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?