From Capitals to Clean Code: Turning Global Country Data into Analytical Gold with SAS & R

Capitals, Chaos & Clean Code: Transforming a Global Countries Dataset into Analytical Gold with SAS & R

1. Introduction

Imagine you are working on a global analytics project for a multinational organization. The dataset contains country names, capital cities, population, GDP, and region classifications. It looks clean at first glance but once you start analyzing, things fall apart.

You notice countries like “india”, “INDIA”, and “India ” treated as separate entities. Some capitals are missing. Population values include negative numbers. GDP fields contain text like “NULL.” Dates are incorrectly formatted. Duplicate rows silently inflate your statistics.

This is not just inconvenient it’s dangerous.

In industries like clinical trials or financial analytics, poor data quality leads to flawed decisions. Imagine a clinical dataset where patient age is -5 or treatment dates are reversed. The consequences can be regulatory rejection or incorrect conclusions.

This is where SAS and R become critical tools. They don’t just analyze data they transform chaos into reliability.

In this blog, we will:

  • Create a messy dataset of popular countries and capitals
  • Clean it using SAS and R
  • Explore PROC MEANS vs PROC SUMMARY
  • Apply real-world data cleaning logic
  • Make it interview-ready

2. Raw Data Creation in SAS and R

SAS Code (With Intentional Errors)

DATA countries_raw;

INFILE DATALINES DLM=',' MISSOVER;

LENGTH Country $20 Capital $20 Region $15;

INPUT Country $ Capital $ Population GDP Year Region $;

DATALINES;

India,New Delhi,1400000000,3.5,2023,asia

USA,Washington,330000000,21.4,2023,North America

india,New Delhi,.,3.5,2023,ASIA

UK,London,-67000000,2.8,2023,Europe

France,Paris,65000000,NULL,2023,Europe

Germany,Berlin,83000000,3.8,2023,europe

Japan,Tokyo,125000000,5.1,2023,Asia

China,Beijing,1400000000,14.7,2023,asia

Brazil,Brasilia,212000000,1.8,2023,South America

India,New Delhi,1400000000,3.5,2023,asia

Canada,Ottawa,38000000,1.9,2023,North America

Australia,Canberra,26000000,1.6,2023,Australia

;

RUN;

PROC PRINT DATA = countries_raw;

RUN;

OUTPUT:

ObsCountryCapitalRegionPopulationGDPYear
1IndiaNew Delhiasia14000000003.52023
2USAWashingtonNorth America33000000021.42023
3indiaNew DelhiASIA.3.52023
4UKLondonEurope-670000002.82023
5FranceParisEurope65000000.2023
6GermanyBerlineurope830000003.82023
7JapanTokyoAsia1250000005.12023
8ChinaBeijingasia140000000014.72023
9BrazilBrasiliaSouth America2120000001.82023
10IndiaNew Delhiasia14000000003.52023
11CanadaOttawaNorth America380000001.92023
12AustraliaCanberraAustralia260000001.62023

Explanation

This dataset intentionally introduces real-world issues. Missing values appear as . and “NULL”. Negative population values (-67000000) simulate data entry errors. Duplicate records (India repeated) inflate metrics. Text inconsistencies (asia, ASIA, Europe, europe) cause grouping issues. Mixed casing in country names leads to duplicate categories during aggregation. The dataset mimics real-world ingestion problems from multiple systems. The use of INFILE DATALINES ensures inline data creation, while MISSOVER prevents row breaks due to missing fields. This raw dataset forms the foundation for demonstrating structured cleaning using SAS procedures and functions.

R Code – Equivalent Raw Dataset

options(scipen = 999)

countries_raw <- data.frame(

  Country = c("India","USA","india","UK","France","Germany","Japan",

              "China","Brazil","India","Canada","Australia"),

  Capital = c("New Delhi","Washington","New Delhi","London","Paris",

              "Berlin","Tokyo","Beijing","Brasilia","New Delhi",

              "Ottawa","Canberra"),

  Population = c(1400000000,330000000,NA,-67000000,65000000,83000000,

                 125000000,1400000000,212000000,1400000000,38000000,

                 26000000),

  GDP = c(3.5,21.4,3.5,2.8,NA,3.8,5.1,14.7,1.8,3.5,1.9,1.6),

  Year = c(2023,2023,2023,2023,2023,2023,2023,2023,2023,2023,

           2023,2023),

  Region = c("asia","North America","ASIA","Europe","Europe","europe",

             "Asia","asia","South America","asia","North America","Australia")

)

OUTPUT:

 

Country

Capital

Population

GDP

Year

Region

1

India

New Delhi

1400000000

3.5

2023

asia

2

USA

Washington

330000000

21.4

2023

North America

3

india

New Delhi

NA

3.5

2023

ASIA

4

UK

London

-67000000

2.8

2023

Europe

5

France

Paris

65000000

NA

2023

Europe

6

Germany

Berlin

83000000

3.8

2023

europe

7

Japan

Tokyo

125000000

5.1

2023

Asia

8

China

Beijing

1400000000

14.7

2023

asia

9

Brazil

Brasilia

212000000

1.8

2023

South America

10

India

New Delhi

1400000000

3.5

2023

asia

11

Canada

Ottawa

38000000

1.9

2023

North America

12

Australia

Canberra

26000000

1.6

2023

Australia

Explanation

The options(scipen = 999) setting increases the penalty for scientific notation, forcing R to display full numeric values. The R dataset mirrors SAS but uses NA for missing values. Data inconsistencies remain intact for cleaning demonstration. The data.frame() function provides a quick way to simulate structured tabular data. Negative population and inconsistent region casing replicate real-world ingestion issues. Duplicate rows are preserved to show deduplication techniques. This setup allows us to apply dplyr transformations effectively. By keeping raw data untouched initially, we ensure traceability a critical requirement in regulated environments like clinical trials (SDTM/ADaM). The dataset acts as a sandbox for demonstrating robust data cleaning pipelines in R.

3. Phase 1: Data Cleaning in SAS

DATA countries_clean;

SET countries_raw;

/* Standardize text */

Country = PROPCASE(STRIP(Country));

Region  = UPCASE(STRIP(Region));

/* Handle missing GDP */

IF GDP = . THEN GDP = 0;

/* Fix invalid population */

IF Population < 0 THEN Population = ABS(Population);

/* Replace missing population */

IF Population = . THEN Population = 0;

/* Format Year */

Year_num = INPUT(PUT(Year,4.),4.);

RUN;

PROC PRINT DATA = countries_clean;

RUN;

OUTPUT:

ObsCountryCapitalRegionPopulationGDPYearYear_num
1IndiaNew DelhiASIA14000000003.520232023
2UsaWashingtonNORTH AMERICA33000000021.420232023
3IndiaNew DelhiASIA03.520232023
4UkLondonEUROPE670000002.820232023
5FranceParisEUROPE650000000.020232023
6GermanyBerlinEUROPE830000003.820232023
7JapanTokyoASIA1250000005.120232023
8ChinaBeijingASIA140000000014.720232023
9BrazilBrasiliaSOUTH AMERICA2120000001.820232023
10IndiaNew DelhiASIA14000000003.520232023
11CanadaOttawaNORTH AMERICA380000001.920232023
12AustraliaCanberraAUSTRALIA260000001.620232023

/* Remove duplicates */

PROC SORT DATA=countries_clean NODUPKEY;

BY Country Capital;

RUN;

PROC PRINT DATA = countries_clean;

RUN;

OUTPUT:

ObsCountryCapitalRegionPopulationGDPYearYear_num
1AustraliaCanberraAUSTRALIA260000001.620232023
2BrazilBrasiliaSOUTH AMERICA2120000001.820232023
3CanadaOttawaNORTH AMERICA380000001.920232023
4ChinaBeijingASIA140000000014.720232023
5FranceParisEUROPE650000000.020232023
6GermanyBerlinEUROPE830000003.820232023
7IndiaNew DelhiASIA14000000003.520232023
8JapanTokyoASIA1250000005.120232023
9UkLondonEUROPE670000002.820232023
10UsaWashingtonNORTH AMERICA33000000021.420232023

Explanation

This step applies core SAS data cleaning techniques. PROPCASE standardizes country names, ensuring uniformity. UPCASE normalizes region values, enabling accurate grouping. Missing GDP values are replaced with zero—useful for aggregation. Negative population values are corrected using ABS, ensuring logical consistency. Missing population values are also handled to avoid analysis errors. INPUT and PUT functions ensure proper numeric formatting. Finally, PROC SORT NODUPKEY removes duplicate records based on key variables. This ensures dataset integrity and prevents inflated counts during analysis. The approach reflects real-world SAS programming practices in clinical and business analytics.

4. Phase 2: Data Cleaning in R

library(dplyr)

countries_clean <- countries_raw %>%

  mutate(

    Country = tools::toTitleCase(trimws(Country)),

    Region = toupper(trimws(Region)),

    GDP = ifelse(is.na(GDP), 0, GDP),

    Population = ifelse(Population < 0, abs(Population), Population),

    Population = ifelse(is.na(Population), 0, Population)

  ) %>%

  distinct(Country, Capital, .keep_all = TRUE)

OUTPUT:

 

Country

Capital

Population

GDP

Year

Region

1

India

New Delhi

1400000000

3.5

2023

ASIA

2

USA

Washington

330000000

21.4

2023

NORTH AMERICA

3

UK

London

67000000

2.8

2023

EUROPE

4

France

Paris

65000000

0

2023

EUROPE

5

Germany

Berlin

83000000

3.8

2023

EUROPE

6

Japan

Tokyo

125000000

5.1

2023

ASIA

7

China

Beijing

1400000000

14.7

2023

ASIA

8

Brazil

Brasilia

212000000

1.8

2023

SOUTH AMERICA

9

Canada

Ottawa

38000000

1.9

2023

NORTH AMERICA

10

Australia

Canberra

26000000

1.6

2023

AUSTRALIA

Explanation

The R cleaning pipeline uses dplyr for transformation. trimws() removes unwanted spaces, while toTitleCase() standardizes country names. toupper() ensures consistent region formatting. Missing GDP values are replaced using ifelse(). Negative population values are corrected using abs(). Missing population values are also handled explicitly. The distinct() function removes duplicate records based on key identifiers. This pipeline demonstrates a declarative and readable approach to data cleaning. Compared to SAS, R provides more chaining flexibility, making transformations easier to interpret and debug. This is particularly useful in exploratory data analysis workflows.

5. Phase 3: Additional SAS Processing + PROC MEANS vs SUMMARY

PROC MEANS DATA=countries_clean N MEAN SUM MAXDEC=2;

CLASS Region;

VAR Population GDP;

RUN;

OUTPUT:

The MEANS Procedure

RegionN ObsVariableNMeanSum
ASIA3
Population
GDP
3
3
975000000.00
7.77
2925000000.0
23.30
AUSTRALIA1
Population
GDP
1
1
26000000.00
1.60
26000000.00
1.60
EUROPE3
Population
GDP
3
3
71666666.67
2.20
215000000.00
6.60
NORTH AMERICA2
Population
GDP
2
2
184000000.00
11.65
368000000.00
23.30
SOUTH AMERICA1
Population
GDP
1
1
212000000.00
1.80
212000000.00
1.80

PROC SUMMARY DATA=countries_clean NWAY;

CLASS Region;

VAR Population GDP;

OUTPUT OUT=summary_out MEAN= SUM=;

RUN;

PROC PRINT DATA = summary_out;

RUN;

OUTPUT:

ObsRegion_TYPE__FREQ_PopulationGDP
1ASIA13975000000.007.7667
2AUSTRALIA1126000000.001.6000
3EUROPE1371666666.672.2000
4NORTH AMERICA12184000000.0011.6500
5SOUTH AMERICA11212000000.001.8000

Explanation

PROC MEANS and PROC SUMMARY are nearly identical but differ in output behavior. PROC MEANS prints results directly, making it useful for quick analysis. PROC SUMMARY is more flexible it suppresses printed output unless specified and is ideal for creating datasets (OUT=). The NWAY option ensures only the highest-level aggregation is output. Both procedures calculate statistics like mean and sum. In production environments, PROC SUMMARY is preferred for pipeline integration, while PROC MEANS is used for exploratory checks. Understanding this distinction is critical for efficient SAS programming.

6. 20 Additional Data Cleaning Best Practices

  1. Always validate source data before transformation
  2. Maintain audit trails for regulatory compliance
  3. Use metadata-driven programming (CDISC standards)
  4. Validate ranges (e.g., age 0–120)
  5. Standardize formats early
  6. Avoid overwriting raw data
  7. Document all transformations
  8. Use macros for repeatable cleaning
  9. Perform QC checks independently
  10. Track missing value patterns
  11. Use controlled terminology (MedDRA, WHO)
  12. Validate key identifiers
  13. Ensure referential integrity
  14. Apply consistent date formats
  15. Log all warnings/errors
  16. Use version control
  17. Validate duplicates carefully
  18. Perform cross-domain checks
  19. Automate validation rules
  20. Ensure reproducibility

7. Business Logic Behind Data Cleaning

Data cleaning is not just technical it’s deeply tied to business logic. Missing values are replaced based on context. For example, in clinical trials, missing patient weight may be imputed using last observation carried forward (LOCF). In business datasets, missing GDP might be set to zero or estimated using historical averages.

Unrealistic values must be corrected because they distort analysis. A negative population is logically impossible and must be fixed or removed. Similarly, salary normalization ensures fair comparisons across regions.

Date imputation is critical. If treatment start dates are missing, derived variables like study duration become invalid. In financial datasets, incorrect dates can affect revenue recognition.

Every correction must align with domain knowledge. Blind transformations can introduce bias. For instance, replacing all missing values with zero may underestimate averages.

Ultimately, clean data ensures:

  • Accurate reporting
  • Reliable decision-making
  • Regulatory compliance

8. 20 Key Points

  • Dirty data leads to wrong conclusions
  • Standardization ensures reproducibility
  • Missing values distort statistics
  • Duplicates inflate metrics
  • Validation prevents errors
  • Clean data builds trust
  • Automation improves efficiency
  • Documentation ensures transparency
  • QC is non-negotiable
  • Regulatory compliance depends on data quality
  • Consistency drives accuracy
  • Raw data must remain untouched
  • Transformation logic must be traceable
  • Business rules guide cleaning
  • Outliers must be investigated
  • Metadata defines structure
  • Clean data accelerates insights
  • Poor data wastes resources
  • Data quality is a competitive advantage
  • Analytics is only as good as the data

9. Summary

SAS and R both provide powerful capabilities for data cleaning, but they differ in approach and strengths. SAS excels in structured, regulatory-driven environments such as clinical trials. Its procedures like PROC SORT, PROC MEANS, and PROC SUMMARY provide robust and standardized workflows. SAS ensures reproducibility, auditability, and compliance critical in industries like pharmaceuticals.

R, on the other hand, offers flexibility and readability. Packages like dplyr allow chaining transformations, making the code intuitive and concise. R is highly effective for exploratory data analysis and rapid prototyping.

In this blog, we transformed a messy dataset of global countries and capitals into a clean, analysis-ready dataset. We handled missing values, corrected invalid data, standardized text, and removed duplicates. We also explored the difference between PROC MEANS and PROC SUMMARY, highlighting their practical use cases.

The key takeaway is that data cleaning is not optional it is foundational. Whether using SAS or R, the goal remains the same: ensure accuracy, consistency, and reliability.

10. Conclusion

Data cleaning is often underestimated, yet it is the backbone of any analytical process. Without clean data, even the most sophisticated models fail. In real-world scenarios whether clinical trials, financial analytics, or global datasets data arrives messy, inconsistent, and incomplete.

This blog demonstrated how to systematically approach data cleaning using SAS and R. By creating a flawed dataset and correcting it step by step, we highlighted practical techniques that are directly applicable in industry settings.

The comparison between PROC MEANS and PROC SUMMARY emphasized the importance of choosing the right tool for the right task. While both provide similar statistical outputs, their usage differs based on context exploratory vs production.

A structured data cleaning framework ensures:

  • Consistency across datasets
  • Reliable statistical outputs
  • Regulatory compliance
  • Faster decision-making

For professionals preparing for interviews or working in data-intensive roles, mastering these concepts is essential. Employers look for candidates who not only write code but understand the logic behind transformations.

In conclusion, clean data is not a luxury it is a necessity. Investing time in data cleaning pays dividends in accuracy, trust, and performance.

11. Interview Questions

Q1: Difference between PROC MEANS and PROC SUMMARY?

Answer: PROC MEANS prints output; PROC SUMMARY creates datasets. SUMMARY is preferred for pipelines.

Q2: How do you handle negative values in clinical datasets?

Answer: Use validation rules. Replace with missing or correct using ABS if justified.

Q3: How do you remove duplicates in SAS?

Answer: Use PROC SORT NODUPKEY with key variables.

Q4: How do you standardize text in R?

Answer: Use toupper(), tolower(), trimws().

Q5: Scenario: GDP column has NULL and missing values what will you do?

Answer: Convert NULL to missing, then apply imputation (0 or mean based on business logic).

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about CAPITALS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy

Comments

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function