Global Divorce Data Uncovered: Turning Inconsistent Records into Analytical Gold with SAS and R

Broken Marriages, Broken Data: Turning a Global Divorce Dataset into Reliable Insights with SAS (PROC SQL vs DATA Step) and R

1. Introduction

Imagine you’re working on a global analytics project analyzing divorce trends across countries. The goal is serious identify patterns, support policymakers, and even assist social researchers. But when the dataset arrives, it’s chaos.

Some countries are written as “india”, some as “INDIA”, and others as “InDia”. Divorce counts include negative numbers. Dates appear as 2023-15-40. Ages of individuals include -5. Duplicate records inflate statistics. Missing values are scattered everywhere.

This is not just “messy data.” This is dangerous data.

In clinical trials or business analytics, bad data leads to:

  • Incorrect conclusions
  • Regulatory rejection
  • Financial losses
  • Poor decision-making

This is where SAS and R become powerful. SAS brings structured, regulated, audit-friendly processing, especially critical in CDISC environments (SDTM/ADaM). R brings flexibility, rapid transformations, and exploratory power.

In this blog, we’ll take a realistic messy divorce dataset, intentionally inject errors, and then clean, refine, and analyze it using SAS (PROC SQL vs DATA Step) and R like a real-world project.

2. Raw Data Creation in SAS and R

SAS Code (Raw Dataset with Errors)

DATA divorce_raw;

INFILE DATALINES DLM=',' MISSOVER;

INPUT Country $ Age Divorce_Count Divorce_Date :$12. 

      Income Status $ Gender $ ID;

DATALINES;

India,35,120,2023-01-15,50000,active,M,101

india,.,-50,2023-15-40,45000,Active,F,102

USA,45,200,2023-02-10,.,inactive,M,103

UK,-5,180,2023-03-05,60000,NULL,F,104

France,50,220,.,70000,active,M,105

Germany,38,200,2023-04-12,55000,ACTIVE,F,106

India,35,120,2023-01-15,50000,active,M,101

Brazil,29,160,2023-05-01,48000, ,F,107

China,42,210,2023-06-11,62000,active,NULL,108

Japan,37,190,2023-07-21,58000,active,M,109

;

RUN;

PROC PRINT DATA = divorce_raw;

RUN;

OUTPUT:

ObsCountryAgeDivorce_CountDivorce_DateIncomeStatusGenderID
1India351202023-01-1550000activeM101
2india.-502023-15-4045000ActiveF102
3USA452002023-02-10.inactiveM103
4UK-51802023-03-0560000NULLF104
5France50220 70000activeM105
6Germany382002023-04-1255000ACTIVEF106
7India351202023-01-1550000activeM101
8Brazil291602023-05-0148000 F107
9China422102023-06-1162000activeNULL108
10Japan371902023-07-2158000activeM109

Explanation (SAS Raw Data)

This dataset intentionally contains:

  • Missing Age (.)
  • Invalid Divorce_Count (-50)
  • Wrong Date (2023-15-40)
  • Duplicate record (India, ID=101)
  • Mixed case text (india, ACTIVE)
  • NULL / blank values

This simulates real-world ingestion from flat files or external systems where validation is absent.

R Code – Equivalent Raw Dataset

library(sqldf)

divorce_raw <- data.frame(

  Country = c("India","india","USA","UK","France","Germany",

              "India","Brazil","China","Japan"),

  Age = c(35, NA,45,-5,50,38,35,29,42,37),

  Divorce_Count = c(120,-50,200,180,220,200,120,160,210,190),

  Divorce_Date = c("2023-01-15","2023-15-40","2023-02-10",

                   "2023-03-05",NA,"2023-04-12","2023-01-15",

                   "2023-05-01","2023-06-11","2023-07-21"),

  Income = c(50000,45000,NA,60000,70000,55000,50000,48000,62000,

             58000),

  Status = c("active","Active","inactive","NULL","active","ACTIVE",

             "active"," ","active","active"),

  Gender = c("M","F","M","F","M","F","M","F","NULL","M"),

  ID = c(101,102,103,104,105,106,101,107,108,109)

)

OUTPUT:

 

Country

Age

Divorce_Count

Divorce_Date

Income

Status

Gender

ID

1

India

35

120

15-01-2023

50000

active

M

101

2

india

NA

-50

2023-15-40

45000

Active

F

102

3

USA

45

200

10-02-2023

NA

inactive

M

103

4

UK

-5

180

05-03-2023

60000

NULL

F

104

5

France

50

220

NA

70000

active

M

105

6

Germany

38

200

12-04-2023

55000

ACTIVE

F

106

7

India

35

120

15-01-2023

50000

active

M

101

8

Brazil

29

160

01-05-2023

48000

 

F

107

9

China

42

210

11-06-2023

62000

active

NULL

108

10

Japan

37

190

21-07-2023

58000

active

M

109

Explanation (R Raw Data)

The same inconsistencies are recreated in R. This helps in cross-platform validation a common requirement in pharma where SAS is primary but R is used for QC.

PROC SQL Version in SAS

DATA dummy;

x = 1;

RUN;


PROC SQL;

CREATE TABLE divorce_raw_sql AS

SELECT "India" AS Country, 

       35 AS Age, 

       120 AS Divorce_Count,

       "2023-01-15" AS Divorce_Date,

       50000 AS Income, 

       "active" AS Status,

       "M" AS Gender, 

       101 AS ID

FROM dummy

UNION ALL

SELECT "india", ., -50, "2023-15-40", 45000, "Active", "F", 102 FROM dummy

UNION ALL

SELECT "USA", 45, 200, "2023-02-10", ., "inactive", "M", 103 FROM dummy

UNION ALL

SELECT "UK", -5, 180, "2023-03-05", 60000, "NULL", "F", 104 FROM dummy

UNION ALL

SELECT "France", 50, 220, ".", 70000, "active", "M", 105 FROM dummy

UNION ALL

SELECT "Germany", 38, 200, "2023-04-12", 55000, "ACTIVE", "F", 106 FROM dummy

UNION ALL

SELECT "India", 35, 120, "2023-01-15", 50000, "active", "M", 101 FROM dummy

UNION ALL

SELECT "Brazil", 29, 160, "2023-05-01", 48000, " ", "F", 107 FROM dummy

UNION ALL

SELECT "China", 42, 210, "2023-06-11", 62000, "active", "NULL", 108 FROM dummy

UNION ALL

SELECT "Japan", 37, 190, "2023-07-21", 58000, "active", "M", 109 FROM dummy

;

QUIT;

PROC PRINT DATA = divorce_raw_sql;

RUN;

OUTPUT:

ObsCountryAgeDivorce_CountDivorce_DateIncomeStatusGenderID
1India351202023-01-1550000activeM101
2india.-502023-15-4045000ActiveF102
3USA452002023-02-10.inactiveM103
4UK-51802023-03-0560000NULLF104
5France50220.70000activeM105
6Germany382002023-04-1255000ACTIVEF106
7India351202023-01-1550000activeM101
8Brazil291602023-05-0148000 F107
9China422102023-06-1162000activeNULL108
10Japan371902023-07-2158000activeM109

Explanation (SAS PROC SQL Dataset)

This approach uses SELECT ... UNION ALL to manually construct rows.

Key Concepts:

  • Each SELECT represents one observation
  • UNION ALL stacks rows (like SET in DATA step)
  • Missing numeric values are represented as .
  • Character variables must be enclosed in quotes

Why this is useful:

  • Helpful when generating test datasets programmatically
  • Useful in macro-driven dynamic dataset creation
  • Often used in unit testing and validation scenarios

Limitation:

  • Not scalable for large datasets
  • Hard to maintain compared to DATA step

👉 Conclusion:
Use PROC SQL for dataset creation only when data is small, static.

3. Phase 1: Data Cleaning in SAS

DATA divorce_clean;

SET divorce_raw;

/* Fix missing Age */

IF Age = . THEN Age = 30;

/* Remove negative Divorce_Count */

IF Divorce_Count < 0 THEN Divorce_Count = ABS(Divorce_Count);

/* Fix invalid Age */

IF Age < 0 THEN Age = .;

/* Standardize Country */

Country = UPCASE(STRIP(Country));

/* Fix Gender*/

IF Gender = "NULL" THEN Gender = "UNKNOWN";

/* Fix Status */

IF Status = "NULL" THEN Status = "UNKNOWN";

Status = UPCASE(COALESCEC(Status,"UNKNOWN"));

/* Convert Date */

Divorce_Date_num = INPUT(Divorce_Date, yymmdd10.);

FORMAT Divorce_Date_num DATE9.;

RUN;

PROC PRINT DATA = divorce_clean;

RUN;

OUTPUT:

ObsCountryAgeDivorce_CountDivorce_DateIncomeStatusGenderIDDivorce_Date_num
1INDIA351202023-01-1550000ACTIVEM10115JAN2023
2INDIA30502023-15-4045000ACTIVEF102.
3USA452002023-02-10.INACTIVEM10310FEB2023
4UK.1802023-03-0560000UNKNOWNF10405MAR2023
5FRANCE50220 70000ACTIVEM105.
6GERMANY382002023-04-1255000ACTIVEF10612APR2023
7INDIA351202023-01-1550000ACTIVEM10115JAN2023
8BRAZIL291602023-05-0148000UNKNOWNF10701MAY2023
9CHINA422102023-06-1162000ACTIVEUNKNOWN10811JUN2023
10JAPAN371902023-07-2158000ACTIVEM10921JUL2023

/* Remove duplicates */

PROC SORT DATA=divorce_clean NODUPKEY;

BY ID;

RUN;

PROC PRINT DATA = divorce_clean;

RUN;

OUTPUT:

ObsCountryAgeDivorce_CountDivorce_DateIncomeStatusGenderIDDivorce_Date_num
1INDIA351202023-01-1550000ACTIVEM10115JAN2023
2INDIA30502023-15-4045000ACTIVEF102.
3USA452002023-02-10.INACTIVEM10310FEB2023
4UK.1802023-03-0560000UNKNOWNF10405MAR2023
5FRANCE50220 70000ACTIVEM105.
6GERMANY382002023-04-1255000ACTIVEF10612APR2023
7BRAZIL291602023-05-0148000UNKNOWNF10701MAY2023
8CHINA422102023-06-1162000ACTIVEUNKNOWN10811JUN2023
9JAPAN371902023-07-2158000ACTIVEM10921JUL2023

Explanation

This step applies DATA step transformations, ideal for row-wise logic:

  • COALESCEC fills missing text
  • ABS() fixes invalid numeric values
  • INPUT() converts date strings
  • UPCASE() standardizes text
  • PROC SORT NODUPKEY removes duplicates

👉 DATA Step is preferred when:

  • Row-by-row transformations
  • Complex conditional logic
  • Sequential data processing

4. Phase 2: Data Refinement in R

divorce_clean_sql <- sqldf("

SELECT DISTINCT

    UPPER(TRIM(Country)) AS Country,

    /* Age cleaning */

    CASE 

        WHEN Age < 0 THEN NULL

        WHEN Age IS NULL THEN 30

        ELSE Age

    END AS Age,

    /* Divorce count fix */

    CASE 

        WHEN Divorce_Count < 0 THEN ABS(Divorce_Count)

        ELSE Divorce_Count

    END AS Divorce_Count,

    /* Date correction */

    CASE

        WHEN LENGTH(Divorce_Date) != 10 THEN NULL

        WHEN substr(Divorce_Date,6,2) > '12' THEN NULL

        WHEN substr(Divorce_Date,9,2) > '31' THEN NULL

        ELSE Divorce_Date

    END AS Divorce_Date,

    /* Income imputation */

    IFNULL(Income, 50000) AS Income,

    /* Status cleaning */

    CASE 

        WHEN TRIM(Status) = '' OR UPPER(Status) = 'NULL' THEN 'UNKNOWN'

        ELSE UPPER(TRIM(Status))

    END AS Status,

    /* Gender cleaning */

    CASE

        WHEN Gender IS NULL OR TRIM(Gender) = '' OR UPPER(Gender) = 'NULL'

        THEN 'UNKNOWN'

        ELSE UPPER(TRIM(Gender))

    END AS Gender,

    ID

FROM divorce_raw")

OUTPUT:

 

Country

Age

Divorce_Count

Divorce_Date

Income

Status

Gender

ID

1

INDIA

35

120

15-01-2023

50000

ACTIVE

M

101

2

INDIA

30

50

NA

45000

ACTIVE

F

102

3

USA

45

200

10-02-2023

50000

INACTIVE

M

103

4

UK

NA

180

05-03-2023

60000

UNKNOWN

F

104

5

FRANCE

50

220

NA

70000

ACTIVE

M

105

6

GERMANY

38

200

12-04-2023

55000

ACTIVE

F

106

7

BRAZIL

29

160

01-05-2023

48000

UNKNOWN

F

107

8

CHINA

42

210

11-06-2023

62000

ACTIVE

UNKNOWN

108

9

JAPAN

37

190

21-07-2023

58000

ACTIVE

M

109


Explanation
This uses sqldf() to run SQL queries inside R.
  • mutate() transforms variables
  • ifelse() handles missing/invalid values
  • toupper() standardizes text
  • distinct() removes duplicates
  • Always use TRIM() for blank spaces
  • Always check "NULL" as string vs actual NULL
  • Never rely on GROUP BY for deduplication 

Method

Behavior

DISTINCT

Removes exact duplicates

GROUP BY

Aggregates → may lose data

GROUP BY without aggregation

Unsafe / non-standard

Key Features:

  • UPPER() → standardizes text
  • TRIM() → removes spaces
  • CASE WHEN → conditional cleaning
  • IFNULL() → replaces missing values
  • GROUP BY → removes duplicates

Why use SQL in R?

  • Useful for users with SQL background
  • Easier for joining multiple datasets
  • Mimics database-like operations

Limitation:

  • Slower than dplyr for large datasets
  • Less idiomatic in R workflows

KEY TAKEAWAYS

  • DATA Step = row-level processing engine
  • PROC SQL = set-based relational engine
  • R (dplyr) = modern data manipulation pipeline
  • SQL in R = bridge for database-style thinking

5. Phase 3:Analysis in SAS (PROC SQL vs DATA Step)

DATA flagged;

SET divorce_clean;

LENGTH Flag $8.;

/* Flag high divorce counts */

IF Divorce_Count > 200 THEN Flag="HIGH";

ELSE Flag="NORMAL";

RUN;

PROC PRINT DATA = flagged;

RUN;

OUTPUT:

ObsCountryAgeDivorce_CountDivorce_DateIncomeStatusGenderIDDivorce_Date_numFlag
1INDIA351202023-01-1550000ACTIVEM10115JAN2023NORMAL
2INDIA30502023-15-4045000ACTIVEF102.NORMAL
3USA452002023-02-10.INACTIVEM10310FEB2023NORMAL
4UK.1802023-03-0560000UNKNOWNF10405MAR2023NORMAL
5FRANCE50220 70000ACTIVEM105.HIGH
6GERMANY382002023-04-1255000ACTIVEF10612APR2023NORMAL
7BRAZIL291602023-05-0148000UNKNOWNF10701MAY2023NORMAL
8CHINA422102023-06-1162000ACTIVEUNKNOWN10811JUN2023HIGH
9JAPAN371902023-07-2158000ACTIVEM10921JUL2023NORMAL

PROC MEANS DATA=flagged;

CLASS Country;

VAR Divorce_Count;

RUN;

OUTPUT:

The MEANS Procedure

Analysis Variable : Divorce_Count
CountryN ObsNMeanStd DevMinimumMaximum
BRAZIL11160.0000000.160.0000000160.0000000
CHINA11210.0000000.210.0000000210.0000000
FRANCE11220.0000000.220.0000000220.0000000
GERMANY11200.0000000.200.0000000200.0000000
INDIA2285.000000049.497474750.0000000120.0000000
JAPAN11190.0000000.190.0000000190.0000000
UK11180.0000000.180.0000000180.0000000
USA11200.0000000.200.0000000200.0000000

PROC SQL Approach

PROC SQL;

CREATE TABLE summary AS

SELECT Country,

       AVG(Divorce_Count) AS Avg_Divorce,

       COUNT(*) AS Total

FROM divorce_clean

GROUP BY Country;

QUIT;

PROC PRINT DATA = summary;

RUN;

OUTPUT:

ObsCountryAvg_DivorceTotal
1BRAZIL1601
2CHINA2101
3FRANCE2201
4GERMANY2001
5INDIA852
6JAPAN1901
7UK1801
8USA2001

Explanation

  • DATA Step: procedural, row-based
  • PROC SQL: declarative, set-based

👉 Use DATA Step when:

  • Complex derivations
  • Sequential transformations

👉 Use PROC SQL when:

  • Joins
  • Aggregations
  • Data summarization

6. 20 Additional Data Cleaning Best Practices

  1. Always validate SDTM domains against Define.xml
  2. Maintain audit trails for regulatory compliance
  3. Use controlled terminology (CDISC)
  4. Avoid hardcoding values
  5. Perform double programming validation
  6. Check date consistency across domains
  7. Ensure unique subject IDs (USUBJID)
  8. Validate missingness patterns
  9. Use metadata-driven programming
  10. Maintain version control
  11. Perform range checks
  12. Normalize categorical variables
  13. Apply derivation traceability
  14. Validate against SAP specifications
  15. Use macros for reusability
  16. Document assumptions
  17. Perform QC using independent code
  18. Flag outliers
  19. Use PROC COMPARE for validation
  20. Ensure reproducibility

7. Business Logic Behind Data Cleaning

Data cleaning is not just technical it’s business-critical logic enforcement. When we replace missing values, we are making assumptions that directly influence analytical outcomes. For example, imputing a missing patient age with a median value ensures statistical stability but must align with study protocols.

Correcting unrealistic values like negative divorce counts or ages is essential because such values distort aggregates. Imagine reporting an average divorce rate including -50; it falsely lowers the mean, misleading policymakers.

Date imputation is equally critical. Invalid dates disrupt timelines, affecting longitudinal analysis. In clinical trials, incorrect dates can invalidate treatment exposure calculations.

Salary normalization or income standardization ensures comparability across regions. Without it, economic insights become biased.

Ultimately, clean data ensures:

  • Reliable analytics
  • Regulatory compliance
  • Trustworthy decision-making

8. 20 Key Points (Sharp Insights)

  • Dirty data leads to wrong conclusions
  • Standardization ensures reproducibility
  • Missing data must be handled carefully
  • Validation is not optional
  • Duplicates inflate metrics
  • Dates drive timelines
  • Outliers require investigation
  • PROC SQL is set-based power
  • DATA Step is row-level control
  • R is flexible for transformations
  • SAS ensures regulatory compliance
  • Cleaning is 70% of analysis
  • Consistency beats complexity
  • Always audit your data
  • Use macros for scalability
  • Document every transformation
  • QC is mandatory
  • Never trust raw data
  • Clean data = credible insights
  • Garbage in, garbage out

9. Summary

This project demonstrates how messy real-world datasets like global divorce statistics can be systematically transformed into reliable analytical assets using SAS and R. The journey begins with raw, inconsistent data containing missing values, duplicates, invalid entries, and formatting issues. Without intervention, such datasets can mislead stakeholders and compromise decisions.

Using the SAS DATA Step, we handled row-level transformations efficiently. It allowed precise control over conditional logic, making it ideal for correcting invalid values, imputing missing fields, and standardizing variables. PROC SORT ensured duplicate removal, reinforcing data integrity.

PROC SQL, on the other hand, excelled in aggregation and summarization. Its declarative nature made it efficient for computing averages and grouping results across countries. Understanding when to use DATA Step versus PROC SQL is crucial DATA Step for transformation, SQL for summarization.

R complemented SAS by providing a flexible and intuitive pipeline-based approach. Functions like mutate(), distinct(), and ifelse() enabled rapid data cleaning and transformation, making R an excellent tool for exploratory data analysis and validation.

Together, SAS and R form a powerful ecosystem. SAS ensures structure, compliance, and scalability, while R provides agility and analytical depth. Mastering both allows professionals to handle any data challenge with confidence.

10. Conclusion

In any data-driven environment whether clinical research, finance, or social analytics data cleaning is not a preliminary step; it is the foundation of truth. This project on global divorce data illustrates a universal reality: raw data is rarely analysis-ready.

By leveraging SAS and R, we transformed chaos into clarity. The SAS DATA Step provided granular control, allowing us to correct errors at the row level. PROC SQL offered efficiency in summarizing and aggregating insights. R added flexibility, enabling rapid iteration and validation.

The key takeaway is not just technical it’s strategic. Choosing between PROC SQL and DATA Step is not about preference; it’s about understanding the problem. Complex transformations demand DATA Step precision, while relational operations benefit from SQL efficiency.

In regulated environments like clinical trials, this distinction becomes even more critical. Every transformation must be traceable, reproducible, and compliant with standards like CDISC. Here, SAS stands out as the gold standard.

However, modern data professionals must be bilingual fluent in both SAS and R. This dual capability ensures not only accuracy but also adaptability.

Ultimately, clean data is credible data. And credible data drives confident decisions.

11. Interview Questions

Q1: When would you use PROC SQL over DATA Step?
A: Use PROC SQL for joins, aggregations, and summarization. DATA Step is better for row-wise transformations.

Q2: How do you handle duplicate records in SAS?
A: Use PROC SORT NODUPKEY BY key_variable;

Q3: In R, how do you standardize inconsistent text values?
A: Use toupper(trimws(column))

Q4: What would you do if a dataset has negative age values?
A: Set them to missing or correct using domain logic.

Q5: How do you validate cleaned data?
A: Use PROC COMPARE (SAS) or all.equal() in R for QC.

SAS vs R Comparison

Feature

SAS

R

Regulatory Compliance

Strong

Moderate

Flexibility

Moderate

High

Performance

High

High

Learning Curve

Medium

Medium

Visualization

Limited

Strong

FINAL COMPARISON: DATA STEP vs PROC SQL vs R SQL

Feature

DATA Step (SAS)

PROC SQL (SAS)

SQL in R

Row-wise control

Excellent

Limited

Limited

Aggregation

Moderate

Excellent

Excellent

Data creation

Easy

Hard

Easy

Readability

High

Medium

Medium

Performance

High

High

Medium

Validation Checklist

Missing values handled
Duplicates removed
Text standardized
Dates validated
Outliers checked
QC performed

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about DIVORCE DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy

Comments

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function