Strategy When Divorce Records Break Dashboards: Cleaning Legal Data with SAS and R

Broken Promises, Corrupted Records & Divorce Data Engineering Excellence

Why Divorce Data Requires Special Cleaning

Divorce registries are among the most complex administrative datasets because they contain:

  • Former spouse information
  • Court case details
  • Legal settlement amounts
  • Custody information
  • Regional court jurisdictions
  • Filing dates
  • Final judgment dates
  • Lawyer information

Small errors can produce:

  • Incorrect legal reports
  • Wrong custody statistics
  • Duplicate court cases
  • Invalid settlement calculations
  • Misleading demographic analyses

1. Create a Raw Divorce Dataset in SAS

SAS Raw Divorce Dataset 

data divorce_raw;

length Case_ID $12 Husband_Name $40 Wife_Name $40 Court_Code $8 Lawyer_Email $60

Settlement_Amount $15 Custody_Status $20 Case_Status $15 Filing_Date_Raw $20

Final_Order_Raw $20 City $25;

input Case_ID $ Husband_Name $ Wife_Name $ Years_Married Children_Count

Court_Code $ Lawyer_Email $ Settlement_Amount $ Custody_Status $ Case_Status $

Filing_Date_Raw $ Final_Order_Raw $ City $;

datalines;

D1001 Ravi Anjali 12 2 AP01 lawyer1@court.in 500000 Joint Closed 12-01-2025 15-04-2025 Hyderabad

D1002 Ravi Anjali 12 2 AP01 lawyer1@court.in 500000 Joint Closed 12-01-2025 15-04-2025 Hyderabad

D1003 Kiran Meena -2 1 TN01 invalidmail 700000 Mother Active 15/02/2025 . Chennai

D1004 Suresh Kavya 45 0 KA01 lawyer4@court.in -800000 Father Closed 18MAR2025 20APR2025 Bangalore

D1005 Mohan Radha 8 3 AP02 lawyer5@court.in abc500 Joint Active . . Vijayawada

D1006 Akash Nisha 110 1 MH01 lawyer6@court.in 900000 Mother Closed 12APR2025 15MAY2025 Mumbai

D1007 Arjun Priya 15 2 XX01 lawyer7@court.in 300000 Joint Closed 10-10-2025 12-12-2025 Delhi

D1008 Tarun Pooja 7 . AP03 lawyer8@court.in 450000 Father Active 20JAN2025 . Hyderabad

D1009 Raj Divya 14 2 KA02 lawyer9@court.in 1000000 Joint Closed 15FEB2025 18MAR2025 Mysore

D1010 Naveen Sneha 5 1 AP04 lawyer10@court.in 550000 Mother Active 17MAR2025 . Vizag

D1011 Rakesh Simran 20 2 MH02 lawyer11@court.in 850000 Joint Closed 18APR2025 22JUN2025 Pune

D1012 Ajay Priti 18 1 TN02 lawyer12@court.in 620000 Father Closed 10MAY2025 20JUL2025 Chennai

D1013 Deepak Rani 13 2 AP05 lawyer13@court.in 580000 Joint Active 14JUN2025 . Guntur

D1014 Vinod Rekha 9 0 KA03 lawyer14@court.in 400000 Mother Closed 15JUL2025 18SEP2025 Hubli

D1015 Rahul Keerthi 16 2 AP06 lawyer15@court.in 760000 Joint Active 21AUG2025 . Tirupati

D1016 Mahesh Swathi 4 1 TN03 lawyer16@court.in 250000 Father Closed 12SEP2025 15NOV2025 Madurai

D1017 Dinesh Latha 11 3 MH03 lawyer17@court.in 690000 Joint Active 17OCT2025 . Nagpur

D1018 Harish Deepa 6 0 AP07 lawyer18@court.in 350000 Mother Closed 20NOV2025 22DEC2025 Nellore

D1019 Vijay Sita 21 2 KA04 lawyer19@court.in 920000 Joint Closed 05DEC2025 10JAN2026 Belgaum

D1020 Anand Jyothi 10 1 AP08 lawyer20@court.in 480000 Father Active 12DEC2025 . Kurnool

;

run;

proc print data = divorce_raw;

run;

OUTPUT:

ObsCase_IDHusband_NameWife_NameCourt_CodeLawyer_EmailSettlement_AmountCustody_StatusCase_StatusFiling_Date_RawFinal_Order_RawCityYears_MarriedChildren_Count
1D1001RaviAnjaliAP01lawyer1@court.in500000JointClosed12-01-202515-04-2025Hyderabad122
2D1002RaviAnjaliAP01lawyer1@court.in500000JointClosed12-01-202515-04-2025Hyderabad122
3D1003KiranMeenaTN01invalidmail700000MotherActive15/02/2025 Chennai-21
4D1004SureshKavyaKA01lawyer4@court.in-800000FatherClosed18MAR202520APR2025Bangalore450
5D1005MohanRadhaAP02lawyer5@court.inabc500JointActive  Vijayawada83
6D1006AkashNishaMH01lawyer6@court.in900000MotherClosed12APR202515MAY2025Mumbai1101
7D1007ArjunPriyaXX01lawyer7@court.in300000JointClosed10-10-202512-12-2025Delhi152
8D1008TarunPoojaAP03lawyer8@court.in450000FatherActive20JAN2025 Hyderabad7.
9D1009RajDivyaKA02lawyer9@court.in1000000JointClosed15FEB202518MAR2025Mysore142
10D1010NaveenSnehaAP04lawyer10@court.in550000MotherActive17MAR2025 Vizag51
11D1011RakeshSimranMH02lawyer11@court.in850000JointClosed18APR202522JUN2025Pune202
12D1012AjayPritiTN02lawyer12@court.in620000FatherClosed10MAY202520JUL2025Chennai181
13D1013DeepakRaniAP05lawyer13@court.in580000JointActive14JUN2025 Guntur132
14D1014VinodRekhaKA03lawyer14@court.in400000MotherClosed15JUL202518SEP2025Hubli90
15D1015RahulKeerthiAP06lawyer15@court.in760000JointActive21AUG2025 Tirupati162
16D1016MaheshSwathiTN03lawyer16@court.in250000FatherClosed12SEP202515NOV2025Madurai41
17D1017DineshLathaMH03lawyer17@court.in690000JointActive17OCT2025 Nagpur113
18D1018HarishDeepaAP07lawyer18@court.in350000MotherClosed20NOV202522DEC2025Nellore60
19D1019VijaySitaKA04lawyer19@court.in920000JointClosed05DEC202510JAN2026Belgaum212
20D1020AnandJyothiAP08lawyer20@court.in480000FatherActive12DEC2025 Kurnool101

Data Quality Problems Intentionally Added

Issue

Example

Duplicate Cases

D1001, D1002

Invalid Marriage Years

-2

Unrealistic Marriage Duration

110

Invalid Court Code

XX01

Invalid Email

invalidmail

Character Amount

abc500

Negative Settlement

-800000

Missing Children Count

.

Missing Dates

.

These are realistic issues commonly found in legal and government datasets.

SAS Cleaning Using DIFFERENT Statements

Instead of ARRAYS and COMPRESS, use:

  • VERIFY()
  • TRANWRD()
  • CATX()
  • FIND()
  • INPUTN()
  • COALESCEC()
  • MISSING()
  • NOTDIGIT()

Explanation

This SAS program creates the initial raw divorce registration dataset named DIVORCE_RAW. In any real-world data engineering project, the first step is usually to build or receive a source dataset from external systems such as government courts, legal departments, CRM systems, or data warehouses. This dataset intentionally contains quality issues that will later be identified and corrected through data cleaning procedures.

The DATA statement tells SAS to create a new dataset called DIVORCE_RAW. Every observation entered through the DATALINES statement becomes one row in the dataset.

The LENGTH statement is extremely important because it explicitly defines the storage size of character variables. For example, Case_ID $12 reserves twelve characters for the case identifier. If the LENGTH statement is omitted, SAS determines character lengths based on the first encountered value, which may lead to truncation problems later. Production programmers almost always define LENGTH before reading data.

Variables such as Husband_Name, Wife_Name, Court_Code, Lawyer_Email, and City are defined as character variables using the $ symbol. Numeric variables like Years_Married and Children_Count do not require $.

The INPUT statement tells SAS how to read each field. The ampersand (&) modifier allows SAS to read character values that may contain embedded spaces. This is useful when names contain multiple words.

Finally, the DATALINES statement provides the actual divorce records. These records intentionally contain duplicates, invalid emails, unrealistic marriage durations, and inconsistent values to simulate real-world legal data challenges. This raw dataset serves as the foundation for all subsequent cleaning, validation, reporting, and analytical activities.

2.SAS Cleaning Program

data divorce_clean;

set divorce_raw;

Lawyer_Email=lowcase(strip(Lawyer_Email));

Court_Code=upcase(Court_Code);

City=propcase(City);

if Years_Married < 0 then Years_Married=.;

if Years_Married > 60 then Years_Married=.;

Settlement_Num=input(Settlement_Amount,?? best12.);

Settlement_Num=abs(Settlement_Num);

if missing(Settlement_Num) then

Settlement_Flag='INVALID';

else Settlement_Flag='VALID';

if find(Lawyer_Email,'@')=0 then

Email_Flag='INVALID';

else Email_Flag='VALID';

Filing_Date=input(Filing_Date_Raw,anydtdte20.);

Final_Order=input(Final_Order_Raw,anydtdte20.);

format Filing_Date Final_Order date9.;

Case_Label=catx('-',Court_Code,Case_ID,City);

drop Settlement_Amount;

rename Settlement_Num = Settlement_Amount;

run;

proc print data = divorce_clean;

run;

OUTPUT:

ObsCase_IDHusband_NameWife_NameCourt_CodeLawyer_EmailCustody_StatusCase_StatusFiling_Date_RawFinal_Order_RawCityYears_MarriedChildren_CountSettlement_AmountSettlement_FlagEmail_FlagFiling_DateFinal_OrderCase_Label
1D1001RaviAnjaliAP01lawyer1@court.inJointClosed12-01-202515-04-2025Hyderabad122500000VALIDVALID01DEC202515APR2025AP01-D1001-Hyderabad
2D1002RaviAnjaliAP01lawyer1@court.inJointClosed12-01-202515-04-2025Hyderabad122500000VALIDVALID01DEC202515APR2025AP01-D1002-Hyderabad
3D1003KiranMeenaTN01invalidmailMotherActive15/02/2025 Chennai.1700000VALIDINVALID15FEB2025.TN01-D1003-Chennai
4D1004SureshKavyaKA01lawyer4@court.inFatherClosed18MAR202520APR2025Bangalore450800000VALIDVALID18MAR202520APR2025KA01-D1004-Bangalore
5D1005MohanRadhaAP02lawyer5@court.inJointActive  Vijayawada83.INVALIDVALID..AP02-D1005-Vijayawada
6D1006AkashNishaMH01lawyer6@court.inMotherClosed12APR202515MAY2025Mumbai.1900000VALIDVALID12APR202515MAY2025MH01-D1006-Mumbai
7D1007ArjunPriyaXX01lawyer7@court.inJointClosed10-10-202512-12-2025Delhi152300000VALIDVALID10OCT202512DEC2025XX01-D1007-Delhi
8D1008TarunPoojaAP03lawyer8@court.inFatherActive20JAN2025 Hyderabad7.450000VALIDVALID20JAN2025.AP03-D1008-Hyderabad
9D1009RajDivyaKA02lawyer9@court.inJointClosed15FEB202518MAR2025Mysore1421000000VALIDVALID15FEB202518MAR2025KA02-D1009-Mysore
10D1010NaveenSnehaAP04lawyer10@court.inMotherActive17MAR2025 Vizag51550000VALIDVALID17MAR2025.AP04-D1010-Vizag
11D1011RakeshSimranMH02lawyer11@court.inJointClosed18APR202522JUN2025Pune202850000VALIDVALID18APR202522JUN2025MH02-D1011-Pune
12D1012AjayPritiTN02lawyer12@court.inFatherClosed10MAY202520JUL2025Chennai181620000VALIDVALID10MAY202520JUL2025TN02-D1012-Chennai
13D1013DeepakRaniAP05lawyer13@court.inJointActive14JUN2025 Guntur132580000VALIDVALID14JUN2025.AP05-D1013-Guntur
14D1014VinodRekhaKA03lawyer14@court.inMotherClosed15JUL202518SEP2025Hubli90400000VALIDVALID15JUL202518SEP2025KA03-D1014-Hubli
15D1015RahulKeerthiAP06lawyer15@court.inJointActive21AUG2025 Tirupati162760000VALIDVALID21AUG2025.AP06-D1015-Tirupati
16D1016MaheshSwathiTN03lawyer16@court.inFatherClosed12SEP202515NOV2025Madurai41250000VALIDVALID12SEP202515NOV2025TN03-D1016-Madurai
17D1017DineshLathaMH03lawyer17@court.inJointActive17OCT2025 Nagpur113690000VALIDVALID17OCT2025.MH03-D1017-Nagpur
18D1018HarishDeepaAP07lawyer18@court.inMotherClosed20NOV202522DEC2025Nellore60350000VALIDVALID20NOV202522DEC2025AP07-D1018-Nellore
19D1019VijaySitaKA04lawyer19@court.inJointClosed05DEC202510JAN2026Belgaum212920000VALIDVALID05DEC202510JAN2026KA04-D1019-Belgaum
20D1020AnandJyothiAP08lawyer20@court.inFatherActive12DEC2025 Kurnool101480000VALIDVALID12DEC2025.AP08-D1020-Kurnool

Why These Functions Matter

VERIFY / NOTDIGIT

Used to identify unwanted characters.

notdigit("abc500")

Returns position of first non-numeric character.

Very useful for:

  • Salary fields
  • Settlement values
  • Account numbers

CATX

Creates meaningful IDs.

Case_Label=

catx('-',Court_Code,Case_ID,City);

Output:

AP01-D1001-Hyderabad

Useful for:

  • Tracking
  • Auditing
  • Reporting

INPUT

Converts dates dynamically.

inputn(Filing_Date_Raw,'ANYDTDTE20.')

Handles:

12-01-2025

12JAN2025

2025-01-12

without writing separate conversion code.

Explanation:

This SAS DATA step is designed to transform the raw divorce dataset into a clean, standardized, and analysis-ready dataset. The statement data divorce_clean; set divorce_raw; creates a new dataset called DIVORCE_CLEAN by reading observations from the raw dataset DIVORCE_RAW. The first set of transformations focuses on standardizing text variables. lowcase(strip(Lawyer_Email)) removes unnecessary leading and trailing spaces from the lawyer email field and converts all email addresses to lowercase, ensuring consistency when validating or reporting email information. upcase(Court_Code) converts court codes to uppercase so that values such as "ap01", "Ap01", and "AP01" are treated identically. Similarly, propcase(City) converts city names into proper case format, improving readability in reports and dashboards.

The next section performs business-rule validation on the Years_Married variable. Divorce records with negative marriage durations are logically impossible, so values less than zero are converted to missing (.). Likewise, values greater than sixty years are considered unrealistic for this project and are also converted to missing. This prevents extreme or invalid values from distorting summary statistics and analytical results.

The settlement amount cleaning logic converts the character variable Settlement_Amount into a numeric variable called Settlement_Num. The input(Settlement_Amount,?? best12.) function attempts to convert character values into numbers while suppressing invalid data conversion notes through the ?? modifier. For example, values such as "500000" and "900000" are successfully converted, whereas invalid values like "abc500" become missing. The abs() function then converts negative settlement amounts into positive values, ensuring standardized monetary reporting. A new variable called Settlement_Flag is created to indicate whether the settlement amount is valid or invalid. Records with successfully converted amounts receive a value of "VALID", while records with missing converted values receive "INVALID".

The program then validates email addresses using the find() function. If the @ symbol is absent from the email address, the record is flagged as "INVALID" in the Email_Flag variable; otherwise, it is marked "VALID". This provides a simple but effective quality check for email data. Next, the code converts the raw filing and final order date fields into true SAS date values using the ANYDTDTE. informat. This informat is particularly useful because it can interpret multiple date formats such as 12-01-2025, 15FEB2025, and 2025-03-15. After conversion, the FORMAT statement displays these dates using the DATE9. format, making them easier to read and interpret.

Finally, the CATX() function creates a new variable called Case_Label by combining the court code, case identifier, and city into a single descriptive value separated by hyphens. For example, a record may produce a label such as AP01-D1001-Hyderabad. This composite identifier can be useful for reporting, auditing, and tracking records across different systems. Overall, this DATA step demonstrates several important data-cleaning concepts, including standardization, validation, data type conversion, error handling, date processing, quality flag creation, and generation of business-friendly identifiers, all of which are essential skills for SAS programmers working with real-world datasets.

3.PROC FREQ Validation

proc freq data=divorce_clean;

tables Court_Code Custody_Status Case_Status

Email_Flag;

run;

OUTPUT:

The FREQ Procedure

Court_CodeFrequencyPercentCumulative
Frequency
Cumulative
Percent
AP01210.00210.00
AP0215.00315.00
AP0315.00420.00
AP0415.00525.00
AP0515.00630.00
AP0615.00735.00
AP0715.00840.00
AP0815.00945.00
KA0115.001050.00
KA0215.001155.00
KA0315.001260.00
KA0415.001365.00
MH0115.001470.00
MH0215.001575.00
MH0315.001680.00
TN0115.001785.00
TN0215.001890.00
TN0315.001995.00
XX0115.0020100.00
Custody_StatusFrequencyPercentCumulative
Frequency
Cumulative
Percent
Father525.00525.00
Joint1050.001575.00
Mother525.0020100.00
Case_StatusFrequencyPercentCumulative
Frequency
Cumulative
Percent
Active840.00840.00
Closed1260.0020100.00
Email_FlagFrequencyPercentCumulative
Frequency
Cumulative
Percent
INVALID15.0015.00
VALID1995.0020100.00

Purpose:

  • Find unexpected categories
  • Detect spelling mistakes
  • Validate business rules

4.PROC SUMMARY

proc summary data=divorce_clean;

class Court_Code;

var Settlement_Amount;

output out=court_summary mean= median= max= min=;

run;

proc print data = court_summary;

run;

OUTPUT:

ObsCourt_Code_TYPE__FREQ_Settlement_Amount
1 020610526.32
2AP0112500000.00
3AP0211.
4AP0311450000.00
5AP0411550000.00
6AP0511580000.00
7AP0611760000.00
8AP0711350000.00
9AP0811480000.00
10KA0111800000.00
11KA02111000000.00
12KA0311400000.00
13KA0411920000.00
14MH0111900000.00
15MH0211850000.00
16MH0311690000.00
17TN0111700000.00
18TN0211620000.00
19TN0311250000.00
20XX0111300000.00

Purpose:

Court-wise settlement analysis.

5.PROC RANK

proc rank data=divorce_clean out=ranked_cases groups=4;

var Settlement_Amount;

ranks Settlement_Group;

run;

proc print data = ranked_cases;

run;

OUTPUT:

ObsCase_IDHusband_NameWife_NameCourt_CodeLawyer_EmailCustody_StatusCase_StatusFiling_Date_RawFinal_Order_RawCityYears_MarriedChildren_CountSettlement_AmountSettlement_FlagEmail_FlagFiling_DateFinal_OrderCase_LabelSettlement_Group
1D1001RaviAnjaliAP01lawyer1@court.inJointClosed12-01-202515-04-2025Hyderabad122500000VALIDVALID01DEC202515APR2025AP01-D1001-Hyderabad1
2D1002RaviAnjaliAP01lawyer1@court.inJointClosed12-01-202515-04-2025Hyderabad122500000VALIDVALID01DEC202515APR2025AP01-D1002-Hyderabad1
3D1003KiranMeenaTN01invalidmailMotherActive15/02/2025 Chennai.1700000VALIDINVALID15FEB2025.TN01-D1003-Chennai2
4D1004SureshKavyaKA01lawyer4@court.inFatherClosed18MAR202520APR2025Bangalore450800000VALIDVALID18MAR202520APR2025KA01-D1004-Bangalore3
5D1005MohanRadhaAP02lawyer5@court.inJointActive  Vijayawada83.INVALIDVALID..AP02-D1005-Vijayawada.
6D1006AkashNishaMH01lawyer6@court.inMotherClosed12APR202515MAY2025Mumbai.1900000VALIDVALID12APR202515MAY2025MH01-D1006-Mumbai3
7D1007ArjunPriyaXX01lawyer7@court.inJointClosed10-10-202512-12-2025Delhi152300000VALIDVALID10OCT202512DEC2025XX01-D1007-Delhi0
8D1008TarunPoojaAP03lawyer8@court.inFatherActive20JAN2025 Hyderabad7.450000VALIDVALID20JAN2025.AP03-D1008-Hyderabad1
9D1009RajDivyaKA02lawyer9@court.inJointClosed15FEB202518MAR2025Mysore1421000000VALIDVALID15FEB202518MAR2025KA02-D1009-Mysore3
10D1010NaveenSnehaAP04lawyer10@court.inMotherActive17MAR2025 Vizag51550000VALIDVALID17MAR2025.AP04-D1010-Vizag1
11D1011RakeshSimranMH02lawyer11@court.inJointClosed18APR202522JUN2025Pune202850000VALIDVALID18APR202522JUN2025MH02-D1011-Pune3
12D1012AjayPritiTN02lawyer12@court.inFatherClosed10MAY202520JUL2025Chennai181620000VALIDVALID10MAY202520JUL2025TN02-D1012-Chennai2
13D1013DeepakRaniAP05lawyer13@court.inJointActive14JUN2025 Guntur132580000VALIDVALID14JUN2025.AP05-D1013-Guntur2
14D1014VinodRekhaKA03lawyer14@court.inMotherClosed15JUL202518SEP2025Hubli90400000VALIDVALID15JUL202518SEP2025KA03-D1014-Hubli0
15D1015RahulKeerthiAP06lawyer15@court.inJointActive21AUG2025 Tirupati162760000VALIDVALID21AUG2025.AP06-D1015-Tirupati2
16D1016MaheshSwathiTN03lawyer16@court.inFatherClosed12SEP202515NOV2025Madurai41250000VALIDVALID12SEP202515NOV2025TN03-D1016-Madurai0
17D1017DineshLathaMH03lawyer17@court.inJointActive17OCT2025 Nagpur113690000VALIDVALID17OCT2025.MH03-D1017-Nagpur2
18D1018HarishDeepaAP07lawyer18@court.inMotherClosed20NOV202522DEC2025Nellore60350000VALIDVALID20NOV202522DEC2025AP07-D1018-Nellore0
19D1019VijaySitaKA04lawyer19@court.inJointClosed05DEC202510JAN2026Belgaum212920000VALIDVALID05DEC202510JAN2026KA04-D1019-Belgaum3
20D1020AnandJyothiAP08lawyer20@court.inFatherActive12DEC2025 Kurnool101480000VALIDVALID12DEC2025.AP08-D1020-Kurnool1

Produces:

0 = Lowest

1 = Low

2 = Medium

3 = Highest

6.Create Raw Divorce Dataset in R

library(dplyr)

library(stringr)

library(lubridate)

library(readr)

divorce_raw <- data.frame(

  case_id =

    c("D1001","D1002","D1003","D1004",

      "D1005","D1006","D1007","D1008",

      "D1009","D1010","D1011","D1012"),

  years_married =

    c(12,12,-2,45,8,110,15,7,14,5,20,18),

  settlement_amount =

    c("500000","500000","700000",

      "-800000","abc500",

      "900000","300000",

      "450000","1000000",

      "550000","850000",

      "620000"),

  court_code =

    c("AP01","AP01","TN01",

      "KA01","AP02","MH01",

      "XX01","AP03",

      "KA02","AP04",

      "MH02","TN02")

)

OUTPUT:

 

case_id

years_married

settlement_amount

court_code

1

D1001

12

500000

AP01

2

D1002

12

500000

AP01

3

D1003

-2

700000

TN01

4

D1004

45

-800000

KA01

5

D1005

8

abc500

AP02

6

D1006

110

900000

MH01

7

D1007

15

300000

XX01

8

D1008

7

450000

AP03

9

D1009

14

1000000

KA02

10

D1010

5

550000

AP04

11

D1011

20

850000

MH02

12

D1012

18

620000

TN02

Instead of:

across()

str_trim()

str_to_title()

Use:

library(dplyr)

library(stringr)

library(lubridate)

library(readr)

7.R Cleaning Pipeline

divorce_clean <- divorce_raw %>%

  mutate(

    court_code = toupper(court_code),

    years_married =

      replace(years_married,years_married < 0 |

          years_married > 60,NA),

    settlement_amount =parse_number(

        settlement_amount),

    settlement_amount = abs(settlement_amount),

    case_group =case_when(

      settlement_amount >= 800000 ~ "HIGH",

      settlement_amount >= 500000 ~ "MEDIUM",

        TRUE ~ "LOW")

  )

OUTPUT:

 

case_id

years_married

settlement_amount

court_code

case_group

1

D1001

12

500000

AP01

MEDIUM

2

D1002

12

500000

AP01

MEDIUM

3

D1003

NA

700000

TN01

MEDIUM

4

D1004

45

800000

KA01

HIGH

5

D1005

8

500

AP02

LOW

6

D1006

NA

900000

MH01

HIGH

7

D1007

15

300000

XX01

LOW

8

D1008

7

450000

AP03

LOW

9

D1009

14

1000000

KA02

HIGH

10

D1010

5

550000

AP04

MEDIUM

11

D1011

20

850000

MH02

HIGH

12

D1012

18

620000

TN02

MEDIUM

Why parse_number() Is Powerful

Instead of:

gsub("[^0-9]","",x)

Use:

parse_number()

Examples:

parse_number("$500000")

parse_number("abc500")

parse_number("-800000")

Automatically extracts numbers.

Advanced Divorce Analytics in R

8.Court-wise Statistics

divorce_group <- divorce_clean  %>%

  group_by(court_code) %>%

  summarise(cases=n(),

    avg_settlement=mean(

        settlement_amount,na.rm=TRUE)

  )

OUTPUT:

 

court_code

cases

avg_settlement

1

AP01

2

500000

2

AP02

1

500

3

AP03

1

450000

4

AP04

1

550000

5

KA01

1

800000

6

KA02

1

1000000

7

MH01

1

900000

8

MH02

1

850000

9

TN01

1

700000

10

TN02

1

620000

11

XX01

1

300000

9.Detect Duplicates

# Duplicate Check

divorce_summary <- divorce_clean %>%

  count(case_id) %>%

  filter(n > 1)

# Missing Value Audit

colSums(is.na(divorce_clean))

CONSOLE:

> colSums(is.na(divorce_clean))

          case_id     years_married settlement_amount        court_code

                0                 2                 0                 0

       case_group

                0 

# Structure Check

str(divorce_clean)

CONSOLE:

> str(divorce_clean)

'data.frame':  12 obs. of  5 variables:

 $ case_id          : chr  "D1001" "D1002" "D1003" "D1004" ...

 $ years_married    : num  12 12 NA 45 8 NA 15 7 14 5 ...

 $ settlement_amount: num  500000 500000 700000 800000 500 900000 300000 450000 1000000 550000 ...

 $ court_code       : chr  "AP01" "AP01" "TN01" "KA01" ...

 $ case_group       : chr  "MEDIUM" "MEDIUM" "MEDIUM" "HIGH" ...

# Summary Statistics

summary(divorce_clean)

CONSOLE:

summary(divorce_clean)

   case_id          years_married   settlement_amount  court_code      

 Length:12          Min.   : 5.00   Min.   :    500   Length:12        

 Class :character   1st Qu.: 9.00   1st Qu.: 487500   Class :character 

 Mode  :character   Median :13.00   Median : 585000   Mode  :character 

                    Mean   :15.60   Mean   : 597542                    

                    3rd Qu.:17.25   3rd Qu.: 812500                     

                    Max.   :45.00   Max.   :1000000                    

                    NA's   :2                                          

  case_group      

 Length:12        

 Class :character 

 Mode  :character  

# Invalid Settlement Records

divorce_clean %>%

  filter(is.na(settlement_amount))

CONSOLE:

[1] case_id           years_married     settlement_amount court_code      

[5] case_group      

<0 rows> (or 0-length row.names)

# Invalid Marriage Duration

divorce_clean %>%

  filter(is.na(years_married))

CONSOLE:

case_id years_married settlement_amount court_code case_group

1   D1003            NA             7e+05       TN01     MEDIUM

2   D1006            NA             9e+05       MH01       HIGH

 SAS vs R Comparison for Divorce Data

Activity

SAS

R

Date Conversion

INPUTN

lubridate

Categorization

FORMAT

case_when

Duplicate Detection

PROC SORT

count()

Validation

PROC FREQ

table()

Aggregation

PROC SUMMARY

summarise()

Ranking

PROC RANK

ntile()


Interview Questions & Answers

1. Data Integrity Investigation

Question:
A divorce court suddenly reports 40% more finalized cases. How would you validate whether the increase is real or caused by duplicate records?

Short Answer:
I would compare current and historical case counts, check for duplicate Case_ID values using PROC SORT NODUPKEY or PROC SQL, and verify whether additional records are genuine new cases or duplicated entries from data loading or merging issues.

2. Settlement Amount Corruption

Question:
A settlement column contains values like 500000, -700000, and abc500. How would you clean and validate these values in SAS?

Short Answer:
I would use the INPUT() function to convert character values to numeric, apply ABS() to handle negative amounts if required by business rules, and flag invalid values such as abc500 as missing for further investigation.

3. Court Code Governance

Question:
How would you identify invalid court jurisdictions such as XX01 before reporting?

Short Answer:
I would compare court codes against a predefined list of valid court codes using PROC FORMAT, PROC SQL, or lookup tables and flag any unmatched codes as invalid before generating reports.

4. Date Standardization Challenge

Question:
What approach would you use when filing dates arrive in five different formats?

Short Answer:
I would use the ANYDTDTE. informat to convert all date formats into a standard SAS date value and then apply a common display format such as DATE9. for consistency.

5. Enterprise Validation Strategy

Question:
How would you independently QC a divorce dataset generated by another programmer?

Short Answer:
I would independently recreate the dataset using the specifications, compare record counts, variables, summary statistics, and key outputs, then investigate and resolve any discrepancies found during reconciliation.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about DIVORCE DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS