Strategy When Divorce Records Break Dashboards: Cleaning Legal Data with SAS and R
Broken Promises, Corrupted
Records & Divorce Data Engineering Excellence
Why Divorce Data Requires Special Cleaning
Divorce
registries are among the most complex administrative datasets because they
contain:
- Former spouse information
- Court case details
- Legal settlement amounts
- Custody information
- Regional court jurisdictions
- Filing dates
- Final judgment dates
- Lawyer information
Small
errors can produce:
- Incorrect legal reports
- Wrong custody statistics
- Duplicate court cases
- Invalid settlement
calculations
- Misleading demographic analyses
1. Create a Raw Divorce Dataset in SAS
SAS Raw Divorce Dataset
data divorce_raw;
length Case_ID $12 Husband_Name $40 Wife_Name $40 Court_Code $8 Lawyer_Email $60
Settlement_Amount $15 Custody_Status $20 Case_Status $15 Filing_Date_Raw $20
Final_Order_Raw $20 City $25;
input Case_ID $ Husband_Name $ Wife_Name $ Years_Married Children_Count
Court_Code $ Lawyer_Email $ Settlement_Amount $ Custody_Status $ Case_Status $
Filing_Date_Raw $ Final_Order_Raw $ City $;
datalines;
D1001 Ravi Anjali 12 2 AP01 lawyer1@court.in 500000 Joint Closed 12-01-2025 15-04-2025 Hyderabad
D1002 Ravi Anjali 12 2 AP01 lawyer1@court.in 500000 Joint Closed 12-01-2025 15-04-2025 Hyderabad
D1003 Kiran Meena -2 1 TN01 invalidmail 700000 Mother Active 15/02/2025 . Chennai
D1004 Suresh Kavya 45 0 KA01 lawyer4@court.in -800000 Father Closed 18MAR2025 20APR2025 Bangalore
D1005 Mohan Radha 8 3 AP02 lawyer5@court.in abc500 Joint Active . . Vijayawada
D1006 Akash Nisha 110 1 MH01 lawyer6@court.in 900000 Mother Closed 12APR2025 15MAY2025 Mumbai
D1007 Arjun Priya 15 2 XX01 lawyer7@court.in 300000 Joint Closed 10-10-2025 12-12-2025 Delhi
D1008 Tarun Pooja 7 . AP03 lawyer8@court.in 450000 Father Active 20JAN2025 . Hyderabad
D1009 Raj Divya 14 2 KA02 lawyer9@court.in 1000000 Joint Closed 15FEB2025 18MAR2025 Mysore
D1010 Naveen Sneha 5 1 AP04 lawyer10@court.in 550000 Mother Active 17MAR2025 . Vizag
D1011 Rakesh Simran 20 2 MH02 lawyer11@court.in 850000 Joint Closed 18APR2025 22JUN2025 Pune
D1012 Ajay Priti 18 1 TN02 lawyer12@court.in 620000 Father Closed 10MAY2025 20JUL2025 Chennai
D1013 Deepak Rani 13 2 AP05 lawyer13@court.in 580000 Joint Active 14JUN2025 . Guntur
D1014 Vinod Rekha 9 0 KA03 lawyer14@court.in 400000 Mother Closed 15JUL2025 18SEP2025 Hubli
D1015 Rahul Keerthi 16 2 AP06 lawyer15@court.in 760000 Joint Active 21AUG2025 . Tirupati
D1016 Mahesh Swathi 4 1 TN03 lawyer16@court.in 250000 Father Closed 12SEP2025 15NOV2025 Madurai
D1017 Dinesh Latha 11 3 MH03 lawyer17@court.in 690000 Joint Active 17OCT2025 . Nagpur
D1018 Harish Deepa 6 0 AP07 lawyer18@court.in 350000 Mother Closed 20NOV2025 22DEC2025 Nellore
D1019 Vijay Sita 21 2 KA04 lawyer19@court.in 920000 Joint Closed 05DEC2025 10JAN2026 Belgaum
D1020 Anand Jyothi 10 1 AP08 lawyer20@court.in 480000 Father Active 12DEC2025 . Kurnool
;
run;
proc print data = divorce_raw;
run;
OUTPUT:
| Obs | Case_ID | Husband_Name | Wife_Name | Court_Code | Lawyer_Email | Settlement_Amount | Custody_Status | Case_Status | Filing_Date_Raw | Final_Order_Raw | City | Years_Married | Children_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | D1001 | Ravi | Anjali | AP01 | lawyer1@court.in | 500000 | Joint | Closed | 12-01-2025 | 15-04-2025 | Hyderabad | 12 | 2 |
| 2 | D1002 | Ravi | Anjali | AP01 | lawyer1@court.in | 500000 | Joint | Closed | 12-01-2025 | 15-04-2025 | Hyderabad | 12 | 2 |
| 3 | D1003 | Kiran | Meena | TN01 | invalidmail | 700000 | Mother | Active | 15/02/2025 | Chennai | -2 | 1 | |
| 4 | D1004 | Suresh | Kavya | KA01 | lawyer4@court.in | -800000 | Father | Closed | 18MAR2025 | 20APR2025 | Bangalore | 45 | 0 |
| 5 | D1005 | Mohan | Radha | AP02 | lawyer5@court.in | abc500 | Joint | Active | Vijayawada | 8 | 3 | ||
| 6 | D1006 | Akash | Nisha | MH01 | lawyer6@court.in | 900000 | Mother | Closed | 12APR2025 | 15MAY2025 | Mumbai | 110 | 1 |
| 7 | D1007 | Arjun | Priya | XX01 | lawyer7@court.in | 300000 | Joint | Closed | 10-10-2025 | 12-12-2025 | Delhi | 15 | 2 |
| 8 | D1008 | Tarun | Pooja | AP03 | lawyer8@court.in | 450000 | Father | Active | 20JAN2025 | Hyderabad | 7 | . | |
| 9 | D1009 | Raj | Divya | KA02 | lawyer9@court.in | 1000000 | Joint | Closed | 15FEB2025 | 18MAR2025 | Mysore | 14 | 2 |
| 10 | D1010 | Naveen | Sneha | AP04 | lawyer10@court.in | 550000 | Mother | Active | 17MAR2025 | Vizag | 5 | 1 | |
| 11 | D1011 | Rakesh | Simran | MH02 | lawyer11@court.in | 850000 | Joint | Closed | 18APR2025 | 22JUN2025 | Pune | 20 | 2 |
| 12 | D1012 | Ajay | Priti | TN02 | lawyer12@court.in | 620000 | Father | Closed | 10MAY2025 | 20JUL2025 | Chennai | 18 | 1 |
| 13 | D1013 | Deepak | Rani | AP05 | lawyer13@court.in | 580000 | Joint | Active | 14JUN2025 | Guntur | 13 | 2 | |
| 14 | D1014 | Vinod | Rekha | KA03 | lawyer14@court.in | 400000 | Mother | Closed | 15JUL2025 | 18SEP2025 | Hubli | 9 | 0 |
| 15 | D1015 | Rahul | Keerthi | AP06 | lawyer15@court.in | 760000 | Joint | Active | 21AUG2025 | Tirupati | 16 | 2 | |
| 16 | D1016 | Mahesh | Swathi | TN03 | lawyer16@court.in | 250000 | Father | Closed | 12SEP2025 | 15NOV2025 | Madurai | 4 | 1 |
| 17 | D1017 | Dinesh | Latha | MH03 | lawyer17@court.in | 690000 | Joint | Active | 17OCT2025 | Nagpur | 11 | 3 | |
| 18 | D1018 | Harish | Deepa | AP07 | lawyer18@court.in | 350000 | Mother | Closed | 20NOV2025 | 22DEC2025 | Nellore | 6 | 0 |
| 19 | D1019 | Vijay | Sita | KA04 | lawyer19@court.in | 920000 | Joint | Closed | 05DEC2025 | 10JAN2026 | Belgaum | 21 | 2 |
| 20 | D1020 | Anand | Jyothi | AP08 | lawyer20@court.in | 480000 | Father | Active | 12DEC2025 | Kurnool | 10 | 1 |
Data Quality Problems
Intentionally Added
|
Issue |
Example |
|
Duplicate
Cases |
D1001,
D1002 |
|
Invalid
Marriage Years |
-2 |
|
Unrealistic
Marriage Duration |
110 |
|
Invalid
Court Code |
XX01 |
|
Invalid
Email |
invalidmail |
|
Character
Amount |
abc500 |
|
Negative
Settlement |
-800000 |
|
Missing
Children Count |
. |
|
Missing
Dates |
. |
These are
realistic issues commonly found in legal and government datasets.
SAS Cleaning Using
DIFFERENT Statements
Instead
of ARRAYS and COMPRESS, use:
- VERIFY()
- TRANWRD()
- CATX()
- FIND()
- INPUTN()
- COALESCEC()
- MISSING()
- NOTDIGIT()
Explanation
This SAS
program creates the initial raw divorce registration dataset named DIVORCE_RAW.
In any real-world data engineering project, the first step is usually to build
or receive a source dataset from external systems such as government courts,
legal departments, CRM systems, or data warehouses. This dataset intentionally
contains quality issues that will later be identified and corrected through
data cleaning procedures.
The DATA
statement tells SAS to create a new dataset called DIVORCE_RAW. Every
observation entered through the DATALINES statement becomes one row in the
dataset.
The
LENGTH statement is extremely important because it explicitly defines the
storage size of character variables. For example, Case_ID $12 reserves twelve
characters for the case identifier. If the LENGTH statement is omitted, SAS
determines character lengths based on the first encountered value, which may
lead to truncation problems later. Production programmers almost always define
LENGTH before reading data.
Variables
such as Husband_Name, Wife_Name, Court_Code, Lawyer_Email, and City are defined
as character variables using the $ symbol. Numeric variables like Years_Married
and Children_Count do not require $.
The INPUT
statement tells SAS how to read each field. The ampersand (&) modifier
allows SAS to read character values that may contain embedded spaces. This is
useful when names contain multiple words.
Finally,
the DATALINES statement provides the actual divorce records. These records
intentionally contain duplicates, invalid emails, unrealistic marriage
durations, and inconsistent values to simulate real-world legal data
challenges. This raw dataset serves as the foundation for all subsequent
cleaning, validation, reporting, and analytical activities.
2.SAS Cleaning Program
data divorce_clean;
set divorce_raw;
Lawyer_Email=lowcase(strip(Lawyer_Email));
Court_Code=upcase(Court_Code);
City=propcase(City);
if Years_Married < 0 then Years_Married=.;
if Years_Married > 60 then Years_Married=.;
Settlement_Num=input(Settlement_Amount,?? best12.);
Settlement_Num=abs(Settlement_Num);
if missing(Settlement_Num) then
Settlement_Flag='INVALID';
else Settlement_Flag='VALID';
if find(Lawyer_Email,'@')=0 then
Email_Flag='INVALID';
else Email_Flag='VALID';
Filing_Date=input(Filing_Date_Raw,anydtdte20.);
Final_Order=input(Final_Order_Raw,anydtdte20.);
format Filing_Date Final_Order date9.;
Case_Label=catx('-',Court_Code,Case_ID,City);
drop Settlement_Amount;
rename Settlement_Num = Settlement_Amount;
run;
proc print data = divorce_clean;
run;
OUTPUT:
| Obs | Case_ID | Husband_Name | Wife_Name | Court_Code | Lawyer_Email | Custody_Status | Case_Status | Filing_Date_Raw | Final_Order_Raw | City | Years_Married | Children_Count | Settlement_Amount | Settlement_Flag | Email_Flag | Filing_Date | Final_Order | Case_Label |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | D1001 | Ravi | Anjali | AP01 | lawyer1@court.in | Joint | Closed | 12-01-2025 | 15-04-2025 | Hyderabad | 12 | 2 | 500000 | VALID | VALID | 01DEC2025 | 15APR2025 | AP01-D1001-Hyderabad |
| 2 | D1002 | Ravi | Anjali | AP01 | lawyer1@court.in | Joint | Closed | 12-01-2025 | 15-04-2025 | Hyderabad | 12 | 2 | 500000 | VALID | VALID | 01DEC2025 | 15APR2025 | AP01-D1002-Hyderabad |
| 3 | D1003 | Kiran | Meena | TN01 | invalidmail | Mother | Active | 15/02/2025 | Chennai | . | 1 | 700000 | VALID | INVALID | 15FEB2025 | . | TN01-D1003-Chennai | |
| 4 | D1004 | Suresh | Kavya | KA01 | lawyer4@court.in | Father | Closed | 18MAR2025 | 20APR2025 | Bangalore | 45 | 0 | 800000 | VALID | VALID | 18MAR2025 | 20APR2025 | KA01-D1004-Bangalore |
| 5 | D1005 | Mohan | Radha | AP02 | lawyer5@court.in | Joint | Active | Vijayawada | 8 | 3 | . | INVALID | VALID | . | . | AP02-D1005-Vijayawada | ||
| 6 | D1006 | Akash | Nisha | MH01 | lawyer6@court.in | Mother | Closed | 12APR2025 | 15MAY2025 | Mumbai | . | 1 | 900000 | VALID | VALID | 12APR2025 | 15MAY2025 | MH01-D1006-Mumbai |
| 7 | D1007 | Arjun | Priya | XX01 | lawyer7@court.in | Joint | Closed | 10-10-2025 | 12-12-2025 | Delhi | 15 | 2 | 300000 | VALID | VALID | 10OCT2025 | 12DEC2025 | XX01-D1007-Delhi |
| 8 | D1008 | Tarun | Pooja | AP03 | lawyer8@court.in | Father | Active | 20JAN2025 | Hyderabad | 7 | . | 450000 | VALID | VALID | 20JAN2025 | . | AP03-D1008-Hyderabad | |
| 9 | D1009 | Raj | Divya | KA02 | lawyer9@court.in | Joint | Closed | 15FEB2025 | 18MAR2025 | Mysore | 14 | 2 | 1000000 | VALID | VALID | 15FEB2025 | 18MAR2025 | KA02-D1009-Mysore |
| 10 | D1010 | Naveen | Sneha | AP04 | lawyer10@court.in | Mother | Active | 17MAR2025 | Vizag | 5 | 1 | 550000 | VALID | VALID | 17MAR2025 | . | AP04-D1010-Vizag | |
| 11 | D1011 | Rakesh | Simran | MH02 | lawyer11@court.in | Joint | Closed | 18APR2025 | 22JUN2025 | Pune | 20 | 2 | 850000 | VALID | VALID | 18APR2025 | 22JUN2025 | MH02-D1011-Pune |
| 12 | D1012 | Ajay | Priti | TN02 | lawyer12@court.in | Father | Closed | 10MAY2025 | 20JUL2025 | Chennai | 18 | 1 | 620000 | VALID | VALID | 10MAY2025 | 20JUL2025 | TN02-D1012-Chennai |
| 13 | D1013 | Deepak | Rani | AP05 | lawyer13@court.in | Joint | Active | 14JUN2025 | Guntur | 13 | 2 | 580000 | VALID | VALID | 14JUN2025 | . | AP05-D1013-Guntur | |
| 14 | D1014 | Vinod | Rekha | KA03 | lawyer14@court.in | Mother | Closed | 15JUL2025 | 18SEP2025 | Hubli | 9 | 0 | 400000 | VALID | VALID | 15JUL2025 | 18SEP2025 | KA03-D1014-Hubli |
| 15 | D1015 | Rahul | Keerthi | AP06 | lawyer15@court.in | Joint | Active | 21AUG2025 | Tirupati | 16 | 2 | 760000 | VALID | VALID | 21AUG2025 | . | AP06-D1015-Tirupati | |
| 16 | D1016 | Mahesh | Swathi | TN03 | lawyer16@court.in | Father | Closed | 12SEP2025 | 15NOV2025 | Madurai | 4 | 1 | 250000 | VALID | VALID | 12SEP2025 | 15NOV2025 | TN03-D1016-Madurai |
| 17 | D1017 | Dinesh | Latha | MH03 | lawyer17@court.in | Joint | Active | 17OCT2025 | Nagpur | 11 | 3 | 690000 | VALID | VALID | 17OCT2025 | . | MH03-D1017-Nagpur | |
| 18 | D1018 | Harish | Deepa | AP07 | lawyer18@court.in | Mother | Closed | 20NOV2025 | 22DEC2025 | Nellore | 6 | 0 | 350000 | VALID | VALID | 20NOV2025 | 22DEC2025 | AP07-D1018-Nellore |
| 19 | D1019 | Vijay | Sita | KA04 | lawyer19@court.in | Joint | Closed | 05DEC2025 | 10JAN2026 | Belgaum | 21 | 2 | 920000 | VALID | VALID | 05DEC2025 | 10JAN2026 | KA04-D1019-Belgaum |
| 20 | D1020 | Anand | Jyothi | AP08 | lawyer20@court.in | Father | Active | 12DEC2025 | Kurnool | 10 | 1 | 480000 | VALID | VALID | 12DEC2025 | . | AP08-D1020-Kurnool |
Why These Functions Matter
VERIFY / NOTDIGIT
Used to
identify unwanted characters.
notdigit("abc500")
Returns
position of first non-numeric character.
Very
useful for:
- Salary fields
- Settlement values
- Account numbers
CATX
Creates
meaningful IDs.
Case_Label=
catx('-',Court_Code,Case_ID,City);
Output:
AP01-D1001-Hyderabad
Useful
for:
- Tracking
- Auditing
- Reporting
INPUT
Converts
dates dynamically.
inputn(Filing_Date_Raw,'ANYDTDTE20.')
Handles:
12-01-2025
12JAN2025
2025-01-12
without
writing separate conversion code.
Explanation:
This SAS
DATA step is designed to transform the raw divorce dataset into a clean,
standardized, and analysis-ready dataset. The statement data divorce_clean; set
divorce_raw; creates a new dataset called DIVORCE_CLEAN by reading
observations from the raw dataset DIVORCE_RAW. The first set of
transformations focuses on standardizing text variables.
lowcase(strip(Lawyer_Email)) removes unnecessary leading and trailing spaces
from the lawyer email field and converts all email addresses to lowercase,
ensuring consistency when validating or reporting email information.
upcase(Court_Code) converts court codes to uppercase so that values such as
"ap01", "Ap01", and "AP01" are treated
identically. Similarly, propcase(City) converts city names into proper case
format, improving readability in reports and dashboards.
The next
section performs business-rule validation on the Years_Married variable.
Divorce records with negative marriage durations are logically impossible, so
values less than zero are converted to missing (.). Likewise, values greater
than sixty years are considered unrealistic for this project and are also
converted to missing. This prevents extreme or invalid values from distorting
summary statistics and analytical results.
The
settlement amount cleaning logic converts the character variable
Settlement_Amount into a numeric variable called Settlement_Num. The
input(Settlement_Amount,?? best12.) function attempts to convert character
values into numbers while suppressing invalid data conversion notes through the
?? modifier. For example, values such as "500000" and
"900000" are successfully converted, whereas invalid values like
"abc500" become missing. The abs() function then converts negative
settlement amounts into positive values, ensuring standardized monetary
reporting. A new variable called Settlement_Flag is created to indicate whether
the settlement amount is valid or invalid. Records with successfully converted
amounts receive a value of "VALID", while records with missing
converted values receive "INVALID".
The
program then validates email addresses using the find() function. If the @
symbol is absent from the email address, the record is flagged as
"INVALID" in the Email_Flag variable; otherwise, it is marked
"VALID". This provides a simple but effective quality check for email
data. Next, the code converts the raw filing and final order date fields into
true SAS date values using the ANYDTDTE. informat. This informat is
particularly useful because it can interpret multiple date formats such as 12-01-2025,
15FEB2025, and 2025-03-15. After conversion, the FORMAT statement displays
these dates using the DATE9. format, making them easier to read and interpret.
Finally,
the CATX() function creates a new variable called Case_Label by combining the
court code, case identifier, and city into a single descriptive value separated
by hyphens. For example, a record may produce a label such as
AP01-D1001-Hyderabad. This composite identifier can be useful for reporting,
auditing, and tracking records across different systems. Overall, this DATA
step demonstrates several important data-cleaning concepts, including
standardization, validation, data type conversion, error handling, date
processing, quality flag creation, and generation of business-friendly identifiers,
all of which are essential skills for SAS programmers working with real-world
datasets.
3.PROC FREQ Validation
proc freq data=divorce_clean;
tables Court_Code Custody_Status Case_Status
Email_Flag;
run;
OUTPUT:
The FREQ Procedure
| Court_Code | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| AP01 | 2 | 10.00 | 2 | 10.00 |
| AP02 | 1 | 5.00 | 3 | 15.00 |
| AP03 | 1 | 5.00 | 4 | 20.00 |
| AP04 | 1 | 5.00 | 5 | 25.00 |
| AP05 | 1 | 5.00 | 6 | 30.00 |
| AP06 | 1 | 5.00 | 7 | 35.00 |
| AP07 | 1 | 5.00 | 8 | 40.00 |
| AP08 | 1 | 5.00 | 9 | 45.00 |
| KA01 | 1 | 5.00 | 10 | 50.00 |
| KA02 | 1 | 5.00 | 11 | 55.00 |
| KA03 | 1 | 5.00 | 12 | 60.00 |
| KA04 | 1 | 5.00 | 13 | 65.00 |
| MH01 | 1 | 5.00 | 14 | 70.00 |
| MH02 | 1 | 5.00 | 15 | 75.00 |
| MH03 | 1 | 5.00 | 16 | 80.00 |
| TN01 | 1 | 5.00 | 17 | 85.00 |
| TN02 | 1 | 5.00 | 18 | 90.00 |
| TN03 | 1 | 5.00 | 19 | 95.00 |
| XX01 | 1 | 5.00 | 20 | 100.00 |
| Custody_Status | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Father | 5 | 25.00 | 5 | 25.00 |
| Joint | 10 | 50.00 | 15 | 75.00 |
| Mother | 5 | 25.00 | 20 | 100.00 |
| Case_Status | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Active | 8 | 40.00 | 8 | 40.00 |
| Closed | 12 | 60.00 | 20 | 100.00 |
| Email_Flag | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| INVALID | 1 | 5.00 | 1 | 5.00 |
| VALID | 19 | 95.00 | 20 | 100.00 |
Purpose:
- Find unexpected categories
- Detect spelling mistakes
- Validate business rules
4.PROC SUMMARY
proc summary data=divorce_clean;
class Court_Code;
var Settlement_Amount;
output out=court_summary mean= median= max= min=;
run;
proc print data = court_summary;
run;
OUTPUT:
| Obs | Court_Code | _TYPE_ | _FREQ_ | Settlement_Amount |
|---|---|---|---|---|
| 1 | 0 | 20 | 610526.32 | |
| 2 | AP01 | 1 | 2 | 500000.00 |
| 3 | AP02 | 1 | 1 | . |
| 4 | AP03 | 1 | 1 | 450000.00 |
| 5 | AP04 | 1 | 1 | 550000.00 |
| 6 | AP05 | 1 | 1 | 580000.00 |
| 7 | AP06 | 1 | 1 | 760000.00 |
| 8 | AP07 | 1 | 1 | 350000.00 |
| 9 | AP08 | 1 | 1 | 480000.00 |
| 10 | KA01 | 1 | 1 | 800000.00 |
| 11 | KA02 | 1 | 1 | 1000000.00 |
| 12 | KA03 | 1 | 1 | 400000.00 |
| 13 | KA04 | 1 | 1 | 920000.00 |
| 14 | MH01 | 1 | 1 | 900000.00 |
| 15 | MH02 | 1 | 1 | 850000.00 |
| 16 | MH03 | 1 | 1 | 690000.00 |
| 17 | TN01 | 1 | 1 | 700000.00 |
| 18 | TN02 | 1 | 1 | 620000.00 |
| 19 | TN03 | 1 | 1 | 250000.00 |
| 20 | XX01 | 1 | 1 | 300000.00 |
Purpose:
Court-wise
settlement analysis.
5.PROC RANK
proc rank data=divorce_clean out=ranked_cases groups=4;
var Settlement_Amount;
ranks Settlement_Group;
run;
proc print data = ranked_cases;
run;
OUTPUT:
| Obs | Case_ID | Husband_Name | Wife_Name | Court_Code | Lawyer_Email | Custody_Status | Case_Status | Filing_Date_Raw | Final_Order_Raw | City | Years_Married | Children_Count | Settlement_Amount | Settlement_Flag | Email_Flag | Filing_Date | Final_Order | Case_Label | Settlement_Group |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | D1001 | Ravi | Anjali | AP01 | lawyer1@court.in | Joint | Closed | 12-01-2025 | 15-04-2025 | Hyderabad | 12 | 2 | 500000 | VALID | VALID | 01DEC2025 | 15APR2025 | AP01-D1001-Hyderabad | 1 |
| 2 | D1002 | Ravi | Anjali | AP01 | lawyer1@court.in | Joint | Closed | 12-01-2025 | 15-04-2025 | Hyderabad | 12 | 2 | 500000 | VALID | VALID | 01DEC2025 | 15APR2025 | AP01-D1002-Hyderabad | 1 |
| 3 | D1003 | Kiran | Meena | TN01 | invalidmail | Mother | Active | 15/02/2025 | Chennai | . | 1 | 700000 | VALID | INVALID | 15FEB2025 | . | TN01-D1003-Chennai | 2 | |
| 4 | D1004 | Suresh | Kavya | KA01 | lawyer4@court.in | Father | Closed | 18MAR2025 | 20APR2025 | Bangalore | 45 | 0 | 800000 | VALID | VALID | 18MAR2025 | 20APR2025 | KA01-D1004-Bangalore | 3 |
| 5 | D1005 | Mohan | Radha | AP02 | lawyer5@court.in | Joint | Active | Vijayawada | 8 | 3 | . | INVALID | VALID | . | . | AP02-D1005-Vijayawada | . | ||
| 6 | D1006 | Akash | Nisha | MH01 | lawyer6@court.in | Mother | Closed | 12APR2025 | 15MAY2025 | Mumbai | . | 1 | 900000 | VALID | VALID | 12APR2025 | 15MAY2025 | MH01-D1006-Mumbai | 3 |
| 7 | D1007 | Arjun | Priya | XX01 | lawyer7@court.in | Joint | Closed | 10-10-2025 | 12-12-2025 | Delhi | 15 | 2 | 300000 | VALID | VALID | 10OCT2025 | 12DEC2025 | XX01-D1007-Delhi | 0 |
| 8 | D1008 | Tarun | Pooja | AP03 | lawyer8@court.in | Father | Active | 20JAN2025 | Hyderabad | 7 | . | 450000 | VALID | VALID | 20JAN2025 | . | AP03-D1008-Hyderabad | 1 | |
| 9 | D1009 | Raj | Divya | KA02 | lawyer9@court.in | Joint | Closed | 15FEB2025 | 18MAR2025 | Mysore | 14 | 2 | 1000000 | VALID | VALID | 15FEB2025 | 18MAR2025 | KA02-D1009-Mysore | 3 |
| 10 | D1010 | Naveen | Sneha | AP04 | lawyer10@court.in | Mother | Active | 17MAR2025 | Vizag | 5 | 1 | 550000 | VALID | VALID | 17MAR2025 | . | AP04-D1010-Vizag | 1 | |
| 11 | D1011 | Rakesh | Simran | MH02 | lawyer11@court.in | Joint | Closed | 18APR2025 | 22JUN2025 | Pune | 20 | 2 | 850000 | VALID | VALID | 18APR2025 | 22JUN2025 | MH02-D1011-Pune | 3 |
| 12 | D1012 | Ajay | Priti | TN02 | lawyer12@court.in | Father | Closed | 10MAY2025 | 20JUL2025 | Chennai | 18 | 1 | 620000 | VALID | VALID | 10MAY2025 | 20JUL2025 | TN02-D1012-Chennai | 2 |
| 13 | D1013 | Deepak | Rani | AP05 | lawyer13@court.in | Joint | Active | 14JUN2025 | Guntur | 13 | 2 | 580000 | VALID | VALID | 14JUN2025 | . | AP05-D1013-Guntur | 2 | |
| 14 | D1014 | Vinod | Rekha | KA03 | lawyer14@court.in | Mother | Closed | 15JUL2025 | 18SEP2025 | Hubli | 9 | 0 | 400000 | VALID | VALID | 15JUL2025 | 18SEP2025 | KA03-D1014-Hubli | 0 |
| 15 | D1015 | Rahul | Keerthi | AP06 | lawyer15@court.in | Joint | Active | 21AUG2025 | Tirupati | 16 | 2 | 760000 | VALID | VALID | 21AUG2025 | . | AP06-D1015-Tirupati | 2 | |
| 16 | D1016 | Mahesh | Swathi | TN03 | lawyer16@court.in | Father | Closed | 12SEP2025 | 15NOV2025 | Madurai | 4 | 1 | 250000 | VALID | VALID | 12SEP2025 | 15NOV2025 | TN03-D1016-Madurai | 0 |
| 17 | D1017 | Dinesh | Latha | MH03 | lawyer17@court.in | Joint | Active | 17OCT2025 | Nagpur | 11 | 3 | 690000 | VALID | VALID | 17OCT2025 | . | MH03-D1017-Nagpur | 2 | |
| 18 | D1018 | Harish | Deepa | AP07 | lawyer18@court.in | Mother | Closed | 20NOV2025 | 22DEC2025 | Nellore | 6 | 0 | 350000 | VALID | VALID | 20NOV2025 | 22DEC2025 | AP07-D1018-Nellore | 0 |
| 19 | D1019 | Vijay | Sita | KA04 | lawyer19@court.in | Joint | Closed | 05DEC2025 | 10JAN2026 | Belgaum | 21 | 2 | 920000 | VALID | VALID | 05DEC2025 | 10JAN2026 | KA04-D1019-Belgaum | 3 |
| 20 | D1020 | Anand | Jyothi | AP08 | lawyer20@court.in | Father | Active | 12DEC2025 | Kurnool | 10 | 1 | 480000 | VALID | VALID | 12DEC2025 | . | AP08-D1020-Kurnool | 1 |
Produces:
0 = Lowest
1 = Low
2 = Medium
3 = Highest
6.Create Raw Divorce Dataset in R
library(dplyr)
library(stringr)
library(lubridate)
library(readr)
divorce_raw <- data.frame(
case_id =
c("D1001","D1002","D1003","D1004",
"D1005","D1006","D1007","D1008",
"D1009","D1010","D1011","D1012"),
years_married =
c(12,12,-2,45,8,110,15,7,14,5,20,18),
settlement_amount =
c("500000","500000","700000",
"-800000","abc500",
"900000","300000",
"450000","1000000",
"550000","850000",
"620000"),
court_code =
c("AP01","AP01","TN01",
"KA01","AP02","MH01",
"XX01","AP03",
"KA02","AP04",
"MH02","TN02")
)
OUTPUT:
|
|
case_id |
years_married |
settlement_amount |
court_code |
|
1 |
D1001 |
12 |
500000 |
AP01 |
|
2 |
D1002 |
12 |
500000 |
AP01 |
|
3 |
D1003 |
-2 |
700000 |
TN01 |
|
4 |
D1004 |
45 |
-800000 |
KA01 |
|
5 |
D1005 |
8 |
abc500 |
AP02 |
|
6 |
D1006 |
110 |
900000 |
MH01 |
|
7 |
D1007 |
15 |
300000 |
XX01 |
|
8 |
D1008 |
7 |
450000 |
AP03 |
|
9 |
D1009 |
14 |
1000000 |
KA02 |
|
10 |
D1010 |
5 |
550000 |
AP04 |
|
11 |
D1011 |
20 |
850000 |
MH02 |
|
12 |
D1012 |
18 |
620000 |
TN02 |
Instead
of:
across()
str_trim()
str_to_title()
Use:
library(dplyr)
library(stringr)
library(lubridate)
library(readr)
7.R Cleaning Pipeline
divorce_clean <- divorce_raw %>%
mutate(
court_code = toupper(court_code),
years_married =
replace(years_married,years_married < 0 |
years_married > 60,NA),
settlement_amount =parse_number(
settlement_amount),
settlement_amount = abs(settlement_amount),
case_group =case_when(
settlement_amount >= 800000 ~ "HIGH",
settlement_amount >= 500000 ~ "MEDIUM",
TRUE ~ "LOW")
)
OUTPUT:
|
|
case_id |
years_married |
settlement_amount |
court_code |
case_group |
|
1 |
D1001 |
12 |
500000 |
AP01 |
MEDIUM |
|
2 |
D1002 |
12 |
500000 |
AP01 |
MEDIUM |
|
3 |
D1003 |
NA |
700000 |
TN01 |
MEDIUM |
|
4 |
D1004 |
45 |
800000 |
KA01 |
HIGH |
|
5 |
D1005 |
8 |
500 |
AP02 |
LOW |
|
6 |
D1006 |
NA |
900000 |
MH01 |
HIGH |
|
7 |
D1007 |
15 |
300000 |
XX01 |
LOW |
|
8 |
D1008 |
7 |
450000 |
AP03 |
LOW |
|
9 |
D1009 |
14 |
1000000 |
KA02 |
HIGH |
|
10 |
D1010 |
5 |
550000 |
AP04 |
MEDIUM |
|
11 |
D1011 |
20 |
850000 |
MH02 |
HIGH |
|
12 |
D1012 |
18 |
620000 |
TN02 |
MEDIUM |
Why parse_number() Is
Powerful
Instead
of:
gsub("[^0-9]","",x)
Use:
parse_number()
Examples:
parse_number("$500000")
parse_number("abc500")
parse_number("-800000")
Automatically
extracts numbers.
Advanced Divorce Analytics
in R
8.Court-wise Statistics
divorce_group <- divorce_clean %>%
group_by(court_code) %>%
summarise(cases=n(),
avg_settlement=mean(
settlement_amount,na.rm=TRUE)
)
OUTPUT:
|
|
court_code |
cases |
avg_settlement |
|
1 |
AP01 |
2 |
500000 |
|
2 |
AP02 |
1 |
500 |
|
3 |
AP03 |
1 |
450000 |
|
4 |
AP04 |
1 |
550000 |
|
5 |
KA01 |
1 |
800000 |
|
6 |
KA02 |
1 |
1000000 |
|
7 |
MH01 |
1 |
900000 |
|
8 |
MH02 |
1 |
850000 |
|
9 |
TN01 |
1 |
700000 |
|
10 |
TN02 |
1 |
620000 |
|
11 |
XX01 |
1 |
300000 |
9.Detect Duplicates
# Duplicate Check
divorce_summary <- divorce_clean %>%
count(case_id) %>%
filter(n > 1)
# Missing Value Audit
colSums(is.na(divorce_clean))
CONSOLE:
> colSums(is.na(divorce_clean))
case_id years_married settlement_amount court_code
0 2 0 0
case_group
0
# Structure Check
str(divorce_clean)
CONSOLE:
> str(divorce_clean)
'data.frame': 12 obs. of
5 variables:
$ case_id : chr
"D1001" "D1002" "D1003" "D1004"
...
$ years_married : num
12 12 NA 45 8 NA 15 7 14 5 ...
$ settlement_amount: num 500000 500000 700000 800000 500 900000 300000
450000 1000000 550000 ...
$ court_code : chr
"AP01" "AP01" "TN01" "KA01" ...
$ case_group : chr
"MEDIUM" "MEDIUM" "MEDIUM"
"HIGH" ...
# Summary Statistics
summary(divorce_clean)
CONSOLE:
summary(divorce_clean)
case_id years_married settlement_amount court_code
Length:12 Min.
: 5.00 Min. :
500 Length:12
Class :character 1st Qu.: 9.00 1st Qu.: 487500 Class :character
Mode
:character Median :13.00 Median : 585000 Mode
:character
Mean :15.60
Mean : 597542
3rd Qu.:17.25 3rd Qu.: 812500
Max. :45.00
Max. :1000000
NA's :2
case_group
Length:12
Class :character
Mode
:character
# Invalid Settlement Records
divorce_clean %>%
filter(is.na(settlement_amount))
CONSOLE:
[1]
case_id years_married settlement_amount court_code
[5]
case_group
<0
rows> (or 0-length row.names)
# Invalid Marriage Duration
divorce_clean %>%
filter(is.na(years_married))
CONSOLE:
case_id
years_married settlement_amount court_code case_group
1 D1003 NA 7e+05 TN01
MEDIUM
2 D1006 NA 9e+05 MH01 HIGH
|
Activity |
SAS |
R |
|
Date
Conversion |
INPUTN |
lubridate |
|
Categorization |
FORMAT |
case_when |
|
Duplicate
Detection |
PROC
SORT |
count() |
|
Validation |
PROC
FREQ |
table() |
|
Aggregation |
PROC
SUMMARY |
summarise() |
|
Ranking |
PROC
RANK |
ntile() |
Interview Questions &
Answers
1. Data Integrity Investigation
Question:
A divorce court suddenly reports 40% more finalized cases. How would you
validate whether the increase is real or caused by duplicate records?
Short Answer:
I would compare current and historical case counts, check for duplicate Case_ID values
using PROC SORT
NODUPKEY or PROC
SQL, and verify whether additional records are genuine new cases
or duplicated entries from data loading or merging issues.
2. Settlement Amount Corruption
Question:
A settlement column contains values like 500000,
-700000,
and abc500.
How would you clean and validate these values in SAS?
Short Answer:
I would use the INPUT()
function to convert character values to numeric, apply ABS() to handle negative
amounts if required by business rules, and flag invalid values such as abc500 as missing
for further investigation.
3. Court Code Governance
Question:
How would you identify invalid court jurisdictions such as XX01
before reporting?
Short
Answer:
I would compare court codes against a predefined list of valid court codes
using PROC FORMAT,
PROC SQL,
or lookup tables and flag any unmatched codes as invalid before generating
reports.
4. Date Standardization Challenge
Question:
What approach would you use when filing dates arrive in five different formats?
Short
Answer:
I would use the ANYDTDTE.
informat to convert all date formats into a standard SAS date value and then
apply a common display format such as DATE9.
for consistency.
5. Enterprise Validation Strategy
Question:
How would you independently QC a divorce dataset generated by another
programmer?
Short Answer:
I would independently recreate the dataset using the specifications, compare
record counts, variables, summary statistics, and key outputs, then investigate
and resolve any discrepancies found during reconciliation.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about DIVORCE DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment