456.Decoding Love Marriage Trends Worldwide Using SAS String Mastery and Data Cleaning Intelligence
Love, Data, and Chaos: Mastering Advanced String
Manipulation in SAS Through a Global Love Marriage Dataset
1. Introduction – When Data
Falls in Love… and Breaks Rules
Imagine
you’re working as a SAS programmer in a global analytics firm. You’ve just
received a dataset titled “Most Love Marriages in the World” from
multiple countries. Sounds interesting, right? But when you open it…
- Ages are negative
- Dates are inconsistent (2023-02-30,
12/15/2022, NULL)
- Names are messy (john, JOHN,
John )
- Duplicate couples appear
multiple times
- Some countries are blank,
some say "null"
This
isn’t just messy it’s dangerous.
In
clinical trials or business analytics, dirty data = wrong decisions.
Imagine deriving patient survival rates or financial forecasts from such flawed
inputs. The consequences can be severe.
This is
where SAS and R become your surgical tools. SAS offers structured,
regulatory-compliant processing (especially in SDTM/ADaM), while R provides
flexible, expressive transformations.
In this
blog, we’ll:
- Create a messy love
marriage dataset
- Inject intentional errors
- Clean it using advanced
SAS string manipulation techniques
- Replicate the same in R
- Explain business logic + interview-ready insights
2. Raw Data Creation in SAS and R
SAS Code – Raw Dataset (With Intentional Errors)
DATA love_raw;
INPUT ID Name $ Country $ Age Marriage_Date:$12. Partner_Name $
Status $ Salary Marriage_Type $;
DATALINES;
1 john india 25 2023-01-10 anna married 50000 love
2 JOHN INDIA -30 2023-02-30 ANNA married 50000 love
3 Ravi null 28 15/03/2022 Sita married . arranged
4 Priya USA 27 NULL Raj married 60000 love
5 Amit india 29 2022-12-01 Neha married 55000 LOVE
6 john india 25 2023-01-10 anna married 50000 love
7 Sara UK . 2021-05-20 Mike married 70000 love
8 NULL Canada 32 2020-11-15 John married -80000 love
9 David usa 35 2022/07/10 Lisa married 65000 love
10 Meena INDIA 26 2023-06-18 NULL married 48000 love
;
RUN;
PROC PRINT DATA =love_raw;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Age | Marriage_Date | Partner_Name | Status | Salary | Marriage_Type |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | john | india | 25 | 2023-01-10 | anna | married | 50000 | love |
| 2 | 2 | JOHN | INDIA | -30 | 2023-02-30 | ANNA | married | 50000 | love |
| 3 | 3 | Ravi | null | 28 | 15/03/2022 | Sita | married | . | arranged |
| 4 | 4 | Priya | USA | 27 | NULL | Raj | married | 60000 | love |
| 5 | 5 | Amit | india | 29 | 2022-12-01 | Neha | married | 55000 | LOVE |
| 6 | 6 | john | india | 25 | 2023-01-10 | anna | married | 50000 | love |
| 7 | 7 | Sara | UK | . | 2021-05-20 | Mike | married | 70000 | love |
| 8 | 8 | NULL | Canada | 32 | 2020-11-15 | John | married | -80000 | love |
| 9 | 9 | David | usa | 35 | 2022/07/10 | Lisa | married | 65000 | love |
| 10 | 10 | Meena | INDIA | 26 | 2023-06-18 | NULL | married | 48000 | love |
Explanation
This
dataset intentionally simulates real-world inconsistencies. We see duplicate
rows (ID 1 & 6), invalid age (-30), missing values (.),
and inconsistent date formats (YYYY-MM-DD, DD/MM/YYYY, NULL). String
inconsistencies include "india", "INDIA", "null",
and "NULL". Salary includes invalid negative values. Marriage type
varies between "love" and "LOVE". These inconsistencies are
common in multi-source datasets (e.g., clinical CRFs or global surveys). The
goal is to apply SAS transformations like UPCASE, COMPRESS, COALESCEC, and date
informats to standardize the dataset before analysis.
Key Points
- Raw data reflects real-world
ingestion issues
- String inconsistencies are
the biggest hidden problem
- Duplicate records distort
analytics
R Code – Equivalent Raw Dataset
love_raw <- data.frame(
ID = c(1,2,3,4,5,6,7,8,9,10),
Name = c("john","JOHN","Ravi","Priya","Amit","john","Sara",
"NULL","David","Meena"),
Country = c("india","INDIA","null","USA","india","india",
"UK","Canada","usa","INDIA"),
Age = c(25,-30,28,27,29,25,NA,32,35,26),
Marriage_Date = c("2023-01-10","2023-02-30","15/03/2022",
"NULL","2022-12-01","2023-01-10","2021-05-20",
"2020-11-15","2022/07/10","2023-06-18"),
Partner_Name = c("anna","ANNA","Sita","Raj","Neha","anna",
"Mike","John","Lisa",NA),
Status = rep("married",10),
Salary = c(50000,50000,NA,60000,55000,50000,70000,-80000,
65000,48000),
Marriage_Type = c("love","love","arranged","love","LOVE","love",
"love","love","love","love")
)
OUTPUT:
|
|
ID |
Name |
Country |
Age |
Marriage_Date |
Partner_Name |
Status |
Salary |
Marriage_Type |
|
1 |
1 |
john |
india |
25 |
10-01-2023 |
anna |
married |
50000 |
love |
|
2 |
2 |
JOHN |
INDIA |
-30 |
2023-02-30 |
ANNA |
married |
50000 |
love |
|
3 |
3 |
Ravi |
null |
28 |
15-03-2022 |
Sita |
married |
NA |
arranged |
|
4 |
4 |
Priya |
USA |
27 |
NULL |
Raj |
married |
60000 |
love |
|
5 |
5 |
Amit |
india |
29 |
01-12-2022 |
Neha |
married |
55000 |
LOVE |
|
6 |
6 |
john |
india |
25 |
10-01-2023 |
anna |
married |
50000 |
love |
|
7 |
7 |
Sara |
UK |
NA |
20-05-2021 |
Mike |
married |
70000 |
love |
|
8 |
8 |
NULL |
Canada |
32 |
15-11-2020 |
John |
married |
-80000 |
love |
|
9 |
9 |
David |
usa |
35 |
10-07-2022 |
Lisa |
married |
65000 |
love |
|
10 |
10 |
Meena |
INDIA |
26 |
18-06-2023 |
NA |
married |
48000 |
love |
Explanation
R
replicates the same structure using data.frame(). NA represents missing values,
unlike SAS’s .. This dataset is equally messy and highlights the importance of cross-platform
cleaning strategies. R is particularly useful for exploratory cleaning,
while SAS is preferred in regulated environments.
Key Points
- NA vs . difference between R
and SAS
- Same data issues replicated
for consistency
- Useful for cross-validation
3. Phase 1: Data Cleaning in SAS
DATA love_clean;
SET love_raw;
/* Standardize text */
Name = PROPCASE(STRIP(Name));
Country = UPCASE(COALESCEC(Country, "UNKNOWN"));
Marriage_Type = UPCASE(Marriage_Type);
/* Handle NULL values */
IF Country = "NULL" THEN Country = "UNKNOWN";
IF Name = "NULL" THEN Name = "UNKNOWN";
/* Fix Age */
IF Age < 0 THEN Age = ABS(Age);
/* Fix Salary */
IF Salary < 0 THEN Salary = ABS(Salary);
/* Date Conversion */
Marriage_dt = INPUT(Marriage_Date, ANYDTDTE.);
FORMAT Marriage_dt DATE9.;
/* Remove missing partner */
IF Partner_Name = "NULL" THEN Partner_Name = "UNKNOWN";
RUN;
PROC PRINT DATA =love_clean;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Age | Marriage_Date | Partner_Name | Status | Salary | Marriage_Type | Marriage_dt |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | John | INDIA | 25 | 2023-01-10 | anna | married | 50000 | LOVE | 01JAN2023 |
| 2 | 2 | John | INDIA | 30 | 2023-02-30 | ANNA | married | 50000 | LOVE | 03FEB2023 |
| 3 | 3 | Ravi | UNKNOWN | 28 | 15/03/2022 | Sita | married | . | ARRANGED | . |
| 4 | 4 | Priya | USA | 27 | NULL | Raj | married | 60000 | LOVE | . |
| 5 | 5 | Amit | INDIA | 29 | 2022-12-01 | Neha | married | 55000 | LOVE | . |
| 6 | 6 | John | INDIA | 25 | 2023-01-10 | anna | married | 50000 | LOVE | 01JAN2023 |
| 7 | 7 | Sara | UK | . | 2021-05-20 | Mike | married | 70000 | LOVE | 02MAY2021 |
| 8 | 8 | Null | CANADA | 32 | 2020-11-15 | John | married | 80000 | LOVE | 01NOV2020 |
| 9 | 9 | David | USA | 35 | 2022/07/10 | Lisa | married | 65000 | LOVE | 01JUL2022 |
| 10 | 10 | Meena | INDIA | 26 | 2023-06-18 | UNKNOWN | married | 48000 | LOVE | 01JUN2023 |
/* Remove duplicates */
PROC SORT DATA=love_clean NODUPKEY;
BY ID Name Marriage_dt;
RUN;
PROC PRINT DATA =love_clean;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Age | Marriage_Date | Partner_Name | Status | Salary | Marriage_Type | Marriage_dt |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | John | INDIA | 25 | 2023-01-10 | anna | married | 50000 | LOVE | 01JAN2023 |
| 2 | 2 | John | INDIA | 30 | 2023-02-30 | ANNA | married | 50000 | LOVE | 03FEB2023 |
| 3 | 3 | Ravi | UNKNOWN | 28 | 15/03/2022 | Sita | married | . | ARRANGED | . |
| 4 | 4 | Priya | USA | 27 | NULL | Raj | married | 60000 | LOVE | . |
| 5 | 5 | Amit | INDIA | 29 | 2022-12-01 | Neha | married | 55000 | LOVE | . |
| 6 | 6 | John | INDIA | 25 | 2023-01-10 | anna | married | 50000 | LOVE | 01JAN2023 |
| 7 | 7 | Sara | UK | . | 2021-05-20 | Mike | married | 70000 | LOVE | 02MAY2021 |
| 8 | 8 | Null | CANADA | 32 | 2020-11-15 | John | married | 80000 | LOVE | 01NOV2020 |
| 9 | 9 | David | USA | 35 | 2022/07/10 | Lisa | married | 65000 | LOVE | 01JUL2022 |
| 10 | 10 | Meena | INDIA | 26 | 2023-06-18 | UNKNOWN | married | 48000 | LOVE | 01JUN2023 |
Explanation
This
phase demonstrates advanced string manipulation and validation logic. PROPER()
standardizes names, UPCASE() ensures uniformity, and COALESCEC() fills missing
values. STRIP() removes leading/trailing spaces critical in matching
operations. The ANYDTDTE. informat is powerful for parsing mixed date formats.
Negative values are corrected using ABS(). Duplicate removal ensures dataset
integrity. These techniques are essential in ADaM dataset derivation,
where traceability and consistency are mandatory.
Key Points
- PROPCASE, UPCASE, STRIP are
core string tools
- ANYDTDTE. handles messy
dates
- PROC SORT NODUPKEY ensures
uniqueness
4. Phase 2: Data Cleaning in R
library(dplyr)
love_clean <- love_raw %>%
mutate(
Name = tools::toTitleCase(trimws(Name)),
Country = toupper(trimws(ifelse(Country == "null" |
Country == "", "UNKNOWN", Country))),
Marriage_Type = toupper(Marriage_Type),
Age = ifelse(Age < 0, abs(Age), Age),
Salary = ifelse(Salary < 0, abs(Salary), Salary),
Marriage_Date = as.Date(Marriage_Date,
tryFormats = c("%Y-%m-%d","%d/%m/%Y","%d/%Y/%m"))
) %>%
distinct(ID, Name, Marriage_Date, .keep_all = TRUE)
OUTPUT:
|
|
ID |
Name |
Country |
Age |
Marriage_Date |
Partner_Name |
Status |
Salary |
Marriage_Type |
|
1 |
1 |
John |
INDIA |
25 |
10-01-2023 |
anna |
married |
50000 |
LOVE |
|
2 |
2 |
JOHN |
INDIA |
30 |
NA |
ANNA |
married |
50000 |
LOVE |
|
3 |
3 |
Ravi |
UNKNOWN |
28 |
NA |
Sita |
married |
NA |
ARRANGED |
|
4 |
4 |
Priya |
USA |
27 |
NA |
Raj |
married |
60000 |
LOVE |
|
5 |
5 |
Amit |
INDIA |
29 |
01-12-2022 |
Neha |
married |
55000 |
LOVE |
|
6 |
6 |
John |
INDIA |
25 |
10-01-2023 |
anna |
married |
50000 |
LOVE |
|
7 |
7 |
Sara |
UK |
NA |
20-05-2021 |
Mike |
married |
70000 |
LOVE |
|
8 |
8 |
NULL |
CANADA |
32 |
15-11-2020 |
John |
married |
80000 |
LOVE |
|
9 |
9 |
David |
USA |
35 |
NA |
Lisa |
married |
65000 |
LOVE |
|
10 |
10 |
Meena |
INDIA |
26 |
18-06-2023 |
NA |
married |
48000 |
LOVE |
Explanation
R uses mutate()
for transformations and distinct() for deduplication. trimws() removes spaces,
while toupper() standardizes case. Conditional logic with ifelse() handles
invalid values. as.Date() with multiple formats mimics SAS’s ANYDTDTE.. This
approach is efficient for exploratory data analysis and rapid prototyping.
Key Points
- mutate() = SAS DATA step
equivalent
- distinct() removes
duplicates
- Flexible date parsing
5. Phase 3: Advanced SAS String Techniques
DATA love_adv;
SET love_clean;
/* Extract first letter */
Initial = SUBSTR(Name,1,1);
/* Remove vowels from name */
Name_NoVowel = COMPRESS(Name, 'aeiouAEIOU');
/* Scan country */
Country_Part = SCAN(Country,1,' ');
/* Concatenate couple name */
Couple = CATX('_', Name, Partner_Name);
/* Replace text */
Country = TRANWRD(Country, "INDIA", "BHARAT");
RUN;
PROC PRINT DATA =love_adv;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Age | Marriage_Date | Partner_Name | Status | Salary | Marriage_Type | Marriage_dt | Initial | Name_NoVowel | Country_Part | Couple |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | John | BHARAT | 25 | 2023-01-10 | anna | married | 50000 | LOVE | 01JAN2023 | J | Jhn | INDIA | John_anna |
| 2 | 2 | John | BHARAT | 30 | 2023-02-30 | ANNA | married | 50000 | LOVE | 03FEB2023 | J | Jhn | INDIA | John_ANNA |
| 3 | 3 | Ravi | UNKNOWN | 28 | 15/03/2022 | Sita | married | . | ARRANGED | . | R | Rv | UNKNOWN | Ravi_Sita |
| 4 | 4 | Priya | USA | 27 | NULL | Raj | married | 60000 | LOVE | . | P | Pry | USA | Priya_Raj |
| 5 | 5 | Amit | BHARAT | 29 | 2022-12-01 | Neha | married | 55000 | LOVE | . | A | mt | INDIA | Amit_Neha |
| 6 | 6 | John | BHARAT | 25 | 2023-01-10 | anna | married | 50000 | LOVE | 01JAN2023 | J | Jhn | INDIA | John_anna |
| 7 | 7 | Sara | UK | . | 2021-05-20 | Mike | married | 70000 | LOVE | 02MAY2021 | S | Sr | UK | Sara_Mike |
| 8 | 8 | Null | CANADA | 32 | 2020-11-15 | John | married | 80000 | LOVE | 01NOV2020 | N | Nll | CANADA | Null_John |
| 9 | 9 | David | USA | 35 | 2022/07/10 | Lisa | married | 65000 | LOVE | 01JUL2022 | D | Dvd | USA | David_Lisa |
| 10 | 10 | Meena | BHARAT | 26 | 2023-06-18 | UNKNOWN | married | 48000 | LOVE | 01JUN2023 | M | Mn | INDIA | Meena_UNKNOWN |
Explanation
This
phase focuses on advanced string manipulation. SUBSTR() extracts
substrings, COMPRESS() removes characters, SCAN() tokenizes strings, and CATX()
concatenates efficiently. TRANWRD() replaces text dynamically. These are
heavily used in SDTM variable derivations, especially when parsing
free-text fields.
Key Points
- String parsing is critical
in real datasets
- CATX avoids manual
concatenation issues
- TRANWRD useful for
standardization
6. 20 Data Cleaning Best
Practices
- Always validate against
source (CRF/raw)
- Maintain audit trails
- Use controlled terminology
(CDISC)
- Avoid hardcoding values
- Standardize formats early
- Validate ranges (age, lab
values)
- Use metadata-driven
programming
- Document all transformations
- Handle missing
systematically
- Perform duplicate checks
- Use validation macros
- Cross-check with R/Python
- Maintain version control
- Log all warnings/errors
- Use PROC COMPARE for QC
- Align with SAP requirements
- Ensure traceability (ADaM)
- Avoid overwriting raw data
- Validate date sequences
- Perform independent QC
7. Business Logic Behind
Data Cleaning
Why do we
clean data?
- Missing values → replaced to avoid bias
- Negative age → logically impossible →
corrected
- Invalid salary → affects financial models
- Date inconsistencies → break time-based analysis
Example:
If patient age is -30 → survival analysis becomes invalid.
If marriage date is wrong → trend analysis fails.
Cleaning
ensures accur decision-making and regulatory compliance.
8. 20 Key Insights
- Dirty data leads to wrong
conclusions
- Standardization ensures
reproducibility
- Dates drive timelines
- Missing ≠ zero
- Duplicates inflate metrics
- Validation is non-negotiable
- String cleaning is
underrated
- Audit trails are mandatory
- SAS ensures compliance
- R ensures flexibility
- Business logic drives
cleaning
- Data lineage matters
- QC saves projects
- Automation reduces errors
- Metadata is powerful
- Consistency is key
- Documentation is survival
- Clean data = trust
- Errors compound quickly
- Prevention is better than
cleaning
9. SAS vs R Comparison
|
Feature |
SAS |
R |
|
Regulatory
Use |
Strong |
Moderate |
|
Flexibility |
Moderate |
High |
|
String
Functions |
Robust |
Extensive |
|
Learning
Curve |
Moderate |
Steep |
|
Industry
Use |
Pharma |
Data
Science |
10. Summary
We
transformed a chaotic love marriage dataset into a clean, analysis-ready
structure using SAS and R. SAS provided structured, compliance-ready
transformations, while R offered flexible manipulation.
11. Conclusion
Data
cleaning is not just a step it’s the foundation of analytics. Whether
you’re working on love marriage datasets or clinical trials, mastering string
manipulation and validation logic is essential. SAS gives you control,
traceability, and compliance qualities that define a professional data
programmer.
12. Interview Questions
Q1: How do you handle mixed date
formats in SAS?
A: Use ANYDTDTE. informat.
Q2: How do you remove duplicates?
A: PROC SORT NODUPKEY.
Q3: How do you standardize text?
A: UPCASE, LOWCASE, PROPER.
Q4: R equivalent of SAS DATA step?
A: mutate() in dplyr.
Q5: Scenario: Age is negative—what
do you do?
A: Validate source → apply ABS() or set to missing.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about LOVE DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment