Love, Data, and Chaos: Mastering Advanced String Manipulation in SAS Through a Global Love Marriage Dataset

1. Introduction – When Data Falls in Love… and Breaks Rules

Imagine you’re working as a SAS programmer in a global analytics firm. You’ve just received a dataset titled “Most Love Marriages in the World” from multiple countries. Sounds interesting, right? But when you open it…

Ages are negative
Dates are inconsistent (2023-02-30, 12/15/2022, NULL)
Names are messy (john, JOHN, John )
Duplicate couples appear multiple times
Some countries are blank, some say "null"

This isn’t just messy it’s dangerous.

In clinical trials or business analytics, dirty data = wrong decisions. Imagine deriving patient survival rates or financial forecasts from such flawed inputs. The consequences can be severe.

This is where SAS and R become your surgical tools. SAS offers structured, regulatory-compliant processing (especially in SDTM/ADaM), while R provides flexible, expressive transformations.

In this blog, we’ll:

Create a messy love marriage dataset
Inject intentional errors
Clean it using advanced SAS string manipulation techniques
Replicate the same in R
Explain business logic + interview-ready insights

2. Raw Data Creation in SAS and R

SAS Code – Raw Dataset (With Intentional Errors)

DATA love_raw;

INPUT ID Name $ Country $ Age Marriage_Date:$12. Partner_Name $

Status $ Salary Marriage_Type $;

DATALINES;

1 john india 25 2023-01-10 anna married 50000 love

2 JOHN INDIA -30 2023-02-30 ANNA married 50000 love

3 Ravi null 28 15/03/2022 Sita married . arranged

4 Priya USA 27 NULL Raj married 60000 love

5 Amit india 29 2022-12-01 Neha married 55000 LOVE

6 john india 25 2023-01-10 anna married 50000 love

7 Sara UK . 2021-05-20 Mike married 70000 love

8 NULL Canada 32 2020-11-15 John married -80000 love

9 David usa 35 2022/07/10 Lisa married 65000 love

10 Meena INDIA 26 2023-06-18 NULL married 48000 love

;

RUN;

PROC PRINT DATA =love_raw;

RUN;

OUTPUT:


Obs	ID	Name	Country	Age	Marriage_Date	Partner_Name	Status	Salary	Marriage_Type
1	1	john	india	25	2023-01-10	anna	married	50000	love
2	2	JOHN	INDIA	-30	2023-02-30	ANNA	married	50000	love
3	3	Ravi	null	28	15/03/2022	Sita	married	.	arranged
4	4	Priya	USA	27	NULL	Raj	married	60000	love
5	5	Amit	india	29	2022-12-01	Neha	married	55000	LOVE
6	6	john	india	25	2023-01-10	anna	married	50000	love
7	7	Sara	UK	.	2021-05-20	Mike	married	70000	love
8	8	NULL	Canada	32	2020-11-15	John	married	-80000	love
9	9	David	usa	35	2022/07/10	Lisa	married	65000	love
10	10	Meena	INDIA	26	2023-06-18	NULL	married	48000	love

Explanation

This dataset intentionally simulates real-world inconsistencies. We see duplicate rows (ID 1 & 6), invalid age (-30), missing values (.), and inconsistent date formats (YYYY-MM-DD, DD/MM/YYYY, NULL). String inconsistencies include "india", "INDIA", "null", and "NULL". Salary includes invalid negative values. Marriage type varies between "love" and "LOVE". These inconsistencies are common in multi-source datasets (e.g., clinical CRFs or global surveys). The goal is to apply SAS transformations like UPCASE, COMPRESS, COALESCEC, and date informats to standardize the dataset before analysis.

Key Points

Raw data reflects real-world ingestion issues
String inconsistencies are the biggest hidden problem
Duplicate records distort analytics

R Code – Equivalent Raw Dataset

love_raw <- data.frame(

ID = c(1,2,3,4,5,6,7,8,9,10),

Name = c("john","JOHN","Ravi","Priya","Amit","john","Sara",

"NULL","David","Meena"),

Country = c("india","INDIA","null","USA","india","india",

"UK","Canada","usa","INDIA"),

Age = c(25,-30,28,27,29,25,NA,32,35,26),

Marriage_Date = c("2023-01-10","2023-02-30","15/03/2022",

"NULL","2022-12-01","2023-01-10","2021-05-20",

"2020-11-15","2022/07/10","2023-06-18"),

Partner_Name = c("anna","ANNA","Sita","Raj","Neha","anna",

"Mike","John","Lisa",NA),

Status = rep("married",10),

Salary = c(50000,50000,NA,60000,55000,50000,70000,-80000,

65000,48000),

Marriage_Type = c("love","love","arranged","love","LOVE","love",

"love","love","love","love")

)

OUTPUT:

	ID	Name	Country	Age	Marriage_Date	Partner_Name	Status	Salary	Marriage_Type
1	1	john	india	25	10-01-2023	anna	married	50000	love
2	2	JOHN	INDIA	-30	2023-02-30	ANNA	married	50000	love
3	3	Ravi	null	28	15-03-2022	Sita	married	NA	arranged
4	4	Priya	USA	27	NULL	Raj	married	60000	love
5	5	Amit	india	29	01-12-2022	Neha	married	55000	LOVE
6	6	john	india	25	10-01-2023	anna	married	50000	love
7	7	Sara	UK	NA	20-05-2021	Mike	married	70000	love
8	8	NULL	Canada	32	15-11-2020	John	married	-80000	love
9	9	David	usa	35	10-07-2022	Lisa	married	65000	love
10	10	Meena	INDIA	26	18-06-2023	NA	married	48000	love

Explanation

R replicates the same structure using data.frame(). NA represents missing values, unlike SAS’s .. This dataset is equally messy and highlights the importance of cross-platform cleaning strategies. R is particularly useful for exploratory cleaning, while SAS is preferred in regulated environments.

Key Points

NA vs . difference between R and SAS
Same data issues replicated for consistency
Useful for cross-validation

3. Phase 1: Data Cleaning in SAS

DATA love_clean;

SET love_raw;

/* Standardize text */

Name = PROPCASE(STRIP(Name));

Country = UPCASE(COALESCEC(Country, "UNKNOWN"));

Marriage_Type = UPCASE(Marriage_Type);

/* Handle NULL values */

IF Country = "NULL" THEN Country = "UNKNOWN";

IF Name = "NULL" THEN Name = "UNKNOWN";

/* Fix Age */

IF Age < 0 THEN Age = ABS(Age);

/* Fix Salary */

IF Salary < 0 THEN Salary = ABS(Salary);

/* Date Conversion */

Marriage_dt = INPUT(Marriage_Date, ANYDTDTE.);

FORMAT Marriage_dt DATE9.;

/* Remove missing partner */

IF Partner_Name = "NULL" THEN Partner_Name = "UNKNOWN";

RUN;

PROC PRINT DATA =love_clean;

RUN;

OUTPUT:


Obs	ID	Name	Country	Age	Marriage_Date	Partner_Name	Status	Salary	Marriage_Type	Marriage_dt
1	1	John	INDIA	25	2023-01-10	anna	married	50000	LOVE	01JAN2023
2	2	John	INDIA	30	2023-02-30	ANNA	married	50000	LOVE	03FEB2023
3	3	Ravi	UNKNOWN	28	15/03/2022	Sita	married	.	ARRANGED	.
4	4	Priya	USA	27	NULL	Raj	married	60000	LOVE	.
5	5	Amit	INDIA	29	2022-12-01	Neha	married	55000	LOVE	.
6	6	John	INDIA	25	2023-01-10	anna	married	50000	LOVE	01JAN2023
7	7	Sara	UK	.	2021-05-20	Mike	married	70000	LOVE	02MAY2021
8	8	Null	CANADA	32	2020-11-15	John	married	80000	LOVE	01NOV2020
9	9	David	USA	35	2022/07/10	Lisa	married	65000	LOVE	01JUL2022
10	10	Meena	INDIA	26	2023-06-18	UNKNOWN	married	48000	LOVE	01JUN2023

/* Remove duplicates */

PROC SORT DATA=love_clean NODUPKEY;

BY ID Name Marriage_dt;

RUN;

PROC PRINT DATA =love_clean;

RUN;

OUTPUT:


Obs	ID	Name	Country	Age	Marriage_Date	Partner_Name	Status	Salary	Marriage_Type	Marriage_dt
1	1	John	INDIA	25	2023-01-10	anna	married	50000	LOVE	01JAN2023
2	2	John	INDIA	30	2023-02-30	ANNA	married	50000	LOVE	03FEB2023
3	3	Ravi	UNKNOWN	28	15/03/2022	Sita	married	.	ARRANGED	.
4	4	Priya	USA	27	NULL	Raj	married	60000	LOVE	.
5	5	Amit	INDIA	29	2022-12-01	Neha	married	55000	LOVE	.
6	6	John	INDIA	25	2023-01-10	anna	married	50000	LOVE	01JAN2023
7	7	Sara	UK	.	2021-05-20	Mike	married	70000	LOVE	02MAY2021
8	8	Null	CANADA	32	2020-11-15	John	married	80000	LOVE	01NOV2020
9	9	David	USA	35	2022/07/10	Lisa	married	65000	LOVE	01JUL2022
10	10	Meena	INDIA	26	2023-06-18	UNKNOWN	married	48000	LOVE	01JUN2023

Explanation

This phase demonstrates advanced string manipulation and validation logic. PROPER() standardizes names, UPCASE() ensures uniformity, and COALESCEC() fills missing values. STRIP() removes leading/trailing spaces critical in matching operations. The ANYDTDTE. informat is powerful for parsing mixed date formats. Negative values are corrected using ABS(). Duplicate removal ensures dataset integrity. These techniques are essential in ADaM dataset derivation, where traceability and consistency are mandatory.

Key Points

PROPCASE, UPCASE, STRIP are core string tools
ANYDTDTE. handles messy dates
PROC SORT NODUPKEY ensures uniqueness

4. Phase 2: Data Cleaning in R

library(dplyr)

love_clean <- love_raw %>%

mutate(

Name = tools::toTitleCase(trimws(Name)),

Country = toupper(trimws(ifelse(Country == "null" |

Country == "", "UNKNOWN", Country))),

Marriage_Type = toupper(Marriage_Type),

Age = ifelse(Age < 0, abs(Age), Age),

Salary = ifelse(Salary < 0, abs(Salary), Salary),

Marriage_Date = as.Date(Marriage_Date,

tryFormats = c("%Y-%m-%d","%d/%m/%Y","%d/%Y/%m"))

) %>%

distinct(ID, Name, Marriage_Date, .keep_all = TRUE)

OUTPUT:

	ID	Name	Country	Age	Marriage_Date	Partner_Name	Status	Salary	Marriage_Type
1	1	John	INDIA	25	10-01-2023	anna	married	50000	LOVE
2	2	JOHN	INDIA	30	NA	ANNA	married	50000	LOVE
3	3	Ravi	UNKNOWN	28	NA	Sita	married	NA	ARRANGED
4	4	Priya	USA	27	NA	Raj	married	60000	LOVE
5	5	Amit	INDIA	29	01-12-2022	Neha	married	55000	LOVE
6	6	John	INDIA	25	10-01-2023	anna	married	50000	LOVE
7	7	Sara	UK	NA	20-05-2021	Mike	married	70000	LOVE
8	8	NULL	CANADA	32	15-11-2020	John	married	80000	LOVE
9	9	David	USA	35	NA	Lisa	married	65000	LOVE
10	10	Meena	INDIA	26	18-06-2023	NA	married	48000	LOVE

Explanation

R uses mutate() for transformations and distinct() for deduplication. trimws() removes spaces, while toupper() standardizes case. Conditional logic with ifelse() handles invalid values. as.Date() with multiple formats mimics SAS’s ANYDTDTE.. This approach is efficient for exploratory data analysis and rapid prototyping.

Key Points

mutate() = SAS DATA step equivalent
distinct() removes duplicates
Flexible date parsing

5. Phase 3: Advanced SAS String Techniques

DATA love_adv;

SET love_clean;

/* Extract first letter */

Initial = SUBSTR(Name,1,1);

/* Remove vowels from name */

Name_NoVowel = COMPRESS(Name, 'aeiouAEIOU');

/* Scan country */

Country_Part = SCAN(Country,1,' ');

/* Concatenate couple name */

Couple = CATX('_', Name, Partner_Name);

/* Replace text */

Country = TRANWRD(Country, "INDIA", "BHARAT");

RUN;

PROC PRINT DATA =love_adv;

RUN;

OUTPUT:


Obs	ID	Name	Country	Age	Marriage_Date	Partner_Name	Status	Salary	Marriage_Type	Marriage_dt	Initial	Name_NoVowel	Country_Part	Couple
1	1	John	BHARAT	25	2023-01-10	anna	married	50000	LOVE	01JAN2023	J	Jhn	INDIA	John_anna
2	2	John	BHARAT	30	2023-02-30	ANNA	married	50000	LOVE	03FEB2023	J	Jhn	INDIA	John_ANNA
3	3	Ravi	UNKNOWN	28	15/03/2022	Sita	married	.	ARRANGED	.	R	Rv	UNKNOWN	Ravi_Sita
4	4	Priya	USA	27	NULL	Raj	married	60000	LOVE	.	P	Pry	USA	Priya_Raj
5	5	Amit	BHARAT	29	2022-12-01	Neha	married	55000	LOVE	.	A	mt	INDIA	Amit_Neha
6	6	John	BHARAT	25	2023-01-10	anna	married	50000	LOVE	01JAN2023	J	Jhn	INDIA	John_anna
7	7	Sara	UK	.	2021-05-20	Mike	married	70000	LOVE	02MAY2021	S	Sr	UK	Sara_Mike
8	8	Null	CANADA	32	2020-11-15	John	married	80000	LOVE	01NOV2020	N	Nll	CANADA	Null_John
9	9	David	USA	35	2022/07/10	Lisa	married	65000	LOVE	01JUL2022	D	Dvd	USA	David_Lisa
10	10	Meena	BHARAT	26	2023-06-18	UNKNOWN	married	48000	LOVE	01JUN2023	M	Mn	INDIA	Meena_UNKNOWN

Explanation

This phase focuses on advanced string manipulation. SUBSTR() extracts substrings, COMPRESS() removes characters, SCAN() tokenizes strings, and CATX() concatenates efficiently. TRANWRD() replaces text dynamically. These are heavily used in SDTM variable derivations, especially when parsing free-text fields.

Key Points

String parsing is critical in real datasets
CATX avoids manual concatenation issues
TRANWRD useful for standardization

6. 20 Data Cleaning Best Practices

Always validate against source (CRF/raw)
Maintain audit trails
Use controlled terminology (CDISC)
Avoid hardcoding values
Standardize formats early
Validate ranges (age, lab values)
Use metadata-driven programming
Document all transformations
Handle missing systematically
Perform duplicate checks
Use validation macros
Cross-check with R/Python
Maintain version control
Log all warnings/errors
Use PROC COMPARE for QC
Align with SAP requirements
Ensure traceability (ADaM)
Avoid overwriting raw data
Validate date sequences
Perform independent QC

7. Business Logic Behind Data Cleaning

Why do we clean data?

Missing values → replaced to avoid bias
Negative age → logically impossible → corrected
Invalid salary → affects financial models
Date inconsistencies → break time-based analysis

Example:
If patient age is -30 → survival analysis becomes invalid.
If marriage date is wrong → trend analysis fails.

Cleaning ensures accur decision-making and regulatory compliance.

8. 20 Key Insights

Dirty data leads to wrong conclusions
Standardization ensures reproducibility
Dates drive timelines
Missing ≠ zero
Duplicates inflate metrics
Validation is non-negotiable
String cleaning is underrated
Audit trails are mandatory
SAS ensures compliance
R ensures flexibility
Business logic drives cleaning
Data lineage matters
QC saves projects
Automation reduces errors
Metadata is powerful
Consistency is key
Documentation is survival
Clean data = trust
Errors compound quickly
Prevention is better than cleaning

9. SAS vs R Comparison

Feature	SAS	R
Regulatory Use	Strong	Moderate
Flexibility	Moderate	High
String Functions	Robust	Extensive
Learning Curve	Moderate	Steep
Industry Use	Pharma	Data Science

10. Summary

We transformed a chaotic love marriage dataset into a clean, analysis-ready structure using SAS and R. SAS provided structured, compliance-ready transformations, while R offered flexible manipulation.

11. Conclusion

Data cleaning is not just a step it’s the foundation of analytics. Whether you’re working on love marriage datasets or clinical trials, mastering string manipulation and validation logic is essential. SAS gives you control, traceability, and compliance qualities that define a professional data programmer.

12. Interview Questions

Q1: How do you handle mixed date formats in SAS?
A: Use ANYDTDTE. informat.

Q2: How do you remove duplicates?
A: PROC SORT NODUPKEY.

Q3: How do you standardize text?
A: UPCASE, LOWCASE, PROPER.

Q4: R equivalent of SAS DATA step?
A: mutate() in dplyr.

Q5: Scenario: Age is negative—what do you do?
A: Validate source → apply ABS() or set to missing.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about LOVE DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can we analyze planets using SAS like NASA scientists?
2.Can SAS Compare How 3 Different Families Live in Their Homes?
3.Can SAS Help Us Choose the Most Efficient Cooler?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Decoding Love Marriage Trends Worldwide Using SAS String Mastery and Data Cleaning Intelligence

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can we analyze planets using SAS like NASA scientists?
2.Can SAS Compare How 3 Different Families Live in Their Homes?
3.Can SAS Help Us Choose the Most Efficient Cooler?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

Decoding Love Marriage Trends Worldwide Using SAS String Mastery and Data Cleaning Intelligence

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can we analyze planets using SAS like NASA scientists?2.Can SAS Compare How 3 Different Families Live in Their Homes?3.Can SAS Help Us Choose the Most Efficient Cooler?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

1.Can we analyze planets using SAS like NASA scientists?
2.Can SAS Compare How 3 Different Families Live in Their Homes?
3.Can SAS Help Us Choose the Most Efficient Cooler?