456.Decoding Love Marriage Trends Worldwide Using SAS String Mastery and Data Cleaning Intelligence

Love, Data, and Chaos: Mastering Advanced String Manipulation in SAS Through a Global Love Marriage Dataset

1. Introduction – When Data Falls in Love… and Breaks Rules

Imagine you’re working as a SAS programmer in a global analytics firm. You’ve just received a dataset titled “Most Love Marriages in the World” from multiple countries. Sounds interesting, right? But when you open it…

  • Ages are negative
  • Dates are inconsistent (2023-02-30, 12/15/2022, NULL)
  • Names are messy (john, JOHN, John )
  • Duplicate couples appear multiple times
  • Some countries are blank, some say "null"

This isn’t just messy it’s dangerous.

In clinical trials or business analytics, dirty data = wrong decisions. Imagine deriving patient survival rates or financial forecasts from such flawed inputs. The consequences can be severe.

This is where SAS and R become your surgical tools. SAS offers structured, regulatory-compliant processing (especially in SDTM/ADaM), while R provides flexible, expressive transformations.

In this blog, we’ll:

  • Create a messy love marriage dataset
  • Inject intentional errors
  • Clean it using advanced SAS string manipulation techniques
  • Replicate the same in R
  • Explain business logic + interview-ready insights

2. Raw Data Creation in SAS and R

SAS Code – Raw Dataset (With Intentional Errors)

DATA love_raw;

INPUT ID Name $ Country $ Age Marriage_Date:$12. Partner_Name $ 

      Status $ Salary Marriage_Type $;

DATALINES;

1 john india 25 2023-01-10 anna married 50000 love

2 JOHN INDIA -30 2023-02-30 ANNA married 50000 love

3 Ravi null 28 15/03/2022 Sita married . arranged

4 Priya USA 27 NULL Raj married 60000 love

5 Amit india 29 2022-12-01 Neha married 55000 LOVE

6 john india 25 2023-01-10 anna married 50000 love

7 Sara UK . 2021-05-20 Mike married 70000 love

8 NULL Canada 32 2020-11-15 John married -80000 love

9 David usa 35 2022/07/10 Lisa married 65000 love

10 Meena INDIA 26 2023-06-18 NULL married 48000 love

;

RUN;

PROC PRINT DATA =love_raw;

RUN;

OUTPUT:

ObsIDNameCountryAgeMarriage_DatePartner_NameStatusSalaryMarriage_Type
11johnindia252023-01-10annamarried50000love
22JOHNINDIA-302023-02-30ANNAmarried50000love
33Ravinull2815/03/2022Sitamarried.arranged
44PriyaUSA27NULLRajmarried60000love
55Amitindia292022-12-01Nehamarried55000LOVE
66johnindia252023-01-10annamarried50000love
77SaraUK.2021-05-20Mikemarried70000love
88NULLCanada322020-11-15Johnmarried-80000love
99Davidusa352022/07/10Lisamarried65000love
1010MeenaINDIA262023-06-18NULLmarried48000love

Explanation

This dataset intentionally simulates real-world inconsistencies. We see duplicate rows (ID 1 & 6), invalid age (-30), missing values (.), and inconsistent date formats (YYYY-MM-DD, DD/MM/YYYY, NULL). String inconsistencies include "india", "INDIA", "null", and "NULL". Salary includes invalid negative values. Marriage type varies between "love" and "LOVE". These inconsistencies are common in multi-source datasets (e.g., clinical CRFs or global surveys). The goal is to apply SAS transformations like UPCASE, COMPRESS, COALESCEC, and date informats to standardize the dataset before analysis.

Key Points

  • Raw data reflects real-world ingestion issues
  • String inconsistencies are the biggest hidden problem
  • Duplicate records distort analytics

R Code – Equivalent Raw Dataset

love_raw <- data.frame(

  ID = c(1,2,3,4,5,6,7,8,9,10),

  Name = c("john","JOHN","Ravi","Priya","Amit","john","Sara",

           "NULL","David","Meena"),

  Country = c("india","INDIA","null","USA","india","india",

              "UK","Canada","usa","INDIA"),

  Age = c(25,-30,28,27,29,25,NA,32,35,26),

  Marriage_Date = c("2023-01-10","2023-02-30","15/03/2022",

                    "NULL","2022-12-01","2023-01-10","2021-05-20",

                    "2020-11-15","2022/07/10","2023-06-18"),

  Partner_Name = c("anna","ANNA","Sita","Raj","Neha","anna",

                   "Mike","John","Lisa",NA),

  Status = rep("married",10),

  Salary = c(50000,50000,NA,60000,55000,50000,70000,-80000,

             65000,48000),

  Marriage_Type = c("love","love","arranged","love","LOVE","love",

                    "love","love","love","love")

)

OUTPUT:

 

ID

Name

Country

Age

Marriage_Date

Partner_Name

Status

Salary

Marriage_Type

1

1

john

india

25

10-01-2023

anna

married

50000

love

2

2

JOHN

INDIA

-30

2023-02-30

ANNA

married

50000

love

3

3

Ravi

null

28

15-03-2022

Sita

married

NA

arranged

4

4

Priya

USA

27

NULL

Raj

married

60000

love

5

5

Amit

india

29

01-12-2022

Neha

married

55000

LOVE

6

6

john

india

25

10-01-2023

anna

married

50000

love

7

7

Sara

UK

NA

20-05-2021

Mike

married

70000

love

8

8

NULL

Canada

32

15-11-2020

John

married

-80000

love

9

9

David

usa

35

10-07-2022

Lisa

married

65000

love

10

10

Meena

INDIA

26

18-06-2023

NA

married

48000

love

Explanation

R replicates the same structure using data.frame(). NA represents missing values, unlike SAS’s .. This dataset is equally messy and highlights the importance of cross-platform cleaning strategies. R is particularly useful for exploratory cleaning, while SAS is preferred in regulated environments.

Key Points

  • NA vs . difference between R and SAS
  • Same data issues replicated for consistency
  • Useful for cross-validation

3. Phase 1: Data Cleaning in SAS

DATA love_clean;

SET love_raw;

/* Standardize text */

Name = PROPCASE(STRIP(Name));

Country = UPCASE(COALESCEC(Country, "UNKNOWN"));

Marriage_Type = UPCASE(Marriage_Type);

/* Handle NULL values */

IF Country = "NULL" THEN Country = "UNKNOWN";

IF Name = "NULL" THEN Name = "UNKNOWN";

/* Fix Age */

IF Age < 0 THEN Age = ABS(Age);

/* Fix Salary */

IF Salary < 0 THEN Salary = ABS(Salary);

/* Date Conversion */

Marriage_dt = INPUT(Marriage_Date, ANYDTDTE.);

FORMAT Marriage_dt DATE9.;

/* Remove missing partner */

IF Partner_Name = "NULL" THEN Partner_Name = "UNKNOWN";

RUN;

PROC PRINT DATA =love_clean;

RUN;

OUTPUT:

ObsIDNameCountryAgeMarriage_DatePartner_NameStatusSalaryMarriage_TypeMarriage_dt
11JohnINDIA252023-01-10annamarried50000LOVE01JAN2023
22JohnINDIA302023-02-30ANNAmarried50000LOVE03FEB2023
33RaviUNKNOWN2815/03/2022Sitamarried.ARRANGED.
44PriyaUSA27NULLRajmarried60000LOVE.
55AmitINDIA292022-12-01Nehamarried55000LOVE.
66JohnINDIA252023-01-10annamarried50000LOVE01JAN2023
77SaraUK.2021-05-20Mikemarried70000LOVE02MAY2021
88NullCANADA322020-11-15Johnmarried80000LOVE01NOV2020
99DavidUSA352022/07/10Lisamarried65000LOVE01JUL2022
1010MeenaINDIA262023-06-18UNKNOWNmarried48000LOVE01JUN2023

/* Remove duplicates */

PROC SORT DATA=love_clean NODUPKEY;

BY ID Name Marriage_dt;

RUN;

PROC PRINT DATA =love_clean;

RUN;

OUTPUT:

ObsIDNameCountryAgeMarriage_DatePartner_NameStatusSalaryMarriage_TypeMarriage_dt
11JohnINDIA252023-01-10annamarried50000LOVE01JAN2023
22JohnINDIA302023-02-30ANNAmarried50000LOVE03FEB2023
33RaviUNKNOWN2815/03/2022Sitamarried.ARRANGED.
44PriyaUSA27NULLRajmarried60000LOVE.
55AmitINDIA292022-12-01Nehamarried55000LOVE.
66JohnINDIA252023-01-10annamarried50000LOVE01JAN2023
77SaraUK.2021-05-20Mikemarried70000LOVE02MAY2021
88NullCANADA322020-11-15Johnmarried80000LOVE01NOV2020
99DavidUSA352022/07/10Lisamarried65000LOVE01JUL2022
1010MeenaINDIA262023-06-18UNKNOWNmarried48000LOVE01JUN2023

Explanation

This phase demonstrates advanced string manipulation and validation logic. PROPER() standardizes names, UPCASE() ensures uniformity, and COALESCEC() fills missing values. STRIP() removes leading/trailing spaces critical in matching operations. The ANYDTDTE. informat is powerful for parsing mixed date formats. Negative values are corrected using ABS(). Duplicate removal ensures dataset integrity. These techniques are essential in ADaM dataset derivation, where traceability and consistency are mandatory.

Key Points

  • PROPCASE, UPCASE, STRIP are core string tools
  • ANYDTDTE. handles messy dates
  • PROC SORT NODUPKEY ensures uniqueness

4. Phase 2: Data Cleaning in R

library(dplyr)

love_clean <- love_raw %>%

  mutate(

    Name = tools::toTitleCase(trimws(Name)),

    Country = toupper(trimws(ifelse(Country == "null" | 

                                    Country == "", "UNKNOWN", Country))),

    Marriage_Type = toupper(Marriage_Type),

    Age = ifelse(Age < 0, abs(Age), Age),

    Salary = ifelse(Salary < 0, abs(Salary), Salary),

    Marriage_Date = as.Date(Marriage_Date, 

                            tryFormats = c("%Y-%m-%d","%d/%m/%Y","%d/%Y/%m"))

  ) %>%

  distinct(ID, Name, Marriage_Date, .keep_all = TRUE)

OUTPUT:

 

ID

Name

Country

Age

Marriage_Date

Partner_Name

Status

Salary

Marriage_Type

1

1

John

INDIA

25

10-01-2023

anna

married

50000

LOVE

2

2

JOHN

INDIA

30

NA

ANNA

married

50000

LOVE

3

3

Ravi

UNKNOWN

28

NA

Sita

married

NA

ARRANGED

4

4

Priya

USA

27

NA

Raj

married

60000

LOVE

5

5

Amit

INDIA

29

01-12-2022

Neha

married

55000

LOVE

6

6

John

INDIA

25

10-01-2023

anna

married

50000

LOVE

7

7

Sara

UK

NA

20-05-2021

Mike

married

70000

LOVE

8

8

NULL

CANADA

32

15-11-2020

John

married

80000

LOVE

9

9

David

USA

35

NA

Lisa

married

65000

LOVE

10

10

Meena

INDIA

26

18-06-2023

NA

married

48000

LOVE

Explanation

R uses mutate() for transformations and distinct() for deduplication. trimws() removes spaces, while toupper() standardizes case. Conditional logic with ifelse() handles invalid values. as.Date() with multiple formats mimics SAS’s ANYDTDTE.. This approach is efficient for exploratory data analysis and rapid prototyping.

Key Points

  • mutate() = SAS DATA step equivalent
  • distinct() removes duplicates
  • Flexible date parsing

5. Phase 3: Advanced SAS String Techniques

DATA love_adv;

SET love_clean;

/* Extract first letter */

Initial = SUBSTR(Name,1,1);

/* Remove vowels from name */

Name_NoVowel = COMPRESS(Name, 'aeiouAEIOU');

/* Scan country */

Country_Part = SCAN(Country,1,' ');

/* Concatenate couple name */

Couple = CATX('_', Name, Partner_Name);

/* Replace text */

Country = TRANWRD(Country, "INDIA", "BHARAT");

RUN;

PROC PRINT DATA =love_adv;

RUN;

OUTPUT:

ObsIDNameCountryAgeMarriage_DatePartner_NameStatusSalaryMarriage_TypeMarriage_dtInitialName_NoVowelCountry_PartCouple
11JohnBHARAT252023-01-10annamarried50000LOVE01JAN2023JJhnINDIAJohn_anna
22JohnBHARAT302023-02-30ANNAmarried50000LOVE03FEB2023JJhnINDIAJohn_ANNA
33RaviUNKNOWN2815/03/2022Sitamarried.ARRANGED.RRvUNKNOWNRavi_Sita
44PriyaUSA27NULLRajmarried60000LOVE.PPryUSAPriya_Raj
55AmitBHARAT292022-12-01Nehamarried55000LOVE.AmtINDIAAmit_Neha
66JohnBHARAT252023-01-10annamarried50000LOVE01JAN2023JJhnINDIAJohn_anna
77SaraUK.2021-05-20Mikemarried70000LOVE02MAY2021SSrUKSara_Mike
88NullCANADA322020-11-15Johnmarried80000LOVE01NOV2020NNllCANADANull_John
99DavidUSA352022/07/10Lisamarried65000LOVE01JUL2022DDvdUSADavid_Lisa
1010MeenaBHARAT262023-06-18UNKNOWNmarried48000LOVE01JUN2023MMnINDIAMeena_UNKNOWN

Explanation

This phase focuses on advanced string manipulation. SUBSTR() extracts substrings, COMPRESS() removes characters, SCAN() tokenizes strings, and CATX() concatenates efficiently. TRANWRD() replaces text dynamically. These are heavily used in SDTM variable derivations, especially when parsing free-text fields.

Key Points

  • String parsing is critical in real datasets
  • CATX avoids manual concatenation issues
  • TRANWRD useful for standardization

6. 20 Data Cleaning Best Practices

  1. Always validate against source (CRF/raw)
  2. Maintain audit trails
  3. Use controlled terminology (CDISC)
  4. Avoid hardcoding values
  5. Standardize formats early
  6. Validate ranges (age, lab values)
  7. Use metadata-driven programming
  8. Document all transformations
  9. Handle missing systematically
  10. Perform duplicate checks
  11. Use validation macros
  12. Cross-check with R/Python
  13. Maintain version control
  14. Log all warnings/errors
  15. Use PROC COMPARE for QC
  16. Align with SAP requirements
  17. Ensure traceability (ADaM)
  18. Avoid overwriting raw data
  19. Validate date sequences
  20. Perform independent QC

7. Business Logic Behind Data Cleaning

Why do we clean data?

  • Missing values → replaced to avoid bias
  • Negative age → logically impossible → corrected
  • Invalid salary → affects financial models
  • Date inconsistencies → break time-based analysis

Example:
If patient age is -30 → survival analysis becomes invalid.
If marriage date is wrong → trend analysis fails.

Cleaning ensures accur decision-making and regulatory compliance.

8. 20 Key Insights

  • Dirty data leads to wrong conclusions
  • Standardization ensures reproducibility
  • Dates drive timelines
  • Missing ≠ zero
  • Duplicates inflate metrics
  • Validation is non-negotiable
  • String cleaning is underrated
  • Audit trails are mandatory
  • SAS ensures compliance
  • R ensures flexibility
  • Business logic drives cleaning
  • Data lineage matters
  • QC saves projects
  • Automation reduces errors
  • Metadata is powerful
  • Consistency is key
  • Documentation is survival
  • Clean data = trust
  • Errors compound quickly
  • Prevention is better than cleaning

9. SAS vs R Comparison

Feature

SAS

R

Regulatory Use

Strong

Moderate

Flexibility

Moderate

High

String Functions

Robust

Extensive

Learning Curve

Moderate

Steep

Industry Use

Pharma

Data Science

10. Summary

We transformed a chaotic love marriage dataset into a clean, analysis-ready structure using SAS and R. SAS provided structured, compliance-ready transformations, while R offered flexible manipulation.

11. Conclusion

Data cleaning is not just a step it’s the foundation of analytics. Whether you’re working on love marriage datasets or clinical trials, mastering string manipulation and validation logic is essential. SAS gives you control, traceability, and compliance qualities that define a professional data programmer.

12. Interview Questions

Q1: How do you handle mixed date formats in SAS?
A: Use ANYDTDTE. informat.

Q2: How do you remove duplicates?
A: PROC SORT NODUPKEY.

Q3: How do you standardize text?
A: UPCASE, LOWCASE, PROPER.

Q4: R equivalent of SAS DATA step?
A: mutate() in dplyr.

Q5: Scenario: Age is negative—what do you do?
A: Validate source → apply ABS() or set to missing.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about LOVE DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can we analyze planets using SAS like NASA scientists?

2.Can SAS Compare How 3 Different Families Live in Their Homes?

3.Can SAS Help Us Choose the Most Efficient Cooler?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy

Comments

Popular posts from this blog

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function

430.R Basics for Beginners with a Practical Comparison to SAS Programming