457.From Billion-Dollar Marriages to Clean Datasets: The Art of SAS Standardization
Billion-Dollar Weddings & Broken Data: Mastering Data Standardization in SAS with Real-World Chaos
1. Introduction – When Glamour Meets Garbage Data
Imagine
you're working as a SAS programmer in a global analytics firm. Your manager
assigns you a dataset titled “Costly Marriages in the World” high-profile
weddings involving celebrities, business tycoons, and royal families. The goal?
Generate insights on spending patterns, country-wise trends, and luxury
benchmarks.
You open
the dataset expecting elegance… but what you get is chaos.
- Marriage costs entered as "5M",
"5000000", and "5 million"
- Dates like "2023-02-30"
(which doesn’t even exist)
- Country names: "india",
"IND", "India "
- Duplicate billionaire
weddings (because apparently one wasn’t enough )
- Negative guest counts (time
travelers?)
This is
not just messy it’s dangerous.
In
real-world analytics, especially in clinical trials (SDTM/ADaM) or
financial reporting, bad data leads to:
- Incorrect conclusions
- Regulatory rejection
- Business loss
This is
where data standardization becomes mission-critical.
Tools
like SAS and R are industry powerhouses for cleaning and
transforming such messy datasets into structured, reliable formats.
Let’s walk through a step-by-step, real-world implementation using both SAS and R.
2. Raw Data Creation in SAS and R (With Intentional Errors)
SAS Raw Dataset
DATA costly_marriages_raw;
INFILE DATALINES DSD DLM=',' TRUNCOVER;
INPUT ID Name:$20. Country:$15. Cost:$15. Guests Age Marriage_Date:$15.
Status:$10. Sponsor:$20.;
DATALINES;
1,RoyalWedding,india,5000000,2000,30,2023-02-28,Done,KingCorp
2,EliteUnion,USA,5M,1500,-25,2023-02-30,done,QueenLtd
3,LuxuryTie,UK,NULL,1800,40,2022/12/15,Completed,NULL
4,GrandAffair,India,7000000,,35,15-01-2023,Done,KingCorp
5,RoyalWedding,india,5000000,2000,30,2023-02-28,Done,KingCorp
6,DiamondBond,FRANCE,-8000000,2500,45,2023-03-10,Done,EliteGroup
7,GoldenCeremony,USA,9000000,3000,50,2023-04-12,,GoldInc
8,PlatinumUnion,UK,10000000,3500,60,2023-05-20,Done,
9,SilverKnot,IND,8500000,2800,55,2023-06-15,done,SilverCorp
10,EliteUnion,USA,5M,1500,-25,2023-02-30,done,QueenLtd
;
RUN;
PROC PRINT DATA = costly_marriages_raw;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Cost | Guests | Age | Marriage_Date | Status | Sponsor |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | RoyalWedding | india | 5000000 | 2000 | 30 | 2023-02-28 | Done | KingCorp |
| 2 | 2 | EliteUnion | USA | 5M | 1500 | -25 | 2023-02-30 | done | QueenLtd |
| 3 | 3 | LuxuryTie | UK | NULL | 1800 | 40 | 2022/12/15 | Completed | NULL |
| 4 | 4 | GrandAffair | India | 7000000 | . | 35 | 15-01-2023 | Done | KingCorp |
| 5 | 5 | RoyalWedding | india | 5000000 | 2000 | 30 | 2023-02-28 | Done | KingCorp |
| 6 | 6 | DiamondBond | FRANCE | -8000000 | 2500 | 45 | 2023-03-10 | Done | EliteGroup |
| 7 | 7 | GoldenCeremony | USA | 9000000 | 3000 | 50 | 2023-04-12 | GoldInc | |
| 8 | 8 | PlatinumUnion | UK | 10000000 | 3500 | 60 | 2023-05-20 | Done | |
| 9 | 9 | SilverKnot | IND | 8500000 | 2800 | 55 | 2023-06-15 | done | SilverCorp |
| 10 | 10 | EliteUnion | USA | 5M | 1500 | -25 | 2023-02-30 | done | QueenLtd |
R Code – Equivalent Raw Dataset
costly_marriages_raw <- data.frame(
ID = c(1,2,3,4,5,6,7,8,9,10),
Name = c("RoyalWedding","EliteUnion","LuxuryTie","GrandAffair",
"RoyalWedding","DiamondBond","GoldenCeremony","PlatinumUnion",
"SilverKnot","EliteUnion"),
Country = c("india","USA","UK","India","india","FRANCE","USA",
"UK","IND","USA"),
Cost = c("5000000","5M","NULL","7000000","5000000","-8000000",
"9000000","10000000","8500000","5M"),
Guests = c(2000,1500,1800,NA,2000,2500,3000,3500,2800,1500),
Age = c(30,-25,40,35,30,45,50,60,55,-25),
Marriage_Date = c("2023-02-28","2023-02-30","2022/12/15",
"15-01-2023","2023-02-28","2023-03-10",
"2023-04-12","2023-05-20","2023-06-15",
"2023-02-30"),
Status = c("Done","done","Completed","Done","Done","Done","",
"Done","done","done"),
Sponsor = c("KingCorp","QueenLtd",NA,"KingCorp","KingCorp",
"EliteGroup","GoldInc",NA,"SilverCorp","QueenLtd")
)
OUTPUT:
|
|
ID |
Name |
Country |
Cost |
Guests |
Age |
Marriage_Date |
Status |
Sponsor |
|
1 |
1 |
RoyalWedding |
india |
5000000 |
2000 |
30 |
28-02-2023 |
Done |
KingCorp |
|
2 |
2 |
EliteUnion |
USA |
5M |
1500 |
-25 |
2023-02-30 |
done |
QueenLtd |
|
3 |
3 |
LuxuryTie |
UK |
NULL |
1800 |
40 |
15-12-2022 |
Completed |
NA |
|
4 |
4 |
GrandAffair |
India |
7000000 |
NA |
35 |
15-01-2023 |
Done |
KingCorp |
|
5 |
5 |
RoyalWedding |
india |
5000000 |
2000 |
30 |
28-02-2023 |
Done |
KingCorp |
|
6 |
6 |
DiamondBond |
FRANCE |
-8000000 |
2500 |
45 |
10-03-2023 |
Done |
EliteGroup |
|
7 |
7 |
GoldenCeremony |
USA |
9000000 |
3000 |
50 |
12-04-2023 |
|
GoldInc |
|
8 |
8 |
PlatinumUnion |
UK |
10000000 |
3500 |
60 |
20-05-2023 |
Done |
NA |
|
9 |
9 |
SilverKnot |
IND |
8500000 |
2800 |
55 |
15-06-2023 |
done |
SilverCorp |
|
10 |
10 |
EliteUnion |
USA |
5M |
1500 |
-25 |
2023-02-30 |
done |
QueenLtd |
Key Issues Identified
- Missing values (NULL, blank)
- Invalid values (negative
Age, Cost)
- Wrong dates
- Duplicates
- Inconsistent text formatting
3. Phase 1: Data Cleaning in SAS
Step-by-Step Cleaning
DATA costly_marriages_clean;
SET costly_marriages_raw;
/* Handle Missing Values */
IF Sponsor = "" OR Sponsor = "NULL"
THEN Sponsor = "UNKNOWN";
/* Fix Negative Age */
IF Age < 0 THEN Age = ABS(Age);
/* Fix Cost */
IF Cost = "5M" THEN Cost_Num = 5000000;
ELSE IF Cost = "NULL" THEN Cost_Num = .;
ELSE Cost_Num = INPUT(Cost, BEST12.);
IF Cost_Num < 0 THEN Cost_Num = ABS(Cost_Num);
/* Standardize Country */
Country = UPCASE(TRIM(Country));
/* Standardize Status */
Status = UPCASE(Status);
/* Fix Date */
Marriage_dt = INPUT(Marriage_Date, ANYDTDTE.);
FORMAT Marriage_dt DATE9.;
RUN;
PROC PRINT DATA = costly_marriages_clean;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Cost | Guests | Age | Marriage_Date | Status | Sponsor | Cost_Num | Marriage_dt |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | RoyalWedding | INDIA | 5000000 | 2000 | 30 | 2023-02-28 | DONE | KingCorp | 5000000 | 02FEB2023 |
| 2 | 2 | EliteUnion | USA | 5M | 1500 | 25 | 2023-02-30 | DONE | QueenLtd | 5000000 | 03FEB2023 |
| 3 | 3 | LuxuryTie | UK | NULL | 1800 | 40 | 2022/12/15 | COMPLETED | UNKNOWN | . | 01DEC2022 |
| 4 | 4 | GrandAffair | INDIA | 7000000 | . | 35 | 15-01-2023 | DONE | KingCorp | 7000000 | . |
| 5 | 5 | RoyalWedding | INDIA | 5000000 | 2000 | 30 | 2023-02-28 | DONE | KingCorp | 5000000 | 02FEB2023 |
| 6 | 6 | DiamondBond | FRANCE | -8000000 | 2500 | 45 | 2023-03-10 | DONE | EliteGroup | 8000000 | 01MAR2023 |
| 7 | 7 | GoldenCeremony | USA | 9000000 | 3000 | 50 | 2023-04-12 | GoldInc | 9000000 | 01APR2023 | |
| 8 | 8 | PlatinumUnion | UK | 10000000 | 3500 | 60 | 2023-05-20 | DONE | UNKNOWN | 10000000 | 02MAY2023 |
| 9 | 9 | SilverKnot | IND | 8500000 | 2800 | 55 | 2023-06-15 | DONE | SilverCorp | 8500000 | 01JUN2023 |
| 10 | 10 | EliteUnion | USA | 5M | 1500 | 25 | 2023-02-30 | DONE | QueenLtd | 5000000 | 03FEB2023 |
/* Remove Duplicates */
PROC SORT DATA=costly_marriages_clean NODUPKEY;
BY ID Name;
RUN;
PROC PRINT DATA = costly_marriages_clean;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Cost | Guests | Age | Marriage_Date | Status | Sponsor | Cost_Num | Marriage_dt |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | RoyalWedding | INDIA | 5000000 | 2000 | 30 | 2023-02-28 | DONE | KingCorp | 5000000 | 02FEB2023 |
| 2 | 2 | EliteUnion | USA | 5M | 1500 | 25 | 2023-02-30 | DONE | QueenLtd | 5000000 | 03FEB2023 |
| 3 | 3 | LuxuryTie | UK | NULL | 1800 | 40 | 2022/12/15 | COMPLETED | UNKNOWN | . | 01DEC2022 |
| 4 | 4 | GrandAffair | INDIA | 7000000 | . | 35 | 15-01-2023 | DONE | KingCorp | 7000000 | . |
| 5 | 5 | RoyalWedding | INDIA | 5000000 | 2000 | 30 | 2023-02-28 | DONE | KingCorp | 5000000 | 02FEB2023 |
| 6 | 6 | DiamondBond | FRANCE | -8000000 | 2500 | 45 | 2023-03-10 | DONE | EliteGroup | 8000000 | 01MAR2023 |
| 7 | 7 | GoldenCeremony | USA | 9000000 | 3000 | 50 | 2023-04-12 | GoldInc | 9000000 | 01APR2023 | |
| 8 | 8 | PlatinumUnion | UK | 10000000 | 3500 | 60 | 2023-05-20 | DONE | UNKNOWN | 10000000 | 02MAY2023 |
| 9 | 9 | SilverKnot | IND | 8500000 | 2800 | 55 | 2023-06-15 | DONE | SilverCorp | 8500000 | 01JUN2023 |
| 10 | 10 | EliteUnion | USA | 5M | 1500 | 25 | 2023-02-30 | DONE | QueenLtd | 5000000 | 03FEB2023 |
Explanation
This SAS
step systematically addresses all data inconsistencies. Missing values are
replaced using conditional logic, ensuring no blank sponsor fields remain.
Negative values are corrected using ABS(), critical for maintaining logical
consistency (e.g., age cannot be negative). Cost values are standardized into
numeric format, enabling further analysis. The UPCASE and TRIM functions
enforce uniform text formatting, which is essential for grouping and reporting.
The ANYDTDTE. informat intelligently parses mixed date formats. Finally, PROC
SORT NODUPKEY removes duplicate records based on key identifiers, ensuring
dataset uniqueness.
Key Points
- Always convert character
numbers to numeric early
- Use ANYDTDTE. for messy date
formats
- Standardization prevents
grouping errors
4. Phase 2: Data Cleaning in R
library(dplyr)
clean_data <- costly_marriages_raw %>%
mutate(
Sponsor = ifelse(is.na(Sponsor) | Sponsor == "", "UNKNOWN", Sponsor),
Age = abs(Age),
Cost_Num = case_when(Cost == "5M" ~ 5000000,
Cost == "NULL" ~ NA_real_,
TRUE ~ suppressWarnings(as.numeric(Cost))),
Cost_Num = abs(Cost_Num),
Country = toupper(trimws(Country)),
Status = toupper(Status),
Marriage_dt = as.Date(Marriage_Date,
tryFormats = c("%Y-%m-%d","%d-%m-%Y","%Y/%m/%d"))
) %>%
distinct(ID, Name, .keep_all = TRUE)
OUTPUT:
|
|
ID |
Name |
Country |
Cost |
Guests |
Age |
Marriage_Date |
Status |
Sponsor |
Cost_Num |
Marriage_dt |
|
1 |
1 |
RoyalWedding |
INDIA |
5000000 |
2000 |
30 |
28-02-2023 |
DONE |
KingCorp |
5.00E+06 |
28-02-2023 |
|
2 |
2 |
EliteUnion |
USA |
5M |
1500 |
25 |
2023-02-30 |
DONE |
QueenLtd |
5.00E+06 |
NA |
|
3 |
3 |
LuxuryTie |
UK |
NULL |
1800 |
40 |
15-12-2022 |
COMPLETED |
UNKNOWN |
NA |
NA |
|
4 |
4 |
GrandAffair |
INDIA |
7000000 |
NA |
35 |
15-01-2023 |
DONE |
KingCorp |
7.00E+06 |
0015-01-20 |
|
5 |
5 |
RoyalWedding |
INDIA |
5000000 |
2000 |
30 |
28-02-2023 |
DONE |
KingCorp |
5.00E+06 |
28-02-2023 |
|
6 |
6 |
DiamondBond |
FRANCE |
-8000000 |
2500 |
45 |
10-03-2023 |
DONE |
EliteGroup |
8.00E+06 |
10-03-2023 |
|
7 |
7 |
GoldenCeremony |
USA |
9000000 |
3000 |
50 |
12-04-2023 |
|
GoldInc |
9.00E+06 |
12-04-2023 |
|
8 |
8 |
PlatinumUnion |
UK |
10000000 |
3500 |
60 |
20-05-2023 |
DONE |
UNKNOWN |
1.00E+07 |
20-05-2023 |
|
9 |
9 |
SilverKnot |
IND |
8500000 |
2800 |
55 |
15-06-2023 |
DONE |
SilverCorp |
8.50E+06 |
15-06-2023 |
|
10 |
10 |
EliteUnion |
USA |
5M |
1500 |
25 |
2023-02-30 |
DONE |
QueenLtd |
5.00E+06 |
NA |
Explanation
R
leverages the dplyr package to create a streamlined pipeline for data cleaning.
Missing values are handled using ifelse() and is.na(), ensuring completeness.
The abs() function corrects negative values, while as.numeric() converts cost
into a usable numeric format. Text standardization is achieved using toupper()
and trimws(). Date parsing is handled flexibly using multiple format attempts
via tryFormats. Finally, distinct() ensures duplicate removal. This
pipeline-style transformation improves readability, reproducibility, and
debugging efficiency.
Key Points
- mutate() enables chained
transformations
- distinct() handles
duplicates efficiently
- Always validate date parsing
5. Phase 3: Additional SAS Enhancements
PROC MEANS DATA=costly_marriages_clean;
VAR Cost_Num Guests Age;
RUN;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|
Cost_Num Guests Age | 9 9 10 | 6944444.44 2288.89 39.5000000 | 2006932.43 700.7932014 12.5720148 | 5000000.00 1500.00 25.0000000 | 10000000.00 3500.00 60.0000000 |
PROC FREQ DATA=costly_marriages_clean;
TABLES Country Status;
RUN;
OUTPUT:
The FREQ Procedure
| Country | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| FRANCE | 1 | 10.00 | 1 | 10.00 |
| IND | 1 | 10.00 | 2 | 20.00 |
| INDIA | 3 | 30.00 | 5 | 50.00 |
| UK | 2 | 20.00 | 7 | 70.00 |
| USA | 3 | 30.00 | 10 | 100.00 |
| Status | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Frequency Missing = 1 | ||||
| COMPLETED | 1 | 11.11 | 1 | 11.11 |
| DONE | 8 | 88.89 | 9 | 100.00 |
Explanation
After
cleaning, validation and exploratory analysis are essential. PROC MEANS
provides summary statistics, helping detect anomalies like unusually high costs
or guest counts. PROC FREQ reveals distribution patterns in categorical
variables such as Country and Status. These steps act as a secondary validation
layer, ensuring that cleaning logic has not introduced inconsistencies. In
clinical trials, similar procedures are used for SDTM validation checks before
ADaM dataset derivation.
6. 20 Additional Data Cleaning Best Practices
- Always validate raw data
before transformation
- Maintain audit trails for
regulatory compliance
- Never overwrite raw datasets
- Use metadata-driven
programming (CDISC)
- Apply range checks (Age >
0)
- Standardize units (currency,
weight)
- Use controlled terminology
- Implement duplicate checks
- Validate date sequences
- Document all transformations
- Perform cross-dataset
validation
- Use macros for repeatability
- Log all warnings/errors
- Validate derived variables
- Perform QC independently
- Use PROC COMPARE for
validation
- Avoid hardcoding values
- Normalize categorical
variables
- Validate against SAP
- Ensure reproducibility
7. Business Logic Behind Data Cleaning
Data
cleaning is not just technical it’s strategic.
- Missing values: Replacing Sponsor with
"UNKNOWN" avoids null-related errors in reporting
- Negative Age: Correcting ensures logical
consistency
- Cost normalization: Enables accurate
aggregation and comparisons
- Date correction: Critical for time-based
analytics
In
clinical trials:
- Wrong patient age →
incorrect subgroup analysis
- Invalid dates → protocol
deviations
8. 20 Key Insights (Sharp & Impactful)
- Dirty data leads to wrong
conclusions
- Standardization ensures
reproducibility
- Missing values hide insights
- Validation prevents
regulatory rejection
- Clean data builds trust
- Dates drive timelines
- Duplicates distort metrics
- Text inconsistency breaks
grouping
- Automation reduces human
error
- QC is non-negotiable
- SAS excels in structured
pipelines
- R excels in flexibility
- Documentation is critical
- Business logic drives
cleaning
- Errors compound quickly
- Clean data accelerates
decisions
- Data integrity is everything
- Consistency beats complexity
- Validation saves projects
- Clean data = credible
analytics
9. SAS vs R – Comparison
|
Feature |
SAS |
R |
|
Stability |
High |
Moderate |
|
Flexibility |
Moderate |
High |
|
Regulatory
Use |
Strong
(CDISC) |
Limited |
|
Debugging |
Easier
logs |
Requires
skill |
|
Performance |
Optimized |
Depends |
10. Summary
The journey through the “Costly Marriages in the World”
dataset highlights a fundamental truth in data analytics: even the most
glamorous data can be deeply flawed beneath the surface. What initially appears
to be high-value, insight-rich information quickly turns unreliable when
plagued with inconsistencies such as missing values, invalid entries, duplicate
records, and non-standard formats. This mirrors real-world scenarios in domains
like clinical trials and financial analytics, where poor data quality can lead
to misleading conclusions and regulatory risks.
Using SAS, we demonstrated a structured and industry-grade
approach to data standardization leveraging functions like UPCASE,
INPUT, ANYDTDTE.,
and procedures such as PROC SORT
to systematically clean and validate data. SAS excels in controlled,
reproducible environments, making it the preferred choice in regulated
industries. In parallel, R showcased flexibility and
efficiency through dplyr,
enabling streamlined transformations using mutate(),
case_when(), and distinct(). However, it
also revealed subtle pitfalls like coercion warnings, reinforcing the need for
careful function selection and validation.
Ultimately, data standardization is not just a technical step it is a
strategic necessity. Clean, consistent data ensures accurate analysis, supports
confident decision-making, and maintains compliance with industry standards.
Whether working with billion-dollar weddings or clinical datasets, the
principles remain the same: validate early, standardize rigorously, and
document thoroughly. Mastering these practices distinguishes a competent
programmer from a truly reliable data professional.
11. Conclusion
Data standardization
is not optional it is foundational.
Whether
you're analyzing billion-dollar weddings or clinical trial data, the principles
remain the same:
- Clean early
- Validate continuously
- Document thoroughly
A structured
cleaning framework transforms chaos into clarity and that’s what separates
average analysts from elite SAS programmers.
12. Interview Questions
Q1: How do you handle inconsistent
date formats in SAS?
A: Use ANYDTDTE. informat.
Q2: How to remove duplicates?
A: PROC SORT NODUPKEY BY variables;
Q3: R equivalent of SAS missing
handling?
A: is.na() and ifelse()
Q4: Debugging scenario: Negative
ages in dataset?
A: Use ABS() and validate source data.
Q5: Why standardization is critical
in SDTM?
A: Ensures regulatory compliance and reproducibility.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about WEDDING DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment