455.Mastering Date & Time Handling in SAS for Global Tech Projects
From Chaos to Clarity:
Mastering Date & Time Handling in SAS for Global Tech Projects
1. Introduction
Imagine
you are working as a SAS programmer in a global technology company handling
data from large-scale infrastructure projects AI deployments, cloud migrations,
smart cities, and telecom expansions. The dataset arrives from multiple
countries: India, USA, Germany, and Japan. Everything looks fine until you
start analyzing it.
Suddenly,
you see:
- Project start dates like 2023-13-01
- Completion dates earlier
than start dates
- Missing timestamps
- Duplicate project IDs
- “NULL” written as text
instead of missing values
This is
not uncommon. Whether in clinical trials (SDTM/ADaM datasets) or
business analytics, dirty data silently destroys decision-making.
A wrong
date can:
- Delay drug approvals
- Miscalculate revenue
timelines
- Break regulatory submissions
This is
where tools like SAS and R become powerful. They don’t just analyze data
they discipline it.
In this blog, we will build a real-world messy dataset, intentionally inject errors, and then clean, standardize, and validate it using SAS and R, with a special focus on date and time handling.
2. Raw Data Creation in SAS and R
SAS Raw Dataset (With Intentional Errors)
DATA tech_projects_raw;
INPUT Project_ID Country $ Budget Start_Date :$15.
End_Date :$15. Manager $ Status $ Duration;
DATALINES;
1 india 500000 2023-01-10 2023-12-20 Ravi Completed 340
2 USA -100000 2023-02-30 2023-11-15 John Ongoing 250
3 germany 750000 NULL 2023-10-10 Anna Completed 200
4 INDIA 600000 2023/03/15 2023/09/10 NULL Completed 180
5 japan 800000 2023-05-01 2023-04-01 Ken Completed -50
6 usa 900000 2023-06-10 2023-12-30 Mike Ongoing 200
6 usa 900000 2023-06-10 2023-12-30 Mike Ongoing 200
7 NULL 450000 2023-07-15 2023-12-10 Sara Completed 150
8 india . 2023-08-01 2023-12-01 Ravi Completed 120
9 germany 700000 2023-09-10 wrongdate Anna Completed 90
10 USA 650000 2023-10-05 2023-15-10 John Ongoing 100
;
RUN;
PROC PRINT DATA=tech_projects_raw;
RUN;
OUTPUT:
| Obs | Project_ID | Country | Budget | Start_Date | End_Date | Manager | Status | Duration |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | india | 500000 | 2023-01-10 | 2023-12-20 | Ravi | Complete | 340 |
| 2 | 2 | USA | -100000 | 2023-02-30 | 2023-11-15 | John | Ongoing | 250 |
| 3 | 3 | germany | 750000 | NULL | 2023-10-10 | Anna | Complete | 200 |
| 4 | 4 | INDIA | 600000 | 2023/03/15 | 2023/09/10 | NULL | Complete | 180 |
| 5 | 5 | japan | 800000 | 2023-05-01 | 2023-04-01 | Ken | Complete | -50 |
| 6 | 6 | usa | 900000 | 2023-06-10 | 2023-12-30 | Mike | Ongoing | 200 |
| 7 | 6 | usa | 900000 | 2023-06-10 | 2023-12-30 | Mike | Ongoing | 200 |
| 8 | 7 | NULL | 450000 | 2023-07-15 | 2023-12-10 | Sara | Complete | 150 |
| 9 | 8 | india | . | 2023-08-01 | 2023-12-01 | Ravi | Complete | 120 |
| 10 | 9 | germany | 700000 | 2023-09-10 | wrongdate | Anna | Complete | 90 |
| 11 | 10 | USA | 650000 | 2023-10-05 | 2023-15-10 | John | Ongoing | 100 |
Explanation
This
dataset simulates global technology projects with multiple issues. We
intentionally introduced:
- Invalid dates (2023-02-30, wrongdate)
- Missing values (NULL, .,
blank)
- Duplicate records
(Project_ID = 6)
- Negative budget and duration
- Inconsistent country naming
(india, INDIA, NULL)
Such
inconsistencies are common in real-world pipelines, especially when integrating
data from multiple systems. The goal is to replicate real industry chaos
before cleaning.
R Raw Dataset
tech_projects_raw <- data.frame(
Project_ID = c(1,2,3,4,5,6,6,7,8,9,10),
Country = c("india","USA","germany","INDIA","japan",
"usa","usa","NULL","india","germany","USA"),
Budget = c(500000,-100000,750000,600000,800000,900000,900000,
450000,NA,700000,650000),
Start_Date = c("2023-01-10","2023-02-30","NULL","2023/03/15",
"2023-05-01","2023-06-10","2023-06-10","2023-07-15",
"2023-08-01","2023-09-10","2023-10-05"),
End_Date = c("2023-12-20","2023-11-15","2023-10-10","2023-09-10",
"2023-04-01","2023-12-30","2023-12-30","2023-12-10",
"2023-12-01","wrongdate","2023-15-10"),
Manager = c("Ravi","John","Anna","NULL","Ken","Mike","Mike","Sara",
"Ravi","Anna","John"),
Status = c("Completed","Ongoing","Completed","Completed","Completed",
"Ongoing","Ongoing","Completed","Completed","Completed",
"Ongoing"),
Duration = c(340,250,200,180,-50,200,200,150,120,90,100)
)
OUTPUT:
|
|
Project_ID |
Country |
Budget |
Start_Date |
End_Date |
Manager |
Status |
Duration |
|
1 |
1 |
india |
500000 |
10-01-2023 |
20-12-2023 |
Ravi |
Completed |
340 |
|
2 |
2 |
USA |
-100000 |
2023-02-30 |
15-11-2023 |
John |
Ongoing |
250 |
|
3 |
3 |
germany |
750000 |
NULL |
10-10-2023 |
Anna |
Completed |
200 |
|
4 |
4 |
INDIA |
600000 |
15-03-2023 |
10-09-2023 |
NULL |
Completed |
180 |
|
5 |
5 |
japan |
800000 |
01-05-2023 |
01-04-2023 |
Ken |
Completed |
-50 |
|
6 |
6 |
usa |
900000 |
10-06-2023 |
30-12-2023 |
Mike |
Ongoing |
200 |
|
7 |
6 |
usa |
900000 |
10-06-2023 |
30-12-2023 |
Mike |
Ongoing |
200 |
|
8 |
7 |
NULL |
450000 |
15-07-2023 |
10-12-2023 |
Sara |
Completed |
150 |
|
9 |
8 |
india |
NA |
01-08-2023 |
01-12-2023 |
Ravi |
Completed |
120 |
|
10 |
9 |
germany |
700000 |
10-09-2023 |
wrongdate |
Anna |
Completed |
90 |
|
11 |
10 |
USA |
650000 |
05-10-2023 |
2023-15-10 |
John |
Ongoing |
100 |
Explanation
This R
dataset mirrors the SAS dataset. It demonstrates how data inconsistencies
remain consistent across tools, meaning cleaning logic must be
reproducible. R stores everything as character or numeric vectors, so incorrect
formats (like invalid dates) won’t fail immediately but will cause issues later
during analysis. This highlights the importance of explicit type conversion
and validation in R pipelines.
3. Phase 1: Data Cleaning in SAS
PROC SORT DATA=tech_projects_raw NODUPKEY OUT=tech_projects_nodup;
BY Project_ID;
RUN;
PROC PRINT DATA=tech_projects_nodup;
RUN;
OUTPUT:
| Obs | Project_ID | Country | Budget | Start_Date | End_Date | Manager | Status | Duration |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | india | 500000 | 2023-01-10 | 2023-12-20 | Ravi | Complete | 340 |
| 2 | 2 | USA | -100000 | 2023-02-30 | 2023-11-15 | John | Ongoing | 250 |
| 3 | 3 | germany | 750000 | NULL | 2023-10-10 | Anna | Complete | 200 |
| 4 | 4 | INDIA | 600000 | 2023/03/15 | 2023/09/10 | NULL | Complete | 180 |
| 5 | 5 | japan | 800000 | 2023-05-01 | 2023-04-01 | Ken | Complete | -50 |
| 6 | 6 | usa | 900000 | 2023-06-10 | 2023-12-30 | Mike | Ongoing | 200 |
| 7 | 7 | NULL | 450000 | 2023-07-15 | 2023-12-10 | Sara | Complete | 150 |
| 8 | 8 | india | . | 2023-08-01 | 2023-12-01 | Ravi | Complete | 120 |
| 9 | 9 | germany | 700000 | 2023-09-10 | wrongdate | Anna | Complete | 90 |
| 10 | 10 | USA | 650000 | 2023-10-05 | 2023-15-10 | John | Ongoing | 100 |
DATA tech_projects_clean;
SET tech_projects_nodup;
/* Handle missing values */
Country = UPCASE(COALESCEC(Country, "UNKNOWN"));
IF Country = "NULL" THEN Country = "UNKNOWN";
Manager = COALESCEC(Manager, "NA");
/* Fix invalid numeric values */
IF Budget < 0 THEN Budget = ABS(Budget);
IF Duration < 0 THEN Duration = .;
/* Date conversion */
Start_dt = INPUT(Start_Date, YYMMDD10.);
End_dt = INPUT(End_Date, YYMMDD10.);
FORMAT Start_dt End_dt DATE9.;
/* Logical correction */
IF End_dt < Start_dt THEN End_dt = .;
RUN;
PROC PRINT DATA=tech_projects_clean;
RUN;
OUTPUT:
| Obs | Project_ID | Country | Budget | Start_Date | End_Date | Manager | Status | Duration | Start_dt | End_dt |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | INDIA | 500000 | 2023-01-10 | 2023-12-20 | Ravi | Complete | 340 | 10JAN2023 | 20DEC2023 |
| 2 | 2 | USA | 100000 | 2023-02-30 | 2023-11-15 | John | Ongoing | 250 | . | 15NOV2023 |
| 3 | 3 | GERMANY | 750000 | NULL | 2023-10-10 | Anna | Complete | 200 | . | 10OCT2023 |
| 4 | 4 | INDIA | 600000 | 2023/03/15 | 2023/09/10 | NULL | Complete | 180 | 15MAR2023 | 10SEP2023 |
| 5 | 5 | JAPAN | 800000 | 2023-05-01 | 2023-04-01 | Ken | Complete | . | 01MAY2023 | . |
| 6 | 6 | USA | 900000 | 2023-06-10 | 2023-12-30 | Mike | Ongoing | 200 | 10JUN2023 | 30DEC2023 |
| 7 | 7 | UNKNOWN | 450000 | 2023-07-15 | 2023-12-10 | Sara | Complete | 150 | 15JUL2023 | 10DEC2023 |
| 8 | 8 | INDIA | . | 2023-08-01 | 2023-12-01 | Ravi | Complete | 120 | 01AUG2023 | 01DEC2023 |
| 9 | 9 | GERMANY | 700000 | 2023-09-10 | wrongdate | Anna | Complete | 90 | 10SEP2023 | . |
| 10 | 10 | USA | 650000 | 2023-10-05 | 2023-15-10 | John | Ongoing | 100 | 05OCT2023 | . |
Explanation
This step
transforms chaos into structured data:
- PROC SORT NODUPKEY removes
duplicate records efficiently.
- COALESCEC standardizes
missing text fields.
- ABS() corrects invalid numeric values like negative budgets.
- Logical validation ensures End_Date
is not earlier than Start_Date.
This
reflects real-world clinical data validation, where incorrect dates can
invalidate entire studies.
4. Phase 2: Data Cleaning in R
library(dplyr)
tech_clean <- tech_projects_raw %>%
distinct(Project_ID, .keep_all = TRUE) %>%
mutate(
Country = toupper(trimws(ifelse(Country=="NULL" | Country=="",
"UNKNOWN", Country))),,
Budget = ifelse(Budget < 0, abs(Budget), Budget),
Duration = ifelse(Duration <0 , NA, Duration),
Start_dt = as.Date(Start_Date, format="%Y-%m-%d"),
End_dt = as.Date(End_Date, format="%Y-%m-%d"),
Date_Flag = case_when(is.na(Start_dt) ~ "MISSING_START",
is.na(End_dt) ~ "MISSING_END",End_dt < Start_dt ~ "INVALID_RANGE",
TRUE ~ "OK")
)
OUTPUT:
|
|
Project_ID |
Country |
Budget |
Start_Date |
End_Date |
Manager |
Status |
Duration |
Start_dt |
End_dt |
Date_Flag |
|
1 |
1 |
INDIA |
500000 |
10-01-2023 |
20-12-2023 |
Ravi |
Completed |
340 |
10-01-2023 |
20-12-2023 |
OK |
|
2 |
2 |
USA |
100000 |
2023-02-30 |
15-11-2023 |
John |
Ongoing |
250 |
NA |
15-11-2023 |
MISSING_START |
|
3 |
3 |
GERMANY |
750000 |
NULL |
10-10-2023 |
Anna |
Completed |
200 |
NA |
10-10-2023 |
MISSING_START |
|
4 |
4 |
INDIA |
600000 |
15-03-2023 |
10-09-2023 |
NULL |
Completed |
180 |
NA |
10-09-2023 |
MISSING_START |
|
5 |
5 |
JAPAN |
800000 |
01-05-2023 |
01-04-2023 |
Ken |
Completed |
NA |
01-05-2023 |
01-04-2023 |
INVALID_RANGE |
|
6 |
6 |
USA |
900000 |
10-06-2023 |
30-12-2023 |
Mike |
Ongoing |
200 |
10-06-2023 |
30-12-2023 |
OK |
|
7 |
7 |
UNKNOWN |
450000 |
15-07-2023 |
10-12-2023 |
Sara |
Completed |
150 |
15-07-2023 |
10-12-2023 |
OK |
|
8 |
8 |
INDIA |
NA |
01-08-2023 |
01-12-2023 |
Ravi |
Completed |
120 |
01-08-2023 |
01-12-2023 |
OK |
|
9 |
9 |
GERMANY |
700000 |
10-09-2023 |
wrongdate |
Anna |
Completed |
90 |
10-09-2023 |
NA |
MISSING_END |
|
10 |
10 |
USA |
650000 |
05-10-2023 |
2023-15-10 |
John |
Ongoing |
100 |
05-10-2023 |
NA |
MISSING_END |
Explanation
R uses a pipeline
approach:
- distinct() removes
duplicates
- mutate() transforms
variables
- toupper() and trimws()
standardize text
- as.Date() converts strings
to date format
- filter() enforces logical
rules
Unlike
SAS, R requires stricter format consistency, so invalid dates may result in NA.
This forces analysts to explicitly handle errors, improving transparency.
5. Phase 3: Additional SAS Processing
DATA tech_enhanced;
SET tech_projects_clean;
/* Derive Year */
Project_Year = YEAR(Start_dt);
/* Duration recalculation */
Calc_Duration = End_dt - Start_dt;
/* Flag issues */
IF Start_dt = . OR End_dt = . THEN Flag = "CHECK";
ELSE Flag = "OK";
RUN;
PROC PRINT DATA=tech_enhanced;
RUN;
OUTPUT:
| Obs | Project_ID | Country | Budget | Start_Date | End_Date | Manager | Status | Duration | Start_dt | End_dt | Project_Year | Calc_Duration | Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | INDIA | 500000 | 2023-01-10 | 2023-12-20 | Ravi | Complete | 340 | 10JAN2023 | 20DEC2023 | 2023 | 344 | OK |
| 2 | 2 | USA | 100000 | 2023-02-30 | 2023-11-15 | John | Ongoing | 250 | . | 15NOV2023 | . | . | CHECK |
| 3 | 3 | GERMANY | 750000 | NULL | 2023-10-10 | Anna | Complete | 200 | . | 10OCT2023 | . | . | CHECK |
| 4 | 4 | INDIA | 600000 | 2023/03/15 | 2023/09/10 | NULL | Complete | 180 | 15MAR2023 | 10SEP2023 | 2023 | 179 | OK |
| 5 | 5 | JAPAN | 800000 | 2023-05-01 | 2023-04-01 | Ken | Complete | . | 01MAY2023 | . | 2023 | . | CHECK |
| 6 | 6 | USA | 900000 | 2023-06-10 | 2023-12-30 | Mike | Ongoing | 200 | 10JUN2023 | 30DEC2023 | 2023 | 203 | OK |
| 7 | 7 | UNKNOWN | 450000 | 2023-07-15 | 2023-12-10 | Sara | Complete | 150 | 15JUL2023 | 10DEC2023 | 2023 | 148 | OK |
| 8 | 8 | INDIA | . | 2023-08-01 | 2023-12-01 | Ravi | Complete | 120 | 01AUG2023 | 01DEC2023 | 2023 | 122 | OK |
| 9 | 9 | GERMANY | 700000 | 2023-09-10 | wrongdate | Anna | Complete | 90 | 10SEP2023 | . | 2023 | . | CHECK |
| 10 | 10 | USA | 650000 | 2023-10-05 | 2023-15-10 | John | Ongoing | 100 | 05OCT2023 | . | 2023 | . | CHECK |
Explanation
This step
adds derived variables, which are essential in ADaM datasets:
- YEAR() extracts useful
temporal features
- Duration recalculation
ensures accuracy
- Flags help identify
problematic records
Such
derived variables are critical for regulatory reporting and trend analysis.
6. 20 Data Cleaning Best Practices
- Always validate date ranges
- Never trust source system
formats
- Maintain audit trails
- Document transformation
logic
- Use controlled terminology
- Validate duplicates
carefully
- Apply SDTM standards
- Ensure traceability
- Perform QC checks
- Use macros for repeatability
- Validate missing values
- Cross-check derived
variables
- Maintain metadata
- Follow CDISC compliance
- Perform independent
validation
- Log all changes
- Use version control
- Ensure reproducibility
- Validate against SAP
- Perform final data
reconciliation
7. Business Logic Behind Data Cleaning
Data
cleaning is not cosmetic it is decision-critical.
Example:
- Negative age → biologically
impossible → must be corrected
- Missing salary → impacts
financial forecasting
- Incorrect dates → breaks
timelines
In
clinical trials:
- Wrong date = protocol
deviation
- Missing data = regulatory
rejection
Thus,
cleaning ensures:
- Accuracy
- Consistency
- Trust
8. 20 Key Insights
- Dirty data leads to wrong
conclusions
- Dates define timelines
- Missing data hides risk
- Validation prevents failure
- Standardization ensures
consistency
- SAS ensures robustness
- R ensures flexibility
- Duplicate data inflates
results
- Invalid dates break models
- Cleaning improves
performance
- QC is mandatory
- Audit trails ensure
compliance
- Data lineage matters
- Reproducibility is critical
- Automation reduces errors
- Domain knowledge is
essential
- Business rules drive
cleaning
- Documentation saves time
- Clean data builds trust
- Data quality = business
quality
9. Summary: SAS vs R
|
Feature |
SAS |
R |
|
Stability |
High |
Moderate |
|
Flexibility |
Moderate |
High |
|
Regulatory |
Strong |
Limited |
|
Date
Handling |
Robust |
Strict |
|
Learning
Curve |
Medium |
High |
10. Conclusion
Handling
date and time in SAS is not just a technical skill it is a data engineering
discipline. Whether you are working on global tech projects or clinical
trials, structured cleaning ensures reliable insights.
Clean
data is not optional. It is foundational.
11. Interview Questions
Q1: How do you handle invalid dates
in SAS?
A: Use ANYDTDTE. informat and validate with conditions.
Q2: Difference between missing and
NULL?
A: Missing is system-defined; NULL is text → must be converted.
Q3: How do you remove duplicates?
A: PROC SORT NODUPKEY
Q4: R vs SAS date handling?
A: SAS flexible; R strict parsing.
A: Set to missing and flag for review.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about GLOBAL PROJECT DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment