From Chaos to Clarity: Mastering Date & Time Handling in SAS for Global Tech Projects

1. Introduction

Imagine you are working as a SAS programmer in a global technology company handling data from large-scale infrastructure projects AI deployments, cloud migrations, smart cities, and telecom expansions. The dataset arrives from multiple countries: India, USA, Germany, and Japan. Everything looks fine until you start analyzing it.

Suddenly, you see:

Project start dates like 2023-13-01
Completion dates earlier than start dates
Missing timestamps
Duplicate project IDs
“NULL” written as text instead of missing values

This is not uncommon. Whether in clinical trials (SDTM/ADaM datasets) or business analytics, dirty data silently destroys decision-making.

A wrong date can:

Delay drug approvals
Miscalculate revenue timelines
Break regulatory submissions

This is where tools like SAS and R become powerful. They don’t just analyze data they discipline it.

In this blog, we will build a real-world messy dataset, intentionally inject errors, and then clean, standardize, and validate it using SAS and R, with a special focus on date and time handling.

2. Raw Data Creation in SAS and R

SAS Raw Dataset (With Intentional Errors)

DATA tech_projects_raw;

INPUT Project_ID Country $ Budget Start_Date :$15.

End_Date :$15. Manager $ Status $ Duration;

DATALINES;

1 india 500000 2023-01-10 2023-12-20 Ravi Completed 340

2 USA -100000 2023-02-30 2023-11-15 John Ongoing 250

3 germany 750000 NULL 2023-10-10 Anna Completed 200

4 INDIA 600000 2023/03/15 2023/09/10 NULL Completed 180

5 japan 800000 2023-05-01 2023-04-01 Ken Completed -50

6 usa 900000 2023-06-10 2023-12-30 Mike Ongoing 200

7 NULL 450000 2023-07-15 2023-12-10 Sara Completed 150

8 india . 2023-08-01 2023-12-01 Ravi Completed 120

9 germany 700000 2023-09-10 wrongdate Anna Completed 90

10 USA 650000 2023-10-05 2023-15-10 John Ongoing 100

;

RUN;

PROC PRINT DATA=tech_projects_raw;

RUN;

OUTPUT:


Obs	Project_ID	Country	Budget	Start_Date	End_Date	Manager	Status	Duration
1	1	india	500000	2023-01-10	2023-12-20	Ravi	Complete	340
2	2	USA	-100000	2023-02-30	2023-11-15	John	Ongoing	250
3	3	germany	750000	NULL	2023-10-10	Anna	Complete	200
4	4	INDIA	600000	2023/03/15	2023/09/10	NULL	Complete	180
5	5	japan	800000	2023-05-01	2023-04-01	Ken	Complete	-50
6	6	usa	900000	2023-06-10	2023-12-30	Mike	Ongoing	200
7	6	usa	900000	2023-06-10	2023-12-30	Mike	Ongoing	200
8	7	NULL	450000	2023-07-15	2023-12-10	Sara	Complete	150
9	8	india	.	2023-08-01	2023-12-01	Ravi	Complete	120
10	9	germany	700000	2023-09-10	wrongdate	Anna	Complete	90
11	10	USA	650000	2023-10-05	2023-15-10	John	Ongoing	100

Explanation

This dataset simulates global technology projects with multiple issues. We intentionally introduced:

Invalid dates (2023-02-30, wrongdate)
Missing values (NULL, ., blank)
Duplicate records (Project_ID = 6)
Negative budget and duration
Inconsistent country naming (india, INDIA, NULL)

Such inconsistencies are common in real-world pipelines, especially when integrating data from multiple systems. The goal is to replicate real industry chaos before cleaning.

R Raw Dataset

tech_projects_raw <- data.frame(

Project_ID = c(1,2,3,4,5,6,6,7,8,9,10),

Country = c("india","USA","germany","INDIA","japan",

"usa","usa","NULL","india","germany","USA"),

Budget = c(500000,-100000,750000,600000,800000,900000,900000,

450000,NA,700000,650000),

Start_Date = c("2023-01-10","2023-02-30","NULL","2023/03/15",

"2023-05-01","2023-06-10","2023-06-10","2023-07-15",

"2023-08-01","2023-09-10","2023-10-05"),

End_Date = c("2023-12-20","2023-11-15","2023-10-10","2023-09-10",

"2023-04-01","2023-12-30","2023-12-30","2023-12-10",

"2023-12-01","wrongdate","2023-15-10"),

Manager = c("Ravi","John","Anna","NULL","Ken","Mike","Mike","Sara",

"Ravi","Anna","John"),

Status = c("Completed","Ongoing","Completed","Completed","Completed",

"Ongoing","Ongoing","Completed","Completed","Completed",

"Ongoing"),

Duration = c(340,250,200,180,-50,200,200,150,120,90,100)

)

OUTPUT:

	Project_ID	Country	Budget	Start_Date	End_Date	Manager	Status	Duration
1	1	india	500000	10-01-2023	20-12-2023	Ravi	Completed	340
2	2	USA	-100000	2023-02-30	15-11-2023	John	Ongoing	250
3	3	germany	750000	NULL	10-10-2023	Anna	Completed	200
4	4	INDIA	600000	15-03-2023	10-09-2023	NULL	Completed	180
5	5	japan	800000	01-05-2023	01-04-2023	Ken	Completed	-50
6	6	usa	900000	10-06-2023	30-12-2023	Mike	Ongoing	200
7	6	usa	900000	10-06-2023	30-12-2023	Mike	Ongoing	200
8	7	NULL	450000	15-07-2023	10-12-2023	Sara	Completed	150
9	8	india	NA	01-08-2023	01-12-2023	Ravi	Completed	120
10	9	germany	700000	10-09-2023	wrongdate	Anna	Completed	90
11	10	USA	650000	05-10-2023	2023-15-10	John	Ongoing	100

Explanation

This R dataset mirrors the SAS dataset. It demonstrates how data inconsistencies remain consistent across tools, meaning cleaning logic must be reproducible. R stores everything as character or numeric vectors, so incorrect formats (like invalid dates) won’t fail immediately but will cause issues later during analysis. This highlights the importance of explicit type conversion and validation in R pipelines.

3. Phase 1: Data Cleaning in SAS

PROC SORT DATA=tech_projects_raw NODUPKEY OUT=tech_projects_nodup;

BY Project_ID;

RUN;

PROC PRINT DATA=tech_projects_nodup;

RUN;

OUTPUT:


Obs	Project_ID	Country	Budget	Start_Date	End_Date	Manager	Status	Duration
1	1	india	500000	2023-01-10	2023-12-20	Ravi	Complete	340
2	2	USA	-100000	2023-02-30	2023-11-15	John	Ongoing	250
3	3	germany	750000	NULL	2023-10-10	Anna	Complete	200
4	4	INDIA	600000	2023/03/15	2023/09/10	NULL	Complete	180
5	5	japan	800000	2023-05-01	2023-04-01	Ken	Complete	-50
6	6	usa	900000	2023-06-10	2023-12-30	Mike	Ongoing	200
7	7	NULL	450000	2023-07-15	2023-12-10	Sara	Complete	150
8	8	india	.	2023-08-01	2023-12-01	Ravi	Complete	120
9	9	germany	700000	2023-09-10	wrongdate	Anna	Complete	90
10	10	USA	650000	2023-10-05	2023-15-10	John	Ongoing	100

DATA tech_projects_clean;

SET tech_projects_nodup;

/* Handle missing values */

Country = UPCASE(COALESCEC(Country, "UNKNOWN"));

IF Country = "NULL" THEN Country = "UNKNOWN";

Manager = COALESCEC(Manager, "NA");

/* Fix invalid numeric values */

IF Budget < 0 THEN Budget = ABS(Budget);

IF Duration < 0 THEN Duration = .;

/* Date conversion */

Start_dt = INPUT(Start_Date, YYMMDD10.);

End_dt = INPUT(End_Date, YYMMDD10.);

FORMAT Start_dt End_dt DATE9.;

/* Logical correction */

IF End_dt < Start_dt THEN End_dt = .;

RUN;

PROC PRINT DATA=tech_projects_clean;

RUN;

OUTPUT:


Obs	Project_ID	Country	Budget	Start_Date	End_Date	Manager	Status	Duration	Start_dt	End_dt
1	1	INDIA	500000	2023-01-10	2023-12-20	Ravi	Complete	340	10JAN2023	20DEC2023
2	2	USA	100000	2023-02-30	2023-11-15	John	Ongoing	250	.	15NOV2023
3	3	GERMANY	750000	NULL	2023-10-10	Anna	Complete	200	.	10OCT2023
4	4	INDIA	600000	2023/03/15	2023/09/10	NULL	Complete	180	15MAR2023	10SEP2023
5	5	JAPAN	800000	2023-05-01	2023-04-01	Ken	Complete	.	01MAY2023	.
6	6	USA	900000	2023-06-10	2023-12-30	Mike	Ongoing	200	10JUN2023	30DEC2023
7	7	UNKNOWN	450000	2023-07-15	2023-12-10	Sara	Complete	150	15JUL2023	10DEC2023
8	8	INDIA	.	2023-08-01	2023-12-01	Ravi	Complete	120	01AUG2023	01DEC2023
9	9	GERMANY	700000	2023-09-10	wrongdate	Anna	Complete	90	10SEP2023	.
10	10	USA	650000	2023-10-05	2023-15-10	John	Ongoing	100	05OCT2023	.

Explanation

This step transforms chaos into structured data:

PROC SORT NODUPKEY removes duplicate records efficiently.
COALESCEC standardizes missing text fields.
ABS() corrects invalid numeric values like negative budgets.
Logical validation ensures End_Date is not earlier than Start_Date.

This reflects real-world clinical data validation, where incorrect dates can invalidate entire studies.

4. Phase 2: Data Cleaning in R

library(dplyr)

tech_clean <- tech_projects_raw %>%

distinct(Project_ID, .keep_all = TRUE) %>%

mutate(

Country = toupper(trimws(ifelse(Country=="NULL" | Country=="",

"UNKNOWN", Country))),,

Budget = ifelse(Budget < 0, abs(Budget), Budget),

Duration = ifelse(Duration <0 , NA, Duration),

Start_dt = as.Date(Start_Date, format="%Y-%m-%d"),

End_dt = as.Date(End_Date, format="%Y-%m-%d"),

Date_Flag = case_when(is.na(Start_dt) ~ "MISSING_START",

is.na(End_dt) ~ "MISSING_END",End_dt < Start_dt ~ "INVALID_RANGE",

TRUE ~ "OK")

)

OUTPUT:

	Project_ID	Country	Budget	Start_Date	End_Date	Manager	Status	Duration	Start_dt	End_dt	Date_Flag
1	1	INDIA	500000	10-01-2023	20-12-2023	Ravi	Completed	340	10-01-2023	20-12-2023	OK
2	2	USA	100000	2023-02-30	15-11-2023	John	Ongoing	250	NA	15-11-2023	MISSING_START
3	3	GERMANY	750000	NULL	10-10-2023	Anna	Completed	200	NA	10-10-2023	MISSING_START
4	4	INDIA	600000	15-03-2023	10-09-2023	NULL	Completed	180	NA	10-09-2023	MISSING_START
5	5	JAPAN	800000	01-05-2023	01-04-2023	Ken	Completed	NA	01-05-2023	01-04-2023	INVALID_RANGE
6	6	USA	900000	10-06-2023	30-12-2023	Mike	Ongoing	200	10-06-2023	30-12-2023	OK
7	7	UNKNOWN	450000	15-07-2023	10-12-2023	Sara	Completed	150	15-07-2023	10-12-2023	OK
8	8	INDIA	NA	01-08-2023	01-12-2023	Ravi	Completed	120	01-08-2023	01-12-2023	OK
9	9	GERMANY	700000	10-09-2023	wrongdate	Anna	Completed	90	10-09-2023	NA	MISSING_END
10	10	USA	650000	05-10-2023	2023-15-10	John	Ongoing	100	05-10-2023	NA	MISSING_END

Explanation

R uses a pipeline approach:

distinct() removes duplicates
mutate() transforms variables
toupper() and trimws() standardize text
as.Date() converts strings to date format
filter() enforces logical rules

Unlike SAS, R requires stricter format consistency, so invalid dates may result in NA. This forces analysts to explicitly handle errors, improving transparency.

5. Phase 3: Additional SAS Processing

DATA tech_enhanced;

SET tech_projects_clean;

/* Derive Year */

Project_Year = YEAR(Start_dt);

/* Duration recalculation */

Calc_Duration = End_dt - Start_dt;

/* Flag issues */

IF Start_dt = . OR End_dt = . THEN Flag = "CHECK";

ELSE Flag = "OK";

RUN;

PROC PRINT DATA=tech_enhanced;

RUN;

OUTPUT:


Obs	Project_ID	Country	Budget	Start_Date	End_Date	Manager	Status	Duration	Start_dt	End_dt	Project_Year	Calc_Duration	Flag
1	1	INDIA	500000	2023-01-10	2023-12-20	Ravi	Complete	340	10JAN2023	20DEC2023	2023	344	OK
2	2	USA	100000	2023-02-30	2023-11-15	John	Ongoing	250	.	15NOV2023	.	.	CHECK
3	3	GERMANY	750000	NULL	2023-10-10	Anna	Complete	200	.	10OCT2023	.	.	CHECK
4	4	INDIA	600000	2023/03/15	2023/09/10	NULL	Complete	180	15MAR2023	10SEP2023	2023	179	OK
5	5	JAPAN	800000	2023-05-01	2023-04-01	Ken	Complete	.	01MAY2023	.	2023	.	CHECK
6	6	USA	900000	2023-06-10	2023-12-30	Mike	Ongoing	200	10JUN2023	30DEC2023	2023	203	OK
7	7	UNKNOWN	450000	2023-07-15	2023-12-10	Sara	Complete	150	15JUL2023	10DEC2023	2023	148	OK
8	8	INDIA	.	2023-08-01	2023-12-01	Ravi	Complete	120	01AUG2023	01DEC2023	2023	122	OK
9	9	GERMANY	700000	2023-09-10	wrongdate	Anna	Complete	90	10SEP2023	.	2023	.	CHECK
10	10	USA	650000	2023-10-05	2023-15-10	John	Ongoing	100	05OCT2023	.	2023	.	CHECK

Explanation

This step adds derived variables, which are essential in ADaM datasets:

YEAR() extracts useful temporal features
Duration recalculation ensures accuracy
Flags help identify problematic records

Such derived variables are critical for regulatory reporting and trend analysis.

6. 20 Data Cleaning Best Practices

Always validate date ranges
Never trust source system formats
Maintain audit trails
Document transformation logic
Use controlled terminology
Validate duplicates carefully
Apply SDTM standards
Ensure traceability
Perform QC checks
Use macros for repeatability
Validate missing values
Cross-check derived variables
Maintain metadata
Follow CDISC compliance
Perform independent validation
Log all changes
Use version control
Ensure reproducibility
Validate against SAP
Perform final data reconciliation

7. Business Logic Behind Data Cleaning

Data cleaning is not cosmetic it is decision-critical.

Example:

Negative age → biologically impossible → must be corrected
Missing salary → impacts financial forecasting
Incorrect dates → breaks timelines

In clinical trials:

Wrong date = protocol deviation
Missing data = regulatory rejection

Thus, cleaning ensures:

Accuracy
Consistency
Trust

8. 20 Key Insights

Dirty data leads to wrong conclusions
Dates define timelines
Missing data hides risk
Validation prevents failure
Standardization ensures consistency
SAS ensures robustness
R ensures flexibility
Duplicate data inflates results
Invalid dates break models
Cleaning improves performance
QC is mandatory
Audit trails ensure compliance
Data lineage matters
Reproducibility is critical
Automation reduces errors
Domain knowledge is essential
Business rules drive cleaning
Documentation saves time
Clean data builds trust
Data quality = business quality

9. Summary: SAS vs R

Feature	SAS	R
Stability	High	Moderate
Flexibility	Moderate	High
Regulatory	Strong	Limited
Date Handling	Robust	Strict
Learning Curve	Medium	High

10. Conclusion

Handling date and time in SAS is not just a technical skill it is a data engineering discipline. Whether you are working on global tech projects or clinical trials, structured cleaning ensures reliable insights.

Clean data is not optional. It is foundational.

11. Interview Questions

Q1: How do you handle invalid dates in SAS?
A: Use ANYDTDTE. informat and validate with conditions.

Q2: Difference between missing and NULL?
A: Missing is system-defined; NULL is text → must be converted.

Q3: How do you remove duplicates?
A: PROC SORT NODUPKEY

Q4: R vs SAS date handling?
A: SAS flexible; R strict parsing.

Q5: Scenario: End date < start date?
A: Set to missing and flag for review.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about GLOBAL PROJECT DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Mastering Date & Time Handling in SAS for Global Tech Projects

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

Mastering Date & Time Handling in SAS for Global Tech Projects

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?