455.Mastering Date & Time Handling in SAS for Global Tech Projects

From Chaos to Clarity: Mastering Date & Time Handling in SAS for Global Tech Projects

1. Introduction

Imagine you are working as a SAS programmer in a global technology company handling data from large-scale infrastructure projects AI deployments, cloud migrations, smart cities, and telecom expansions. The dataset arrives from multiple countries: India, USA, Germany, and Japan. Everything looks fine until you start analyzing it.

Suddenly, you see:

  • Project start dates like 2023-13-01
  • Completion dates earlier than start dates
  • Missing timestamps
  • Duplicate project IDs
  • “NULL” written as text instead of missing values

This is not uncommon. Whether in clinical trials (SDTM/ADaM datasets) or business analytics, dirty data silently destroys decision-making.

A wrong date can:

  • Delay drug approvals
  • Miscalculate revenue timelines
  • Break regulatory submissions

This is where tools like SAS and R become powerful. They don’t just analyze data they discipline it.

In this blog, we will build a real-world messy dataset, intentionally inject errors, and then clean, standardize, and validate it using SAS and R, with a special focus on date and time handling. 

2. Raw Data Creation in SAS and R

SAS Raw Dataset (With Intentional Errors)

DATA tech_projects_raw;

INPUT Project_ID Country $ Budget Start_Date :$15. 

      End_Date :$15. Manager $ Status $ Duration;

DATALINES;

1 india 500000 2023-01-10 2023-12-20 Ravi Completed 340

2 USA -100000 2023-02-30 2023-11-15 John Ongoing 250

3 germany 750000 NULL 2023-10-10 Anna Completed 200

4 INDIA 600000 2023/03/15 2023/09/10 NULL Completed 180

5 japan 800000 2023-05-01 2023-04-01 Ken Completed -50

6 usa 900000 2023-06-10 2023-12-30 Mike Ongoing 200

6 usa 900000 2023-06-10 2023-12-30 Mike Ongoing 200

7 NULL 450000 2023-07-15 2023-12-10 Sara Completed 150

8 india . 2023-08-01 2023-12-01 Ravi Completed 120

9 germany 700000 2023-09-10 wrongdate Anna Completed 90

10 USA 650000 2023-10-05 2023-15-10 John Ongoing 100

;

RUN;

PROC PRINT DATA=tech_projects_raw;

RUN;

OUTPUT:

ObsProject_IDCountryBudgetStart_DateEnd_DateManagerStatusDuration
11india5000002023-01-102023-12-20RaviComplete340
22USA-1000002023-02-302023-11-15JohnOngoing250
33germany750000NULL2023-10-10AnnaComplete200
44INDIA6000002023/03/152023/09/10NULLComplete180
55japan8000002023-05-012023-04-01KenComplete-50
66usa9000002023-06-102023-12-30MikeOngoing200
76usa9000002023-06-102023-12-30MikeOngoing200
87NULL4500002023-07-152023-12-10SaraComplete150
98india.2023-08-012023-12-01RaviComplete120
109germany7000002023-09-10wrongdateAnnaComplete90
1110USA6500002023-10-052023-15-10JohnOngoing100

Explanation

This dataset simulates global technology projects with multiple issues. We intentionally introduced:

  • Invalid dates (2023-02-30, wrongdate)
  • Missing values (NULL, ., blank)
  • Duplicate records (Project_ID = 6)
  • Negative budget and duration
  • Inconsistent country naming (india, INDIA, NULL)

Such inconsistencies are common in real-world pipelines, especially when integrating data from multiple systems. The goal is to replicate real industry chaos before cleaning.

R Raw Dataset

tech_projects_raw <- data.frame(

  Project_ID = c(1,2,3,4,5,6,6,7,8,9,10),

  Country = c("india","USA","germany","INDIA","japan",

              "usa","usa","NULL","india","germany","USA"),

  Budget = c(500000,-100000,750000,600000,800000,900000,900000,

             450000,NA,700000,650000),

  Start_Date = c("2023-01-10","2023-02-30","NULL","2023/03/15",

                 "2023-05-01","2023-06-10","2023-06-10","2023-07-15",

                 "2023-08-01","2023-09-10","2023-10-05"),

  End_Date = c("2023-12-20","2023-11-15","2023-10-10","2023-09-10",

               "2023-04-01","2023-12-30","2023-12-30","2023-12-10",

               "2023-12-01","wrongdate","2023-15-10"),

  Manager = c("Ravi","John","Anna","NULL","Ken","Mike","Mike","Sara",

              "Ravi","Anna","John"),

  Status = c("Completed","Ongoing","Completed","Completed","Completed",

             "Ongoing","Ongoing","Completed","Completed","Completed",

             "Ongoing"),

  Duration = c(340,250,200,180,-50,200,200,150,120,90,100)

)

OUTPUT:

 

Project_ID

Country

Budget

Start_Date

End_Date

Manager

Status

Duration

1

1

india

500000

10-01-2023

20-12-2023

Ravi

Completed

340

2

2

USA

-100000

2023-02-30

15-11-2023

John

Ongoing

250

3

3

germany

750000

NULL

10-10-2023

Anna

Completed

200

4

4

INDIA

600000

15-03-2023

10-09-2023

NULL

Completed

180

5

5

japan

800000

01-05-2023

01-04-2023

Ken

Completed

-50

6

6

usa

900000

10-06-2023

30-12-2023

Mike

Ongoing

200

7

6

usa

900000

10-06-2023

30-12-2023

Mike

Ongoing

200

8

7

NULL

450000

15-07-2023

10-12-2023

Sara

Completed

150

9

8

india

NA

01-08-2023

01-12-2023

Ravi

Completed

120

10

9

germany

700000

10-09-2023

wrongdate

Anna

Completed

90

11

10

USA

650000

05-10-2023

2023-15-10

John

Ongoing

100

Explanation

This R dataset mirrors the SAS dataset. It demonstrates how data inconsistencies remain consistent across tools, meaning cleaning logic must be reproducible. R stores everything as character or numeric vectors, so incorrect formats (like invalid dates) won’t fail immediately but will cause issues later during analysis. This highlights the importance of explicit type conversion and validation in R pipelines.

3. Phase 1: Data Cleaning in SAS

PROC SORT DATA=tech_projects_raw NODUPKEY OUT=tech_projects_nodup;

BY Project_ID;

RUN;

PROC PRINT DATA=tech_projects_nodup;

RUN;

OUTPUT:

ObsProject_IDCountryBudgetStart_DateEnd_DateManagerStatusDuration
11india5000002023-01-102023-12-20RaviComplete340
22USA-1000002023-02-302023-11-15JohnOngoing250
33germany750000NULL2023-10-10AnnaComplete200
44INDIA6000002023/03/152023/09/10NULLComplete180
55japan8000002023-05-012023-04-01KenComplete-50
66usa9000002023-06-102023-12-30MikeOngoing200
77NULL4500002023-07-152023-12-10SaraComplete150
88india.2023-08-012023-12-01RaviComplete120
99germany7000002023-09-10wrongdateAnnaComplete90
1010USA6500002023-10-052023-15-10JohnOngoing100

DATA tech_projects_clean;

SET tech_projects_nodup;

/* Handle missing values */

Country = UPCASE(COALESCEC(Country, "UNKNOWN"));

IF Country = "NULL" THEN Country = "UNKNOWN";

Manager = COALESCEC(Manager, "NA");

/* Fix invalid numeric values */

IF Budget < 0 THEN Budget = ABS(Budget);

IF Duration < 0 THEN Duration = .;

/* Date conversion */

Start_dt = INPUT(Start_Date, YYMMDD10.);

End_dt   = INPUT(End_Date, YYMMDD10.);

FORMAT Start_dt End_dt DATE9.;

/* Logical correction */

IF End_dt < Start_dt THEN End_dt = .;

RUN;

PROC PRINT DATA=tech_projects_clean;

RUN;

OUTPUT:

ObsProject_IDCountryBudgetStart_DateEnd_DateManagerStatusDurationStart_dtEnd_dt
11INDIA5000002023-01-102023-12-20RaviComplete34010JAN202320DEC2023
22USA1000002023-02-302023-11-15JohnOngoing250.15NOV2023
33GERMANY750000NULL2023-10-10AnnaComplete200.10OCT2023
44INDIA6000002023/03/152023/09/10NULLComplete18015MAR202310SEP2023
55JAPAN8000002023-05-012023-04-01KenComplete.01MAY2023.
66USA9000002023-06-102023-12-30MikeOngoing20010JUN202330DEC2023
77UNKNOWN4500002023-07-152023-12-10SaraComplete15015JUL202310DEC2023
88INDIA.2023-08-012023-12-01RaviComplete12001AUG202301DEC2023
99GERMANY7000002023-09-10wrongdateAnnaComplete9010SEP2023.
1010USA6500002023-10-052023-15-10JohnOngoing10005OCT2023.

Explanation

This step transforms chaos into structured data:

  • PROC SORT NODUPKEY removes duplicate records efficiently.
  • COALESCEC standardizes missing text fields.
  • ABS() corrects invalid numeric values like negative budgets.
  • Logical validation ensures End_Date is not earlier than Start_Date.

This reflects real-world clinical data validation, where incorrect dates can invalidate entire studies.

4. Phase 2: Data Cleaning in R

library(dplyr)

tech_clean <- tech_projects_raw %>%

  distinct(Project_ID, .keep_all = TRUE) %>%

  mutate(

    Country = toupper(trimws(ifelse(Country=="NULL" | Country=="",

                                    "UNKNOWN", Country))),,

    Budget = ifelse(Budget < 0, abs(Budget), Budget),

    Duration = ifelse(Duration <0 , NA, Duration),

    Start_dt = as.Date(Start_Date, format="%Y-%m-%d"),

    End_dt = as.Date(End_Date, format="%Y-%m-%d"),

    Date_Flag = case_when(is.na(Start_dt) ~ "MISSING_START",

    is.na(End_dt) ~ "MISSING_END",End_dt < Start_dt ~ "INVALID_RANGE",

    TRUE ~ "OK")

  )

OUTPUT:

 

Project_ID

Country

Budget

Start_Date

End_Date

Manager

Status

Duration

Start_dt

End_dt

Date_Flag

1

1

INDIA

500000

10-01-2023

20-12-2023

Ravi

Completed

340

10-01-2023

20-12-2023

OK

2

2

USA

100000

2023-02-30

15-11-2023

John

Ongoing

250

NA

15-11-2023

MISSING_START

3

3

GERMANY

750000

NULL

10-10-2023

Anna

Completed

200

NA

10-10-2023

MISSING_START

4

4

INDIA

600000

15-03-2023

10-09-2023

NULL

Completed

180

NA

10-09-2023

MISSING_START

5

5

JAPAN

800000

01-05-2023

01-04-2023

Ken

Completed

NA

01-05-2023

01-04-2023

INVALID_RANGE

6

6

USA

900000

10-06-2023

30-12-2023

Mike

Ongoing

200

10-06-2023

30-12-2023

OK

7

7

UNKNOWN

450000

15-07-2023

10-12-2023

Sara

Completed

150

15-07-2023

10-12-2023

OK

8

8

INDIA

NA

01-08-2023

01-12-2023

Ravi

Completed

120

01-08-2023

01-12-2023

OK

9

9

GERMANY

700000

10-09-2023

wrongdate

Anna

Completed

90

10-09-2023

NA

MISSING_END

10

10

USA

650000

05-10-2023

2023-15-10

John

Ongoing

100

05-10-2023

NA

MISSING_END

Explanation

R uses a pipeline approach:

  • distinct() removes duplicates
  • mutate() transforms variables
  • toupper() and trimws() standardize text
  • as.Date() converts strings to date format
  • filter() enforces logical rules

Unlike SAS, R requires stricter format consistency, so invalid dates may result in NA. This forces analysts to explicitly handle errors, improving transparency.

5. Phase 3: Additional SAS Processing

DATA tech_enhanced;

SET tech_projects_clean;

/* Derive Year */

Project_Year = YEAR(Start_dt);

/* Duration recalculation */

Calc_Duration = End_dt - Start_dt;

/* Flag issues */

IF Start_dt = . OR End_dt = . THEN Flag = "CHECK";

ELSE Flag = "OK";

RUN;

PROC PRINT DATA=tech_enhanced;

RUN;

OUTPUT:

ObsProject_IDCountryBudgetStart_DateEnd_DateManagerStatusDurationStart_dtEnd_dtProject_YearCalc_DurationFlag
11INDIA5000002023-01-102023-12-20RaviComplete34010JAN202320DEC20232023344OK
22USA1000002023-02-302023-11-15JohnOngoing250.15NOV2023..CHECK
33GERMANY750000NULL2023-10-10AnnaComplete200.10OCT2023..CHECK
44INDIA6000002023/03/152023/09/10NULLComplete18015MAR202310SEP20232023179OK
55JAPAN8000002023-05-012023-04-01KenComplete.01MAY2023.2023.CHECK
66USA9000002023-06-102023-12-30MikeOngoing20010JUN202330DEC20232023203OK
77UNKNOWN4500002023-07-152023-12-10SaraComplete15015JUL202310DEC20232023148OK
88INDIA.2023-08-012023-12-01RaviComplete12001AUG202301DEC20232023122OK
99GERMANY7000002023-09-10wrongdateAnnaComplete9010SEP2023.2023.CHECK
1010USA6500002023-10-052023-15-10JohnOngoing10005OCT2023.2023.CHECK

Explanation

This step adds derived variables, which are essential in ADaM datasets:

  • YEAR() extracts useful temporal features
  • Duration recalculation ensures accuracy
  • Flags help identify problematic records

Such derived variables are critical for regulatory reporting and trend analysis.

6. 20 Data Cleaning Best Practices

  1. Always validate date ranges
  2. Never trust source system formats
  3. Maintain audit trails
  4. Document transformation logic
  5. Use controlled terminology
  6. Validate duplicates carefully
  7. Apply SDTM standards
  8. Ensure traceability
  9. Perform QC checks
  10. Use macros for repeatability
  11. Validate missing values
  12. Cross-check derived variables
  13. Maintain metadata
  14. Follow CDISC compliance
  15. Perform independent validation
  16. Log all changes
  17. Use version control
  18. Ensure reproducibility
  19. Validate against SAP
  20. Perform final data reconciliation

7. Business Logic Behind Data Cleaning

Data cleaning is not cosmetic it is decision-critical.

Example:

  • Negative age → biologically impossible → must be corrected
  • Missing salary → impacts financial forecasting
  • Incorrect dates → breaks timelines

In clinical trials:

  • Wrong date = protocol deviation
  • Missing data = regulatory rejection

Thus, cleaning ensures:

  • Accuracy
  • Consistency
  • Trust

8. 20 Key Insights

  1. Dirty data leads to wrong conclusions
  2. Dates define timelines
  3. Missing data hides risk
  4. Validation prevents failure
  5. Standardization ensures consistency
  6. SAS ensures robustness
  7. R ensures flexibility
  8. Duplicate data inflates results
  9. Invalid dates break models
  10. Cleaning improves performance
  11. QC is mandatory
  12. Audit trails ensure compliance
  13. Data lineage matters
  14. Reproducibility is critical
  15. Automation reduces errors
  16. Domain knowledge is essential
  17. Business rules drive cleaning
  18. Documentation saves time
  19. Clean data builds trust
  20. Data quality = business quality

9. Summary: SAS vs R

Feature

SAS

R

Stability

High

Moderate

Flexibility

Moderate

High

Regulatory

Strong

Limited

Date Handling

Robust

Strict

Learning Curve

Medium

High

10. Conclusion

Handling date and time in SAS is not just a technical skill it is a data engineering discipline. Whether you are working on global tech projects or clinical trials, structured cleaning ensures reliable insights.

Clean data is not optional. It is foundational.

11. Interview Questions

Q1: How do you handle invalid dates in SAS?
A: Use ANYDTDTE. informat and validate with conditions.

Q2: Difference between missing and NULL?
A: Missing is system-defined; NULL is text → must be converted.

Q3: How do you remove duplicates?
A: PROC SORT NODUPKEY

Q4: R vs SAS date handling?
A: SAS flexible; R strict parsing.

Q5: Scenario: End date < start date?
A: Set to missing and flag for review.

 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about GLOBAL PROJECT DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy


Comments

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS