Numbers Don’t Lie—Until Dates Do: Rebuilding Accuracy in Financial Data
Global Loans, Broken Dates & Hidden Risks: Mastering SAS Data Cleaning for Reliable Analytics
1. Introduction – When Data Lies, Decisions
Collapse
Imagine
you’re working on a global financial dataset tracking loans issued by top countries.
Everything looks fine until you discover that some loan dates are “2023-15-45”,
others are “NULL”, and a few are even in future centuries.
Now
picture this in a clinical trial instead of finance. A patient’s
treatment start date is recorded incorrectly. That single error can misalign
treatment timelines, skew efficacy analysis, and potentially mislead regulatory
submissions.
This is
the silent danger of bad data.
In the
real world, datasets are messy:
- Missing values creep in
- Dates get corrupted
- Duplicate records inflate
metrics
- Text fields become
inconsistent
And the
worst part? These issues don’t throw errors they quietly distort your insights.
This is
where SAS (especially in regulated domains like clinical trials) and R
(for flexible data manipulation) become powerful allies. They help transform
chaotic datasets into analysis-ready, audit-compliant data.
In this blog, we’ll walk through a “Top Countries Loans” dataset, intentionally inject errors, and systematically clean it using SAS and R focusing especially on handling invalid dates, a critical and often overlooked problem.
2. Raw Data Creation in SAS and R
We create a dataset with intentional issues:
- Invalid dates
- Missing values
- Duplicates
- Inconsistent text
SAS Code – Raw Dataset Creation
DATA loans_raw;
INPUT Country $ Loan_ID Loan_Amount Loan_Date:$10. Approval_Date:$10.
Interest_Rate Duration Borrower_Count Status $;
DATALINES;
India 101 500000 2023-01-15 2023-01-20 7.5 5 100 Approved
USA 102 800000 2023-15-45 2023-02-01 6.5 10 200 Approved
china 103 . 2022-12-10 2022-12-15 5.5 8 150 NULL
India 101 500000 2023-01-15 2023-01-20 7.5 5 100 Approved
UK 104 -300000 2023-03-12 2023-03-18 6.0 7 120 Approved
Germany 105 700000 NULL 2023-04-01 5.0 6 130 Approved
France 106 650000 2023-02-30 2023-03-05 5.8 9 140 Approved
Brazil 107 400000 2023-05-10 2023-05-15 6.2 4 110 Approved
Japan 108 900000 2023-06-01 2023-06-05 4.5 12 180 Approved
India 109 300000 2023-07-20 2023-07-25 7.0 3 90 Approved
;
RUN;
PROC PRINT DATA = loans_raw;
RUN;
OUTPUT:
| Obs | Country | Loan_ID | Loan_Amount | Loan_Date | Approval_Date | Interest_Rate | Duration | Borrower_Count | Status |
|---|---|---|---|---|---|---|---|---|---|
| 1 | India | 101 | 500000 | 2023-01-15 | 2023-01-20 | 7.5 | 5 | 100 | Approved |
| 2 | USA | 102 | 800000 | 2023-15-45 | 2023-02-01 | 6.5 | 10 | 200 | Approved |
| 3 | china | 103 | . | 2022-12-10 | 2022-12-15 | 5.5 | 8 | 150 | NULL |
| 4 | India | 101 | 500000 | 2023-01-15 | 2023-01-20 | 7.5 | 5 | 100 | Approved |
| 5 | UK | 104 | -300000 | 2023-03-12 | 2023-03-18 | 6.0 | 7 | 120 | Approved |
| 6 | Germany | 105 | 700000 | NULL | 2023-04-01 | 5.0 | 6 | 130 | Approved |
| 7 | France | 106 | 650000 | 2023-02-30 | 2023-03-05 | 5.8 | 9 | 140 | Approved |
| 8 | Brazil | 107 | 400000 | 2023-05-10 | 2023-05-15 | 6.2 | 4 | 110 | Approved |
| 9 | Japan | 108 | 900000 | 2023-06-01 | 2023-06-05 | 4.5 | 12 | 180 | Approved |
| 10 | India | 109 | 300000 | 2023-07-20 | 2023-07-25 | 7.0 | 3 | 90 | Approved |
Explanation & Key Points
This
dataset intentionally mimics real-world inconsistencies:
- 2023-15-45 → invalid date
- 2023-02-30 → non-existent
date
- NULL values → missing data
- Duplicate record (India 101)
- Negative loan amount
(-300000)
- Inconsistent casing (“china”)
In
financial or clinical datasets, such issues can break downstream derivations
(e.g., duration calculations, exposure windows). SAS reads everything as raw
input, but does not validate correctness automatically, making manual
cleaning essential.
R Code – Equivalent Raw Dataset
loans_raw <- data.frame(
Country = c("India","USA","china","India","UK","Germany",
"France","Brazil","Japan","India"),
Loan_ID = c(101,102,103,101,104,105,106,107,108,109),
Loan_Amount = c(500000,800000,NA,500000,-300000,700000,
650000,400000,900000,300000),
Loan_Date = c("2023-01-15","2023-15-45","2022-12-10",
"2023-01-15","2023-03-12","NULL","2023-02-30",
"2023-05-10","2023-06-01","2023-07-20"),
Approval_Date = c("2023-01-20","2023-02-01","2022-12-15",
"2023-01-20","2023-03-18","2023-04-01",
"2023-03-05","2023-05-15","2023-06-05",
"2023-07-25"),
Interest_Rate = c(7.5,6.5,5.5,7.5,6.0,5.0,5.8,6.2,4.5,7.0),
Duration = c(5,10,8,5,7,6,9,4,12,3),
Borrower_Count = c(100,200,150,100,120,130,140,110,180,90),
Status = c("Approved","Approved","NULL","Approved","Approved",
"Approved","Approved","Approved","Approved","Approved")
)
OUTPUT:
|
|
Country |
Loan_ID |
Loan_Amount |
Loan_Date |
Approval_Date |
Interest_Rate |
Duration |
Borrower_Count |
Status |
|
1 |
India |
101 |
500000 |
15-01-2023 |
20-01-2023 |
7.5 |
5 |
100 |
Approved |
|
2 |
USA |
102 |
800000 |
2023-15-45 |
01-02-2023 |
6.5 |
10 |
200 |
Approved |
|
3 |
china |
103 |
NA |
10-12-2022 |
15-12-2022 |
5.5 |
8 |
150 |
NULL |
|
4 |
India |
101 |
500000 |
15-01-2023 |
20-01-2023 |
7.5 |
5 |
100 |
Approved |
|
5 |
UK |
104 |
-300000 |
12-03-2023 |
18-03-2023 |
6 |
7 |
120 |
Approved |
|
6 |
Germany |
105 |
700000 |
NULL |
01-04-2023 |
5 |
6 |
130 |
Approved |
|
7 |
France |
106 |
650000 |
2023-02-30 |
05-03-2023 |
5.8 |
9 |
140 |
Approved |
|
8 |
Brazil |
107 |
400000 |
10-05-2023 |
15-05-2023 |
6.2 |
4 |
110 |
Approved |
|
9 |
Japan |
108 |
900000 |
01-06-2023 |
05-06-2023 |
4.5 |
12 |
180 |
Approved |
|
10 |
India |
109 |
300000 |
20-07-2023 |
25-07-2023 |
7 |
3 |
90 |
Approved |
Explanation & Key Points
R allows
flexible dataset creation but does not enforce strict data validation. Invalid
dates remain as strings, and missing values (NA, "NULL") coexist.
This flexibility is powerful but risky without cleaning, analysis functions may
silently fail or produce incorrect outputs.
3. Phase 1: Data Cleaning in SAS
SAS Cleaning Code
DATA loans_clean;
SET loans_raw;
Country = UPCASE(STRIP(Country));
IF Status IN ("", "NULL", ".", "NA") THEN
Status = "UNKNOWN";
IF Loan_Amount < 0 THEN Loan_Amount = ABS(Loan_Amount);
IF Loan_Date = "NULL" THEN Loan_Date = "";
Loan_Date_num = INPUT(Loan_Date,??YYMMDD10.);
Approval_Date_num = INPUT(Approval_Date,??YYMMDD10.);
FORMAT Loan_Date_num Approval_Date_num DATE9.;
IF Loan_Date_num = . THEN Loan_Date_num = Approval_Date_num;
RUN;
PROC PRINT DATA = loans_clean;
RUN;
OUTPUT:
| Obs | Country | Loan_ID | Loan_Amount | Loan_Date | Approval_Date | Interest_Rate | Duration | Borrower_Count | Status | Loan_Date_num | Approval_Date_num |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | INDIA | 101 | 500000 | 2023-01-15 | 2023-01-20 | 7.5 | 5 | 100 | Approved | 15JAN2023 | 20JAN2023 |
| 2 | USA | 102 | 800000 | 2023-15-45 | 2023-02-01 | 6.5 | 10 | 200 | Approved | 01FEB2023 | 01FEB2023 |
| 3 | CHINA | 103 | . | 2022-12-10 | 2022-12-15 | 5.5 | 8 | 150 | UNKNOWN | 10DEC2022 | 15DEC2022 |
| 4 | INDIA | 101 | 500000 | 2023-01-15 | 2023-01-20 | 7.5 | 5 | 100 | Approved | 15JAN2023 | 20JAN2023 |
| 5 | UK | 104 | 300000 | 2023-03-12 | 2023-03-18 | 6.0 | 7 | 120 | Approved | 12MAR2023 | 18MAR2023 |
| 6 | GERMANY | 105 | 700000 | 2023-04-01 | 5.0 | 6 | 130 | Approved | 01APR2023 | 01APR2023 | |
| 7 | FRANCE | 106 | 650000 | 2023-02-30 | 2023-03-05 | 5.8 | 9 | 140 | Approved | 05MAR2023 | 05MAR2023 |
| 8 | BRAZIL | 107 | 400000 | 2023-05-10 | 2023-05-15 | 6.2 | 4 | 110 | Approved | 10MAY2023 | 15MAY2023 |
| 9 | JAPAN | 108 | 900000 | 2023-06-01 | 2023-06-05 | 4.5 | 12 | 180 | Approved | 01JUN2023 | 05JUN2023 |
| 10 | INDIA | 109 | 300000 | 2023-07-20 | 2023-07-25 | 7.0 | 3 | 90 | Approved | 20JUL2023 | 25JUL2023 |
PROC SORT DATA=loans_clean NODUPKEY;
BY Loan_ID;
RUN;
PROC PRINT DATA = loans_clean;
RUN;
OUTPUT:
| Obs | Country | Loan_ID | Loan_Amount | Loan_Date | Approval_Date | Interest_Rate | Duration | Borrower_Count | Status | Loan_Date_num | Approval_Date_num |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | INDIA | 101 | 500000 | 2023-01-15 | 2023-01-20 | 7.5 | 5 | 100 | Approved | 15JAN2023 | 20JAN2023 |
| 2 | USA | 102 | 800000 | 2023-15-45 | 2023-02-01 | 6.5 | 10 | 200 | Approved | 01FEB2023 | 01FEB2023 |
| 3 | CHINA | 103 | . | 2022-12-10 | 2022-12-15 | 5.5 | 8 | 150 | UNKNOWN | 10DEC2022 | 15DEC2022 |
| 4 | UK | 104 | 300000 | 2023-03-12 | 2023-03-18 | 6.0 | 7 | 120 | Approved | 12MAR2023 | 18MAR2023 |
| 5 | GERMANY | 105 | 700000 | 2023-04-01 | 5.0 | 6 | 130 | Approved | 01APR2023 | 01APR2023 | |
| 6 | FRANCE | 106 | 650000 | 2023-02-30 | 2023-03-05 | 5.8 | 9 | 140 | Approved | 05MAR2023 | 05MAR2023 |
| 7 | BRAZIL | 107 | 400000 | 2023-05-10 | 2023-05-15 | 6.2 | 4 | 110 | Approved | 10MAY2023 | 15MAY2023 |
| 8 | JAPAN | 108 | 900000 | 2023-06-01 | 2023-06-05 | 4.5 | 12 | 180 | Approved | 01JUN2023 | 05JUN2023 |
| 9 | INDIA | 109 | 300000 | 2023-07-20 | 2023-07-25 | 7.0 | 3 | 90 | Approved | 20JUL2023 | 25JUL2023 |
Explanation & Key Points
This is
where SAS shines structured, auditable cleaning:
- UPCASE + STRIP standardizes
text
- COALESCEC replaces missing
character values
- ABS() corrects invalid
numeric values
- INPUT() converts strings to
proper SAS dates
- ?? modifier
If conversion fails, don’t throw an error
- Invalid dates become missing
(.), enabling detection
- Imputation: missing
Loan_Date replaced with Approval_Date
- PROC SORT NODUPKEY removes
duplicates
SAS
enforces strict type conversion, making it ideal for regulated
environments (CDISC/SDTM).
4. Phase 2: Data Cleaning in R
library(dplyr)
loans_clean <- loans_raw %>%
mutate(
Country = toupper(trimws(Country)),
Status = ifelse(is.na(Status) | Status=="NULL","UNKNOWN",Status),
Loan_Amount = ifelse(!is.na(Loan_Amount) & Loan_Amount < 0, abs(Loan_Amount), Loan_Amount),
Loan_Date = as.Date(Loan_Date, origin = "1960-01-01"),
Approval_Date = as.Date(Approval_Date),
Loan_Date = if_else(is.na(Loan_Date), Approval_Date, Loan_Date)
) %>%
distinct(Loan_ID, .keep_all = TRUE)
OUTPUT:
|
|
Country |
Loan_ID |
Loan_Amount |
Loan_Date |
Approval_Date |
Interest_Rate |
Duration |
Borrower_Count |
Status |
|
1 |
INDIA |
101 |
500000 |
15-01-2023 |
20-01-2023 |
7.5 |
5 |
100 |
Approved |
|
2 |
USA |
102 |
800000 |
01-02-2023 |
01-02-2023 |
6.5 |
10 |
200 |
Approved |
|
3 |
CHINA |
103 |
NA |
10-12-2022 |
15-12-2022 |
5.5 |
8 |
150 |
UNKNOWN |
|
4 |
UK |
104 |
300000 |
12-03-2023 |
18-03-2023 |
6 |
7 |
120 |
Approved |
|
5 |
GERMANY |
105 |
700000 |
01-04-2023 |
01-04-2023 |
5 |
6 |
130 |
Approved |
|
6 |
FRANCE |
106 |
650000 |
05-03-2023 |
05-03-2023 |
5.8 |
9 |
140 |
Approved |
|
7 |
BRAZIL |
107 |
400000 |
10-05-2023 |
15-05-2023 |
6.2 |
4 |
110 |
Approved |
|
8 |
JAPAN |
108 |
900000 |
01-06-2023 |
05-06-2023 |
4.5 |
12 |
180 |
Approved |
|
9 |
INDIA |
109 |
300000 |
20-07-2023 |
25-07-2023 |
7 |
3 |
90 |
Approved |
Explanation & Key Points
R
provides flexible transformation pipelines:
- mutate() for transformation
- as.Date() automatically
flags invalid dates as NA
originfor numeric dates- if_else()Preserves
Date type
- distinct() removes
duplicates
Unlike
SAS, R requires more manual control but offers speed and flexibility,
especially for exploratory analysis.
5. Phase 3: Additional SAS Processing
PROC MEANS DATA=loans_clean;
VAR Loan_Amount Interest_Rate;
RUN;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|
Loan_Amount Interest_Rate | 8 9 | 568750.00 6.0000000 | 228250.33 0.9407444 | 300000.00 4.5000000 | 900000.00 7.5000000 |
PROC FREQ DATA=loans_clean;
TABLES Country Status;
RUN;
OUTPUT:
The FREQ Procedure
| Country | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| BRAZIL | 1 | 11.11 | 1 | 11.11 |
| CHINA | 1 | 11.11 | 2 | 22.22 |
| FRANCE | 1 | 11.11 | 3 | 33.33 |
| GERMANY | 1 | 11.11 | 4 | 44.44 |
| INDIA | 2 | 22.22 | 6 | 66.67 |
| JAPAN | 1 | 11.11 | 7 | 77.78 |
| UK | 1 | 11.11 | 8 | 88.89 |
| USA | 1 | 11.11 | 9 | 100.00 |
| Status | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Approved | 8 | 88.89 | 8 | 88.89 |
| UNKNOWN | 1 | 11.11 | 9 | 100.00 |
PROC SQL;
SELECT Country, SUM(Loan_Amount) AS Total_Loan
FROM loans_clean
GROUP BY Country;
QUIT;
OUTPUT:
| Country | Total_Loan |
|---|---|
| BRAZIL | 400000 |
| CHINA | . |
| FRANCE | 650000 |
| GERMANY | 700000 |
| INDIA | 800000 |
| JAPAN | 900000 |
| UK | 300000 |
| USA | 800000 |
Explanation & Key Points
These
procedures validate cleaned data:
- PROC MEANS → detects
anomalies in numeric values
- PROC FREQ → identifies
unexpected categories
- PROC SQL → aggregates
business insights
This step
ensures data is not just clean but analytically trustworthy.
6. 20 Data Cleaning Best Practices
- Always validate date formats
before conversion
- Use controlled terminology
(CDISC standards)
- Maintain audit trails for
transformations
- Never overwrite raw data
- Use metadata-driven programming
- Validate ranges (age,
amount, etc.)
- Use PROC COMPARE for QC
- Ensure referential integrity
across datasets
- Standardize variable naming
conventions
- Log all transformations
- Avoid hardcoding logic
- Validate against SAP
specifications
- Implement duplicate checks
- Use macros for reusability
- Perform cross-domain
consistency checks
- Track missing value patterns
- Use validation reports
- Follow regulatory guidelines
(FDA/EMA)
- Implement version control
- Conduct peer reviews
7. Business Logic Behind Data Cleaning
Data
cleaning is not just technical it’s deeply tied to business logic. Missing
values are often replaced to preserve analytical continuity. For example, if a
loan date is missing but approval date exists, using the approval date ensures
timeline consistency.
Unrealistic
values like negative loan amounts are corrected because they violate domain
logic. In clinical trials, a negative patient age is impossible; similarly, in
finance, a negative loan disbursement is invalid unless explicitly defined.
Date
imputation is critical. If treatment start dates are incorrect, efficacy
endpoints shift. In financial analytics, incorrect loan timelines distort
repayment schedules and risk models.
Salary
normalization (in HR datasets) or interest rate standardization ensures
comparability across regions. Without this, models produce biased outputs.
Ultimately,
every cleaning decision must align with real-world logic, not just code
correctness. Poor cleaning leads to flawed insights, which can result in
financial loss, regulatory rejection, or incorrect scientific conclusions.
8. 20 Key Insights
- Dirty data leads to wrong
conclusions
- Dates are the backbone of
timelines
- Missing data must be handled
logically
- Standardization ensures
reproducibility
- Duplicate records inflate
metrics
- Validation is not optional
- Always keep raw data intact
- Audit trails are critical in
clinical trials
- Automation reduces human
error
- Text inconsistencies break
joins
- Date formats must be
consistent
- Outliers must be
investigated, not ignored
- SAS ensures regulatory
compliance
- R accelerates exploratory
analysis
- Cleaning improves model
accuracy
- Business rules guide transformations
- Always validate after
cleaning
- Data quality defines
analytics quality
- Small errors create big
risks
- Clean data builds trust
9. Summary
This blog
demonstrated how a seemingly simple Top Countries Loans dataset can
become analytically unreliable due to issues like invalid dates, duplicates,
and inconsistent values. Through structured cleaning in SAS and flexible
transformations in R, we converted messy data into a reliable dataset.
SAS
provides a robust, audit-friendly framework ideal for regulated environments
such as clinical trials. Its strict typing and procedural clarity make it
highly reliable for production-grade data pipelines. Functions like INPUT, COALESCEC,
and PROC SORT enable systematic validation and cleaning.
R, on the
other hand, excels in flexibility and speed. With packages like dplyr, it
allows concise and readable transformations. However, it requires more manual
vigilance to ensure data integrity.
Handling
invalid dates was a central focus. SAS converts invalid dates into missing
values, enabling structured correction. R flags them as NA, allowing flexible
imputation strategies.
Ultimately,
data cleaning is not just a preprocessing step it is the foundation of
trustworthy analytics. Whether in finance or clinical trials, clean data
ensures accurate insights, regulatory compliance, and confident
decision-making.
10. Conclusion
In any
data-driven organization, the quality of insights is directly proportional to
the quality of data. This blog walked through a realistic scenario where a
global loan dataset contained hidden issues invalid dates, duplicates, and
inconsistent values that could easily mislead analysis.
By
leveraging SAS, we implemented a structured, step-by-step cleaning pipeline
that ensures traceability, reproducibility, and compliance key requirements in
regulated domains like clinical research. SAS’s ability to enforce strict data
types and provide detailed logs makes it indispensable for enterprise-level
data management.
R
complemented this process by offering a flexible and intuitive environment for
data transformation. Its expressive syntax allows analysts to quickly iterate
and explore datasets, making it highly effective for early-stage analysis.
However,
tools alone are not enough. The real strength lies in understanding the business
logic behind the data. Why should a missing date be imputed? Why must
negative values be corrected? These decisions require domain knowledge, not
just programming skills.
Handling
invalid dates is especially critical because time-based analysis underpins most
business and clinical decisions. A single incorrect date can cascade into
flawed models, incorrect conclusions, and costly mistakes.
The key
takeaway is clear: Data cleaning is not optional it is foundational. A
disciplined, well-documented, and logically sound cleaning process transforms
raw data into a strategic asset.
11. Interview Questions
Q1: How do you handle invalid dates in SAS?
Answer: Use INPUT() to convert character
to date. Invalid values become missing (.), which can then be imputed or
flagged.
Q2: How would you detect duplicates in SAS?
Answer: Use PROC SORT NODUPKEY or PROC
SQL GROUP BY HAVING COUNT(*) > 1.
Q3: In R, how do you handle invalid date formats?
Answer: Use as.Date(). Invalid formats
return NA, which can be handled using ifelse() or replace_na().
Q4: Scenario – Loan amount is negative. What will
you do?
Answer: Validate business logic. If
invalid, use ABS() or flag for review.
Q5: Why is date validation critical in clinical
trials?
Answer: Because endpoints depend on
timelines. Incorrect dates can invalidate study results and regulatory
submissions.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about LOAN DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment