444.Turning Messy Raw Data Into Decision-Ready Gold With SAS Error Fixing Techniques
Can A Step-By-Step SAS Data Error Fixing Framework Turn Chaotic Raw Data Into Trusted Intelligence?
Introduction: Why This Project Matters
In
real-world analytics, raw data is rarely clean. It behaves like an untrained
employee unpredictable, inconsistent, and sometimes outright wrong. As a SAS
programmer, your role is not just to analyze data, but to discipline it,
standardize it, and ultimately trust it.
In this project, we will simulate a realistic messy dataset and systematically fix it using SAS. Think of this as moving from “data chaos” to “decision-grade intelligence.”
The Raw Dataset
SAS Code (DATALINES)
DATA raw_errors;
INPUT ID Name $ Age Salary JoinDate:$10. Score;
DATALINES;
1 John 25 50000 2022-01-15 85
2 Mary -30 60000 2022-02-20 90
3 Steve 45 -70000 2022-03-10 88
4 Anna . 55000 2022-04-12 .
5 Mike 200 45000 2022-05-05 75
6 Sara 29 0 2022-06-25 92
7 David 34 72000 2022-13-01 80
8 Lisa 28 68000 2022-07-15 95
9 Tom 31 65000 . 89
10 Emma 27 70000 2022-08-20 91
11 Raj 26 62000 2022-09-10 87
12 Ravi 30 -1000 2022-10-11 85
13 Priya 29 58000 2022-11-01 93
14 Arun . 61000 2022-12-05 88
15 Kiran 33 67000 2022-06-31 90
;
RUN;
Proc print data=raw_errors;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 2022-01-15 | 85 |
| 2 | 2 | Mary | -30 | 60000 | 2022-02-20 | 90 |
| 3 | 3 | Steve | 45 | -70000 | 2022-03-10 | 88 |
| 4 | 4 | Anna | . | 55000 | 2022-04-12 | . |
| 5 | 5 | Mike | 200 | 45000 | 2022-05-05 | 75 |
| 6 | 6 | Sara | 29 | 0 | 2022-06-25 | 92 |
| 7 | 7 | David | 34 | 72000 | 2022-13-01 | 80 |
| 8 | 8 | Lisa | 28 | 68000 | 2022-07-15 | 95 |
| 9 | 9 | Tom | 31 | 65000 | 89 | |
| 10 | 10 | Emma | 27 | 70000 | 2022-08-20 | 91 |
| 11 | 11 | Raj | 26 | 62000 | 2022-09-10 | 87 |
| 12 | 12 | Ravi | 30 | -1000 | 2022-10-11 | 85 |
| 13 | 13 | Priya | 29 | 58000 | 2022-11-01 | 93 |
| 14 | 14 | Arun | . | 61000 | 2022-12-05 | 88 |
| 15 | 15 | Kiran | 33 | 67000 | 2022-06-31 | 90 |
Equivalent R Code
raw_errors <- data.frame(
ID = 1:15,
Name = c("John","Mary","Steve","Anna","Mike","Sara","David","Lisa",
"Tom","Emma","Raj","Ravi","Priya","Arun","Kiran"),
Age = c(25,-30,45,NA,200,29,34,28,31,27,26,30,29,NA,33),
Salary = c(50000,60000,-70000,55000,45000,0,72000,68000,65000,
70000,62000,-1000,58000,61000,67000),
JoinDate = c("2022-01-15","2022-02-20","2022-03-10","2022-04-12",
"2022-05-05","2022-06-25","2022-13-01","2022-07-15",NA,
"2022-08-20","2022-09-10","2022-10-11","2022-11-01",
"2022-12-05","2022-06-31"),
Score = c(85,90,88,NA,75,92,80,95,89,91,87,85,93,88,90)
)
OUTPUT:
|
|
ID |
Name |
Age |
Salary |
JoinDate |
Score |
|
1 |
1 |
John |
25 |
50000 |
15-01-2022 |
85 |
|
2 |
2 |
Mary |
-30 |
60000 |
20-02-2022 |
90 |
|
3 |
3 |
Steve |
45 |
-70000 |
10-03-2022 |
88 |
|
4 |
4 |
Anna |
NA |
55000 |
12-04-2022 |
NA |
|
5 |
5 |
Mike |
200 |
45000 |
05-05-2022 |
75 |
|
6 |
6 |
Sara |
29 |
0 |
25-06-2022 |
92 |
|
7 |
7 |
David |
34 |
72000 |
2022-13-01 |
80 |
|
8 |
8 |
Lisa |
28 |
68000 |
15-07-2022 |
95 |
|
9 |
9 |
Tom |
31 |
65000 |
NA |
89 |
|
10 |
10 |
Emma |
27 |
70000 |
20-08-2022 |
91 |
|
11 |
11 |
Raj |
26 |
62000 |
10-09-2022 |
87 |
|
12 |
12 |
Ravi |
30 |
-1000 |
11-10-2022 |
85 |
|
13 |
13 |
Priya |
29 |
58000 |
01-11-2022 |
93 |
|
14 |
14 |
Arun |
NA |
61000 |
05-12-2022 |
88 |
|
15 |
15 |
Kiran |
33 |
67000 |
2022-06-31 |
90 |
Phase 1: Discovery &
Chaos
Understanding the 5 Critical Data Errors
This
dataset contains intentional real-world issues:
- Negative Age (-30) → impossible biologically
- Negative Salary (-70000,
-1000) →
invalid financial data
- Missing Values (Age, Score,
JoinDate) →
incomplete observations
- Out-of-Range Age (200) → logical violation
- Invalid Dates (2022-13-01,
2022-06-31) →
calendar errors
Why These Errors Destroy Scientific Integrity
Data
errors are not just technical issues they are business risks disguised as
numbers. Imagine calculating average salary including negative values. The
result becomes misleading, and any downstream decision—budgeting, forecasting,
or compensation benchmarking collapses.
Missing
values create silent bias. For example, if high performers have missing scores,
your analysis might underestimate performance trends. In regulated industries
like clinical trials, missing or incorrect data can invalidate entire studies.
Range
violations, such as age = 200, break statistical assumptions. Models depend on
realistic distributions. Outliers like this distort mean, variance, and
regression coefficients, leading to unstable predictions.
Date
errors are even more dangerous. Time-based analytics retention, cohort
analysis, survival analysis depend on chronological accuracy. A single invalid
date can misplace records across timelines, corrupting duration calculations
using functions like INTCK and INTNX.
Negative
salary values reflect upstream system failures ETL issues, incorrect
transformations, or data entry mistakes. Ignoring them signals poor data
governance.
In
essence, dirty data erodes trust, and trust is the currency of
analytics. Without trust, dashboards become decoration rather than decision
tools.
Phase 2: Step-by-Step SAS
Mastery
1. PROC SORT – Organizing Chaos
Business Logic
Sorting
is the first act of control. When datasets arrive unsorted, patterns are
hidden. Duplicate detection, merging, and BY-group processing all depend on
ordering. Think of sorting as arranging files in a cabinet before auditing
them.
PROC SORT DATA=raw_errors OUT=sorted_data;
BY ID;
RUN;
Proc print data=sorted_data;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 2022-01-15 | 85 |
| 2 | 2 | Mary | -30 | 60000 | 2022-02-20 | 90 |
| 3 | 3 | Steve | 45 | -70000 | 2022-03-10 | 88 |
| 4 | 4 | Anna | . | 55000 | 2022-04-12 | . |
| 5 | 5 | Mike | 200 | 45000 | 2022-05-05 | 75 |
| 6 | 6 | Sara | 29 | 0 | 2022-06-25 | 92 |
| 7 | 7 | David | 34 | 72000 | 2022-13-01 | 80 |
| 8 | 8 | Lisa | 28 | 68000 | 2022-07-15 | 95 |
| 9 | 9 | Tom | 31 | 65000 | 89 | |
| 10 | 10 | Emma | 27 | 70000 | 2022-08-20 | 91 |
| 11 | 11 | Raj | 26 | 62000 | 2022-09-10 | 87 |
| 12 | 12 | Ravi | 30 | -1000 | 2022-10-11 | 85 |
| 13 | 13 | Priya | 29 | 58000 | 2022-11-01 | 93 |
| 14 | 14 | Arun | . | 61000 | 2022-12-05 | 88 |
| 15 | 15 | Kiran | 33 | 67000 | 2022-06-31 | 90 |
Always
sort before merging datasets to avoid silent mismatches.
Key Takeaways
- Required for BY-group operations
- Prevents merge errors
- Improves readability
2. DATA Step – Fix Negative Values Using ABS
Business Logic
Negative
salaries are logically invalid. Using ABS() converts them to positive values.
This assumes the issue is sign inversion, not corruption.
DATA fix_salary;
SET sorted_data;
Salary = ABS(Salary);
RUN;
Proc print data=fix_salary;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 2022-01-15 | 85 |
| 2 | 2 | Mary | -30 | 60000 | 2022-02-20 | 90 |
| 3 | 3 | Steve | 45 | 70000 | 2022-03-10 | 88 |
| 4 | 4 | Anna | . | 55000 | 2022-04-12 | . |
| 5 | 5 | Mike | 200 | 45000 | 2022-05-05 | 75 |
| 6 | 6 | Sara | 29 | 0 | 2022-06-25 | 92 |
| 7 | 7 | David | 34 | 72000 | 2022-13-01 | 80 |
| 8 | 8 | Lisa | 28 | 68000 | 2022-07-15 | 95 |
| 9 | 9 | Tom | 31 | 65000 | 89 | |
| 10 | 10 | Emma | 27 | 70000 | 2022-08-20 | 91 |
| 11 | 11 | Raj | 26 | 62000 | 2022-09-10 | 87 |
| 12 | 12 | Ravi | 30 | 1000 | 2022-10-11 | 85 |
| 13 | 13 | Priya | 29 | 58000 | 2022-11-01 | 93 |
| 14 | 14 | Arun | . | 61000 | 2022-12-05 | 88 |
| 15 | 15 | Kiran | 33 | 67000 | 2022-06-31 | 90 |
Validate
whether negatives are true errors before blindly applying ABS().
Key Takeaways
- ABS() ensures positivity
- Useful for financial data
cleaning
- Always validate assumptions
3. DATA Step – Handle Missing Values Using COALESCE
Business Logic
Missing
values distort aggregations. COALESCE() replaces missing with defaults or
fallback values.
DATA fix_missing;
SET fix_salary;
Age = COALESCE(Age, 30);
Score = COALESCE(Score, 85);
RUN;
Proc print data=fix_missing;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 2022-01-15 | 85 |
| 2 | 2 | Mary | -30 | 60000 | 2022-02-20 | 90 |
| 3 | 3 | Steve | 45 | 70000 | 2022-03-10 | 88 |
| 4 | 4 | Anna | 30 | 55000 | 2022-04-12 | 85 |
| 5 | 5 | Mike | 200 | 45000 | 2022-05-05 | 75 |
| 6 | 6 | Sara | 29 | 0 | 2022-06-25 | 92 |
| 7 | 7 | David | 34 | 72000 | 2022-13-01 | 80 |
| 8 | 8 | Lisa | 28 | 68000 | 2022-07-15 | 95 |
| 9 | 9 | Tom | 31 | 65000 | 89 | |
| 10 | 10 | Emma | 27 | 70000 | 2022-08-20 | 91 |
| 11 | 11 | Raj | 26 | 62000 | 2022-09-10 | 87 |
| 12 | 12 | Ravi | 30 | 1000 | 2022-10-11 | 85 |
| 13 | 13 | Priya | 29 | 58000 | 2022-11-01 | 93 |
| 14 | 14 | Arun | 30 | 61000 | 2022-12-05 | 88 |
| 15 | 15 | Kiran | 33 | 67000 | 2022-06-31 | 90 |
Use
domain knowledge when imputing—not arbitrary values.
Key Takeaways
- Prevents NULL propagation
- Maintains dataset
completeness
- Supports statistical
stability
4. DATA Step – Fix Range Violations
Business Logic
Age must
be realistic. Values outside 18–65 are capped.
DATA fix_range;
SET fix_missing;
IF Age > 100 THEN Age = 60;
IF Age < 18 THEN Age = 25;
RUN;
Proc print data=fix_range;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 2022-01-15 | 85 |
| 2 | 2 | Mary | 25 | 60000 | 2022-02-20 | 90 |
| 3 | 3 | Steve | 45 | 70000 | 2022-03-10 | 88 |
| 4 | 4 | Anna | 30 | 55000 | 2022-04-12 | 85 |
| 5 | 5 | Mike | 60 | 45000 | 2022-05-05 | 75 |
| 6 | 6 | Sara | 29 | 0 | 2022-06-25 | 92 |
| 7 | 7 | David | 34 | 72000 | 2022-13-01 | 80 |
| 8 | 8 | Lisa | 28 | 68000 | 2022-07-15 | 95 |
| 9 | 9 | Tom | 31 | 65000 | 89 | |
| 10 | 10 | Emma | 27 | 70000 | 2022-08-20 | 91 |
| 11 | 11 | Raj | 26 | 62000 | 2022-09-10 | 87 |
| 12 | 12 | Ravi | 30 | 1000 | 2022-10-11 | 85 |
| 13 | 13 | Priya | 29 | 58000 | 2022-11-01 | 93 |
| 14 | 14 | Arun | 30 | 61000 | 2022-12-05 | 88 |
| 15 | 15 | Kiran | 33 | 67000 | 2022-06-31 | 90 |
Use
business rules, not arbitrary thresholds.
Key Takeaways
- Enforces logical constraints
- Prevents outlier distortion
- Essential for modeling
5. DATE Handling with INPUT & FORMAT
Business Logic
Dates
must be numeric SAS dates for time analysis.
DATA fix_date;
SET fix_range;
JoinDate_clean = INPUT(JoinDate, YYMMDD10.);
FORMAT JoinDate_clean DATE9.;
RUN;
Proc print data=fix_date;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score | JoinDate_clean |
|---|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 2022-01-15 | 85 | 15JAN2022 |
| 2 | 2 | Mary | 25 | 60000 | 2022-02-20 | 90 | 20FEB2022 |
| 3 | 3 | Steve | 45 | 70000 | 2022-03-10 | 88 | 10MAR2022 |
| 4 | 4 | Anna | 30 | 55000 | 2022-04-12 | 85 | 12APR2022 |
| 5 | 5 | Mike | 60 | 45000 | 2022-05-05 | 75 | 05MAY2022 |
| 6 | 6 | Sara | 29 | 0 | 2022-06-25 | 92 | 25JUN2022 |
| 7 | 7 | David | 34 | 72000 | 2022-13-01 | 80 | . |
| 8 | 8 | Lisa | 28 | 68000 | 2022-07-15 | 95 | 15JUL2022 |
| 9 | 9 | Tom | 31 | 65000 | 89 | . | |
| 10 | 10 | Emma | 27 | 70000 | 2022-08-20 | 91 | 20AUG2022 |
| 11 | 11 | Raj | 26 | 62000 | 2022-09-10 | 87 | 10SEP2022 |
| 12 | 12 | Ravi | 30 | 1000 | 2022-10-11 | 85 | 11OCT2022 |
| 13 | 13 | Priya | 29 | 58000 | 2022-11-01 | 93 | 01NOV2022 |
| 14 | 14 | Arun | 30 | 61000 | 2022-12-05 | 88 | 05DEC2022 |
| 15 | 15 | Kiran | 33 | 67000 | 2022-06-31 | 90 | . |
Invalid
dates become missing handle them separately.
Key Takeaways
- Converts character to
numeric
- Enables time calculations
- Essential for INTNX, INTCK
6. PROC MEANS – Validation Check
Business Logic
After
cleaning, validate distributions.
PROC MEANS DATA=fix_date;
VAR Age Salary Score;
RUN;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|
Age Salary Score | 15 15 15 | 32.1333333 53600.00 87.5333333 | 9.1172260 22846.69 5.1111456 | 25.0000000 0 75.0000000 | 60.0000000 72000.00 95.0000000 |
Always
validate after transformation.
Key Takeaways
- Detect anomalies
- Confirm corrections
- Quick sanity check
7. FORMAT – Categorization
Business Logic
Categorizing
salary helps reporting.
PROC FORMAT;
VALUE salfmt LOW-50000='Low'
50001-70000='Medium'
70001-HIGH='High';
RUN;
LOG:
Key Takeaways
- Improves interpretability
- Supports reporting
8. APPLY FORMAT
DATA formatted;
SET fix_date;
FORMAT Salary salfmt.;
RUN;
Proc print data=formatted;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score | JoinDate_clean |
|---|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | Low | 2022-01-15 | 85 | 15JAN2022 |
| 2 | 2 | Mary | 25 | Medium | 2022-02-20 | 90 | 20FEB2022 |
| 3 | 3 | Steve | 45 | Medium | 2022-03-10 | 88 | 10MAR2022 |
| 4 | 4 | Anna | 30 | Medium | 2022-04-12 | 85 | 12APR2022 |
| 5 | 5 | Mike | 60 | Low | 2022-05-05 | 75 | 05MAY2022 |
| 6 | 6 | Sara | 29 | Low | 2022-06-25 | 92 | 25JUN2022 |
| 7 | 7 | David | 34 | High | 2022-13-01 | 80 | . |
| 8 | 8 | Lisa | 28 | Medium | 2022-07-15 | 95 | 15JUL2022 |
| 9 | 9 | Tom | 31 | Medium | 89 | . | |
| 10 | 10 | Emma | 27 | Medium | 2022-08-20 | 91 | 20AUG2022 |
| 11 | 11 | Raj | 26 | Medium | 2022-09-10 | 87 | 10SEP2022 |
| 12 | 12 | Ravi | 30 | Low | 2022-10-11 | 85 | 11OCT2022 |
| 13 | 13 | Priya | 29 | Medium | 2022-11-01 | 93 | 01NOV2022 |
| 14 | 14 | Arun | 30 | Medium | 2022-12-05 | 88 | 05DEC2022 |
| 15 | 15 | Kiran | 33 | Medium | 2022-06-31 | 90 | . |
9. PROC TRANSPOSE
Business Logic
Reshapes
data for reporting or modeling.
PROC TRANSPOSE DATA=formatted OUT=transposed;
BY ID;
VAR Age Salary Score;
RUN;
Proc print data=transposed;
run;
OUTPUT:
| Obs | ID | _NAME_ | COL1 |
|---|---|---|---|
| 1 | 1 | Age | 25 |
| 2 | 1 | Salary | 50000 |
| 3 | 1 | Score | 85 |
| 4 | 2 | Age | 25 |
| 5 | 2 | Salary | 60000 |
| 6 | 2 | Score | 90 |
| 7 | 3 | Age | 45 |
| 8 | 3 | Salary | 70000 |
| 9 | 3 | Score | 88 |
| 10 | 4 | Age | 30 |
| 11 | 4 | Salary | 55000 |
| 12 | 4 | Score | 85 |
| 13 | 5 | Age | 60 |
| 14 | 5 | Salary | 45000 |
| 15 | 5 | Score | 75 |
| 16 | 6 | Age | 29 |
| 17 | 6 | Salary | 0 |
| 18 | 6 | Score | 92 |
| 19 | 7 | Age | 34 |
| 20 | 7 | Salary | 72000 |
| 21 | 7 | Score | 80 |
| 22 | 8 | Age | 28 |
| 23 | 8 | Salary | 68000 |
| 24 | 8 | Score | 95 |
| 25 | 9 | Age | 31 |
| 26 | 9 | Salary | 65000 |
| 27 | 9 | Score | 89 |
| 28 | 10 | Age | 27 |
| 29 | 10 | Salary | 70000 |
| 30 | 10 | Score | 91 |
| 31 | 11 | Age | 26 |
| 32 | 11 | Salary | 62000 |
| 33 | 11 | Score | 87 |
| 34 | 12 | Age | 30 |
| 35 | 12 | Salary | 1000 |
| 36 | 12 | Score | 85 |
| 37 | 13 | Age | 29 |
| 38 | 13 | Salary | 58000 |
| 39 | 13 | Score | 93 |
| 40 | 14 | Age | 30 |
| 41 | 14 | Salary | 61000 |
| 42 | 14 | Score | 88 |
| 43 | 15 | Age | 33 |
| 44 | 15 | Salary | 67000 |
| 45 | 15 | Score | 90 |
10. PROC APPEND
PROC APPEND BASE=fix_date
DATA=formatted;
RUN;
Proc print data=fix_date;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score | JoinDate_clean |
|---|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 2022-01-15 | 85 | 15JAN2022 |
| 2 | 2 | Mary | 25 | 60000 | 2022-02-20 | 90 | 20FEB2022 |
| 3 | 3 | Steve | 45 | 70000 | 2022-03-10 | 88 | 10MAR2022 |
| 4 | 4 | Anna | 30 | 55000 | 2022-04-12 | 85 | 12APR2022 |
| 5 | 5 | Mike | 60 | 45000 | 2022-05-05 | 75 | 05MAY2022 |
| 6 | 6 | Sara | 29 | 0 | 2022-06-25 | 92 | 25JUN2022 |
| 7 | 7 | David | 34 | 72000 | 2022-13-01 | 80 | . |
| 8 | 8 | Lisa | 28 | 68000 | 2022-07-15 | 95 | 15JUL2022 |
| 9 | 9 | Tom | 31 | 65000 | 89 | . | |
| 10 | 10 | Emma | 27 | 70000 | 2022-08-20 | 91 | 20AUG2022 |
| 11 | 11 | Raj | 26 | 62000 | 2022-09-10 | 87 | 10SEP2022 |
| 12 | 12 | Ravi | 30 | 1000 | 2022-10-11 | 85 | 11OCT2022 |
| 13 | 13 | Priya | 29 | 58000 | 2022-11-01 | 93 | 01NOV2022 |
| 14 | 14 | Arun | 30 | 61000 | 2022-12-05 | 88 | 05DEC2022 |
| 15 | 15 | Kiran | 33 | 67000 | 2022-06-31 | 90 | . |
| 16 | 1 | John | 25 | 50000 | 2022-01-15 | 85 | 15JAN2022 |
| 17 | 2 | Mary | 25 | 60000 | 2022-02-20 | 90 | 20FEB2022 |
| 18 | 3 | Steve | 45 | 70000 | 2022-03-10 | 88 | 10MAR2022 |
| 19 | 4 | Anna | 30 | 55000 | 2022-04-12 | 85 | 12APR2022 |
| 20 | 5 | Mike | 60 | 45000 | 2022-05-05 | 75 | 05MAY2022 |
| 21 | 6 | Sara | 29 | 0 | 2022-06-25 | 92 | 25JUN2022 |
| 22 | 7 | David | 34 | 72000 | 2022-13-01 | 80 | . |
| 23 | 8 | Lisa | 28 | 68000 | 2022-07-15 | 95 | 15JUL2022 |
| 24 | 9 | Tom | 31 | 65000 | 89 | . | |
| 25 | 10 | Emma | 27 | 70000 | 2022-08-20 | 91 | 20AUG2022 |
| 26 | 11 | Raj | 26 | 62000 | 2022-09-10 | 87 | 10SEP2022 |
| 27 | 12 | Ravi | 30 | 1000 | 2022-10-11 | 85 | 11OCT2022 |
| 28 | 13 | Priya | 29 | 58000 | 2022-11-01 | 93 | 01NOV2022 |
| 29 | 14 | Arun | 30 | 61000 | 2022-12-05 | 88 | 05DEC2022 |
| 30 | 15 | Kiran | 33 | 67000 | 2022-06-31 | 90 | . |
11. STRING CLEANING (TRIM/STRIP/PROPCASE)
DATA clean_names;
SET formatted;
Name = PROPCASE(STRIP(Name));
RUN;
Proc print data=clean_names;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score | JoinDate_clean |
|---|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | Low | 2022-01-15 | 85 | 15JAN2022 |
| 2 | 2 | Mary | 25 | Medium | 2022-02-20 | 90 | 20FEB2022 |
| 3 | 3 | Steve | 45 | Medium | 2022-03-10 | 88 | 10MAR2022 |
| 4 | 4 | Anna | 30 | Medium | 2022-04-12 | 85 | 12APR2022 |
| 5 | 5 | Mike | 60 | Low | 2022-05-05 | 75 | 05MAY2022 |
| 6 | 6 | Sara | 29 | Low | 2022-06-25 | 92 | 25JUN2022 |
| 7 | 7 | David | 34 | High | 2022-13-01 | 80 | . |
| 8 | 8 | Lisa | 28 | Medium | 2022-07-15 | 95 | 15JUL2022 |
| 9 | 9 | Tom | 31 | Medium | 89 | . | |
| 10 | 10 | Emma | 27 | Medium | 2022-08-20 | 91 | 20AUG2022 |
| 11 | 11 | Raj | 26 | Medium | 2022-09-10 | 87 | 10SEP2022 |
| 12 | 12 | Ravi | 30 | Low | 2022-10-11 | 85 | 11OCT2022 |
| 13 | 13 | Priya | 29 | Medium | 2022-11-01 | 93 | 01NOV2022 |
| 14 | 14 | Arun | 30 | Medium | 2022-12-05 | 88 | 05DEC2022 |
| 15 | 15 | Kiran | 33 | Medium | 2022-06-31 | 90 | . |
12. INTNX – Date Increment
DATA add_month;
SET clean_names;
Next_Date = INTNX('MONTH', JoinDate_clean, 1);
FORMAT Next_Date DATE9.;
RUN;
Proc print data=add_month;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score | JoinDate_clean | Next_Date |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | Low | 2022-01-15 | 85 | 15JAN2022 | 01FEB2022 |
| 2 | 2 | Mary | 25 | Medium | 2022-02-20 | 90 | 20FEB2022 | 01MAR2022 |
| 3 | 3 | Steve | 45 | Medium | 2022-03-10 | 88 | 10MAR2022 | 01APR2022 |
| 4 | 4 | Anna | 30 | Medium | 2022-04-12 | 85 | 12APR2022 | 01MAY2022 |
| 5 | 5 | Mike | 60 | Low | 2022-05-05 | 75 | 05MAY2022 | 01JUN2022 |
| 6 | 6 | Sara | 29 | Low | 2022-06-25 | 92 | 25JUN2022 | 01JUL2022 |
| 7 | 7 | David | 34 | High | 2022-13-01 | 80 | . | . |
| 8 | 8 | Lisa | 28 | Medium | 2022-07-15 | 95 | 15JUL2022 | 01AUG2022 |
| 9 | 9 | Tom | 31 | Medium | 89 | . | . | |
| 10 | 10 | Emma | 27 | Medium | 2022-08-20 | 91 | 20AUG2022 | 01SEP2022 |
| 11 | 11 | Raj | 26 | Medium | 2022-09-10 | 87 | 10SEP2022 | 01OCT2022 |
| 12 | 12 | Ravi | 30 | Low | 2022-10-11 | 85 | 11OCT2022 | 01NOV2022 |
| 13 | 13 | Priya | 29 | Medium | 2022-11-01 | 93 | 01NOV2022 | 01DEC2022 |
| 14 | 14 | Arun | 30 | Medium | 2022-12-05 | 88 | 05DEC2022 | 01JAN2023 |
| 15 | 15 | Kiran | 33 | Medium | 2022-06-31 | 90 | . | . |
13. INTCK – Duration
DATA duration;
SET add_month;
Months = INTCK('MONTH', JoinDate_clean, TODAY());
RUN;
Proc print data=duration;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score | JoinDate_clean | Next_Date | Months |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | Low | 2022-01-15 | 85 | 15JAN2022 | 01FEB2022 | 51 |
| 2 | 2 | Mary | 25 | Medium | 2022-02-20 | 90 | 20FEB2022 | 01MAR2022 | 50 |
| 3 | 3 | Steve | 45 | Medium | 2022-03-10 | 88 | 10MAR2022 | 01APR2022 | 49 |
| 4 | 4 | Anna | 30 | Medium | 2022-04-12 | 85 | 12APR2022 | 01MAY2022 | 48 |
| 5 | 5 | Mike | 60 | Low | 2022-05-05 | 75 | 05MAY2022 | 01JUN2022 | 47 |
| 6 | 6 | Sara | 29 | Low | 2022-06-25 | 92 | 25JUN2022 | 01JUL2022 | 46 |
| 7 | 7 | David | 34 | High | 2022-13-01 | 80 | . | . | . |
| 8 | 8 | Lisa | 28 | Medium | 2022-07-15 | 95 | 15JUL2022 | 01AUG2022 | 45 |
| 9 | 9 | Tom | 31 | Medium | 89 | . | . | . | |
| 10 | 10 | Emma | 27 | Medium | 2022-08-20 | 91 | 20AUG2022 | 01SEP2022 | 44 |
| 11 | 11 | Raj | 26 | Medium | 2022-09-10 | 87 | 10SEP2022 | 01OCT2022 | 43 |
| 12 | 12 | Ravi | 30 | Low | 2022-10-11 | 85 | 11OCT2022 | 01NOV2022 | 42 |
| 13 | 13 | Priya | 29 | Medium | 2022-11-01 | 93 | 01NOV2022 | 01DEC2022 | 41 |
| 14 | 14 | Arun | 30 | Medium | 2022-12-05 | 88 | 05DEC2022 | 01JAN2023 | 40 |
| 15 | 15 | Kiran | 33 | Medium | 2022-06-31 | 90 | . | . | . |
14. MACRO – Automated Cleaning
%MACRO clean_data(ds);
DATA &ds._clean;
SET &ds;
Salary = ABS(Salary);
RUN;
Proc print data=&ds._clean;
run;
%MEND;
%clean_data(raw_errors);
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 2022-01-15 | 85 |
| 2 | 2 | Mary | -30 | 60000 | 2022-02-20 | 90 |
| 3 | 3 | Steve | 45 | 70000 | 2022-03-10 | 88 |
| 4 | 4 | Anna | . | 55000 | 2022-04-12 | . |
| 5 | 5 | Mike | 200 | 45000 | 2022-05-05 | 75 |
| 6 | 6 | Sara | 29 | 0 | 2022-06-25 | 92 |
| 7 | 7 | David | 34 | 72000 | 2022-13-01 | 80 |
| 8 | 8 | Lisa | 28 | 68000 | 2022-07-15 | 95 |
| 9 | 9 | Tom | 31 | 65000 | 89 | |
| 10 | 10 | Emma | 27 | 70000 | 2022-08-20 | 91 |
| 11 | 11 | Raj | 26 | 62000 | 2022-09-10 | 87 |
| 12 | 12 | Ravi | 30 | 1000 | 2022-10-11 | 85 |
| 13 | 13 | Priya | 29 | 58000 | 2022-11-01 | 93 |
| 14 | 14 | Arun | . | 61000 | 2022-12-05 | 88 |
| 15 | 15 | Kiran | 33 | 67000 | 2022-06-31 | 90 |
15. FINAL DATA STEP
DATA final_clean;
SET duration;
RUN;
Proc print data=final_clean;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score | JoinDate_clean | Next_Date | Months |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | Low | 2022-01-15 | 85 | 15JAN2022 | 01FEB2022 | 51 |
| 2 | 2 | Mary | 25 | Medium | 2022-02-20 | 90 | 20FEB2022 | 01MAR2022 | 50 |
| 3 | 3 | Steve | 45 | Medium | 2022-03-10 | 88 | 10MAR2022 | 01APR2022 | 49 |
| 4 | 4 | Anna | 30 | Medium | 2022-04-12 | 85 | 12APR2022 | 01MAY2022 | 48 |
| 5 | 5 | Mike | 60 | Low | 2022-05-05 | 75 | 05MAY2022 | 01JUN2022 | 47 |
| 6 | 6 | Sara | 29 | Low | 2022-06-25 | 92 | 25JUN2022 | 01JUL2022 | 46 |
| 7 | 7 | David | 34 | High | 2022-13-01 | 80 | . | . | . |
| 8 | 8 | Lisa | 28 | Medium | 2022-07-15 | 95 | 15JUL2022 | 01AUG2022 | 45 |
| 9 | 9 | Tom | 31 | Medium | 89 | . | . | . | |
| 10 | 10 | Emma | 27 | Medium | 2022-08-20 | 91 | 20AUG2022 | 01SEP2022 | 44 |
| 11 | 11 | Raj | 26 | Medium | 2022-09-10 | 87 | 10SEP2022 | 01OCT2022 | 43 |
| 12 | 12 | Ravi | 30 | Low | 2022-10-11 | 85 | 11OCT2022 | 01NOV2022 | 42 |
| 13 | 13 | Priya | 29 | Medium | 2022-11-01 | 93 | 01NOV2022 | 01DEC2022 | 41 |
| 14 | 14 | Arun | 30 | Medium | 2022-12-05 | 88 | 05DEC2022 | 01JAN2023 | 40 |
| 15 | 15 | Kiran | 33 | Medium | 2022-06-31 | 90 | . | . | . |
16. Master Dataset
PROC SORT DATA=raw_errors OUT=sorted;
BY ID;
RUN;
LOG:
DATA cleaned;
SET sorted;
Salary = ABS(Salary);
Age = COALESCE(Age,30);
Score = COALESCE(Score,85);
IF Age > 100 THEN Age = 60;
IF Age < 18 THEN Age = 25;
JoinDate_clean = INPUT(JoinDate, YYMMDD10.);
FORMAT JoinDate_clean DATE9.;
Name = PROPCASE(STRIP(Name));
Next_Date = INTNX('MONTH', JoinDate_clean, 1);
Months = INTCK('MONTH', JoinDate_clean, TODAY());
RUN;
Proc print data=cleaned;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | JoinDate | Score | JoinDate_clean | Next_Date | Months |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 2022-01-15 | 85 | 15JAN2022 | 22677 | 51 |
| 2 | 2 | Mary | 25 | 60000 | 2022-02-20 | 90 | 20FEB2022 | 22705 | 50 |
| 3 | 3 | Steve | 45 | 70000 | 2022-03-10 | 88 | 10MAR2022 | 22736 | 49 |
| 4 | 4 | Anna | 30 | 55000 | 2022-04-12 | 85 | 12APR2022 | 22766 | 48 |
| 5 | 5 | Mike | 60 | 45000 | 2022-05-05 | 75 | 05MAY2022 | 22797 | 47 |
| 6 | 6 | Sara | 29 | 0 | 2022-06-25 | 92 | 25JUN2022 | 22827 | 46 |
| 7 | 7 | David | 34 | 72000 | 2022-13-01 | 80 | . | . | . |
| 8 | 8 | Lisa | 28 | 68000 | 2022-07-15 | 95 | 15JUL2022 | 22858 | 45 |
| 9 | 9 | Tom | 31 | 65000 | 89 | . | . | . | |
| 10 | 10 | Emma | 27 | 70000 | 2022-08-20 | 91 | 20AUG2022 | 22889 | 44 |
| 11 | 11 | Raj | 26 | 62000 | 2022-09-10 | 87 | 10SEP2022 | 22919 | 43 |
| 12 | 12 | Ravi | 30 | 1000 | 2022-10-11 | 85 | 11OCT2022 | 22950 | 42 |
| 13 | 13 | Priya | 29 | 58000 | 2022-11-01 | 93 | 01NOV2022 | 22980 | 41 |
| 14 | 14 | Arun | 30 | 61000 | 2022-12-05 | 88 | 05DEC2022 | 23011 | 40 |
| 15 | 15 | Kiran | 33 | 67000 | 2022-06-31 | 90 | . | . | . |
- Always validate before and
after cleaning
- Use macros for scalability
- Avoid hardcoding thresholds
- Separate raw and cleaned
datasets
- Log all transformations
- Use formats for reporting
- Missing ≠ zero
- Dates must be numeric
- Sorting improves joins
- Validate ranges early
- Use PROC FREQ for
categorical checks
- Keep audit trails
- Avoid overwriting raw data
- Use labels for clarity
- Standardize naming
conventions
- Check duplicates
- Use indexing for large data
- Optimize memory usage
- Automate repetitive tasks
- Always think like a data
auditor
18. Business Context
In a
corporate environment, poor data quality directly translates into financial
loss. Imagine a retail company calculating employee bonuses using incorrect
salary data. Negative or missing values could lead to underpayment or
overpayment, both of which create operational and legal risks.
Similarly,
HR analytics depends heavily on accurate age and tenure data. Incorrect join
dates distort attrition analysis, workforce planning, and promotion cycles. A
company might mistakenly assume high employee turnover due to flawed duration
calculations.
In
financial reporting, even a small percentage of erroneous data can lead to
incorrect forecasts. Executives rely on dashboards built from datasets like this.
If the underlying data is flawed, strategic decisions—such as expansion,
hiring, or investment become misinformed.
Data
cleaning using SAS ensures consistency, reliability, and auditability. By
applying structured transformations (ABS, COALESCE, INTNX, INTCK),
organizations create standardized pipelines. These pipelines reduce manual
intervention, minimize human error, and improve reproducibility.
Moreover,
regulatory industries like pharmaceuticals require strict compliance. Clean
datasets are essential for submissions, audits, and validation processes. A
well-designed SAS cleaning workflow ensures traceability every change is
documented and reproducible.
- Raw data is inherently
messy, often containing hidden errors that silently corrupt analysis if
left untreated.
- The first step is always data
profiling, where you understand distributions, missingness, and
anomalies using procedures like PROC MEANS and PROC FREQ.
- Sorting the dataset using PROC
SORT establishes structural order, which is essential for merges,
deduplication, and BY-group processing.
- Identifying negative
values in non-negative fields (like salary) is critical, as they
indicate data entry or system errors.
- Functions like ABS() help
normalize such anomalies, converting incorrect negative values into usable
positive numbers.
- Missing values are dangerous
because they propagate through calculations and distort averages, totals,
and model outputs.
- Using COALESCE() ensures
that missing values are replaced with logical defaults or derived values,
maintaining dataset completeness.
- Range validation ensures
that variables like age fall within realistic human limits, preventing
statistical distortion.
- Conditional logic (IF-THEN)
enforces business rules, correcting outliers such as unrealistic ages
(e.g., 200 years).
- Date variables must be
converted from character to numeric using INPUT() to enable time-based
analytics.
- Invalid dates automatically
become missing in SAS, making it easier to isolate and correct them.
- Functions like INTNX() allow
controlled date increments, which are essential for forecasting and
scheduling analysis.
- INTCK() calculates time
intervals, enabling tenure analysis, cohort tracking, and lifecycle
insights.
- String inconsistencies
(extra spaces, inconsistent casing) are resolved using STRIP(), TRIM(),
and PROPCASE().
- Applying PROC FORMAT
transforms raw numeric values into meaningful categories, improving
interpretability for stakeholders.
- PROC TRANSPOSE reshapes data
structures, making datasets compatible with reporting tools and
statistical models.
- PROC APPEND enables scalable
data integration by efficiently combining datasets without rewriting code.
- SAS macros automate repetitive
cleaning tasks, ensuring consistency and reducing manual coding effort.
- Validation after each
transformation step ensures that corrections are accurate and no new
errors are introduced.
- The final cleaned dataset
becomes analysis-ready, enabling reliable reporting, accurate
modeling, and confident business decision-making.
20. Summary & Conclusion
This
project demonstrates a fundamental truth: data cleaning is not a preliminary
step it is the foundation of analytics. Without it, even the most sophisticated
models fail.
We
started with a chaotic dataset filled with negative values, missing entries,
invalid dates, and logical inconsistencies. Step by step, we transformed it
into a structured, reliable dataset using SAS techniques.
Functions
like ABS() corrected financial anomalies, while COALESCE() ensured
completeness. Date functions such as INTNX and INTCK enabled temporal analysis.
String functions (STRIP, PROPCASE) standardized textual data. Procedures like PROC
SORT, PROC MEANS, and PROC TRANSPOSE enhanced structure and validation.
More
importantly, every transformation was guided by business logic, not just
technical execution. This is what separates a programmer from a data
professional.
In
real-world scenarios, data is messy by default. Your job is to impose order,
enforce rules, and ensure trust. Once data becomes reliable, everything else analysis,
modeling, reporting becomes exponentially easier.
Think of
data cleaning as building a strong foundation. If the foundation is weak, the
entire structure collapses. But if it is solid, you can build anything on top
of it.
21. Interview Preparation
1. Why use ABS() in data cleaning?
Answer: To correct negative numeric
values when they are logically invalid, especially in financial data.
2. Difference between COALESCE and IF-THEN?
Answer: COALESCE handles missing values
efficiently across multiple variables, while IF-THEN is conditional logic.
3. Why convert dates using INPUT()?
Answer: SAS requires numeric dates for
calculations like duration and intervals.
4. What is INTNX vs INTCK?
Answer: INTNX increments dates; INTCK
counts intervals.
5. Why avoid overwriting raw data?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about DATA ERROR FIXING.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment