443.The Art Of Turning Imperfect Data Into Powerful Business Intelligence With SAS
Can A Step-By-Step SAS Data Cleaning Project Turn Messy Data Into Reliable Business Intelligence?
Introduction: Why Data Cleaning Is The Real
Intelligence Layer
If raw
data is the “fuel,” then data cleaning is the refinery. Without it, even the
most advanced analytics machine
learning, forecasting, or regulatory reporting collapse under silent
inaccuracies.
In
real-world clinical, banking, or retail systems, datasets are rarely pristine.
They arrive fragmented, inconsistent, and sometimes outright misleading. As a
SAS programmer, your value isn’t just writing code it’s restoring data
integrity.
In this
project, we will simulate a real-world SAS Data Cleaning Project using a
deliberately flawed dataset. Then, step-by-step, we will transform it into a business-ready,
validated dataset.
The Raw Dataset (With Intentional Errors)
SAS Code (DATALINES)
DATA raw_data;
INPUT ID Name $ Age Salary Join_Date : DATE9. Score;
FORMAT Join_Date DATE9.;
DATALINES;
1 John 25 50000 01JAN2020 80
2 Alice -30 60000 15FEB2021 90
3 Bob 45 -70000 10MAR2020 85
4 Eva 200 80000 25APR2019 .
5 Mike 35 55000 01JAN1800 88
6 Sara . 62000 12JUN2021 92
7 Tom 28 48000 15JUL2022 -10
8 Nina 32 0 20AUG2021 87
9 Raj 29 52000 . 91
10 Lee 40 70000 05SEP2020 89
11 Zara 27 65000 11OCT2021 93
12 Arun -5 40000 12DEC2020 75
13 Kiran 33 58000 18JAN2022 82
14 Priya 31 60000 20FEB2021 86
15 Dev 29 -1000 25MAR2020 90
;
RUN;
Proc print data=raw_data;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 01JAN2020 | 80 |
| 2 | 2 | Alice | -30 | 60000 | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | -70000 | 10MAR2020 | 85 |
| 4 | 4 | Eva | 200 | 80000 | 25APR2019 | . |
| 5 | 5 | Mike | 35 | 55000 | 01JAN1800 | 88 |
| 6 | 6 | Sara | . | 62000 | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | 48000 | 15JUL2022 | -10 |
| 8 | 8 | Nina | 32 | 0 | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | 52000 | . | 91 |
| 10 | 10 | Lee | 40 | 70000 | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | 65000 | 11OCT2021 | 93 |
| 12 | 12 | Arun | -5 | 40000 | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | 58000 | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | 60000 | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | -1000 | 25MAR2020 | 90 |
Equivalent R Code
raw_data <- data.frame(
ID = c(1:15),
Name = c("John","Alice","Bob","Eva","Mike","Sara","Tom",
"Nina","Raj","Lee","Zara","Arun","Kiran","Priya",
"Dev"),
Age = c(25,-30,45,200,35,NA,28,32,29,40,27,-5,33,31,29),
Salary = c(50000,60000,-70000,80000,55000,62000,48000,0,
52000,70000,65000,40000,58000,60000,-1000),
Join_Date = as.Date(c("2020-01-01","2021-02-15","2020-03-10",
"2019-04-25","1800-01-01","2021-06-12",
"2022-07-15","2021-08-20",NA,"2020-09-05",
"2021-10-11","2020-12-12","2022-01-18",
"2021-02-20","2020-03-25")),
Score = c(80,90,85,NA,88,92,-10,87,91,89,93,75,82,86,90)
)
OUTPUT:
|
|
ID |
Name |
Age |
Salary |
Join_Date |
Score |
|
1 |
1 |
John |
25 |
50000 |
01-01-2020 |
80 |
|
2 |
2 |
Alice |
-30 |
60000 |
15-02-2021 |
90 |
|
3 |
3 |
Bob |
45 |
-70000 |
10-03-2020 |
85 |
|
4 |
4 |
Eva |
200 |
80000 |
25-04-2019 |
NA |
|
5 |
5 |
Mike |
35 |
55000 |
1800-01-01 |
88 |
|
6 |
6 |
Sara |
NA |
62000 |
12-06-2021 |
92 |
|
7 |
7 |
Tom |
28 |
48000 |
15-07-2022 |
-10 |
|
8 |
8 |
Nina |
32 |
0 |
20-08-2021 |
87 |
|
9 |
9 |
Raj |
29 |
52000 |
NA |
91 |
|
10 |
10 |
Lee |
40 |
70000 |
05-09-2020 |
89 |
|
11 |
11 |
Zara |
27 |
65000 |
11-10-2021 |
93 |
|
12 |
12 |
Arun |
-5 |
40000 |
12-12-2020 |
75 |
|
13 |
13 |
Kiran |
33 |
58000 |
18-01-2022 |
82 |
|
14 |
14 |
Priya |
31 |
60000 |
20-02-2021 |
86 |
|
15 |
15 |
Dev |
29 |
-1000 |
25-03-2020 |
90 |
Phase 1: Discovery & Chaos
5 Critical Errors in the Dataset
- Negative Age (Alice, Arun)
- Negative Salary (Bob, Dev)
- Impossible Age (Eva = 200)
- Missing Values (Sara Age,
Raj Date, Eva Score)
- Invalid Score Range (Tom =
-10)
Why These Errors Destroy Scientific Integrity
In any
analytical pipeline, data validity is the foundation of trust. When
datasets contain logical inconsistencies such as negative ages or salaries they
violate domain constraints, making downstream statistical models unreliable.
Imagine calculating average employee age when one record is -30. The mean
becomes distorted, leading to incorrect demographic insights.
Similarly,
missing data introduces bias. If high-performing individuals
disproportionately have missing scores, your performance metrics will be skewed
downward. This creates flawed business decisions, such as underestimating
workforce productivity.
Extreme
outliers, like an age of 200, can significantly inflate variance. In SAS
procedures like PROC MEANS, this distorts standard deviation, affecting
confidence intervals and predictive modeling assumptions.
Temporal
inconsistencies such as a join date in the year 1800 break time-series logic.
Functions like INTCK and INTNX rely on valid date sequences. Invalid dates
corrupt tenure calculations, which are often used in churn prediction or HR analytics.
Finally,
invalid ranges (e.g., score = -10) undermine data governance policies.
If such data passes through regulatory pipelines (e.g., clinical trials), it
can lead to compliance failures.
Phase 2: Step-by-Step SAS
Mastery
1. PROC SORT – Structuring Errors
Business Logic
Sorting
is the first stabilization step. Think of it like organizing scattered medical
records before diagnosis. Without order, duplicate detection, merging, and
validation become inefficient.
PROC SORT DATA=raw_data OUT=sorted_data;
BY ID;
RUN;
Proc print data=sorted_data;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 01JAN2020 | 80 |
| 2 | 2 | Alice | -30 | 60000 | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | -70000 | 10MAR2020 | 85 |
| 4 | 4 | Eva | 200 | 80000 | 25APR2019 | . |
| 5 | 5 | Mike | 35 | 55000 | 01JAN1800 | 88 |
| 6 | 6 | Sara | . | 62000 | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | 48000 | 15JUL2022 | -10 |
| 8 | 8 | Nina | 32 | 0 | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | 52000 | . | 91 |
| 10 | 10 | Lee | 40 | 70000 | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | 65000 | 11OCT2021 | 93 |
| 12 | 12 | Arun | -5 | 40000 | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | 58000 | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | 60000 | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | -1000 | 25MAR2020 | 90 |
Sorting
improves performance for BY-group processing.
Key Takeaways
- Required before MERGE
- Improves readability
- Enables BY-group logic
2. Handling Negative Values Using ABS()
Business Logic
Negative
salary or age violates domain constraints. Using ABS() normalizes these
anomalies without discarding data.
DATA clean1;
SET sorted_data;
Age = ABS(Age);
Salary = ABS(Salary);
RUN;
Proc print data=clean1;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 01JAN2020 | 80 |
| 2 | 2 | Alice | 30 | 60000 | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | 70000 | 10MAR2020 | 85 |
| 4 | 4 | Eva | 200 | 80000 | 25APR2019 | . |
| 5 | 5 | Mike | 35 | 55000 | 01JAN1800 | 88 |
| 6 | 6 | Sara | . | 62000 | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | 48000 | 15JUL2022 | -10 |
| 8 | 8 | Nina | 32 | 0 | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | 52000 | . | 91 |
| 10 | 10 | Lee | 40 | 70000 | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | 65000 | 11OCT2021 | 93 |
| 12 | 12 | Arun | 5 | 40000 | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | 58000 | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | 60000 | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | 1000 | 25MAR2020 | 90 |
Use ABS
cautiously sometimes negatives indicate deeper issues.
Key Takeaways
- Converts invalid negatives
- Preserves records
- Quick fix, not root-cause
solution
3. Handling Missing Values with COALESCE
Business Logic
Missing
values are silent killers. COALESCE fills gaps intelligently.
DATA clean2;
SET clean1;
Age = COALESCE(Age,30);
Score = COALESCE(Score,85);
RUN;
Proc print data=clean2;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 01JAN2020 | 80 |
| 2 | 2 | Alice | 30 | 60000 | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | 70000 | 10MAR2020 | 85 |
| 4 | 4 | Eva | 200 | 80000 | 25APR2019 | 85 |
| 5 | 5 | Mike | 35 | 55000 | 01JAN1800 | 88 |
| 6 | 6 | Sara | 30 | 62000 | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | 48000 | 15JUL2022 | -10 |
| 8 | 8 | Nina | 32 | 0 | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | 52000 | . | 91 |
| 10 | 10 | Lee | 40 | 70000 | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | 65000 | 11OCT2021 | 93 |
| 12 | 12 | Arun | 5 | 40000 | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | 58000 | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | 60000 | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | 1000 | 25MAR2020 | 90 |
Use
domain-specific defaults, not arbitrary values.
Key Takeaways
- Prevents null propagation
- Improves model stability
4. Range Validation
Business Logic
Ensures
Age and Score fall within logical limits.
DATA clean3;
SET clean2;
IF Age > 100 THEN Age = .;
IF Score < 0 OR Score > 100 THEN Score = .;
RUN;
Proc print data=clean3;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 01JAN2020 | 80 |
| 2 | 2 | Alice | 30 | 60000 | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | 70000 | 10MAR2020 | 85 |
| 4 | 4 | Eva | . | 80000 | 25APR2019 | 85 |
| 5 | 5 | Mike | 35 | 55000 | 01JAN1800 | 88 |
| 6 | 6 | Sara | 30 | 62000 | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | 48000 | 15JUL2022 | . |
| 8 | 8 | Nina | 32 | 0 | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | 52000 | . | 91 |
| 10 | 10 | Lee | 40 | 70000 | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | 65000 | 11OCT2021 | 93 |
| 12 | 12 | Arun | 5 | 40000 | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | 58000 | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | 60000 | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | 1000 | 25MAR2020 | 90 |
Always
validate ranges before analysis.
Key Takeaways
- Removes unrealistic values
- Protects statistical
integrity
5. Date Correction using INTNX
Business Logic
Fix
invalid historical dates.
DATA clean4;
SET clean3;
IF YEAR(Join_Date) < 2000 THEN
Join_Date = INTNX('year', TODAY(), -5);
FORMAT Join_Date DATE9.;
RUN;
Proc print data=clean4;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 01JAN2020 | 80 |
| 2 | 2 | Alice | 30 | 60000 | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | 70000 | 10MAR2020 | 85 |
| 4 | 4 | Eva | . | 80000 | 25APR2019 | 85 |
| 5 | 5 | Mike | 35 | 55000 | 01JAN2021 | 88 |
| 6 | 6 | Sara | 30 | 62000 | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | 48000 | 15JUL2022 | . |
| 8 | 8 | Nina | 32 | 0 | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | 52000 | 01JAN2021 | 91 |
| 10 | 10 | Lee | 40 | 70000 | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | 65000 | 11OCT2021 | 93 |
| 12 | 12 | Arun | 5 | 40000 | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | 58000 | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | 60000 | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | 1000 | 25MAR2020 | 90 |
INTNX is
essential for temporal adjustments.
Key Takeaways
- Corrects time anomalies
- Ensures realistic timelines
6. Standardizing Names using PROPCASE
DATA clean5;
SET clean4;
Name = PROPCASE(Name);
RUN;
Proc print data=clean5;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 01JAN2020 | 80 |
| 2 | 2 | Alice | 30 | 60000 | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | 70000 | 10MAR2020 | 85 |
| 4 | 4 | Eva | . | 80000 | 25APR2019 | 85 |
| 5 | 5 | Mike | 35 | 55000 | 01JAN2021 | 88 |
| 6 | 6 | Sara | 30 | 62000 | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | 48000 | 15JUL2022 | . |
| 8 | 8 | Nina | 32 | 0 | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | 52000 | 01JAN2021 | 91 |
| 10 | 10 | Lee | 40 | 70000 | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | 65000 | 11OCT2021 | 93 |
| 12 | 12 | Arun | 5 | 40000 | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | 58000 | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | 60000 | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | 1000 | 25MAR2020 | 90 |
7. Removing Leading Spaces (STRIP/TRIM)
DATA clean6;
SET clean5;
Name = STRIP(Name);
RUN;
Proc print data=clean6;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 01JAN2020 | 80 |
| 2 | 2 | Alice | 30 | 60000 | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | 70000 | 10MAR2020 | 85 |
| 4 | 4 | Eva | . | 80000 | 25APR2019 | 85 |
| 5 | 5 | Mike | 35 | 55000 | 01JAN2021 | 88 |
| 6 | 6 | Sara | 30 | 62000 | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | 48000 | 15JUL2022 | . |
| 8 | 8 | Nina | 32 | 0 | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | 52000 | 01JAN2021 | 91 |
| 10 | 10 | Lee | 40 | 70000 | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | 65000 | 11OCT2021 | 93 |
| 12 | 12 | Arun | 5 | 40000 | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | 58000 | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | 60000 | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | 1000 | 25MAR2020 | 90 |
8. PROC MEANS – Statistical Validation
PROC MEANS DATA=clean6;
VAR Age Salary Score;
RUN;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|
Age Salary Score | 14 15 14 | 29.9285714 51400.00 86.6428571 | 8.8965359 22893.54 5.0016481 | 5.0000000 0 75.0000000 | 45.0000000 80000.00 93.0000000 |
9. PROC FORMAT – Categorization
PROC FORMAT;
VALUE salaryfmt LOW-50000='Low'
50001-70000='Medium'
70001-HIGH='High';
RUN;
LOG:
10. Applying Format
DATA clean7;
SET clean6;
FORMAT Salary salaryfmt.;
RUN;
Proc print data=clean7;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | Low | 01JAN2020 | 80 |
| 2 | 2 | Alice | 30 | Medium | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | Medium | 10MAR2020 | 85 |
| 4 | 4 | Eva | . | High | 25APR2019 | 85 |
| 5 | 5 | Mike | 35 | Medium | 01JAN2021 | 88 |
| 6 | 6 | Sara | 30 | Medium | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | Low | 15JUL2022 | . |
| 8 | 8 | Nina | 32 | Low | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | Medium | 01JAN2021 | 91 |
| 10 | 10 | Lee | 40 | Medium | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | Medium | 11OCT2021 | 93 |
| 12 | 12 | Arun | 5 | Low | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | Medium | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | Medium | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | Low | 25MAR2020 | 90 |
11. PROC TRANSPOSE
PROC TRANSPOSE DATA=clean7 OUT=transposed;
VAR Salary Score;
RUN;
Proc print data=transposed;
run;
OUTPUT:
| Obs | _NAME_ | COL1 | COL2 | COL3 | COL4 | COL5 | COL6 | COL7 | COL8 | COL9 | COL10 | COL11 | COL12 | COL13 | COL14 | COL15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Salary | 50000 | 60000 | 70000 | 80000 | 55000 | 62000 | 48000 | 0 | 52000 | 70000 | 65000 | 40000 | 58000 | 60000 | 1000 |
| 2 | Score | 80 | 90 | 85 | 85 | 88 | 92 | . | 87 | 91 | 89 | 93 | 75 | 82 | 86 | 90 |
12. Removing Duplicates
PROC SORT DATA=clean7 NODUPKEY;
BY ID;
RUN;
LOG:
13. Creating Macro for Cleaning
%MACRO clean_data(ds);
DATA &ds;
SET &ds;
Age = ABS(Age);
RUN;
Proc print data=&ds;
run;
%MEND;
%clean_data(raw_data);
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 01JAN2020 | 80 |
| 2 | 2 | Alice | 30 | 60000 | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | -70000 | 10MAR2020 | 85 |
| 4 | 4 | Eva | 200 | 80000 | 25APR2019 | . |
| 5 | 5 | Mike | 35 | 55000 | 01JAN1800 | 88 |
| 6 | 6 | Sara | . | 62000 | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | 48000 | 15JUL2022 | -10 |
| 8 | 8 | Nina | 32 | 0 | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | 52000 | . | 91 |
| 10 | 10 | Lee | 40 | 70000 | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | 65000 | 11OCT2021 | 93 |
| 12 | 12 | Arun | 5 | 40000 | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | 58000 | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | 60000 | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | -1000 | 25MAR2020 | 90 |
14. PROC APPEND
PROC APPEND BASE=clean7
DATA=raw_data;
RUN;
Proc print data=clean7;
run;
LOG:
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | Low | 01JAN2020 | 80 |
| 2 | 2 | Alice | 30 | Medium | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | Medium | 10MAR2020 | 85 |
| 4 | 4 | Eva | . | High | 25APR2019 | 85 |
| 5 | 5 | Mike | 35 | Medium | 01JAN2021 | 88 |
| 6 | 6 | Sara | 30 | Medium | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | Low | 15JUL2022 | . |
| 8 | 8 | Nina | 32 | Low | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | Medium | 01JAN2021 | 91 |
| 10 | 10 | Lee | 40 | Medium | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | Medium | 11OCT2021 | 93 |
| 12 | 12 | Arun | 5 | Low | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | Medium | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | Medium | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | Low | 25MAR2020 | 90 |
| 16 | 1 | John | 25 | Low | 01JAN2020 | 80 |
| 17 | 2 | Alice | 30 | Medium | 15FEB2021 | 90 |
| 18 | 3 | Bob | 45 | Low | 10MAR2020 | 85 |
| 19 | 4 | Eva | 200 | High | 25APR2019 | . |
| 20 | 5 | Mike | 35 | Medium | 01JAN1800 | 88 |
| 21 | 6 | Sara | . | Medium | 12JUN2021 | 92 |
| 22 | 7 | Tom | 28 | Low | 15JUL2022 | -10 |
| 23 | 8 | Nina | 32 | Low | 20AUG2021 | 87 |
| 24 | 9 | Raj | 29 | Medium | . | 91 |
| 25 | 10 | Lee | 40 | Medium | 05SEP2020 | 89 |
| 26 | 11 | Zara | 27 | Medium | 11OCT2021 | 93 |
| 27 | 12 | Arun | 5 | Low | 12DEC2020 | 75 |
| 28 | 13 | Kiran | 33 | Medium | 18JAN2022 | 82 |
| 29 | 14 | Priya | 31 | Medium | 20FEB2021 | 86 |
| 30 | 15 | Dev | 29 | Low | 25MAR2020 | 90 |
15. Final Dataset Creation
PROC SORT DATA=clean7 NODUPKEY;
BY ID;
RUN;
LOG:
DATA final_data;
SET clean7;
RUN;
Proc print data=final_data;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | Low | 01JAN2020 | 80 |
| 2 | 2 | Alice | 30 | Medium | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | Medium | 10MAR2020 | 85 |
| 4 | 4 | Eva | . | High | 25APR2019 | 85 |
| 5 | 5 | Mike | 35 | Medium | 01JAN2021 | 88 |
| 6 | 6 | Sara | 30 | Medium | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | Low | 15JUL2022 | . |
| 8 | 8 | Nina | 32 | Low | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | Medium | 01JAN2021 | 91 |
| 10 | 10 | Lee | 40 | Medium | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | Medium | 11OCT2021 | 93 |
| 12 | 12 | Arun | 5 | Low | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | Medium | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | Medium | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | Low | 25MAR2020 | 90 |
16. Master Dataset
DATA final_data;
SET raw_data;
Age = ABS(COALESCE(Age,30));
Salary = ABS(Salary);
Score = COALESCE(Score,85);
IF Age > 100 THEN Age = .;
IF Score < 0 OR Score > 100 THEN Score = .;
Name = PROPCASE(STRIP(Name));
IF YEAR(Join_Date) < 2000 THEN Join_Date = INTNX('year', TODAY(), -5);
FORMAT Join_Date DATE9.;
RUN;
Proc print data=final_data;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Join_Date | Score |
|---|---|---|---|---|---|---|
| 1 | 1 | John | 25 | 50000 | 01JAN2020 | 80 |
| 2 | 2 | Alice | 30 | 60000 | 15FEB2021 | 90 |
| 3 | 3 | Bob | 45 | 70000 | 10MAR2020 | 85 |
| 4 | 4 | Eva | . | 80000 | 25APR2019 | 85 |
| 5 | 5 | Mike | 35 | 55000 | 01JAN2021 | 88 |
| 6 | 6 | Sara | 30 | 62000 | 12JUN2021 | 92 |
| 7 | 7 | Tom | 28 | 48000 | 15JUL2022 | . |
| 8 | 8 | Nina | 32 | 0 | 20AUG2021 | 87 |
| 9 | 9 | Raj | 29 | 52000 | 01JAN2021 | 91 |
| 10 | 10 | Lee | 40 | 70000 | 05SEP2020 | 89 |
| 11 | 11 | Zara | 27 | 65000 | 11OCT2021 | 93 |
| 12 | 12 | Arun | 5 | 40000 | 12DEC2020 | 75 |
| 13 | 13 | Kiran | 33 | 58000 | 18JAN2022 | 82 |
| 14 | 14 | Priya | 31 | 60000 | 20FEB2021 | 86 |
| 15 | 15 | Dev | 29 | 1000 | 25MAR2020 | 90 |
17. 20 Advanced Insights
- ABS is not always correct investigate
root cause
- COALESCE differs from
IF-THEN
- INTNX vs INTCK confusion
- PROC SORT affects
performance
- Missing ≠ Zero
- Format vs Informat
difference
- Transpose reshapes data
logic
- Macro reuse reduces
redundancy
- STRIP vs TRIM differences
- Date handling is critical
- Range checks prevent bias
- Use PROC MEANS for QA
- Validate before modeling
- Use LABEL for clarity
- Avoid hardcoding values
- Keep audit trail
- Use WHERE for efficiency
- Optimize memory usage
- Document transformations
- Always QC twice
1. Raw
data is rarely analysis-ready; it almost always contains hidden inconsistencies
that must be addressed before any meaningful insights can be derived.
2. Negative
values in critical variables (like Age or Salary) violate domain rules and must
be corrected using functions like ABS()
or investigated further.
3. Missing
values silently distort statistical outputs, making functions like COALESCE() essential for
controlled imputation.
4. Range
validation (e.g., Age ≤ 100, Score between 0–100) ensures that data aligns with
real-world logical boundaries.
5. Date
inconsistencies (e.g., year 1800) can break time-based analytics, requiring
correction using INTNX()
and validation with YEAR()
functions.
6. Standardizing
text fields using PROPCASE()
improves readability and ensures consistency across reporting outputs.
7. Removing
unwanted spaces with STRIP()
or TRIM() prevents
matching and joining errors in downstream processes.
8. PROC SORT is foundational
for organizing data and is mandatory before operations like MERGE
or BY-group processing.
9. PROC MEANS acts as a
diagnostic tool to quickly identify anomalies such as extreme values or
unexpected distributions.
10. Formatting
data using PROC FORMAT allows
business-friendly categorization without altering the underlying raw values.
11. Applying
formats improves interpretability for stakeholders while maintaining analytical
precision in the dataset.
12. PROC TRANSPOSE helps
reshape data structures, making it easier to perform comparative or
cross-variable analysis.
13. Duplicate
removal using PROC SORT
NODUPKEY ensures data uniqueness and prevents inflated metrics.
14. SAS Macros
automate repetitive cleaning logic, improving efficiency and ensuring
consistency across datasets.
15. PROC APPEND enables
scalable data integration, especially useful when handling incremental or batch
data loads.
16. Combining
all cleaning steps into a master script ensures reproducibility and supports
production-level deployment.
17. Data
cleaning directly impacts model accuracy, as poor-quality inputs lead to
misleading outputs (“garbage in, garbage out”).
18. Proper
validation techniques reduce business risk by preventing incorrect decisions
based on flawed data.
19. Clean data
pipelines improve operational efficiency, saving time and reducing manual
intervention costs.
20. Ultimately,
structured SAS data cleaning transforms unreliable raw data into trusted,
decision-ready business intelligence.
19. Business Context
In
enterprise environments especially clinical trials, banking analytics, and
retail forecasting data cleaning directly translates to cost savings and
decision accuracy. Poor-quality data leads to rework cycles, delayed
reporting, and flawed insights that can cost millions.
For
example, in a pharmaceutical company, incorrect patient age can invalidate
subgroup analysis, delaying regulatory approvals. In banking, negative
transaction values may distort fraud detection models, allowing fraudulent
activities to go unnoticed.
By
implementing structured SAS cleaning pipelines like this, organizations achieve
automation, consistency, and auditability. Automated macros reduce manual
intervention, while standardized validation ensures compliance with data
governance frameworks like CDISC.
Time
savings are equally critical. A well-designed SAS cleaning script can process
millions of records in minutes, compared to hours of manual correction.
Ultimately,
clean data enables trustworthy analytics, which drives better strategic
decisions, improves operational efficiency, and reduces regulatory risk.
Summary & Conclusion
Data cleaning
is not a preliminary step it is the core of analytics excellence. In
this project, we transformed a flawed dataset into a structured, reliable asset
using SAS.
We
started with chaotic data containing negative values, missing fields, and
logical inconsistencies. Through systematic application of SAS functions like ABS,
COALESCE, INTNX, and procedures like PROC SORT and PROC MEANS, we restored
integrity.
Each
transformation was not just technical it was business-driven. Every
correction aligned the data with real-world logic, ensuring that downstream
analytics would be accurate and meaningful.
The key
takeaway is this: clean data is not about perfection, it’s about trust.
When your dataset reflects reality, your insights become actionable.
For SAS
programmers, mastering data cleaning is a career accelerator. It demonstrates
not just coding ability, but analytical thinking, domain understanding, and
attention to detail.
If you
can clean data effectively, you can solve real business problems and that’s
what truly matters.
Interview Preparation
1. Why use ABS in SAS?
To correct
negative values while preserving records.
2. Difference between COALESCE and IF-THEN?
COALESCE
handles multiple missing values efficiently.
3. What is INTNX used for?
Date
interval shifting.
4. Why PROC SORT before MERGE?
Ensures
proper alignment of BY variables.
5. How do you validate data quality?
Using
PROC MEANS, range checks, and logical rules.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about DATA CLEANING.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment