445.The Art And Science Of Data Cleaning Using SAS Structure And R Agility
Can A Structured SAS vs R Data Cleaning Framework Turn Messy Data Into Trustworthy Intelligence?
Introduction: Why This Comparison
Matters More Than You Think
In real-world data science especially in regulated industries like clinical
trials, banking, and pharma data cleaning is not just a preprocessing step. It
is the foundation of truth. If your data is wrong, your insights are
dangerous.
Think of raw data like unfiltered water. Both SAS and R
are purification systems. SAS is like an industrial-grade filtration
plant rigid, controlled, regulatory-compliant. R is like a flexible smart filter customizable,
powerful, but dependent on the user’s discipline.
This blog walks you through a complete end-to-end project
where we:
· Create
a 15+ observation dataset
· Introduce
intentional errors
· Clean
using 15+ SAS PROG1 techniques
· Compare
every step with R equivalents
· Deliver
a production-ready pipeline
The Dataset: Raw Data Creation (SAS
vs R)
Business Context of Dataset
We simulate a customer transaction dataset where errors are
intentionally injected.
Variables
· ID
· Name
· Age
· Salary
· Department
· Join_Date
SAS RAW DATA (DATALINES)
DATA raw_data;
INPUT ID Name $ Age Salary Department $ Join_Date :DATE9.;
FORMAT Join_Date DATE9.;
DATALINES;
1 Ravi 25 50000 IT 01JAN2020
2 Sita -30 60000 HR 15FEB2021
3 Arun 45 -10000 FIN 10MAR2019
4 NULL 29 45000 IT 25APR2020
5 John 200 70000 SALES 30MAY2022
6 Meena 32 . HR 12JUN2021
7 Raj 28 52000 IT 05JUL2020
8 Priya 27 51000 IT .
9 NULL 35 48000 FIN 11AUG2019
10 Kiran 40 62000 SALES 09SEP2020
11 Anil 38 -20000 HR 20OCT2021
12 Pooja 29 53000 IT 18NOV2022
13 NULL 31 55000 SALES 22DEC2020
14 Ramesh 33 56000 FIN 01JAN2021
15 Kavya 26 49000 IT 15FEB2022
;
RUN;
Proc print data=raw_data;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Department | Join_Date |
|---|---|---|---|---|---|---|
| 1 | 1 | Ravi | 25 | 50000 | IT | 01JAN2020 |
| 2 | 2 | Sita | -30 | 60000 | HR | 15FEB2021 |
| 3 | 3 | Arun | 45 | -10000 | FIN | 10MAR2019 |
| 4 | 4 | NULL | 29 | 45000 | IT | 25APR2020 |
| 5 | 5 | John | 200 | 70000 | SALES | 30MAY2022 |
| 6 | 6 | Meena | 32 | . | HR | 12JUN2021 |
| 7 | 7 | Raj | 28 | 52000 | IT | 05JUL2020 |
| 8 | 8 | Priya | 27 | 51000 | IT | . |
| 9 | 9 | NULL | 35 | 48000 | FIN | 11AUG2019 |
| 10 | 10 | Kiran | 40 | 62000 | SALES | 09SEP2020 |
| 11 | 11 | Anil | 38 | -20000 | HR | 20OCT2021 |
| 12 | 12 | Pooja | 29 | 53000 | IT | 18NOV2022 |
| 13 | 13 | NULL | 31 | 55000 | SALES | 22DEC2020 |
| 14 | 14 | Ramesh | 33 | 56000 | FIN | 01JAN2021 |
| 15 | 15 | Kavya | 26 | 49000 | IT | 15FEB2022 |
R RAW DATA (data.frame)
raw_data <- data.frame(
ID = 1:15,
Name = c("Ravi","Sita","Arun",NA,"John","Meena","Raj",
"Priya",NA,"Kiran","Anil","Pooja",NA,"Ramesh","Kavya"),
Age = c(25,-30,45,29,200,32,28,27,35,40,38,29,31,33,26),
Salary = c(50000,60000,-10000,45000,70000,NA,52000,51000,
48000,62000,-20000,53000,55000,56000,49000),
Department = c("IT","HR","FIN","IT","SALES","HR","IT","IT",
"FIN","SALES","HR","IT","SALES","FIN","IT"),
Join_Date = as.Date(c("2020-01-01","2021-02-15","2019-03-10",
"2020-04-25","2022-05-30","2021-06-12",
"2020-07-05",NA,"2019-08-11","2020-09-09",
"2021-10-20","2022-11-18","2020-12-22",
"2021-01-01","2022-02-15"))
)
OUTPUT:
|
|
ID |
Name |
Age |
Salary |
Department |
Join_Date |
|
1 |
1 |
Ravi |
25 |
50000 |
IT |
01-01-2020 |
|
2 |
2 |
Sita |
-30 |
60000 |
HR |
15-02-2021 |
|
3 |
3 |
Arun |
45 |
-10000 |
FIN |
10-03-2019 |
|
4 |
4 |
NA |
29 |
45000 |
IT |
25-04-2020 |
|
5 |
5 |
John |
200 |
70000 |
SALES |
30-05-2022 |
|
6 |
6 |
Meena |
32 |
NA |
HR |
12-06-2021 |
|
7 |
7 |
Raj |
28 |
52000 |
IT |
05-07-2020 |
|
8 |
8 |
Priya |
27 |
51000 |
IT |
NA |
|
9 |
9 |
NA |
35 |
48000 |
FIN |
11-08-2019 |
|
10 |
10 |
Kiran |
40 |
62000 |
SALES |
09-09-2020 |
|
11 |
11 |
Anil |
38 |
-20000 |
HR |
20-10-2021 |
|
12 |
12 |
Pooja |
29 |
53000 |
IT |
18-11-2022 |
|
13 |
13 |
NA |
31 |
55000 |
SALES |
22-12-2020 |
|
14 |
14 |
Ramesh |
33 |
56000 |
FIN |
01-01-2021 |
|
15 |
15 |
Kavya |
26 |
49000 |
IT |
15-02-2022 |
Phase 1: Discovery & Chaos
5 Intentional Errors
1. Negative
Age (-30)
2. Salary
negative values (-10000,
-20000)
3. Missing
Names (NULL)
4. Age
out of range (200)
5. Missing
Dates & Salary
Why These Errors Destroy Scientific
Integrity
Data errors are not just technical issues they are epistemological
threats. In scientific and business environments, every decision derives
from data assumptions. When those assumptions are violated, conclusions become
unreliable.
Take negative age. This is not just incorrect it signals
either data entry failure or system misalignment. If used in modeling, it skews
distributions, affects mean calculations, and breaks statistical assumptions.
Similarly, negative salary values distort financial
summaries. Imagine calculating average revenue these values artificially
deflate results, leading to underestimation of performance.
Missing values are even more dangerous. They introduce
bias. If high-income individuals are missing salary data, your dataset becomes
systematically skewed toward lower income.
Range violations (Age = 200) suggest data validation
failure. Such anomalies inflate variance and disrupt predictive models.
Finally, missing dates compromise time-series analysis.
Without proper temporal alignment, trend analysis becomes meaningless.
SAS vs R Perspective
· SAS
enforces structured validation through formats and controlled data steps.
· R
allows flexibility but requires manual validation using packages like dplyr.
Conclusion: Data cleaning is not optional it is the
difference between insight and illusion.
Phase 2: Step-by-Step SAS Mastery
Business Logic
Sorting is the first step in any structured pipeline. In enterprise
workflows, sorted data ensures reproducibility, consistency in merges, and
reliable reporting. Without sorting, operations like BY-group processing fail
silently.
In SAS, sorting is deterministic. In R, sorting depends on order()
or arrange() from dplyr.
1. PROC SORT
PROC SORT DATA=raw_data OUT=sorted_data;
BY ID;
RUN;
Proc print data=sorted_data;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Department | Join_Date |
|---|---|---|---|---|---|---|
| 1 | 1 | Ravi | 25 | 50000 | IT | 01JAN2020 |
| 2 | 2 | Sita | -30 | 60000 | HR | 15FEB2021 |
| 3 | 3 | Arun | 45 | -10000 | FIN | 10MAR2019 |
| 4 | 4 | NULL | 29 | 45000 | IT | 25APR2020 |
| 5 | 5 | John | 200 | 70000 | SALES | 30MAY2022 |
| 6 | 6 | Meena | 32 | . | HR | 12JUN2021 |
| 7 | 7 | Raj | 28 | 52000 | IT | 05JUL2020 |
| 8 | 8 | Priya | 27 | 51000 | IT | . |
| 9 | 9 | NULL | 35 | 48000 | FIN | 11AUG2019 |
| 10 | 10 | Kiran | 40 | 62000 | SALES | 09SEP2020 |
| 11 | 11 | Anil | 38 | -20000 | HR | 20OCT2021 |
| 12 | 12 | Pooja | 29 | 53000 | IT | 18NOV2022 |
| 13 | 13 | NULL | 31 | 55000 | SALES | 22DEC2020 |
| 14 | 14 | Ramesh | 33 | 56000 | FIN | 01JAN2021 |
| 15 | 15 | Kavya | 26 | 49000 | IT | 15FEB2022 |
R Equivalent
sorted_data <- raw_data[order(raw_data$ID),]
OUTPUT:
|
|
ID |
Name |
Age |
Salary |
Department |
Join_Date |
|
1 |
1 |
Ravi |
25 |
50000 |
IT |
01-01-2020 |
|
2 |
2 |
Sita |
-30 |
60000 |
HR |
15-02-2021 |
|
3 |
3 |
Arun |
45 |
-10000 |
FIN |
10-03-2019 |
|
4 |
4 |
NA |
29 |
45000 |
IT |
25-04-2020 |
|
5 |
5 |
John |
200 |
70000 |
SALES |
30-05-2022 |
|
6 |
6 |
Meena |
32 |
NA |
HR |
12-06-2021 |
|
7 |
7 |
Raj |
28 |
52000 |
IT |
05-07-2020 |
|
8 |
8 |
Priya |
27 |
51000 |
IT |
NA |
|
9 |
9 |
NA |
35 |
48000 |
FIN |
11-08-2019 |
|
10 |
10 |
Kiran |
40 |
62000 |
SALES |
09-09-2020 |
|
11 |
11 |
Anil |
38 |
-20000 |
HR |
20-10-2021 |
|
12 |
12 |
Pooja |
29 |
53000 |
IT |
18-11-2022 |
|
13 |
13 |
NA |
31 |
55000 |
SALES |
22-12-2020 |
|
14 |
14 |
Ramesh |
33 |
56000 |
FIN |
01-01-2021 |
|
15 |
15 |
Kavya |
26 |
49000 |
IT |
15-02-2022 |
Always sort before merges unsorted merges cause silent data corruption.
Key Takeaways
· Required
for BY processing
· Improves
readability
· Mandatory
before MERGE
2. Handling Missing Names
Business Logic
Missing names indicate identity loss. In regulated environments, this is
unacceptable. We standardize using "UNKNOWN".
DATA clean1;
SET sorted_data;
IF Name = "NULL" OR Name = "" THEN Name = "UNKNOWN";
RUN;
Proc print data=clean1;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Department | Join_Date |
|---|---|---|---|---|---|---|
| 1 | 1 | Ravi | 25 | 50000 | IT | 01JAN2020 |
| 2 | 2 | Sita | -30 | 60000 | HR | 15FEB2021 |
| 3 | 3 | Arun | 45 | -10000 | FIN | 10MAR2019 |
| 4 | 4 | UNKNOWN | 29 | 45000 | IT | 25APR2020 |
| 5 | 5 | John | 200 | 70000 | SALES | 30MAY2022 |
| 6 | 6 | Meena | 32 | . | HR | 12JUN2021 |
| 7 | 7 | Raj | 28 | 52000 | IT | 05JUL2020 |
| 8 | 8 | Priya | 27 | 51000 | IT | . |
| 9 | 9 | UNKNOWN | 35 | 48000 | FIN | 11AUG2019 |
| 10 | 10 | Kiran | 40 | 62000 | SALES | 09SEP2020 |
| 11 | 11 | Anil | 38 | -20000 | HR | 20OCT2021 |
| 12 | 12 | Pooja | 29 | 53000 | IT | 18NOV2022 |
| 13 | 13 | UNKNOWN | 31 | 55000 | SALES | 22DEC2020 |
| 14 | 14 | Ramesh | 33 | 56000 | FIN | 01JAN2021 |
| 15 | 15 | Kavya | 26 | 49000 | IT | 15FEB2022 |
R Equivalent
clean1 <- sorted_data
clean1$Name[is.na(clean1$Name) | clean1$Name == "NULL" | clean1$Name == ""] <- "UNKNOWN"
|
|
ID |
Name |
Age |
Salary |
Department |
Join_Date |
|
1 |
1 |
Ravi |
25 |
50000 |
IT |
01-01-2020 |
|
2 |
2 |
Sita |
-30 |
60000 |
HR |
15-02-2021 |
|
3 |
3 |
Arun |
45 |
-10000 |
FIN |
10-03-2019 |
|
4 |
4 |
UNKNOWN |
29 |
45000 |
IT |
25-04-2020 |
|
5 |
5 |
John |
200 |
70000 |
SALES |
30-05-2022 |
|
6 |
6 |
Meena |
32 |
NA |
HR |
12-06-2021 |
|
7 |
7 |
Raj |
28 |
52000 |
IT |
05-07-2020 |
|
8 |
8 |
Priya |
27 |
51000 |
IT |
NA |
|
9 |
9 |
UNKNOWN |
35 |
48000 |
FIN |
11-08-2019 |
|
10 |
10 |
Kiran |
40 |
62000 |
SALES |
09-09-2020 |
|
11 |
11 |
Anil |
38 |
-20000 |
HR |
20-10-2021 |
|
12 |
12 |
Pooja |
29 |
53000 |
IT |
18-11-2022 |
|
13 |
13 |
UNKNOWN |
31 |
55000 |
SALES |
22-12-2020 |
|
14 |
14 |
Ramesh |
33 |
56000 |
FIN |
01-01-2021 |
|
15 |
15 |
Kavya |
26 |
49000 |
IT |
15-02-2022 |
Use COALESCE in SQL for
scalable pipelines.
Key Takeaways
· Identity
fields must never be missing
· Use
fallback values
3. Fix Negative Age
Business Logic
Age must be realistic. Use ABS()
to correct.
DATA clean2;
SET clean1;
Age = ABS(Age);
RUN;
Proc print data=clean2;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Department | Join_Date |
|---|---|---|---|---|---|---|
| 1 | 1 | Ravi | 25 | 50000 | IT | 01JAN2020 |
| 2 | 2 | Sita | 30 | 60000 | HR | 15FEB2021 |
| 3 | 3 | Arun | 45 | -10000 | FIN | 10MAR2019 |
| 4 | 4 | UNKNOWN | 29 | 45000 | IT | 25APR2020 |
| 5 | 5 | John | 200 | 70000 | SALES | 30MAY2022 |
| 6 | 6 | Meena | 32 | . | HR | 12JUN2021 |
| 7 | 7 | Raj | 28 | 52000 | IT | 05JUL2020 |
| 8 | 8 | Priya | 27 | 51000 | IT | . |
| 9 | 9 | UNKNOWN | 35 | 48000 | FIN | 11AUG2019 |
| 10 | 10 | Kiran | 40 | 62000 | SALES | 09SEP2020 |
| 11 | 11 | Anil | 38 | -20000 | HR | 20OCT2021 |
| 12 | 12 | Pooja | 29 | 53000 | IT | 18NOV2022 |
| 13 | 13 | UNKNOWN | 31 | 55000 | SALES | 22DEC2020 |
| 14 | 14 | Ramesh | 33 | 56000 | FIN | 01JAN2021 |
| 15 | 15 | Kavya | 26 | 49000 | IT | 15FEB2022 |
R Equivalent
clean2 <- clean1
clean2$Age <- abs(clean2$Age)
OUTPUT:
|
|
ID |
Name |
Age |
Salary |
Department |
Join_Date |
|
1 |
1 |
Ravi |
25 |
50000 |
IT |
01-01-2020 |
|
2 |
2 |
Sita |
30 |
60000 |
HR |
15-02-2021 |
|
3 |
3 |
Arun |
45 |
-10000 |
FIN |
10-03-2019 |
|
4 |
4 |
UNKNOWN |
29 |
45000 |
IT |
25-04-2020 |
|
5 |
5 |
John |
200 |
70000 |
SALES |
30-05-2022 |
|
6 |
6 |
Meena |
32 |
NA |
HR |
12-06-2021 |
|
7 |
7 |
Raj |
28 |
52000 |
IT |
05-07-2020 |
|
8 |
8 |
Priya |
27 |
51000 |
IT |
NA |
|
9 |
9 |
UNKNOWN |
35 |
48000 |
FIN |
11-08-2019 |
|
10 |
10 |
Kiran |
40 |
62000 |
SALES |
09-09-2020 |
|
11 |
11 |
Anil |
38 |
-20000 |
HR |
20-10-2021 |
|
12 |
12 |
Pooja |
29 |
53000 |
IT |
18-11-2022 |
|
13 |
13 |
UNKNOWN |
31 |
55000 |
SALES |
22-12-2020 |
|
14 |
14 |
Ramesh |
33 |
56000 |
FIN |
01-01-2021 |
|
15 |
15 |
Kavya |
26 |
49000 |
IT |
15-02-2022 |
ABS is powerful but use cautiously sometimes errors should be flagged, not
fixed.
Key Takeaways
· Numeric
correction
· Removes
negative bias
4. Age Range Validation
Business Logic
Age > 100 is unrealistic. Replace with missing.
DATA clean3;
SET clean2;
IF Age > 100 THEN Age = .;
RUN;
Proc print data=clean3;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Department | Join_Date |
|---|---|---|---|---|---|---|
| 1 | 1 | Ravi | 25 | 50000 | IT | 01JAN2020 |
| 2 | 2 | Sita | 30 | 60000 | HR | 15FEB2021 |
| 3 | 3 | Arun | 45 | -10000 | FIN | 10MAR2019 |
| 4 | 4 | UNKNOWN | 29 | 45000 | IT | 25APR2020 |
| 5 | 5 | John | . | 70000 | SALES | 30MAY2022 |
| 6 | 6 | Meena | 32 | . | HR | 12JUN2021 |
| 7 | 7 | Raj | 28 | 52000 | IT | 05JUL2020 |
| 8 | 8 | Priya | 27 | 51000 | IT | . |
| 9 | 9 | UNKNOWN | 35 | 48000 | FIN | 11AUG2019 |
| 10 | 10 | Kiran | 40 | 62000 | SALES | 09SEP2020 |
| 11 | 11 | Anil | 38 | -20000 | HR | 20OCT2021 |
| 12 | 12 | Pooja | 29 | 53000 | IT | 18NOV2022 |
| 13 | 13 | UNKNOWN | 31 | 55000 | SALES | 22DEC2020 |
| 14 | 14 | Ramesh | 33 | 56000 | FIN | 01JAN2021 |
| 15 | 15 | Kavya | 26 | 49000 | IT | 15FEB2022 |
R Equivalent
clean3 <- clean2
clean3$Age[clean3$Age > 100] <- NA
OUTPUT:
|
|
ID |
Name |
Age |
Salary |
Department |
Join_Date |
|
1 |
1 |
Ravi |
25 |
50000 |
IT |
01-01-2020 |
|
2 |
2 |
Sita |
30 |
60000 |
HR |
15-02-2021 |
|
3 |
3 |
Arun |
45 |
-10000 |
FIN |
10-03-2019 |
|
4 |
4 |
UNKNOWN |
29 |
45000 |
IT |
25-04-2020 |
|
5 |
5 |
John |
NA |
70000 |
SALES |
30-05-2022 |
|
6 |
6 |
Meena |
32 |
NA |
HR |
12-06-2021 |
|
7 |
7 |
Raj |
28 |
52000 |
IT |
05-07-2020 |
|
8 |
8 |
Priya |
27 |
51000 |
IT |
NA |
|
9 |
9 |
UNKNOWN |
35 |
48000 |
FIN |
11-08-2019 |
|
10 |
10 |
Kiran |
40 |
62000 |
SALES |
09-09-2020 |
|
11 |
11 |
Anil |
38 |
-20000 |
HR |
20-10-2021 |
|
12 |
12 |
Pooja |
29 |
53000 |
IT |
18-11-2022 |
|
13 |
13 |
UNKNOWN |
31 |
55000 |
SALES |
22-12-2020 |
|
14 |
14 |
Ramesh |
33 |
56000 |
FIN |
01-01-2021 |
|
15 |
15 |
Kavya |
26 |
49000 |
IT |
15-02-2022 |
Use domain knowledge not just code.
Key Takeaways
· Range
validation critical
5. Fix Salary Errors
Business Logic
Negative salary invalid → use ABS
DATA clean4;
SET clean3;
Salary = ABS(Salary);
RUN;
Proc print data=clean4;
run;
OUTPUT:
| Obs | ID | Name | Age | Salary | Department | Join_Date |
|---|---|---|---|---|---|---|
| 1 | 1 | Ravi | 25 | 50000 | IT | 01JAN2020 |
| 2 | 2 | Sita | 30 | 60000 | HR | 15FEB2021 |
| 3 | 3 | Arun | 45 | 10000 | FIN | 10MAR2019 |
| 4 | 4 | UNKNOWN | 29 | 45000 | IT | 25APR2020 |
| 5 | 5 | John | . | 70000 | SALES | 30MAY2022 |
| 6 | 6 | Meena | 32 | . | HR | 12JUN2021 |
| 7 | 7 | Raj | 28 | 52000 | IT | 05JUL2020 |
| 8 | 8 | Priya | 27 | 51000 | IT | . |
| 9 | 9 | UNKNOWN | 35 | 48000 | FIN | 11AUG2019 |
| 10 | 10 | Kiran | 40 | 62000 | SALES | 09SEP2020 |
| 11 | 11 | Anil | 38 | 20000 | HR | 20OCT2021 |
| 12 | 12 | Pooja | 29 | 53000 | IT | 18NOV2022 |
| 13 | 13 | UNKNOWN | 31 | 55000 | SALES | 22DEC2020 |
| 14 | 14 | Ramesh | 33 | 56000 | FIN | 01JAN2021 |
| 15 | 15 | Kavya | 26 | 49000 | IT | 15FEB2022 |
R Equivalent
clean4 <- clean3
clean4$Salary <- abs(clean4$Salary)
OUTPUT:
|
|
ID |
Name |
Age |
Salary |
Department |
Join_Date |
|
1 |
1 |
Ravi |
25 |
50000 |
IT |
01-01-2020 |
|
2 |
2 |
Sita |
30 |
60000 |
HR |
15-02-2021 |
|
3 |
3 |
Arun |
45 |
10000 |
FIN |
10-03-2019 |
|
4 |
4 |
UNKNOWN |
29 |
45000 |
IT |
25-04-2020 |
|
5 |
5 |
John |
NA |
70000 |
SALES |
30-05-2022 |
|
6 |
6 |
Meena |
32 |
NA |
HR |
12-06-2021 |
|
7 |
7 |
Raj |
28 |
52000 |
IT |
05-07-2020 |
|
8 |
8 |
Priya |
27 |
51000 |
IT |
NA |
|
9 |
9 |
UNKNOWN |
35 |
48000 |
FIN |
11-08-2019 |
|
10 |
10 |
Kiran |
40 |
62000 |
SALES |
09-09-2020 |
|
11 |
11 |
Anil |
38 |
20000 |
HR |
20-10-2021 |
|
12 |
12 |
Pooja |
29 |
53000 |
IT |
18-11-2022 |
|
13 |
13 |
UNKNOWN |
31 |
55000 |
SALES |
22-12-2020 |
|
14 |
14 |
Ramesh |
33 |
56000 |
FIN |
01-01-2021 |
|
15 |
15 |
Kavya |
26 |
49000 |
IT |
15-02-2022 |
Flag before correcting in audits.
Key Takeaways
· Financial
data must be clean
· ABS
simplifies correction
6. Handle Missing Salary
Business Logic
Replace missing salary with mean
PROC MEANS DATA=clean4 NOPRINT;
VAR Salary;
OUTPUT OUT=mean_sal MEAN=avg_sal;
RUN;
LOG:
DATA clean5;
IF _N_=1 THEN SET mean_sal;
SET clean4;
IF Salary = . THEN Salary = avg_sal;
RUN;
Proc print data=clean5;
run;
OUTPUT:
| Obs | _TYPE_ | _FREQ_ | avg_sal | ID | Name | Age | Salary | Department | Join_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 15 | 48642.86 | 1 | Ravi | 25 | 50000.00 | IT | 01JAN2020 |
| 2 | 0 | 15 | 48642.86 | 2 | Sita | 30 | 60000.00 | HR | 15FEB2021 |
| 3 | 0 | 15 | 48642.86 | 3 | Arun | 45 | 10000.00 | FIN | 10MAR2019 |
| 4 | 0 | 15 | 48642.86 | 4 | UNKNOWN | 29 | 45000.00 | IT | 25APR2020 |
| 5 | 0 | 15 | 48642.86 | 5 | John | . | 70000.00 | SALES | 30MAY2022 |
| 6 | 0 | 15 | 48642.86 | 6 | Meena | 32 | 48642.86 | HR | 12JUN2021 |
| 7 | 0 | 15 | 48642.86 | 7 | Raj | 28 | 52000.00 | IT | 05JUL2020 |
| 8 | 0 | 15 | 48642.86 | 8 | Priya | 27 | 51000.00 | IT | . |
| 9 | 0 | 15 | 48642.86 | 9 | UNKNOWN | 35 | 48000.00 | FIN | 11AUG2019 |
| 10 | 0 | 15 | 48642.86 | 10 | Kiran | 40 | 62000.00 | SALES | 09SEP2020 |
| 11 | 0 | 15 | 48642.86 | 11 | Anil | 38 | 20000.00 | HR | 20OCT2021 |
| 12 | 0 | 15 | 48642.86 | 12 | Pooja | 29 | 53000.00 | IT | 18NOV2022 |
| 13 | 0 | 15 | 48642.86 | 13 | UNKNOWN | 31 | 55000.00 | SALES | 22DEC2020 |
| 14 | 0 | 15 | 48642.86 | 14 | Ramesh | 33 | 56000.00 | FIN | 01JAN2021 |
| 15 | 0 | 15 | 48642.86 | 15 | Kavya | 26 | 49000.00 | IT | 15FEB2022 |
R Equivalent
clean4$Salary[is.na(clean4$Salary)] <- mean(clean4$Salary, na.rm=TRUE)
OUTPUT:
|
|
ID |
Name |
Age |
Salary |
Department |
Join_Date |
|
1 |
1 |
Ravi |
25 |
50000 |
IT |
01-01-2020 |
|
2 |
2 |
Sita |
30 |
60000 |
HR |
15-02-2021 |
|
3 |
3 |
Arun |
45 |
10000 |
FIN |
10-03-2019 |
|
4 |
4 |
UNKNOWN |
29 |
45000 |
IT |
25-04-2020 |
|
5 |
5 |
John |
NA |
70000 |
SALES |
30-05-2022 |
|
6 |
6 |
Meena |
32 |
48642.86 |
HR |
12-06-2021 |
|
7 |
7 |
Raj |
28 |
52000 |
IT |
05-07-2020 |
|
8 |
8 |
Priya |
27 |
51000 |
IT |
NA |
|
9 |
9 |
UNKNOWN |
35 |
48000 |
FIN |
11-08-2019 |
|
10 |
10 |
Kiran |
40 |
62000 |
SALES |
09-09-2020 |
|
11 |
11 |
Anil |
38 |
20000 |
HR |
20-10-2021 |
|
12 |
12 |
Pooja |
29 |
53000 |
IT |
18-11-2022 |
|
13 |
13 |
UNKNOWN |
31 |
55000 |
SALES |
22-12-2020 |
|
14 |
14 |
Ramesh |
33 |
56000 |
FIN |
01-01-2021 |
|
15 |
15 |
Kavya |
26 |
49000 |
IT |
15-02-2022 |
Mean imputation is simple but may introduce bias.
Key Takeaways
· Imputation
strategy matters
· Use
domain logic
7. Date Cleaning
Business Logic
Missing dates break time logic.
DATA clean6;
SET clean5;
IF Join_Date = . THEN Join_Date = TODAY();
RUN;
Proc print data=clean6;
run;
OUTPUT:
| Obs | _TYPE_ | _FREQ_ | avg_sal | ID | Name | Age | Salary | Department | Join_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 15 | 48642.86 | 1 | Ravi | 25 | 50000.00 | IT | 01JAN2020 |
| 2 | 0 | 15 | 48642.86 | 2 | Sita | 30 | 60000.00 | HR | 15FEB2021 |
| 3 | 0 | 15 | 48642.86 | 3 | Arun | 45 | 10000.00 | FIN | 10MAR2019 |
| 4 | 0 | 15 | 48642.86 | 4 | UNKNOWN | 29 | 45000.00 | IT | 25APR2020 |
| 5 | 0 | 15 | 48642.86 | 5 | John | . | 70000.00 | SALES | 30MAY2022 |
| 6 | 0 | 15 | 48642.86 | 6 | Meena | 32 | 48642.86 | HR | 12JUN2021 |
| 7 | 0 | 15 | 48642.86 | 7 | Raj | 28 | 52000.00 | IT | 05JUL2020 |
| 8 | 0 | 15 | 48642.86 | 8 | Priya | 27 | 51000.00 | IT | 08APR2026 |
| 9 | 0 | 15 | 48642.86 | 9 | UNKNOWN | 35 | 48000.00 | FIN | 11AUG2019 |
| 10 | 0 | 15 | 48642.86 | 10 | Kiran | 40 | 62000.00 | SALES | 09SEP2020 |
| 11 | 0 | 15 | 48642.86 | 11 | Anil | 38 | 20000.00 | HR | 20OCT2021 |
| 12 | 0 | 15 | 48642.86 | 12 | Pooja | 29 | 53000.00 | IT | 18NOV2022 |
| 13 | 0 | 15 | 48642.86 | 13 | UNKNOWN | 31 | 55000.00 | SALES | 22DEC2020 |
| 14 | 0 | 15 | 48642.86 | 14 | Ramesh | 33 | 56000.00 | FIN | 01JAN2021 |
| 15 | 0 | 15 | 48642.86 | 15 | Kavya | 26 | 49000.00 | IT | 15FEB2022 |
R Equivalent
clean4$Join_Date[is.na(clean4$Join_Date)] <- Sys.Date()
OUTPUT:
|
|
ID |
Name |
Age |
Salary |
Department |
Join_Date |
|
1 |
1 |
Ravi |
25 |
50000 |
IT |
01-01-2020 |
|
2 |
2 |
Sita |
30 |
60000 |
HR |
15-02-2021 |
|
3 |
3 |
Arun |
45 |
10000 |
FIN |
10-03-2019 |
|
4 |
4 |
UNKNOWN |
29 |
45000 |
IT |
25-04-2020 |
|
5 |
5 |
John |
NA |
70000 |
SALES |
30-05-2022 |
|
6 |
6 |
Meena |
32 |
48642.86 |
HR |
12-06-2021 |
|
7 |
7 |
Raj |
28 |
52000 |
IT |
05-07-2020 |
|
8 |
8 |
Priya |
27 |
51000 |
IT |
08-04-2026 |
|
9 |
9 |
UNKNOWN |
35 |
48000 |
FIN |
11-08-2019 |
|
10 |
10 |
Kiran |
40 |
62000 |
SALES |
09-09-2020 |
|
11 |
11 |
Anil |
38 |
20000 |
HR |
20-10-2021 |
|
12 |
12 |
Pooja |
29 |
53000 |
IT |
18-11-2022 |
|
13 |
13 |
UNKNOWN |
31 |
55000 |
SALES |
22-12-2020 |
|
14 |
14 |
Ramesh |
33 |
56000 |
FIN |
01-01-2021 |
|
15 |
15 |
Kavya |
26 |
49000 |
IT |
15-02-2022 |
Avoid using TODAY in production Use business reference date.
8. Master Dataset
DATA final_clean;
SET raw_data;
IF Name = "NULL" OR Name = "" THEN Name = "UNKNOWN";
Age = ABS(Age);
IF Age > 100 THEN Age = .;
Salary = ABS(Salary);
IF Salary = . THEN Salary = 55000;
IF MISSING(Join_Date) THEN Join_Date = TODAY();
FORMAT Join_Date DATE9.;
RUN;
Proc print data=final_clean;
run;
| Obs | ID | Name | Age | Salary | Department | Join_Date |
|---|---|---|---|---|---|---|
| 1 | 1 | Ravi | 25 | 50000 | IT | 01JAN2020 |
| 2 | 2 | Sita | 30 | 60000 | HR | 15FEB2021 |
| 3 | 3 | Arun | 45 | 10000 | FIN | 10MAR2019 |
| 4 | 4 | UNKNOWN | 29 | 45000 | IT | 25APR2020 |
| 5 | 5 | John | . | 70000 | SALES | 30MAY2022 |
| 6 | 6 | Meena | 32 | 55000 | HR | 12JUN2021 |
| 7 | 7 | Raj | 28 | 52000 | IT | 05JUL2020 |
| 8 | 8 | Priya | 27 | 51000 | IT | 08APR2026 |
| 9 | 9 | UNKNOWN | 35 | 48000 | FIN | 11AUG2019 |
| 10 | 10 | Kiran | 40 | 62000 | SALES | 09SEP2020 |
| 11 | 11 | Anil | 38 | 20000 | HR | 20OCT2021 |
| 12 | 12 | Pooja | 29 | 53000 | IT | 18NOV2022 |
| 13 | 13 | UNKNOWN | 31 | 55000 | SALES | 22DEC2020 |
| 14 | 14 | Ramesh | 33 | 56000 | FIN | 01JAN2021 |
| 15 | 15 | Kavya | 26 | 49000 | IT | 15FEB2022 |
|
|
ID |
Name |
Age |
Salary |
Department |
Join_Date |
|
1 |
1 |
Ravi |
25 |
50000 |
IT |
01-01-2020 |
|
2 |
2 |
Sita |
30 |
60000 |
HR |
15-02-2021 |
|
3 |
3 |
Arun |
45 |
10000 |
FIN |
10-03-2019 |
|
4 |
4 |
UNKNOWN |
29 |
45000 |
IT |
25-04-2020 |
|
5 |
5 |
John |
NA |
70000 |
SALES |
30-05-2022 |
|
6 |
6 |
Meena |
32 |
55000 |
HR |
12-06-2021 |
|
7 |
7 |
Raj |
28 |
52000 |
IT |
05-07-2020 |
|
8 |
8 |
Priya |
27 |
51000 |
IT |
08-04-2026 |
|
9 |
9 |
UNKNOWN |
35 |
48000 |
FIN |
11-08-2019 |
|
10 |
10 |
Kiran |
40 |
62000 |
SALES |
09-09-2020 |
|
11 |
11 |
Anil |
38 |
20000 |
HR |
20-10-2021 |
|
12 |
12 |
Pooja |
29 |
53000 |
IT |
18-11-2022 |
|
13 |
13 |
UNKNOWN |
31 |
55000 |
SALES |
22-12-2020 |
|
14 |
14 |
Ramesh |
33 |
56000 |
FIN |
01-01-2021 |
|
15 |
15 |
Kavya |
26 |
49000 |
IT |
15-02-2022 |
1. Always
validate before transform
2. Use
macros for scalability
3. PROC
SQL for large joins
4. Avoid
hardcoding
5. Use
formats for classification
6. INTNX
for time shifts
7. INTCK
for intervals
8. Use
KEEP/DROP
9. Debug
with PUTLOG
10. Use WHERE
vs IF
11. Indexing
improves performance
12. Use LENGTH
early
13. Avoid
implicit conversions
14. Validate
duplicates
15. Use hash
objects
16. Modular
coding
17. Log
checking mandatory
18. Automate QC
19. Version
control SAS scripts
20. Compare
outputs with R for validation
In enterprise environments, poor data quality translates directly into
financial loss. Consider a pharmaceutical company analyzing clinical trial
data. If patient age is incorrect or missing, it affects safety analysis and
regulatory submissions. Incorrect salary data in HR systems can lead to payroll
discrepancies, compliance violations, and employee dissatisfaction.
By implementing a structured SAS data cleaning pipeline,
organizations reduce manual intervention. Automation ensures repeatability and
auditability key requirements in regulated industries.
Compared to R, SAS provides stronger governance, making it ideal for production
environments. R, however, excels in exploratory analysis and rapid prototyping.
A company adopting this hybrid approach benefits from:
· Reduced
data processing time by 40–60%
· Improved
accuracy in reporting
· Faster
regulatory approvals
· Lower
operational risk
Ultimately, clean data is not just about correctness it is about trust.
11. 20 Key Points About The Project
1. A
structured data cleaning framework ensures that raw, inconsistent data is
systematically transformed into reliable and analysis-ready datasets.
2. In
SAS, the DATA step enforces a disciplined, row-wise execution model, making
transformations predictable and audit-friendly.
3. In
R, vectorized operations allow faster and more flexible data manipulation,
especially for large-scale datasets.
4. SAS
uses explicit missing value representations (.
and " "),
while R uses a unified NA,
simplifying missing value handling.
5. Handling
missing data using functions like COALESCE
(SAS) and ifelse or dplyr::coalesce (R)
improves data completeness.
6. Functions
like ABS() in SAS and abs() in R correct invalid
negative values, but require domain validation before application.
7. Range
validation (e.g., Age > 100) ensures logical consistency and prevents
statistical distortion in downstream models.
8. SAS
PROC SORT enforces
ordering before merges, ensuring data alignment and preventing silent
mismatches.
9. R
sorting (arrange() or order()) enhances
readability but is not mandatory for joins, offering more flexibility.
10. Date
handling in SAS (TODAY(),
INTNX, INTCK)
provides strong temporal control, while R uses Sys.Date()
and lubridate for flexibility.
11. String
cleaning functions like STRIP,
TRIM, and PROPCASE
in SAS ensure standardized textual data.
12. R’s stringr package offers
advanced and readable string manipulation for similar transformations.
13. SAS PROC MEANS enables
statistical imputation, while R uses mean()
with na.rm=TRUE for
similar results.
14. Macros in
SAS automate repetitive cleaning tasks, while R uses functions and packages for
modularization.
15. SAS logs
provide detailed execution tracking, making debugging and validation easier in
regulated environments.
16. R provides
flexibility but requires explicit validation steps to ensure reproducibility
and compliance.
17. Combining
multiple cleaning steps into a master script improves efficiency and reduces
manual intervention.
18. Cross-validation
between SAS and R outputs increases confidence in data integrity and accuracy.
19. A
structured framework reduces processing time, minimizes human error, and
enhances scalability.
20. Ultimately,
clean data produced through SAS discipline and R flexibility becomes
trustworthy intelligence that drives accurate business decisions.
Data cleaning is not a mechanical step it is an intellectual process. It
requires understanding the data, questioning anomalies, and applying domain
logic.
In this project, we demonstrated how a messy dataset can be transformed into
a reliable asset using SAS and R. SAS provided structure, repeatability, and
enterprise-grade reliability. R offered flexibility and speed.
The key takeaway is not choosing one over the other but understanding when
to use each. In regulated industries, SAS dominates due to auditability. In
research and innovation, R shines.
If you master both, you become not just a programmer but a data
strategist.
Clean data leads to clear thinking. And clear thinking drives powerful
decisions.
Interview Preparation
Q1: Why use ABS() in data cleaning?
Answer: To correct negative numeric anomalies, but should
be validated before blind correction.
Q2: Difference between IF and WHERE
in SAS?
Answer: WHERE filters before reading; IF filters after
reading.
Q3: How does SAS handle missing
values vs R?
Answer: SAS uses .
while R uses NA.
Q4: Why use PROC MEANS for
imputation?
Answer: To compute statistical replacements like mean
efficiently.
Q5: SAS vs R Which is better?
Answer: SAS for production & compliance; R for
flexibility & analytics.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SAS Vs R DATA CLEANING.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment