Can A Step-By-Step SAS Data Error Fixing Framework Turn Chaotic Raw Data Into Trusted Intelligence?

Introduction: Why This Project Matters

In real-world analytics, raw data is rarely clean. It behaves like an untrained employee unpredictable, inconsistent, and sometimes outright wrong. As a SAS programmer, your role is not just to analyze data, but to discipline it, standardize it, and ultimately trust it.

In this project, we will simulate a realistic messy dataset and systematically fix it using SAS. Think of this as moving from “data chaos” to “decision-grade intelligence.”

The Raw Dataset

SAS Code (DATALINES)

DATA raw_errors;

INPUT ID Name $ Age Salary JoinDate:$10. Score;

DATALINES;

1 John 25 50000 2022-01-15 85

2 Mary -30 60000 2022-02-20 90

3 Steve 45 -70000 2022-03-10 88

4 Anna . 55000 2022-04-12 .

5 Mike 200 45000 2022-05-05 75

6 Sara 29 0 2022-06-25 92

7 David 34 72000 2022-13-01 80

8 Lisa 28 68000 2022-07-15 95

9 Tom 31 65000 . 89

10 Emma 27 70000 2022-08-20 91

11 Raj 26 62000 2022-09-10 87

12 Ravi 30 -1000 2022-10-11 85

13 Priya 29 58000 2022-11-01 93

14 Arun . 61000 2022-12-05 88

15 Kiran 33 67000 2022-06-31 90

;

RUN;

Proc print data=raw_errors;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score
1	1	John	25	50000	2022-01-15	85
2	2	Mary	-30	60000	2022-02-20	90
3	3	Steve	45	-70000	2022-03-10	88
4	4	Anna	.	55000	2022-04-12	.
5	5	Mike	200	45000	2022-05-05	75
6	6	Sara	29	0	2022-06-25	92
7	7	David	34	72000	2022-13-01	80
8	8	Lisa	28	68000	2022-07-15	95
9	9	Tom	31	65000		89
10	10	Emma	27	70000	2022-08-20	91
11	11	Raj	26	62000	2022-09-10	87
12	12	Ravi	30	-1000	2022-10-11	85
13	13	Priya	29	58000	2022-11-01	93
14	14	Arun	.	61000	2022-12-05	88
15	15	Kiran	33	67000	2022-06-31	90

Equivalent R Code

raw_errors <- data.frame(

ID = 1:15,

Name = c("John","Mary","Steve","Anna","Mike","Sara","David","Lisa",

"Tom","Emma","Raj","Ravi","Priya","Arun","Kiran"),

Age = c(25,-30,45,NA,200,29,34,28,31,27,26,30,29,NA,33),

Salary = c(50000,60000,-70000,55000,45000,0,72000,68000,65000,

70000,62000,-1000,58000,61000,67000),

JoinDate = c("2022-01-15","2022-02-20","2022-03-10","2022-04-12",

"2022-05-05","2022-06-25","2022-13-01","2022-07-15",NA,

"2022-08-20","2022-09-10","2022-10-11","2022-11-01",

"2022-12-05","2022-06-31"),

Score = c(85,90,88,NA,75,92,80,95,89,91,87,85,93,88,90)

)

OUTPUT:

	ID	Name	Age	Salary	JoinDate	Score
1	1	John	25	50000	15-01-2022	85
2	2	Mary	-30	60000	20-02-2022	90
3	3	Steve	45	-70000	10-03-2022	88
4	4	Anna	NA	55000	12-04-2022	NA
5	5	Mike	200	45000	05-05-2022	75
6	6	Sara	29	0	25-06-2022	92
7	7	David	34	72000	2022-13-01	80
8	8	Lisa	28	68000	15-07-2022	95
9	9	Tom	31	65000	NA	89
10	10	Emma	27	70000	20-08-2022	91
11	11	Raj	26	62000	10-09-2022	87
12	12	Ravi	30	-1000	11-10-2022	85
13	13	Priya	29	58000	01-11-2022	93
14	14	Arun	NA	61000	05-12-2022	88
15	15	Kiran	33	67000	2022-06-31	90

Phase 1: Discovery & Chaos

Understanding the 5 Critical Data Errors

This dataset contains intentional real-world issues:

Negative Age (-30) → impossible biologically
Negative Salary (-70000, -1000) → invalid financial data
Missing Values (Age, Score, JoinDate) → incomplete observations
Out-of-Range Age (200) → logical violation
Invalid Dates (2022-13-01, 2022-06-31) → calendar errors

Why These Errors Destroy Scientific Integrity

Data errors are not just technical issues they are business risks disguised as numbers. Imagine calculating average salary including negative values. The result becomes misleading, and any downstream decision—budgeting, forecasting, or compensation benchmarking collapses.

Missing values create silent bias. For example, if high performers have missing scores, your analysis might underestimate performance trends. In regulated industries like clinical trials, missing or incorrect data can invalidate entire studies.

Range violations, such as age = 200, break statistical assumptions. Models depend on realistic distributions. Outliers like this distort mean, variance, and regression coefficients, leading to unstable predictions.

Date errors are even more dangerous. Time-based analytics retention, cohort analysis, survival analysis depend on chronological accuracy. A single invalid date can misplace records across timelines, corrupting duration calculations using functions like INTCK and INTNX.

Negative salary values reflect upstream system failures ETL issues, incorrect transformations, or data entry mistakes. Ignoring them signals poor data governance.

In essence, dirty data erodes trust, and trust is the currency of analytics. Without trust, dashboards become decoration rather than decision tools.

Phase 2: Step-by-Step SAS Mastery

1. PROC SORT – Organizing Chaos

Business Logic

Sorting is the first act of control. When datasets arrive unsorted, patterns are hidden. Duplicate detection, merging, and BY-group processing all depend on ordering. Think of sorting as arranging files in a cabinet before auditing them.

PROC SORT DATA=raw_errors OUT=sorted_data;

BY ID;

RUN;

Proc print data=sorted_data;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score
1	1	John	25	50000	2022-01-15	85
2	2	Mary	-30	60000	2022-02-20	90
3	3	Steve	45	-70000	2022-03-10	88
4	4	Anna	.	55000	2022-04-12	.
5	5	Mike	200	45000	2022-05-05	75
6	6	Sara	29	0	2022-06-25	92
7	7	David	34	72000	2022-13-01	80
8	8	Lisa	28	68000	2022-07-15	95
9	9	Tom	31	65000		89
10	10	Emma	27	70000	2022-08-20	91
11	11	Raj	26	62000	2022-09-10	87
12	12	Ravi	30	-1000	2022-10-11	85
13	13	Priya	29	58000	2022-11-01	93
14	14	Arun	.	61000	2022-12-05	88
15	15	Kiran	33	67000	2022-06-31	90

Always sort before merging datasets to avoid silent mismatches.

Key Takeaways

Required for BY-group operations
Prevents merge errors
Improves readability

2. DATA Step – Fix Negative Values Using ABS

Business Logic

Negative salaries are logically invalid. Using ABS() converts them to positive values. This assumes the issue is sign inversion, not corruption.

DATA fix_salary;

SET sorted_data;

Salary = ABS(Salary);

RUN;

Proc print data=fix_salary;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score
1	1	John	25	50000	2022-01-15	85
2	2	Mary	-30	60000	2022-02-20	90
3	3	Steve	45	70000	2022-03-10	88
4	4	Anna	.	55000	2022-04-12	.
5	5	Mike	200	45000	2022-05-05	75
6	6	Sara	29	0	2022-06-25	92
7	7	David	34	72000	2022-13-01	80
8	8	Lisa	28	68000	2022-07-15	95
9	9	Tom	31	65000		89
10	10	Emma	27	70000	2022-08-20	91
11	11	Raj	26	62000	2022-09-10	87
12	12	Ravi	30	1000	2022-10-11	85
13	13	Priya	29	58000	2022-11-01	93
14	14	Arun	.	61000	2022-12-05	88
15	15	Kiran	33	67000	2022-06-31	90

Validate whether negatives are true errors before blindly applying ABS().

Key Takeaways

ABS() ensures positivity
Useful for financial data cleaning
Always validate assumptions

3. DATA Step – Handle Missing Values Using COALESCE

Business Logic

Missing values distort aggregations. COALESCE() replaces missing with defaults or fallback values.

DATA fix_missing;

SET fix_salary;

Age = COALESCE(Age, 30);

Score = COALESCE(Score, 85);

RUN;

Proc print data=fix_missing;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score
1	1	John	25	50000	2022-01-15	85
2	2	Mary	-30	60000	2022-02-20	90
3	3	Steve	45	70000	2022-03-10	88
4	4	Anna	30	55000	2022-04-12	85
5	5	Mike	200	45000	2022-05-05	75
6	6	Sara	29	0	2022-06-25	92
7	7	David	34	72000	2022-13-01	80
8	8	Lisa	28	68000	2022-07-15	95
9	9	Tom	31	65000		89
10	10	Emma	27	70000	2022-08-20	91
11	11	Raj	26	62000	2022-09-10	87
12	12	Ravi	30	1000	2022-10-11	85
13	13	Priya	29	58000	2022-11-01	93
14	14	Arun	30	61000	2022-12-05	88
15	15	Kiran	33	67000	2022-06-31	90

Use domain knowledge when imputing—not arbitrary values.

Key Takeaways

Prevents NULL propagation
Maintains dataset completeness
Supports statistical stability

4. DATA Step – Fix Range Violations

Business Logic

Age must be realistic. Values outside 18–65 are capped.

DATA fix_range;

SET fix_missing;

IF Age > 100 THEN Age = 60;

IF Age < 18 THEN Age = 25;

RUN;

Proc print data=fix_range;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score
1	1	John	25	50000	2022-01-15	85
2	2	Mary	25	60000	2022-02-20	90
3	3	Steve	45	70000	2022-03-10	88
4	4	Anna	30	55000	2022-04-12	85
5	5	Mike	60	45000	2022-05-05	75
6	6	Sara	29	0	2022-06-25	92
7	7	David	34	72000	2022-13-01	80
8	8	Lisa	28	68000	2022-07-15	95
9	9	Tom	31	65000		89
10	10	Emma	27	70000	2022-08-20	91
11	11	Raj	26	62000	2022-09-10	87
12	12	Ravi	30	1000	2022-10-11	85
13	13	Priya	29	58000	2022-11-01	93
14	14	Arun	30	61000	2022-12-05	88
15	15	Kiran	33	67000	2022-06-31	90

Use business rules, not arbitrary thresholds.

Key Takeaways

Enforces logical constraints
Prevents outlier distortion
Essential for modeling

5. DATE Handling with INPUT & FORMAT

Business Logic

Dates must be numeric SAS dates for time analysis.

DATA fix_date;

SET fix_range;

JoinDate_clean = INPUT(JoinDate, YYMMDD10.);

FORMAT JoinDate_clean DATE9.;

RUN;

Proc print data=fix_date;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score	JoinDate_clean
1	1	John	25	50000	2022-01-15	85	15JAN2022
2	2	Mary	25	60000	2022-02-20	90	20FEB2022
3	3	Steve	45	70000	2022-03-10	88	10MAR2022
4	4	Anna	30	55000	2022-04-12	85	12APR2022
5	5	Mike	60	45000	2022-05-05	75	05MAY2022
6	6	Sara	29	0	2022-06-25	92	25JUN2022
7	7	David	34	72000	2022-13-01	80	.
8	8	Lisa	28	68000	2022-07-15	95	15JUL2022
9	9	Tom	31	65000		89	.
10	10	Emma	27	70000	2022-08-20	91	20AUG2022
11	11	Raj	26	62000	2022-09-10	87	10SEP2022
12	12	Ravi	30	1000	2022-10-11	85	11OCT2022
13	13	Priya	29	58000	2022-11-01	93	01NOV2022
14	14	Arun	30	61000	2022-12-05	88	05DEC2022
15	15	Kiran	33	67000	2022-06-31	90	.

Invalid dates become missing handle them separately.

Key Takeaways

Converts character to numeric
Enables time calculations
Essential for INTNX, INTCK

6. PROC MEANS – Validation Check

Business Logic

After cleaning, validate distributions.

PROC MEANS DATA=fix_date;

VAR Age Salary Score;

RUN;

OUTPUT:

The MEANS Procedure


Variable	N	Mean	Std Dev	Minimum	Maximum
Age Salary Score	15 15 15	32.1333333 53600.00 87.5333333	9.1172260 22846.69 5.1111456	25.0000000 0 75.0000000	60.0000000 72000.00 95.0000000

Always validate after transformation.

Key Takeaways

Detect anomalies
Confirm corrections
Quick sanity check

7. FORMAT – Categorization

Business Logic

Categorizing salary helps reporting.

PROC FORMAT;

VALUE salfmt LOW-50000='Low'

50001-70000='Medium'

70001-HIGH='High';

RUN;

LOG:

 NOTE: Format SALFMT has been output.

Key Takeaways

Improves interpretability
Supports reporting

8. APPLY FORMAT

DATA formatted;

SET fix_date;

FORMAT Salary salfmt.;

RUN;

Proc print data=formatted;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score	JoinDate_clean
1	1	John	25	Low	2022-01-15	85	15JAN2022
2	2	Mary	25	Medium	2022-02-20	90	20FEB2022
3	3	Steve	45	Medium	2022-03-10	88	10MAR2022
4	4	Anna	30	Medium	2022-04-12	85	12APR2022
5	5	Mike	60	Low	2022-05-05	75	05MAY2022
6	6	Sara	29	Low	2022-06-25	92	25JUN2022
7	7	David	34	High	2022-13-01	80	.
8	8	Lisa	28	Medium	2022-07-15	95	15JUL2022
9	9	Tom	31	Medium		89	.
10	10	Emma	27	Medium	2022-08-20	91	20AUG2022
11	11	Raj	26	Medium	2022-09-10	87	10SEP2022
12	12	Ravi	30	Low	2022-10-11	85	11OCT2022
13	13	Priya	29	Medium	2022-11-01	93	01NOV2022
14	14	Arun	30	Medium	2022-12-05	88	05DEC2022
15	15	Kiran	33	Medium	2022-06-31	90	.

9. PROC TRANSPOSE

Business Logic

Reshapes data for reporting or modeling.

PROC TRANSPOSE DATA=formatted OUT=transposed;

BY ID;

VAR Age Salary Score;

RUN;

Proc print data=transposed;

run;

OUTPUT:


Obs	ID	_NAME_	COL1
1	1	Age	25
2	1	Salary	50000
3	1	Score	85
4	2	Age	25
5	2	Salary	60000
6	2	Score	90
7	3	Age	45
8	3	Salary	70000
9	3	Score	88
10	4	Age	30
11	4	Salary	55000
12	4	Score	85
13	5	Age	60
14	5	Salary	45000
15	5	Score	75
16	6	Age	29
17	6	Salary	0
18	6	Score	92
19	7	Age	34
20	7	Salary	72000
21	7	Score	80
22	8	Age	28
23	8	Salary	68000
24	8	Score	95
25	9	Age	31
26	9	Salary	65000
27	9	Score	89
28	10	Age	27
29	10	Salary	70000
30	10	Score	91
31	11	Age	26
32	11	Salary	62000
33	11	Score	87
34	12	Age	30
35	12	Salary	1000
36	12	Score	85
37	13	Age	29
38	13	Salary	58000
39	13	Score	93
40	14	Age	30
41	14	Salary	61000
42	14	Score	88
43	15	Age	33
44	15	Salary	67000
45	15	Score	90

10. PROC APPEND

PROC APPEND BASE=fix_date

DATA=formatted;

RUN;

Proc print data=fix_date;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score	JoinDate_clean
1	1	John	25	50000	2022-01-15	85	15JAN2022
2	2	Mary	25	60000	2022-02-20	90	20FEB2022
3	3	Steve	45	70000	2022-03-10	88	10MAR2022
4	4	Anna	30	55000	2022-04-12	85	12APR2022
5	5	Mike	60	45000	2022-05-05	75	05MAY2022
6	6	Sara	29	0	2022-06-25	92	25JUN2022
7	7	David	34	72000	2022-13-01	80	.
8	8	Lisa	28	68000	2022-07-15	95	15JUL2022
9	9	Tom	31	65000		89	.
10	10	Emma	27	70000	2022-08-20	91	20AUG2022
11	11	Raj	26	62000	2022-09-10	87	10SEP2022
12	12	Ravi	30	1000	2022-10-11	85	11OCT2022
13	13	Priya	29	58000	2022-11-01	93	01NOV2022
14	14	Arun	30	61000	2022-12-05	88	05DEC2022
15	15	Kiran	33	67000	2022-06-31	90	.
16	1	John	25	50000	2022-01-15	85	15JAN2022
17	2	Mary	25	60000	2022-02-20	90	20FEB2022
18	3	Steve	45	70000	2022-03-10	88	10MAR2022
19	4	Anna	30	55000	2022-04-12	85	12APR2022
20	5	Mike	60	45000	2022-05-05	75	05MAY2022
21	6	Sara	29	0	2022-06-25	92	25JUN2022
22	7	David	34	72000	2022-13-01	80	.
23	8	Lisa	28	68000	2022-07-15	95	15JUL2022
24	9	Tom	31	65000		89	.
25	10	Emma	27	70000	2022-08-20	91	20AUG2022
26	11	Raj	26	62000	2022-09-10	87	10SEP2022
27	12	Ravi	30	1000	2022-10-11	85	11OCT2022
28	13	Priya	29	58000	2022-11-01	93	01NOV2022
29	14	Arun	30	61000	2022-12-05	88	05DEC2022
30	15	Kiran	33	67000	2022-06-31	90	.

11. STRING CLEANING (TRIM/STRIP/PROPCASE)

DATA clean_names;

SET formatted;

Name = PROPCASE(STRIP(Name));

RUN;

Proc print data=clean_names;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score	JoinDate_clean
1	1	John	25	Low	2022-01-15	85	15JAN2022
2	2	Mary	25	Medium	2022-02-20	90	20FEB2022
3	3	Steve	45	Medium	2022-03-10	88	10MAR2022
4	4	Anna	30	Medium	2022-04-12	85	12APR2022
5	5	Mike	60	Low	2022-05-05	75	05MAY2022
6	6	Sara	29	Low	2022-06-25	92	25JUN2022
7	7	David	34	High	2022-13-01	80	.
8	8	Lisa	28	Medium	2022-07-15	95	15JUL2022
9	9	Tom	31	Medium		89	.
10	10	Emma	27	Medium	2022-08-20	91	20AUG2022
11	11	Raj	26	Medium	2022-09-10	87	10SEP2022
12	12	Ravi	30	Low	2022-10-11	85	11OCT2022
13	13	Priya	29	Medium	2022-11-01	93	01NOV2022
14	14	Arun	30	Medium	2022-12-05	88	05DEC2022
15	15	Kiran	33	Medium	2022-06-31	90	.

12. INTNX – Date Increment

DATA add_month;

SET clean_names;

Next_Date = INTNX('MONTH', JoinDate_clean, 1);

FORMAT Next_Date DATE9.;

RUN;

Proc print data=add_month;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score	JoinDate_clean	Next_Date
1	1	John	25	Low	2022-01-15	85	15JAN2022	01FEB2022
2	2	Mary	25	Medium	2022-02-20	90	20FEB2022	01MAR2022
3	3	Steve	45	Medium	2022-03-10	88	10MAR2022	01APR2022
4	4	Anna	30	Medium	2022-04-12	85	12APR2022	01MAY2022
5	5	Mike	60	Low	2022-05-05	75	05MAY2022	01JUN2022
6	6	Sara	29	Low	2022-06-25	92	25JUN2022	01JUL2022
7	7	David	34	High	2022-13-01	80	.	.
8	8	Lisa	28	Medium	2022-07-15	95	15JUL2022	01AUG2022
9	9	Tom	31	Medium		89	.	.
10	10	Emma	27	Medium	2022-08-20	91	20AUG2022	01SEP2022
11	11	Raj	26	Medium	2022-09-10	87	10SEP2022	01OCT2022
12	12	Ravi	30	Low	2022-10-11	85	11OCT2022	01NOV2022
13	13	Priya	29	Medium	2022-11-01	93	01NOV2022	01DEC2022
14	14	Arun	30	Medium	2022-12-05	88	05DEC2022	01JAN2023
15	15	Kiran	33	Medium	2022-06-31	90	.	.

13. INTCK – Duration

DATA duration;

SET add_month;

Months = INTCK('MONTH', JoinDate_clean, TODAY());

RUN;

Proc print data=duration;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score	JoinDate_clean	Next_Date	Months
1	1	John	25	Low	2022-01-15	85	15JAN2022	01FEB2022	51
2	2	Mary	25	Medium	2022-02-20	90	20FEB2022	01MAR2022	50
3	3	Steve	45	Medium	2022-03-10	88	10MAR2022	01APR2022	49
4	4	Anna	30	Medium	2022-04-12	85	12APR2022	01MAY2022	48
5	5	Mike	60	Low	2022-05-05	75	05MAY2022	01JUN2022	47
6	6	Sara	29	Low	2022-06-25	92	25JUN2022	01JUL2022	46
7	7	David	34	High	2022-13-01	80	.	.	.
8	8	Lisa	28	Medium	2022-07-15	95	15JUL2022	01AUG2022	45
9	9	Tom	31	Medium		89	.	.	.
10	10	Emma	27	Medium	2022-08-20	91	20AUG2022	01SEP2022	44
11	11	Raj	26	Medium	2022-09-10	87	10SEP2022	01OCT2022	43
12	12	Ravi	30	Low	2022-10-11	85	11OCT2022	01NOV2022	42
13	13	Priya	29	Medium	2022-11-01	93	01NOV2022	01DEC2022	41
14	14	Arun	30	Medium	2022-12-05	88	05DEC2022	01JAN2023	40
15	15	Kiran	33	Medium	2022-06-31	90	.	.	.

14. MACRO – Automated Cleaning

%MACRO clean_data(ds);

DATA &ds._clean;

SET &ds;

Salary = ABS(Salary);

RUN;

Proc print data=&ds._clean;

run;

%MEND;

%clean_data(raw_errors);

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score
1	1	John	25	50000	2022-01-15	85
2	2	Mary	-30	60000	2022-02-20	90
3	3	Steve	45	70000	2022-03-10	88
4	4	Anna	.	55000	2022-04-12	.
5	5	Mike	200	45000	2022-05-05	75
6	6	Sara	29	0	2022-06-25	92
7	7	David	34	72000	2022-13-01	80
8	8	Lisa	28	68000	2022-07-15	95
9	9	Tom	31	65000		89
10	10	Emma	27	70000	2022-08-20	91
11	11	Raj	26	62000	2022-09-10	87
12	12	Ravi	30	1000	2022-10-11	85
13	13	Priya	29	58000	2022-11-01	93
14	14	Arun	.	61000	2022-12-05	88
15	15	Kiran	33	67000	2022-06-31	90

15. FINAL DATA STEP

DATA final_clean;

SET duration;

RUN;

Proc print data=final_clean;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score	JoinDate_clean	Next_Date	Months
1	1	John	25	Low	2022-01-15	85	15JAN2022	01FEB2022	51
2	2	Mary	25	Medium	2022-02-20	90	20FEB2022	01MAR2022	50
3	3	Steve	45	Medium	2022-03-10	88	10MAR2022	01APR2022	49
4	4	Anna	30	Medium	2022-04-12	85	12APR2022	01MAY2022	48
5	5	Mike	60	Low	2022-05-05	75	05MAY2022	01JUN2022	47
6	6	Sara	29	Low	2022-06-25	92	25JUN2022	01JUL2022	46
7	7	David	34	High	2022-13-01	80	.	.	.
8	8	Lisa	28	Medium	2022-07-15	95	15JUL2022	01AUG2022	45
9	9	Tom	31	Medium		89	.	.	.
10	10	Emma	27	Medium	2022-08-20	91	20AUG2022	01SEP2022	44
11	11	Raj	26	Medium	2022-09-10	87	10SEP2022	01OCT2022	43
12	12	Ravi	30	Low	2022-10-11	85	11OCT2022	01NOV2022	42
13	13	Priya	29	Medium	2022-11-01	93	01NOV2022	01DEC2022	41
14	14	Arun	30	Medium	2022-12-05	88	05DEC2022	01JAN2023	40
15	15	Kiran	33	Medium	2022-06-31	90	.	.	.

16. Master Dataset

PROC SORT DATA=raw_errors OUT=sorted;

BY ID;

RUN;

LOG:

 NOTE: There were 15 observations read from the data set WORK.RAW_ERRORS.

 NOTE: The data set WORK.SORTED has 15 observations and 6 variables.

DATA cleaned;

SET sorted;

Salary = ABS(Salary);

Age = COALESCE(Age,30);

Score = COALESCE(Score,85);

IF Age > 100 THEN Age = 60;

IF Age < 18 THEN Age = 25;

JoinDate_clean = INPUT(JoinDate, YYMMDD10.);

FORMAT JoinDate_clean DATE9.;

Name = PROPCASE(STRIP(Name));

Next_Date = INTNX('MONTH', JoinDate_clean, 1);

Months = INTCK('MONTH', JoinDate_clean, TODAY());

RUN;

Proc print data=cleaned;

run;

OUTPUT:


Obs	ID	Name	Age	Salary	JoinDate	Score	JoinDate_clean	Next_Date	Months
1	1	John	25	50000	2022-01-15	85	15JAN2022	22677	51
2	2	Mary	25	60000	2022-02-20	90	20FEB2022	22705	50
3	3	Steve	45	70000	2022-03-10	88	10MAR2022	22736	49
4	4	Anna	30	55000	2022-04-12	85	12APR2022	22766	48
5	5	Mike	60	45000	2022-05-05	75	05MAY2022	22797	47
6	6	Sara	29	0	2022-06-25	92	25JUN2022	22827	46
7	7	David	34	72000	2022-13-01	80	.	.	.
8	8	Lisa	28	68000	2022-07-15	95	15JUL2022	22858	45
9	9	Tom	31	65000		89	.	.	.
10	10	Emma	27	70000	2022-08-20	91	20AUG2022	22889	44
11	11	Raj	26	62000	2022-09-10	87	10SEP2022	22919	43
12	12	Ravi	30	1000	2022-10-11	85	11OCT2022	22950	42
13	13	Priya	29	58000	2022-11-01	93	01NOV2022	22980	41
14	14	Arun	30	61000	2022-12-05	88	05DEC2022	23011	40
15	15	Kiran	33	67000	2022-06-31	90	.	.	.

17. 20 Advanced Insights

Always validate before and after cleaning
Use macros for scalability
Avoid hardcoding thresholds
Separate raw and cleaned datasets
Log all transformations
Use formats for reporting
Missing ≠ zero
Dates must be numeric
Sorting improves joins
Validate ranges early
Use PROC FREQ for categorical checks
Keep audit trails
Avoid overwriting raw data
Use labels for clarity
Standardize naming conventions
Check duplicates
Use indexing for large data
Optimize memory usage
Automate repetitive tasks
Always think like a data auditor

18. Business Context

In a corporate environment, poor data quality directly translates into financial loss. Imagine a retail company calculating employee bonuses using incorrect salary data. Negative or missing values could lead to underpayment or overpayment, both of which create operational and legal risks.

Similarly, HR analytics depends heavily on accurate age and tenure data. Incorrect join dates distort attrition analysis, workforce planning, and promotion cycles. A company might mistakenly assume high employee turnover due to flawed duration calculations.

In financial reporting, even a small percentage of erroneous data can lead to incorrect forecasts. Executives rely on dashboards built from datasets like this. If the underlying data is flawed, strategic decisions—such as expansion, hiring, or investment become misinformed.

Data cleaning using SAS ensures consistency, reliability, and auditability. By applying structured transformations (ABS, COALESCE, INTNX, INTCK), organizations create standardized pipelines. These pipelines reduce manual intervention, minimize human error, and improve reproducibility.

Moreover, regulatory industries like pharmaceuticals require strict compliance. Clean datasets are essential for submissions, audits, and validation processes. A well-designed SAS cleaning workflow ensures traceability every change is documented and reproducible.

Ultimately, investing time in data cleaning saves exponentially more time downstream. Analysts spend less time debugging and more time generating insights. Decision-makers gain confidence in reports, leading to faster and more accurate business actions.

19. 20 Key Points SAS Data Error Fixing Framework

Raw data is inherently messy, often containing hidden errors that silently corrupt analysis if left untreated.
The first step is always data profiling, where you understand distributions, missingness, and anomalies using procedures like PROC MEANS and PROC FREQ.
Sorting the dataset using PROC SORT establishes structural order, which is essential for merges, deduplication, and BY-group processing.
Identifying negative values in non-negative fields (like salary) is critical, as they indicate data entry or system errors.
Functions like ABS() help normalize such anomalies, converting incorrect negative values into usable positive numbers.
Missing values are dangerous because they propagate through calculations and distort averages, totals, and model outputs.
Using COALESCE() ensures that missing values are replaced with logical defaults or derived values, maintaining dataset completeness.
Range validation ensures that variables like age fall within realistic human limits, preventing statistical distortion.
Conditional logic (IF-THEN) enforces business rules, correcting outliers such as unrealistic ages (e.g., 200 years).
Date variables must be converted from character to numeric using INPUT() to enable time-based analytics.
Invalid dates automatically become missing in SAS, making it easier to isolate and correct them.
Functions like INTNX() allow controlled date increments, which are essential for forecasting and scheduling analysis.
INTCK() calculates time intervals, enabling tenure analysis, cohort tracking, and lifecycle insights.
String inconsistencies (extra spaces, inconsistent casing) are resolved using STRIP(), TRIM(), and PROPCASE().
Applying PROC FORMAT transforms raw numeric values into meaningful categories, improving interpretability for stakeholders.
PROC TRANSPOSE reshapes data structures, making datasets compatible with reporting tools and statistical models.
PROC APPEND enables scalable data integration by efficiently combining datasets without rewriting code.
SAS macros automate repetitive cleaning tasks, ensuring consistency and reducing manual coding effort.
Validation after each transformation step ensures that corrections are accurate and no new errors are introduced.
The final cleaned dataset becomes analysis-ready, enabling reliable reporting, accurate modeling, and confident business decision-making.

20. Summary & Conclusion

This project demonstrates a fundamental truth: data cleaning is not a preliminary step it is the foundation of analytics. Without it, even the most sophisticated models fail.

We started with a chaotic dataset filled with negative values, missing entries, invalid dates, and logical inconsistencies. Step by step, we transformed it into a structured, reliable dataset using SAS techniques.

Functions like ABS() corrected financial anomalies, while COALESCE() ensured completeness. Date functions such as INTNX and INTCK enabled temporal analysis. String functions (STRIP, PROPCASE) standardized textual data. Procedures like PROC SORT, PROC MEANS, and PROC TRANSPOSE enhanced structure and validation.

More importantly, every transformation was guided by business logic, not just technical execution. This is what separates a programmer from a data professional.

In real-world scenarios, data is messy by default. Your job is to impose order, enforce rules, and ensure trust. Once data becomes reliable, everything else analysis, modeling, reporting becomes exponentially easier.

Think of data cleaning as building a strong foundation. If the foundation is weak, the entire structure collapses. But if it is solid, you can build anything on top of it.

21. Interview Preparation

1. Why use ABS() in data cleaning?

Answer: To correct negative numeric values when they are logically invalid, especially in financial data.

2. Difference between COALESCE and IF-THEN?

Answer: COALESCE handles missing values efficiently across multiple variables, while IF-THEN is conditional logic.

3. Why convert dates using INPUT()?

Answer: SAS requires numeric dates for calculations like duration and intervals.

4. What is INTNX vs INTCK?

Answer: INTNX increments dates; INTCK counts intervals.

5. Why avoid overwriting raw data?

Answer: To maintain audit trails and ensure reproducibility.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about DATA ERROR FIXING.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can we analyze planets using SAS like NASA scientists?
2.Can SAS Compare How 3 Different Families Live in Their Homes?
3.Can SAS Help Us Choose the Most Efficient Cooler?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

444.Turning Messy Raw Data Into Decision-Ready Gold With SAS Error Fixing Techniques

Can A Step-By-Step SAS Data Error Fixing Framework Turn Chaotic Raw Data Into Trusted Intelligence?

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can we analyze planets using SAS like NASA scientists?
2.Can SAS Compare How 3 Different Families Live in Their Homes?
3.Can SAS Help Us Choose the Most Efficient Cooler?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

418.Can We Design, Debug, Detect Fraud, And Optimize A Smart Parking System Using Advanced SAS Programming Techniques?

444.Turning Messy Raw Data Into Decision-Ready Gold With SAS Error Fixing Techniques

Can A Step-By-Step SAS Data Error Fixing Framework Turn Chaotic Raw Data Into Trusted Intelligence?

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can we analyze planets using SAS like NASA scientists?2.Can SAS Compare How 3 Different Families Live in Their Homes?3.Can SAS Help Us Choose the Most Efficient Cooler?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

418.Can We Design, Debug, Detect Fraud, And Optimize A Smart Parking System Using Advanced SAS Programming Techniques?

1.Can we analyze planets using SAS like NASA scientists?
2.Can SAS Compare How 3 Different Families Live in Their Homes?
3.Can SAS Help Us Choose the Most Efficient Cooler?