447.Data Gone Wrong to Data Done Right Using IF-THEN Logic in SAS
Mastering IF-THEN Logic in SAS Using a School Dataset with Intentional Errors
1. Introduction
Imagine
you are working as a SAS programmer in a clinical trial company or a school
analytics system. You receive a dataset that looks fine at first glance,
but once you start analyzing, everything falls apart ages are negative, dates
don’t make sense, names are inconsistent, and duplicate records exist.
This is
not just a “messy dataset.” This is a silent threat to decision-making.
In
clinical trials, incorrect patient age can impact dosing decisions. In a school
dataset, wrong student data can mislead performance analytics, scholarship
eligibility, or attendance tracking.
Bad data
doesn’t just reduce accuracy it destroys trust.
This is
where IF-THEN logic in SAS becomes powerful. It allows you to enforce
business rules, correct inconsistencies, and transform unreliable data into a
trusted analytical asset.
Both SAS
and R are widely used in data cleaning:
- SAS excels in structured,
rule-based transformations (especially in regulated environments like
CDISC SDTM/ADaM).
- R provides flexibility and
modern data manipulation capabilities.
In this blog, we will simulate a school dataset with intentional errors, clean it step-by-step using SAS and R, and understand the business logic behind each transformation.
2. Raw Data Creation in SAS and R
Dataset Variables (7 Variables)
- Student_ID
- Name
- Age
- Gender
- Marks
- Join_Date
- City
SAS Code: Raw Dataset Creation
DATA school_raw;
INFILE DATALINES DLM='|' MISSOVER;
INPUT Student_ID Name $ Age Gender $ Marks
Join_Date : DATE9. City:$15.;
FORMAT Join_Date DATE9.;
DATALINES;
101|John|15|M|85|12JAN2020|Delhi
102|NULL|-16|F|90|15FEB2020|Mumbai
103|Alice|.|F|78|.|Chennai
104|Bob|200|M|-10|25MAR2020|Delhi
105|John|15|M|85|12JAN2020|Delhi
106|Sara|14|f|88|01APR2020|Kolkata
107|Mike|13|M|.|05MAY2020|NULL
108| |12|F|70|10JUN2020|Hyderabad
109|David|-5|M|95|15JUL2020|Pune
110|Emma|16|F|105|20AUG2020|Bangalore
;
RUN;
PROC PRINT DATA=school_raw;
RUN;
OUTPUT:
| Obs | Student_ID | Name | Age | Gender | Marks | Join_Date | City |
|---|---|---|---|---|---|---|---|
| 1 | 101 | John | 15 | M | 85 | 12JAN2020 | Delhi |
| 2 | 102 | NULL | -16 | F | 90 | 15FEB2020 | Mumbai |
| 3 | 103 | Alice | . | F | 78 | . | Chennai |
| 4 | 104 | Bob | 200 | M | -10 | 25MAR2020 | Delhi |
| 5 | 105 | John | 15 | M | 85 | 12JAN2020 | Delhi |
| 6 | 106 | Sara | 14 | f | 88 | 01APR2020 | Kolkata |
| 7 | 107 | Mike | 13 | M | . | 05MAY2020 | NULL |
| 8 | 108 | 12 | F | 70 | 10JUN2020 | Hyderabad | |
| 9 | 109 | David | -5 | M | 95 | 15JUL2020 | Pune |
| 10 | 110 | Emma | 16 | F | 105 | 20AUG2020 | Bangalore |
Explanation + Key Points
This SAS dataset is intentionally designed with real-world data quality issues:
- Missing values (Age, Join_Date, Name)
- Invalid values (negative age, marks >100)
- Duplicate record (Student_ID 105 same as 101)
- Inconsistent text (“NULL”, lowercase gender)
- Blank name (Student_ID 108)
Using DATALINES, we simulate raw ingestion similar to receiving flat files from external systems. The FORMAT ensures date readability.
Key Insight: Raw data should never be trusted blindly. Always assume errors exist and design validation logic accordingly.
Explanation
The MISSOVER option prevents SAS from reading the next line when data is missing in the current row. Without it, SAS attempts to “borrow” values from the next line, causing misalignment and _ERROR_. While MISSOVER improves robustness, it does not fix structural issues completely it only prevents spillover. You still need consistent placeholders for missing values. In production environments, combining MISSOVER, TRUNCOVER, and proper data validation checks ensures safe ingestion of imperfect datasets. However, the best practice remains: clean data at source whenever possible.
Equivalent R Dataset
school_raw <- data.frame(
Student_ID = c(101,102,103,104,105,106,107,108,109,110),
Name = c("John","NULL","Alice","Bob","John","Sara","Mike",
"", "David","Emma"),
Age = c(15,-16,NA,200,15,14,13,12,-5,16),
Gender = c("M","F","F","M","M","f","M","F","M","F"),
Marks = c(85,90,78,-10,85,88,NA,70,95,105),
Join_Date = as.Date(c("2020-01-12","2020-02-15",NA,"2020-03-25",
"2020-01-12","2020-04-01","2020-05-05",
"2020-06-10","2020-07-15","2020-08-20")),
City = c("Delhi","Mumbai","Chennai","Delhi","Delhi","Kolkata",
"NULL","Hyderabad","Pune","Bangalore")
)
|
|
Student_ID |
Name |
Age |
Gender |
Marks |
Join_Date |
City |
|
1 |
101 |
John |
15 |
M |
85 |
12-01-2020 |
Delhi |
|
2 |
102 |
NULL |
-16 |
F |
90 |
15-02-2020 |
Mumbai |
|
3 |
103 |
Alice |
NA |
F |
78 |
NA |
Chennai |
|
4 |
104 |
Bob |
200 |
M |
-10 |
25-03-2020 |
Delhi |
|
5 |
105 |
John |
15 |
M |
85 |
12-01-2020 |
Delhi |
|
6 |
106 |
Sara |
14 |
f |
88 |
01-04-2020 |
Kolkata |
|
7 |
107 |
Mike |
13 |
M |
NA |
05-05-2020 |
NULL |
|
8 |
108 |
|
12 |
F |
70 |
10-06-2020 |
Hyderabad |
|
9 |
109 |
David |
-5 |
M |
95 |
15-07-2020 |
Pune |
|
10 |
110 |
Emma |
16 |
F |
105 |
20-08-2020 |
Bangalore |
Explanation + Key Points
The R
dataset mirrors SAS but uses data.frame() for creation. Missing values are
represented using NA, and dates are handled using as.Date().
Notice:
- Empty string ("")
for missing name
- "NULL" used as
string placeholder
- Invalid numeric values
retained intentionally
Key
Insight: R
provides flexibility but requires explicit handling of missing values and type
consistency. Without proper cleaning, statistical functions may produce misleading
outputs.
3. Phase 1: Data Cleaning in SAS (IF-THEN Logic Focus)
DATA school_clean1;
SET school_raw;
/* Handle missing and NULL names */
IF Name = "" OR Name = "NULL" THEN Name = "UNKNOWN";
/* Fix Age */
IF Age < 0 THEN Age = ABS(Age);
IF Age > 100 OR Age = . THEN Age = 15;
/* Standardize Gender */
Gender = UPCASE(Gender);
/* Fix Marks */
IF Marks < 0 THEN Marks = ABS(Marks);
IF Marks > 100 THEN Marks = 100;
IF Marks = . THEN Marks = 75;
/* Fix City */
IF City = "NULL" THEN City = "UNKNOWN";
/* Fix missing dates */
IF Join_Date = . THEN Join_Date = TODAY();
RUN;
PROC PRINT DATA=school_clean1;
RUN;
OUTPUT:
| Obs | Student_ID | Name | Age | Gender | Marks | Join_Date | City |
|---|---|---|---|---|---|---|---|
| 1 | 101 | John | 15 | M | 85 | 12JAN2020 | Delhi |
| 2 | 102 | UNKNOWN | 16 | F | 90 | 15FEB2020 | Mumbai |
| 3 | 103 | Alice | 15 | F | 78 | 10APR2026 | Chennai |
| 4 | 104 | Bob | 15 | M | 10 | 25MAR2020 | Delhi |
| 5 | 105 | John | 15 | M | 85 | 12JAN2020 | Delhi |
| 6 | 106 | Sara | 14 | F | 88 | 01APR2020 | Kolkata |
| 7 | 107 | Mike | 13 | M | 75 | 05MAY2020 | UNKNOWN |
| 8 | 108 | UNKNOWN | 12 | F | 70 | 10JUN2020 | Hyderabad |
| 9 | 109 | David | 5 | M | 95 | 15JUL2020 | Pune |
| 10 | 110 | Emma | 16 | F | 100 | 20AUG2020 | Bangalore |
/* Remove duplicates */
PROC SORT DATA=school_clean1 NODUPKEY;
BY Student_ID;
RUN;
LOG:
NOTE: 0 observations with duplicate key values were deleted.
PROC PRINT DATA=school_clean1;
RUN;
OUTPUT:
| Obs | Student_ID | Name | Age | Gender | Marks | Join_Date | City |
|---|---|---|---|---|---|---|---|
| 1 | 101 | John | 15 | M | 85 | 12JAN2020 | Delhi |
| 2 | 102 | UNKNOWN | 16 | F | 90 | 15FEB2020 | Mumbai |
| 3 | 103 | Alice | 15 | F | 78 | 10APR2026 | Chennai |
| 4 | 104 | Bob | 15 | M | 10 | 25MAR2020 | Delhi |
| 5 | 105 | John | 15 | M | 85 | 12JAN2020 | Delhi |
| 6 | 106 | Sara | 14 | F | 88 | 01APR2020 | Kolkata |
| 7 | 107 | Mike | 13 | M | 75 | 05MAY2020 | UNKNOWN |
| 8 | 108 | UNKNOWN | 12 | F | 70 | 10JUN2020 | Hyderabad |
| 9 | 109 | David | 5 | M | 95 | 15JUL2020 | Pune |
| 10 | 110 | Emma | 16 | F | 100 | 20AUG2020 | Bangalore |
Explanation + Key Points
This is
where IF-THEN logic becomes the backbone of data cleaning.
- IF Name = "" OR
Name = "NULL" standardizes missing text.
- ABS(Age) corrects negative
values.
- Conditional logic ensures
Age remains within realistic bounds.
- UPCASE() enforces
consistency for categorical variables.
- Missing dates are imputed
using TODAY() common in operational datasets.
- PROC SORT NODUPKEY removes
duplicate records based on primary key.
Key
Insight: SAS
IF-THEN logic allows deterministic, auditable transformations critical for
regulatory environments like clinical trials.
4. Phase 2: Data Cleaning in R
library(dplyr)
school_clean <- school_raw %>%
mutate(Name = ifelse(Name == "" | Name == "NULL" , "UNKNOWN", Name),
Age = ifelse(is.na(Age) | Age > 100, 15, abs(Age)),
Gender = toupper(Gender),
Marks = ifelse(is.na(Marks), 75,
ifelse(Marks > 100, 100, abs(Marks))),
City = ifelse(City == "NULL", "UNKNOWN", City),
Join_Date = ifelse(is.na(Join_Date), Sys.Date(), Join_Date)
) %>%
distinct(Student_ID, .keep_all = TRUE)
OUTPUT:
|
|
Student_ID |
Name |
Age |
Gender |
Marks |
Join_Date |
City |
|
1 |
101 |
John |
15 |
M |
85 |
18273 |
Delhi |
|
2 |
102 |
UNKNOWN |
16 |
F |
90 |
18307 |
Mumbai |
|
3 |
103 |
Alice |
15 |
F |
78 |
20553 |
Chennai |
|
4 |
104 |
Bob |
15 |
M |
10 |
18346 |
Delhi |
|
5 |
105 |
John |
15 |
M |
85 |
18273 |
Delhi |
|
6 |
106 |
Sara |
14 |
F |
88 |
18353 |
Kolkata |
|
7 |
107 |
Mike |
13 |
M |
75 |
18387 |
UNKNOWN |
|
8 |
108 |
UNKNOWN |
12 |
F |
70 |
18423 |
Hyderabad |
|
9 |
109 |
David |
5 |
M |
95 |
18458 |
Pune |
|
10 |
110 |
Emma |
16 |
F |
100 |
18494 |
Bangalore |
Explanation + Key Points
R uses dplyr
for declarative transformations:
- mutate() applies column-wise
transformations
- ifelse() replaces SAS
IF-THEN logic
- toupper() standardizes
categorical variables
- distinct() removes
duplicates
Nested ifelse()
enables multi-condition handling similar to SAS.
Key
Insight: R is
expressive and compact but can become harder to audit compared to SAS in
regulated workflows. However, it excels in rapid prototyping and exploratory
analysis.
5. 20 Additional Data Cleaning Best Practices
- Always validate against
protocol (clinical trials)
- Maintain audit trails for
transformations
- Never overwrite raw data
- Use metadata-driven
programming
- Validate ranges using domain
knowledge
- Apply controlled terminology
(CDISC)
- Perform double programming
validation
- Document assumptions clearly
- Use macros for reusable
logic
- Perform frequency checks
- Cross-domain validation
(SDTM consistency)
- Handle missing data
systematically
- Flag imputed values
- Use version control
- Perform QC checks
independently
- Validate date sequences
- Ensure key uniqueness
- Avoid hardcoding values
- Use logs for error tracking
- Test edge cases thoroughly
6. Business Logic Behind Data Cleaning
Data
cleaning is not random it is business-rule driven.
- Missing Age → Default Value
(15):
In school datasets, a typical student age range exists. Missing values are imputed based on expected distribution. - Negative Age → ABS():
Negative values are data entry errors, not real-world possibilities. - Marks >100 → Cap at 100:
Academic systems have defined scoring limits. - Missing Dates → TODAY():
Useful in operational tracking when actual date is unavailable.
Impact:
Incorrect data leads to flawed analytics wrong performance rankings, incorrect
dropout predictions, or flawed clinical endpoints.
7. 20 Key Points (Sharp & Impactful)
- Dirty data leads to wrong
conclusions.
- IF-THEN logic enforces
business rules.
- Missing values must be
handled explicitly.
- Standardization ensures
reproducibility.
- Duplicate records distort
analytics.
- Validation is not
optional—it is mandatory.
- Data cleaning is 70% of
analytics work.
- Consistency beats
complexity.
- Every variable needs domain
understanding.
- Logs are your debugging
backbone.
- Never trust raw data blindly.
- Audit trails ensure
compliance.
- Automation reduces human
error.
- Edge cases define
robustness.
- Clean data drives better
decisions.
- Reproducibility is key in
regulated environments.
- SAS excels in structured
pipelines.
- R excels in flexibility.
- Documentation is as
important as code.
- Good data = Good insights.
8. Summary
SAS and R
both offer powerful data cleaning capabilities:
|
Feature |
SAS |
R |
|
Logic
Handling |
Strong
IF-THEN |
Flexible
ifelse |
|
Auditability |
High |
Moderate |
|
Regulatory
Use |
Preferred |
Limited |
|
Flexibility |
Moderate |
High |
SAS is
ideal for structured, repeatable workflows, while R is excellent for exploration
and flexibility.
9. Conclusion
Mastering
IF-THEN logic in SAS is not just about writing conditions it’s about embedding
business intelligence into your data pipeline.
A messy
dataset is inevitable. A messy analysis is optional.
When you combine:
- Strong validation rules
- Structured cleaning logic
- Domain knowledge
You
transform unreliable data into a decision-grade asset.
10. Interview Questions
1. How would you handle a
dataset where age contains negative and missing values in SAS?
Answer:
Handling
age requires both data correction logic and business rules.
SAS:
DATA clean_age;
SET raw_data;
/* Handle negative values */
IF Age < 0 THEN Age = ABS(Age);
/* Handle missing or unrealistic
values */
IF Age = . OR Age > 100 THEN Age = 30; /* Example imputation */
RUN;
Explanation:
Negative
age values are logically invalid and usually result from data entry errors, so
applying ABS() corrects them without losing information. Missing values (.) and
unrealistic values (e.g., Age > 100 in a school dataset) must be handled
using domain-driven imputation rules. For example, replacing with a median or
typical value (like 30 or 15 depending on context). In clinical trials, you
wouldn’t directly overwrite values—you might derive a new variable (e.g., AGE_IMP)
and flag it. The key is ensuring traceability, auditability, and consistency
in transformations.
2. Explain how PROC SORT
NODUPKEY works and when to use it?
Answer:
SAS Code
PROC SORT DATA=raw_data
OUT=dedup_data NODUPKEY;
BY Student_ID;
RUN;
Explanation:
PROC SORT
NODUPKEY removes duplicate observations based on the BY variables. It
keeps the first occurrence and removes subsequent duplicates. This is
useful when you have a unique identifier like Student_ID or USUBJID in clinical
trials. However, it does not compare all columns—it only checks the BY
variables. So if duplicates have conflicting values, this method may silently
discard important data. In regulated environments (like SDTM/ADaM), you should
first investigate duplicates using PROC FREQ or PROC SQL before removing them.
Always ensure that deduplication aligns with business rules.
3. In R, how would you
handle nested conditions similar to SAS IF-THEN?
Answer:
R Code
library(dplyr)
clean_data <- raw_data %>%
mutate(
Age = ifelse(is.na(Age), 30,
ifelse(Age < 0, abs(Age),
ifelse(Age > 100, 30, Age)))
)
Explanation:
In R,
nested conditions are implemented using ifelse() inside mutate(). Each ifelse()
acts like an IF-THEN-ELSE block in SAS. The structure flows from top to bottom,
similar to SAS logic. However, deeply nested ifelse() can reduce readability.
In such cases, case_when() from dplyr is preferred for clarity. The key
difference is that SAS executes row-wise in a data step, while R uses
vectorized operations. Understanding this distinction helps in writing
efficient and readable code. Always validate results after applying nested
conditions to avoid logical errors.
4. How do you validate
cleaned data in clinical trial datasets (SDTM/ADaM)?
Answer:
Validation Approach
- Perform double
programming (independent validation)
- Use PROC COMPARE
- Run range checks and
consistency checks
- Validate against SAP
(Statistical Analysis Plan)
- Cross-check with raw data
(SDTM vs ADaM)
SAS Example
PROC COMPARE BASE=adam_dataset
COMP=qc_dataset;
RUN;
Explanation:
Validation
in clinical trials is critical due to regulatory requirements (FDA, CDISC).
After cleaning, datasets must undergo independent QC validation, often
by a second programmer. PROC COMPARE is used to compare production and QC
datasets. Additionally, range checks (e.g., Age limits), consistency checks
(e.g., date sequences), and cross-domain validation (e.g., AE vs DM) are
performed. All transformations must align with the SAP. Any discrepancies are
documented and resolved. Audit trails are maintained to ensure traceability.
The goal is to ensure the dataset is analysis-ready, accurate, and compliant.
5. If duplicate records have
conflicting values, how would you resolve them?
Answer:
SAS:
PROC SORT DATA=raw_data;
BY Student_ID DESCENDING
Join_Date;
RUN;
DATA resolved;
SET raw_data;
BY Student_ID;
IF FIRST.Student_ID;
RUN;
Explanation:
When
duplicates have conflicting values, simple removal is risky. Instead,
resolution must follow business logic. For example, keeping the latest
record based on Join_Date or selecting the most complete record. In SAS,
sorting by a priority variable (e.g., descending date) ensures the preferred
record appears first, and FIRST. logic retains it. Alternatively, you can use PROC
SQL to aggregate or compare records. In clinical trials, discrepancies must be
investigated and documented rather than blindly resolved. The chosen approach
should be justified, reproducible, and aligned with study rules.
Bonus: Validation Checklist
- ✔ Missing values handled
- ✔ Range checks applied
- ✔ Duplicates removed
- ✔ Formats standardized
- ✔ Business rules applied
- ✔ Logs reviewed
- ✔ QC performed
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SCHOOL DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment