Mastering IF-THEN Logic in SAS Using a School Dataset with Intentional Errors

1. Introduction

Imagine you are working as a SAS programmer in a clinical trial company or a school analytics system. You receive a dataset that looks fine at first glance, but once you start analyzing, everything falls apart ages are negative, dates don’t make sense, names are inconsistent, and duplicate records exist.

This is not just a “messy dataset.” This is a silent threat to decision-making.

In clinical trials, incorrect patient age can impact dosing decisions. In a school dataset, wrong student data can mislead performance analytics, scholarship eligibility, or attendance tracking.

Bad data doesn’t just reduce accuracy it destroys trust.

This is where IF-THEN logic in SAS becomes powerful. It allows you to enforce business rules, correct inconsistencies, and transform unreliable data into a trusted analytical asset.

Both SAS and R are widely used in data cleaning:

SAS excels in structured, rule-based transformations (especially in regulated environments like CDISC SDTM/ADaM).
R provides flexibility and modern data manipulation capabilities.

In this blog, we will simulate a school dataset with intentional errors, clean it step-by-step using SAS and R, and understand the business logic behind each transformation.

2. Raw Data Creation in SAS and R

Dataset Variables (7 Variables)

Student_ID
Name
Age
Gender
Marks
Join_Date
City

SAS Code: Raw Dataset Creation

DATA school_raw;

INFILE DATALINES DLM='|' MISSOVER;

INPUT Student_ID Name $ Age Gender $ Marks

Join_Date : DATE9. City:$15.;

FORMAT Join_Date DATE9.;

DATALINES;

101|John|15|M|85|12JAN2020|Delhi

102|NULL|-16|F|90|15FEB2020|Mumbai

103|Alice|.|F|78|.|Chennai

104|Bob|200|M|-10|25MAR2020|Delhi

105|John|15|M|85|12JAN2020|Delhi

106|Sara|14|f|88|01APR2020|Kolkata

107|Mike|13|M|.|05MAY2020|NULL

108| |12|F|70|10JUN2020|Hyderabad

109|David|-5|M|95|15JUL2020|Pune

110|Emma|16|F|105|20AUG2020|Bangalore

;

RUN;

PROC PRINT DATA=school_raw;

RUN;

OUTPUT:


Obs	Student_ID	Name	Age	Gender	Marks	Join_Date	City
1	101	John	15	M	85	12JAN2020	Delhi
2	102	NULL	-16	F	90	15FEB2020	Mumbai
3	103	Alice	.	F	78	.	Chennai
4	104	Bob	200	M	-10	25MAR2020	Delhi
5	105	John	15	M	85	12JAN2020	Delhi
6	106	Sara	14	f	88	01APR2020	Kolkata
7	107	Mike	13	M	.	05MAY2020	NULL
8	108		12	F	70	10JUN2020	Hyderabad
9	109	David	-5	M	95	15JUL2020	Pune
10	110	Emma	16	F	105	20AUG2020	Bangalore

Explanation + Key Points

This SAS dataset is intentionally designed with real-world data quality issues:

Missing values (Age, Join_Date, Name)
Invalid values (negative age, marks >100)
Duplicate record (Student_ID 105 same as 101)
Inconsistent text (“NULL”, lowercase gender)
Blank name (Student_ID 108)

Using DATALINES, we simulate raw ingestion similar to receiving flat files from external systems. The FORMAT ensures date readability.

Key Insight: Raw data should never be trusted blindly. Always assume errors exist and design validation logic accordingly.

Explanation

The MISSOVER option prevents SAS from reading the next line when data is missing in the current row. Without it, SAS attempts to “borrow” values from the next line, causing misalignment and _ERROR_. While MISSOVER improves robustness, it does not fix structural issues completely it only prevents spillover. You still need consistent placeholders for missing values. In production environments, combining MISSOVER, TRUNCOVER, and proper data validation checks ensures safe ingestion of imperfect datasets. However, the best practice remains: clean data at source whenever possible.

Equivalent R Dataset

school_raw <- data.frame(

Student_ID = c(101,102,103,104,105,106,107,108,109,110),

Name = c("John","NULL","Alice","Bob","John","Sara","Mike",

"", "David","Emma"),

Age = c(15,-16,NA,200,15,14,13,12,-5,16),

Gender = c("M","F","F","M","M","f","M","F","M","F"),

Marks = c(85,90,78,-10,85,88,NA,70,95,105),

Join_Date = as.Date(c("2020-01-12","2020-02-15",NA,"2020-03-25",

"2020-01-12","2020-04-01","2020-05-05",

"2020-06-10","2020-07-15","2020-08-20")),

City = c("Delhi","Mumbai","Chennai","Delhi","Delhi","Kolkata",

"NULL","Hyderabad","Pune","Bangalore")

)

OUTPUT:

	Student_ID	Name	Age	Gender	Marks	Join_Date	City
1	101	John	15	M	85	12-01-2020	Delhi
2	102	NULL	-16	F	90	15-02-2020	Mumbai
3	103	Alice	NA	F	78	NA	Chennai
4	104	Bob	200	M	-10	25-03-2020	Delhi
5	105	John	15	M	85	12-01-2020	Delhi
6	106	Sara	14	f	88	01-04-2020	Kolkata
7	107	Mike	13	M	NA	05-05-2020	NULL
8	108		12	F	70	10-06-2020	Hyderabad
9	109	David	-5	M	95	15-07-2020	Pune
10	110	Emma	16	F	105	20-08-2020	Bangalore

Explanation + Key Points

The R dataset mirrors SAS but uses data.frame() for creation. Missing values are represented using NA, and dates are handled using as.Date().

Notice:

Empty string ("") for missing name
"NULL" used as string placeholder
Invalid numeric values retained intentionally

Key Insight: R provides flexibility but requires explicit handling of missing values and type consistency. Without proper cleaning, statistical functions may produce misleading outputs.

3. Phase 1: Data Cleaning in SAS (IF-THEN Logic Focus)

DATA school_clean1;

SET school_raw;

/* Handle missing and NULL names */

IF Name = "" OR Name = "NULL" THEN Name = "UNKNOWN";

/* Fix Age */

IF Age < 0 THEN Age = ABS(Age);

IF Age > 100 OR Age = . THEN Age = 15;

/* Standardize Gender */

Gender = UPCASE(Gender);

/* Fix Marks */

IF Marks < 0 THEN Marks = ABS(Marks);

IF Marks > 100 THEN Marks = 100;

IF Marks = . THEN Marks = 75;

/* Fix City */

IF City = "NULL" THEN City = "UNKNOWN";

/* Fix missing dates */

IF Join_Date = . THEN Join_Date = TODAY();

RUN;

PROC PRINT DATA=school_clean1;

RUN;

OUTPUT:


Obs	Student_ID	Name	Age	Gender	Marks	Join_Date	City
1	101	John	15	M	85	12JAN2020	Delhi
2	102	UNKNOWN	16	F	90	15FEB2020	Mumbai
3	103	Alice	15	F	78	10APR2026	Chennai
4	104	Bob	15	M	10	25MAR2020	Delhi
5	105	John	15	M	85	12JAN2020	Delhi
6	106	Sara	14	F	88	01APR2020	Kolkata
7	107	Mike	13	M	75	05MAY2020	UNKNOWN
8	108	UNKNOWN	12	F	70	10JUN2020	Hyderabad
9	109	David	5	M	95	15JUL2020	Pune
10	110	Emma	16	F	100	20AUG2020	Bangalore

/* Remove duplicates */

PROC SORT DATA=school_clean1 NODUPKEY;

BY Student_ID;

RUN;

LOG:

NOTE: 0 observations with duplicate key values were deleted.

PROC PRINT DATA=school_clean1;

RUN;

OUTPUT:


Obs	Student_ID	Name	Age	Gender	Marks	Join_Date	City
1	101	John	15	M	85	12JAN2020	Delhi
2	102	UNKNOWN	16	F	90	15FEB2020	Mumbai
3	103	Alice	15	F	78	10APR2026	Chennai
4	104	Bob	15	M	10	25MAR2020	Delhi
5	105	John	15	M	85	12JAN2020	Delhi
6	106	Sara	14	F	88	01APR2020	Kolkata
7	107	Mike	13	M	75	05MAY2020	UNKNOWN
8	108	UNKNOWN	12	F	70	10JUN2020	Hyderabad
9	109	David	5	M	95	15JUL2020	Pune
10	110	Emma	16	F	100	20AUG2020	Bangalore

Explanation + Key Points

This is where IF-THEN logic becomes the backbone of data cleaning.

IF Name = "" OR Name = "NULL" standardizes missing text.
ABS(Age) corrects negative values.
Conditional logic ensures Age remains within realistic bounds.
UPCASE() enforces consistency for categorical variables.
Missing dates are imputed using TODAY() common in operational datasets.
PROC SORT NODUPKEY removes duplicate records based on primary key.

Key Insight: SAS IF-THEN logic allows deterministic, auditable transformations critical for regulatory environments like clinical trials.

4. Phase 2: Data Cleaning in R

library(dplyr)

school_clean <- school_raw %>%

mutate(Name = ifelse(Name == "" | Name == "NULL" , "UNKNOWN", Name),

Age = ifelse(is.na(Age) | Age > 100, 15, abs(Age)),

Gender = toupper(Gender),

Marks = ifelse(is.na(Marks), 75,

ifelse(Marks > 100, 100, abs(Marks))),

City = ifelse(City == "NULL", "UNKNOWN", City),

Join_Date = ifelse(is.na(Join_Date), Sys.Date(), Join_Date)

) %>%

distinct(Student_ID, .keep_all = TRUE)

OUTPUT:

	Student_ID	Name	Age	Gender	Marks	Join_Date	City
1	101	John	15	M	85	18273	Delhi
2	102	UNKNOWN	16	F	90	18307	Mumbai
3	103	Alice	15	F	78	20553	Chennai
4	104	Bob	15	M	10	18346	Delhi
5	105	John	15	M	85	18273	Delhi
6	106	Sara	14	F	88	18353	Kolkata
7	107	Mike	13	M	75	18387	UNKNOWN
8	108	UNKNOWN	12	F	70	18423	Hyderabad
9	109	David	5	M	95	18458	Pune
10	110	Emma	16	F	100	18494	Bangalore

Explanation + Key Points

R uses dplyr for declarative transformations:

mutate() applies column-wise transformations
ifelse() replaces SAS IF-THEN logic
toupper() standardizes categorical variables
distinct() removes duplicates

Nested ifelse() enables multi-condition handling similar to SAS.

Key Insight: R is expressive and compact but can become harder to audit compared to SAS in regulated workflows. However, it excels in rapid prototyping and exploratory analysis.

5. 20 Additional Data Cleaning Best Practices

Always validate against protocol (clinical trials)
Maintain audit trails for transformations
Never overwrite raw data
Use metadata-driven programming
Validate ranges using domain knowledge
Apply controlled terminology (CDISC)
Perform double programming validation
Document assumptions clearly
Use macros for reusable logic
Perform frequency checks
Cross-domain validation (SDTM consistency)
Handle missing data systematically
Flag imputed values
Use version control
Perform QC checks independently
Validate date sequences
Ensure key uniqueness
Avoid hardcoding values
Use logs for error tracking
Test edge cases thoroughly

6. Business Logic Behind Data Cleaning

Data cleaning is not random it is business-rule driven.

Missing Age → Default Value (15):
In school datasets, a typical student age range exists. Missing values are imputed based on expected distribution.
Negative Age → ABS():
Negative values are data entry errors, not real-world possibilities.
Marks >100 → Cap at 100:
Academic systems have defined scoring limits.
Missing Dates → TODAY():
Useful in operational tracking when actual date is unavailable.

Impact:
Incorrect data leads to flawed analytics wrong performance rankings, incorrect dropout predictions, or flawed clinical endpoints.

7. 20 Key Points (Sharp & Impactful)

Dirty data leads to wrong conclusions.
IF-THEN logic enforces business rules.
Missing values must be handled explicitly.
Standardization ensures reproducibility.
Duplicate records distort analytics.
Validation is not optional—it is mandatory.
Data cleaning is 70% of analytics work.
Consistency beats complexity.
Every variable needs domain understanding.
Logs are your debugging backbone.
Never trust raw data blindly.
Audit trails ensure compliance.
Automation reduces human error.
Edge cases define robustness.
Clean data drives better decisions.
Reproducibility is key in regulated environments.
SAS excels in structured pipelines.
R excels in flexibility.
Documentation is as important as code.
Good data = Good insights.

8. Summary

SAS and R both offer powerful data cleaning capabilities:

Feature	SAS	R
Logic Handling	Strong IF-THEN	Flexible ifelse
Auditability	High	Moderate
Regulatory Use	Preferred	Limited
Flexibility	Moderate	High

SAS is ideal for structured, repeatable workflows, while R is excellent for exploration and flexibility.

9. Conclusion

Mastering IF-THEN logic in SAS is not just about writing conditions it’s about embedding business intelligence into your data pipeline.

A messy dataset is inevitable. A messy analysis is optional.

When you combine:

Strong validation rules
Structured cleaning logic
Domain knowledge

You transform unreliable data into a decision-grade asset.

10. Interview Questions

1. How would you handle a dataset where age contains negative and missing values in SAS?

Answer:

Handling age requires both data correction logic and business rules.

SAS:

DATA clean_age;

SET raw_data;

/* Handle negative values */

IF Age < 0 THEN Age = ABS(Age);

/* Handle missing or unrealistic values */

IF Age = . OR Age > 100 THEN Age = 30; /* Example imputation */

RUN;

Explanation:

Negative age values are logically invalid and usually result from data entry errors, so applying ABS() corrects them without losing information. Missing values (.) and unrealistic values (e.g., Age > 100 in a school dataset) must be handled using domain-driven imputation rules. For example, replacing with a median or typical value (like 30 or 15 depending on context). In clinical trials, you wouldn’t directly overwrite values—you might derive a new variable (e.g., AGE_IMP) and flag it. The key is ensuring traceability, auditability, and consistency in transformations.

2. Explain how PROC SORT NODUPKEY works and when to use it?

Answer:

SAS Code

PROC SORT DATA=raw_data OUT=dedup_data NODUPKEY;

BY Student_ID;

RUN;

Explanation:

PROC SORT NODUPKEY removes duplicate observations based on the BY variables. It keeps the first occurrence and removes subsequent duplicates. This is useful when you have a unique identifier like Student_ID or USUBJID in clinical trials. However, it does not compare all columns—it only checks the BY variables. So if duplicates have conflicting values, this method may silently discard important data. In regulated environments (like SDTM/ADaM), you should first investigate duplicates using PROC FREQ or PROC SQL before removing them. Always ensure that deduplication aligns with business rules.

3. In R, how would you handle nested conditions similar to SAS IF-THEN?

Answer:

R Code

library(dplyr)

clean_data <- raw_data %>%

mutate(

Age = ifelse(is.na(Age), 30,

ifelse(Age < 0, abs(Age),

ifelse(Age > 100, 30, Age)))

)

Explanation:

In R, nested conditions are implemented using ifelse() inside mutate(). Each ifelse() acts like an IF-THEN-ELSE block in SAS. The structure flows from top to bottom, similar to SAS logic. However, deeply nested ifelse() can reduce readability. In such cases, case_when() from dplyr is preferred for clarity. The key difference is that SAS executes row-wise in a data step, while R uses vectorized operations. Understanding this distinction helps in writing efficient and readable code. Always validate results after applying nested conditions to avoid logical errors.

4. How do you validate cleaned data in clinical trial datasets (SDTM/ADaM)?

Answer:

Validation Approach

Perform double programming (independent validation)
Use PROC COMPARE
Run range checks and consistency checks
Validate against SAP (Statistical Analysis Plan)
Cross-check with raw data (SDTM vs ADaM)

SAS Example

PROC COMPARE BASE=adam_dataset COMP=qc_dataset;

RUN;

Explanation:

Validation in clinical trials is critical due to regulatory requirements (FDA, CDISC). After cleaning, datasets must undergo independent QC validation, often by a second programmer. PROC COMPARE is used to compare production and QC datasets. Additionally, range checks (e.g., Age limits), consistency checks (e.g., date sequences), and cross-domain validation (e.g., AE vs DM) are performed. All transformations must align with the SAP. Any discrepancies are documented and resolved. Audit trails are maintained to ensure traceability. The goal is to ensure the dataset is analysis-ready, accurate, and compliant.

5. If duplicate records have conflicting values, how would you resolve them?

Answer:

SAS:

PROC SORT DATA=raw_data;

BY Student_ID DESCENDING Join_Date;

RUN;

DATA resolved;

SET raw_data;

BY Student_ID;

IF FIRST.Student_ID;

RUN;

Explanation:

When duplicates have conflicting values, simple removal is risky. Instead, resolution must follow business logic. For example, keeping the latest record based on Join_Date or selecting the most complete record. In SAS, sorting by a priority variable (e.g., descending date) ensures the preferred record appears first, and FIRST. logic retains it. Alternatively, you can use PROC SQL to aggregate or compare records. In clinical trials, discrepancies must be investigated and documented rather than blindly resolved. The chosen approach should be justified, reproducible, and aligned with study rules.

Bonus: Validation Checklist

✔ Missing values handled
✔ Range checks applied
✔ Duplicates removed
✔ Formats standardized
✔ Business rules applied
✔ Logs reviewed
✔ QC performed

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SCHOOL DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Data Gone Wrong to Data Done Right Using IF-THEN Logic in SAS

Mastering IF-THEN Logic in SAS Using a School Dataset with Intentional Errors

Explanation

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

Data Gone Wrong to Data Done Right Using IF-THEN Logic in SAS

Mastering IF-THEN Logic in SAS Using a School Dataset with Intentional Errors

Explanation

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?