447.Data Gone Wrong to Data Done Right Using IF-THEN Logic in SAS

Mastering IF-THEN Logic in SAS Using a School Dataset with Intentional Errors

1. Introduction

Imagine you are working as a SAS programmer in a clinical trial company or a school analytics system. You receive a dataset that looks fine at first glance, but once you start analyzing, everything falls apart ages are negative, dates don’t make sense, names are inconsistent, and duplicate records exist.

This is not just a “messy dataset.” This is a silent threat to decision-making.

In clinical trials, incorrect patient age can impact dosing decisions. In a school dataset, wrong student data can mislead performance analytics, scholarship eligibility, or attendance tracking.

Bad data doesn’t just reduce accuracy it destroys trust.

This is where IF-THEN logic in SAS becomes powerful. It allows you to enforce business rules, correct inconsistencies, and transform unreliable data into a trusted analytical asset.

Both SAS and R are widely used in data cleaning:

  • SAS excels in structured, rule-based transformations (especially in regulated environments like CDISC SDTM/ADaM).
  • R provides flexibility and modern data manipulation capabilities.

In this blog, we will simulate a school dataset with intentional errors, clean it step-by-step using SAS and R, and understand the business logic behind each transformation. 

2. Raw Data Creation in SAS and R

Dataset Variables (7 Variables)

  • Student_ID
  • Name
  • Age
  • Gender
  • Marks
  • Join_Date
  • City

SAS Code: Raw Dataset Creation

DATA school_raw;

INFILE DATALINES DLM='|' MISSOVER;

INPUT Student_ID Name $ Age Gender $ Marks 

      Join_Date : DATE9. City:$15.;

FORMAT Join_Date DATE9.;

DATALINES;

101|John|15|M|85|12JAN2020|Delhi

102|NULL|-16|F|90|15FEB2020|Mumbai

103|Alice|.|F|78|.|Chennai

104|Bob|200|M|-10|25MAR2020|Delhi

105|John|15|M|85|12JAN2020|Delhi

106|Sara|14|f|88|01APR2020|Kolkata

107|Mike|13|M|.|05MAY2020|NULL

108| |12|F|70|10JUN2020|Hyderabad

109|David|-5|M|95|15JUL2020|Pune

110|Emma|16|F|105|20AUG2020|Bangalore

;

RUN;

PROC PRINT DATA=school_raw;

RUN;

OUTPUT:

ObsStudent_IDNameAgeGenderMarksJoin_DateCity
1101John15M8512JAN2020Delhi
2102NULL-16F9015FEB2020Mumbai
3103Alice.F78.Chennai
4104Bob200M-1025MAR2020Delhi
5105John15M8512JAN2020Delhi
6106Sara14f8801APR2020Kolkata
7107Mike13M.05MAY2020NULL
8108 12F7010JUN2020Hyderabad
9109David-5M9515JUL2020Pune
10110Emma16F10520AUG2020Bangalore

Explanation + Key Points

This SAS dataset is intentionally designed with real-world data quality issues:

  • Missing values (Age, Join_Date, Name)
  • Invalid values (negative age, marks >100)
  • Duplicate record (Student_ID 105 same as 101)
  • Inconsistent text (“NULL”, lowercase gender)
  • Blank name (Student_ID 108)

Using DATALINES, we simulate raw ingestion similar to receiving flat files from external systems. The FORMAT ensures date readability.

Key Insight: Raw data should never be trusted blindly. Always assume errors exist and design validation logic accordingly.

Explanation

The MISSOVER option prevents SAS from reading the next line when data is missing in the current row. Without it, SAS attempts to “borrow” values from the next line, causing misalignment and _ERROR_. While MISSOVER improves robustness, it does not fix structural issues completely it only prevents spillover. You still need consistent placeholders for missing values. In production environments, combining MISSOVERTRUNCOVER, and proper data validation checks ensures safe ingestion of imperfect datasets. However, the best practice remains: clean data at source whenever possible.

Equivalent R Dataset

school_raw <- data.frame(

  Student_ID = c(101,102,103,104,105,106,107,108,109,110),

  Name = c("John","NULL","Alice","Bob","John","Sara","Mike",

           "", "David","Emma"),

  Age = c(15,-16,NA,200,15,14,13,12,-5,16),

  Gender = c("M","F","F","M","M","f","M","F","M","F"),

  Marks = c(85,90,78,-10,85,88,NA,70,95,105),

  Join_Date = as.Date(c("2020-01-12","2020-02-15",NA,"2020-03-25",

                        "2020-01-12","2020-04-01","2020-05-05",

                        "2020-06-10","2020-07-15","2020-08-20")),

  City = c("Delhi","Mumbai","Chennai","Delhi","Delhi","Kolkata",

           "NULL","Hyderabad","Pune","Bangalore")

)

OUTPUT:

 

Student_ID

Name

Age

Gender

Marks

Join_Date

City

1

101

John

15

M

85

12-01-2020

Delhi

2

102

NULL

-16

F

90

15-02-2020

Mumbai

3

103

Alice

NA

F

78

NA

Chennai

4

104

Bob

200

M

-10

25-03-2020

Delhi

5

105

John

15

M

85

12-01-2020

Delhi

6

106

Sara

14

f

88

01-04-2020

Kolkata

7

107

Mike

13

M

NA

05-05-2020

NULL

8

108

 

12

F

70

10-06-2020

Hyderabad

9

109

David

-5

M

95

15-07-2020

Pune

10

110

Emma

16

F

105

20-08-2020

Bangalore

Explanation + Key Points

The R dataset mirrors SAS but uses data.frame() for creation. Missing values are represented using NA, and dates are handled using as.Date().

Notice:

  • Empty string ("") for missing name
  • "NULL" used as string placeholder
  • Invalid numeric values retained intentionally

Key Insight: R provides flexibility but requires explicit handling of missing values and type consistency. Without proper cleaning, statistical functions may produce misleading outputs.

3. Phase 1: Data Cleaning in SAS (IF-THEN Logic Focus)

DATA school_clean1;

SET school_raw;

/* Handle missing and NULL names */

IF Name = "" OR Name = "NULL" THEN Name = "UNKNOWN";

/* Fix Age */

IF Age < 0 THEN Age = ABS(Age);

IF Age > 100 OR Age = . THEN Age = 15;

/* Standardize Gender */

Gender = UPCASE(Gender);

/* Fix Marks */

IF Marks < 0 THEN Marks = ABS(Marks);

IF Marks > 100 THEN Marks = 100;

IF Marks = . THEN Marks = 75;

/* Fix City */

IF City = "NULL" THEN City = "UNKNOWN";

/* Fix missing dates */

IF Join_Date = . THEN Join_Date = TODAY();

RUN;

PROC PRINT DATA=school_clean1;

RUN;

OUTPUT:

ObsStudent_IDNameAgeGenderMarksJoin_DateCity
1101John15M8512JAN2020Delhi
2102UNKNOWN16F9015FEB2020Mumbai
3103Alice15F7810APR2026Chennai
4104Bob15M1025MAR2020Delhi
5105John15M8512JAN2020Delhi
6106Sara14F8801APR2020Kolkata
7107Mike13M7505MAY2020UNKNOWN
8108UNKNOWN12F7010JUN2020Hyderabad
9109David5M9515JUL2020Pune
10110Emma16F10020AUG2020Bangalore

/* Remove duplicates */

PROC SORT DATA=school_clean1 NODUPKEY;

BY Student_ID;

RUN;

LOG:

NOTE: 0 observations with duplicate key values were deleted.

PROC PRINT DATA=school_clean1;

RUN;

OUTPUT:

ObsStudent_IDNameAgeGenderMarksJoin_DateCity
1101John15M8512JAN2020Delhi
2102UNKNOWN16F9015FEB2020Mumbai
3103Alice15F7810APR2026Chennai
4104Bob15M1025MAR2020Delhi
5105John15M8512JAN2020Delhi
6106Sara14F8801APR2020Kolkata
7107Mike13M7505MAY2020UNKNOWN
8108UNKNOWN12F7010JUN2020Hyderabad
9109David5M9515JUL2020Pune
10110Emma16F10020AUG2020Bangalore

Explanation + Key Points

This is where IF-THEN logic becomes the backbone of data cleaning.

  • IF Name = "" OR Name = "NULL" standardizes missing text.
  • ABS(Age) corrects negative values.
  • Conditional logic ensures Age remains within realistic bounds.
  • UPCASE() enforces consistency for categorical variables.
  • Missing dates are imputed using TODAY() common in operational datasets.
  • PROC SORT NODUPKEY removes duplicate records based on primary key.

Key Insight: SAS IF-THEN logic allows deterministic, auditable transformations critical for regulatory environments like clinical trials.

4. Phase 2: Data Cleaning in R

library(dplyr)


school_clean <- school_raw %>%

  mutate(Name = ifelse(Name == "" | Name == "NULL" , "UNKNOWN", Name),

         Age = ifelse(is.na(Age) | Age > 100, 15, abs(Age)),

         Gender = toupper(Gender),

         Marks = ifelse(is.na(Marks), 75,

                        ifelse(Marks > 100, 100, abs(Marks))),

         City = ifelse(City == "NULL", "UNKNOWN", City),

         Join_Date = ifelse(is.na(Join_Date), Sys.Date(), Join_Date)

  ) %>%

  distinct(Student_ID, .keep_all = TRUE)

OUTPUT:

 

Student_ID

Name

Age

Gender

Marks

Join_Date

City

1

101

John

15

M

85

18273

Delhi

2

102

UNKNOWN

16

F

90

18307

Mumbai

3

103

Alice

15

F

78

20553

Chennai

4

104

Bob

15

M

10

18346

Delhi

5

105

John

15

M

85

18273

Delhi

6

106

Sara

14

F

88

18353

Kolkata

7

107

Mike

13

M

75

18387

UNKNOWN

8

108

UNKNOWN

12

F

70

18423

Hyderabad

9

109

David

5

M

95

18458

Pune

10

110

Emma

16

F

100

18494

Bangalore

Explanation + Key Points

R uses dplyr for declarative transformations:

  • mutate() applies column-wise transformations
  • ifelse() replaces SAS IF-THEN logic
  • toupper() standardizes categorical variables
  • distinct() removes duplicates

Nested ifelse() enables multi-condition handling similar to SAS.

Key Insight: R is expressive and compact but can become harder to audit compared to SAS in regulated workflows. However, it excels in rapid prototyping and exploratory analysis.

5. 20 Additional Data Cleaning Best Practices

  1. Always validate against protocol (clinical trials)
  2. Maintain audit trails for transformations
  3. Never overwrite raw data
  4. Use metadata-driven programming
  5. Validate ranges using domain knowledge
  6. Apply controlled terminology (CDISC)
  7. Perform double programming validation
  8. Document assumptions clearly
  9. Use macros for reusable logic
  10. Perform frequency checks
  11. Cross-domain validation (SDTM consistency)
  12. Handle missing data systematically
  13. Flag imputed values
  14. Use version control
  15. Perform QC checks independently
  16. Validate date sequences
  17. Ensure key uniqueness
  18. Avoid hardcoding values
  19. Use logs for error tracking
  20. Test edge cases thoroughly

6. Business Logic Behind Data Cleaning

Data cleaning is not random it is business-rule driven.

  • Missing Age → Default Value (15):
    In school datasets, a typical student age range exists. Missing values are imputed based on expected distribution.
  • Negative Age → ABS():
    Negative values are data entry errors, not real-world possibilities.
  • Marks >100 → Cap at 100:
    Academic systems have defined scoring limits.
  • Missing Dates → TODAY():
    Useful in operational tracking when actual date is unavailable.

Impact:
Incorrect data leads to flawed analytics wrong performance rankings, incorrect dropout predictions, or flawed clinical endpoints.

7. 20 Key Points (Sharp & Impactful)

  • Dirty data leads to wrong conclusions.
  • IF-THEN logic enforces business rules.
  • Missing values must be handled explicitly.
  • Standardization ensures reproducibility.
  • Duplicate records distort analytics.
  • Validation is not optional—it is mandatory.
  • Data cleaning is 70% of analytics work.
  • Consistency beats complexity.
  • Every variable needs domain understanding.
  • Logs are your debugging backbone.
  • Never trust raw data blindly.
  • Audit trails ensure compliance.
  • Automation reduces human error.
  • Edge cases define robustness.
  • Clean data drives better decisions.
  • Reproducibility is key in regulated environments.
  • SAS excels in structured pipelines.
  • R excels in flexibility.
  • Documentation is as important as code.
  • Good data = Good insights.

8. Summary

SAS and R both offer powerful data cleaning capabilities:

Feature

SAS

R

Logic Handling

Strong IF-THEN

Flexible ifelse

Auditability

High

Moderate

Regulatory Use

Preferred

Limited

Flexibility

Moderate

High

SAS is ideal for structured, repeatable workflows, while R is excellent for exploration and flexibility.

9. Conclusion

Mastering IF-THEN logic in SAS is not just about writing conditions it’s about embedding business intelligence into your data pipeline.

A messy dataset is inevitable. A messy analysis is optional.

When you combine:

  • Strong validation rules
  • Structured cleaning logic
  • Domain knowledge

You transform unreliable data into a decision-grade asset.

10. Interview Questions 

1. How would you handle a dataset where age contains negative and missing values in SAS?

Answer:

Handling age requires both data correction logic and business rules.

SAS:

DATA clean_age;

SET raw_data;

/* Handle negative values */

IF Age < 0 THEN Age = ABS(Age);

/* Handle missing or unrealistic values */

IF Age = . OR Age > 100 THEN Age = 30; /* Example imputation */

RUN;

Explanation:

Negative age values are logically invalid and usually result from data entry errors, so applying ABS() corrects them without losing information. Missing values (.) and unrealistic values (e.g., Age > 100 in a school dataset) must be handled using domain-driven imputation rules. For example, replacing with a median or typical value (like 30 or 15 depending on context). In clinical trials, you wouldn’t directly overwrite values—you might derive a new variable (e.g., AGE_IMP) and flag it. The key is ensuring traceability, auditability, and consistency in transformations.

2. Explain how PROC SORT NODUPKEY works and when to use it?

Answer:

SAS Code

PROC SORT DATA=raw_data OUT=dedup_data NODUPKEY;

BY Student_ID;

RUN;

Explanation:

PROC SORT NODUPKEY removes duplicate observations based on the BY variables. It keeps the first occurrence and removes subsequent duplicates. This is useful when you have a unique identifier like Student_ID or USUBJID in clinical trials. However, it does not compare all columns—it only checks the BY variables. So if duplicates have conflicting values, this method may silently discard important data. In regulated environments (like SDTM/ADaM), you should first investigate duplicates using PROC FREQ or PROC SQL before removing them. Always ensure that deduplication aligns with business rules.

3. In R, how would you handle nested conditions similar to SAS IF-THEN?

Answer:

R Code

library(dplyr)

 

clean_data <- raw_data %>%

  mutate(

    Age = ifelse(is.na(Age), 30,

          ifelse(Age < 0, abs(Age),

          ifelse(Age > 100, 30, Age)))

  )

Explanation:

In R, nested conditions are implemented using ifelse() inside mutate(). Each ifelse() acts like an IF-THEN-ELSE block in SAS. The structure flows from top to bottom, similar to SAS logic. However, deeply nested ifelse() can reduce readability. In such cases, case_when() from dplyr is preferred for clarity. The key difference is that SAS executes row-wise in a data step, while R uses vectorized operations. Understanding this distinction helps in writing efficient and readable code. Always validate results after applying nested conditions to avoid logical errors.

4. How do you validate cleaned data in clinical trial datasets (SDTM/ADaM)?

Answer:

Validation Approach

  • Perform double programming (independent validation)
  • Use PROC COMPARE
  • Run range checks and consistency checks
  • Validate against SAP (Statistical Analysis Plan)
  • Cross-check with raw data (SDTM vs ADaM)

SAS Example

PROC COMPARE BASE=adam_dataset COMP=qc_dataset;

RUN;

Explanation:

Validation in clinical trials is critical due to regulatory requirements (FDA, CDISC). After cleaning, datasets must undergo independent QC validation, often by a second programmer. PROC COMPARE is used to compare production and QC datasets. Additionally, range checks (e.g., Age limits), consistency checks (e.g., date sequences), and cross-domain validation (e.g., AE vs DM) are performed. All transformations must align with the SAP. Any discrepancies are documented and resolved. Audit trails are maintained to ensure traceability. The goal is to ensure the dataset is analysis-ready, accurate, and compliant.

5. If duplicate records have conflicting values, how would you resolve them?

Answer:

SAS:

PROC SORT DATA=raw_data;

BY Student_ID DESCENDING Join_Date;

RUN;

 

DATA resolved;

SET raw_data;

BY Student_ID;

IF FIRST.Student_ID;

RUN;

Explanation:

When duplicates have conflicting values, simple removal is risky. Instead, resolution must follow business logic. For example, keeping the latest record based on Join_Date or selecting the most complete record. In SAS, sorting by a priority variable (e.g., descending date) ensures the preferred record appears first, and FIRST. logic retains it. Alternatively, you can use PROC SQL to aggregate or compare records. In clinical trials, discrepancies must be investigated and documented rather than blindly resolved. The chosen approach should be justified, reproducible, and aligned with study rules.

Bonus: Validation Checklist

  • Missing values handled
  • Range checks applied
  • Duplicates removed
  • Formats standardized
  • Business rules applied
  • Logs reviewed
  • QC performed

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SCHOOL DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy










Comments

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

418.Can We Design, Debug, Detect Fraud, And Optimize A Smart Parking System Using Advanced SAS Programming Techniques?