From Contaminated Data to Clinical Clarity: Automating Antibiotics Dataset Cleaning in SAS with Macros

1. Introduction

Imagine you are working as a clinical SAS programmer in a pharmaceutical company. A new antibiotics trial dataset lands on your desk. It looks promising patient demographics, drug dosages, outcomes all neatly structured. But the moment you start exploring, reality hits.

A patient aged -5 years. Treatment dates occurring before enrollment. Drug names like amoxicillin, AMOX, null, and even blanks representing the same medication. Duplicate patient records with slightly different spellings. Suddenly, your “clean dataset” turns into a minefield.

This is not hypothetical this is daily reality in clinical trials.

Dirty data is not just inconvenient it is dangerous. It can lead to:

Incorrect efficacy conclusions
Regulatory rejection (FDA/EMA compliance issues)
Misleading safety signals

This is where data cleaning becomes a critical scientific process, not just a technical step.

Tools like SAS and R dominate this space. SAS excels in structured clinical workflows (SDTM/ADaM), while R provides flexibility and exploratory power. But when automation meets SAS macros, we unlock true efficiency.

In this blog, we will build a messy antibiotics dataset, intentionally inject errors, and then clean it using SAS macros and R techniques just like in real-world clinical programming.

2. Raw Data Creation in SAS and R

SAS Raw Dataset (With Intentional Errors)

DATA antibiotics_raw;

INFILE DATALINES DLM=',' DSD;

LENGTH Patient_ID $5 Drug_Name $20 Gender $10;

INPUT Patient_ID $ Age Gender $ Drug_Name $ Dose

Treatment_Date :$12. Outcome :$15.;

DATALINES;

P001,25,M,amoxicillin,500,2023-01-01,Recovered

P002,-5,F,NULL,250,2023-02-01,Recovered

P003,45,m,AMOX,500,2023-01-10,Not Recovered

P004,60,F, ,700,2022-12-01,Recovered

P005,30,M,azithromycin,abc,2023-03-01,Recovered

P001,25,M,amoxicillin,500,2023-01-01,Recovered

P006,200,F,penicillin,500,2023-02-30,Recovered

P007,35,,ciprofloxacin,500,2023-01-15,Recovered

P008,40,M,NULL,500,2023-01-20,Recovered

P009,28,F,amox,500,2023-01-18,Recovered

;

RUN;

PROC PRINT DATA = antibiotics_raw;

RUN;

OUTPUT:


Obs	Patient_ID	Drug_Name	Gender	Age	Dose	Treatment_Date	Outcome
1	P001	amoxicillin	M	25	500	2023-01-01	Recovered
2	P002	NULL	F	-5	250	2023-02-01	Recovered
3	P003	AMOX	m	45	500	2023-01-10	Not Recovered
4	P004		F	60	700	2022-12-01	Recovered
5	P005	azithromycin	M	30	.	2023-03-01	Recovered
6	P001	amoxicillin	M	25	500	2023-01-01	Recovered
7	P006	penicillin	F	200	500	2023-02-30	Recovered
8	P007	ciprofloxacin		35	500	2023-01-15	Recovered
9	P008	NULL	M	40	500	2023-01-20	Recovered
10	P009	amox	F	28	500	2023-01-18	Recovered

Explanation

This dataset simulates real-world clinical trial inconsistencies. We deliberately introduced invalid ages (-5, 200), missing values (blank gender, NULL drug), invalid dose ("abc"), duplicate records (P001), and invalid dates (Feb 30). Drug names are inconsistent (amoxicillin vs AMOX vs amox). This mirrors raw SDTM ingestion challenges where CRF data is messy. Using INFILE DATALINES ensures flexibility in reading unstructured input. The $ and :$12. informats demonstrate mixed-type parsing. This dataset becomes the foundation for demonstrating robust cleaning techniques.

R Code – Equivalent Raw Dataset

antibiotics_raw <- data.frame(

Patient_ID = c("P001","P002","P003","P004","P005","P001","P006",

"P007","P008","P009"),

Age = c(25,-5,45,60,30,25,200,35,40,28),

Gender = c("M","F","m","F","M","M","F",NA,"M","F"),

Drug_Name = c("amoxicillin","NULL","AMOX"," ","azithromycin"

,"amoxicillin","penicillin","ciprofloxacin","NULL",

"amox"),

Dose = c("500","250","500","700","abc","500","500","500","500",

"500"),

Treatment_Date = c("2023-01-01","2023-02-01","2023-01-10",

"2022-12-01","2023-03-01","2023-01-01",

"2023-02-30","2023-01-15","2023-01-20",

"2023-01-18"),

Outcome = c("Recovered","Recovered","Not Recovered","Recovered",

"Recovered","Recovered","Recovered","Recovered",

"Recovered","Recovered")

)

OUTPUT:

	Patient_ID	Age	Gender	Drug_Name	Dose	Treatment_Date	Outcome
1	P001	25	M	amoxicillin	500	01-01-2023	Recovered
2	P002	-5	F	NULL	250	01-02-2023	Recovered
3	P003	45	m	AMOX	500	10-01-2023	Not Recovered
4	P004	60	F		700	01-12-2022	Recovered
5	P005	30	M	azithromycin	abc	01-03-2023	Recovered
6	P001	25	M	amoxicillin	500	01-01-2023	Recovered
7	P006	200	F	penicillin	500	2023-02-30	Recovered
8	P007	35	NA	ciprofloxacin	500	15-01-2023	Recovered
9	P008	40	M	NULL	500	20-01-2023	Recovered
10	P009	28	F	amox	500	18-01-2023	Recovered

Explanation

This R dataset mirrors the SAS structure but uses data.frame() for flexibility. Notice that Dose is stored as character, which can cause numeric conversion issues. Missing values are represented using NA, unlike SAS blanks. Mixed casing and inconsistent drug names highlight the need for string normalization. Invalid date (2023-02-30) will cause parsing issues when converting to Date. This dataset prepares us for demonstrating dplyr transformations and validation steps. R is particularly powerful for exploratory cleaning, but requires explicit handling of types.

3. Phase 1: Data Cleaning in SAS

DATA antibiotics_clean;

SET antibiotics_raw;

/* Handle Missing Drug */

Drug_Name = STRIP(Drug_Name);

IF UPCASE(Drug_Name) = "NULL" THEN Drug_Name = "";

Drug_Name = COALESCEC(Drug_Name, "UNKNOWN");

IF Age < 0 OR Age > 120 THEN Age = . ; /* Fix Age */

/* Standardize Gender and Drug */

Gender = UPCASE(STRIP(Gender));

Drug_Name = UPCASE(STRIP(Drug_Name));

/* Fix Dose */

Dose_num = INPUT(Dose, BEST.);

IF Dose_num = . THEN Dose_num = 0;

/* Convert Date */

Treatment_DT = INPUT(Treatment_Date, YYMMDD10.);

FORMAT Treatment_DT DATE9.;

RUN;

PROC PRINT DATA = antibiotics_clean;

RUN;

OUTPUT:


Obs	Patient_ID	Drug_Name	Gender	Age	Dose	Treatment_Date	Outcome	Dose_num	Treatment_DT
1	P001	AMOXICILLIN	M	25	500	2023-01-01	Recovered	500	01JAN2023
2	P002	UNKNOWN	F	.	250	2023-02-01	Recovered	250	01FEB2023
3	P003	AMOX	M	45	500	2023-01-10	Not Recovered	500	10JAN2023
4	P004	UNKNOWN	F	60	700	2022-12-01	Recovered	700	01DEC2022
5	P005	AZITHROMYCIN	M	30	.	2023-03-01	Recovered	0	01MAR2023
6	P001	AMOXICILLIN	M	25	500	2023-01-01	Recovered	500	01JAN2023
7	P006	PENICILLIN	F	.	500	2023-02-30	Recovered	500	.
8	P007	CIPROFLOXACIN		35	500	2023-01-15	Recovered	500	15JAN2023
9	P008	UNKNOWN	M	40	500	2023-01-20	Recovered	500	20JAN2023
10	P009	AMOX	F	28	500	2023-01-18	Recovered	500	18JAN2023

PROC SORT DATA=antibiotics_clean NODUPKEY;

BY Patient_ID;

RUN;

PROC PRINT DATA = antibiotics_clean;

RUN;

OUTPUT:


Obs	Patient_ID	Drug_Name	Gender	Age	Dose	Treatment_Date	Outcome	Dose_num	Treatment_DT
1	P001	AMOXICILLIN	M	25	500	2023-01-01	Recovered	500	01JAN2023
2	P002	UNKNOWN	F	.	250	2023-02-01	Recovered	250	01FEB2023
3	P003	AMOX	M	45	500	2023-01-10	Not Recovered	500	10JAN2023
4	P004	UNKNOWN	F	60	700	2022-12-01	Recovered	700	01DEC2022
5	P005	AZITHROMYCIN	M	30	.	2023-03-01	Recovered	0	01MAR2023
6	P006	PENICILLIN	F	.	500	2023-02-30	Recovered	500	.
7	P007	CIPROFLOXACIN		35	500	2023-01-15	Recovered	500	15JAN2023
8	P008	UNKNOWN	M	40	500	2023-01-20	Recovered	500	20JAN2023
9	P009	AMOX	F	28	500	2023-01-18	Recovered	500	18JAN2023

Explanation

This step performs structured cleaning. COALESCEC replaces missing drug names, ensuring no null categories. Age validation removes biologically impossible values. UPCASE + STRIP standardizes text for consistency across analysis datasets. The INPUT() function converts character dose to numeric, with fallback logic for invalid entries. Date conversion uses YYMMDD10. informat, converting strings into SAS dates. Finally, PROC SORT NODUPKEY removes duplicate patient records a critical step in SDTM compliance. This pipeline mimics production-grade clinical cleaning logic.

4. Phase 2: Data Cleaning in R

library(dplyr)

antibiotics_clean <- antibiotics_raw %>%

mutate(

Drug_Name = ifelse(is.na(Drug_Name) | trimws(Drug_Name)

%in% c("NULL", "") | grepl("^[0-9]+$",

trimws(Drug_Name)),"UNKNOWN",Drug_Name),

Drug_Name = toupper(trimws(Drug_Name)),

Gender = toupper(trimws(Gender)),

Age = ifelse(Age < 0 | Age > 120, NA, Age),

Dose = suppressWarnings(as.numeric(Dose)),

Dose = ifelse(is.na(Dose), 0, Dose),

Treatment_Date = as.Date(Treatment_Date)

) %>%

distinct(Patient_ID, .keep_all = TRUE)

OUTPUT:

	Patient_ID	Age	Gender	Drug_Name	Dose	Treatment_Date	Outcome
1	P001	25	M	AMOXICILLIN	500	01-01-2023	Recovered
2	P002	NA	F	UNKNOWN	250	01-02-2023	Recovered
3	P003	45	M	AMOX	500	10-01-2023	Not Recovered
4	P004	60	F	UNKNOWN	700	01-12-2022	Recovered
5	P005	30	M	AZITHROMYCIN	0	01-03-2023	Recovered
6	P006	NA	F	PENICILLIN	500	NA	Recovered
7	P007	35	NA	CIPROFLOXACIN	500	15-01-2023	Recovered
8	P008	40	M	UNKNOWN	500	20-01-2023	Recovered
9	P009	28	F	AMOX	500	18-01-2023	Recovered

Explanation

R uses dplyr for declarative transformations. mutate() applies column-wise cleaning. Missing and invalid drug values are replaced using logical conditions. grepl("^[0-9]+$", Drug_Name) detects pure numeric strings. toupper() ensures consistency, critical for grouping operations. Numeric conversion of Dose automatically generates NA for invalid entries, which are then replaced. as.Date() converts string to date—invalid dates become NA, highlighting errors. distinct() removes duplicates efficiently. Compared to SAS, R offers concise syntax but requires careful type handling.

5. Phase 3: Advanced SAS Cleaning Using Macros

%MACRO clean_var(ds,var);

DATA &ds;

SET &ds;

&var = UPCASE(STRIP(&var));

RUN;

PROC PRINT DATA = &ds;

RUN;

%MEND;

%clean_var(antibiotics_clean, Drug_Name);

OUTPUT:


Obs	Patient_ID	Drug_Name	Gender	Age	Dose	Treatment_Date	Outcome	Dose_num	Treatment_DT
1	P001	AMOXICILLIN	M	25	500	2023-01-01	Recovered	500	01JAN2023
2	P002	UNKNOWN	F	.	250	2023-02-01	Recovered	250	01FEB2023
3	P003	AMOX	M	45	500	2023-01-10	Not Recovered	500	10JAN2023
4	P004	UNKNOWN	F	60	700	2022-12-01	Recovered	700	01DEC2022
5	P005	AZITHROMYCIN	M	30	.	2023-03-01	Recovered	0	01MAR2023
6	P006	PENICILLIN	F	.	500	2023-02-30	Recovered	500	.
7	P007	CIPROFLOXACIN		35	500	2023-01-15	Recovered	500	15JAN2023
8	P008	UNKNOWN	M	40	500	2023-01-20	Recovered	500	20JAN2023
9	P009	AMOX	F	28	500	2023-01-18	Recovered	500	18JAN2023

%clean_var(antibiotics_clean, Gender);

OUTPUT:


Obs	Patient_ID	Drug_Name	Gender	Age	Dose	Treatment_Date	Outcome	Dose_num	Treatment_DT
1	P001	AMOXICILLIN	M	25	500	2023-01-01	Recovered	500	01JAN2023
2	P002	UNKNOWN	F	.	250	2023-02-01	Recovered	250	01FEB2023
3	P003	AMOX	M	45	500	2023-01-10	Not Recovered	500	10JAN2023
4	P004	UNKNOWN	F	60	700	2022-12-01	Recovered	700	01DEC2022
5	P005	AZITHROMYCIN	M	30	.	2023-03-01	Recovered	0	01MAR2023
6	P006	PENICILLIN	F	.	500	2023-02-30	Recovered	500	.
7	P007	CIPROFLOXACIN		35	500	2023-01-15	Recovered	500	15JAN2023
8	P008	UNKNOWN	M	40	500	2023-01-20	Recovered	500	20JAN2023
9	P009	AMOX	F	28	500	2023-01-18	Recovered	500	18JAN2023

Explanation

Macros enable automation and scalability. Instead of repeating code, we generalize cleaning logic. The macro clean_var standardizes any variable dynamically. This is crucial in large SDTM domains (AE, LB, VS) where multiple variables require identical transformations. Macros improve maintainability, reduce errors, and support reusable pipelines. In production, macros are often parameterized with metadata, enabling automated dataset processing across studies.

6. 20 Additional Data Cleaning Best Practices

Always validate against SAP specifications
Maintain audit trails
Never overwrite raw datasets
Use controlled terminology (CDISC)
Validate date sequences
Handle partial dates carefully
Use metadata-driven programming
Standardize units (mg, kg)
Flag imputed values
Track derivation logic
Avoid hardcoding
Perform cross-domain validation
Check referential integrity
Use PROC COMPARE for QC
Document assumptions
Validate duplicates across domains
Use macro variables for flexibility
Ensure traceability to CRF
Follow 21 CFR Part 11 compliance
Perform double programming QC

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary it is driven by domain-specific logic. In clinical trials, replacing missing values is carefully justified. For instance, if drug dose is missing, assigning zero might indicate “no treatment,” but in some contexts, it must remain missing to avoid bias. Similarly, unrealistic values like age >120 are biologically implausible and must be corrected or removed to maintain dataset integrity.

Date corrections are even more critical. If treatment occurs before enrollment, it violates study protocol and must be flagged or corrected. These inconsistencies directly impact statistical outputs like survival analysis or efficacy endpoints.

Consider patient age correction: a negative age might result from data entry error. If not handled, it could distort demographic summaries. Similarly, salary normalization in business datasets ensures comparability across regions.

Ultimately, clean data ensures accurate decision-making, whether approving a drug or analyzing business performance. Poor cleaning leads to flawed models, incorrect conclusions, and potentially catastrophic decisions.

8. 20 Key Points

Dirty data leads to wrong conclusions
Standardization ensures reproducibility
Missing values must be handled contextually
Macros improve scalability
Validation is as important as cleaning
Duplicate removal is critical
Dates must follow logical sequences
Controlled terminology is mandatory
Automation reduces manual errors
Documentation is non-negotiable
Audit trails ensure compliance
Data cleaning is iterative
SAS excels in structured workflows
R excels in flexibility
Type consistency is crucial
Outliers must be investigated
Always preserve raw data
Use QC checks extensively
Regulatory standards guide cleaning
Clean data builds trust

9. Summary

Data cleaning is the backbone of reliable analytics, especially in high-stakes domains like clinical trials. In this blog, we explored how messy antibiotic datasets filled with missing values, duplicates, inconsistent formats, and invalid entries can be transformed into structured, analysis-ready datasets using SAS and R.

SAS demonstrated its strength in structured, rule-based cleaning, particularly with functions like COALESCEC, INPUT, and PROC SORT. Its macro system adds a powerful layer of automation, enabling scalable solutions across large clinical datasets. This is why SAS remains dominant in regulatory environments like SDTM and ADaM.

R, on the other hand, provided a more flexible and concise approach using dplyr. Its expressive syntax allows rapid transformations and exploratory data analysis. However, it requires careful handling of data types and missing values.

The integration of both tools provides a robust ecosystem SAS for compliance and reproducibility, and R for agility and exploration.

Ultimately, data cleaning is not just a pre-processing step it is a scientific discipline that ensures data integrity, re-producibility, and regulatory compliance. Without it, even the most advanced statistical models fail.

10. Conclusion

In modern data-driven environments, especially in clinical trials involving antibiotics and patient outcomes, data cleaning is no longer optional it is foundational. Every dataset carries the risk of hidden inconsistencies, and without structured cleaning frameworks, these errors propagate into analysis, reporting, and ultimately decision-making.

This blog demonstrated how intentional errors invalid ages, inconsistent drug names, missing values, and duplicate records can compromise data integrity. More importantly, it showed how SAS and R can systematically resolve these issues.

SAS stands out with its robust validation, macro automation, and regulatory alignment, making it indispensable in clinical programming. R complements this by offering flexibility and rapid transformation capabilities, ideal for exploratory phases.

The real power lies in combining both approaches using SAS for standardized pipelines and R for dynamic analysis.

As datasets grow in size and complexity, manual cleaning becomes impractical. Automation through macros and reproducible pipelines is the future. Organizations that invest in structured data cleaning frameworks will gain a competitive edge through reliable insights and faster decision-making.

In the end, clean data is not just about correctness it is about trust. Trust in your analysis, your models, and your conclusions.

11. Interview Questions

Q1: How do you handle missing values in SAS?
Answer: Use COALESCEC for character variables and conditional logic for numeric variables. Always justify imputation based on business rules.

Q2: How would you debug invalid date issues?
Answer: Use INPUT() with proper informat and check for missing outputs. Validate using conditional checks.

Q3: How do macros improve data cleaning?
Answer: They automate repetitive logic, improve scalability, and reduce human error.

Q4: Difference between SAS and R cleaning?
Answer: SAS is structured and compliance-focused; R is flexible and exploratory.

Q5: Real-world scenario:
Dataset has duplicate patients with different outcomes.
Solution: Investigate source, prioritize latest record or flag inconsistencies, document logic.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about ANTIBIOTICS DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

The Power of SAS Macros in Antibiotics : Data From Inconsistent Inputs to Insightful Outputs

From Contaminated Data to Clinical Clarity: Automating Antibiotics Dataset Cleaning in SAS with Macros

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

The Power of SAS Macros in Antibiotics : Data From Inconsistent Inputs to Insightful Outputs

From Contaminated Data to Clinical Clarity: Automating Antibiotics Dataset Cleaning in SAS with Macros

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?