The Power of SAS Macros in Antibiotics : Data From Inconsistent Inputs to Insightful Outputs
From Contaminated Data to Clinical Clarity: Automating Antibiotics Dataset Cleaning in SAS with Macros
1. Introduction
Imagine
you are working as a clinical SAS programmer in a pharmaceutical company. A new
antibiotics trial dataset lands on your desk. It looks promising patient
demographics, drug dosages, outcomes all neatly structured. But the moment you
start exploring, reality hits.
A patient
aged -5 years. Treatment dates occurring before enrollment. Drug
names like amoxicillin, AMOX, null, and even blanks
representing the same medication. Duplicate patient records with slightly
different spellings. Suddenly, your “clean dataset” turns into a minefield.
This is
not hypothetical this is daily reality in clinical trials.
Dirty
data is not just inconvenient it is dangerous. It can lead to:
- Incorrect efficacy
conclusions
- Regulatory rejection
(FDA/EMA compliance issues)
- Misleading safety signals
This is
where data cleaning becomes a critical scientific process, not just a
technical step.
Tools
like SAS and R dominate this space. SAS excels in structured
clinical workflows (SDTM/ADaM), while R provides flexibility and exploratory
power. But when automation meets SAS macros, we unlock true efficiency.
In this blog, we will build a messy antibiotics dataset, intentionally inject errors, and then clean it using SAS macros and R techniques just like in real-world clinical programming.
2. Raw Data Creation in SAS and R
SAS Raw Dataset (With Intentional Errors)
DATA antibiotics_raw;
INFILE DATALINES DLM=',' DSD;
LENGTH Patient_ID $5 Drug_Name $20 Gender $10;
INPUT Patient_ID $ Age Gender $ Drug_Name $ Dose
Treatment_Date :$12. Outcome :$15.;
DATALINES;
P001,25,M,amoxicillin,500,2023-01-01,Recovered
P002,-5,F,NULL,250,2023-02-01,Recovered
P003,45,m,AMOX,500,2023-01-10,Not Recovered
P004,60,F, ,700,2022-12-01,Recovered
P005,30,M,azithromycin,abc,2023-03-01,Recovered
P001,25,M,amoxicillin,500,2023-01-01,Recovered
P006,200,F,penicillin,500,2023-02-30,Recovered
P007,35,,ciprofloxacin,500,2023-01-15,Recovered
P008,40,M,NULL,500,2023-01-20,Recovered
P009,28,F,amox,500,2023-01-18,Recovered
;
RUN;
PROC PRINT DATA = antibiotics_raw;
RUN;
OUTPUT:
| Obs | Patient_ID | Drug_Name | Gender | Age | Dose | Treatment_Date | Outcome |
|---|---|---|---|---|---|---|---|
| 1 | P001 | amoxicillin | M | 25 | 500 | 2023-01-01 | Recovered |
| 2 | P002 | NULL | F | -5 | 250 | 2023-02-01 | Recovered |
| 3 | P003 | AMOX | m | 45 | 500 | 2023-01-10 | Not Recovered |
| 4 | P004 | F | 60 | 700 | 2022-12-01 | Recovered | |
| 5 | P005 | azithromycin | M | 30 | . | 2023-03-01 | Recovered |
| 6 | P001 | amoxicillin | M | 25 | 500 | 2023-01-01 | Recovered |
| 7 | P006 | penicillin | F | 200 | 500 | 2023-02-30 | Recovered |
| 8 | P007 | ciprofloxacin | 35 | 500 | 2023-01-15 | Recovered | |
| 9 | P008 | NULL | M | 40 | 500 | 2023-01-20 | Recovered |
| 10 | P009 | amox | F | 28 | 500 | 2023-01-18 | Recovered |
Explanation
This
dataset simulates real-world clinical trial inconsistencies. We deliberately
introduced invalid ages (-5, 200), missing values (blank gender,
NULL drug), invalid dose ("abc"), duplicate records
(P001), and invalid dates (Feb 30). Drug names are inconsistent
(amoxicillin vs AMOX vs amox). This mirrors raw SDTM ingestion challenges where
CRF data is messy. Using INFILE DATALINES ensures flexibility in reading
unstructured input. The $ and :$12. informats demonstrate mixed-type parsing.
This dataset becomes the foundation for demonstrating robust cleaning
techniques.
R Code – Equivalent Raw Dataset
antibiotics_raw <- data.frame(
Patient_ID = c("P001","P002","P003","P004","P005","P001","P006",
"P007","P008","P009"),
Age = c(25,-5,45,60,30,25,200,35,40,28),
Gender = c("M","F","m","F","M","M","F",NA,"M","F"),
Drug_Name = c("amoxicillin","NULL","AMOX"," ","azithromycin"
,"amoxicillin","penicillin","ciprofloxacin","NULL",
"amox"),
Dose = c("500","250","500","700","abc","500","500","500","500",
"500"),
Treatment_Date = c("2023-01-01","2023-02-01","2023-01-10",
"2022-12-01","2023-03-01","2023-01-01",
"2023-02-30","2023-01-15","2023-01-20",
"2023-01-18"),
Outcome = c("Recovered","Recovered","Not Recovered","Recovered",
"Recovered","Recovered","Recovered","Recovered",
"Recovered","Recovered")
)
OUTPUT:
|
|
Patient_ID |
Age |
Gender |
Drug_Name |
Dose |
Treatment_Date |
Outcome |
|
1 |
P001 |
25 |
M |
amoxicillin |
500 |
01-01-2023 |
Recovered |
|
2 |
P002 |
-5 |
F |
NULL |
250 |
01-02-2023 |
Recovered |
|
3 |
P003 |
45 |
m |
AMOX |
500 |
10-01-2023 |
Not Recovered |
|
4 |
P004 |
60 |
F |
|
700 |
01-12-2022 |
Recovered |
|
5 |
P005 |
30 |
M |
azithromycin |
abc |
01-03-2023 |
Recovered |
|
6 |
P001 |
25 |
M |
amoxicillin |
500 |
01-01-2023 |
Recovered |
|
7 |
P006 |
200 |
F |
penicillin |
500 |
2023-02-30 |
Recovered |
|
8 |
P007 |
35 |
NA |
ciprofloxacin |
500 |
15-01-2023 |
Recovered |
|
9 |
P008 |
40 |
M |
NULL |
500 |
20-01-2023 |
Recovered |
|
10 |
P009 |
28 |
F |
amox |
500 |
18-01-2023 |
Recovered |
Explanation
This R
dataset mirrors the SAS structure but uses data.frame() for flexibility. Notice
that Dose is stored as character, which can cause numeric conversion
issues. Missing values are represented using NA, unlike SAS blanks. Mixed
casing and inconsistent drug names highlight the need for string normalization.
Invalid date (2023-02-30) will cause parsing issues when converting to Date.
This dataset prepares us for demonstrating dplyr transformations and validation
steps. R is particularly powerful for exploratory cleaning, but requires
explicit handling of types.
3. Phase 1: Data Cleaning in SAS
DATA antibiotics_clean;
SET antibiotics_raw;
/* Handle Missing Drug */
Drug_Name = STRIP(Drug_Name);
IF UPCASE(Drug_Name) = "NULL" THEN Drug_Name = "";
Drug_Name = COALESCEC(Drug_Name, "UNKNOWN");
IF Age < 0 OR Age > 120 THEN Age = . ; /* Fix Age */
/* Standardize Gender and Drug */
Gender = UPCASE(STRIP(Gender));
Drug_Name = UPCASE(STRIP(Drug_Name));
/* Fix Dose */
Dose_num = INPUT(Dose, BEST.);
IF Dose_num = . THEN Dose_num = 0;
/* Convert Date */
Treatment_DT = INPUT(Treatment_Date, YYMMDD10.);
FORMAT Treatment_DT DATE9.;
RUN;
PROC PRINT DATA = antibiotics_clean;
RUN;
OUTPUT:
| Obs | Patient_ID | Drug_Name | Gender | Age | Dose | Treatment_Date | Outcome | Dose_num | Treatment_DT |
|---|---|---|---|---|---|---|---|---|---|
| 1 | P001 | AMOXICILLIN | M | 25 | 500 | 2023-01-01 | Recovered | 500 | 01JAN2023 |
| 2 | P002 | UNKNOWN | F | . | 250 | 2023-02-01 | Recovered | 250 | 01FEB2023 |
| 3 | P003 | AMOX | M | 45 | 500 | 2023-01-10 | Not Recovered | 500 | 10JAN2023 |
| 4 | P004 | UNKNOWN | F | 60 | 700 | 2022-12-01 | Recovered | 700 | 01DEC2022 |
| 5 | P005 | AZITHROMYCIN | M | 30 | . | 2023-03-01 | Recovered | 0 | 01MAR2023 |
| 6 | P001 | AMOXICILLIN | M | 25 | 500 | 2023-01-01 | Recovered | 500 | 01JAN2023 |
| 7 | P006 | PENICILLIN | F | . | 500 | 2023-02-30 | Recovered | 500 | . |
| 8 | P007 | CIPROFLOXACIN | 35 | 500 | 2023-01-15 | Recovered | 500 | 15JAN2023 | |
| 9 | P008 | UNKNOWN | M | 40 | 500 | 2023-01-20 | Recovered | 500 | 20JAN2023 |
| 10 | P009 | AMOX | F | 28 | 500 | 2023-01-18 | Recovered | 500 | 18JAN2023 |
PROC SORT DATA=antibiotics_clean NODUPKEY;
BY Patient_ID;
RUN;
PROC PRINT DATA = antibiotics_clean;
RUN;
OUTPUT:
| Obs | Patient_ID | Drug_Name | Gender | Age | Dose | Treatment_Date | Outcome | Dose_num | Treatment_DT |
|---|---|---|---|---|---|---|---|---|---|
| 1 | P001 | AMOXICILLIN | M | 25 | 500 | 2023-01-01 | Recovered | 500 | 01JAN2023 |
| 2 | P002 | UNKNOWN | F | . | 250 | 2023-02-01 | Recovered | 250 | 01FEB2023 |
| 3 | P003 | AMOX | M | 45 | 500 | 2023-01-10 | Not Recovered | 500 | 10JAN2023 |
| 4 | P004 | UNKNOWN | F | 60 | 700 | 2022-12-01 | Recovered | 700 | 01DEC2022 |
| 5 | P005 | AZITHROMYCIN | M | 30 | . | 2023-03-01 | Recovered | 0 | 01MAR2023 |
| 6 | P006 | PENICILLIN | F | . | 500 | 2023-02-30 | Recovered | 500 | . |
| 7 | P007 | CIPROFLOXACIN | 35 | 500 | 2023-01-15 | Recovered | 500 | 15JAN2023 | |
| 8 | P008 | UNKNOWN | M | 40 | 500 | 2023-01-20 | Recovered | 500 | 20JAN2023 |
| 9 | P009 | AMOX | F | 28 | 500 | 2023-01-18 | Recovered | 500 | 18JAN2023 |
Explanation
This step
performs structured cleaning. COALESCEC replaces missing drug names, ensuring
no null categories. Age validation removes biologically impossible values. UPCASE
+ STRIP standardizes text for consistency across analysis datasets. The INPUT()
function converts character dose to numeric, with fallback logic for invalid
entries. Date conversion uses YYMMDD10. informat, converting strings into SAS
dates. Finally, PROC SORT NODUPKEY removes duplicate patient records a critical
step in SDTM compliance. This pipeline mimics production-grade clinical
cleaning logic.
4. Phase 2: Data Cleaning in R
library(dplyr)
antibiotics_clean <- antibiotics_raw %>%
mutate(
Drug_Name = ifelse(is.na(Drug_Name) | trimws(Drug_Name)
%in% c("NULL", "") | grepl("^[0-9]+$",
trimws(Drug_Name)),"UNKNOWN",Drug_Name),
Drug_Name = toupper(trimws(Drug_Name)),
Gender = toupper(trimws(Gender)),
Age = ifelse(Age < 0 | Age > 120, NA, Age),
Dose = suppressWarnings(as.numeric(Dose)),
Dose = ifelse(is.na(Dose), 0, Dose),
Treatment_Date = as.Date(Treatment_Date)
) %>%
distinct(Patient_ID, .keep_all = TRUE)
OUTPUT:
|
|
Patient_ID |
Age |
Gender |
Drug_Name |
Dose |
Treatment_Date |
Outcome |
|
1 |
P001 |
25 |
M |
AMOXICILLIN |
500 |
01-01-2023 |
Recovered |
|
2 |
P002 |
NA |
F |
UNKNOWN |
250 |
01-02-2023 |
Recovered |
|
3 |
P003 |
45 |
M |
AMOX |
500 |
10-01-2023 |
Not Recovered |
|
4 |
P004 |
60 |
F |
UNKNOWN |
700 |
01-12-2022 |
Recovered |
|
5 |
P005 |
30 |
M |
AZITHROMYCIN |
0 |
01-03-2023 |
Recovered |
|
6 |
P006 |
NA |
F |
PENICILLIN |
500 |
NA |
Recovered |
|
7 |
P007 |
35 |
NA |
CIPROFLOXACIN |
500 |
15-01-2023 |
Recovered |
|
8 |
P008 |
40 |
M |
UNKNOWN |
500 |
20-01-2023 |
Recovered |
|
9 |
P009 |
28 |
F |
AMOX |
500 |
18-01-2023 |
Recovered |
Explanation
R uses dplyr
for declarative transformations. mutate() applies column-wise cleaning. Missing
and invalid drug values are replaced using logical conditions. grepl("^[0-9]+$", Drug_Name) detects pure numeric strings. toupper()
ensures consistency, critical for grouping operations. Numeric conversion of
Dose automatically generates NA for invalid entries, which are then replaced. as.Date()
converts string to date—invalid dates become NA, highlighting errors. distinct()
removes duplicates efficiently. Compared to SAS, R offers concise syntax but
requires careful type handling.
5. Phase 3: Advanced SAS Cleaning Using Macros
%MACRO clean_var(ds,var);
DATA &ds;
SET &ds;
&var = UPCASE(STRIP(&var));
RUN;
PROC PRINT DATA = &ds;
RUN;
%MEND;
%clean_var(antibiotics_clean, Drug_Name);
OUTPUT:
| Obs | Patient_ID | Drug_Name | Gender | Age | Dose | Treatment_Date | Outcome | Dose_num | Treatment_DT |
|---|---|---|---|---|---|---|---|---|---|
| 1 | P001 | AMOXICILLIN | M | 25 | 500 | 2023-01-01 | Recovered | 500 | 01JAN2023 |
| 2 | P002 | UNKNOWN | F | . | 250 | 2023-02-01 | Recovered | 250 | 01FEB2023 |
| 3 | P003 | AMOX | M | 45 | 500 | 2023-01-10 | Not Recovered | 500 | 10JAN2023 |
| 4 | P004 | UNKNOWN | F | 60 | 700 | 2022-12-01 | Recovered | 700 | 01DEC2022 |
| 5 | P005 | AZITHROMYCIN | M | 30 | . | 2023-03-01 | Recovered | 0 | 01MAR2023 |
| 6 | P006 | PENICILLIN | F | . | 500 | 2023-02-30 | Recovered | 500 | . |
| 7 | P007 | CIPROFLOXACIN | 35 | 500 | 2023-01-15 | Recovered | 500 | 15JAN2023 | |
| 8 | P008 | UNKNOWN | M | 40 | 500 | 2023-01-20 | Recovered | 500 | 20JAN2023 |
| 9 | P009 | AMOX | F | 28 | 500 | 2023-01-18 | Recovered | 500 | 18JAN2023 |
%clean_var(antibiotics_clean, Gender);
OUTPUT:
| Obs | Patient_ID | Drug_Name | Gender | Age | Dose | Treatment_Date | Outcome | Dose_num | Treatment_DT |
|---|---|---|---|---|---|---|---|---|---|
| 1 | P001 | AMOXICILLIN | M | 25 | 500 | 2023-01-01 | Recovered | 500 | 01JAN2023 |
| 2 | P002 | UNKNOWN | F | . | 250 | 2023-02-01 | Recovered | 250 | 01FEB2023 |
| 3 | P003 | AMOX | M | 45 | 500 | 2023-01-10 | Not Recovered | 500 | 10JAN2023 |
| 4 | P004 | UNKNOWN | F | 60 | 700 | 2022-12-01 | Recovered | 700 | 01DEC2022 |
| 5 | P005 | AZITHROMYCIN | M | 30 | . | 2023-03-01 | Recovered | 0 | 01MAR2023 |
| 6 | P006 | PENICILLIN | F | . | 500 | 2023-02-30 | Recovered | 500 | . |
| 7 | P007 | CIPROFLOXACIN | 35 | 500 | 2023-01-15 | Recovered | 500 | 15JAN2023 | |
| 8 | P008 | UNKNOWN | M | 40 | 500 | 2023-01-20 | Recovered | 500 | 20JAN2023 |
| 9 | P009 | AMOX | F | 28 | 500 | 2023-01-18 | Recovered | 500 | 18JAN2023 |
Explanation
Macros
enable automation and scalability. Instead of repeating code, we
generalize cleaning logic. The macro clean_var standardizes any variable
dynamically. This is crucial in large SDTM domains (AE, LB, VS) where multiple
variables require identical transformations. Macros improve maintainability,
reduce errors, and support reusable pipelines. In production, macros are often
parameterized with metadata, enabling automated dataset processing across
studies.
6. 20 Additional Data Cleaning Best Practices
- Always validate against SAP
specifications
- Maintain audit trails
- Never overwrite raw datasets
- Use controlled terminology
(CDISC)
- Validate date sequences
- Handle partial dates
carefully
- Use metadata-driven
programming
- Standardize units (mg, kg)
- Flag imputed values
- Track derivation logic
- Avoid hardcoding
- Perform cross-domain
validation
- Check referential integrity
- Use PROC COMPARE for QC
- Document assumptions
- Validate duplicates across
domains
- Use macro variables for
flexibility
- Ensure traceability to CRF
- Follow 21 CFR Part 11
compliance
- Perform double programming
QC
7. Business Logic Behind Data Cleaning
Data
cleaning is not arbitrary it is driven by domain-specific logic. In clinical
trials, replacing missing values is carefully justified. For instance, if drug
dose is missing, assigning zero might indicate “no treatment,” but in some
contexts, it must remain missing to avoid bias. Similarly, unrealistic values
like age >120 are biologically implausible and must be corrected or removed
to maintain dataset integrity.
Date
corrections are even more critical. If treatment occurs before enrollment, it
violates study protocol and must be flagged or corrected. These inconsistencies
directly impact statistical outputs like survival analysis or efficacy
endpoints.
Consider
patient age correction: a negative age might result from data entry error. If
not handled, it could distort demographic summaries. Similarly, salary
normalization in business datasets ensures comparability across regions.
Ultimately,
clean data ensures accurate decision-making, whether approving a drug or
analyzing business performance. Poor cleaning leads to flawed models, incorrect
conclusions, and potentially catastrophic decisions.
8. 20 Key Points
- Dirty data leads to wrong
conclusions
- Standardization ensures
reproducibility
- Missing values must be
handled contextually
- Macros improve scalability
- Validation is as important
as cleaning
- Duplicate removal is
critical
- Dates must follow logical
sequences
- Controlled terminology is
mandatory
- Automation reduces manual
errors
- Documentation is
non-negotiable
- Audit trails ensure
compliance
- Data cleaning is iterative
- SAS excels in structured
workflows
- R excels in flexibility
- Type consistency is crucial
- Outliers must be
investigated
- Always preserve raw data
- Use QC checks extensively
- Regulatory standards guide
cleaning
- Clean data builds trust
9. Summary
Data
cleaning is the backbone of reliable analytics, especially in high-stakes
domains like clinical trials. In this blog, we explored how messy antibiotic
datasets filled with missing values, duplicates, inconsistent formats, and
invalid entries can be transformed into structured, analysis-ready datasets
using SAS and R.
SAS
demonstrated its strength in structured, rule-based cleaning,
particularly with functions like COALESCEC, INPUT, and PROC SORT. Its macro
system adds a powerful layer of automation, enabling scalable solutions across
large clinical datasets. This is why SAS remains dominant in regulatory
environments like SDTM and ADaM.
R, on the
other hand, provided a more flexible and concise approach using dplyr.
Its expressive syntax allows rapid transformations and exploratory data
analysis. However, it requires careful handling of data types and missing
values.
The
integration of both tools provides a robust ecosystem SAS for compliance and
reproducibility, and R for agility and exploration.
Ultimately,
data cleaning is not just a pre-processing step it is a scientific discipline
that ensures data integrity, re-producibility, and regulatory compliance.
Without it, even the most advanced statistical models fail.
10. Conclusion
In modern
data-driven environments, especially in clinical trials involving antibiotics
and patient outcomes, data cleaning is no longer optional it is foundational.
Every dataset carries the risk of hidden inconsistencies, and without
structured cleaning frameworks, these errors propagate into analysis,
reporting, and ultimately decision-making.
This blog
demonstrated how intentional errors invalid ages, inconsistent drug names,
missing values, and duplicate records can compromise data integrity. More
importantly, it showed how SAS and R can systematically resolve these issues.
SAS
stands out with its robust validation, macro automation, and regulatory
alignment, making it indispensable in clinical programming. R complements
this by offering flexibility and rapid transformation capabilities,
ideal for exploratory phases.
The real
power lies in combining both approaches using SAS for standardized pipelines
and R for dynamic analysis.
As
datasets grow in size and complexity, manual cleaning becomes impractical.
Automation through macros and reproducible pipelines is the future.
Organizations that invest in structured data cleaning frameworks will gain a
competitive edge through reliable insights and faster decision-making.
In the
end, clean data is not just about correctness it is about trust. Trust
in your analysis, your models, and your conclusions.
11. Interview Questions
Q1: How
do you handle missing values in SAS?
Answer: Use COALESCEC for character variables and conditional logic for
numeric variables. Always justify imputation based on business rules.
Q2: How
would you debug invalid date issues?
Answer: Use INPUT() with proper informat and check for missing outputs.
Validate using conditional checks.
Q3: How
do macros improve data cleaning?
Answer: They automate repetitive logic, improve scalability, and reduce
human error.
Q4:
Difference between SAS and R cleaning?
Answer: SAS is structured and compliance-focused; R is flexible and
exploratory.
Q5:
Real-world scenario:
Dataset has duplicate patients with different outcomes.
Solution: Investigate source, prioritize latest record or flag
inconsistencies, document logic.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about ANTIBIOTICS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment