The Power of SAS Macros in Antibiotics : Data From Inconsistent Inputs to Insightful Outputs

From Contaminated Data to Clinical Clarity: Automating Antibiotics Dataset Cleaning in SAS with Macros

1. Introduction

Imagine you are working as a clinical SAS programmer in a pharmaceutical company. A new antibiotics trial dataset lands on your desk. It looks promising patient demographics, drug dosages, outcomes all neatly structured. But the moment you start exploring, reality hits.

A patient aged -5 years. Treatment dates occurring before enrollment. Drug names like amoxicillin, AMOX, null, and even blanks representing the same medication. Duplicate patient records with slightly different spellings. Suddenly, your “clean dataset” turns into a minefield.

This is not hypothetical this is daily reality in clinical trials.

Dirty data is not just inconvenient it is dangerous. It can lead to:

  • Incorrect efficacy conclusions
  • Regulatory rejection (FDA/EMA compliance issues)
  • Misleading safety signals

This is where data cleaning becomes a critical scientific process, not just a technical step.

Tools like SAS and R dominate this space. SAS excels in structured clinical workflows (SDTM/ADaM), while R provides flexibility and exploratory power. But when automation meets SAS macros, we unlock true efficiency.

In this blog, we will build a messy antibiotics dataset, intentionally inject errors, and then clean it using SAS macros and R techniques just like in real-world clinical programming.

2. Raw Data Creation in SAS and R

SAS Raw Dataset (With Intentional Errors)

DATA antibiotics_raw;

INFILE DATALINES DLM=',' DSD;

LENGTH Patient_ID $5 Drug_Name $20 Gender $10;

INPUT Patient_ID $ Age Gender $ Drug_Name $ Dose 

      Treatment_Date :$12. Outcome :$15.;

DATALINES;

P001,25,M,amoxicillin,500,2023-01-01,Recovered

P002,-5,F,NULL,250,2023-02-01,Recovered

P003,45,m,AMOX,500,2023-01-10,Not Recovered

P004,60,F, ,700,2022-12-01,Recovered

P005,30,M,azithromycin,abc,2023-03-01,Recovered

P001,25,M,amoxicillin,500,2023-01-01,Recovered

P006,200,F,penicillin,500,2023-02-30,Recovered

P007,35,,ciprofloxacin,500,2023-01-15,Recovered

P008,40,M,NULL,500,2023-01-20,Recovered

P009,28,F,amox,500,2023-01-18,Recovered

;

RUN;

PROC PRINT DATA = antibiotics_raw;

RUN;

OUTPUT:

ObsPatient_IDDrug_NameGenderAgeDoseTreatment_DateOutcome
1P001amoxicillinM255002023-01-01Recovered
2P002NULLF-52502023-02-01Recovered
3P003AMOXm455002023-01-10Not Recovered
4P004 F607002022-12-01Recovered
5P005azithromycinM30.2023-03-01Recovered
6P001amoxicillinM255002023-01-01Recovered
7P006penicillinF2005002023-02-30Recovered
8P007ciprofloxacin 355002023-01-15Recovered
9P008NULLM405002023-01-20Recovered
10P009amoxF285002023-01-18Recovered

Explanation

This dataset simulates real-world clinical trial inconsistencies. We deliberately introduced invalid ages (-5, 200), missing values (blank gender, NULL drug), invalid dose ("abc"), duplicate records (P001), and invalid dates (Feb 30). Drug names are inconsistent (amoxicillin vs AMOX vs amox). This mirrors raw SDTM ingestion challenges where CRF data is messy. Using INFILE DATALINES ensures flexibility in reading unstructured input. The $ and :$12. informats demonstrate mixed-type parsing. This dataset becomes the foundation for demonstrating robust cleaning techniques.

R Code – Equivalent Raw Dataset

antibiotics_raw <- data.frame(

  Patient_ID = c("P001","P002","P003","P004","P005","P001","P006",

                 "P007","P008","P009"),

  Age = c(25,-5,45,60,30,25,200,35,40,28),

  Gender = c("M","F","m","F","M","M","F",NA,"M","F"),

  Drug_Name = c("amoxicillin","NULL","AMOX"," ","azithromycin"

                ,"amoxicillin","penicillin","ciprofloxacin","NULL",

                "amox"),

  Dose = c("500","250","500","700","abc","500","500","500","500",

           "500"),

  Treatment_Date = c("2023-01-01","2023-02-01","2023-01-10",

                     "2022-12-01","2023-03-01","2023-01-01",

                     "2023-02-30","2023-01-15","2023-01-20",

                     "2023-01-18"),

  Outcome = c("Recovered","Recovered","Not Recovered","Recovered",

              "Recovered","Recovered","Recovered","Recovered",

              "Recovered","Recovered")

)

OUTPUT:

 

Patient_ID

Age

Gender

Drug_Name

Dose

Treatment_Date

Outcome

1

P001

25

M

amoxicillin

500

01-01-2023

Recovered

2

P002

-5

F

NULL

250

01-02-2023

Recovered

3

P003

45

m

AMOX

500

10-01-2023

Not Recovered

4

P004

60

F

 

700

01-12-2022

Recovered

5

P005

30

M

azithromycin

abc

01-03-2023

Recovered

6

P001

25

M

amoxicillin

500

01-01-2023

Recovered

7

P006

200

F

penicillin

500

2023-02-30

Recovered

8

P007

35

NA

ciprofloxacin

500

15-01-2023

Recovered

9

P008

40

M

NULL

500

20-01-2023

Recovered

10

P009

28

F

amox

500

18-01-2023

Recovered

Explanation

This R dataset mirrors the SAS structure but uses data.frame() for flexibility. Notice that Dose is stored as character, which can cause numeric conversion issues. Missing values are represented using NA, unlike SAS blanks. Mixed casing and inconsistent drug names highlight the need for string normalization. Invalid date (2023-02-30) will cause parsing issues when converting to Date. This dataset prepares us for demonstrating dplyr transformations and validation steps. R is particularly powerful for exploratory cleaning, but requires explicit handling of types.

3. Phase 1: Data Cleaning in SAS

DATA antibiotics_clean;

    SET antibiotics_raw;

    /* Handle Missing Drug */

    Drug_Name = STRIP(Drug_Name);

    IF UPCASE(Drug_Name) = "NULL"  THEN Drug_Name = "";

    Drug_Name = COALESCEC(Drug_Name, "UNKNOWN");    

    IF Age < 0 OR Age > 120 THEN Age = . ;  /* Fix Age */

    /* Standardize Gender and Drug */

    Gender = UPCASE(STRIP(Gender));

    Drug_Name = UPCASE(STRIP(Drug_Name));

    /* Fix Dose */

    Dose_num = INPUT(Dose, BEST.);

    IF Dose_num = . THEN Dose_num = 0;

    /* Convert Date */

    Treatment_DT = INPUT(Treatment_Date, YYMMDD10.);

    FORMAT Treatment_DT DATE9.;

RUN;

PROC PRINT DATA = antibiotics_clean;

RUN;

OUTPUT:

ObsPatient_IDDrug_NameGenderAgeDoseTreatment_DateOutcomeDose_numTreatment_DT
1P001AMOXICILLINM255002023-01-01Recovered50001JAN2023
2P002UNKNOWNF.2502023-02-01Recovered25001FEB2023
3P003AMOXM455002023-01-10Not Recovered50010JAN2023
4P004UNKNOWNF607002022-12-01Recovered70001DEC2022
5P005AZITHROMYCINM30.2023-03-01Recovered001MAR2023
6P001AMOXICILLINM255002023-01-01Recovered50001JAN2023
7P006PENICILLINF.5002023-02-30Recovered500.
8P007CIPROFLOXACIN 355002023-01-15Recovered50015JAN2023
9P008UNKNOWNM405002023-01-20Recovered50020JAN2023
10P009AMOXF285002023-01-18Recovered50018JAN2023

PROC SORT DATA=antibiotics_clean NODUPKEY;

BY Patient_ID;

RUN;

PROC PRINT DATA = antibiotics_clean;

RUN;

OUTPUT:

ObsPatient_IDDrug_NameGenderAgeDoseTreatment_DateOutcomeDose_numTreatment_DT
1P001AMOXICILLINM255002023-01-01Recovered50001JAN2023
2P002UNKNOWN F.2502023-02-01Recovered25001FEB2023
3P003AMOXM455002023-01-10Not Recovered50010JAN2023
4P004UNKNOWNF607002022-12-01Recovered70001DEC2022
5P005AZITHROMYCINM30.2023-03-01Recovered001MAR2023
6P006PENICILLINF.5002023-02-30Recovered500.
7P007CIPROFLOXACIN 355002023-01-15Recovered50015JAN2023
8P008UNKNOWNM405002023-01-20Recovered50020JAN2023
9P009AMOXF285002023-01-18Recovered50018JAN2023

Explanation

This step performs structured cleaning. COALESCEC replaces missing drug names, ensuring no null categories. Age validation removes biologically impossible values. UPCASE + STRIP standardizes text for consistency across analysis datasets. The INPUT() function converts character dose to numeric, with fallback logic for invalid entries. Date conversion uses YYMMDD10. informat, converting strings into SAS dates. Finally, PROC SORT NODUPKEY removes duplicate patient records a critical step in SDTM compliance. This pipeline mimics production-grade clinical cleaning logic.

4. Phase 2: Data Cleaning in R

library(dplyr)

antibiotics_clean <- antibiotics_raw %>%

  mutate(

 Drug_Name = ifelse(is.na(Drug_Name) | trimws(Drug_Name)

                       %in% c("NULL", "") | grepl("^[0-9]+$", 

                      trimws(Drug_Name)),"UNKNOWN",Drug_Name),

    Drug_Name = toupper(trimws(Drug_Name)),

    Gender = toupper(trimws(Gender)),

    Age = ifelse(Age < 0 | Age > 120, NA, Age),

    Dose = suppressWarnings(as.numeric(Dose)),

    Dose = ifelse(is.na(Dose), 0, Dose),

    Treatment_Date = as.Date(Treatment_Date)

  ) %>%

  distinct(Patient_ID, .keep_all = TRUE)

OUTPUT:

 

Patient_ID

Age

Gender

Drug_Name

Dose

Treatment_Date

Outcome

1

P001

25

M

AMOXICILLIN

500

01-01-2023

Recovered

2

P002

NA

F

UNKNOWN

250

01-02-2023

Recovered

3

P003

45

M

AMOX

500

10-01-2023

Not Recovered

4

P004

60

F

UNKNOWN

700

01-12-2022

Recovered

5

P005

30

M

AZITHROMYCIN

0

01-03-2023

Recovered

6

P006

NA

F

PENICILLIN

500

NA

Recovered

7

P007

35

NA

CIPROFLOXACIN

500

15-01-2023

Recovered

8

P008

40

M

UNKNOWN

500

20-01-2023

Recovered

9

P009

28

F

AMOX

500

18-01-2023

Recovered

Explanation

R uses dplyr for declarative transformations. mutate() applies column-wise cleaning. Missing and invalid drug values are replaced using logical conditions. grepl("^[0-9]+$", Drug_Name) detects pure numeric strings. toupper() ensures consistency, critical for grouping operations. Numeric conversion of Dose automatically generates NA for invalid entries, which are then replaced. as.Date() converts string to date—invalid dates become NA, highlighting errors. distinct() removes duplicates efficiently. Compared to SAS, R offers concise syntax but requires careful type handling.

5. Phase 3: Advanced SAS Cleaning Using Macros

%MACRO clean_var(ds,var);

    DATA &ds;

        SET &ds;

        &var = UPCASE(STRIP(&var));

    RUN;

PROC PRINT DATA = &ds;

    RUN;

%MEND;


%clean_var(antibiotics_clean, Drug_Name);

OUTPUT:

ObsPatient_IDDrug_NameGenderAgeDoseTreatment_DateOutcomeDose_numTreatment_DT
1P001AMOXICILLINM255002023-01-01Recovered50001JAN2023
2P002UNKNOWNF.2502023-02-01Recovered25001FEB2023
3P003AMOXM455002023-01-10Not Recovered50010JAN2023
4P004UNKNOWNF607002022-12-01Recovered70001DEC2022
5P005AZITHROMYCINM30.2023-03-01Recovered001MAR2023
6P006PENICILLINF.5002023-02-30Recovered500.
7P007CIPROFLOXACIN 355002023-01-15Recovered50015JAN2023
8P008UNKNOWNM405002023-01-20Recovered50020JAN2023
9P009AMOXF285002023-01-18Recovered50018JAN2023

%clean_var(antibiotics_clean, Gender);

OUTPUT:

ObsPatient_IDDrug_NameGenderAgeDoseTreatment_DateOutcomeDose_numTreatment_DT
1P001AMOXICILLINM255002023-01-01Recovered50001JAN2023
2P002UNKNOWNF.2502023-02-01Recovered25001FEB2023
3P003AMOXM455002023-01-10Not Recovered50010JAN2023
4P004UNKNOWNF607002022-12-01Recovered70001DEC2022
5P005AZITHROMYCINM30.2023-03-01Recovered001MAR2023
6P006PENICILLINF.5002023-02-30Recovered500.
7P007CIPROFLOXACIN 355002023-01-15Recovered50015JAN2023
8P008UNKNOWNM405002023-01-20Recovered50020JAN2023
9P009AMOXF285002023-01-18Recovered50018JAN2023

Explanation

Macros enable automation and scalability. Instead of repeating code, we generalize cleaning logic. The macro clean_var standardizes any variable dynamically. This is crucial in large SDTM domains (AE, LB, VS) where multiple variables require identical transformations. Macros improve maintainability, reduce errors, and support reusable pipelines. In production, macros are often parameterized with metadata, enabling automated dataset processing across studies.

6. 20 Additional Data Cleaning Best Practices

  1. Always validate against SAP specifications
  2. Maintain audit trails
  3. Never overwrite raw datasets
  4. Use controlled terminology (CDISC)
  5. Validate date sequences
  6. Handle partial dates carefully
  7. Use metadata-driven programming
  8. Standardize units (mg, kg)
  9. Flag imputed values
  10. Track derivation logic
  11. Avoid hardcoding
  12. Perform cross-domain validation
  13. Check referential integrity
  14. Use PROC COMPARE for QC
  15. Document assumptions
  16. Validate duplicates across domains
  17. Use macro variables for flexibility
  18. Ensure traceability to CRF
  19. Follow 21 CFR Part 11 compliance
  20. Perform double programming QC

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary it is driven by domain-specific logic. In clinical trials, replacing missing values is carefully justified. For instance, if drug dose is missing, assigning zero might indicate “no treatment,” but in some contexts, it must remain missing to avoid bias. Similarly, unrealistic values like age >120 are biologically implausible and must be corrected or removed to maintain dataset integrity.

Date corrections are even more critical. If treatment occurs before enrollment, it violates study protocol and must be flagged or corrected. These inconsistencies directly impact statistical outputs like survival analysis or efficacy endpoints.

Consider patient age correction: a negative age might result from data entry error. If not handled, it could distort demographic summaries. Similarly, salary normalization in business datasets ensures comparability across regions.

Ultimately, clean data ensures accurate decision-making, whether approving a drug or analyzing business performance. Poor cleaning leads to flawed models, incorrect conclusions, and potentially catastrophic decisions.

8. 20 Key Points

  • Dirty data leads to wrong conclusions
  • Standardization ensures reproducibility
  • Missing values must be handled contextually
  • Macros improve scalability
  • Validation is as important as cleaning
  • Duplicate removal is critical
  • Dates must follow logical sequences
  • Controlled terminology is mandatory
  • Automation reduces manual errors
  • Documentation is non-negotiable
  • Audit trails ensure compliance
  • Data cleaning is iterative
  • SAS excels in structured workflows
  • R excels in flexibility
  • Type consistency is crucial
  • Outliers must be investigated
  • Always preserve raw data
  • Use QC checks extensively
  • Regulatory standards guide cleaning
  • Clean data builds trust

9. Summary

Data cleaning is the backbone of reliable analytics, especially in high-stakes domains like clinical trials. In this blog, we explored how messy antibiotic datasets filled with missing values, duplicates, inconsistent formats, and invalid entries can be transformed into structured, analysis-ready datasets using SAS and R.

SAS demonstrated its strength in structured, rule-based cleaning, particularly with functions like COALESCEC, INPUT, and PROC SORT. Its macro system adds a powerful layer of automation, enabling scalable solutions across large clinical datasets. This is why SAS remains dominant in regulatory environments like SDTM and ADaM.

R, on the other hand, provided a more flexible and concise approach using dplyr. Its expressive syntax allows rapid transformations and exploratory data analysis. However, it requires careful handling of data types and missing values.

The integration of both tools provides a robust ecosystem SAS for compliance and reproducibility, and R for agility and exploration.

Ultimately, data cleaning is not just a pre-processing step it is a scientific discipline that ensures data integrity, re-producibility, and regulatory compliance. Without it, even the most advanced statistical models fail.

10. Conclusion

In modern data-driven environments, especially in clinical trials involving antibiotics and patient outcomes, data cleaning is no longer optional it is foundational. Every dataset carries the risk of hidden inconsistencies, and without structured cleaning frameworks, these errors propagate into analysis, reporting, and ultimately decision-making.

This blog demonstrated how intentional errors invalid ages, inconsistent drug names, missing values, and duplicate records can compromise data integrity. More importantly, it showed how SAS and R can systematically resolve these issues.

SAS stands out with its robust validation, macro automation, and regulatory alignment, making it indispensable in clinical programming. R complements this by offering flexibility and rapid transformation capabilities, ideal for exploratory phases.

The real power lies in combining both approaches using SAS for standardized pipelines and R for dynamic analysis.

As datasets grow in size and complexity, manual cleaning becomes impractical. Automation through macros and reproducible pipelines is the future. Organizations that invest in structured data cleaning frameworks will gain a competitive edge through reliable insights and faster decision-making.

In the end, clean data is not just about correctness it is about trust. Trust in your analysis, your models, and your conclusions.

11. Interview Questions

Q1: How do you handle missing values in SAS?
Answer: Use COALESCEC for character variables and conditional logic for numeric variables. Always justify imputation based on business rules.

Q2: How would you debug invalid date issues?
Answer: Use INPUT() with proper informat and check for missing outputs. Validate using conditional checks.

Q3: How do macros improve data cleaning?
Answer: They automate repetitive logic, improve scalability, and reduce human error.

Q4: Difference between SAS and R cleaning?
Answer: SAS is structured and compliance-focused; R is flexible and exploratory.

Q5: Real-world scenario:
Dataset has duplicate patients with different outcomes.
Solution: Investigate source, prioritize latest record or flag inconsistencies, document logic.

 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about ANTIBIOTICS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy

Comments

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function