Treatment Data Unleashed: Converting Chaos into Clinical Clarity with SAS Intelligence

From Chaotic Care to Clinical Clarity: Mastering Treatment Data with SAS PROC FREQ and Intelligent Cleaning Frameworks

1. Introduction – When Data Lies, Decisions Collapse

Imagine you're working on a clinical trial dataset tracking different types of treatments chemotherapy, immunotherapy, antibiotics, lifestyle interventions. Everything looks fine on the surface. But then…

  • A patient aged -12 appears.
  • Treatment dates occur before enrollment.
  • Some treatments are labeled “chemo,” others “CHEMOTHERAPY,” and some just “NULL”.
  • Duplicate patients distort outcome counts.

Now picture this dataset being used for regulatory submission (CDISC SDTM/ADaM). One wrong frequency count in PROC FREQ could misrepresent safety signals. That’s not just bad analytics that’s regulatory risk.

This is where SAS and R become your surgical tools. They don’t just process data they enforce discipline, reproducibility, and auditability.

In this blog, we’ll build a “Different Types of Treatment” dataset, intentionally inject real-world errors, clean it step-by-step, and finally analyze it using PROC FREQ like an industry-grade programmer.

2. Raw Data Creation in SAS and R 

SAS Raw Dataset

DATA treatments_raw;

INFILE DATALINES DLM='|' MISSOVER DSD;

LENGTH Patient_ID $5 Treatment_Type $20 

       Gender $10 Status $12;

FORMAT Treatment_Date DATE9.;

INPUT Patient_ID $ Age Gender $ Treatment_Type $ 

      Treatment_Date:DATE9. Status $ Cost Outcome_Score;

DATALINES;

P001|25|male|chemotherapy|01JAN2023|Completed|50000|80

P002|-5|FEMALE|Immunotherapy|15FEB2023|Ongoing|60000|90

P003|45|Male|NULL|.|Completed|70000|85

P004|60||Antibiotics|10MAR2023|NULL|3000|70

P001|25|male|chemotherapy|01JAN2023|Completed|50000|80

P005|150|Female|Radiation|05APR2023|Completed|40000|95

P006|30|male|chemo|20MAY2023|Ongoing|45000|88

P007|.|Female|Immunotherapy|15JUN2023|Completed|55000|92

P008|40|male|ANTIBIOTICS|01JUL2023|Completed|2000|65

P009|35|female|lifestyle|12AUG2023|Completed|1000|60

;

RUN;

PROC PRINT DATA = treatments_raw;

RUN;

OUTPUT:

ObsPatient_IDTreatment_TypeGenderStatusTreatment_DateAgeCostOutcome_Score
1P001chemotherapymaleCompleted01JAN2023255000080
2P002ImmunotherapyFEMALEOngoing15FEB2023-56000090
3P003NULLMaleCompleted.457000085
4P004Antibiotics NULL10MAR202360300070
5P001chemotherapymaleCompleted01JAN2023255000080
6P005RadiationFemaleCompleted05APR20231504000095
7P006chemomaleOngoing20MAY2023304500088
8P007ImmunotherapyFemaleCompleted15JUN2023.5500092
9P008ANTIBIOTICSmaleCompleted01JUL202340200065
10P009lifestylefemaleCompleted12AUG202335100060

Explanation (SAS Raw Data)

This dataset mimics a clinical trial structure with intentional flaws:

  • Missing Age (P007)
  • Invalid Age (-5, 150)
  • Duplicate (P001)
  • Inconsistent treatment naming (“chemo”, “ANTIBIOTICS”)
  • Missing values (Gender, Status)
  • NULL placeholders

Key Insight: Real-world clinical data is rarely clean. Designing such flawed datasets prepares you for SDTM/ADaM validation scenarios.

R Code – Equivalent Raw Dataset

treatments_raw <- data.frame(

  Patient_ID = c("P001","P002","P003","P004","P001","P005","P006",

                 "P007","P008","P009"),

  Age = c(25,-5,45,60,25,150,30,NA,40,35),

  Gender = c("male","FEMALE","Male",NA,"male","Female","male",

             "Female","male","female"),

  Treatment_Type = c("chemotherapy","Immunotherapy","NULL",

                     "Antibiotics","chemotherapy","Radiation",

                     "chemo","Immunotherapy","ANTIBIOTICS",

                     "lifestyle"),

  Treatment_Date = as.Date(c("2023-01-01","2023-02-15",NA,

                             "2023-03-10","2023-01-01","2023-04-05",

                             "2023-05-20","2023-06-15","2023-07-01",

                             "2023-08-12")),

  Status = c("Completed","Ongoing","Completed","NULL","Completed",

             "Completed","Ongoing","Completed","Completed","Completed"),

  Cost = c(50000,60000,70000,3000,50000,40000,45000,55000,2000,1000),

  Outcome_Score = c(80,90,85,70,80,95,88,92,65,60)

)

OUTPUT:

 

Patient_ID

Age

Gender

Treatment_Type

Treatment_Date

Status

Cost

Outcome_Score

1

P001

25

male

chemotherapy

01-01-2023

Completed

50000

80

2

P002

-5

FEMALE

Immunotherapy

15-02-2023

Ongoing

60000

90

3

P003

45

Male

NULL

NA

Completed

70000

85

4

P004

60

NA

Antibiotics

10-03-2023

NULL

3000

70

5

P001

25

male

chemotherapy

01-01-2023

Completed

50000

80

6

P005

150

Female

Radiation

05-04-2023

Completed

40000

95

7

P006

30

male

chemo

20-05-2023

Ongoing

45000

88

8

P007

NA

Female

Immunotherapy

15-06-2023

Completed

55000

92

9

P008

40

male

ANTIBIOTICS

01-07-2023

Completed

2000

65

10

P009

35

female

lifestyle

12-08-2023

Completed

1000

60

Explanation (R Raw Data)

The same flawed dataset is recreated using data.frame(). R enables flexible transformations but lacks built-in regulatory structure like SAS.

Key Insight: Parallel dataset creation helps validate consistency between SAS and R pipelines.

3. Phase 1: Data Cleaning in SAS

DATA treatments_clean;

SET treatments_raw;

/* Standardize text */

Gender = UPCASE(STRIP(Gender));

Treatment_Type = PROPCASE(STRIP(Treatment_Type));

/* Handle missing values */

IF Gender = "" THEN Gender = "UNKNOWN";

IF Status = "NULL" THEN Status = "UNKNOWN";

/* Fix invalid age */

IF Age < 0 THEN Age = .;

IF Age > 120 THEN Age = 120;

/* Replace NULL treatment */

IF Treatment_Type = "Null" THEN Treatment_Type = "UNKNOWN";

/* Date correction */

IF Treatment_Date = . THEN Treatment_Date = TODAY();

RUN;

PROC PRINT DATA = treatments_clean;

RUN;

OUTPUT:

ObsPatient_IDTreatment_TypeGenderStatusTreatment_DateAgeCostOutcome_Score
1P001ChemotherapyMALECompleted01JAN2023255000080
2P002ImmunotherapyFEMALEOngoing15FEB2023.6000090
3P003UNKNOWNMALECompleted02MAY2026457000085
4P004AntibioticsUNKNOWNUNKNOWN10MAR202360300070
5P001ChemotherapyMALECompleted01JAN2023255000080
6P005RadiationFEMALECompleted05APR20231204000095
7P006ChemoMALEOngoing20MAY2023304500088
8P007ImmunotherapyFEMALECompleted15JUN2023.5500092
9P008AntibioticsMALECompleted01JUL202340200065
10P009LifestyleFEMALECompleted12AUG202335100060

/* Remove duplicates */

PROC SORT DATA=treatments_clean NODUPKEY;

BY Patient_ID Treatment_Date;

RUN;

PROC PRINT DATA = treatments_clean;

RUN;

OUTPUT:

ObsPatient_IDTreatment_TypeGenderStatusTreatment_DateAgeCostOutcome_Score
1P001ChemotherapyMALECompleted01JAN2023255000080
2P002ImmunotherapyFEMALEOngoing15FEB2023.6000090
3P003UNKNOWNMALECompleted02MAY2026457000085
4P004AntibioticsUNKNOWNUNKNOWN10MAR202360300070
5P005RadiationFEMALECompleted05APR20231204000095
6P006ChemoMALEOngoing20MAY2023304500088
7P007ImmunotherapyFEMALECompleted15JUN2023.5500092
8P008AntibioticsMALECompleted01JUL202340200065
9P009LifestyleFEMALECompleted12AUG202335100060

Explanation

This step introduces data normalization and validation logic:

  • UPCASE, PROPCASE, STRIP ensure text consistency.
  • Logical conditions fix unrealistic values.
  • Missing dates are imputed using TODAY().
  • PROC SORT NODUPKEY ensures unique patient records.

Key Insight: In clinical SAS workflows, every transformation must be traceable and justifiable, especially for regulatory audits.

4. Phase 2: Data Cleaning in R

library(dplyr)

treatments_clean <- treatments_raw %>%

  mutate(

    Gender = toupper(trimws(ifelse(is.na(Gender) | Gender == "", "UNKNOWN",

                                   Gender))),

    Treatment_Type = ifelse(is.na(Treatment_Type) | Treatment_Type == "NULL"

                            | Treatment_Type == "","UNKNOWN",

                            tools::toTitleCase(trimws(Treatment_Type))),

    Status = ifelse(Status == "NULL" | is.na(Status), "UNKNOWN", Status),

    Age = ifelse(Age < 0, NA, Age),

    Age = ifelse(Age > 120, 120, Age),

    Treatment_Date = coalesce(Treatment_Date, Sys.Date())

  ) %>%

  distinct(Patient_ID, Treatment_Date, .keep_all = TRUE)

OUTPUT:

 

Patient_ID

Age

Gender

Treatment_Type

Treatment_Date

Status

Cost

Outcome_Score

1

P001

25

MALE

Chemotherapy

01-01-2023

Completed

50000

80

2

P002

NA

FEMALE

Immunotherapy

15-02-2023

Ongoing

60000

90

3

P003

45

MALE

UNKNOWN

02-05-2026

Completed

70000

85

4

P004

60

UNKNOWN

Antibiotics

10-03-2023

UNKNOWN

3000

70

5

P005

120

FEMALE

Radiation

05-04-2023

Completed

40000

95

6

P006

30

MALE

Chemo

20-05-2023

Ongoing

45000

88

7

P007

NA

FEMALE

Immunotherapy

15-06-2023

Completed

55000

92

8

P008

40

MALE

ANTIBIOTICS

01-07-2023

Completed

2000

65

9

P009

35

FEMALE

Lifestyle

12-08-2023

Completed

1000

60

Explanation

R uses functional pipelines:

  • mutate() applies transformations
  • ifelse() handles conditions
  • coalesce() Keeps original data type
  • distinct() removes duplicates
  • trimws() and toupper() standardize text

Key Insight: R is flexible and expressive, but SAS provides stronger clinical traceability and metadata control.

5. Phase 3: Additional SAS Enhancements

/* Create Age Group */

DATA treatments_final;

SET treatments_clean;

LENGTH Age_Group $10;

SELECT;

WHEN (Age < 18) Age_Group = "CHILD";

WHEN (18 <= Age < 60) Age_Group = "ADULT";

OTHERWISE Age_Group = "SENIOR";

END;

RUN;

PROC PRINT DATA = treatments_final;

RUN;

OUTPUT:

ObsPatient_IDTreatment_TypeGenderStatusTreatment_DateAgeCostOutcome_ScoreAge_Group
1P001ChemotherapyMALECompleted01JAN2023255000080ADULT
2P002ImmunotherapyFEMALEOngoing15FEB2023.6000090CHILD
3P003UNKNOWNMALECompleted02MAY2026457000085ADULT
4P004AntibioticsUNKNOWNUNKNOWN10MAR202360300070SENIOR
5P005RadiationFEMALECompleted05APR20231204000095SENIOR
6P006ChemoMALEOngoing20MAY2023304500088ADULT
7P007ImmunotherapyFEMALECompleted15JUN2023.5500092CHILD
8P008AntibioticsMALECompleted01JUL202340200065ADULT
9P009LifestyleFEMALECompleted12AUG202335100060ADULT

/* Frequency Analysis */

PROC FREQ DATA=treatments_final;

TABLES Treatment_Type*Status / NOCUM NOPERCENT;

RUN;

OUTPUT:

The FREQ Procedure

Frequency
Row Pct
Col Pct
Table of Treatment_Type by Status
Treatment_TypeStatus
CompletedOngoingUNKNOWNTotal
Antibiotics
1
50.00
16.67
0
0.00
0.00
1
50.00
100.00
2
 
 
Chemo
0
0.00
0.00
1
100.00
50.00
0
0.00
0.00
1
 
 
Chemotherapy
1
100.00
16.67
0
0.00
0.00
0
0.00
0.00
1
 
 
Immunotherapy
1
50.00
16.67
1
50.00
50.00
0
0.00
0.00
2
 
 
Lifestyle
1
100.00
16.67
0
0.00
0.00
0
0.00
0.00
1
 
 
Radiation
1
100.00
16.67
0
0.00
0.00
0
0.00
0.00
1
 
 
UNKNOWN
1
100.00
16.67
0
0.00
0.00
0
0.00
0.00
1
 
 
Total
6
2
1
9

Explanation

  • SELECT-WHEN provides structured conditional logic.
  • Age grouping aligns with ADaM derivations.
  • PROC FREQ generates cross-tabulations for treatment vs status.

Key Insight: This step converts cleaned data into analytical insights, critical for safety and efficacy reporting.

6. 20 Data Cleaning Best Practices

  1. Always validate ranges (e.g., Age 0–120)
  2. Use controlled terminology (CDISC standards)
  3. Maintain audit trails for transformations
  4. Avoid hardcoding—use macros
  5. Validate date sequences (start < end)
  6. Use PROC COMPARE for QC
  7. Document assumptions clearly
  8. Separate raw and cleaned datasets
  9. Apply metadata-driven programming
  10. Handle missing values consistently
  11. Avoid overwriting original data
  12. Validate duplicates using keys
  13. Use formats for standardization
  14. Ensure reproducibility
  15. Perform double programming validation
  16. Align with SDTM/ADaM structures
  17. Use logs for debugging
  18. Validate categorical values
  19. Implement automated QC checks
  20. Ensure regulatory compliance (FDA-ready)

7. Business Logic Behind Data Cleaning

Data cleaning is not cosmetic it’s decision-critical engineering. When we replace missing values, we’re not just filling gaps we’re preserving analytical continuity. For example, imputing a missing treatment date with TODAY() ensures that time-based analyses (like survival curves) remain computable.

Correcting unrealistic values like a patient age of 150 prevents statistical distortion. Imagine calculating average age for treatment response. One outlier could skew results, leading to incorrect clinical conclusions.

In salary datasets, normalization ensures fairness in compensation analysis. In clinical trials, correcting treatment labels (“chemo” → “Chemotherapy”) ensures accurate grouping in PROC FREQ, which directly impacts safety summaries.

Date imputation is particularly critical. If a treatment date is missing, downstream derivations like treatment duration or time-to-event analysis fail. That’s unacceptable in regulatory submissions.

Ultimately, every cleaning decision must align with business rules, clinical logic, and regulatory expectations. It’s not about making data look clean it’s about making it trustworthy, reproducible, and decision-ready.

8. 20 Key Insights

  • Dirty data leads to wrong conclusions
  • Standardization ensures reproducibility
  • Missing values distort analytics
  • Duplicates inflate counts
  • Validation prevents regulatory issues
  • SAS logs are your debugging allies
  • Controlled terminology is critical
  • Date consistency drives time analysis
  • Outliers must be justified or corrected
  • Audit trails ensure transparency
  • PROC FREQ depends on clean categories
  • Text inconsistencies break grouping
  • Cleaning is iterative, not one-time
  • Business logic drives transformations
  • QC is as important as programming
  • Metadata defines structure
  • Clinical data requires precision
  • Automation reduces human error
  • Clean data accelerates insights
  • Trust in data equals trust in decisions

9. Summary

SAS and R both offer powerful capabilities for data cleaning, but they serve slightly different purposes in a professional setting. SAS excels in structured, regulatory-compliant environments, particularly in clinical trials where traceability, auditability, and reproducibility are mandatory. Its procedures like PROC SORT, PROC FREQ, and data step logic provide a robust framework for handling complex transformations with precision.

R, on the other hand, shines in flexibility and rapid prototyping. Its dplyr package allows intuitive data manipulation using pipelines, making it ideal for exploratory data analysis and quick transformations. However, it lacks the built-in regulatory structure that SAS provides, which is critical in industries like pharmaceuticals.

In this blog, we demonstrated how a messy “Different Types of Treatment” dataset can be transformed into a clean, analysis-ready structure. We handled missing values, corrected invalid entries, standardized text, removed duplicates, and finally used PROC FREQ to generate meaningful insights.

The key takeaway is that data cleaning is not optional it’s foundational. Whether you're working with SDTM datasets or business analytics, the quality of your input directly determines the reliability of your output.

By combining SAS’s robustness with R’s flexibility, you can build scalable, accurate, and compliant data pipelines that stand up to both analytical scrutiny and regulatory review.

10. Conclusion

In the world of data analytics especially in clinical trials clean data is not a luxury, it’s a non-negotiable requirement. Every dataset you encounter will carry imperfections: missing values, inconsistent formats, duplicates, and logical anomalies. The difference between a junior programmer and an expert lies in how systematically and intelligently these issues are addressed.

This blog walked you through a realistic scenario involving treatment data arguably one of the most sensitive and impactful domains in healthcare analytics. We didn’t just clean data we engineered trust into it. Using SAS, we applied structured transformations, ensured compliance with clinical standards, and leveraged PROC FREQ to extract meaningful insights. With R, we demonstrated how similar logic can be implemented using modern, flexible tools.

But beyond tools, the real takeaway is mindset. Data cleaning is not a checklist it’s a discipline. Every transformation must be justified, documented, and reproducible. Every assumption must be defensible. And every output must be reliable enough to support critical decisions whether it's approving a drug or optimizing a business strategy.

As datasets grow in complexity and scale, the need for automated, validated, and standardized cleaning frameworks becomes even more critical. SAS macros, QC checks, and metadata-driven programming are not optional they are essential.

If you master this discipline, you don’t just become a programmer you become a data custodian, someone who ensures that decisions are built on a foundation of truth.

11. Interview Questions 

Q1: How would you handle duplicate patient records in SAS?

Answer: Use PROC SORT NODUPKEY with appropriate BY variables (e.g., Patient_ID, Date). Validate using PROC FREQ before and after.

Q2: A patient has age = -10. What would you do?

Answer: Set to missing using conditional logic (IF Age < 0 THEN Age = .;). Document the assumption.

Q3: How do you standardize inconsistent treatment names?

Answer: Use UPCASE, PROPCASE, and mapping logic or formats to align with controlled terminology.

Q4: How would you debug missing values in R?

Answer: Use is.na(), summarize using colSums(is.na(df)), and trace transformations step-by-step.

Q5: Why is PROC FREQ important after cleaning?

Answer: It validates categorical distributions, detects anomalies, and supports clinical summaries.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TREATMENT DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy

Comments

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function