From Chaotic Care to Clinical Clarity: Mastering Treatment Data with SAS PROC FREQ and Intelligent Cleaning Frameworks

1. Introduction – When Data Lies, Decisions Collapse

Imagine you're working on a clinical trial dataset tracking different types of treatments chemotherapy, immunotherapy, antibiotics, lifestyle interventions. Everything looks fine on the surface. But then…

A patient aged -12 appears.
Treatment dates occur before enrollment.
Some treatments are labeled “chemo,” others “CHEMOTHERAPY,” and some just “NULL”.
Duplicate patients distort outcome counts.

Now picture this dataset being used for regulatory submission (CDISC SDTM/ADaM). One wrong frequency count in PROC FREQ could misrepresent safety signals. That’s not just bad analytics that’s regulatory risk.

This is where SAS and R become your surgical tools. They don’t just process data they enforce discipline, reproducibility, and auditability.

In this blog, we’ll build a “Different Types of Treatment” dataset, intentionally inject real-world errors, clean it step-by-step, and finally analyze it using PROC FREQ like an industry-grade programmer.

2. Raw Data Creation in SAS and R

SAS Raw Dataset

DATA treatments_raw;

INFILE DATALINES DLM='|' MISSOVER DSD;

LENGTH Patient_ID $5 Treatment_Type $20

Gender $10 Status $12;

FORMAT Treatment_Date DATE9.;

INPUT Patient_ID $ Age Gender $ Treatment_Type $

Treatment_Date:DATE9. Status $ Cost Outcome_Score;

DATALINES;

P001|25|male|chemotherapy|01JAN2023|Completed|50000|80

P002|-5|FEMALE|Immunotherapy|15FEB2023|Ongoing|60000|90

P003|45|Male|NULL|.|Completed|70000|85

P004|60||Antibiotics|10MAR2023|NULL|3000|70

P001|25|male|chemotherapy|01JAN2023|Completed|50000|80

P005|150|Female|Radiation|05APR2023|Completed|40000|95

P006|30|male|chemo|20MAY2023|Ongoing|45000|88

P007|.|Female|Immunotherapy|15JUN2023|Completed|55000|92

P008|40|male|ANTIBIOTICS|01JUL2023|Completed|2000|65

P009|35|female|lifestyle|12AUG2023|Completed|1000|60

;

RUN;

PROC PRINT DATA = treatments_raw;

RUN;

OUTPUT:


Obs	Patient_ID	Treatment_Type	Gender	Status	Treatment_Date	Age	Cost	Outcome_Score
1	P001	chemotherapy	male	Completed	01JAN2023	25	50000	80
2	P002	Immunotherapy	FEMALE	Ongoing	15FEB2023	-5	60000	90
3	P003	NULL	Male	Completed	.	45	70000	85
4	P004	Antibiotics		NULL	10MAR2023	60	3000	70
5	P001	chemotherapy	male	Completed	01JAN2023	25	50000	80
6	P005	Radiation	Female	Completed	05APR2023	150	40000	95
7	P006	chemo	male	Ongoing	20MAY2023	30	45000	88
8	P007	Immunotherapy	Female	Completed	15JUN2023	.	55000	92
9	P008	ANTIBIOTICS	male	Completed	01JUL2023	40	2000	65
10	P009	lifestyle	female	Completed	12AUG2023	35	1000	60

Explanation (SAS Raw Data)

This dataset mimics a clinical trial structure with intentional flaws:

Missing Age (P007)
Invalid Age (-5, 150)
Duplicate (P001)
Inconsistent treatment naming (“chemo”, “ANTIBIOTICS”)
Missing values (Gender, Status)
NULL placeholders

Key Insight: Real-world clinical data is rarely clean. Designing such flawed datasets prepares you for SDTM/ADaM validation scenarios.

R Code – Equivalent Raw Dataset

treatments_raw <- data.frame(

Patient_ID = c("P001","P002","P003","P004","P001","P005","P006",

"P007","P008","P009"),

Age = c(25,-5,45,60,25,150,30,NA,40,35),

Gender = c("male","FEMALE","Male",NA,"male","Female","male",

"Female","male","female"),

Treatment_Type = c("chemotherapy","Immunotherapy","NULL",

"Antibiotics","chemotherapy","Radiation",

"chemo","Immunotherapy","ANTIBIOTICS",

"lifestyle"),

Treatment_Date = as.Date(c("2023-01-01","2023-02-15",NA,

"2023-03-10","2023-01-01","2023-04-05",

"2023-05-20","2023-06-15","2023-07-01",

"2023-08-12")),

Status = c("Completed","Ongoing","Completed","NULL","Completed",

"Completed","Ongoing","Completed","Completed","Completed"),

Cost = c(50000,60000,70000,3000,50000,40000,45000,55000,2000,1000),

Outcome_Score = c(80,90,85,70,80,95,88,92,65,60)

)

OUTPUT:

	Patient_ID	Age	Gender	Treatment_Type	Treatment_Date	Status	Cost	Outcome_Score
1	P001	25	male	chemotherapy	01-01-2023	Completed	50000	80
2	P002	-5	FEMALE	Immunotherapy	15-02-2023	Ongoing	60000	90
3	P003	45	Male	NULL	NA	Completed	70000	85
4	P004	60	NA	Antibiotics	10-03-2023	NULL	3000	70
5	P001	25	male	chemotherapy	01-01-2023	Completed	50000	80
6	P005	150	Female	Radiation	05-04-2023	Completed	40000	95
7	P006	30	male	chemo	20-05-2023	Ongoing	45000	88
8	P007	NA	Female	Immunotherapy	15-06-2023	Completed	55000	92
9	P008	40	male	ANTIBIOTICS	01-07-2023	Completed	2000	65
10	P009	35	female	lifestyle	12-08-2023	Completed	1000	60

Explanation (R Raw Data)

The same flawed dataset is recreated using data.frame(). R enables flexible transformations but lacks built-in regulatory structure like SAS.

Key Insight: Parallel dataset creation helps validate consistency between SAS and R pipelines.

3. Phase 1: Data Cleaning in SAS

DATA treatments_clean;

SET treatments_raw;

/* Standardize text */

Gender = UPCASE(STRIP(Gender));

Treatment_Type = PROPCASE(STRIP(Treatment_Type));

/* Handle missing values */

IF Gender = "" THEN Gender = "UNKNOWN";

IF Status = "NULL" THEN Status = "UNKNOWN";

/* Fix invalid age */

IF Age < 0 THEN Age = .;

IF Age > 120 THEN Age = 120;

/* Replace NULL treatment */

IF Treatment_Type = "Null" THEN Treatment_Type = "UNKNOWN";

/* Date correction */

IF Treatment_Date = . THEN Treatment_Date = TODAY();

RUN;

PROC PRINT DATA = treatments_clean;

RUN;

OUTPUT:


Obs	Patient_ID	Treatment_Type	Gender	Status	Treatment_Date	Age	Cost	Outcome_Score
1	P001	Chemotherapy	MALE	Completed	01JAN2023	25	50000	80
2	P002	Immunotherapy	FEMALE	Ongoing	15FEB2023	.	60000	90
3	P003	UNKNOWN	MALE	Completed	02MAY2026	45	70000	85
4	P004	Antibiotics	UNKNOWN	UNKNOWN	10MAR2023	60	3000	70
5	P001	Chemotherapy	MALE	Completed	01JAN2023	25	50000	80
6	P005	Radiation	FEMALE	Completed	05APR2023	120	40000	95
7	P006	Chemo	MALE	Ongoing	20MAY2023	30	45000	88
8	P007	Immunotherapy	FEMALE	Completed	15JUN2023	.	55000	92
9	P008	Antibiotics	MALE	Completed	01JUL2023	40	2000	65
10	P009	Lifestyle	FEMALE	Completed	12AUG2023	35	1000	60

/* Remove duplicates */

PROC SORT DATA=treatments_clean NODUPKEY;

BY Patient_ID Treatment_Date;

RUN;

PROC PRINT DATA = treatments_clean;

RUN;

OUTPUT:


Obs	Patient_ID	Treatment_Type	Gender	Status	Treatment_Date	Age	Cost	Outcome_Score
1	P001	Chemotherapy	MALE	Completed	01JAN2023	25	50000	80
2	P002	Immunotherapy	FEMALE	Ongoing	15FEB2023	.	60000	90
3	P003	UNKNOWN	MALE	Completed	02MAY2026	45	70000	85
4	P004	Antibiotics	UNKNOWN	UNKNOWN	10MAR2023	60	3000	70
5	P005	Radiation	FEMALE	Completed	05APR2023	120	40000	95
6	P006	Chemo	MALE	Ongoing	20MAY2023	30	45000	88
7	P007	Immunotherapy	FEMALE	Completed	15JUN2023	.	55000	92
8	P008	Antibiotics	MALE	Completed	01JUL2023	40	2000	65
9	P009	Lifestyle	FEMALE	Completed	12AUG2023	35	1000	60

Explanation

This step introduces data normalization and validation logic:

UPCASE, PROPCASE, STRIP ensure text consistency.
Logical conditions fix unrealistic values.
Missing dates are imputed using TODAY().
PROC SORT NODUPKEY ensures unique patient records.

Key Insight: In clinical SAS workflows, every transformation must be traceable and justifiable, especially for regulatory audits.

4. Phase 2: Data Cleaning in R

library(dplyr)

treatments_clean <- treatments_raw %>%

mutate(

Gender = toupper(trimws(ifelse(is.na(Gender) | Gender == "", "UNKNOWN",

Gender))),

Treatment_Type = ifelse(is.na(Treatment_Type) | Treatment_Type == "NULL"

| Treatment_Type == "","UNKNOWN",

tools::toTitleCase(trimws(Treatment_Type))),

Status = ifelse(Status == "NULL" | is.na(Status), "UNKNOWN", Status),

Age = ifelse(Age < 0, NA, Age),

Age = ifelse(Age > 120, 120, Age),

Treatment_Date = coalesce(Treatment_Date, Sys.Date())

) %>%

distinct(Patient_ID, Treatment_Date, .keep_all = TRUE)

OUTPUT:

	Patient_ID	Age	Gender	Treatment_Type	Treatment_Date	Status	Cost	Outcome_Score
1	P001	25	MALE	Chemotherapy	01-01-2023	Completed	50000	80
2	P002	NA	FEMALE	Immunotherapy	15-02-2023	Ongoing	60000	90
3	P003	45	MALE	UNKNOWN	02-05-2026	Completed	70000	85
4	P004	60	UNKNOWN	Antibiotics	10-03-2023	UNKNOWN	3000	70
5	P005	120	FEMALE	Radiation	05-04-2023	Completed	40000	95
6	P006	30	MALE	Chemo	20-05-2023	Ongoing	45000	88
7	P007	NA	FEMALE	Immunotherapy	15-06-2023	Completed	55000	92
8	P008	40	MALE	ANTIBIOTICS	01-07-2023	Completed	2000	65
9	P009	35	FEMALE	Lifestyle	12-08-2023	Completed	1000	60

Explanation

R uses functional pipelines:

mutate() applies transformations
ifelse() handles conditions
coalesce() Keeps original data type
distinct() removes duplicates
trimws() and toupper() standardize text

Key Insight: R is flexible and expressive, but SAS provides stronger clinical traceability and metadata control.

5. Phase 3: Additional SAS Enhancements

/* Create Age Group */

DATA treatments_final;

SET treatments_clean;

LENGTH Age_Group $10;

SELECT;

WHEN (Age < 18) Age_Group = "CHILD";

WHEN (18 <= Age < 60) Age_Group = "ADULT";

OTHERWISE Age_Group = "SENIOR";

END;

RUN;

PROC PRINT DATA = treatments_final;

RUN;

OUTPUT:


Obs	Patient_ID	Treatment_Type	Gender	Status	Treatment_Date	Age	Cost	Outcome_Score	Age_Group
1	P001	Chemotherapy	MALE	Completed	01JAN2023	25	50000	80	ADULT
2	P002	Immunotherapy	FEMALE	Ongoing	15FEB2023	.	60000	90	CHILD
3	P003	UNKNOWN	MALE	Completed	02MAY2026	45	70000	85	ADULT
4	P004	Antibiotics	UNKNOWN	UNKNOWN	10MAR2023	60	3000	70	SENIOR
5	P005	Radiation	FEMALE	Completed	05APR2023	120	40000	95	SENIOR
6	P006	Chemo	MALE	Ongoing	20MAY2023	30	45000	88	ADULT
7	P007	Immunotherapy	FEMALE	Completed	15JUN2023	.	55000	92	CHILD
8	P008	Antibiotics	MALE	Completed	01JUL2023	40	2000	65	ADULT
9	P009	Lifestyle	FEMALE	Completed	12AUG2023	35	1000	60	ADULT

/* Frequency Analysis */

PROC FREQ DATA=treatments_final;

TABLES Treatment_Type*Status / NOCUM NOPERCENT;

RUN;

OUTPUT:

The FREQ Procedure

Frequency Row Pct Col Pct


Table of Treatment_Type by Status
Treatment_Type	Status
Treatment_Type	Completed	Ongoing	UNKNOWN	Total
Antibiotics	1 50.00 16.67	0 0.00 0.00	1 50.00 100.00	2
Chemo	0 0.00 0.00	1 100.00 50.00	0 0.00 0.00	1
Chemotherapy	1 100.00 16.67	0 0.00 0.00	0 0.00 0.00	1
Immunotherapy	1 50.00 16.67	1 50.00 50.00	0 0.00 0.00	2
Lifestyle	1 100.00 16.67	0 0.00 0.00	0 0.00 0.00	1
Radiation	1 100.00 16.67	0 0.00 0.00	0 0.00 0.00	1
UNKNOWN	1 100.00 16.67	0 0.00 0.00	0 0.00 0.00	1
Total	6	2	1	9

Explanation

SELECT-WHEN provides structured conditional logic.
Age grouping aligns with ADaM derivations.
PROC FREQ generates cross-tabulations for treatment vs status.

Key Insight: This step converts cleaned data into analytical insights, critical for safety and efficacy reporting.

6. 20 Data Cleaning Best Practices

Always validate ranges (e.g., Age 0–120)
Use controlled terminology (CDISC standards)
Maintain audit trails for transformations
Avoid hardcoding—use macros
Validate date sequences (start < end)
Use PROC COMPARE for QC
Document assumptions clearly
Separate raw and cleaned datasets
Apply metadata-driven programming
Handle missing values consistently
Avoid overwriting original data
Validate duplicates using keys
Use formats for standardization
Ensure reproducibility
Perform double programming validation
Align with SDTM/ADaM structures
Use logs for debugging
Validate categorical values
Implement automated QC checks
Ensure regulatory compliance (FDA-ready)

7. Business Logic Behind Data Cleaning

Data cleaning is not cosmetic it’s decision-critical engineering. When we replace missing values, we’re not just filling gaps we’re preserving analytical continuity. For example, imputing a missing treatment date with TODAY() ensures that time-based analyses (like survival curves) remain computable.

Correcting unrealistic values like a patient age of 150 prevents statistical distortion. Imagine calculating average age for treatment response. One outlier could skew results, leading to incorrect clinical conclusions.

In salary datasets, normalization ensures fairness in compensation analysis. In clinical trials, correcting treatment labels (“chemo” → “Chemotherapy”) ensures accurate grouping in PROC FREQ, which directly impacts safety summaries.

Date imputation is particularly critical. If a treatment date is missing, downstream derivations like treatment duration or time-to-event analysis fail. That’s unacceptable in regulatory submissions.

Ultimately, every cleaning decision must align with business rules, clinical logic, and regulatory expectations. It’s not about making data look clean it’s about making it trustworthy, reproducible, and decision-ready.

8. 20 Key Insights

Dirty data leads to wrong conclusions
Standardization ensures reproducibility
Missing values distort analytics
Duplicates inflate counts
Validation prevents regulatory issues
SAS logs are your debugging allies
Controlled terminology is critical
Date consistency drives time analysis
Outliers must be justified or corrected
Audit trails ensure transparency
PROC FREQ depends on clean categories
Text inconsistencies break grouping
Cleaning is iterative, not one-time
Business logic drives transformations
QC is as important as programming
Metadata defines structure
Clinical data requires precision
Automation reduces human error
Clean data accelerates insights
Trust in data equals trust in decisions

9. Summary

SAS and R both offer powerful capabilities for data cleaning, but they serve slightly different purposes in a professional setting. SAS excels in structured, regulatory-compliant environments, particularly in clinical trials where traceability, auditability, and reproducibility are mandatory. Its procedures like PROC SORT, PROC FREQ, and data step logic provide a robust framework for handling complex transformations with precision.

R, on the other hand, shines in flexibility and rapid prototyping. Its dplyr package allows intuitive data manipulation using pipelines, making it ideal for exploratory data analysis and quick transformations. However, it lacks the built-in regulatory structure that SAS provides, which is critical in industries like pharmaceuticals.

In this blog, we demonstrated how a messy “Different Types of Treatment” dataset can be transformed into a clean, analysis-ready structure. We handled missing values, corrected invalid entries, standardized text, removed duplicates, and finally used PROC FREQ to generate meaningful insights.

The key takeaway is that data cleaning is not optional it’s foundational. Whether you're working with SDTM datasets or business analytics, the quality of your input directly determines the reliability of your output.

By combining SAS’s robustness with R’s flexibility, you can build scalable, accurate, and compliant data pipelines that stand up to both analytical scrutiny and regulatory review.

10. Conclusion

In the world of data analytics especially in clinical trials clean data is not a luxury, it’s a non-negotiable requirement. Every dataset you encounter will carry imperfections: missing values, inconsistent formats, duplicates, and logical anomalies. The difference between a junior programmer and an expert lies in how systematically and intelligently these issues are addressed.

This blog walked you through a realistic scenario involving treatment data arguably one of the most sensitive and impactful domains in healthcare analytics. We didn’t just clean data we engineered trust into it. Using SAS, we applied structured transformations, ensured compliance with clinical standards, and leveraged PROC FREQ to extract meaningful insights. With R, we demonstrated how similar logic can be implemented using modern, flexible tools.

But beyond tools, the real takeaway is mindset. Data cleaning is not a checklist it’s a discipline. Every transformation must be justified, documented, and reproducible. Every assumption must be defensible. And every output must be reliable enough to support critical decisions whether it's approving a drug or optimizing a business strategy.

As datasets grow in complexity and scale, the need for automated, validated, and standardized cleaning frameworks becomes even more critical. SAS macros, QC checks, and metadata-driven programming are not optional they are essential.

If you master this discipline, you don’t just become a programmer you become a data custodian, someone who ensures that decisions are built on a foundation of truth.

11. Interview Questions

Q1: How would you handle duplicate patient records in SAS?

Answer: Use PROC SORT NODUPKEY with appropriate BY variables (e.g., Patient_ID, Date). Validate using PROC FREQ before and after.

Q2: A patient has age = -10. What would you do?

Answer: Set to missing using conditional logic (IF Age < 0 THEN Age = .;). Document the assumption.

Q3: How do you standardize inconsistent treatment names?

Answer: Use UPCASE, PROPCASE, and mapping logic or formats to align with controlled terminology.

Q4: How would you debug missing values in R?

Answer: Use is.na(), summarize using colSums(is.na(df)), and trace transformations step-by-step.

Q5: Why is PROC FREQ important after cleaning?

Answer: It validates categorical distributions, detects anomalies, and supports clinical summaries.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TREATMENT DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Treatment Data Unleashed: Converting Chaos into Clinical Clarity with SAS Intelligence

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

Treatment Data Unleashed: Converting Chaos into Clinical Clarity with SAS Intelligence

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?