Treatment Data Unleashed: Converting Chaos into Clinical Clarity with SAS Intelligence
From Chaotic Care to Clinical Clarity: Mastering Treatment Data with SAS PROC FREQ and Intelligent Cleaning Frameworks
1. Introduction – When Data Lies, Decisions
Collapse
Imagine
you're working on a clinical trial dataset tracking different types of
treatments chemotherapy, immunotherapy, antibiotics, lifestyle interventions.
Everything looks fine on the surface. But then…
- A patient aged -12
appears.
- Treatment dates occur before
enrollment.
- Some treatments are labeled
“chemo,” others “CHEMOTHERAPY,” and some just “NULL”.
- Duplicate patients distort
outcome counts.
Now
picture this dataset being used for regulatory submission (CDISC SDTM/ADaM).
One wrong frequency count in PROC FREQ could misrepresent safety signals.
That’s not just bad analytics that’s regulatory risk.
This is
where SAS and R become your surgical tools. They don’t just process data
they enforce discipline, reproducibility, and auditability.
In this blog, we’ll build a “Different Types of Treatment” dataset, intentionally inject real-world errors, clean it step-by-step, and finally analyze it using PROC FREQ like an industry-grade programmer.
2. Raw Data Creation in SAS and R
SAS Raw Dataset
DATA treatments_raw;
INFILE DATALINES DLM='|' MISSOVER DSD;
LENGTH Patient_ID $5 Treatment_Type $20
Gender $10 Status $12;
FORMAT Treatment_Date DATE9.;
INPUT Patient_ID $ Age Gender $ Treatment_Type $
Treatment_Date:DATE9. Status $ Cost Outcome_Score;
DATALINES;
P001|25|male|chemotherapy|01JAN2023|Completed|50000|80
P002|-5|FEMALE|Immunotherapy|15FEB2023|Ongoing|60000|90
P003|45|Male|NULL|.|Completed|70000|85
P004|60||Antibiotics|10MAR2023|NULL|3000|70
P001|25|male|chemotherapy|01JAN2023|Completed|50000|80
P005|150|Female|Radiation|05APR2023|Completed|40000|95
P006|30|male|chemo|20MAY2023|Ongoing|45000|88
P007|.|Female|Immunotherapy|15JUN2023|Completed|55000|92
P008|40|male|ANTIBIOTICS|01JUL2023|Completed|2000|65
P009|35|female|lifestyle|12AUG2023|Completed|1000|60
;
RUN;
PROC PRINT DATA = treatments_raw;
RUN;
OUTPUT:
| Obs | Patient_ID | Treatment_Type | Gender | Status | Treatment_Date | Age | Cost | Outcome_Score |
|---|---|---|---|---|---|---|---|---|
| 1 | P001 | chemotherapy | male | Completed | 01JAN2023 | 25 | 50000 | 80 |
| 2 | P002 | Immunotherapy | FEMALE | Ongoing | 15FEB2023 | -5 | 60000 | 90 |
| 3 | P003 | NULL | Male | Completed | . | 45 | 70000 | 85 |
| 4 | P004 | Antibiotics | NULL | 10MAR2023 | 60 | 3000 | 70 | |
| 5 | P001 | chemotherapy | male | Completed | 01JAN2023 | 25 | 50000 | 80 |
| 6 | P005 | Radiation | Female | Completed | 05APR2023 | 150 | 40000 | 95 |
| 7 | P006 | chemo | male | Ongoing | 20MAY2023 | 30 | 45000 | 88 |
| 8 | P007 | Immunotherapy | Female | Completed | 15JUN2023 | . | 55000 | 92 |
| 9 | P008 | ANTIBIOTICS | male | Completed | 01JUL2023 | 40 | 2000 | 65 |
| 10 | P009 | lifestyle | female | Completed | 12AUG2023 | 35 | 1000 | 60 |
Explanation (SAS Raw Data)
This
dataset mimics a clinical trial structure with intentional flaws:
- Missing Age (P007)
- Invalid Age (-5, 150)
- Duplicate (P001)
- Inconsistent treatment
naming (“chemo”, “ANTIBIOTICS”)
- Missing values (Gender,
Status)
- NULL placeholders
Key
Insight:
Real-world clinical data is rarely clean. Designing such flawed datasets
prepares you for SDTM/ADaM validation scenarios.
R Code – Equivalent Raw Dataset
treatments_raw <- data.frame(
Patient_ID = c("P001","P002","P003","P004","P001","P005","P006",
"P007","P008","P009"),
Age = c(25,-5,45,60,25,150,30,NA,40,35),
Gender = c("male","FEMALE","Male",NA,"male","Female","male",
"Female","male","female"),
Treatment_Type = c("chemotherapy","Immunotherapy","NULL",
"Antibiotics","chemotherapy","Radiation",
"chemo","Immunotherapy","ANTIBIOTICS",
"lifestyle"),
Treatment_Date = as.Date(c("2023-01-01","2023-02-15",NA,
"2023-03-10","2023-01-01","2023-04-05",
"2023-05-20","2023-06-15","2023-07-01",
"2023-08-12")),
Status = c("Completed","Ongoing","Completed","NULL","Completed",
"Completed","Ongoing","Completed","Completed","Completed"),
Cost = c(50000,60000,70000,3000,50000,40000,45000,55000,2000,1000),
Outcome_Score = c(80,90,85,70,80,95,88,92,65,60)
)
OUTPUT:
|
|
Patient_ID |
Age |
Gender |
Treatment_Type |
Treatment_Date |
Status |
Cost |
Outcome_Score |
|
1 |
P001 |
25 |
male |
chemotherapy |
01-01-2023 |
Completed |
50000 |
80 |
|
2 |
P002 |
-5 |
FEMALE |
Immunotherapy |
15-02-2023 |
Ongoing |
60000 |
90 |
|
3 |
P003 |
45 |
Male |
NULL |
NA |
Completed |
70000 |
85 |
|
4 |
P004 |
60 |
NA |
Antibiotics |
10-03-2023 |
NULL |
3000 |
70 |
|
5 |
P001 |
25 |
male |
chemotherapy |
01-01-2023 |
Completed |
50000 |
80 |
|
6 |
P005 |
150 |
Female |
Radiation |
05-04-2023 |
Completed |
40000 |
95 |
|
7 |
P006 |
30 |
male |
chemo |
20-05-2023 |
Ongoing |
45000 |
88 |
|
8 |
P007 |
NA |
Female |
Immunotherapy |
15-06-2023 |
Completed |
55000 |
92 |
|
9 |
P008 |
40 |
male |
ANTIBIOTICS |
01-07-2023 |
Completed |
2000 |
65 |
|
10 |
P009 |
35 |
female |
lifestyle |
12-08-2023 |
Completed |
1000 |
60 |
Explanation (R Raw Data)
The same
flawed dataset is recreated using data.frame(). R enables flexible transformations
but lacks built-in regulatory structure like SAS.
Key
Insight: Parallel
dataset creation helps validate consistency between SAS and R pipelines.
3. Phase 1: Data Cleaning in SAS
DATA treatments_clean;
SET treatments_raw;
/* Standardize text */
Gender = UPCASE(STRIP(Gender));
Treatment_Type = PROPCASE(STRIP(Treatment_Type));
/* Handle missing values */
IF Gender = "" THEN Gender = "UNKNOWN";
IF Status = "NULL" THEN Status = "UNKNOWN";
/* Fix invalid age */
IF Age < 0 THEN Age = .;
IF Age > 120 THEN Age = 120;
/* Replace NULL treatment */
IF Treatment_Type = "Null" THEN Treatment_Type = "UNKNOWN";
/* Date correction */
IF Treatment_Date = . THEN Treatment_Date = TODAY();
RUN;
PROC PRINT DATA = treatments_clean;
RUN;
OUTPUT:
| Obs | Patient_ID | Treatment_Type | Gender | Status | Treatment_Date | Age | Cost | Outcome_Score |
|---|---|---|---|---|---|---|---|---|
| 1 | P001 | Chemotherapy | MALE | Completed | 01JAN2023 | 25 | 50000 | 80 |
| 2 | P002 | Immunotherapy | FEMALE | Ongoing | 15FEB2023 | . | 60000 | 90 |
| 3 | P003 | UNKNOWN | MALE | Completed | 02MAY2026 | 45 | 70000 | 85 |
| 4 | P004 | Antibiotics | UNKNOWN | UNKNOWN | 10MAR2023 | 60 | 3000 | 70 |
| 5 | P001 | Chemotherapy | MALE | Completed | 01JAN2023 | 25 | 50000 | 80 |
| 6 | P005 | Radiation | FEMALE | Completed | 05APR2023 | 120 | 40000 | 95 |
| 7 | P006 | Chemo | MALE | Ongoing | 20MAY2023 | 30 | 45000 | 88 |
| 8 | P007 | Immunotherapy | FEMALE | Completed | 15JUN2023 | . | 55000 | 92 |
| 9 | P008 | Antibiotics | MALE | Completed | 01JUL2023 | 40 | 2000 | 65 |
| 10 | P009 | Lifestyle | FEMALE | Completed | 12AUG2023 | 35 | 1000 | 60 |
/* Remove duplicates */
PROC SORT DATA=treatments_clean NODUPKEY;
BY Patient_ID Treatment_Date;
RUN;
PROC PRINT DATA = treatments_clean;
RUN;
OUTPUT:
| Obs | Patient_ID | Treatment_Type | Gender | Status | Treatment_Date | Age | Cost | Outcome_Score |
|---|---|---|---|---|---|---|---|---|
| 1 | P001 | Chemotherapy | MALE | Completed | 01JAN2023 | 25 | 50000 | 80 |
| 2 | P002 | Immunotherapy | FEMALE | Ongoing | 15FEB2023 | . | 60000 | 90 |
| 3 | P003 | UNKNOWN | MALE | Completed | 02MAY2026 | 45 | 70000 | 85 |
| 4 | P004 | Antibiotics | UNKNOWN | UNKNOWN | 10MAR2023 | 60 | 3000 | 70 |
| 5 | P005 | Radiation | FEMALE | Completed | 05APR2023 | 120 | 40000 | 95 |
| 6 | P006 | Chemo | MALE | Ongoing | 20MAY2023 | 30 | 45000 | 88 |
| 7 | P007 | Immunotherapy | FEMALE | Completed | 15JUN2023 | . | 55000 | 92 |
| 8 | P008 | Antibiotics | MALE | Completed | 01JUL2023 | 40 | 2000 | 65 |
| 9 | P009 | Lifestyle | FEMALE | Completed | 12AUG2023 | 35 | 1000 | 60 |
Explanation
This step
introduces data normalization and validation logic:
- UPCASE, PROPCASE, STRIP
ensure text consistency.
- Logical conditions fix
unrealistic values.
- Missing dates are imputed
using TODAY().
- PROC SORT NODUPKEY ensures
unique patient records.
Key
Insight: In
clinical SAS workflows, every transformation must be traceable and
justifiable, especially for regulatory audits.
4. Phase 2: Data Cleaning in R
library(dplyr)
treatments_clean <- treatments_raw %>%
mutate(
Gender = toupper(trimws(ifelse(is.na(Gender) | Gender == "", "UNKNOWN",
Gender))),
Treatment_Type = ifelse(is.na(Treatment_Type) | Treatment_Type == "NULL"
| Treatment_Type == "","UNKNOWN",
tools::toTitleCase(trimws(Treatment_Type))),
Status = ifelse(Status == "NULL" | is.na(Status), "UNKNOWN", Status),
Age = ifelse(Age < 0, NA, Age),
Age = ifelse(Age > 120, 120, Age),
Treatment_Date = coalesce(Treatment_Date, Sys.Date())
) %>%
distinct(Patient_ID, Treatment_Date, .keep_all = TRUE)
OUTPUT:
|
|
Patient_ID |
Age |
Gender |
Treatment_Type |
Treatment_Date |
Status |
Cost |
Outcome_Score |
|
1 |
P001 |
25 |
MALE |
Chemotherapy |
01-01-2023 |
Completed |
50000 |
80 |
|
2 |
P002 |
NA |
FEMALE |
Immunotherapy |
15-02-2023 |
Ongoing |
60000 |
90 |
|
3 |
P003 |
45 |
MALE |
UNKNOWN |
02-05-2026 |
Completed |
70000 |
85 |
|
4 |
P004 |
60 |
UNKNOWN |
Antibiotics |
10-03-2023 |
UNKNOWN |
3000 |
70 |
|
5 |
P005 |
120 |
FEMALE |
Radiation |
05-04-2023 |
Completed |
40000 |
95 |
|
6 |
P006 |
30 |
MALE |
Chemo |
20-05-2023 |
Ongoing |
45000 |
88 |
|
7 |
P007 |
NA |
FEMALE |
Immunotherapy |
15-06-2023 |
Completed |
55000 |
92 |
|
8 |
P008 |
40 |
MALE |
ANTIBIOTICS |
01-07-2023 |
Completed |
2000 |
65 |
|
9 |
P009 |
35 |
FEMALE |
Lifestyle |
12-08-2023 |
Completed |
1000 |
60 |
Explanation
R uses functional
pipelines:
- mutate() applies
transformations
- ifelse() handles conditions
- coalesce() Keeps
original data type
- distinct() removes
duplicates
- trimws() and toupper() standardize
text
Key
Insight: R is
flexible and expressive, but SAS provides stronger clinical traceability and
metadata control.
5. Phase 3: Additional SAS Enhancements
/* Create Age Group */
DATA treatments_final;
SET treatments_clean;
LENGTH Age_Group $10;
SELECT;
WHEN (Age < 18) Age_Group = "CHILD";
WHEN (18 <= Age < 60) Age_Group = "ADULT";
OTHERWISE Age_Group = "SENIOR";
END;
RUN;
PROC PRINT DATA = treatments_final;
RUN;
OUTPUT:
| Obs | Patient_ID | Treatment_Type | Gender | Status | Treatment_Date | Age | Cost | Outcome_Score | Age_Group |
|---|---|---|---|---|---|---|---|---|---|
| 1 | P001 | Chemotherapy | MALE | Completed | 01JAN2023 | 25 | 50000 | 80 | ADULT |
| 2 | P002 | Immunotherapy | FEMALE | Ongoing | 15FEB2023 | . | 60000 | 90 | CHILD |
| 3 | P003 | UNKNOWN | MALE | Completed | 02MAY2026 | 45 | 70000 | 85 | ADULT |
| 4 | P004 | Antibiotics | UNKNOWN | UNKNOWN | 10MAR2023 | 60 | 3000 | 70 | SENIOR |
| 5 | P005 | Radiation | FEMALE | Completed | 05APR2023 | 120 | 40000 | 95 | SENIOR |
| 6 | P006 | Chemo | MALE | Ongoing | 20MAY2023 | 30 | 45000 | 88 | ADULT |
| 7 | P007 | Immunotherapy | FEMALE | Completed | 15JUN2023 | . | 55000 | 92 | CHILD |
| 8 | P008 | Antibiotics | MALE | Completed | 01JUL2023 | 40 | 2000 | 65 | ADULT |
| 9 | P009 | Lifestyle | FEMALE | Completed | 12AUG2023 | 35 | 1000 | 60 | ADULT |
/* Frequency Analysis */
PROC FREQ DATA=treatments_final;
TABLES Treatment_Type*Status / NOCUM NOPERCENT;
RUN;
OUTPUT:
The FREQ Procedure
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Explanation
- SELECT-WHEN provides
structured conditional logic.
- Age grouping aligns with ADaM
derivations.
- PROC FREQ generates
cross-tabulations for treatment vs status.
Key
Insight: This
step converts cleaned data into analytical insights, critical for safety
and efficacy reporting.
6. 20 Data Cleaning Best Practices
- Always validate ranges
(e.g., Age 0–120)
- Use controlled terminology
(CDISC standards)
- Maintain audit trails for
transformations
- Avoid hardcoding—use macros
- Validate date sequences
(start < end)
- Use PROC COMPARE for QC
- Document assumptions clearly
- Separate raw and cleaned
datasets
- Apply metadata-driven
programming
- Handle missing values
consistently
- Avoid overwriting original
data
- Validate duplicates using
keys
- Use formats for
standardization
- Ensure reproducibility
- Perform double programming
validation
- Align with SDTM/ADaM
structures
- Use logs for debugging
- Validate categorical values
- Implement automated QC
checks
- Ensure regulatory compliance
(FDA-ready)
7. Business Logic Behind Data Cleaning
Data
cleaning is not cosmetic it’s decision-critical engineering. When we
replace missing values, we’re not just filling gaps we’re preserving analytical
continuity. For example, imputing a missing treatment date with TODAY() ensures
that time-based analyses (like survival curves) remain computable.
Correcting
unrealistic values like a patient age of 150 prevents statistical distortion.
Imagine calculating average age for treatment response. One outlier could skew
results, leading to incorrect clinical conclusions.
In salary
datasets, normalization ensures fairness in compensation analysis. In clinical
trials, correcting treatment labels (“chemo” → “Chemotherapy”) ensures accurate
grouping in PROC FREQ, which directly impacts safety summaries.
Date
imputation is particularly critical. If a treatment date is missing, downstream
derivations like treatment duration or time-to-event analysis
fail. That’s unacceptable in regulatory submissions.
Ultimately,
every cleaning decision must align with business rules, clinical logic, and
regulatory expectations. It’s not about making data look clean it’s about
making it trustworthy, reproducible, and decision-ready.
8. 20 Key Insights
- Dirty data leads to wrong
conclusions
- Standardization ensures
reproducibility
- Missing values distort
analytics
- Duplicates inflate counts
- Validation prevents
regulatory issues
- SAS logs are your debugging
allies
- Controlled terminology is
critical
- Date consistency drives time
analysis
- Outliers must be justified
or corrected
- Audit trails ensure
transparency
- PROC FREQ depends on clean
categories
- Text inconsistencies break
grouping
- Cleaning is iterative, not
one-time
- Business logic drives
transformations
- QC is as important as
programming
- Metadata defines structure
- Clinical data requires
precision
- Automation reduces human
error
- Clean data accelerates
insights
- Trust in data equals trust
in decisions
9. Summary
SAS and R
both offer powerful capabilities for data cleaning, but they serve slightly
different purposes in a professional setting. SAS excels in structured,
regulatory-compliant environments, particularly in clinical trials where traceability,
auditability, and reproducibility are mandatory. Its procedures like PROC
SORT, PROC FREQ, and data step logic provide a robust framework for handling
complex transformations with precision.
R, on the
other hand, shines in flexibility and rapid prototyping. Its dplyr
package allows intuitive data manipulation using pipelines, making it ideal for
exploratory data analysis and quick transformations. However, it lacks the
built-in regulatory structure that SAS provides, which is critical in
industries like pharmaceuticals.
In this
blog, we demonstrated how a messy “Different Types of Treatment” dataset can be
transformed into a clean, analysis-ready structure. We handled missing values,
corrected invalid entries, standardized text, removed duplicates, and finally
used PROC FREQ to generate meaningful insights.
The key
takeaway is that data cleaning is not optional it’s foundational.
Whether you're working with SDTM datasets or business analytics, the quality of
your input directly determines the reliability of your output.
By
combining SAS’s robustness with R’s flexibility, you can build scalable,
accurate, and compliant data pipelines that stand up to both analytical
scrutiny and regulatory review.
10. Conclusion
In the
world of data analytics especially in clinical trials clean data is not a
luxury, it’s a non-negotiable requirement. Every dataset you encounter
will carry imperfections: missing values, inconsistent formats, duplicates, and
logical anomalies. The difference between a junior programmer and an expert
lies in how systematically and intelligently these issues are addressed.
This blog
walked you through a realistic scenario involving treatment data arguably one
of the most sensitive and impactful domains in healthcare analytics. We didn’t
just clean data we engineered trust into it. Using SAS, we applied
structured transformations, ensured compliance with clinical standards, and
leveraged PROC FREQ to extract meaningful insights. With R, we demonstrated how
similar logic can be implemented using modern, flexible tools.
But
beyond tools, the real takeaway is mindset. Data cleaning is not a
checklist it’s a discipline. Every transformation must be justified,
documented, and reproducible. Every assumption must be defensible. And every
output must be reliable enough to support critical decisions whether it's
approving a drug or optimizing a business strategy.
As
datasets grow in complexity and scale, the need for automated, validated,
and standardized cleaning frameworks becomes even more critical. SAS
macros, QC checks, and metadata-driven programming are not optional they are
essential.
If you
master this discipline, you don’t just become a programmer you become a data
custodian, someone who ensures that decisions are built on a foundation of
truth.
11. Interview Questions
Q1: How would you handle duplicate patient records
in SAS?
Answer: Use PROC SORT NODUPKEY with
appropriate BY variables (e.g., Patient_ID, Date). Validate using PROC FREQ
before and after.
Q2: A patient has age = -10. What would you do?
Answer: Set to missing using conditional
logic (IF Age < 0 THEN Age = .;). Document the assumption.
Q3: How do you standardize inconsistent treatment
names?
Answer: Use UPCASE, PROPCASE, and
mapping logic or formats to align with controlled terminology.
Q4: How would you debug missing values in R?
Answer: Use is.na(), summarize using colSums(is.na(df)),
and trace transformations step-by-step.
Q5: Why is PROC FREQ important after cleaning?
Answer: It validates categorical
distributions, detects anomalies, and supports clinical summaries.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TREATMENT DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment