When Brilliant Data Goes Bad: Mastering Transformation with SAS & R
Brilliant Minds, Broken Data: Transforming a “Top
Scientists in the World” Dataset into Analytical Gold with SAS & R
1. Introduction
Imagine
you're working on a clinical trial dataset for a global oncology study.
The stakes are high regulatory submission depends on your analysis. But when
you open the dataset, chaos greets you:
- Patient ages include -5
and 250
- Dates like 2023-02-30
- Missing treatment groups
labeled as "NULL", " " and NA
- Duplicate patient IDs
- Country names like "india",
"INDIA", "Ind"
This is
not just messy it’s dangerous.
Bad data
doesn’t just slow you down; it destroys trust in analytics. In clinical
trials, it can lead to incorrect efficacy conclusions, regulatory
rejection, or even patient safety risks.
This is
where SAS and R shine. SAS dominates regulated environments (CDISC,
SDTM, ADaM), while R provides flexibility and speed. Together, they form a
powerful toolkit for data transformation and cleaning.
In this
blog, we’ll simulate a messy dataset of Top Scientists in the World,
intentionally inject errors, and then systematically clean and transform it
using SAS and R just like in real-world industry scenarios.
2. Raw Data Creation in SAS and R
SAS Code (Raw Dataset with Intentional Errors)
DATA scientists_raw;
INPUT ID Name $ Country $ Field :$12. Age Awards
Salary Join_Date :$12.;
DATALINES;
1 Einstein germany Physics 76 5 100000 1955-04-18
2 curie france chemistry -45 2 90000 1934-07-04
3 Newton UK Physics 84 . 120000 1727-03-31
4 Tesla NULL engineering 86 3 110000 1943-01-07
5 Darwin uk biology 73 4 . 1882-04-19
6 Einstein germany Physics 76 5 100000 1955-04-18
7 Raman INDIA Physics 82 3 80000 1970-11-21
8 kalam india aerospace 150 6 95000 2015-07-27
9 NULL usa math 60 2 70000 2000-02-30
10 Bohr denmark physics 77 3 105000 1962-11-18
;
RUN;
PROC PRINT DATA = scientists_raw;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Field | Age | Awards | Salary | Join_Date |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Einstein | germany | Physics | 76 | 5 | 100000 | 1955-04-18 |
| 2 | 2 | curie | france | chemistry | -45 | 2 | 90000 | 1934-07-04 |
| 3 | 3 | Newton | UK | Physics | 84 | . | 120000 | 1727-03-31 |
| 4 | 4 | Tesla | NULL | engineering | 86 | 3 | 110000 | 1943-01-07 |
| 5 | 5 | Darwin | uk | biology | 73 | 4 | . | 1882-04-19 |
| 6 | 6 | Einstein | germany | Physics | 76 | 5 | 100000 | 1955-04-18 |
| 7 | 7 | Raman | INDIA | Physics | 82 | 3 | 80000 | 1970-11-21 |
| 8 | 8 | kalam | india | aerospace | 150 | 6 | 95000 | 2015-07-27 |
| 9 | 9 | NULL | usa | math | 60 | 2 | 70000 | 2000-02-30 |
| 10 | 10 | Bohr | denmark | physics | 77 | 3 | 105000 | 1962-11-18 |
Explanation
This
dataset intentionally includes multiple real-world issues: missing values
(.), inconsistent casing (india vs INDIA), invalid values (negative age,
age 150), duplicate records (Einstein), and incorrect dates (Feb 30). These are
typical in raw datasets extracted from multiple sources such as CRFs or legacy
systems. The goal here is not perfection but simulation of real-world messy
data pipelines. SAS DATALINES allows quick prototyping of datasets, useful
in interviews and learning scenarios. The variables include demographics and
professional metrics, making it suitable for transformation exercises.
R Code – Equivalent Raw Dataset
scientists_raw <- data.frame(
ID = c(1,2,3,4,5,6,7,8,9,10),
Name = c("Einstein","curie","Newton","Tesla","Darwin","Einstein",
"Raman","kalam","NULL","Bohr"),
Country = c("germany","france","UK","NULL","uk","germany","INDIA",
"india","usa","denmark"),
Field = c("Physics","chemistry","Physics","engineering","biology",
"Physics","Physics","aerospace","math","physics"),
Age = c(76,-45,84,86,73,76,82,150,60,77),
Awards = c(5,2,NA,3,4,5,3,6,2,3),
Salary = c(100000,90000,120000,110000,NA,100000,80000,95000,70000,
105000),
Join_Date = c("1955-04-18","1934-07-04","1727-03-31","1943-01-07",
"1882-04-19","1955-04-18","1970-11-21","2015-07-27",
"2000-02-30","1962-11-18")
)
OUTPUT:
|
|
ID |
Name |
Country |
Field |
Age |
Awards |
Salary |
Join_Date |
|
1 |
1 |
Einstein |
germany |
Physics |
76 |
5 |
100000 |
18-04-1955 |
|
2 |
2 |
curie |
france |
chemistry |
-45 |
2 |
90000 |
04-07-1934 |
|
3 |
3 |
Newton |
UK |
Physics |
84 |
NA |
120000 |
1727-03-31 |
|
4 |
4 |
Tesla |
NULL |
engineering |
86 |
3 |
110000 |
07-01-1943 |
|
5 |
5 |
Darwin |
uk |
biology |
73 |
4 |
NA |
1882-04-19 |
|
6 |
6 |
Einstein |
germany |
Physics |
76 |
5 |
100000 |
18-04-1955 |
|
7 |
7 |
Raman |
INDIA |
Physics |
82 |
3 |
80000 |
21-11-1970 |
|
8 |
8 |
kalam |
india |
aerospace |
150 |
6 |
95000 |
27-07-2015 |
|
9 |
9 |
NULL |
usa |
math |
60 |
2 |
70000 |
2000-02-30 |
|
10 |
10 |
Bohr |
denmark |
physics |
77 |
3 |
105000 |
18-11-1962 |
Explanation
In R, we
use data.frame() to replicate the same flawed dataset. The issues mirror SAS:
NA values, inconsistent casing, and invalid entries. Unlike SAS, R treats
missing values explicitly as NA. Character variables like dates are initially
stored as strings, which is a common ingestion issue. This dataset will later
require transformation using dplyr and base R functions. The objective here is
to ensure cross-platform consistency, a key requirement in modern
analytics pipelines where SAS and R coexist.
3. Phase 1: Data Cleaning in SAS
DATA scientists_clean;
SET scientists_raw;
/* Fix Name properly */
IF MISSING(Name) OR STRIP(UPCASE(Name)) = "NULL"
THEN Name = "UNKNOWN";
/* Fix Country */
IF MISSING(Country) OR STRIP(UPCASE(Country)) = "NULL"
THEN Country = "UNKNOWN";
ELSE Country = UPCASE(STRIP(Country));
/* Fix invalid age */
IF Age < 0 OR Age > 120 THEN Age = .;
/* Replace missing salary */
IF Salary = . THEN Salary = 50000;
/* Standardize field */
Field = PROPCASE(Field);
/* Fix date */
Join_dt = INPUT(Join_Date,?? YYMMDD10.);
FORMAT Join_dt DATE9.;
IF Join_dt = . THEN Date_Flag = "INVALID";
ELSE Date_Flag = "VALID";
RUN;
PROC PRINT DATA = scientists_clean;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Field | Age | Awards | Salary | Join_Date | Join_dt | Date_Flag |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Einstein | GERMANY | Physics | 76 | 5 | 100000 | 1955-04-18 | 18APR1955 | VALID |
| 2 | 2 | curie | FRANCE | Chemistry | . | 2 | 90000 | 1934-07-04 | 04JUL1934 | VALID |
| 3 | 3 | Newton | UK | Physics | 84 | . | 120000 | 1727-03-31 | 31MAR1727 | VALID |
| 4 | 4 | Tesla | UNKNOWN | Engineering | 86 | 3 | 110000 | 1943-01-07 | 07JAN1943 | VALID |
| 5 | 5 | Darwin | UK | Biology | 73 | 4 | 50000 | 1882-04-19 | 19APR1882 | VALID |
| 6 | 6 | Einstein | GERMANY | Physics | 76 | 5 | 100000 | 1955-04-18 | 18APR1955 | VALID |
| 7 | 7 | Raman | INDIA | Physics | 82 | 3 | 80000 | 1970-11-21 | 21NOV1970 | VALID |
| 8 | 8 | kalam | INDIA | Aerospace | . | 6 | 95000 | 2015-07-27 | 27JUL2015 | VALID |
| 9 | 9 | UNKNOWN | USA | Math | 60 | 2 | 70000 | 2000-02-30 | . | INVALID |
| 10 | 10 | Bohr | DENMARK | Physics | 77 | 3 | 105000 | 1962-11-18 | 18NOV1962 | VALID |
/* Remove duplicates */
PROC SORT DATA=scientists_clean NODUPKEY;
BY Name;
RUN;
PROC PRINT DATA = scientists_clean;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Field | Age | Awards | Salary | Join_Date | Join_dt | Date_Flag |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10 | Bohr | DENMARK | Physics | 77 | 3 | 105000 | 1962-11-18 | 18NOV1962 | VALID |
| 2 | 5 | Darwin | UK | Biology | 73 | 4 | 50000 | 1882-04-19 | 19APR1882 | VALID |
| 3 | 1 | Einstein | GERMANY | Physics | 76 | 5 | 100000 | 1955-04-18 | 18APR1955 | VALID |
| 4 | 3 | Newton | UK | Physics | 84 | . | 120000 | 1727-03-31 | 31MAR1727 | VALID |
| 5 | 7 | Raman | INDIA | Physics | 82 | 3 | 80000 | 1970-11-21 | 21NOV1970 | VALID |
| 6 | 4 | Tesla | UNKNOWN | Engineering | 86 | 3 | 110000 | 1943-01-07 | 07JAN1943 | VALID |
| 7 | 9 | UNKNOWN | USA | Math | 60 | 2 | 70000 | 2000-02-30 | . | INVALID |
| 8 | 2 | curie | FRANCE | Chemistry | . | 2 | 90000 | 1934-07-04 | 04JUL1934 | VALID |
| 9 | 8 | kalam | INDIA | Aerospace | . | 6 | 95000 | 2015-07-27 | 27JUL2015 | VALID |
PROC SORT DATA=scientists_clean NODUPKEY;
BY ID;
RUN;
PROC PRINT DATA = scientists_clean;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Field | Age | Awards | Salary | Join_Date | Join_dt | Date_Flag |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Einstein | GERMANY | Physics | 76 | 5 | 100000 | 1955-04-18 | 18APR1955 | VALID |
| 2 | 2 | curie | FRANCE | Chemistry | . | 2 | 90000 | 1934-07-04 | 04JUL1934 | VALID |
| 3 | 3 | Newton | UK | Physics | 84 | . | 120000 | 1727-03-31 | 31MAR1727 | VALID |
| 4 | 4 | Tesla | UNKNOWN | Engineering | 86 | 3 | 110000 | 1943-01-07 | 07JAN1943 | VALID |
| 5 | 5 | Darwin | UK | Biology | 73 | 4 | 50000 | 1882-04-19 | 19APR1882 | VALID |
| 6 | 7 | Raman | INDIA | Physics | 82 | 3 | 80000 | 1970-11-21 | 21NOV1970 | VALID |
| 7 | 8 | kalam | INDIA | Aerospace | . | 6 | 95000 | 2015-07-27 | 27JUL2015 | VALID |
| 8 | 9 | UNKNOWN | USA | Math | 60 | 2 | 70000 | 2000-02-30 | . | INVALID |
| 9 | 10 | Bohr | DENMARK | Physics | 77 | 3 | 105000 | 1962-11-18 | 18NOV1962 | VALID |
Explanation
This SAS
step performs structured cleaning. COALESCEC replaces missing character values,
while UPCASE and STRIP standardize text fields. Invalid ages are set to missing
using logical conditions critical in clinical validation. Missing salary is
imputed with a business rule (50,000). Dates are converted using INPUT into SAS
date format for accurate analysis. Finally, PROC SORT NODUPKEY removes
duplicates based on ID, ensuring dataset integrity. This step mirrors
real-world SDTM/ADaM preprocessing where traceability and consistency are
mandatory.The
?? modifier in SAS
tells the INPUT()
function to suppress error messages
for invalid conversions. Instead of throwing warnings in the log, SAS quietly
assigns a missing value (.).
This is extremely useful in production environments where logs must remain
clean. Additionally, we introduce a Date_Flag
variable to explicitly track invalid dates this is critical for auditability and regulatory compliance
(FDA/CDISC). Rather than ignoring bad data, we capture and document it, which is a best
practice in clinical programming. This approach ensures both data integrity and traceability.
4. Phase 2: Data Cleaning in R
library(dplyr)
scientists_clean <- scientists_raw %>%
mutate(
Name = ifelse(is.na(Name) | Name == "NULL", "UNKNOWN", Name),
Country = toupper(trimws(ifelse(Country == "NULL", "UNKNOWN", Country))),
Age = ifelse(Age < 0 | Age > 120, NA, Age),
Salary = ifelse(is.na(Salary), 50000, Salary),
Field = tools::toTitleCase(Field),
Join_Date = as.Date(Join_Date, format="%Y-%m-%d"),
Date_Flag = ifelse(is.na(Join_Date), "INVALID", "VALID")
) %>%
distinct(Name, .keep_all = TRUE)
OUTPUT:
|
|
ID |
Name |
Country |
Field |
Age |
Awards |
Salary |
Join_Date |
Date_Flag |
|
1 |
1 |
Einstein |
GERMANY |
Physics |
76 |
5 |
100000 |
18-04-1955 |
VALID |
|
2 |
2 |
curie |
FRANCE |
Chemistry |
NA |
2 |
90000 |
04-07-1934 |
VALID |
|
3 |
3 |
Newton |
UK |
Physics |
84 |
NA |
120000 |
1727-03-31 |
VALID |
|
4 |
4 |
Tesla |
UNKNOWN |
Engineering |
86 |
3 |
110000 |
07-01-1943 |
VALID |
|
5 |
5 |
Darwin |
UK |
Biology |
73 |
4 |
50000 |
1882-04-19 |
VALID |
|
6 |
7 |
Raman |
INDIA |
Physics |
82 |
3 |
80000 |
21-11-1970 |
VALID |
|
7 |
8 |
kalam |
INDIA |
Aerospace |
NA |
6 |
95000 |
27-07-2015 |
VALID |
|
8 |
9 |
UNKNOWN |
USA |
Math |
60 |
2 |
70000 |
NA |
INVALID |
|
9 |
10 |
Bohr |
DENMARK |
Physics |
77 |
3 |
105000 |
18-11-1962 |
VALID |
Explanation
In R, dplyr
provides a clean and readable pipeline. mutate() transforms columns, while ifelse()
handles conditional replacements. String cleaning uses toupper() and trimws().
Invalid ages are replaced with NA, aligning with statistical best practices.
Dates are parsed using as.Date(), though invalid dates (like Feb 30) become NA
automatically highlighting R’s strict parsing. distinct() removes duplicates.
This pipeline demonstrates functional programming style, making
transformations reproducible and scalable.
5. Phase 3: Additional SAS Transformations
DATA scientists_enriched;
SET scientists_clean;
/* Derived variable */
LENGTH Category Salary_Band $15.;
Experience = YEAR(TODAY()) - YEAR(Join_dt);
/* Categorization */
IF Awards >=5 THEN Category = "Elite";
ELSE Category = "Standard";
/* Salary band */
IF Salary > 100000 THEN Salary_Band = "High";
ELSE Salary_Band = "Medium";
RUN;
PROC PRINT DATA = scientists_enriched;
RUN;
OUTPUT:
| Obs | ID | Name | Country | Field | Age | Awards | Salary | Join_Date | Join_dt | Date_Flag | Category | Salary_Band | Experience |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Einstein | GERMANY | Physics | 76 | 5 | 100000 | 1955-04-18 | 18APR1955 | VALID | Elite | Medium | 71 |
| 2 | 2 | curie | FRANCE | Chemistry | . | 2 | 90000 | 1934-07-04 | 04JUL1934 | VALID | Standard | Medium | 92 |
| 3 | 3 | Newton | UK | Physics | 84 | . | 120000 | 1727-03-31 | 31MAR1727 | VALID | Standard | High | 299 |
| 4 | 4 | Tesla | UNKNOWN | Engineering | 86 | 3 | 110000 | 1943-01-07 | 07JAN1943 | VALID | Standard | High | 83 |
| 5 | 5 | Darwin | UK | Biology | 73 | 4 | 50000 | 1882-04-19 | 19APR1882 | VALID | Standard | Medium | 144 |
| 6 | 7 | Raman | INDIA | Physics | 82 | 3 | 80000 | 1970-11-21 | 21NOV1970 | VALID | Standard | Medium | 56 |
| 7 | 8 | kalam | INDIA | Aerospace | . | 6 | 95000 | 2015-07-27 | 27JUL2015 | VALID | Elite | Medium | 11 |
| 8 | 9 | UNKNOWN | USA | Math | 60 | 2 | 70000 | 2000-02-30 | . | INVALID | Standard | Medium | . |
| 9 | 10 | Bohr | DENMARK | Physics | 77 | 3 | 105000 | 1962-11-18 | 18NOV1962 | VALID | Standard | High | 64 |
Explanation
This
phase introduces data transformation beyond cleaning. Derived variables
like Experience simulate ADaM derivations. Categorization based on awards
introduces grouping logic, useful for reporting and TLFs. Salary banding
supports segmentation analysis. These transformations reflect real-world
requirements where raw data is converted into analysis-ready datasets.
SAS excels here with efficient row-wise processing and clear syntax for
conditional logic.
6. 20 Additional Data
Cleaning Best Practices
- Always validate against protocol-defined
ranges
- Maintain audit trails for
transformations
- Use controlled
terminology (CDISC)
- Never overwrite raw data
- Document every derivation
- Validate dates rigorously
- Use PROC COMPARE for QC
- Handle duplicates explicitly
- Ensure consistent encoding
- Use formats for
standardization
- Automate checks with macros
- Validate against CRFs
- Apply referential integrity
- Track missingness patterns
- Avoid hardcoding values
- Use metadata-driven
programming
- Cross-validate SAS vs R
outputs
- Perform outlier detection
- Ensure reproducibility
- Align with regulatory
standards
7. Business Logic Behind
Data Cleaning
Data
cleaning is not arbitrary it is driven by business and scientific logic.
For example, replacing missing salary values ensures models don’t fail due to
null inputs. However, the replacement value must be justified mean, median, or
domain-specific defaults.
Similarly,
correcting unrealistic values like age (-45 or 150) is essential because such
entries distort statistical summaries. In clinical trials, incorrect age can
impact subgroup analysis, leading to flawed conclusions about drug efficacy.
Date
imputation is another critical aspect. Missing or invalid dates can disrupt
time-to-event analysis. For instance, survival analysis depends heavily on
accurate start and end dates.
Normalization
ensures comparability. If country names are inconsistent, grouping becomes
unreliable, affecting regional analysis.
Ultimately,
data cleaning ensures that decisions are based on reality, not noise. In
regulated environments, every transformation must be traceable and justified,
making cleaning both a technical and compliance activity.
8. 20 Key Points
- Dirty data leads to wrong
conclusions
- Validation is not optional
- Missing data hides insights
- Standardization ensures
reproducibility
- Duplicates inflate metrics
- Invalid dates break timelines
- Text inconsistency ruins
grouping
- Outliers distort models
- Clean data builds trust
- Automation reduces errors
- Documentation is critical
- Regulatory compliance
demands rigor
- Data lineage matters
- QC is a must, not a choice
- Transformation adds
analytical value
- Imputation requires logic
- Clean data accelerates
insights
- SAS ensures stability
- R ensures flexibility
- Together, they are powerful
9. Summary
Data
cleaning and transformation form the backbone of reliable analytics. In this
blog, we explored a messy dataset of top scientists and systematically cleaned
it using SAS and R.
SAS
demonstrated its strength in structured environments. Functions like COALESCEC,
PROC SORT, and date handling make it ideal for regulatory workflows such as
SDTM and ADaM. Its deterministic behavior ensures reproducibility critical in
clinical trials.
R, on the
other hand, offered flexibility and concise syntax. Using dplyr, we performed
transformations in a pipeline, making the process intuitive and readable. R’s
strength lies in rapid prototyping and exploratory data analysis.
Both
tools handled missing values, invalid entries, duplicates, and standardization
effectively. However, their philosophies differ: SAS is control-driven,
while R is function-driven.
The key
takeaway is that data cleaning is not just preprocessing it is a strategic
step that determines analysis quality. Whether using SAS or R, the focus
should be on consistency, validation, and traceability.
10. Conclusion
In any
data-driven environment whether clinical trials, finance, or research the
quality of insights depends entirely on the quality of data. Raw datasets are
rarely perfect. They come with inconsistencies, errors, and gaps that must be
addressed systematically.
This blog
demonstrated how to transform a flawed dataset into a reliable analytical asset
using SAS and R. The process involved identifying issues, applying logical
corrections, standardizing formats, and deriving meaningful variables.
SAS
provided a robust framework for structured cleaning, particularly suited for
regulated industries. Its clarity and control make it indispensable in clinical
data management. R complemented this with flexibility and efficiency, enabling
rapid transformations and exploratory workflows.
The real
lesson is not just about tools it’s about discipline in data handling.
Every transformation must be justified, documented, and validated. This ensures
that downstream analyses are accurate and defensible.
As
datasets grow in size and complexity, the importance of automated, scalable
cleaning frameworks will only increase. Professionals who master these techniques
will not just clean data they will enable trustworthy decision-making.
11. Interview Questions
1. How do you handle invalid age values in SAS?
Answer: Use conditional logic:
IF Age < 0 OR Age > 120
THEN Age = .;
Ensures
realistic data ranges.
2. How do you remove duplicates in SAS?
Answer:
Use PROC SORT NODUPKEY BY ID;
Removes duplicate records based on key variables.
3. How does R handle invalid dates?
Answer:
as.Date() converts invalid dates to NA automatically, helping identify issues.
4. Scenario: Salary missing for 30% records what
will you do?
Answer:
Analyze distribution, apply median imputation or domain-based defaults, and
document logic.
5. SAS vs R for cleaning?
Answer:
SAS = stable, regulatory
R = flexible, fast
Best practice: use both for validation and robustness.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SCIENTISTS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment