Brilliant Minds, Broken Data: Transforming a “Top Scientists in the World” Dataset into Analytical Gold with SAS & R

1. Introduction

Imagine you're working on a clinical trial dataset for a global oncology study. The stakes are high regulatory submission depends on your analysis. But when you open the dataset, chaos greets you:

Patient ages include -5 and 250
Dates like 2023-02-30
Missing treatment groups labeled as "NULL", " " and NA
Duplicate patient IDs
Country names like "india", "INDIA", "Ind"

This is not just messy it’s dangerous.

Bad data doesn’t just slow you down; it destroys trust in analytics. In clinical trials, it can lead to incorrect efficacy conclusions, regulatory rejection, or even patient safety risks.

This is where SAS and R shine. SAS dominates regulated environments (CDISC, SDTM, ADaM), while R provides flexibility and speed. Together, they form a powerful toolkit for data transformation and cleaning.

In this blog, we’ll simulate a messy dataset of Top Scientists in the World, intentionally inject errors, and then systematically clean and transform it using SAS and R just like in real-world industry scenarios.

2. Raw Data Creation in SAS and R

SAS Code (Raw Dataset with Intentional Errors)

DATA scientists_raw;

INPUT ID Name $ Country $ Field :$12. Age Awards

Salary Join_Date :$12.;

DATALINES;

1 Einstein germany Physics 76 5 100000 1955-04-18

2 curie france chemistry -45 2 90000 1934-07-04

3 Newton UK Physics 84 . 120000 1727-03-31

4 Tesla NULL engineering 86 3 110000 1943-01-07

5 Darwin uk biology 73 4 . 1882-04-19

6 Einstein germany Physics 76 5 100000 1955-04-18

7 Raman INDIA Physics 82 3 80000 1970-11-21

8 kalam india aerospace 150 6 95000 2015-07-27

9 NULL usa math 60 2 70000 2000-02-30

10 Bohr denmark physics 77 3 105000 1962-11-18

;

RUN;

PROC PRINT DATA = scientists_raw;

RUN;

OUTPUT:


Obs	ID	Name	Country	Field	Age	Awards	Salary	Join_Date
1	1	Einstein	germany	Physics	76	5	100000	1955-04-18
2	2	curie	france	chemistry	-45	2	90000	1934-07-04
3	3	Newton	UK	Physics	84	.	120000	1727-03-31
4	4	Tesla	NULL	engineering	86	3	110000	1943-01-07
5	5	Darwin	uk	biology	73	4	.	1882-04-19
6	6	Einstein	germany	Physics	76	5	100000	1955-04-18
7	7	Raman	INDIA	Physics	82	3	80000	1970-11-21
8	8	kalam	india	aerospace	150	6	95000	2015-07-27
9	9	NULL	usa	math	60	2	70000	2000-02-30
10	10	Bohr	denmark	physics	77	3	105000	1962-11-18

Explanation

This dataset intentionally includes multiple real-world issues: missing values (.), inconsistent casing (india vs INDIA), invalid values (negative age, age 150), duplicate records (Einstein), and incorrect dates (Feb 30). These are typical in raw datasets extracted from multiple sources such as CRFs or legacy systems. The goal here is not perfection but simulation of real-world messy data pipelines. SAS DATALINES allows quick prototyping of datasets, useful in interviews and learning scenarios. The variables include demographics and professional metrics, making it suitable for transformation exercises.

R Code – Equivalent Raw Dataset

scientists_raw <- data.frame(

ID = c(1,2,3,4,5,6,7,8,9,10),

Name = c("Einstein","curie","Newton","Tesla","Darwin","Einstein",

"Raman","kalam","NULL","Bohr"),

Country = c("germany","france","UK","NULL","uk","germany","INDIA",

"india","usa","denmark"),

Field = c("Physics","chemistry","Physics","engineering","biology",

"Physics","Physics","aerospace","math","physics"),

Age = c(76,-45,84,86,73,76,82,150,60,77),

Awards = c(5,2,NA,3,4,5,3,6,2,3),

Salary = c(100000,90000,120000,110000,NA,100000,80000,95000,70000,

105000),

Join_Date = c("1955-04-18","1934-07-04","1727-03-31","1943-01-07",

"1882-04-19","1955-04-18","1970-11-21","2015-07-27",

"2000-02-30","1962-11-18")

)

OUTPUT:

	ID	Name	Country	Field	Age	Awards	Salary	Join_Date
1	1	Einstein	germany	Physics	76	5	100000	18-04-1955
2	2	curie	france	chemistry	-45	2	90000	04-07-1934
3	3	Newton	UK	Physics	84	NA	120000	1727-03-31
4	4	Tesla	NULL	engineering	86	3	110000	07-01-1943
5	5	Darwin	uk	biology	73	4	NA	1882-04-19
6	6	Einstein	germany	Physics	76	5	100000	18-04-1955
7	7	Raman	INDIA	Physics	82	3	80000	21-11-1970
8	8	kalam	india	aerospace	150	6	95000	27-07-2015
9	9	NULL	usa	math	60	2	70000	2000-02-30
10	10	Bohr	denmark	physics	77	3	105000	18-11-1962

Explanation

In R, we use data.frame() to replicate the same flawed dataset. The issues mirror SAS: NA values, inconsistent casing, and invalid entries. Unlike SAS, R treats missing values explicitly as NA. Character variables like dates are initially stored as strings, which is a common ingestion issue. This dataset will later require transformation using dplyr and base R functions. The objective here is to ensure cross-platform consistency, a key requirement in modern analytics pipelines where SAS and R coexist.

3. Phase 1: Data Cleaning in SAS

DATA scientists_clean;

SET scientists_raw;

/* Fix Name properly */

IF MISSING(Name) OR STRIP(UPCASE(Name)) = "NULL"

THEN Name = "UNKNOWN";

/* Fix Country */

IF MISSING(Country) OR STRIP(UPCASE(Country)) = "NULL"

THEN Country = "UNKNOWN";

ELSE Country = UPCASE(STRIP(Country));

/* Fix invalid age */

IF Age < 0 OR Age > 120 THEN Age = .;

/* Replace missing salary */

IF Salary = . THEN Salary = 50000;

/* Standardize field */

Field = PROPCASE(Field);

/* Fix date */

Join_dt = INPUT(Join_Date,?? YYMMDD10.);

FORMAT Join_dt DATE9.;

IF Join_dt = . THEN Date_Flag = "INVALID";

ELSE Date_Flag = "VALID";

RUN;

PROC PRINT DATA = scientists_clean;

RUN;

OUTPUT:


Obs	ID	Name	Country	Field	Age	Awards	Salary	Join_Date	Join_dt	Date_Flag
1	1	Einstein	GERMANY	Physics	76	5	100000	1955-04-18	18APR1955	VALID
2	2	curie	FRANCE	Chemistry	.	2	90000	1934-07-04	04JUL1934	VALID
3	3	Newton	UK	Physics	84	.	120000	1727-03-31	31MAR1727	VALID
4	4	Tesla	UNKNOWN	Engineering	86	3	110000	1943-01-07	07JAN1943	VALID
5	5	Darwin	UK	Biology	73	4	50000	1882-04-19	19APR1882	VALID
6	6	Einstein	GERMANY	Physics	76	5	100000	1955-04-18	18APR1955	VALID
7	7	Raman	INDIA	Physics	82	3	80000	1970-11-21	21NOV1970	VALID
8	8	kalam	INDIA	Aerospace	.	6	95000	2015-07-27	27JUL2015	VALID
9	9	UNKNOWN	USA	Math	60	2	70000	2000-02-30	.	INVALID
10	10	Bohr	DENMARK	Physics	77	3	105000	1962-11-18	18NOV1962	VALID

/* Remove duplicates */

PROC SORT DATA=scientists_clean NODUPKEY;

BY Name;

RUN;

PROC PRINT DATA = scientists_clean;

RUN;

OUTPUT:


Obs	ID	Name	Country	Field	Age	Awards	Salary	Join_Date	Join_dt	Date_Flag
1	10	Bohr	DENMARK	Physics	77	3	105000	1962-11-18	18NOV1962	VALID
2	5	Darwin	UK	Biology	73	4	50000	1882-04-19	19APR1882	VALID
3	1	Einstein	GERMANY	Physics	76	5	100000	1955-04-18	18APR1955	VALID
4	3	Newton	UK	Physics	84	.	120000	1727-03-31	31MAR1727	VALID
5	7	Raman	INDIA	Physics	82	3	80000	1970-11-21	21NOV1970	VALID
6	4	Tesla	UNKNOWN	Engineering	86	3	110000	1943-01-07	07JAN1943	VALID
7	9	UNKNOWN	USA	Math	60	2	70000	2000-02-30	.	INVALID
8	2	curie	FRANCE	Chemistry	.	2	90000	1934-07-04	04JUL1934	VALID
9	8	kalam	INDIA	Aerospace	.	6	95000	2015-07-27	27JUL2015	VALID

PROC SORT DATA=scientists_clean NODUPKEY;

BY ID;

RUN;

PROC PRINT DATA = scientists_clean;

RUN;

OUTPUT:


Obs	ID	Name	Country	Field	Age	Awards	Salary	Join_Date	Join_dt	Date_Flag
1	1	Einstein	GERMANY	Physics	76	5	100000	1955-04-18	18APR1955	VALID
2	2	curie	FRANCE	Chemistry	.	2	90000	1934-07-04	04JUL1934	VALID
3	3	Newton	UK	Physics	84	.	120000	1727-03-31	31MAR1727	VALID
4	4	Tesla	UNKNOWN	Engineering	86	3	110000	1943-01-07	07JAN1943	VALID
5	5	Darwin	UK	Biology	73	4	50000	1882-04-19	19APR1882	VALID
6	7	Raman	INDIA	Physics	82	3	80000	1970-11-21	21NOV1970	VALID
7	8	kalam	INDIA	Aerospace	.	6	95000	2015-07-27	27JUL2015	VALID
8	9	UNKNOWN	USA	Math	60	2	70000	2000-02-30	.	INVALID
9	10	Bohr	DENMARK	Physics	77	3	105000	1962-11-18	18NOV1962	VALID

Explanation

This SAS step performs structured cleaning. COALESCEC replaces missing character values, while UPCASE and STRIP standardize text fields. Invalid ages are set to missing using logical conditions critical in clinical validation. Missing salary is imputed with a business rule (50,000). Dates are converted using INPUT into SAS date format for accurate analysis. Finally, PROC SORT NODUPKEY removes duplicates based on ID, ensuring dataset integrity. This step mirrors real-world SDTM/ADaM preprocessing where traceability and consistency are mandatory.The ?? modifier in SAS tells the INPUT() function to suppress error messages for invalid conversions. Instead of throwing warnings in the log, SAS quietly assigns a missing value (.). This is extremely useful in production environments where logs must remain clean. Additionally, we introduce a Date_Flag variable to explicitly track invalid dates this is critical for auditability and regulatory compliance (FDA/CDISC). Rather than ignoring bad data, we capture and document it, which is a best practice in clinical programming. This approach ensures both data integrity and traceability.

4. Phase 2: Data Cleaning in R

library(dplyr)

scientists_clean <- scientists_raw %>%

mutate(

Name = ifelse(is.na(Name) | Name == "NULL", "UNKNOWN", Name),

Country = toupper(trimws(ifelse(Country == "NULL", "UNKNOWN", Country))),

Age = ifelse(Age < 0 | Age > 120, NA, Age),

Salary = ifelse(is.na(Salary), 50000, Salary),

Field = tools::toTitleCase(Field),

Join_Date = as.Date(Join_Date, format="%Y-%m-%d"),

Date_Flag = ifelse(is.na(Join_Date), "INVALID", "VALID")

) %>%

distinct(Name, .keep_all = TRUE)

OUTPUT:

	ID	Name	Country	Field	Age	Awards	Salary	Join_Date	Date_Flag
1	1	Einstein	GERMANY	Physics	76	5	100000	18-04-1955	VALID
2	2	curie	FRANCE	Chemistry	NA	2	90000	04-07-1934	VALID
3	3	Newton	UK	Physics	84	NA	120000	1727-03-31	VALID
4	4	Tesla	UNKNOWN	Engineering	86	3	110000	07-01-1943	VALID
5	5	Darwin	UK	Biology	73	4	50000	1882-04-19	VALID
6	7	Raman	INDIA	Physics	82	3	80000	21-11-1970	VALID
7	8	kalam	INDIA	Aerospace	NA	6	95000	27-07-2015	VALID
8	9	UNKNOWN	USA	Math	60	2	70000	NA	INVALID
9	10	Bohr	DENMARK	Physics	77	3	105000	18-11-1962	VALID

Explanation

In R, dplyr provides a clean and readable pipeline. mutate() transforms columns, while ifelse() handles conditional replacements. String cleaning uses toupper() and trimws(). Invalid ages are replaced with NA, aligning with statistical best practices. Dates are parsed using as.Date(), though invalid dates (like Feb 30) become NA automatically highlighting R’s strict parsing. distinct() removes duplicates. This pipeline demonstrates functional programming style, making transformations reproducible and scalable.

5. Phase 3: Additional SAS Transformations

DATA scientists_enriched;

SET scientists_clean;

/* Derived variable */

LENGTH Category Salary_Band $15.;

Experience = YEAR(TODAY()) - YEAR(Join_dt);

/* Categorization */

IF Awards >=5 THEN Category = "Elite";

ELSE Category = "Standard";

/* Salary band */

IF Salary > 100000 THEN Salary_Band = "High";

ELSE Salary_Band = "Medium";

RUN;

PROC PRINT DATA = scientists_enriched;

RUN;

OUTPUT:


Obs	ID	Name	Country	Field	Age	Awards	Salary	Join_Date	Join_dt	Date_Flag	Category	Salary_Band	Experience
1	1	Einstein	GERMANY	Physics	76	5	100000	1955-04-18	18APR1955	VALID	Elite	Medium	71
2	2	curie	FRANCE	Chemistry	.	2	90000	1934-07-04	04JUL1934	VALID	Standard	Medium	92
3	3	Newton	UK	Physics	84	.	120000	1727-03-31	31MAR1727	VALID	Standard	High	299
4	4	Tesla	UNKNOWN	Engineering	86	3	110000	1943-01-07	07JAN1943	VALID	Standard	High	83
5	5	Darwin	UK	Biology	73	4	50000	1882-04-19	19APR1882	VALID	Standard	Medium	144
6	7	Raman	INDIA	Physics	82	3	80000	1970-11-21	21NOV1970	VALID	Standard	Medium	56
7	8	kalam	INDIA	Aerospace	.	6	95000	2015-07-27	27JUL2015	VALID	Elite	Medium	11
8	9	UNKNOWN	USA	Math	60	2	70000	2000-02-30	.	INVALID	Standard	Medium	.
9	10	Bohr	DENMARK	Physics	77	3	105000	1962-11-18	18NOV1962	VALID	Standard	High	64

Explanation

This phase introduces data transformation beyond cleaning. Derived variables like Experience simulate ADaM derivations. Categorization based on awards introduces grouping logic, useful for reporting and TLFs. Salary banding supports segmentation analysis. These transformations reflect real-world requirements where raw data is converted into analysis-ready datasets. SAS excels here with efficient row-wise processing and clear syntax for conditional logic.

6. 20 Additional Data Cleaning Best Practices

Always validate against protocol-defined ranges
Maintain audit trails for transformations
Use controlled terminology (CDISC)
Never overwrite raw data
Document every derivation
Validate dates rigorously
Use PROC COMPARE for QC
Handle duplicates explicitly
Ensure consistent encoding
Use formats for standardization
Automate checks with macros
Validate against CRFs
Apply referential integrity
Track missingness patterns
Avoid hardcoding values
Use metadata-driven programming
Cross-validate SAS vs R outputs
Perform outlier detection
Ensure reproducibility
Align with regulatory standards

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary it is driven by business and scientific logic. For example, replacing missing salary values ensures models don’t fail due to null inputs. However, the replacement value must be justified mean, median, or domain-specific defaults.

Similarly, correcting unrealistic values like age (-45 or 150) is essential because such entries distort statistical summaries. In clinical trials, incorrect age can impact subgroup analysis, leading to flawed conclusions about drug efficacy.

Date imputation is another critical aspect. Missing or invalid dates can disrupt time-to-event analysis. For instance, survival analysis depends heavily on accurate start and end dates.

Normalization ensures comparability. If country names are inconsistent, grouping becomes unreliable, affecting regional analysis.

Ultimately, data cleaning ensures that decisions are based on reality, not noise. In regulated environments, every transformation must be traceable and justified, making cleaning both a technical and compliance activity.

8. 20 Key Points

Dirty data leads to wrong conclusions
Validation is not optional
Missing data hides insights
Standardization ensures reproducibility
Duplicates inflate metrics
Invalid dates break timelines
Text inconsistency ruins grouping
Outliers distort models
Clean data builds trust
Automation reduces errors
Documentation is critical
Regulatory compliance demands rigor
Data lineage matters
QC is a must, not a choice
Transformation adds analytical value
Imputation requires logic
Clean data accelerates insights
SAS ensures stability
R ensures flexibility
Together, they are powerful

9. Summary

Data cleaning and transformation form the backbone of reliable analytics. In this blog, we explored a messy dataset of top scientists and systematically cleaned it using SAS and R.

SAS demonstrated its strength in structured environments. Functions like COALESCEC, PROC SORT, and date handling make it ideal for regulatory workflows such as SDTM and ADaM. Its deterministic behavior ensures reproducibility critical in clinical trials.

R, on the other hand, offered flexibility and concise syntax. Using dplyr, we performed transformations in a pipeline, making the process intuitive and readable. R’s strength lies in rapid prototyping and exploratory data analysis.

Both tools handled missing values, invalid entries, duplicates, and standardization effectively. However, their philosophies differ: SAS is control-driven, while R is function-driven.

The key takeaway is that data cleaning is not just preprocessing it is a strategic step that determines analysis quality. Whether using SAS or R, the focus should be on consistency, validation, and traceability.

10. Conclusion

In any data-driven environment whether clinical trials, finance, or research the quality of insights depends entirely on the quality of data. Raw datasets are rarely perfect. They come with inconsistencies, errors, and gaps that must be addressed systematically.

This blog demonstrated how to transform a flawed dataset into a reliable analytical asset using SAS and R. The process involved identifying issues, applying logical corrections, standardizing formats, and deriving meaningful variables.

SAS provided a robust framework for structured cleaning, particularly suited for regulated industries. Its clarity and control make it indispensable in clinical data management. R complemented this with flexibility and efficiency, enabling rapid transformations and exploratory workflows.

The real lesson is not just about tools it’s about discipline in data handling. Every transformation must be justified, documented, and validated. This ensures that downstream analyses are accurate and defensible.

As datasets grow in size and complexity, the importance of automated, scalable cleaning frameworks will only increase. Professionals who master these techniques will not just clean data they will enable trustworthy decision-making.

11. Interview Questions

1. How do you handle invalid age values in SAS?

Answer: Use conditional logic:

IF Age < 0 OR Age > 120 THEN Age = .;

Ensures realistic data ranges.

2. How do you remove duplicates in SAS?

Answer:
Use PROC SORT NODUPKEY BY ID;
Removes duplicate records based on key variables.

3. How does R handle invalid dates?

Answer:
as.Date() converts invalid dates to NA automatically, helping identify issues.

4. Scenario: Salary missing for 30% records what will you do?

Answer:
Analyze distribution, apply median imputation or domain-based defaults, and document logic.

5. SAS vs R for cleaning?

Answer:
SAS = stable, regulatory
R = flexible, fast
Best practice: use both for validation and robustness.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SCIENTISTS DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Billion-Dollar Marriages to Clean Datasets: The Art of SAS Standardization

2.The Hidden Power of SAS Dates: Building Smart Time Logic with INTNX and INTCK

3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

When Brilliant Data Goes Bad: Mastering Transformation with SAS & R

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function