When Brilliant Data Goes Bad: Mastering Transformation with SAS & R

Brilliant Minds, Broken Data: Transforming a “Top Scientists in the World” Dataset into Analytical Gold with SAS & R

1. Introduction

Imagine you're working on a clinical trial dataset for a global oncology study. The stakes are high regulatory submission depends on your analysis. But when you open the dataset, chaos greets you:

  • Patient ages include -5 and 250
  • Dates like 2023-02-30
  • Missing treatment groups labeled as "NULL", " " and NA
  • Duplicate patient IDs
  • Country names like "india", "INDIA", "Ind"

This is not just messy it’s dangerous.

Bad data doesn’t just slow you down; it destroys trust in analytics. In clinical trials, it can lead to incorrect efficacy conclusions, regulatory rejection, or even patient safety risks.

This is where SAS and R shine. SAS dominates regulated environments (CDISC, SDTM, ADaM), while R provides flexibility and speed. Together, they form a powerful toolkit for data transformation and cleaning.

In this blog, we’ll simulate a messy dataset of Top Scientists in the World, intentionally inject errors, and then systematically clean and transform it using SAS and R just like in real-world industry scenarios.

2. Raw Data Creation in SAS and R

SAS Code (Raw Dataset with Intentional Errors)

DATA scientists_raw;

INPUT ID Name $ Country $ Field :$12. Age Awards 

      Salary Join_Date :$12.;

DATALINES;

1 Einstein germany Physics 76 5 100000 1955-04-18

2 curie france chemistry -45 2 90000 1934-07-04

3 Newton UK Physics 84 . 120000 1727-03-31

4 Tesla NULL engineering 86 3 110000 1943-01-07

5 Darwin uk biology 73 4 . 1882-04-19

6 Einstein germany Physics 76 5 100000 1955-04-18

7 Raman INDIA Physics 82 3 80000 1970-11-21

8 kalam india aerospace 150 6 95000 2015-07-27

9 NULL usa math 60 2 70000 2000-02-30

10 Bohr denmark physics 77 3 105000 1962-11-18

;

RUN;

PROC PRINT DATA = scientists_raw;

RUN;

OUTPUT:

ObsIDNameCountryFieldAgeAwardsSalaryJoin_Date
11EinsteingermanyPhysics7651000001955-04-18
22curiefrancechemistry-452900001934-07-04
33NewtonUKPhysics84.1200001727-03-31
44TeslaNULLengineering8631100001943-01-07
55Darwinukbiology734.1882-04-19
66EinsteingermanyPhysics7651000001955-04-18
77RamanINDIAPhysics823800001970-11-21
88kalamindiaaerospace1506950002015-07-27
99NULLusamath602700002000-02-30
1010Bohrdenmarkphysics7731050001962-11-18

Explanation

This dataset intentionally includes multiple real-world issues: missing values (.), inconsistent casing (india vs INDIA), invalid values (negative age, age 150), duplicate records (Einstein), and incorrect dates (Feb 30). These are typical in raw datasets extracted from multiple sources such as CRFs or legacy systems. The goal here is not perfection but simulation of real-world messy data pipelines. SAS DATALINES allows quick prototyping of datasets, useful in interviews and learning scenarios. The variables include demographics and professional metrics, making it suitable for transformation exercises.

R Code – Equivalent Raw Dataset

scientists_raw <- data.frame(

  ID = c(1,2,3,4,5,6,7,8,9,10),

  Name = c("Einstein","curie","Newton","Tesla","Darwin","Einstein",

           "Raman","kalam","NULL","Bohr"),

  Country = c("germany","france","UK","NULL","uk","germany","INDIA",

              "india","usa","denmark"),

  Field = c("Physics","chemistry","Physics","engineering","biology",

            "Physics","Physics","aerospace","math","physics"),

  Age = c(76,-45,84,86,73,76,82,150,60,77),

  Awards = c(5,2,NA,3,4,5,3,6,2,3),

  Salary = c(100000,90000,120000,110000,NA,100000,80000,95000,70000,

             105000),

  Join_Date = c("1955-04-18","1934-07-04","1727-03-31","1943-01-07",

                "1882-04-19","1955-04-18","1970-11-21","2015-07-27",

                 "2000-02-30","1962-11-18")

)

OUTPUT:

 

ID

Name

Country

Field

Age

Awards

Salary

Join_Date

1

1

Einstein

germany

Physics

76

5

100000

18-04-1955

2

2

curie

france

chemistry

-45

2

90000

04-07-1934

3

3

Newton

UK

Physics

84

NA

120000

1727-03-31

4

4

Tesla

NULL

engineering

86

3

110000

07-01-1943

5

5

Darwin

uk

biology

73

4

NA

1882-04-19

6

6

Einstein

germany

Physics

76

5

100000

18-04-1955

7

7

Raman

INDIA

Physics

82

3

80000

21-11-1970

8

8

kalam

india

aerospace

150

6

95000

27-07-2015

9

9

NULL

usa

math

60

2

70000

2000-02-30

10

10

Bohr

denmark

physics

77

3

105000

18-11-1962

Explanation

In R, we use data.frame() to replicate the same flawed dataset. The issues mirror SAS: NA values, inconsistent casing, and invalid entries. Unlike SAS, R treats missing values explicitly as NA. Character variables like dates are initially stored as strings, which is a common ingestion issue. This dataset will later require transformation using dplyr and base R functions. The objective here is to ensure cross-platform consistency, a key requirement in modern analytics pipelines where SAS and R coexist.

3. Phase 1: Data Cleaning in SAS

DATA scientists_clean;

SET scientists_raw;

/* Fix Name properly */

IF MISSING(Name) OR STRIP(UPCASE(Name)) = "NULL" 

                           THEN Name = "UNKNOWN";

/* Fix Country */

IF MISSING(Country) OR STRIP(UPCASE(Country)) = "NULL" 

                           THEN Country = "UNKNOWN";

ELSE Country = UPCASE(STRIP(Country));

/* Fix invalid age */

IF Age < 0 OR Age > 120 THEN Age = .;

/* Replace missing salary */

IF Salary = . THEN Salary = 50000;

/* Standardize field */

Field = PROPCASE(Field);

/* Fix date */

Join_dt = INPUT(Join_Date,?? YYMMDD10.);

FORMAT Join_dt DATE9.;

IF Join_dt = . THEN Date_Flag = "INVALID";

ELSE Date_Flag = "VALID";

RUN;

PROC PRINT DATA = scientists_clean;

RUN;

OUTPUT:

ObsIDNameCountryFieldAgeAwardsSalaryJoin_DateJoin_dtDate_Flag
11EinsteinGERMANYPhysics7651000001955-04-1818APR1955VALID
22curieFRANCEChemistry.2900001934-07-0404JUL1934VALID
33NewtonUKPhysics84.1200001727-03-3131MAR1727VALID
44TeslaUNKNOWNEngineering8631100001943-01-0707JAN1943VALID
55DarwinUKBiology734500001882-04-1919APR1882VALID
66EinsteinGERMANYPhysics7651000001955-04-1818APR1955VALID
77RamanINDIAPhysics823800001970-11-2121NOV1970VALID
88kalamINDIAAerospace.6950002015-07-2727JUL2015VALID
99UNKNOWNUSAMath602700002000-02-30.INVALID
1010BohrDENMARKPhysics7731050001962-11-1818NOV1962VALID

/* Remove duplicates */

PROC SORT DATA=scientists_clean NODUPKEY;

BY Name;

RUN;

PROC PRINT DATA = scientists_clean;

RUN;

OUTPUT:

ObsIDNameCountryFieldAgeAwardsSalaryJoin_DateJoin_dtDate_Flag
110BohrDENMARKPhysics7731050001962-11-1818NOV1962VALID
25DarwinUKBiology734500001882-04-1919APR1882VALID
31EinsteinGERMANYPhysics7651000001955-04-1818APR1955VALID
43NewtonUKPhysics84.1200001727-03-3131MAR1727VALID
57RamanINDIAPhysics823800001970-11-2121NOV1970VALID
64TeslaUNKNOWNEngineering8631100001943-01-0707JAN1943VALID
79UNKNOWNUSAMath602700002000-02-30.INVALID
82curieFRANCEChemistry.2900001934-07-0404JUL1934VALID
98kalamINDIAAerospace.6950002015-07-2727JUL2015VALID

PROC SORT DATA=scientists_clean NODUPKEY;

BY ID;

RUN;

PROC PRINT DATA = scientists_clean;

RUN;

OUTPUT:

ObsIDNameCountryFieldAgeAwardsSalaryJoin_DateJoin_dtDate_Flag
11EinsteinGERMANYPhysics7651000001955-04-1818APR1955VALID
22curieFRANCEChemistry.2900001934-07-0404JUL1934VALID
33NewtonUKPhysics84.1200001727-03-3131MAR1727VALID
44TeslaUNKNOWNEngineering8631100001943-01-0707JAN1943VALID
55DarwinUKBiology734500001882-04-1919APR1882VALID
67RamanINDIAPhysics823800001970-11-2121NOV1970VALID
78kalamINDIAAerospace.6950002015-07-2727JUL2015VALID
89UNKNOWNUSAMath602700002000-02-30.INVALID
910BohrDENMARKPhysics7731050001962-11-1818NOV1962VALID

Explanation

This SAS step performs structured cleaning. COALESCEC replaces missing character values, while UPCASE and STRIP standardize text fields. Invalid ages are set to missing using logical conditions critical in clinical validation. Missing salary is imputed with a business rule (50,000). Dates are converted using INPUT into SAS date format for accurate analysis. Finally, PROC SORT NODUPKEY removes duplicates based on ID, ensuring dataset integrity. This step mirrors real-world SDTM/ADaM preprocessing where traceability and consistency are mandatory.The ?? modifier in SAS tells the INPUT() function to suppress error messages for invalid conversions. Instead of throwing warnings in the log, SAS quietly assigns a missing value (.). This is extremely useful in production environments where logs must remain clean. Additionally, we introduce a Date_Flag variable to explicitly track invalid dates this is critical for auditability and regulatory compliance (FDA/CDISC). Rather than ignoring bad data, we capture and document it, which is a best practice in clinical programming. This approach ensures both data integrity and traceability.

4. Phase 2: Data Cleaning in R

library(dplyr)

scientists_clean <- scientists_raw %>%

  mutate(

    Name = ifelse(is.na(Name) | Name == "NULL", "UNKNOWN", Name),

    Country = toupper(trimws(ifelse(Country == "NULL", "UNKNOWN", Country))),

    Age = ifelse(Age < 0 | Age > 120, NA, Age),

    Salary = ifelse(is.na(Salary), 50000, Salary),

    Field = tools::toTitleCase(Field),

    Join_Date = as.Date(Join_Date, format="%Y-%m-%d"),

    Date_Flag = ifelse(is.na(Join_Date), "INVALID", "VALID")

  ) %>%

  distinct(Name, .keep_all = TRUE)

OUTPUT:

 

ID

Name

Country

Field

Age

Awards

Salary

Join_Date

Date_Flag

1

1

Einstein

GERMANY

Physics

76

5

100000

18-04-1955

VALID

2

2

curie

FRANCE

Chemistry

NA

2

90000

04-07-1934

VALID

3

3

Newton

UK

Physics

84

NA

120000

1727-03-31

VALID

4

4

Tesla

UNKNOWN

Engineering

86

3

110000

07-01-1943

VALID

5

5

Darwin

UK

Biology

73

4

50000

1882-04-19

VALID

6

7

Raman

INDIA

Physics

82

3

80000

21-11-1970

VALID

7

8

kalam

INDIA

Aerospace

NA

6

95000

27-07-2015

VALID

8

9

UNKNOWN

USA

Math

60

2

70000

NA

INVALID

9

10

Bohr

DENMARK

Physics

77

3

105000

18-11-1962

VALID

Explanation

In R, dplyr provides a clean and readable pipeline. mutate() transforms columns, while ifelse() handles conditional replacements. String cleaning uses toupper() and trimws(). Invalid ages are replaced with NA, aligning with statistical best practices. Dates are parsed using as.Date(), though invalid dates (like Feb 30) become NA automatically highlighting R’s strict parsing. distinct() removes duplicates. This pipeline demonstrates functional programming style, making transformations reproducible and scalable.

5. Phase 3: Additional SAS Transformations

DATA scientists_enriched;

SET scientists_clean;

/* Derived variable */

LENGTH Category Salary_Band $15.;

Experience = YEAR(TODAY()) - YEAR(Join_dt);

/* Categorization */

IF Awards >=5 THEN Category = "Elite";

ELSE Category = "Standard";

/* Salary band */

IF Salary > 100000 THEN Salary_Band = "High";

ELSE Salary_Band = "Medium";

RUN;

PROC PRINT DATA = scientists_enriched;

RUN;

OUTPUT:

ObsIDNameCountryFieldAgeAwardsSalaryJoin_DateJoin_dtDate_FlagCategorySalary_BandExperience
11EinsteinGERMANYPhysics7651000001955-04-1818APR1955VALIDEliteMedium71
22curieFRANCEChemistry.2900001934-07-0404JUL1934VALIDStandardMedium92
33NewtonUKPhysics84.1200001727-03-3131MAR1727VALIDStandardHigh299
44TeslaUNKNOWNEngineering8631100001943-01-0707JAN1943VALIDStandardHigh83
55DarwinUKBiology734500001882-04-1919APR1882VALIDStandardMedium144
67RamanINDIAPhysics823800001970-11-2121NOV1970VALIDStandardMedium56
78kalamINDIAAerospace.6950002015-07-2727JUL2015VALIDEliteMedium11
89UNKNOWNUSAMath602700002000-02-30.INVALIDStandardMedium.
910BohrDENMARKPhysics7731050001962-11-1818NOV1962VALIDStandardHigh64

Explanation

This phase introduces data transformation beyond cleaning. Derived variables like Experience simulate ADaM derivations. Categorization based on awards introduces grouping logic, useful for reporting and TLFs. Salary banding supports segmentation analysis. These transformations reflect real-world requirements where raw data is converted into analysis-ready datasets. SAS excels here with efficient row-wise processing and clear syntax for conditional logic.

6. 20 Additional Data Cleaning Best Practices

  1. Always validate against protocol-defined ranges
  2. Maintain audit trails for transformations
  3. Use controlled terminology (CDISC)
  4. Never overwrite raw data
  5. Document every derivation
  6. Validate dates rigorously
  7. Use PROC COMPARE for QC
  8. Handle duplicates explicitly
  9. Ensure consistent encoding
  10. Use formats for standardization
  11. Automate checks with macros
  12. Validate against CRFs
  13. Apply referential integrity
  14. Track missingness patterns
  15. Avoid hardcoding values
  16. Use metadata-driven programming
  17. Cross-validate SAS vs R outputs
  18. Perform outlier detection
  19. Ensure reproducibility
  20. Align with regulatory standards

7. Business Logic Behind Data Cleaning

Data cleaning is not arbitrary it is driven by business and scientific logic. For example, replacing missing salary values ensures models don’t fail due to null inputs. However, the replacement value must be justified mean, median, or domain-specific defaults.

Similarly, correcting unrealistic values like age (-45 or 150) is essential because such entries distort statistical summaries. In clinical trials, incorrect age can impact subgroup analysis, leading to flawed conclusions about drug efficacy.

Date imputation is another critical aspect. Missing or invalid dates can disrupt time-to-event analysis. For instance, survival analysis depends heavily on accurate start and end dates.

Normalization ensures comparability. If country names are inconsistent, grouping becomes unreliable, affecting regional analysis.

Ultimately, data cleaning ensures that decisions are based on reality, not noise. In regulated environments, every transformation must be traceable and justified, making cleaning both a technical and compliance activity.

8. 20 Key Points

  • Dirty data leads to wrong conclusions
  • Validation is not optional
  • Missing data hides insights
  • Standardization ensures reproducibility
  • Duplicates inflate metrics
  • Invalid dates break timelines
  • Text inconsistency ruins grouping
  • Outliers distort models
  • Clean data builds trust
  • Automation reduces errors
  • Documentation is critical
  • Regulatory compliance demands rigor
  • Data lineage matters
  • QC is a must, not a choice
  • Transformation adds analytical value
  • Imputation requires logic
  • Clean data accelerates insights
  • SAS ensures stability
  • R ensures flexibility
  • Together, they are powerful

9. Summary

Data cleaning and transformation form the backbone of reliable analytics. In this blog, we explored a messy dataset of top scientists and systematically cleaned it using SAS and R.

SAS demonstrated its strength in structured environments. Functions like COALESCEC, PROC SORT, and date handling make it ideal for regulatory workflows such as SDTM and ADaM. Its deterministic behavior ensures reproducibility critical in clinical trials.

R, on the other hand, offered flexibility and concise syntax. Using dplyr, we performed transformations in a pipeline, making the process intuitive and readable. R’s strength lies in rapid prototyping and exploratory data analysis.

Both tools handled missing values, invalid entries, duplicates, and standardization effectively. However, their philosophies differ: SAS is control-driven, while R is function-driven.

The key takeaway is that data cleaning is not just preprocessing it is a strategic step that determines analysis quality. Whether using SAS or R, the focus should be on consistency, validation, and traceability.

10. Conclusion

In any data-driven environment whether clinical trials, finance, or research the quality of insights depends entirely on the quality of data. Raw datasets are rarely perfect. They come with inconsistencies, errors, and gaps that must be addressed systematically.

This blog demonstrated how to transform a flawed dataset into a reliable analytical asset using SAS and R. The process involved identifying issues, applying logical corrections, standardizing formats, and deriving meaningful variables.

SAS provided a robust framework for structured cleaning, particularly suited for regulated industries. Its clarity and control make it indispensable in clinical data management. R complemented this with flexibility and efficiency, enabling rapid transformations and exploratory workflows.

The real lesson is not just about tools it’s about discipline in data handling. Every transformation must be justified, documented, and validated. This ensures that downstream analyses are accurate and defensible.

As datasets grow in size and complexity, the importance of automated, scalable cleaning frameworks will only increase. Professionals who master these techniques will not just clean data they will enable trustworthy decision-making.

11. Interview Questions

1. How do you handle invalid age values in SAS?

Answer: Use conditional logic:

IF Age < 0 OR Age > 120 THEN Age = .;

Ensures realistic data ranges.

2. How do you remove duplicates in SAS?

Answer:
Use PROC SORT NODUPKEY BY ID;
Removes duplicate records based on key variables.

3. How does R handle invalid dates?

Answer:
as.Date() converts invalid dates to NA automatically, helping identify issues.

4. Scenario: Salary missing for 30% records what will you do?

Answer:
Analyze distribution, apply median imputation or domain-based defaults, and document logic.

5. SAS vs R for cleaning?

Answer:
SAS = stable, regulatory
R = flexible, fast
Best practice: use both for validation and robustness.

 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about SCIENTISTS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



3.Data Disasters to Data Intelligence: Mastering TRANWRD in SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy


Comments

Popular posts from this blog

458.Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

441.Fixing Negative Data Errors Like A Pro Using SAS ABS Function