Ancient Legends, Modern Analytics: Enterprise Data Cleaning Frameworks in SAS and R

Transforming a Corrupted Dataset Of “Best Warriors in History” into Enterprise-Grade Analytical Intelligence Using SAS (PROC SQL vs DATA Step) and R

Introduction: When Dirty Data Rewrites History

Imagine a global historical analytics company preparing a documentary called Best Warriors in History. Researchers collected information on famous warriors across civilizations Spartans, Samurai, Vikings, Mongols, Rajputs, Zulus, Roman Legionaries, and others.

A week before executive presentation, analysts discover serious problems:

  • Duplicate Warrior IDs
  • Missing battle dates
  • Negative battle counts
  • Invalid ages
  • Corrupted region codes
  • Mixed text formatting
  • NULL strings
  • Invalid email contacts of researchers
  • Inconsistent warrior categories
  • Impossible years

The result?

Executive dashboards show incorrect rankings.

AI models predict inaccurate warrior influence scores.

Historical trend reports become unreliable.

Management begins questioning the entire analytical process.

This situation is remarkably similar to real clinical trial environments where incorrect patient records can jeopardize regulatory submissions.

Whether analyzing warriors, patients, insurance claims, or financial transactions, data quality determines analytical credibility.

Warriors Dataset Structure

Variable

Description

Warrior_ID

Unique Warrior Identifier

Warrior_Name

Warrior Name

Civilization

Warrior Origin

Region_Code

Geographic Region

Battle_Count

Number of Battles

Age

Age at Peak

Last_Battle_Date

Final Battle Date

Rank_Category

Warrior Classification

Research_Email

Research Contact

1.Raw SAS Dataset with Intentional Errors

data warriors_raw;

length Warrior_ID $8 Warrior_Name $40 Civilization $25

       Region_Code $10 Rank_Category $20 Research_Email $50;

infile datalines dlm='|' dsd truncover;

input Warrior_ID $ Warrior_Name $ Civilization $ Region_Code $

      Battle_Count Age Last_Battle_Date $ Rank_Category $ Research_Email $;

datalines;

W001|alexander|MACEDONIAN|EU|45|32|323BC|Elite|alexander@history.com

W002|genghis khan|mongol|APAC|120|65|1227|Legend|genghis@history.com

W003|spartan leonidas|EUROPE|EU|15|300|480BC|Elite|bademail

W004|samurai musashi|japan|APAC|-25|61|1645|Master|musashi@gmail

W005|viking ragnar|nordic|EUR|50|-10|845|Warrior|ragnar@email.com

W006|NULL|rome|EU|35|45||Elite|romehistory.com

W006|NULL|rome|EU|35|45||Elite|romehistory.com

W007|rajput prithviraj|india|APAC|80|40|1192|KING|rajput@history.com

W008|zulu shaka|africa|AFRICA|70|42|1828|Hero|shaka@@mail.com

W009|white space |persia|ME|65|55|651 | elite |persia@history.com

;

run;

proc print data=warriors_raw;

run;

OUTPUT:

ObsWarrior_IDWarrior_NameCivilizationRegion_CodeRank_CategoryResearch_EmailBattle_CountAgeLast_Battle_Date
1W001alexanderMACEDONIANEUElitealexander@history.com4532323BC
2W002genghis khanmongolAPACLegendgenghis@history.com120651227
3W003spartan leonidasEUROPEEUElitebademail15300480BC
4W004samurai musashijapanAPACMastermusashi@gmail-25611645
5W005viking ragnarnordicEURWarriorragnar@email.com50-10845
6W006NULLromeEUEliteromehistory.com3545 
7W006NULLromeEUEliteromehistory.com3545 
8W007rajput prithvirajindiaAPACKINGrajput@history.com80401192
9W008zulu shakaafricaAFRICAHeroshaka@@mail.com70421828
10W009white spacepersiaMEelitepersia@history.com6555651

Understanding Character Truncation Risk

One of the most overlooked SAS issues is character truncation.

InCorrect:

data demo;

name="Alexander The Great";

length name $10;

run;

proc print data=demo;

run;

LOG:

WARNING: Length of character variable name has already been set.
Use the LENGTH statement as the very first statement in the DATA STEP
to declare the length of a character variable.

OUTPUT:

Obsname
1Alexander The Great

SAS already assigned length before LENGTH executes.

Result:

Alexander

Data gets truncated.

Correct:

data demo;

length name $50;

name="Alexander The Great";

run;

proc print data=demo;

run;

OUTPUT:

Obsname
1Alexander The Great

Why This Matters

Clinical trial datasets frequently contain long investigator names, treatment descriptions, and adverse event terms.

Improper LENGTH placement can silently corrupt data.

Unlike SAS, R dynamically handles character vectors and typically avoids fixed-length truncation issues.

2.Data Cleaning Workflow Using DATA Step

data warriors_clean;

set warriors_raw;

Warrior_Name=propcase(strip(Warrior_Name));

Civilization=upcase(strip(Civilization));

Region_Code=upcase(strip(Region_Code));

Rank_Category=propcase(strip(Rank_Category));

Battle_Count=abs(Battle_Count);

if Age<15 or Age>100 then Age=.;

Research_Email=lowcase(strip(Research_Email));

if find(Research_Email,'@')=0 then Research_Email='';

if Warrior_Name='Null' then Warrior_Name='Unknown';

run;

proc print data=warriors_clean;

run;

OUTPUT:

ObsWarrior_IDWarrior_NameCivilizationRegion_CodeRank_CategoryResearch_EmailBattle_CountAgeLast_Battle_Date
1W001AlexanderMACEDONIANEUElitealexander@history.com4532323BC
2W002Genghis KhanMONGOLAPACLegendgenghis@history.com120651227
3W003Spartan LeonidasEUROPEEUElite 15.480BC
4W004Samurai MusashiJAPANAPACMastermusashi@gmail25611645
5W005Viking RagnarNORDICEURWarriorragnar@email.com50.845
6W006UnknownROMEEUElite 3545 
7W006UnknownROMEEUElite 3545 
8W007Rajput PrithvirajINDIAAPACKingrajput@history.com80401192
9W008Zulu ShakaAFRICAAFRICAHeroshaka@@mail.com70421828
10W009White SpacePERSIAMEElitepersia@history.com6555651

Explanation

This DATA Step performs standardization and validation.

Functions demonstrated:

  • PROPCASE
  • STRIP
  • UPCASE
  • LOWCASE
  • FIND
  • ABS

Negative battle counts become positive.

Impossible ages become missing.

Emails are validated.

Text values become consistent.

This mirrors real-world SDTM preparation where raw clinical data must be standardized before analysis.

3.Using SELECT-WHEN Logic

data warriors_rank;

set warriors_clean;

select;

when(Battle_Count>=100) Category='Legend';

when(Battle_Count>=70) Category='Elite';

when(Battle_Count>=40) Category='Veteran';

otherwise Category='Novice';

end;

run;

proc print data=warriors_rank;

run;

OUTPUT:

ObsWarrior_IDWarrior_NameCivilizationRegion_CodeRank_CategoryResearch_EmailBattle_CountAgeLast_Battle_DateCategory
1W001AlexanderMACEDONIANEUElitealexander@history.com4532323BCVetera
2W002Genghis KhanMONGOLAPACLegendgenghis@history.com120651227Legend
3W003Spartan LeonidasEUROPEEUElite 15.480BCNovice
4W004Samurai MusashiJAPANAPACMastermusashi@gmail25611645Novice
5W005Viking RagnarNORDICEURWarriorragnar@email.com50.845Vetera
6W006UnknownROMEEUElite 3545 Novice
7W006UnknownROMEEUElite 3545 Novice
8W007Rajput PrithvirajINDIAAPACKingrajput@history.com80401192Elite
9W008Zulu ShakaAFRICAAFRICAHeroshaka@@mail.com70421828Elite
10W009White SpacePERSIAMEElitepersia@history.com6555651Vetera

Explanation

SELECT-WHEN is cleaner than long IF-THEN chains.

Benefits:

  • Easier maintenance
  • Better readability
  • Reduced programming errors

Commonly used in treatment categorization and risk scoring.

4.ARRAY Processing

data warriors_array;

set warriors_clean;

array nums Age Battle_Count;

do i=1 to dim(nums);

if nums[i]<0 then nums[i]=abs(nums[i]);

end;

drop i;

run;

proc print data=warriors_array;

run;

OUTPUT:

ObsWarrior_IDWarrior_NameCivilizationRegion_CodeRank_CategoryResearch_EmailBattle_CountAgeLast_Battle_Date
1W001AlexanderMACEDONIANEUElitealexander@history.com4532323BC
2W002Genghis KhanMONGOLAPACLegendgenghis@history.com120651227
3W003Spartan LeonidasEUROPEEUElite 15.480BC
4W004Samurai MusashiJAPANAPACMastermusashi@gmail25611645
5W005Viking RagnarNORDICEURWarriorragnar@email.com50.845
6W006UnknownROMEEUElite 3545 
7W006UnknownROMEEUElite 3545 
8W007Rajput PrithvirajINDIAAPACKingrajput@history.com80401192
9W008Zulu ShakaAFRICAAFRICAHeroshaka@@mail.com70421828
10W009White SpacePERSIAMEElitepersia@history.com6555651

Explanation

ARRAYS help process multiple variables efficiently.

Instead of repeating logic, a loop validates all numeric fields.

This technique is heavily used in laboratory datasets containing hundreds of measurements.

5.RETAIN and FIRST./LAST. Processing

proc sort data=warriors_clean;

by Civilization;

run;

proc print data=warriors_clean;

run;

OUTPUT:

ObsWarrior_IDWarrior_NameCivilizationRegion_CodeRank_CategoryResearch_EmailBattle_CountAgeLast_Battle_Date
1W008Zulu ShakaAFRICAAFRICAHeroshaka@@mail.com70421828
2W003Spartan LeonidasEUROPEEUElite 15.480BC
3W007Rajput PrithvirajINDIAAPACKingrajput@history.com80401192
4W004Samurai MusashiJAPANAPACMastermusashi@gmail25611645
5W001AlexanderMACEDONIANEUElitealexander@history.com4532323BC
6W002Genghis KhanMONGOLAPACLegendgenghis@history.com120651227
7W005Viking RagnarNORDICEURWarriorragnar@email.com50.845
8W009White SpacePERSIAMEElitepersia@history.com6555651
9W006UnknownROMEEUElite 3545 
10W006UnknownROMEEUElite 3545 

data civ_summary;

set warriors_clean;

by Civilization;

retain Total_Battles 0;

Total_Battles+Battle_Count;

if last.Civilization;

run;

proc print data=civ_summary;

run;

OUTPUT:

ObsWarrior_IDWarrior_NameCivilizationRegion_CodeRank_CategoryResearch_EmailBattle_CountAgeLast_Battle_DateTotal_Battles
1W008Zulu ShakaAFRICAAFRICAHeroshaka@@mail.com7042182870
2W003Spartan LeonidasEUROPEEUElite 15.480BC85
3W007Rajput PrithvirajINDIAAPACKingrajput@history.com80401192165
4W004Samurai MusashiJAPANAPACMastermusashi@gmail25611645190
5W001AlexanderMACEDONIANEUElitealexander@history.com4532323BC235
6W002Genghis KhanMONGOLAPACLegendgenghis@history.com120651227355
7W005Viking RagnarNORDICEURWarriorragnar@email.com50.845405
8W009White SpacePERSIAMEElitepersia@history.com6555651470
9W006UnknownROMEEUElite 3545 540

Explanation

RETAIN preserves values across observations.

FIRST./LAST. processing creates group-level summaries.

This technique is essential for patient-level analysis in ADaM datasets.

6.Removing Duplicates

proc sort data=warriors_clean

          out=warriors_nodup nodupkey;

by Warrior_ID;

run;

proc print data=warriors_nodup;

run;

LOG:

NOTE: There were 10 observations read from the data set WORK.WARRIORS_CLEAN.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.WARRIORS_NODUP has 9 observations and 9 variables.

OUTPUT:

ObsWarrior_IDWarrior_NameCivilizationRegion_CodeRank_CategoryResearch_EmailBattle_CountAgeLast_Battle_Date
1W001AlexanderMACEDONIANEUElitealexander@history.com4532323BC
2W002Genghis KhanMONGOLAPACLegendgenghis@history.com120651227
3W003Spartan LeonidasEUROPEEUElite 15.480BC
4W004Samurai MusashiJAPANAPACMastermusashi@gmail25611645
5W005Viking RagnarNORDICEURWarriorragnar@email.com50.845
6W006UnknownROMEEUElite 3545 
7W007Rajput PrithvirajINDIAAPACKingrajput@history.com80401192
8W008Zulu ShakaAFRICAAFRICAHeroshaka@@mail.com70421828
9W009White SpacePERSIAMEElitepersia@history.com6555651

Explanation

Duplicate Warrior_ID values are removed.

Equivalent clinical scenario:

Duplicate patient identifiers causing enrollment inflation.

PROC SORT NODUPKEY remains one of the most efficient SAS deduplication techniques.

7.PROC FORMAT Standardization

proc format;

value agegrp low-30='Young'

                         31-50='Experienced'

                      51-high='Veteran';

run;

LOG:

NOTE: Format AGEGRP has been output.

Explanation

Formats improve reporting consistency.

Instead of coding categories repeatedly, business rules become centralized and reusable.

8.PROC SQL Approach

proc sql;

create table warrior_sql as

select a.*,

       put(a.age,agegrp.) as Age_Group length=15,

       b.Category

from warriors_nodup a

left join warriors_rank b

on a.Warrior_ID=b.Warrior_ID;

quit;

proc print data=warrior_sql;

run;

OUTPUT:

ObsWarrior_IDWarrior_NameCivilizationRegion_CodeRank_CategoryResearch_EmailBattle_CountAgeLast_Battle_DateAge_GroupCategory
1W001AlexanderMACEDONIANEUElitealexander@history.com4532323BCExperiencedVetera
2W002Genghis KhanMONGOLAPACLegendgenghis@history.com120651227VeteranLegend
3W003Spartan LeonidasEUROPEEUElite 15.480BC.Novice
4W004Samurai MusashiJAPANAPACMastermusashi@gmail25611645VeteranNovice
5W005Viking RagnarNORDICEURWarriorragnar@email.com50.845.Vetera
6W006UnknownROMEEUElite 3545 ExperiencedNovice
7W006UnknownROMEEUElite 3545 ExperiencedNovice
8W007Rajput PrithvirajINDIAAPACKingrajput@history.com80401192ExperiencedElite
9W008Zulu ShakaAFRICAAFRICAHeroshaka@@mail.com70421828ExperiencedElite
10W009White SpacePERSIAMEElitepersia@history.com6555651VeteranVetera

Explanation

PROC SQL simplifies joins and relational operations.

Advantages:

  • Familiar SQL syntax
  • Flexible joins
  • Database-like processing

Ideal when integrating multiple source systems.

9.MERGE Logic Using DATA Step

proc sort data=warriors_nodup;

by Warrior_ID;

run;

proc print data=warriors_nodup;

run;

OUTPUT:

ObsWarrior_IDWarrior_NameCivilizationRegion_CodeRank_CategoryResearch_EmailBattle_CountAgeLast_Battle_Date
1W001AlexanderMACEDONIANEUElitealexander@history.com4532323BC
2W002Genghis KhanMONGOLAPACLegendgenghis@history.com120651227
3W003Spartan LeonidasEUROPEEUElite 15.480BC
4W004Samurai MusashiJAPANAPACMastermusashi@gmail25611645
5W005Viking RagnarNORDICEURWarriorragnar@email.com50.845
6W006UnknownROMEEUElite 3545 
7W007Rajput PrithvirajINDIAAPACKingrajput@history.com80401192
8W008Zulu ShakaAFRICAAFRICAHeroshaka@@mail.com70421828
9W009White SpacePERSIAMEElitepersia@history.com6555651

proc sort data=warriors_rank nodupkey;

by Warrior_ID;

run;

proc print data=warriors_rank;

run;

OUTPUT:

ObsWarrior_IDWarrior_NameCivilizationRegion_CodeRank_CategoryResearch_EmailBattle_CountAgeLast_Battle_DateCategory
1W001AlexanderMACEDONIANEUElitealexander@history.com4532323BCVetera
2W002Genghis KhanMONGOLAPACLegendgenghis@history.com120651227Legend
3W003Spartan LeonidasEUROPEEUElite 15.480BCNovice
4W004Samurai MusashiJAPANAPACMastermusashi@gmail25611645Novice
5W005Viking RagnarNORDICEURWarriorragnar@email.com50.845Vetera
6W006UnknownROMEEUElite 3545 Novice
7W007Rajput PrithvirajINDIAAPACKingrajput@history.com80401192Elite
8W008Zulu ShakaAFRICAAFRICAHeroshaka@@mail.com70421828Elite
9W009White SpacePERSIAMEElitepersia@history.com6555651Vetera

data warrior_merge;

merge warriors_nodup(in=a)

      warriors_rank(in=b);

by Warrior_ID;

if a;

run;

proc print data=warrior_merge;

run;

OUTPUT:

ObsWarrior_IDWarrior_NameCivilizationRegion_CodeRank_CategoryResearch_EmailBattle_CountAgeLast_Battle_DateCategory
1W001AlexanderMACEDONIANEUElitealexander@history.com4532323BCVetera
2W002Genghis KhanMONGOLAPACLegendgenghis@history.com120651227Legend
3W003Spartan LeonidasEUROPEEUElite 15.480BCNovice
4W004Samurai MusashiJAPANAPACMastermusashi@gmail25611645Novice
5W005Viking RagnarNORDICEURWarriorragnar@email.com50.845Vetera
6W006UnknownROMEEUElite 3545 Novice
7W007Rajput PrithvirajINDIAAPACKingrajput@history.com80401192Elite
8W008Zulu ShakaAFRICAAFRICAHeroshaka@@mail.com70421828Elite
9W009White SpacePERSIAMEElitepersia@history.com6555651Vetera

Explanation

DATA Step MERGE is often faster for sorted SAS datasets.

Comparison:

Feature

DATA Step

PROC SQL

Speed

Faster on SAS tables

Flexible

Readability

Moderate

High

Complex Joins

Limited

Strong

Clinical Usage

Very Common

Common

10.Advanced Data Profiling

10.1.PROC CONTENTS

proc contents data=warriors_clean;

run;

OUTPUT:

The CONTENTS Procedure

Data Set NameWORK.WARRIORS_CLEANObservations10
Member TypeDATAVariables9
EngineV9Indexes0
Created06/21/2026 15:25:18Observation Length184
Last Modified06/21/2026 15:25:18Deleted Observations0
Protection CompressedNO
Data Set Type SortedYES
Label   
Data RepresentationSOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64  
Encodingutf-8 Unicode (UTF-8)  
Engine/Host Dependent Information
Data Set Page Size131072
Number of Data Set Pages1
First Data Page1
Max Obs per Page711
Obs in First Data Page10
Number of Data Set Repairs0
Filename/saswork/SAS_workD37400005FFD_odaws01-apse1-2.oda.sas.com/SAS_work1C2F00005FFD_odaws01-apse1-2.oda.sas.com/warriors_clean.sas7bdat
Release Created9.0401M8
Host CreatedLinux
Inode Number1379846
Access Permissionrw-r--r--
Owner Nameu63247146
File Size256KB
File Size (bytes)262144
Alphabetic List of Variables and Attributes
#VariableTypeLen
8AgeNum8
7Battle_CountNum8
3CivilizationChar25
9Last_Battle_DateChar8
5Rank_CategoryChar20
4Region_CodeChar10
6Research_EmailChar50
1Warrior_IDChar8
2Warrior_NameChar40
Sort Information
SortedbyCivilization
ValidatedYES
Character SetASCII

Provides metadata review.

10.2.PROC FREQ

proc freq data=warriors_clean;

tables Region_Code Rank_Category;

run;

OUTPUT:

The FREQ Procedure

Region_CodeFrequencyPercentCumulative
Frequency
Cumulative
Percent
AFRICA110.00110.00
APAC330.00440.00
EU440.00880.00
EUR110.00990.00
ME110.0010100.00
Rank_CategoryFrequencyPercentCumulative
Frequency
Cumulative
Percent
Elite550.00550.00
Hero110.00660.00
King110.00770.00
Legend110.00880.00
Master110.00990.00
Warrior110.0010100.00

Detects unexpected categories.

10.3.PROC MEANS

proc means data=warriors_clean n nmiss mean min max;

var Age Battle_Count;

run;

OUTPUT:

The MEANS Procedure

VariableNN MissMeanMinimumMaximum
Age
Battle_Count
8
10
2
0
48.1250000
54.0000000
32.0000000
15.0000000
65.0000000
120.0000000

Identifies outliers and missing values.

10.4.PROC SUMMARY

proc summary data=warriors_clean nway;

class Civilization;

var Battle_Count;

output out=summary_stats sum=;

run;

proc print data=summary_stats;

run;

OUTPUT:

ObsCivilization_TYPE__FREQ_Battle_Count
1AFRICA1170
2EUROPE1115
3INDIA1180
4JAPAN1125
5MACEDONIAN1145
6MONGOL11120
7NORDIC1150
8PERSIA1165
9ROME1270

Creates aggregated reports.

10.5.PROC TRANSPOSE

proc transpose data=summary_stats out=transposed;

run;

proc print data=transposed;

run;

OUTPUT:

Obs_NAME_COL1COL2COL3COL4COL5COL6COL7COL8COL9
1_TYPE_111111111
2_FREQ_111111112
3Battle_Count7015802545120506570

Reshapes reporting structures.

10.6.PROC REPORT

proc report data=warriors_clean nowd;

column Warrior_Name Civilization Battle_Count;

run;

OUTPUT:

Warrior_NameCivilizationBattle_Count
Zulu ShakaAFRICA70
Spartan LeonidasEUROPE15
Rajput PrithvirajINDIA80
Samurai MusashiJAPAN25
AlexanderMACEDONIAN45
Genghis KhanMONGOL120
Viking RagnarNORDIC50
White SpacePERSIA65
UnknownROME35
UnknownROME35

Generates professional outputs.

11.Reusable SAS Macro

%macro profile(ds);

proc means data=&ds

n nmiss mean std min max;

run;

proc freq data=&ds;

tables _character_;

run;

%mend;

%profile(warriors_clean);

OUTPUT:

The MEANS Procedure

VariableNN MissMeanStd DevMinimumMaximum
Battle_Count
Age
10
8
0
2
54.0000000
48.1250000
30.8940843
11.2178111
15.0000000
32.0000000
120.0000000
65.0000000

The FREQ Procedure

Warrior_IDFrequencyPercentCumulative
Frequency
Cumulative
Percent
W001110.00110.00
W002110.00220.00
W003110.00330.00
W004110.00440.00
W005110.00550.00
W006220.00770.00
W007110.00880.00
W008110.00990.00
W009110.0010100.00
Warrior_NameFrequencyPercentCumulative
Frequency
Cumulative
Percent
Alexander110.00110.00
Genghis Khan110.00220.00
Rajput Prithviraj110.00330.00
Samurai Musashi110.00440.00
Spartan Leonidas110.00550.00
Unknown220.00770.00
Viking Ragnar110.00880.00
White Space110.00990.00
Zulu Shaka110.0010100.00
CivilizationFrequencyPercentCumulative
Frequency
Cumulative
Percent
AFRICA110.00110.00
EUROPE110.00220.00
INDIA110.00330.00
JAPAN110.00440.00
MACEDONIAN110.00550.00
MONGOL110.00660.00
NORDIC110.00770.00
PERSIA110.00880.00
ROME220.0010100.00
Region_CodeFrequencyPercentCumulative
Frequency
Cumulative
Percent
AFRICA110.00110.00
APAC330.00440.00
EU440.00880.00
EUR110.00990.00
ME110.0010100.00
Rank_CategoryFrequencyPercentCumulative
Frequency
Cumulative
Percent
Elite550.00550.00
Hero110.00660.00
King110.00770.00
Legend110.00880.00
Master110.00990.00
Warrior110.0010100.00
Research_EmailFrequencyPercentCumulative
Frequency
Cumulative
Percent
Frequency Missing = 3
alexander@history.com114.29114.29
genghis@history.com114.29228.57
musashi@gmail114.29342.86
persia@history.com114.29457.14
ragnar@email.com114.29571.43
rajput@history.com114.29685.71
shaka@@mail.com114.297100.00
Last_Battle_DateFrequencyPercentCumulative
Frequency
Cumulative
Percent
Frequency Missing = 2
1192112.50112.50
1227112.50225.00
1645112.50337.50
1828112.50450.00
323BC112.50562.50
480BC112.50675.00
651112.50787.50
845112.508100.00

Explanation

Macros standardize validation.

Benefits:

  • Reusability
  • Consistency
  • Reduced coding effort
  • Faster production deployment

12.R Raw Data

library(readr)

txt <- "

Warrior_ID|Warrior_Name|Civilization|Region_Code|Battle_Count|Age|Last_Battle_Date|Rank_Category|Research_Email

W001|alexander|MACEDONIAN|EU|45|32|323BC|Elite|alexander@history.com

W002|genghis khan|mongol|APAC|120|65|1227|Legend|genghis@history.com

W003|spartan leonidas|EUROPE|EU|15|300|480BC|Elite|bademail

W004|samurai musashi|japan|APAC|-25|61|1645|Master|musashi@gmail

W005|viking ragnar|nordic|EUR|50|-10|845|Warrior|ragnar@email.com

W006|NULL|rome|EU|35|45||Elite|romehistory.com

W006|NULL|rome|EU|35|45||Elite|romehistory.com

W007|rajput prithviraj|india|APAC|80|40|1192|KING|rajput@history.com

W008|zulu shaka|africa|AFRICA|70|42|1828|Hero|shaka@@mail.com

W009|white space |persia|ME|65|55|651 | elite |persia@history.com

"

warriors_raw <- read_delim(

  I(txt),

  delim = "|",

  trim_ws = FALSE,

  show_col_types = FALSE

)

OUTPUT:

Warrior_ID

Warrior_Name

Civilization

Region_Code

Battle_Count

Age

Last_Battle_Date

Rank_Category

Research_Email

W001

alexander

MACEDONIAN

EU

45

32

323BC

Elite

alexander@history.com

W002

genghis khan

mongol

APAC

120

65

1227

Legend

genghis@history.com

W003

spartan leonidas

EUROPE

EU

15

300

480BC

Elite

bademail

W004

samurai musashi

japan

APAC

-25

61

1645

Master

musashi@gmail

W005

viking ragnar

nordic

EUR

50

-10

845

Warrior

ragnar@email.com

W006

NULL

rome

EU

35

45

Elite

romehistory.com

W006

NULL

rome

EU

35

45

Elite

romehistory.com

W007

rajput prithviraj

india

APAC

80

40

1192

KING

rajput@history.com

W008

zulu shaka

africa

AFRICA

70

42

1828

Hero

shaka@@mail.com

W009

white space

persia

ME

65

55

651

 elite

persia@history.com


13.Equivalent R Cleaning Workflow

library(tidyverse)

library(janitor)

library(lubridate)

library(stringr)

warriors_clean <- warriors_raw %>%

  clean_names() %>%

  mutate( warrior_name =str_to_title(str_trim(warrior_name)),

          civilization =str_to_upper(str_trim(civilization)),

           region_code =case_when(region_code %in% c("EUR","EU") ~ "EU",

           region_code=="AFRICA" ~ "AFR",

          TRUE ~ region_code),

          battle_count =abs(battle_count),

          age =if_else(age<15 | age>100,

              NA_real_,age),

          research_email =str_to_lower(research_email),

          research_email =if_else(grepl("@", research_email) &

                                  grepl("\\.com$", research_email),

          research_email,NA_character_)

  ) %>%

  distinct(warrior_id,.keep_all=TRUE)

OUTPUT:

warrior_id

warrior_name

civilization

region_code

battle_count

age

last_battle_date

rank_category

research_email

W001

Alexander

MACEDONIAN

EU

45

32

323BC

Elite

alexander@history.com

W002

Genghis Khan

MONGOL

APAC

120

65

1227

Legend

genghis@history.com

W003

Spartan Leonidas

EUROPE

EU

15

480BC

Elite

W004

Samurai Musashi

JAPAN

APAC

25

61

1645

Master

W005

Viking Ragnar

NORDIC

EU

50

845

Warrior

ragnar@email.com

W006

Null

ROME

EU

35

45

Elite

W007

Rajput Prithviraj

INDIA

APAC

80

40

1192

KING

rajput@history.com

W008

Zulu Shaka

AFRICA

AFR

70

42

1828

Hero

shaka@@mail.com

W009

White Space

PERSIA

ME

65

55

651

 elite

persia@history.com

Explanation

Equivalent SAS → R mappings:

SAS

R

PROPCASE

str_to_title

UPCASE

str_to_upper

LOWCASE

str_to_lower

ABS

abs

FIND

grepl

STRIP

str_trim

PROC SORT NODUPKEY

distinct

IF THEN

if_else

SELECT WHEN

case_when

R offers concise syntax while SAS provides stronger auditability.

Validation & Compliance

Clinical environments require:

  • SDTM compliance
  • ADaM traceability
  • Independent QC
  • Audit trails
  • Reproducibility
  • Validation documentation

A major SAS risk:

if value < 10;

Missing numeric values (.) are considered smaller than any number.

Therefore:

. < 10

returns TRUE.

This can unintentionally classify missing patients as high-risk populations.

Proper logic:

if not missing(value) and value<10;

Failure to apply this rule can cause severe regulatory consequences.

Business Logic Behind Cleaning

Data cleaning is not cosmetic it directly impacts business decisions. Consider patient age. If a clinical trial contains an age of 300 years due to data entry errors, statistical summaries become distorted and age-stratified analyses become unreliable. Similar problems occur in banking when loan applicants show negative income values or impossible employment durations. Missing values are often imputed because many analytical models cannot process blanks effectively. For example, a missing visit date may prevent a patient from being included in efficacy analyses, reducing statistical power. Standardized dates ensure accurate interval calculations using INTCK and INTNX functions. Text normalization prevents duplicate categories such as “elite”, “Elite”, and “ELITE” from appearing as separate groups. Region codes such as EU, EUR, and Europe must be standardized to avoid fragmented reporting. Email validation ensures communication workflows remain operational. Numeric normalization removes impossible negative billing amounts and incorrect transaction values. Proper business rules create consistency across dashboards, machine learning systems, regulatory submissions, executive reports, and operational analytics. Ultimately, every cleaning rule must be justified, documented, reproducible, and traceable to business requirements so stakeholders can trust analytical outcomes.

20 Data Cleaning Best Practices

  1. Define metadata before coding.
  2. Standardize variable naming conventions.
  3. Validate incoming source files.
  4. Remove duplicates early.
  5. Check missing-value patterns.
  6. Validate date ranges.
  7. Normalize categorical values.
  8. Use reusable macros.
  9. Maintain audit trails.
  10. Implement independent QC.
  11. Validate joins carefully.
  12. Document derivations.
  13. Track data lineage.
  14. Use version control.
  15. Centralize business rules.
  16. Validate email structures.
  17. Check impossible numeric ranges.
  18. Review PROC CONTENTS outputs.
  19. Create automated validation reports.
  20. Perform production sign-off before deployment.

20 One-Line Insights

  1. Dirty data creates expensive business mistakes.
  2. Validation logic is stronger than visual inspection.
  3. Standardized variables improve reproducibility.
  4. Duplicate records distort analytics.
  5. Missing dates damage timelines.
  6. Auditability builds trust.
  7. Metadata drives consistency.
  8. QC is not optional.
  9. Automation reduces human error.
  10. Macros improve scalability.
  11. SQL excels at integration.
  12. DATA Step excels at transformation.
  13. Traceability supports compliance.
  14. Documentation prevents confusion.
  15. Clean inputs create reliable outputs.
  16. Normalized text improves reporting.
  17. Dates deserve special attention.
  18. Missing values require explicit handling.
  19. Enterprise analytics depends on governance.
  20. Quality data enables quality decisions.

Small Validation Checklist

Check

Status

Duplicate IDs Removed

Missing Values Reviewed

Date Validation Completed

Text Standardized

Numeric Ranges Verified

Metadata Reviewed

QC Completed

Reporting Validated

SAS vs R Cleaning Summary

SAS and R are both powerful platforms for enterprise-grade data cleaning, but they excel in different areas. SAS is renowned for stability, auditability, regulatory acceptance, and large-scale batch processing. In regulated industries such as pharmaceuticals, SAS remains dominant because every transformation can be documented, validated, and traced. DATA Step programming provides exceptional control over row-level processing, while PROC SQL offers powerful relational capabilities. Features such as FIRST./LAST. processing, RETAIN statements, PROC FORMAT, and metadata-driven macros make SAS highly suitable for production environments.

R, meanwhile, offers unmatched flexibility and modern data manipulation capabilities. Packages such as tidyverse, dplyr, stringr, lubridate, janitor, and purrr simplify complex transformations using concise syntax. R excels in exploratory analysis, advanced visualization, machine learning, and rapid prototyping. Functions like mutate(), across(), case_when(), and coalesce() enable highly readable workflows that often require fewer lines of code than SAS.

From a scalability perspective, SAS performs exceptionally well on enterprise infrastructure and validated environments. R provides extensibility and innovation through thousands of open-source packages. Many organizations increasingly adopt hybrid architectures where SAS performs regulatory-compliant clinical processing and R supports exploratory analytics and advanced modeling.

The strongest strategy is not SAS versus R it is SAS and R together. SAS delivers governance, reproducibility, and audit readiness, while R delivers agility, flexibility, and modern analytical capabilities. Combining both technologies creates reliable, scalable, and analytically powerful ecosystems capable of transforming messy operational data into trusted business intelligence.

Conclusion

The journey from corrupted operational data to trusted analytical intelligence is rarely straightforward. Whether working with clinical trial patients, insurance claims, banking transactions, retail purchases, or a historical warriors dataset, poor-quality data introduces substantial risks. Duplicate identifiers can inflate counts, invalid dates can break timelines, malformed categories can fragment reporting, and missing values can quietly distort statistical conclusions. In regulated industries, these problems are not merely technical inconveniences they can affect regulatory submissions, patient safety evaluations, financial decisions, and executive strategy.

A structured data-cleaning framework is therefore a foundational requirement rather than an optional enhancement. SAS provides a highly controlled environment for enterprise transformation through DATA Step programming, PROC SQL, PROC FORMAT, PROC REPORT, validation macros, metadata management, and auditable workflows. Techniques such as ARRAY processing, FIRST./LAST. logic, RETAIN statements, MERGE operations, and automated QC checks help ensure consistency and traceability across large production systems. At the same time, R complements these strengths through modern packages that accelerate cleaning, exploration, visualization, and advanced analytics.

The most successful analytics organizations establish repeatable standards: validated metadata, standardized macros, documented business rules, automated quality checks, independent QC reviews, and strong governance controls. These practices transform raw data into a reliable strategic asset. When implemented correctly, SAS and R together form a powerful ecosystem capable of producing trustworthy dashboards, reproducible analyses, compliant regulatory outputs, and scalable business intelligence. Clean data is not simply a technical objective it is the foundation upon which every credible analytical decision is built.

Interview Questions & Answers

1. A patient appears twice in an SDTM DM dataset. How would you identify it?

Answer: Use PROC SORT NODUPKEY, PROC FREQ, or PROC SQL GROUP BY HAVING COUNT(*)>1. Investigate source records and document corrective action.

2. Why is missing numeric handling dangerous in SAS?

Answer: Missing values are treated as smaller than valid numbers. Conditions like if value<10; may incorrectly include missing observations.

3. When would you choose PROC SQL over DATA Step?

Answer: PROC SQL is preferred for complex joins, aggregation, and relational processing. DATA Step is often preferred for row-level transformations and performance.

4. How would you validate email fields in SAS?

Answer: Use FIND, INDEX, VERIFY, SCAN, and regular-expression functions to confirm valid structures and identify malformed values.

5. How do you ensure R and SAS outputs match?

Answer: Apply identical business rules, compare frequencies, summaries, record counts, missing-value counts, and perform independent QC reconciliation.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about WARRIORS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS