Ancient Legends, Modern Analytics: Enterprise Data Cleaning Frameworks in SAS and R
Transforming a Corrupted Dataset Of “Best Warriors in History” into Enterprise-Grade Analytical Intelligence Using SAS (PROC SQL vs DATA Step) and R
Introduction: When Dirty Data Rewrites History
Imagine a
global historical analytics company preparing a documentary called Best
Warriors in History. Researchers collected information on famous warriors
across civilizations Spartans, Samurai, Vikings, Mongols, Rajputs, Zulus, Roman
Legionaries, and others.
A week
before executive presentation, analysts discover serious problems:
- Duplicate Warrior IDs
- Missing battle dates
- Negative battle counts
- Invalid ages
- Corrupted region codes
- Mixed text formatting
- NULL strings
- Invalid email contacts of
researchers
- Inconsistent warrior
categories
- Impossible years
The
result?
Executive
dashboards show incorrect rankings.
AI models
predict inaccurate warrior influence scores.
Historical
trend reports become unreliable.
Management
begins questioning the entire analytical process.
This
situation is remarkably similar to real clinical trial environments where
incorrect patient records can jeopardize regulatory submissions.
Whether
analyzing warriors, patients, insurance claims, or financial transactions, data
quality determines analytical credibility.
Warriors Dataset Structure
|
Variable |
Description |
|
Warrior_ID |
Unique
Warrior Identifier |
|
Warrior_Name |
Warrior
Name |
|
Civilization |
Warrior
Origin |
|
Region_Code |
Geographic
Region |
|
Battle_Count |
Number
of Battles |
|
Age |
Age at
Peak |
|
Last_Battle_Date |
Final
Battle Date |
|
Rank_Category |
Warrior
Classification |
|
Research_Email |
Research
Contact |
1.Raw SAS Dataset with Intentional Errors
data warriors_raw;
length Warrior_ID $8 Warrior_Name $40 Civilization $25
Region_Code $10 Rank_Category $20 Research_Email $50;
infile datalines dlm='|' dsd truncover;
input Warrior_ID $ Warrior_Name $ Civilization $ Region_Code $
Battle_Count Age Last_Battle_Date $ Rank_Category $ Research_Email $;
datalines;
W001|alexander|MACEDONIAN|EU|45|32|323BC|Elite|alexander@history.com
W002|genghis khan|mongol|APAC|120|65|1227|Legend|genghis@history.com
W003|spartan leonidas|EUROPE|EU|15|300|480BC|Elite|bademail
W004|samurai musashi|japan|APAC|-25|61|1645|Master|musashi@gmail
W005|viking ragnar|nordic|EUR|50|-10|845|Warrior|ragnar@email.com
W006|NULL|rome|EU|35|45||Elite|romehistory.com
W006|NULL|rome|EU|35|45||Elite|romehistory.com
W007|rajput prithviraj|india|APAC|80|40|1192|KING|rajput@history.com
W008|zulu shaka|africa|AFRICA|70|42|1828|Hero|shaka@@mail.com
W009|white space |persia|ME|65|55|651 | elite |persia@history.com
;
run;
proc print data=warriors_raw;
run;
OUTPUT:
| Obs | Warrior_ID | Warrior_Name | Civilization | Region_Code | Rank_Category | Research_Email | Battle_Count | Age | Last_Battle_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | W001 | alexander | MACEDONIAN | EU | Elite | alexander@history.com | 45 | 32 | 323BC |
| 2 | W002 | genghis khan | mongol | APAC | Legend | genghis@history.com | 120 | 65 | 1227 |
| 3 | W003 | spartan leonidas | EUROPE | EU | Elite | bademail | 15 | 300 | 480BC |
| 4 | W004 | samurai musashi | japan | APAC | Master | musashi@gmail | -25 | 61 | 1645 |
| 5 | W005 | viking ragnar | nordic | EUR | Warrior | ragnar@email.com | 50 | -10 | 845 |
| 6 | W006 | NULL | rome | EU | Elite | romehistory.com | 35 | 45 | |
| 7 | W006 | NULL | rome | EU | Elite | romehistory.com | 35 | 45 | |
| 8 | W007 | rajput prithviraj | india | APAC | KING | rajput@history.com | 80 | 40 | 1192 |
| 9 | W008 | zulu shaka | africa | AFRICA | Hero | shaka@@mail.com | 70 | 42 | 1828 |
| 10 | W009 | white space | persia | ME | elite | persia@history.com | 65 | 55 | 651 |
Understanding Character Truncation Risk
One of
the most overlooked SAS issues is character truncation.
InCorrect:
data demo;
name="Alexander The Great";
length name $10;
run;
proc print data=demo;
run;
LOG:
OUTPUT:
| Obs | name |
|---|---|
| 1 | Alexander The Great |
SAS
already assigned length before LENGTH executes.
Result:
Alexander
Data gets
truncated.
Correct:
data demo;
length name $50;
name="Alexander The Great";
run;
proc print data=demo;
run;
OUTPUT:
| Obs | name |
|---|---|
| 1 | Alexander The Great |
Why This Matters
Clinical
trial datasets frequently contain long investigator names, treatment
descriptions, and adverse event terms.
Improper
LENGTH placement can silently corrupt data.
Unlike
SAS, R dynamically handles character vectors and typically avoids fixed-length truncation
issues.
2.Data Cleaning Workflow Using DATA Step
data warriors_clean;
set warriors_raw;
Warrior_Name=propcase(strip(Warrior_Name));
Civilization=upcase(strip(Civilization));
Region_Code=upcase(strip(Region_Code));
Rank_Category=propcase(strip(Rank_Category));
Battle_Count=abs(Battle_Count);
if Age<15 or Age>100 then Age=.;
Research_Email=lowcase(strip(Research_Email));
if find(Research_Email,'@')=0 then Research_Email='';
if Warrior_Name='Null' then Warrior_Name='Unknown';
run;
proc print data=warriors_clean;
run;
OUTPUT:
| Obs | Warrior_ID | Warrior_Name | Civilization | Region_Code | Rank_Category | Research_Email | Battle_Count | Age | Last_Battle_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | W001 | Alexander | MACEDONIAN | EU | Elite | alexander@history.com | 45 | 32 | 323BC |
| 2 | W002 | Genghis Khan | MONGOL | APAC | Legend | genghis@history.com | 120 | 65 | 1227 |
| 3 | W003 | Spartan Leonidas | EUROPE | EU | Elite | 15 | . | 480BC | |
| 4 | W004 | Samurai Musashi | JAPAN | APAC | Master | musashi@gmail | 25 | 61 | 1645 |
| 5 | W005 | Viking Ragnar | NORDIC | EUR | Warrior | ragnar@email.com | 50 | . | 845 |
| 6 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | ||
| 7 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | ||
| 8 | W007 | Rajput Prithviraj | INDIA | APAC | King | rajput@history.com | 80 | 40 | 1192 |
| 9 | W008 | Zulu Shaka | AFRICA | AFRICA | Hero | shaka@@mail.com | 70 | 42 | 1828 |
| 10 | W009 | White Space | PERSIA | ME | Elite | persia@history.com | 65 | 55 | 651 |
Explanation
This DATA
Step performs standardization and validation.
Functions
demonstrated:
- PROPCASE
- STRIP
- UPCASE
- LOWCASE
- FIND
- ABS
Negative
battle counts become positive.
Impossible
ages become missing.
Emails
are validated.
Text
values become consistent.
This
mirrors real-world SDTM preparation where raw clinical data must be
standardized before analysis.
3.Using SELECT-WHEN Logic
data warriors_rank;
set warriors_clean;
select;
when(Battle_Count>=100) Category='Legend';
when(Battle_Count>=70) Category='Elite';
when(Battle_Count>=40) Category='Veteran';
otherwise Category='Novice';
end;
run;
proc print data=warriors_rank;
run;
OUTPUT:
| Obs | Warrior_ID | Warrior_Name | Civilization | Region_Code | Rank_Category | Research_Email | Battle_Count | Age | Last_Battle_Date | Category |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | W001 | Alexander | MACEDONIAN | EU | Elite | alexander@history.com | 45 | 32 | 323BC | Vetera |
| 2 | W002 | Genghis Khan | MONGOL | APAC | Legend | genghis@history.com | 120 | 65 | 1227 | Legend |
| 3 | W003 | Spartan Leonidas | EUROPE | EU | Elite | 15 | . | 480BC | Novice | |
| 4 | W004 | Samurai Musashi | JAPAN | APAC | Master | musashi@gmail | 25 | 61 | 1645 | Novice |
| 5 | W005 | Viking Ragnar | NORDIC | EUR | Warrior | ragnar@email.com | 50 | . | 845 | Vetera |
| 6 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | Novice | ||
| 7 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | Novice | ||
| 8 | W007 | Rajput Prithviraj | INDIA | APAC | King | rajput@history.com | 80 | 40 | 1192 | Elite |
| 9 | W008 | Zulu Shaka | AFRICA | AFRICA | Hero | shaka@@mail.com | 70 | 42 | 1828 | Elite |
| 10 | W009 | White Space | PERSIA | ME | Elite | persia@history.com | 65 | 55 | 651 | Vetera |
Explanation
SELECT-WHEN
is cleaner than long IF-THEN chains.
Benefits:
- Easier maintenance
- Better readability
- Reduced programming errors
Commonly
used in treatment categorization and risk scoring.
4.ARRAY Processing
data warriors_array;
set warriors_clean;
array nums Age Battle_Count;
do i=1 to dim(nums);
if nums[i]<0 then nums[i]=abs(nums[i]);
end;
drop i;
run;
proc print data=warriors_array;
run;
OUTPUT:
| Obs | Warrior_ID | Warrior_Name | Civilization | Region_Code | Rank_Category | Research_Email | Battle_Count | Age | Last_Battle_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | W001 | Alexander | MACEDONIAN | EU | Elite | alexander@history.com | 45 | 32 | 323BC |
| 2 | W002 | Genghis Khan | MONGOL | APAC | Legend | genghis@history.com | 120 | 65 | 1227 |
| 3 | W003 | Spartan Leonidas | EUROPE | EU | Elite | 15 | . | 480BC | |
| 4 | W004 | Samurai Musashi | JAPAN | APAC | Master | musashi@gmail | 25 | 61 | 1645 |
| 5 | W005 | Viking Ragnar | NORDIC | EUR | Warrior | ragnar@email.com | 50 | . | 845 |
| 6 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | ||
| 7 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | ||
| 8 | W007 | Rajput Prithviraj | INDIA | APAC | King | rajput@history.com | 80 | 40 | 1192 |
| 9 | W008 | Zulu Shaka | AFRICA | AFRICA | Hero | shaka@@mail.com | 70 | 42 | 1828 |
| 10 | W009 | White Space | PERSIA | ME | Elite | persia@history.com | 65 | 55 | 651 |
Explanation
ARRAYS
help process multiple variables efficiently.
Instead
of repeating logic, a loop validates all numeric fields.
This
technique is heavily used in laboratory datasets containing hundreds of
measurements.
5.RETAIN and FIRST./LAST. Processing
proc sort data=warriors_clean;
by Civilization;
run;
proc print data=warriors_clean;
run;
OUTPUT:
| Obs | Warrior_ID | Warrior_Name | Civilization | Region_Code | Rank_Category | Research_Email | Battle_Count | Age | Last_Battle_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | W008 | Zulu Shaka | AFRICA | AFRICA | Hero | shaka@@mail.com | 70 | 42 | 1828 |
| 2 | W003 | Spartan Leonidas | EUROPE | EU | Elite | 15 | . | 480BC | |
| 3 | W007 | Rajput Prithviraj | INDIA | APAC | King | rajput@history.com | 80 | 40 | 1192 |
| 4 | W004 | Samurai Musashi | JAPAN | APAC | Master | musashi@gmail | 25 | 61 | 1645 |
| 5 | W001 | Alexander | MACEDONIAN | EU | Elite | alexander@history.com | 45 | 32 | 323BC |
| 6 | W002 | Genghis Khan | MONGOL | APAC | Legend | genghis@history.com | 120 | 65 | 1227 |
| 7 | W005 | Viking Ragnar | NORDIC | EUR | Warrior | ragnar@email.com | 50 | . | 845 |
| 8 | W009 | White Space | PERSIA | ME | Elite | persia@history.com | 65 | 55 | 651 |
| 9 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | ||
| 10 | W006 | Unknown | ROME | EU | Elite | 35 | 45 |
data civ_summary;
set warriors_clean;
by Civilization;
retain Total_Battles 0;
Total_Battles+Battle_Count;
if last.Civilization;
run;
proc print data=civ_summary;
run;
OUTPUT:
| Obs | Warrior_ID | Warrior_Name | Civilization | Region_Code | Rank_Category | Research_Email | Battle_Count | Age | Last_Battle_Date | Total_Battles |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | W008 | Zulu Shaka | AFRICA | AFRICA | Hero | shaka@@mail.com | 70 | 42 | 1828 | 70 |
| 2 | W003 | Spartan Leonidas | EUROPE | EU | Elite | 15 | . | 480BC | 85 | |
| 3 | W007 | Rajput Prithviraj | INDIA | APAC | King | rajput@history.com | 80 | 40 | 1192 | 165 |
| 4 | W004 | Samurai Musashi | JAPAN | APAC | Master | musashi@gmail | 25 | 61 | 1645 | 190 |
| 5 | W001 | Alexander | MACEDONIAN | EU | Elite | alexander@history.com | 45 | 32 | 323BC | 235 |
| 6 | W002 | Genghis Khan | MONGOL | APAC | Legend | genghis@history.com | 120 | 65 | 1227 | 355 |
| 7 | W005 | Viking Ragnar | NORDIC | EUR | Warrior | ragnar@email.com | 50 | . | 845 | 405 |
| 8 | W009 | White Space | PERSIA | ME | Elite | persia@history.com | 65 | 55 | 651 | 470 |
| 9 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | 540 |
Explanation
RETAIN
preserves values across observations.
FIRST./LAST.
processing creates group-level summaries.
This
technique is essential for patient-level analysis in ADaM datasets.
6.Removing Duplicates
proc sort data=warriors_clean
out=warriors_nodup nodupkey;
by Warrior_ID;
run;
proc print data=warriors_nodup;
run;
LOG:
OUTPUT:
| Obs | Warrior_ID | Warrior_Name | Civilization | Region_Code | Rank_Category | Research_Email | Battle_Count | Age | Last_Battle_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | W001 | Alexander | MACEDONIAN | EU | Elite | alexander@history.com | 45 | 32 | 323BC |
| 2 | W002 | Genghis Khan | MONGOL | APAC | Legend | genghis@history.com | 120 | 65 | 1227 |
| 3 | W003 | Spartan Leonidas | EUROPE | EU | Elite | 15 | . | 480BC | |
| 4 | W004 | Samurai Musashi | JAPAN | APAC | Master | musashi@gmail | 25 | 61 | 1645 |
| 5 | W005 | Viking Ragnar | NORDIC | EUR | Warrior | ragnar@email.com | 50 | . | 845 |
| 6 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | ||
| 7 | W007 | Rajput Prithviraj | INDIA | APAC | King | rajput@history.com | 80 | 40 | 1192 |
| 8 | W008 | Zulu Shaka | AFRICA | AFRICA | Hero | shaka@@mail.com | 70 | 42 | 1828 |
| 9 | W009 | White Space | PERSIA | ME | Elite | persia@history.com | 65 | 55 | 651 |
Explanation
Duplicate
Warrior_ID values are removed.
Equivalent
clinical scenario:
Duplicate
patient identifiers causing enrollment inflation.
PROC SORT
NODUPKEY remains one of the most efficient SAS deduplication techniques.
7.PROC FORMAT Standardization
proc format;
value agegrp low-30='Young'
31-50='Experienced'
51-high='Veteran';
run;
LOG:
Explanation
Formats
improve reporting consistency.
Instead
of coding categories repeatedly, business rules become centralized and
reusable.
8.PROC SQL Approach
proc sql;
create table warrior_sql as
select a.*,
put(a.age,agegrp.) as Age_Group length=15,
b.Category
from warriors_nodup a
left join warriors_rank b
on a.Warrior_ID=b.Warrior_ID;
quit;
proc print data=warrior_sql;
run;
OUTPUT:
| Obs | Warrior_ID | Warrior_Name | Civilization | Region_Code | Rank_Category | Research_Email | Battle_Count | Age | Last_Battle_Date | Age_Group | Category |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | W001 | Alexander | MACEDONIAN | EU | Elite | alexander@history.com | 45 | 32 | 323BC | Experienced | Vetera |
| 2 | W002 | Genghis Khan | MONGOL | APAC | Legend | genghis@history.com | 120 | 65 | 1227 | Veteran | Legend |
| 3 | W003 | Spartan Leonidas | EUROPE | EU | Elite | 15 | . | 480BC | . | Novice | |
| 4 | W004 | Samurai Musashi | JAPAN | APAC | Master | musashi@gmail | 25 | 61 | 1645 | Veteran | Novice |
| 5 | W005 | Viking Ragnar | NORDIC | EUR | Warrior | ragnar@email.com | 50 | . | 845 | . | Vetera |
| 6 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | Experienced | Novice | ||
| 7 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | Experienced | Novice | ||
| 8 | W007 | Rajput Prithviraj | INDIA | APAC | King | rajput@history.com | 80 | 40 | 1192 | Experienced | Elite |
| 9 | W008 | Zulu Shaka | AFRICA | AFRICA | Hero | shaka@@mail.com | 70 | 42 | 1828 | Experienced | Elite |
| 10 | W009 | White Space | PERSIA | ME | Elite | persia@history.com | 65 | 55 | 651 | Veteran | Vetera |
Explanation
PROC SQL
simplifies joins and relational operations.
Advantages:
- Familiar SQL syntax
- Flexible joins
- Database-like processing
Ideal
when integrating multiple source systems.
9.MERGE Logic Using DATA Step
proc sort data=warriors_nodup;
by Warrior_ID;
run;
proc print data=warriors_nodup;
run;
OUTPUT:
| Obs | Warrior_ID | Warrior_Name | Civilization | Region_Code | Rank_Category | Research_Email | Battle_Count | Age | Last_Battle_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | W001 | Alexander | MACEDONIAN | EU | Elite | alexander@history.com | 45 | 32 | 323BC |
| 2 | W002 | Genghis Khan | MONGOL | APAC | Legend | genghis@history.com | 120 | 65 | 1227 |
| 3 | W003 | Spartan Leonidas | EUROPE | EU | Elite | 15 | . | 480BC | |
| 4 | W004 | Samurai Musashi | JAPAN | APAC | Master | musashi@gmail | 25 | 61 | 1645 |
| 5 | W005 | Viking Ragnar | NORDIC | EUR | Warrior | ragnar@email.com | 50 | . | 845 |
| 6 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | ||
| 7 | W007 | Rajput Prithviraj | INDIA | APAC | King | rajput@history.com | 80 | 40 | 1192 |
| 8 | W008 | Zulu Shaka | AFRICA | AFRICA | Hero | shaka@@mail.com | 70 | 42 | 1828 |
| 9 | W009 | White Space | PERSIA | ME | Elite | persia@history.com | 65 | 55 | 651 |
proc sort data=warriors_rank nodupkey;
by Warrior_ID;
run;
proc print data=warriors_rank;
run;
OUTPUT:
| Obs | Warrior_ID | Warrior_Name | Civilization | Region_Code | Rank_Category | Research_Email | Battle_Count | Age | Last_Battle_Date | Category |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | W001 | Alexander | MACEDONIAN | EU | Elite | alexander@history.com | 45 | 32 | 323BC | Vetera |
| 2 | W002 | Genghis Khan | MONGOL | APAC | Legend | genghis@history.com | 120 | 65 | 1227 | Legend |
| 3 | W003 | Spartan Leonidas | EUROPE | EU | Elite | 15 | . | 480BC | Novice | |
| 4 | W004 | Samurai Musashi | JAPAN | APAC | Master | musashi@gmail | 25 | 61 | 1645 | Novice |
| 5 | W005 | Viking Ragnar | NORDIC | EUR | Warrior | ragnar@email.com | 50 | . | 845 | Vetera |
| 6 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | Novice | ||
| 7 | W007 | Rajput Prithviraj | INDIA | APAC | King | rajput@history.com | 80 | 40 | 1192 | Elite |
| 8 | W008 | Zulu Shaka | AFRICA | AFRICA | Hero | shaka@@mail.com | 70 | 42 | 1828 | Elite |
| 9 | W009 | White Space | PERSIA | ME | Elite | persia@history.com | 65 | 55 | 651 | Vetera |
data warrior_merge;
merge warriors_nodup(in=a)
warriors_rank(in=b);
by Warrior_ID;
if a;
run;
proc print data=warrior_merge;
run;
OUTPUT:
| Obs | Warrior_ID | Warrior_Name | Civilization | Region_Code | Rank_Category | Research_Email | Battle_Count | Age | Last_Battle_Date | Category |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | W001 | Alexander | MACEDONIAN | EU | Elite | alexander@history.com | 45 | 32 | 323BC | Vetera |
| 2 | W002 | Genghis Khan | MONGOL | APAC | Legend | genghis@history.com | 120 | 65 | 1227 | Legend |
| 3 | W003 | Spartan Leonidas | EUROPE | EU | Elite | 15 | . | 480BC | Novice | |
| 4 | W004 | Samurai Musashi | JAPAN | APAC | Master | musashi@gmail | 25 | 61 | 1645 | Novice |
| 5 | W005 | Viking Ragnar | NORDIC | EUR | Warrior | ragnar@email.com | 50 | . | 845 | Vetera |
| 6 | W006 | Unknown | ROME | EU | Elite | 35 | 45 | Novice | ||
| 7 | W007 | Rajput Prithviraj | INDIA | APAC | King | rajput@history.com | 80 | 40 | 1192 | Elite |
| 8 | W008 | Zulu Shaka | AFRICA | AFRICA | Hero | shaka@@mail.com | 70 | 42 | 1828 | Elite |
| 9 | W009 | White Space | PERSIA | ME | Elite | persia@history.com | 65 | 55 | 651 | Vetera |
Explanation
DATA Step
MERGE is often faster for sorted SAS datasets.
Comparison:
|
Feature |
DATA
Step |
PROC SQL |
|
Speed |
Faster
on SAS tables |
Flexible |
|
Readability |
Moderate |
High |
|
Complex
Joins |
Limited |
Strong |
|
Clinical
Usage |
Very
Common |
Common |
10.Advanced Data Profiling
10.1.PROC CONTENTS
proc contents data=warriors_clean;
run;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.WARRIORS_CLEAN | Observations | 10 |
|---|---|---|---|
| Member Type | DATA | Variables | 9 |
| Engine | V9 | Indexes | 0 |
| Created | 06/21/2026 15:25:18 | Observation Length | 184 |
| Last Modified | 06/21/2026 15:25:18 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | YES | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 711 |
| Obs in First Data Page | 10 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_workD37400005FFD_odaws01-apse1-2.oda.sas.com/SAS_work1C2F00005FFD_odaws01-apse1-2.oda.sas.com/warriors_clean.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 1379846 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Alphabetic List of Variables and Attributes | |||
|---|---|---|---|
| # | Variable | Type | Len |
| 8 | Age | Num | 8 |
| 7 | Battle_Count | Num | 8 |
| 3 | Civilization | Char | 25 |
| 9 | Last_Battle_Date | Char | 8 |
| 5 | Rank_Category | Char | 20 |
| 4 | Region_Code | Char | 10 |
| 6 | Research_Email | Char | 50 |
| 1 | Warrior_ID | Char | 8 |
| 2 | Warrior_Name | Char | 40 |
| Sort Information | |
|---|---|
| Sortedby | Civilization |
| Validated | YES |
| Character Set | ASCII |
Provides
metadata review.
10.2.PROC FREQ
proc freq data=warriors_clean;
tables Region_Code Rank_Category;
run;
OUTPUT:
The FREQ Procedure
| Region_Code | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| AFRICA | 1 | 10.00 | 1 | 10.00 |
| APAC | 3 | 30.00 | 4 | 40.00 |
| EU | 4 | 40.00 | 8 | 80.00 |
| EUR | 1 | 10.00 | 9 | 90.00 |
| ME | 1 | 10.00 | 10 | 100.00 |
| Rank_Category | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Elite | 5 | 50.00 | 5 | 50.00 |
| Hero | 1 | 10.00 | 6 | 60.00 |
| King | 1 | 10.00 | 7 | 70.00 |
| Legend | 1 | 10.00 | 8 | 80.00 |
| Master | 1 | 10.00 | 9 | 90.00 |
| Warrior | 1 | 10.00 | 10 | 100.00 |
Detects
unexpected categories.
10.3.PROC MEANS
proc means data=warriors_clean n nmiss mean min max;
var Age Battle_Count;
run;
OUTPUT:
The MEANS Procedure
| Variable | N | N Miss | Mean | Minimum | Maximum |
|---|---|---|---|---|---|
Age Battle_Count | 8 10 | 2 0 | 48.1250000 54.0000000 | 32.0000000 15.0000000 | 65.0000000 120.0000000 |
Identifies
outliers and missing values.
10.4.PROC SUMMARY
proc summary data=warriors_clean nway;
class Civilization;
var Battle_Count;
output out=summary_stats sum=;
run;
proc print data=summary_stats;
run;
OUTPUT:
| Obs | Civilization | _TYPE_ | _FREQ_ | Battle_Count |
|---|---|---|---|---|
| 1 | AFRICA | 1 | 1 | 70 |
| 2 | EUROPE | 1 | 1 | 15 |
| 3 | INDIA | 1 | 1 | 80 |
| 4 | JAPAN | 1 | 1 | 25 |
| 5 | MACEDONIAN | 1 | 1 | 45 |
| 6 | MONGOL | 1 | 1 | 120 |
| 7 | NORDIC | 1 | 1 | 50 |
| 8 | PERSIA | 1 | 1 | 65 |
| 9 | ROME | 1 | 2 | 70 |
Creates
aggregated reports.
10.5.PROC TRANSPOSE
proc transpose data=summary_stats out=transposed;
run;
proc print data=transposed;
run;
OUTPUT:
| Obs | _NAME_ | COL1 | COL2 | COL3 | COL4 | COL5 | COL6 | COL7 | COL8 | COL9 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | _TYPE_ | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | _FREQ_ | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 |
| 3 | Battle_Count | 70 | 15 | 80 | 25 | 45 | 120 | 50 | 65 | 70 |
Reshapes
reporting structures.
10.6.PROC REPORT
proc report data=warriors_clean nowd;
column Warrior_Name Civilization Battle_Count;
run;
OUTPUT:
| Warrior_Name | Civilization | Battle_Count |
|---|---|---|
| Zulu Shaka | AFRICA | 70 |
| Spartan Leonidas | EUROPE | 15 |
| Rajput Prithviraj | INDIA | 80 |
| Samurai Musashi | JAPAN | 25 |
| Alexander | MACEDONIAN | 45 |
| Genghis Khan | MONGOL | 120 |
| Viking Ragnar | NORDIC | 50 |
| White Space | PERSIA | 65 |
| Unknown | ROME | 35 |
| Unknown | ROME | 35 |
Generates
professional outputs.
11.Reusable SAS Macro
%macro profile(ds);
proc means data=&ds
n nmiss mean std min max;
run;
proc freq data=&ds;
tables _character_;
run;
%mend;
%profile(warriors_clean);
The MEANS Procedure
| Variable | N | N Miss | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|---|
Battle_Count Age | 10 8 | 0 2 | 54.0000000 48.1250000 | 30.8940843 11.2178111 | 15.0000000 32.0000000 | 120.0000000 65.0000000 |
The FREQ Procedure
| Warrior_ID | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| W001 | 1 | 10.00 | 1 | 10.00 |
| W002 | 1 | 10.00 | 2 | 20.00 |
| W003 | 1 | 10.00 | 3 | 30.00 |
| W004 | 1 | 10.00 | 4 | 40.00 |
| W005 | 1 | 10.00 | 5 | 50.00 |
| W006 | 2 | 20.00 | 7 | 70.00 |
| W007 | 1 | 10.00 | 8 | 80.00 |
| W008 | 1 | 10.00 | 9 | 90.00 |
| W009 | 1 | 10.00 | 10 | 100.00 |
| Warrior_Name | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Alexander | 1 | 10.00 | 1 | 10.00 |
| Genghis Khan | 1 | 10.00 | 2 | 20.00 |
| Rajput Prithviraj | 1 | 10.00 | 3 | 30.00 |
| Samurai Musashi | 1 | 10.00 | 4 | 40.00 |
| Spartan Leonidas | 1 | 10.00 | 5 | 50.00 |
| Unknown | 2 | 20.00 | 7 | 70.00 |
| Viking Ragnar | 1 | 10.00 | 8 | 80.00 |
| White Space | 1 | 10.00 | 9 | 90.00 |
| Zulu Shaka | 1 | 10.00 | 10 | 100.00 |
| Civilization | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| AFRICA | 1 | 10.00 | 1 | 10.00 |
| EUROPE | 1 | 10.00 | 2 | 20.00 |
| INDIA | 1 | 10.00 | 3 | 30.00 |
| JAPAN | 1 | 10.00 | 4 | 40.00 |
| MACEDONIAN | 1 | 10.00 | 5 | 50.00 |
| MONGOL | 1 | 10.00 | 6 | 60.00 |
| NORDIC | 1 | 10.00 | 7 | 70.00 |
| PERSIA | 1 | 10.00 | 8 | 80.00 |
| ROME | 2 | 20.00 | 10 | 100.00 |
| Region_Code | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| AFRICA | 1 | 10.00 | 1 | 10.00 |
| APAC | 3 | 30.00 | 4 | 40.00 |
| EU | 4 | 40.00 | 8 | 80.00 |
| EUR | 1 | 10.00 | 9 | 90.00 |
| ME | 1 | 10.00 | 10 | 100.00 |
| Rank_Category | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Elite | 5 | 50.00 | 5 | 50.00 |
| Hero | 1 | 10.00 | 6 | 60.00 |
| King | 1 | 10.00 | 7 | 70.00 |
| Legend | 1 | 10.00 | 8 | 80.00 |
| Master | 1 | 10.00 | 9 | 90.00 |
| Warrior | 1 | 10.00 | 10 | 100.00 |
| Research_Email | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Frequency Missing = 3 | ||||
| alexander@history.com | 1 | 14.29 | 1 | 14.29 |
| genghis@history.com | 1 | 14.29 | 2 | 28.57 |
| musashi@gmail | 1 | 14.29 | 3 | 42.86 |
| persia@history.com | 1 | 14.29 | 4 | 57.14 |
| ragnar@email.com | 1 | 14.29 | 5 | 71.43 |
| rajput@history.com | 1 | 14.29 | 6 | 85.71 |
| shaka@@mail.com | 1 | 14.29 | 7 | 100.00 |
| Last_Battle_Date | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Frequency Missing = 2 | ||||
| 1192 | 1 | 12.50 | 1 | 12.50 |
| 1227 | 1 | 12.50 | 2 | 25.00 |
| 1645 | 1 | 12.50 | 3 | 37.50 |
| 1828 | 1 | 12.50 | 4 | 50.00 |
| 323BC | 1 | 12.50 | 5 | 62.50 |
| 480BC | 1 | 12.50 | 6 | 75.00 |
| 651 | 1 | 12.50 | 7 | 87.50 |
| 845 | 1 | 12.50 | 8 | 100.00 |
Explanation
Macros
standardize validation.
Benefits:
- Reusability
- Consistency
- Reduced coding effort
- Faster production deployment
12.R Raw Data
library(readr)
txt <- "
Warrior_ID|Warrior_Name|Civilization|Region_Code|Battle_Count|Age|Last_Battle_Date|Rank_Category|Research_Email
W001|alexander|MACEDONIAN|EU|45|32|323BC|Elite|alexander@history.com
W002|genghis khan|mongol|APAC|120|65|1227|Legend|genghis@history.com
W003|spartan leonidas|EUROPE|EU|15|300|480BC|Elite|bademail
W004|samurai musashi|japan|APAC|-25|61|1645|Master|musashi@gmail
W005|viking ragnar|nordic|EUR|50|-10|845|Warrior|ragnar@email.com
W006|NULL|rome|EU|35|45||Elite|romehistory.com
W006|NULL|rome|EU|35|45||Elite|romehistory.com
W007|rajput prithviraj|india|APAC|80|40|1192|KING|rajput@history.com
W008|zulu shaka|africa|AFRICA|70|42|1828|Hero|shaka@@mail.com
W009|white space |persia|ME|65|55|651 | elite |persia@history.com
"
warriors_raw <- read_delim(
I(txt),
delim = "|",
trim_ws = FALSE,
show_col_types = FALSE
)
OUTPUT:
|
Warrior_ID |
Warrior_Name |
Civilization |
Region_Code |
Battle_Count |
Age |
Last_Battle_Date |
Rank_Category |
Research_Email |
|
W001 |
alexander |
MACEDONIAN |
EU |
45 |
32 |
323BC |
Elite |
alexander@history.com |
|
W002 |
genghis khan |
mongol |
APAC |
120 |
65 |
1227 |
Legend |
genghis@history.com |
|
W003 |
spartan leonidas |
EUROPE |
EU |
15 |
300 |
480BC |
Elite |
bademail |
|
W004 |
samurai musashi |
japan |
APAC |
-25 |
61 |
1645 |
Master |
musashi@gmail |
|
W005 |
viking ragnar |
nordic |
EUR |
50 |
-10 |
845 |
Warrior |
ragnar@email.com |
|
W006 |
NULL |
rome |
EU |
35 |
45 |
Elite |
romehistory.com |
|
|
W006 |
NULL |
rome |
EU |
35 |
45 |
Elite |
romehistory.com |
|
|
W007 |
rajput prithviraj |
india |
APAC |
80 |
40 |
1192 |
KING |
rajput@history.com |
|
W008 |
zulu shaka |
africa |
AFRICA |
70 |
42 |
1828 |
Hero |
shaka@@mail.com |
|
W009 |
white space |
persia |
ME |
65 |
55 |
651 |
elite |
persia@history.com |
13.Equivalent R Cleaning Workflow
library(tidyverse)
library(janitor)
library(lubridate)
library(stringr)
warriors_clean <- warriors_raw %>%
clean_names() %>%
mutate( warrior_name =str_to_title(str_trim(warrior_name)),
civilization =str_to_upper(str_trim(civilization)),
region_code =case_when(region_code %in% c("EUR","EU") ~ "EU",
region_code=="AFRICA" ~ "AFR",
TRUE ~ region_code),
battle_count =abs(battle_count),
age =if_else(age<15 | age>100,
NA_real_,age),
research_email =str_to_lower(research_email),
research_email =if_else(grepl("@", research_email) &
grepl("\\.com$", research_email),
research_email,NA_character_)
) %>%
distinct(warrior_id,.keep_all=TRUE)
OUTPUT:
|
warrior_id |
warrior_name |
civilization |
region_code |
battle_count |
age |
last_battle_date |
rank_category |
research_email |
|
W001 |
Alexander |
MACEDONIAN |
EU |
45 |
32 |
323BC |
Elite |
alexander@history.com |
|
W002 |
Genghis Khan |
MONGOL |
APAC |
120 |
65 |
1227 |
Legend |
genghis@history.com |
|
W003 |
Spartan Leonidas |
EUROPE |
EU |
15 |
480BC |
Elite |
||
|
W004 |
Samurai Musashi |
JAPAN |
APAC |
25 |
61 |
1645 |
Master |
|
|
W005 |
Viking Ragnar |
NORDIC |
EU |
50 |
845 |
Warrior |
ragnar@email.com |
|
|
W006 |
Null |
ROME |
EU |
35 |
45 |
Elite |
||
|
W007 |
Rajput Prithviraj |
INDIA |
APAC |
80 |
40 |
1192 |
KING |
rajput@history.com |
|
W008 |
Zulu Shaka |
AFRICA |
AFR |
70 |
42 |
1828 |
Hero |
shaka@@mail.com |
|
W009 |
White Space |
PERSIA |
ME |
65 |
55 |
651 |
elite |
persia@history.com |
Explanation
Equivalent
SAS → R mappings:
|
SAS |
R |
|
PROPCASE |
str_to_title |
|
UPCASE |
str_to_upper |
|
LOWCASE |
str_to_lower |
|
ABS |
abs |
|
FIND |
grepl |
|
STRIP |
str_trim |
|
PROC
SORT NODUPKEY |
distinct |
|
IF THEN |
if_else |
|
SELECT
WHEN |
case_when |
R offers
concise syntax while SAS provides stronger auditability.
Validation & Compliance
Clinical
environments require:
- SDTM compliance
- ADaM traceability
- Independent QC
- Audit trails
- Reproducibility
- Validation documentation
A major
SAS risk:
if value < 10;
Missing
numeric values (.) are considered smaller than any number.
Therefore:
. < 10
returns
TRUE.
This can
unintentionally classify missing patients as high-risk populations.
Proper
logic:
if not missing(value) and
value<10;
Failure
to apply this rule can cause severe regulatory consequences.
Business Logic Behind
Cleaning
Data
cleaning is not cosmetic it directly impacts business decisions. Consider
patient age. If a clinical trial contains an age of 300 years due to data entry
errors, statistical summaries become distorted and age-stratified analyses
become unreliable. Similar problems occur in banking when loan applicants show
negative income values or impossible employment durations. Missing values are
often imputed because many analytical models cannot process blanks effectively.
For example, a missing visit date may prevent a patient from being included in
efficacy analyses, reducing statistical power. Standardized dates ensure
accurate interval calculations using INTCK and INTNX functions. Text
normalization prevents duplicate categories such as “elite”, “Elite”, and
“ELITE” from appearing as separate groups. Region codes such as EU, EUR, and
Europe must be standardized to avoid fragmented reporting. Email validation
ensures communication workflows remain operational. Numeric normalization
removes impossible negative billing amounts and incorrect transaction values.
Proper business rules create consistency across dashboards, machine learning
systems, regulatory submissions, executive reports, and operational analytics.
Ultimately, every cleaning rule must be justified, documented, reproducible,
and traceable to business requirements so stakeholders can trust analytical
outcomes.
20 Data Cleaning Best
Practices
- Define metadata before
coding.
- Standardize variable naming
conventions.
- Validate incoming source
files.
- Remove duplicates early.
- Check missing-value
patterns.
- Validate date ranges.
- Normalize categorical
values.
- Use reusable macros.
- Maintain audit trails.
- Implement independent QC.
- Validate joins carefully.
- Document derivations.
- Track data lineage.
- Use version control.
- Centralize business rules.
- Validate email structures.
- Check impossible numeric
ranges.
- Review PROC CONTENTS
outputs.
- Create automated validation
reports.
- Perform production sign-off
before deployment.
20 One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic is stronger
than visual inspection.
- Standardized variables
improve reproducibility.
- Duplicate records distort
analytics.
- Missing dates damage
timelines.
- Auditability builds trust.
- Metadata drives consistency.
- QC is not optional.
- Automation reduces human
error.
- Macros improve scalability.
- SQL excels at integration.
- DATA Step excels at
transformation.
- Traceability supports
compliance.
- Documentation prevents
confusion.
- Clean inputs create reliable
outputs.
- Normalized text improves
reporting.
- Dates deserve special
attention.
- Missing values require
explicit handling.
- Enterprise analytics depends
on governance.
- Quality data enables quality
decisions.
Small Validation Checklist
|
Check |
Status |
|
Duplicate
IDs Removed |
✓ |
|
Missing
Values Reviewed |
✓ |
|
Date
Validation Completed |
✓ |
|
Text
Standardized |
✓ |
|
Numeric
Ranges Verified |
✓ |
|
Metadata
Reviewed |
✓ |
|
QC
Completed |
✓ |
|
Reporting
Validated |
✓ |
SAS vs R Cleaning Summary
SAS and R
are both powerful platforms for enterprise-grade data cleaning, but they excel
in different areas. SAS is renowned for stability, auditability, regulatory
acceptance, and large-scale batch processing. In regulated industries such as
pharmaceuticals, SAS remains dominant because every transformation can be
documented, validated, and traced. DATA Step programming provides exceptional
control over row-level processing, while PROC SQL offers powerful relational
capabilities. Features such as FIRST./LAST. processing, RETAIN statements, PROC
FORMAT, and metadata-driven macros make SAS highly suitable for production
environments.
R,
meanwhile, offers unmatched flexibility and modern data manipulation
capabilities. Packages such as tidyverse, dplyr, stringr, lubridate, janitor,
and purrr simplify complex transformations using concise syntax. R excels in
exploratory analysis, advanced visualization, machine learning, and rapid
prototyping. Functions like mutate(), across(), case_when(), and coalesce()
enable highly readable workflows that often require fewer lines of code than
SAS.
From a
scalability perspective, SAS performs exceptionally well on enterprise
infrastructure and validated environments. R provides extensibility and
innovation through thousands of open-source packages. Many organizations
increasingly adopt hybrid architectures where SAS performs regulatory-compliant
clinical processing and R supports exploratory analytics and advanced modeling.
The
strongest strategy is not SAS versus R it is SAS and R together. SAS delivers
governance, reproducibility, and audit readiness, while R delivers agility,
flexibility, and modern analytical capabilities. Combining both technologies
creates reliable, scalable, and analytically powerful ecosystems capable of
transforming messy operational data into trusted business intelligence.
Conclusion
The
journey from corrupted operational data to trusted analytical intelligence is
rarely straightforward. Whether working with clinical trial patients, insurance
claims, banking transactions, retail purchases, or a historical warriors
dataset, poor-quality data introduces substantial risks. Duplicate identifiers
can inflate counts, invalid dates can break timelines, malformed categories can
fragment reporting, and missing values can quietly distort statistical
conclusions. In regulated industries, these problems are not merely technical
inconveniences they can affect regulatory submissions, patient safety
evaluations, financial decisions, and executive strategy.
A
structured data-cleaning framework is therefore a foundational requirement
rather than an optional enhancement. SAS provides a highly controlled
environment for enterprise transformation through DATA Step programming, PROC
SQL, PROC FORMAT, PROC REPORT, validation macros, metadata management, and
auditable workflows. Techniques such as ARRAY processing, FIRST./LAST. logic,
RETAIN statements, MERGE operations, and automated QC checks help ensure
consistency and traceability across large production systems. At the same time,
R complements these strengths through modern packages that accelerate cleaning,
exploration, visualization, and advanced analytics.
The most
successful analytics organizations establish repeatable standards: validated
metadata, standardized macros, documented business rules, automated quality
checks, independent QC reviews, and strong governance controls. These practices
transform raw data into a reliable strategic asset. When implemented correctly,
SAS and R together form a powerful ecosystem capable of producing trustworthy
dashboards, reproducible analyses, compliant regulatory outputs, and scalable
business intelligence. Clean data is not simply a technical objective it is the
foundation upon which every credible analytical decision is built.
Interview Questions &
Answers
1. A patient appears twice in an SDTM DM dataset.
How would you identify it?
Answer: Use PROC SORT NODUPKEY, PROC
FREQ, or PROC SQL GROUP BY HAVING COUNT(*)>1. Investigate source records and
document corrective action.
2. Why is missing numeric handling dangerous in
SAS?
Answer: Missing values are treated as
smaller than valid numbers. Conditions like if value<10; may incorrectly
include missing observations.
3. When would you choose PROC SQL over DATA Step?
Answer: PROC SQL is preferred for
complex joins, aggregation, and relational processing. DATA Step is often
preferred for row-level transformations and performance.
4. How would you validate email fields in SAS?
Answer: Use FIND, INDEX, VERIFY, SCAN,
and regular-expression functions to confirm valid structures and identify
malformed values.
5. How do you ensure R and SAS outputs match?
Answer: Apply identical business rules,
compare frequencies, summaries, record counts, missing-value counts, and
perform independent QC reconciliation.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about WARRIORS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment