Transforming a Corrupted Dataset Of “Best Warriors in History” into Enterprise-Grade Analytical Intelligence Using SAS (PROC SQL vs DATA Step) and R

Introduction: When Dirty Data Rewrites History

Imagine a global historical analytics company preparing a documentary called Best Warriors in History. Researchers collected information on famous warriors across civilizations Spartans, Samurai, Vikings, Mongols, Rajputs, Zulus, Roman Legionaries, and others.

A week before executive presentation, analysts discover serious problems:

Duplicate Warrior IDs
Missing battle dates
Negative battle counts
Invalid ages
Corrupted region codes
Mixed text formatting
NULL strings
Invalid email contacts of researchers
Inconsistent warrior categories
Impossible years

The result?

Executive dashboards show incorrect rankings.

AI models predict inaccurate warrior influence scores.

Historical trend reports become unreliable.

Management begins questioning the entire analytical process.

This situation is remarkably similar to real clinical trial environments where incorrect patient records can jeopardize regulatory submissions.

Whether analyzing warriors, patients, insurance claims, or financial transactions, data quality determines analytical credibility.

Warriors Dataset Structure

Variable	Description
Warrior_ID	Unique Warrior Identifier
Warrior_Name	Warrior Name
Civilization	Warrior Origin
Region_Code	Geographic Region
Battle_Count	Number of Battles
Age	Age at Peak
Last_Battle_Date	Final Battle Date
Rank_Category	Warrior Classification
Research_Email	Research Contact

1.Raw SAS Dataset with Intentional Errors

data warriors_raw;

length Warrior_ID $8 Warrior_Name $40 Civilization $25

Region_Code $10 Rank_Category $20 Research_Email $50;

infile datalines dlm='|' dsd truncover;

input Warrior_ID $ Warrior_Name $ Civilization $ Region_Code $

Battle_Count Age Last_Battle_Date $ Rank_Category $ Research_Email $;

datalines;

W001|alexander|MACEDONIAN|EU|45|32|323BC|Elite|alexander@history.com

W002|genghis khan|mongol|APAC|120|65|1227|Legend|genghis@history.com

W003|spartan leonidas|EUROPE|EU|15|300|480BC|Elite|bademail

W004|samurai musashi|japan|APAC|-25|61|1645|Master|musashi@gmail

W005|viking ragnar|nordic|EUR|50|-10|845|Warrior|ragnar@email.com

W006|NULL|rome|EU|35|45||Elite|romehistory.com

W007|rajput prithviraj|india|APAC|80|40|1192|KING|rajput@history.com

W008|zulu shaka|africa|AFRICA|70|42|1828|Hero|shaka@@mail.com

W009|white space |persia|ME|65|55|651 | elite |persia@history.com

;

run;

proc print data=warriors_raw;

run;

OUTPUT:


Obs	Warrior_ID	Warrior_Name	Civilization	Region_Code	Rank_Category	Research_Email	Battle_Count	Age	Last_Battle_Date
1	W001	alexander	MACEDONIAN	EU	Elite	alexander@history.com	45	32	323BC
2	W002	genghis khan	mongol	APAC	Legend	genghis@history.com	120	65	1227
3	W003	spartan leonidas	EUROPE	EU	Elite	bademail	15	300	480BC
4	W004	samurai musashi	japan	APAC	Master	musashi@gmail	-25	61	1645
5	W005	viking ragnar	nordic	EUR	Warrior	ragnar@email.com	50	-10	845
6	W006	NULL	rome	EU	Elite	romehistory.com	35	45
7	W006	NULL	rome	EU	Elite	romehistory.com	35	45
8	W007	rajput prithviraj	india	APAC	KING	rajput@history.com	80	40	1192
9	W008	zulu shaka	africa	AFRICA	Hero	shaka@@mail.com	70	42	1828
10	W009	white space	persia	ME	elite	persia@history.com	65	55	651

Understanding Character Truncation Risk

One of the most overlooked SAS issues is character truncation.

InCorrect:

data demo;

name="Alexander The Great";

length name $10;

run;

proc print data=demo;

run;

LOG:

WARNING: Length of character variable name has already been set. 

Use the LENGTH statement as the very first statement in the DATA STEP

to declare the length of a character variable.

OUTPUT:


Obs	name
1	Alexander The Great

SAS already assigned length before LENGTH executes.

Result:

Alexander

Data gets truncated.

Correct:

data demo;

length name $50;

name="Alexander The Great";

run;

proc print data=demo;

run;

OUTPUT:


Obs	name
1	Alexander The Great

Why This Matters

Clinical trial datasets frequently contain long investigator names, treatment descriptions, and adverse event terms.

Improper LENGTH placement can silently corrupt data.

Unlike SAS, R dynamically handles character vectors and typically avoids fixed-length truncation issues.

2.Data Cleaning Workflow Using DATA Step

data warriors_clean;

set warriors_raw;

Warrior_Name=propcase(strip(Warrior_Name));

Civilization=upcase(strip(Civilization));

Region_Code=upcase(strip(Region_Code));

Rank_Category=propcase(strip(Rank_Category));

Battle_Count=abs(Battle_Count);

if Age<15 or Age>100 then Age=.;

Research_Email=lowcase(strip(Research_Email));

if find(Research_Email,'@')=0 then Research_Email='';

if Warrior_Name='Null' then Warrior_Name='Unknown';

run;

proc print data=warriors_clean;

run;

OUTPUT:


Obs	Warrior_ID	Warrior_Name	Civilization	Region_Code	Rank_Category	Research_Email	Battle_Count	Age	Last_Battle_Date
1	W001	Alexander	MACEDONIAN	EU	Elite	alexander@history.com	45	32	323BC
2	W002	Genghis Khan	MONGOL	APAC	Legend	genghis@history.com	120	65	1227
3	W003	Spartan Leonidas	EUROPE	EU	Elite		15	.	480BC
4	W004	Samurai Musashi	JAPAN	APAC	Master	musashi@gmail	25	61	1645
5	W005	Viking Ragnar	NORDIC	EUR	Warrior	ragnar@email.com	50	.	845
6	W006	Unknown	ROME	EU	Elite		35	45
7	W006	Unknown	ROME	EU	Elite		35	45
8	W007	Rajput Prithviraj	INDIA	APAC	King	rajput@history.com	80	40	1192
9	W008	Zulu Shaka	AFRICA	AFRICA	Hero	shaka@@mail.com	70	42	1828
10	W009	White Space	PERSIA	ME	Elite	persia@history.com	65	55	651

Explanation

This DATA Step performs standardization and validation.

Functions demonstrated:

PROPCASE
STRIP
UPCASE
LOWCASE
FIND
ABS

Negative battle counts become positive.

Impossible ages become missing.

Emails are validated.

Text values become consistent.

This mirrors real-world SDTM preparation where raw clinical data must be standardized before analysis.

3.Using SELECT-WHEN Logic

data warriors_rank;

set warriors_clean;

select;

when(Battle_Count>=100) Category='Legend';

when(Battle_Count>=70) Category='Elite';

when(Battle_Count>=40) Category='Veteran';

otherwise Category='Novice';

end;

run;

proc print data=warriors_rank;

run;

OUTPUT:


Obs	Warrior_ID	Warrior_Name	Civilization	Region_Code	Rank_Category	Research_Email	Battle_Count	Age	Last_Battle_Date	Category
1	W001	Alexander	MACEDONIAN	EU	Elite	alexander@history.com	45	32	323BC	Vetera
2	W002	Genghis Khan	MONGOL	APAC	Legend	genghis@history.com	120	65	1227	Legend
3	W003	Spartan Leonidas	EUROPE	EU	Elite		15	.	480BC	Novice
4	W004	Samurai Musashi	JAPAN	APAC	Master	musashi@gmail	25	61	1645	Novice
5	W005	Viking Ragnar	NORDIC	EUR	Warrior	ragnar@email.com	50	.	845	Vetera
6	W006	Unknown	ROME	EU	Elite		35	45		Novice
7	W006	Unknown	ROME	EU	Elite		35	45		Novice
8	W007	Rajput Prithviraj	INDIA	APAC	King	rajput@history.com	80	40	1192	Elite
9	W008	Zulu Shaka	AFRICA	AFRICA	Hero	shaka@@mail.com	70	42	1828	Elite
10	W009	White Space	PERSIA	ME	Elite	persia@history.com	65	55	651	Vetera

Explanation

SELECT-WHEN is cleaner than long IF-THEN chains.

Benefits:

Easier maintenance
Better readability
Reduced programming errors

Commonly used in treatment categorization and risk scoring.

4.ARRAY Processing

data warriors_array;

set warriors_clean;

array nums Age Battle_Count;

do i=1 to dim(nums);

if nums[i]<0 then nums[i]=abs(nums[i]);

end;

drop i;

run;

proc print data=warriors_array;

run;

OUTPUT:


Obs	Warrior_ID	Warrior_Name	Civilization	Region_Code	Rank_Category	Research_Email	Battle_Count	Age	Last_Battle_Date
1	W001	Alexander	MACEDONIAN	EU	Elite	alexander@history.com	45	32	323BC
2	W002	Genghis Khan	MONGOL	APAC	Legend	genghis@history.com	120	65	1227
3	W003	Spartan Leonidas	EUROPE	EU	Elite		15	.	480BC
4	W004	Samurai Musashi	JAPAN	APAC	Master	musashi@gmail	25	61	1645
5	W005	Viking Ragnar	NORDIC	EUR	Warrior	ragnar@email.com	50	.	845
6	W006	Unknown	ROME	EU	Elite		35	45
7	W006	Unknown	ROME	EU	Elite		35	45
8	W007	Rajput Prithviraj	INDIA	APAC	King	rajput@history.com	80	40	1192
9	W008	Zulu Shaka	AFRICA	AFRICA	Hero	shaka@@mail.com	70	42	1828
10	W009	White Space	PERSIA	ME	Elite	persia@history.com	65	55	651

Explanation

ARRAYS help process multiple variables efficiently.

Instead of repeating logic, a loop validates all numeric fields.

This technique is heavily used in laboratory datasets containing hundreds of measurements.

5.RETAIN and FIRST./LAST. Processing

proc sort data=warriors_clean;

by Civilization;

run;

proc print data=warriors_clean;

run;

OUTPUT:


Obs	Warrior_ID	Warrior_Name	Civilization	Region_Code	Rank_Category	Research_Email	Battle_Count	Age	Last_Battle_Date
1	W008	Zulu Shaka	AFRICA	AFRICA	Hero	shaka@@mail.com	70	42	1828
2	W003	Spartan Leonidas	EUROPE	EU	Elite		15	.	480BC
3	W007	Rajput Prithviraj	INDIA	APAC	King	rajput@history.com	80	40	1192
4	W004	Samurai Musashi	JAPAN	APAC	Master	musashi@gmail	25	61	1645
5	W001	Alexander	MACEDONIAN	EU	Elite	alexander@history.com	45	32	323BC
6	W002	Genghis Khan	MONGOL	APAC	Legend	genghis@history.com	120	65	1227
7	W005	Viking Ragnar	NORDIC	EUR	Warrior	ragnar@email.com	50	.	845
8	W009	White Space	PERSIA	ME	Elite	persia@history.com	65	55	651
9	W006	Unknown	ROME	EU	Elite		35	45
10	W006	Unknown	ROME	EU	Elite		35	45

data civ_summary;

set warriors_clean;

by Civilization;

retain Total_Battles 0;

Total_Battles+Battle_Count;

if last.Civilization;

run;

proc print data=civ_summary;

run;

OUTPUT:


Obs	Warrior_ID	Warrior_Name	Civilization	Region_Code	Rank_Category	Research_Email	Battle_Count	Age	Last_Battle_Date	Total_Battles
1	W008	Zulu Shaka	AFRICA	AFRICA	Hero	shaka@@mail.com	70	42	1828	70
2	W003	Spartan Leonidas	EUROPE	EU	Elite		15	.	480BC	85
3	W007	Rajput Prithviraj	INDIA	APAC	King	rajput@history.com	80	40	1192	165
4	W004	Samurai Musashi	JAPAN	APAC	Master	musashi@gmail	25	61	1645	190
5	W001	Alexander	MACEDONIAN	EU	Elite	alexander@history.com	45	32	323BC	235
6	W002	Genghis Khan	MONGOL	APAC	Legend	genghis@history.com	120	65	1227	355
7	W005	Viking Ragnar	NORDIC	EUR	Warrior	ragnar@email.com	50	.	845	405
8	W009	White Space	PERSIA	ME	Elite	persia@history.com	65	55	651	470
9	W006	Unknown	ROME	EU	Elite		35	45		540

Explanation

RETAIN preserves values across observations.

FIRST./LAST. processing creates group-level summaries.

This technique is essential for patient-level analysis in ADaM datasets.

6.Removing Duplicates

proc sort data=warriors_clean

out=warriors_nodup nodupkey;

by Warrior_ID;

run;

proc print data=warriors_nodup;

run;

LOG:

 NOTE: There were 10 observations read from the data set WORK.WARRIORS_CLEAN.

 NOTE: 1 observations with duplicate key values were deleted.

 NOTE: The data set WORK.WARRIORS_NODUP has 9 observations and 9 variables.

OUTPUT:


Obs	Warrior_ID	Warrior_Name	Civilization	Region_Code	Rank_Category	Research_Email	Battle_Count	Age	Last_Battle_Date
1	W001	Alexander	MACEDONIAN	EU	Elite	alexander@history.com	45	32	323BC
2	W002	Genghis Khan	MONGOL	APAC	Legend	genghis@history.com	120	65	1227
3	W003	Spartan Leonidas	EUROPE	EU	Elite		15	.	480BC
4	W004	Samurai Musashi	JAPAN	APAC	Master	musashi@gmail	25	61	1645
5	W005	Viking Ragnar	NORDIC	EUR	Warrior	ragnar@email.com	50	.	845
6	W006	Unknown	ROME	EU	Elite		35	45
7	W007	Rajput Prithviraj	INDIA	APAC	King	rajput@history.com	80	40	1192
8	W008	Zulu Shaka	AFRICA	AFRICA	Hero	shaka@@mail.com	70	42	1828
9	W009	White Space	PERSIA	ME	Elite	persia@history.com	65	55	651

Explanation

Duplicate Warrior_ID values are removed.

Equivalent clinical scenario:

Duplicate patient identifiers causing enrollment inflation.

PROC SORT NODUPKEY remains one of the most efficient SAS deduplication techniques.

7.PROC FORMAT Standardization

proc format;

value agegrp low-30='Young'

31-50='Experienced'

51-high='Veteran';

run;

LOG:

 NOTE: Format AGEGRP has been output.

Explanation

Formats improve reporting consistency.

Instead of coding categories repeatedly, business rules become centralized and reusable.

8.PROC SQL Approach

proc sql;

create table warrior_sql as

select a.*,

put(a.age,agegrp.) as Age_Group length=15,

b.Category

from warriors_nodup a

left join warriors_rank b

on a.Warrior_ID=b.Warrior_ID;

quit;

proc print data=warrior_sql;

run;

OUTPUT:


Obs	Warrior_ID	Warrior_Name	Civilization	Region_Code	Rank_Category	Research_Email	Battle_Count	Age	Last_Battle_Date	Age_Group	Category
1	W001	Alexander	MACEDONIAN	EU	Elite	alexander@history.com	45	32	323BC	Experienced	Vetera
2	W002	Genghis Khan	MONGOL	APAC	Legend	genghis@history.com	120	65	1227	Veteran	Legend
3	W003	Spartan Leonidas	EUROPE	EU	Elite		15	.	480BC	.	Novice
4	W004	Samurai Musashi	JAPAN	APAC	Master	musashi@gmail	25	61	1645	Veteran	Novice
5	W005	Viking Ragnar	NORDIC	EUR	Warrior	ragnar@email.com	50	.	845	.	Vetera
6	W006	Unknown	ROME	EU	Elite		35	45		Experienced	Novice
7	W006	Unknown	ROME	EU	Elite		35	45		Experienced	Novice
8	W007	Rajput Prithviraj	INDIA	APAC	King	rajput@history.com	80	40	1192	Experienced	Elite
9	W008	Zulu Shaka	AFRICA	AFRICA	Hero	shaka@@mail.com	70	42	1828	Experienced	Elite
10	W009	White Space	PERSIA	ME	Elite	persia@history.com	65	55	651	Veteran	Vetera

Explanation

PROC SQL simplifies joins and relational operations.

Advantages:

Familiar SQL syntax
Flexible joins
Database-like processing

Ideal when integrating multiple source systems.

9.MERGE Logic Using DATA Step

proc sort data=warriors_nodup;

by Warrior_ID;

run;

proc print data=warriors_nodup;

run;

OUTPUT:


Obs	Warrior_ID	Warrior_Name	Civilization	Region_Code	Rank_Category	Research_Email	Battle_Count	Age	Last_Battle_Date
1	W001	Alexander	MACEDONIAN	EU	Elite	alexander@history.com	45	32	323BC
2	W002	Genghis Khan	MONGOL	APAC	Legend	genghis@history.com	120	65	1227
3	W003	Spartan Leonidas	EUROPE	EU	Elite		15	.	480BC
4	W004	Samurai Musashi	JAPAN	APAC	Master	musashi@gmail	25	61	1645
5	W005	Viking Ragnar	NORDIC	EUR	Warrior	ragnar@email.com	50	.	845
6	W006	Unknown	ROME	EU	Elite		35	45
7	W007	Rajput Prithviraj	INDIA	APAC	King	rajput@history.com	80	40	1192
8	W008	Zulu Shaka	AFRICA	AFRICA	Hero	shaka@@mail.com	70	42	1828
9	W009	White Space	PERSIA	ME	Elite	persia@history.com	65	55	651

proc sort data=warriors_rank nodupkey;

by Warrior_ID;

run;

proc print data=warriors_rank;

run;

OUTPUT:


Obs	Warrior_ID	Warrior_Name	Civilization	Region_Code	Rank_Category	Research_Email	Battle_Count	Age	Last_Battle_Date	Category
1	W001	Alexander	MACEDONIAN	EU	Elite	alexander@history.com	45	32	323BC	Vetera
2	W002	Genghis Khan	MONGOL	APAC	Legend	genghis@history.com	120	65	1227	Legend
3	W003	Spartan Leonidas	EUROPE	EU	Elite		15	.	480BC	Novice
4	W004	Samurai Musashi	JAPAN	APAC	Master	musashi@gmail	25	61	1645	Novice
5	W005	Viking Ragnar	NORDIC	EUR	Warrior	ragnar@email.com	50	.	845	Vetera
6	W006	Unknown	ROME	EU	Elite		35	45		Novice
7	W007	Rajput Prithviraj	INDIA	APAC	King	rajput@history.com	80	40	1192	Elite
8	W008	Zulu Shaka	AFRICA	AFRICA	Hero	shaka@@mail.com	70	42	1828	Elite
9	W009	White Space	PERSIA	ME	Elite	persia@history.com	65	55	651	Vetera

data warrior_merge;

merge warriors_nodup(in=a)

warriors_rank(in=b);

by Warrior_ID;

if a;

run;

proc print data=warrior_merge;

run;

OUTPUT:


Obs	Warrior_ID	Warrior_Name	Civilization	Region_Code	Rank_Category	Research_Email	Battle_Count	Age	Last_Battle_Date	Category
1	W001	Alexander	MACEDONIAN	EU	Elite	alexander@history.com	45	32	323BC	Vetera
2	W002	Genghis Khan	MONGOL	APAC	Legend	genghis@history.com	120	65	1227	Legend
3	W003	Spartan Leonidas	EUROPE	EU	Elite		15	.	480BC	Novice
4	W004	Samurai Musashi	JAPAN	APAC	Master	musashi@gmail	25	61	1645	Novice
5	W005	Viking Ragnar	NORDIC	EUR	Warrior	ragnar@email.com	50	.	845	Vetera
6	W006	Unknown	ROME	EU	Elite		35	45		Novice
7	W007	Rajput Prithviraj	INDIA	APAC	King	rajput@history.com	80	40	1192	Elite
8	W008	Zulu Shaka	AFRICA	AFRICA	Hero	shaka@@mail.com	70	42	1828	Elite
9	W009	White Space	PERSIA	ME	Elite	persia@history.com	65	55	651	Vetera

Explanation

DATA Step MERGE is often faster for sorted SAS datasets.

Comparison:

Feature	DATA Step	PROC SQL
Speed	Faster on SAS tables	Flexible
Readability	Moderate	High
Complex Joins	Limited	Strong
Clinical Usage	Very Common	Common

10.Advanced Data Profiling

10.1.PROC CONTENTS

proc contents data=warriors_clean;

run;

OUTPUT:

The CONTENTS Procedure


Data Set Name	WORK.WARRIORS_CLEAN	Observations	10
Member Type	DATA	Variables	9
Engine	V9	Indexes	0
Created	06/21/2026 15:25:18	Observation Length	184
Last Modified	06/21/2026 15:25:18	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	YES
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	131072
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	711
Obs in First Data Page	10
Number of Data Set Repairs	0
Filename	/saswork/SAS_workD37400005FFD_odaws01-apse1-2.oda.sas.com/SAS_work1C2F00005FFD_odaws01-apse1-2.oda.sas.com/warriors_clean.sas7bdat
Release Created	9.0401M8
Host Created	Linux
Inode Number	1379846
Access Permission	rw-r--r--
Owner Name	u63247146
File Size	256KB
File Size (bytes)	262144


Alphabetic List of Variables and Attributes
#	Variable	Type	Len
8	Age	Num	8
7	Battle_Count	Num	8
3	Civilization	Char	25
9	Last_Battle_Date	Char	8
5	Rank_Category	Char	20
4	Region_Code	Char	10
6	Research_Email	Char	50
1	Warrior_ID	Char	8
2	Warrior_Name	Char	40


Sort Information
Sortedby	Civilization
Validated	YES
Character Set	ASCII

Provides metadata review.

10.2.PROC FREQ

proc freq data=warriors_clean;

tables Region_Code Rank_Category;

run;

OUTPUT:

The FREQ Procedure


Region_Code	Frequency	Percent	Cumulative Frequency	Cumulative Percent
AFRICA	1	10.00	1	10.00
APAC	3	30.00	4	40.00
EU	4	40.00	8	80.00
EUR	1	10.00	9	90.00
ME	1	10.00	10	100.00


Rank_Category	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Elite	5	50.00	5	50.00
Hero	1	10.00	6	60.00
King	1	10.00	7	70.00
Legend	1	10.00	8	80.00
Master	1	10.00	9	90.00
Warrior	1	10.00	10	100.00

Detects unexpected categories.

10.3.PROC MEANS

proc means data=warriors_clean n nmiss mean min max;

var Age Battle_Count;

run;

OUTPUT:

The MEANS Procedure


Variable	N	N Miss	Mean	Minimum	Maximum
Age Battle_Count	8 10	2 0	48.1250000 54.0000000	32.0000000 15.0000000	65.0000000 120.0000000

Identifies outliers and missing values.

10.4.PROC SUMMARY

proc summary data=warriors_clean nway;

class Civilization;

var Battle_Count;

output out=summary_stats sum=;

run;

proc print data=summary_stats;

run;

OUTPUT:


Obs	Civilization	_TYPE_	_FREQ_	Battle_Count
1	AFRICA	1	1	70
2	EUROPE	1	1	15
3	INDIA	1	1	80
4	JAPAN	1	1	25
5	MACEDONIAN	1	1	45
6	MONGOL	1	1	120
7	NORDIC	1	1	50
8	PERSIA	1	1	65
9	ROME	1	2	70

Creates aggregated reports.

10.5.PROC TRANSPOSE

proc transpose data=summary_stats out=transposed;

run;

proc print data=transposed;

run;

OUTPUT:


Obs	_NAME_	COL1	COL2	COL3	COL4	COL5	COL6	COL7	COL8	COL9
1	_TYPE_	1	1	1	1	1	1	1	1	1
2	_FREQ_	1	1	1	1	1	1	1	1	2
3	Battle_Count	70	15	80	25	45	120	50	65	70

Reshapes reporting structures.

10.6.PROC REPORT

proc report data=warriors_clean nowd;

column Warrior_Name Civilization Battle_Count;

run;

OUTPUT:


Warrior_Name	Civilization	Battle_Count
Zulu Shaka	AFRICA	70
Spartan Leonidas	EUROPE	15
Rajput Prithviraj	INDIA	80
Samurai Musashi	JAPAN	25
Alexander	MACEDONIAN	45
Genghis Khan	MONGOL	120
Viking Ragnar	NORDIC	50
White Space	PERSIA	65
Unknown	ROME	35
Unknown	ROME	35

Generates professional outputs.

11.Reusable SAS Macro

%macro profile(ds);

proc means data=&ds

n nmiss mean std min max;

run;

proc freq data=&ds;

tables _character_;

run;

%mend;

%profile(warriors_clean);

OUTPUT:

The MEANS Procedure


Variable	N	N Miss	Mean	Std Dev	Minimum	Maximum
Battle_Count Age	10 8	0 2	54.0000000 48.1250000	30.8940843 11.2178111	15.0000000 32.0000000	120.0000000 65.0000000

The FREQ Procedure


Warrior_ID	Frequency	Percent	Cumulative Frequency	Cumulative Percent
W001	1	10.00	1	10.00
W002	1	10.00	2	20.00
W003	1	10.00	3	30.00
W004	1	10.00	4	40.00
W005	1	10.00	5	50.00
W006	2	20.00	7	70.00
W007	1	10.00	8	80.00
W008	1	10.00	9	90.00
W009	1	10.00	10	100.00


Warrior_Name	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Alexander	1	10.00	1	10.00
Genghis Khan	1	10.00	2	20.00
Rajput Prithviraj	1	10.00	3	30.00
Samurai Musashi	1	10.00	4	40.00
Spartan Leonidas	1	10.00	5	50.00
Unknown	2	20.00	7	70.00
Viking Ragnar	1	10.00	8	80.00
White Space	1	10.00	9	90.00
Zulu Shaka	1	10.00	10	100.00


Civilization	Frequency	Percent	Cumulative Frequency	Cumulative Percent
AFRICA	1	10.00	1	10.00
EUROPE	1	10.00	2	20.00
INDIA	1	10.00	3	30.00
JAPAN	1	10.00	4	40.00
MACEDONIAN	1	10.00	5	50.00
MONGOL	1	10.00	6	60.00
NORDIC	1	10.00	7	70.00
PERSIA	1	10.00	8	80.00
ROME	2	20.00	10	100.00


Region_Code	Frequency	Percent	Cumulative Frequency	Cumulative Percent
AFRICA	1	10.00	1	10.00
APAC	3	30.00	4	40.00
EU	4	40.00	8	80.00
EUR	1	10.00	9	90.00
ME	1	10.00	10	100.00


Rank_Category	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Elite	5	50.00	5	50.00
Hero	1	10.00	6	60.00
King	1	10.00	7	70.00
Legend	1	10.00	8	80.00
Master	1	10.00	9	90.00
Warrior	1	10.00	10	100.00


Research_Email	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Frequency Missing = 3
alexander@history.com	1	14.29	1	14.29
genghis@history.com	1	14.29	2	28.57
musashi@gmail	1	14.29	3	42.86
persia@history.com	1	14.29	4	57.14
ragnar@email.com	1	14.29	5	71.43
rajput@history.com	1	14.29	6	85.71
shaka@@mail.com	1	14.29	7	100.00


Last_Battle_Date	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Frequency Missing = 2
1192	1	12.50	1	12.50
1227	1	12.50	2	25.00
1645	1	12.50	3	37.50
1828	1	12.50	4	50.00
323BC	1	12.50	5	62.50
480BC	1	12.50	6	75.00
651	1	12.50	7	87.50
845	1	12.50	8	100.00

Explanation

Macros standardize validation.

Benefits:

Reusability
Consistency
Reduced coding effort
Faster production deployment

12.R Raw Data

library(readr)

txt <- "

W001|alexander|MACEDONIAN|EU|45|32|323BC|Elite|alexander@history.com

W002|genghis khan|mongol|APAC|120|65|1227|Legend|genghis@history.com

W003|spartan leonidas|EUROPE|EU|15|300|480BC|Elite|bademail

W004|samurai musashi|japan|APAC|-25|61|1645|Master|musashi@gmail

W005|viking ragnar|nordic|EUR|50|-10|845|Warrior|ragnar@email.com

W006|NULL|rome|EU|35|45||Elite|romehistory.com

W007|rajput prithviraj|india|APAC|80|40|1192|KING|rajput@history.com

W008|zulu shaka|africa|AFRICA|70|42|1828|Hero|shaka@@mail.com

W009|white space |persia|ME|65|55|651 | elite |persia@history.com

warriors_raw <- read_delim(

I(txt),

delim = "|",

trim_ws = FALSE,

show_col_types = FALSE

)

OUTPUT:

Warrior_ID	Warrior_Name	Civilization	Region_Code	Battle_Count	Age	Last_Battle_Date	Rank_Category	Research_Email
W001	alexander	MACEDONIAN	EU	45	32	323BC	Elite	alexander@history.com
W002	genghis khan	mongol	APAC	120	65	1227	Legend	genghis@history.com
W003	spartan leonidas	EUROPE	EU	15	300	480BC	Elite	bademail
W004	samurai musashi	japan	APAC	-25	61	1645	Master	musashi@gmail
W005	viking ragnar	nordic	EUR	50	-10	845	Warrior	ragnar@email.com
W006	NULL	rome	EU	35	45		Elite	romehistory.com
W006	NULL	rome	EU	35	45		Elite	romehistory.com
W007	rajput prithviraj	india	APAC	80	40	1192	KING	rajput@history.com
W008	zulu shaka	africa	AFRICA	70	42	1828	Hero	shaka@@mail.com
W009	white space	persia	ME	65	55	651	elite	persia@history.com

13.Equivalent R Cleaning Workflow

library(tidyverse)

library(janitor)

library(lubridate)

library(stringr)

warriors_clean <- warriors_raw %>%

clean_names() %>%

mutate( warrior_name =str_to_title(str_trim(warrior_name)),

civilization =str_to_upper(str_trim(civilization)),

region_code =case_when(region_code %in% c("EUR","EU") ~ "EU",

region_code=="AFRICA" ~ "AFR",

TRUE ~ region_code),

battle_count =abs(battle_count),

age =if_else(age<15 | age>100,

NA_real_,age),

research_email =str_to_lower(research_email),

research_email =if_else(grepl("@", research_email) &

grepl("\\.com$", research_email),

research_email,NA_character_)

) %>%

distinct(warrior_id,.keep_all=TRUE)

OUTPUT:

warrior_id	warrior_name	civilization	region_code	battle_count	age	last_battle_date	rank_category	research_email
W001	Alexander	MACEDONIAN	EU	45	32	323BC	Elite	alexander@history.com
W002	Genghis Khan	MONGOL	APAC	120	65	1227	Legend	genghis@history.com
W003	Spartan Leonidas	EUROPE	EU	15		480BC	Elite
W004	Samurai Musashi	JAPAN	APAC	25	61	1645	Master
W005	Viking Ragnar	NORDIC	EU	50		845	Warrior	ragnar@email.com
W006	Null	ROME	EU	35	45		Elite
W007	Rajput Prithviraj	INDIA	APAC	80	40	1192	KING	rajput@history.com
W008	Zulu Shaka	AFRICA	AFR	70	42	1828	Hero	shaka@@mail.com
W009	White Space	PERSIA	ME	65	55	651	elite	persia@history.com

Explanation

Equivalent SAS → R mappings:

SAS	R
PROPCASE	str_to_title
UPCASE	str_to_upper
LOWCASE	str_to_lower
ABS	abs
FIND	grepl
STRIP	str_trim
PROC SORT NODUPKEY	distinct
IF THEN	if_else
SELECT WHEN	case_when

R offers concise syntax while SAS provides stronger auditability.

Validation & Compliance

Clinical environments require:

SDTM compliance
ADaM traceability
Independent QC
Audit trails
Reproducibility
Validation documentation

A major SAS risk:

if value < 10;

Missing numeric values (.) are considered smaller than any number.

Therefore:

. < 10

returns TRUE.

This can unintentionally classify missing patients as high-risk populations.

Proper logic:

if not missing(value) and value<10;

Failure to apply this rule can cause severe regulatory consequences.

Business Logic Behind Cleaning

Data cleaning is not cosmetic it directly impacts business decisions. Consider patient age. If a clinical trial contains an age of 300 years due to data entry errors, statistical summaries become distorted and age-stratified analyses become unreliable. Similar problems occur in banking when loan applicants show negative income values or impossible employment durations. Missing values are often imputed because many analytical models cannot process blanks effectively. For example, a missing visit date may prevent a patient from being included in efficacy analyses, reducing statistical power. Standardized dates ensure accurate interval calculations using INTCK and INTNX functions. Text normalization prevents duplicate categories such as “elite”, “Elite”, and “ELITE” from appearing as separate groups. Region codes such as EU, EUR, and Europe must be standardized to avoid fragmented reporting. Email validation ensures communication workflows remain operational. Numeric normalization removes impossible negative billing amounts and incorrect transaction values. Proper business rules create consistency across dashboards, machine learning systems, regulatory submissions, executive reports, and operational analytics. Ultimately, every cleaning rule must be justified, documented, reproducible, and traceable to business requirements so stakeholders can trust analytical outcomes.

20 Data Cleaning Best Practices

Define metadata before coding.
Standardize variable naming conventions.
Validate incoming source files.
Remove duplicates early.
Check missing-value patterns.
Validate date ranges.
Normalize categorical values.
Use reusable macros.
Maintain audit trails.
Implement independent QC.
Validate joins carefully.
Document derivations.
Track data lineage.
Use version control.
Centralize business rules.
Validate email structures.
Check impossible numeric ranges.
Review PROC CONTENTS outputs.
Create automated validation reports.
Perform production sign-off before deployment.

20 One-Line Insights

Dirty data creates expensive business mistakes.
Validation logic is stronger than visual inspection.
Standardized variables improve reproducibility.
Duplicate records distort analytics.
Missing dates damage timelines.
Auditability builds trust.
Metadata drives consistency.
QC is not optional.
Automation reduces human error.
Macros improve scalability.
SQL excels at integration.
DATA Step excels at transformation.
Traceability supports compliance.
Documentation prevents confusion.
Clean inputs create reliable outputs.
Normalized text improves reporting.
Dates deserve special attention.
Missing values require explicit handling.
Enterprise analytics depends on governance.
Quality data enables quality decisions.

Small Validation Checklist

Check	Status
Duplicate IDs Removed	✓
Missing Values Reviewed	✓
Date Validation Completed	✓
Text Standardized	✓
Numeric Ranges Verified	✓
Metadata Reviewed	✓
QC Completed	✓
Reporting Validated	✓

SAS vs R Cleaning Summary

SAS and R are both powerful platforms for enterprise-grade data cleaning, but they excel in different areas. SAS is renowned for stability, auditability, regulatory acceptance, and large-scale batch processing. In regulated industries such as pharmaceuticals, SAS remains dominant because every transformation can be documented, validated, and traced. DATA Step programming provides exceptional control over row-level processing, while PROC SQL offers powerful relational capabilities. Features such as FIRST./LAST. processing, RETAIN statements, PROC FORMAT, and metadata-driven macros make SAS highly suitable for production environments.

R, meanwhile, offers unmatched flexibility and modern data manipulation capabilities. Packages such as tidyverse, dplyr, stringr, lubridate, janitor, and purrr simplify complex transformations using concise syntax. R excels in exploratory analysis, advanced visualization, machine learning, and rapid prototyping. Functions like mutate(), across(), case_when(), and coalesce() enable highly readable workflows that often require fewer lines of code than SAS.

From a scalability perspective, SAS performs exceptionally well on enterprise infrastructure and validated environments. R provides extensibility and innovation through thousands of open-source packages. Many organizations increasingly adopt hybrid architectures where SAS performs regulatory-compliant clinical processing and R supports exploratory analytics and advanced modeling.

The strongest strategy is not SAS versus R it is SAS and R together. SAS delivers governance, reproducibility, and audit readiness, while R delivers agility, flexibility, and modern analytical capabilities. Combining both technologies creates reliable, scalable, and analytically powerful ecosystems capable of transforming messy operational data into trusted business intelligence.

Conclusion

The journey from corrupted operational data to trusted analytical intelligence is rarely straightforward. Whether working with clinical trial patients, insurance claims, banking transactions, retail purchases, or a historical warriors dataset, poor-quality data introduces substantial risks. Duplicate identifiers can inflate counts, invalid dates can break timelines, malformed categories can fragment reporting, and missing values can quietly distort statistical conclusions. In regulated industries, these problems are not merely technical inconveniences they can affect regulatory submissions, patient safety evaluations, financial decisions, and executive strategy.

A structured data-cleaning framework is therefore a foundational requirement rather than an optional enhancement. SAS provides a highly controlled environment for enterprise transformation through DATA Step programming, PROC SQL, PROC FORMAT, PROC REPORT, validation macros, metadata management, and auditable workflows. Techniques such as ARRAY processing, FIRST./LAST. logic, RETAIN statements, MERGE operations, and automated QC checks help ensure consistency and traceability across large production systems. At the same time, R complements these strengths through modern packages that accelerate cleaning, exploration, visualization, and advanced analytics.

The most successful analytics organizations establish repeatable standards: validated metadata, standardized macros, documented business rules, automated quality checks, independent QC reviews, and strong governance controls. These practices transform raw data into a reliable strategic asset. When implemented correctly, SAS and R together form a powerful ecosystem capable of producing trustworthy dashboards, reproducible analyses, compliant regulatory outputs, and scalable business intelligence. Clean data is not simply a technical objective it is the foundation upon which every credible analytical decision is built.

Interview Questions & Answers

1. A patient appears twice in an SDTM DM dataset. How would you identify it?

Answer: Use PROC SORT NODUPKEY, PROC FREQ, or PROC SQL GROUP BY HAVING COUNT(*)>1. Investigate source records and document corrective action.

2. Why is missing numeric handling dangerous in SAS?

Answer: Missing values are treated as smaller than valid numbers. Conditions like if value<10; may incorrectly include missing observations.

3. When would you choose PROC SQL over DATA Step?

Answer: PROC SQL is preferred for complex joins, aggregation, and relational processing. DATA Step is often preferred for row-level transformations and performance.

4. How would you validate email fields in SAS?

Answer: Use FIND, INDEX, VERIFY, SCAN, and regular-expression functions to confirm valid structures and identify malformed values.

5. How do you ensure R and SAS outputs match?

Answer: Apply identical business rules, compare frequencies, summaries, record counts, missing-value counts, and perform independent QC reconciliation.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about WARRIORS DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Ancient Legends, Modern Analytics: Enterprise Data Cleaning Frameworks in SAS and R

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

Ancient Legends, Modern Analytics: Enterprise Data Cleaning Frameworks in SAS and R

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?