🐅 Roar of Reliability: Transforming the “Best Tigers in the World” Dataset into Analysis-Ready Intelligence Using SAS and R for Enterprise-Grade Reporting

Introduction: When Dirty Data Creates a Wildlife Analytics Disaster

Imagine a global wildlife conservation organization preparing an executive report ranking the world's most famous tiger populations. The report is intended for conservation funding agencies, biodiversity researchers, AI prediction systems, and government policymakers.

However, just before publication, analysts discover alarming issues:

Duplicate Tiger IDs
Negative tiger weights
Missing observation dates
Invalid conservation statuses
Corrupted region codes
Mixed uppercase/lowercase subspecies names
Invalid GPS region mappings
Malformed researcher emails
Impossible tiger ages
Text fields containing NULL strings

As a result:

Population dashboards become inaccurate
AI habitat prediction models fail
Funding allocations become misleading
Regulatory conservation reports become unreliable

This is exactly why enterprise data cleaning exists.

Today we'll build a complete Best Tigers in the World Dataset and transform it into an analysis-ready dataset using both SAS and R.

Raw Business Dataset

Variable	Description
Tiger_ID	Unique Tiger Identifier
Tiger_Name	Tiger Name
Subspecies	Tiger Type
Country	Country
Region_Code	Region
Age	Tiger Age
Weight_KG	Weight
Observation_Date	Observation Date
Researcher_Email	Researcher Email

1.Raw SAS Dataset with Intentional Errors

data tiger_raw;

length Tiger_ID $8 Tiger_Name $25 Subspecies $30 Country $20

Region_Code $10 Observation_Date $20 Researcher_Email $60;

infile datalines dlm='|' dsd truncover;

input Tiger_ID $ Tiger_Name $ Subspecies $ Country $ Region_Code $

Age Weight_KG Observation_Date $ Researcher_Email $;

datalines;

T001|Sheru|BENGAL TIGER|India|IN|8|220|2024-01-10|research1@gmail.com

T002|Raja|bengal tiger|India|IND|9|-210|2024-02-15|research2gmail.com

T003| Maya |NULL|Nepal|NP|150|190|2024-03-20|research3@yahoo.com

T004|Khan|SIBERIAN TIGER|Russia|RU|10|300|2024-15-40|research4@gmail.com

T005|Sheru|BENGAL TIGER|India|IN|8|220|2024-01-10|research1@gmail.com

T006|Leo|SUMATRAN TIGER|Indonesia|ID|-5|140||research5@gmail.com

T007|Max|MALAYAN TIGER|Malaysia|MY|12|-150|2024-04-12|NULL

T008|Rocky|South China Tiger|China|CHN|9|170|2024-05-22|research8gmail

T009|TigerX|Siberian Tiger|Russia|RUS|7|280|2024-06-01|research9@gmail.com

T010|TigerX|Siberian Tiger|Russia|RUS|7|280|2024-06-01|research9@gmail.com

;

run;

proc print data=tiger_raw;

run;

OUTPUT:


Obs	Tiger_ID	Tiger_Name	Subspecies	Country	Region_Code	Observation_Date	Researcher_Email	Age	Weight_KG
1	T001	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
2	T002	Raja	bengal tiger	India	IND	2024-02-15	research2gmail.com	9	-210
3	T003	Maya	NULL	Nepal	NP	2024-03-20	research3@yahoo.com	150	190
4	T004	Khan	SIBERIAN TIGER	Russia	RU	2024-15-40	research4@gmail.com	10	300
5	T005	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
6	T006	Leo	SUMATRAN TIGER	Indonesia	ID		research5@gmail.com	-5	140
7	T007	Max	MALAYAN TIGER	Malaysia	MY	2024-04-12	NULL	12	-150
8	T008	Rocky	South China Tiger	China	CHN	2024-05-22	research8gmail	9	170
9	T009	TigerX	Siberian Tiger	Russia	RUS	2024-06-01	research9@gmail.com	7	280
10	T010	TigerX	Siberian Tiger	Russia	RUS	2024-06-01	research9@gmail.com	7	280

Why LENGTH Must Appear First

One of the most overlooked SAS issues is Character Truncation Risk.

Bad example:

data test;

name="Siberian Tiger Population";

length name $10;

run;

proc print data=test;

run;

LOG:

WARNING: Length of character variable name has already been set. 

Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.

OUTPUT:


Obs	name
1	Siberian T

The variable length was fixed before expansion.

Correct:

data test;

length name $40;

name="Siberian Tiger Population";

run;

proc print data=test;

run;

OUTPUT:


Obs	name
1	Siberian Tiger Population

Key Point

SAS allocates memory during compilation.

The first encounter determines length unless LENGTH is explicitly defined first.

R behaves differently because character vectors dynamically allocate memory and do not suffer traditional truncation during assignment.

Step 1: Initial Data Profiling

proc contents data=tiger_raw;

run;

OUTPUT:

The CONTENTS Procedure


Data Set Name	WORK.TIGER_RAW	Observations	10
Member Type	DATA	Variables	9
Engine	V9	Indexes	0
Created	06/23/2026 18:43:54	Observation Length	192
Last Modified	06/23/2026 18:43:54	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	131072
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	682
Obs in First Data Page	10
Number of Data Set Repairs	0
Filename	/saswork/SAS_workE4C20001896F_odaws02-apse1-2.oda.sas.com/SAS_work27FB0001896F_odaws02-apse1-2.oda.sas.com/tiger_raw.sas7bdat
Release Created	9.0401M8
Host Created	Linux
Inode Number	134333503
Access Permission	rw-r--r--
Owner Name	u63247146
File Size	256KB
File Size (bytes)	262144


Alphabetic List of Variables and Attributes
#	Variable	Type	Len
8	Age	Num	8
4	Country	Char	20
6	Observation_Date	Char	20
5	Region_Code	Char	10
7	Researcher_Email	Char	60
3	Subspecies	Char	30
1	Tiger_ID	Char	8
2	Tiger_Name	Char	25
9	Weight_KG	Num	8

proc freq data=tiger_raw;

tables subspecies region_code;

run;

OUTPUT:

The FREQ Procedure


Subspecies	Frequency	Percent	Cumulative Frequency	Cumulative Percent
BENGAL TIGER	2	20.00	2	20.00
MALAYAN TIGER	1	10.00	3	30.00
NULL	1	10.00	4	40.00
SIBERIAN TIGER	1	10.00	5	50.00
SUMATRAN TIGER	1	10.00	6	60.00
Siberian Tiger	2	20.00	8	80.00
South China Tiger	1	10.00	9	90.00
bengal tiger	1	10.00	10	100.00


Region_Code	Frequency	Percent	Cumulative Frequency	Cumulative Percent
CHN	1	10.00	1	10.00
ID	1	10.00	2	20.00
IN	2	20.00	4	40.00
IND	1	10.00	5	50.00
MY	1	10.00	6	60.00
NP	1	10.00	7	70.00
RU	1	10.00	8	80.00
RUS	2	20.00	10	100.00

proc means data=tiger_raw n nmiss min max;

var age weight_kg;

run;

OUTPUT:

The MEANS Procedure


Variable	N	N Miss	Minimum	Maximum
Age Weight_KG	10 10	0 0	-5.0000000 -210.0000000	150.0000000 300.0000000

Explanation

This profiling step acts like a medical diagnosis before treatment. PROC CONTENTS validates metadata, PROC FREQ identifies corrupted categorical values, and PROC MEANS reveals impossible numeric ranges such as negative weights and ages above biological limits. Enterprise teams always profile data before cleaning because hidden quality issues often create downstream reporting failures. Metadata validation is the first layer of production-grade data governance.

Step 2: Enterprise Cleaning Using DATA Step

data tiger_clean;

retain Source_System "GLOBAL_WILDLIFE";

set tiger_raw;

Tiger_Name = propcase(strip(Tiger_Name));

Subspecies = upcase(strip(Subspecies));

Country = propcase(strip(Country));

Researcher_Email = lowcase(strip(Researcher_Email));

if age < 0 then age=.;

if age > 30 then age=.;

Weight_KG = abs(Weight_KG);

if Subspecies='NULL' then Subspecies='UNKNOWN';

if Researcher_Email='null' then Researcher_Email='';

select(strip(upcase(region_code)));

when('IND') Region_Code='IN';

when('RUS') Region_Code='RU';

when('CHN') Region_Code='CN';

otherwise Region_Code=Region_Code;

end;

run;

proc print data=tiger_clean;

run;

OUTPUT:


Obs	Source_System	Tiger_ID	Tiger_Name	Subspecies	Country	Region_Code	Observation_Date	Researcher_Email	Age	Weight_KG
1	GLOBAL_WILDLIFE	T001	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
2	GLOBAL_WILDLIFE	T002	Raja	BENGAL TIGER	India	IN	2024-02-15	research2gmail.com	9	210
3	GLOBAL_WILDLIFE	T003	Maya	UNKNOWN	Nepal	NP	2024-03-20	research3@yahoo.com	.	190
4	GLOBAL_WILDLIFE	T004	Khan	SIBERIAN TIGER	Russia	RU	2024-15-40	research4@gmail.com	10	300
5	GLOBAL_WILDLIFE	T005	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
6	GLOBAL_WILDLIFE	T006	Leo	SUMATRAN TIGER	Indonesia	ID		research5@gmail.com	.	140
7	GLOBAL_WILDLIFE	T007	Max	MALAYAN TIGER	Malaysia	MY	2024-04-12		12	150
8	GLOBAL_WILDLIFE	T008	Rocky	SOUTH CHINA TIGER	China	CN	2024-05-22	research8gmail	9	170
9	GLOBAL_WILDLIFE	T009	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280
10	GLOBAL_WILDLIFE	T010	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280

Explanation

This DATA Step performs standardization, normalization, and error correction. PROPCASE standardizes names, UPCASE harmonizes species classifications, LOWCASE cleans emails, and ABS converts invalid negative weights into usable values. Age validations remove biologically impossible values. RETAIN preserves source lineage information. These transformations create consistency essential for reliable statistical analysis and regulatory reporting.

Step 3: Using SELECT-WHEN

data tiger_clean;

length Tiger_Group $20;

set tiger_clean;

select(upcase(subspecies));

when('BENGAL TIGER')

Tiger_Group='ASIAN';

when('SIBERIAN TIGER')

Tiger_Group='NORTHERN';

otherwise

Tiger_Group='OTHER';

end;

run;

proc print data=tiger_clean;

run;

OUTPUT:


Obs	TIGER_GROUP	Source_System	Tiger_ID	Tiger_Name	Subspecies	Country	Region_Code	Observation_Date	Researcher_Email	Age	Weight_KG
1	ASIAN	GLOBAL_WILDLIFE	T001	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
2	ASIAN	GLOBAL_WILDLIFE	T002	Raja	BENGAL TIGER	India	IN	2024-02-15	research2gmail.com	9	210
3	OTHER	GLOBAL_WILDLIFE	T003	Maya	UNKNOWN	Nepal	NP	2024-03-20	research3@yahoo.com	.	190
4	NORTHERN	GLOBAL_WILDLIFE	T004	Khan	SIBERIAN TIGER	Russia	RU	2024-15-40	research4@gmail.com	10	300
5	ASIAN	GLOBAL_WILDLIFE	T005	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
6	OTHER	GLOBAL_WILDLIFE	T006	Leo	SUMATRAN TIGER	Indonesia	ID		research5@gmail.com	.	140
7	OTHER	GLOBAL_WILDLIFE	T007	Max	MALAYAN TIGER	Malaysia	MY	2024-04-12		12	150
8	OTHER	GLOBAL_WILDLIFE	T008	Rocky	SOUTH CHINA TIGER	China	CN	2024-05-22	research8gmail	9	170
9	NORTHERN	GLOBAL_WILDLIFE	T009	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280
10	NORTHERN	GLOBAL_WILDLIFE	T010	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280

Explanation

SELECT-WHEN is more efficient than multiple IF-THEN conditions when handling categorical classification logic. It improves readability, maintainability, and enterprise code standardization.

Step 4: Using ARRAY Processing

data tiger_clean;

set tiger_clean;

array nums(*) age weight_kg;

do i=1 to dim(nums);

if nums(i)<0 then nums(i)=.;

end;

drop i;

run;

proc print data=tiger_clean;

run;

OUTPUT:


Obs	TIGER_GROUP	Source_System	Tiger_ID	Tiger_Name	Subspecies	Country	Region_Code	Observation_Date	Researcher_Email	Age	Weight_KG
1	ASIAN	GLOBAL_WILDLIFE	T001	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
2	ASIAN	GLOBAL_WILDLIFE	T002	Raja	BENGAL TIGER	India	IN	2024-02-15	research2gmail.com	9	210
3	OTHER	GLOBAL_WILDLIFE	T003	Maya	UNKNOWN	Nepal	NP	2024-03-20	research3@yahoo.com	.	190
4	NORTHERN	GLOBAL_WILDLIFE	T004	Khan	SIBERIAN TIGER	Russia	RU	2024-15-40	research4@gmail.com	10	300
5	ASIAN	GLOBAL_WILDLIFE	T005	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
6	OTHER	GLOBAL_WILDLIFE	T006	Leo	SUMATRAN TIGER	Indonesia	ID		research5@gmail.com	.	140
7	OTHER	GLOBAL_WILDLIFE	T007	Max	MALAYAN TIGER	Malaysia	MY	2024-04-12		12	150
8	OTHER	GLOBAL_WILDLIFE	T008	Rocky	SOUTH CHINA TIGER	China	CN	2024-05-22	research8gmail	9	170
9	NORTHERN	GLOBAL_WILDLIFE	T009	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280
10	NORTHERN	GLOBAL_WILDLIFE	T010	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280

Explanation

ARRAYS allow simultaneous validation of multiple variables. Instead of writing repetitive code for every numeric column, enterprise programmers use arrays to scale validation rules efficiently across hundreds of variables.

Step 5: Removing Duplicates

proc sort data=tiger_clean

out=tiger_nodup nodupkey;

by tiger_id;

run;

proc print data=tiger_nodup;

run;

LOG:

 NOTE: There were 10 observations read from the data set WORK.TIGER_CLEAN.

 NOTE: 0 observations with duplicate key values were deleted.

 NOTE: The data set WORK.TIGER_NODUP has 10 observations and 11 variables.

OUTPUT:


Obs	TIGER_GROUP	Source_System	Tiger_ID	Tiger_Name	Subspecies	Country	Region_Code	Observation_Date	Researcher_Email	Age	Weight_KG
1	ASIAN	GLOBAL_WILDLIFE	T001	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
2	ASIAN	GLOBAL_WILDLIFE	T002	Raja	BENGAL TIGER	India	IN	2024-02-15	research2gmail.com	9	210
3	OTHER	GLOBAL_WILDLIFE	T003	Maya	UNKNOWN	Nepal	NP	2024-03-20	research3@yahoo.com	.	190
4	NORTHERN	GLOBAL_WILDLIFE	T004	Khan	SIBERIAN TIGER	Russia	RU	2024-15-40	research4@gmail.com	10	300
5	ASIAN	GLOBAL_WILDLIFE	T005	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
6	OTHER	GLOBAL_WILDLIFE	T006	Leo	SUMATRAN TIGER	Indonesia	ID		research5@gmail.com	.	140
7	OTHER	GLOBAL_WILDLIFE	T007	Max	MALAYAN TIGER	Malaysia	MY	2024-04-12		12	150
8	OTHER	GLOBAL_WILDLIFE	T008	Rocky	SOUTH CHINA TIGER	China	CN	2024-05-22	research8gmail	9	170
9	NORTHERN	GLOBAL_WILDLIFE	T009	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280
10	NORTHERN	GLOBAL_WILDLIFE	T010	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280

Explanation

Duplicate observations can inflate counts and distort summaries. PROC SORT NODUPKEY removes repeated Tiger_ID values while preserving the first occurrence. This is a standard enterprise deduplication strategy.

Step 6: PROC FORMAT

proc format;

value agegrp low-5='Young'

6-10='Adult'

11-high='Senior';

run;

LOG:

 NOTE: Format AGEGRP has been output.

Explanation

Formats provide business-friendly reporting labels without altering underlying data values. They improve readability and executive presentation quality.

Step 7: PROC SQL Approach

proc sql;

create table tiger_sql as

select distinct a.*

from tiger_nodup a;

quit;

proc print data=tiger_sql;

run;

OUTPUT:


Obs	TIGER_GROUP	Source_System	Tiger_ID	Tiger_Name	Subspecies	Country	Region_Code	Observation_Date	Researcher_Email	Age	Weight_KG
1	ASIAN	GLOBAL_WILDLIFE	T001	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
2	ASIAN	GLOBAL_WILDLIFE	T002	Raja	BENGAL TIGER	India	IN	2024-02-15	research2gmail.com	9	210
3	ASIAN	GLOBAL_WILDLIFE	T005	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220
4	NORTHERN	GLOBAL_WILDLIFE	T004	Khan	SIBERIAN TIGER	Russia	RU	2024-15-40	research4@gmail.com	10	300
5	NORTHERN	GLOBAL_WILDLIFE	T009	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280
6	NORTHERN	GLOBAL_WILDLIFE	T010	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280
7	OTHER	GLOBAL_WILDLIFE	T003	Maya	UNKNOWN	Nepal	NP	2024-03-20	research3@yahoo.com	.	190
8	OTHER	GLOBAL_WILDLIFE	T006	Leo	SUMATRAN TIGER	Indonesia	ID		research5@gmail.com	.	140
9	OTHER	GLOBAL_WILDLIFE	T007	Max	MALAYAN TIGER	Malaysia	MY	2024-04-12		12	150
10	OTHER	GLOBAL_WILDLIFE	T008	Rocky	SOUTH CHINA TIGER	China	CN	2024-05-22	research8gmail	9	170

Explanation

PROC SQL provides database-style processing. It is especially useful when integrating multiple datasets, creating joins, performing aggregations, and implementing relational business logic.

Step 8: PROC SQL Join

data tiger_rank;

input tiger_id $ population_rank;

datalines;

T001 1

T002 2

T003 3

T004 4

T005 5

T006 6

T007 7

T008 8

T009 9

T010 10

;

run;

proc print data=tiger_rank;

run;

OUTPUT:


Obs	tiger_id	population_rank
1	T001	1
2	T002	2
3	T003	3
4	T004	4
5	T005	5
6	T006	6
7	T007	7
8	T008	8
9	T009	9
10	T010	10

proc sql;

create table tiger_report as

select a.*,

b.population_rank

from tiger_nodup a

left join tiger_rank b

on a.tiger_id=b.tiger_id;

quit;

proc print data=tiger_report;

run;

OUTPUT:


Obs	TIGER_GROUP	Source_System	Tiger_ID	Tiger_Name	Subspecies	Country	Region_Code	Observation_Date	Researcher_Email	Age	Weight_KG	population_rank
1	ASIAN	GLOBAL_WILDLIFE	T001	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220	1
2	ASIAN	GLOBAL_WILDLIFE	T002	Raja	BENGAL TIGER	India	IN	2024-02-15	research2gmail.com	9	210	2
3	OTHER	GLOBAL_WILDLIFE	T003	Maya	UNKNOWN	Nepal	NP	2024-03-20	research3@yahoo.com	.	190	3
4	NORTHERN	GLOBAL_WILDLIFE	T004	Khan	SIBERIAN TIGER	Russia	RU	2024-15-40	research4@gmail.com	10	300	4
5	ASIAN	GLOBAL_WILDLIFE	T005	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220	5
6	OTHER	GLOBAL_WILDLIFE	T006	Leo	SUMATRAN TIGER	Indonesia	ID		research5@gmail.com	.	140	6
7	OTHER	GLOBAL_WILDLIFE	T007	Max	MALAYAN TIGER	Malaysia	MY	2024-04-12		12	150	7
8	OTHER	GLOBAL_WILDLIFE	T008	Rocky	SOUTH CHINA TIGER	China	CN	2024-05-22	research8gmail	9	170	8
9	NORTHERN	GLOBAL_WILDLIFE	T009	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280	9
10	NORTHERN	GLOBAL_WILDLIFE	T010	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280	10

Explanation

SQL joins combine datasets using business keys. In enterprise environments, this is used to integrate demographic, operational, clinical, financial, or conservation data.

Step 9: Advanced SAS Functions

data tiger_clean;

set tiger_clean;

Email_Domain=scan(researcher_email,2,'@');

Email_Check=index(researcher_email,'@');

Country_Code=substr(region_code,1,2);

Words=countw(tiger_name);

Missing_Count=cmiss(of _character_);

Numeric_Missing=nmiss(of _numeric_);

run;

proc print data=tiger_clean;

run;

OUTPUT:


Obs	TIGER_GROUP	Source_System	Tiger_ID	Tiger_Name	Subspecies	Country	Region_Code	Observation_Date	Researcher_Email	Age	Weight_KG	Email_Domain	Email_Check	Country_Code	Words	Missing_Count	Numeric_Missing
1	ASIAN	GLOBAL_WILDLIFE	T001	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220	gmail.com	10	IN	1	0	1
2	ASIAN	GLOBAL_WILDLIFE	T002	Raja	BENGAL TIGER	India	IN	2024-02-15	research2gmail.com	9	210		0	IN	1	1	1
3	OTHER	GLOBAL_WILDLIFE	T003	Maya	UNKNOWN	Nepal	NP	2024-03-20	research3@yahoo.com	.	190	yahoo.com	10	NP	1	0	2
4	NORTHERN	GLOBAL_WILDLIFE	T004	Khan	SIBERIAN TIGER	Russia	RU	2024-15-40	research4@gmail.com	10	300	gmail.com	10	RU	1	0	1
5	ASIAN	GLOBAL_WILDLIFE	T005	Sheru	BENGAL TIGER	India	IN	2024-01-10	research1@gmail.com	8	220	gmail.com	10	IN	1	0	1
6	OTHER	GLOBAL_WILDLIFE	T006	Leo	SUMATRAN TIGER	Indonesia	ID		research5@gmail.com	.	140	gmail.com	10	ID	1	1	2
7	OTHER	GLOBAL_WILDLIFE	T007	Max	MALAYAN TIGER	Malaysia	MY	2024-04-12		12	150		0	MY	1	2	1
8	OTHER	GLOBAL_WILDLIFE	T008	Rocky	SOUTH CHINA TIGER	China	CN	2024-05-22	research8gmail	9	170		0	CN	1	1	1
9	NORTHERN	GLOBAL_WILDLIFE	T009	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280	gmail.com	10	RU	1	0	1
10	NORTHERN	GLOBAL_WILDLIFE	T010	Tigerx	SIBERIAN TIGER	Russia	RU	2024-06-01	research9@gmail.com	7	280	gmail.com	10	RU	1	0	1

Explanation

Functions like SCAN, INDEX, SUBSTR, CMISS, and NMISS support advanced validation frameworks. They help detect malformed values, missing data, and metadata inconsistencies.

Step 10: R Raw Dataset

library(tidyverse)

tiger_raw <- tibble(

tiger_id=c("T001","T002","T003","T004","T005","T006",

"T007","T008","T009","T010"),

tiger_name=c("Sheru","Raja"," Maya ","Khan","Sheru",

"Leo","Max","Rocky","TigerX","TigerX"),

subspecies=c("BENGAL TIGER","bengal tiger","NULL",

"SIBERIAN TIGER","BENGAL TIGER",

"SUMATRAN TIGER","MALAYAN TIGER",

"South China Tiger","Siberian Tiger",

"Siberian Tiger"),

age=c(8,9,150,10,8,-5,12,9,7,7),

weight_kg=c(220,-210,190,300,220,140,-150,170,280,280))

OUTPUT:

tiger_id	tiger_name	subspecies	age	weight_kg
T001	Sheru	BENGAL TIGER	8	220
T002	Raja	bengal tiger	9	-210
T003	Maya	NULL	150	190
T004	Khan	SIBERIAN TIGER	10	300
T005	Sheru	BENGAL TIGER	8	220
T006	Leo	SUMATRAN TIGER	-5	140
T007	Max	MALAYAN TIGER	12	-150
T008	Rocky	South China Tiger	9	170
T009	TigerX	Siberian Tiger	7	280
T010	TigerX	Siberian Tiger	7	280

Step 11: R Cleaning Workflow

library(janitor)

library(lubridate)

library(stringr

tiger_clean <- tiger_raw %>%

clean_names() %>%

mutate(tiger_name=str_to_title(str_trim(tiger_name)),

subspecies=str_to_upper(str_trim(subspecies)),

age=if_else(age<0 | age>30,

NA_real_,as.numeric(age)),

weight_kg=abs(weight_kg),

subspecies=replace(subspecies,subspecies=="NULL",

"UNKNOWN")

) %>%

distinct(tiger_id,.keep_all=TRUE)

OUTPUT:

tiger_id	tiger_name	subspecies	age	weight_kg
T001	Sheru	BENGAL TIGER	8	220
T002	Raja	BENGAL TIGER	9	210
T003	Maya	UNKNOWN	NA	190
T004	Khan	SIBERIAN TIGER	10	300
T005	Sheru	BENGAL TIGER	8	220
T006	Leo	SUMATRAN TIGER	NA	140
T007	Max	MALAYAN TIGER	12	150
T008	Rocky	SOUTH CHINA TIGER	9	170
T009	Tigerx	SIBERIAN TIGER	7	280
T010	Tigerx	SIBERIAN TIGER	7	280

Explanation

The tidyverse workflow mirrors SAS cleaning logic. mutate() performs transformations, str_trim() removes whitespace, str_to_upper() standardizes text, if_else() validates age, abs() fixes weights, and distinct() removes duplicates. This declarative style makes R highly flexible for exploratory and production analytics.

SAS vs R Cleaning Comparison

Task	SAS	R
Standardize Text	PROPCASE	str_to_title
Trim Spaces	STRIP	str_trim
Missing Values	NMISS	is.na
Deduplicate	PROC SORT	distinct
Joins	PROC SQL	left_join
Formats	PROC FORMAT	factor/case_when
Date Handling	INTNX	lubridate
Validation	DATA Step	mutate

Enterprise Validation & Compliance

For SDTM and ADaM environments:

Every transformation must be traceable.
Audit trails must document all changes.
Independent QC programmers must reproduce results.
Metadata must align with Define.xml specifications.
Regulatory reviewers expect consistent derivation logic.

Critical SAS rule:

if age=. then ...

Missing numeric values in SAS are treated lower than valid numbers.

Example:

if age<18 then flag='Y';

Missing ages will incorrectly become flagged.

Correct:

if age ne . and age<18 then flag='Y';

This single mistake has caused numerous production validation findings.

Business Logic Behind Cleaning

Data cleaning exists because raw operational data rarely reflects reality. Missing values may result from delayed entry, system outages, or incomplete source records. Age values above biological limits indicate entry errors, while negative weights often arise from sign reversal mistakes during file transfers. Standardizing dates ensures consistent chronological calculations, enabling accurate duration, trend, and survival analyses. Text normalization removes unnecessary variability that would otherwise split identical categories into multiple groups. For example, "BENGAL TIGER", "Bengal Tiger", and " bengal tiger " should represent one category. Email validation supports researcher communication workflows and audit readiness. Missing observation dates may require imputation or exclusion depending on business rules. Consistent region coding improves geographic reporting accuracy. Every correction must follow documented business logic, ensuring reproducibility and traceability. Without standardization, analytical outputs become unreliable, executive dashboards become misleading, and AI models learn incorrect patterns.

20 Data Cleaning Best Practices

Profile before cleaning.
Validate metadata.
Standardize variable naming.
Use audit trails.
Create reusable macros.
Validate source-to-target mappings.
Remove duplicates early.
Check date formats.
Validate email structures.
Standardize categorical values.
Flag unrealistic numeric values.
Document assumptions.
Maintain lineage records.
Apply QC independently.
Automate validation rules.
Avoid hardcoding.
Use metadata-driven processing.
Archive raw data.
Version control code.
Reconcile outputs before deployment.

20 One-Line Insights

Dirty data creates expensive business mistakes.
Validation logic is stronger than visual inspection.
Metadata drives consistency.
Deduplication improves trust.
Standardized variables improve reproducibility.
Missing values deserve investigation.
Governance starts with profiling.
Auditability matters.
Clean inputs create reliable outputs.
Traceability protects compliance.
Macros improve scalability.
Defensive programming reduces risk.
Formats improve readability.
Documentation prevents confusion.
QC is not optional.
Date consistency matters.
Text normalization reduces noise.
Automation reduces human error.
Reliable data powers AI.
Analytics is only as good as the source.

Summary: SAS vs R for Enterprise Data Cleaning

SAS excels in regulated environments requiring auditability, traceability, repeatability, and compliance. DATA Step processing remains one of the most powerful row-wise transformation engines available. PROC SQL integrates relational logic efficiently, while PROC FORMAT, PROC REPORT, and ODS provide enterprise reporting capabilities. R provides exceptional flexibility through tidyverse packages, rapid prototyping, visualization, and advanced analytics. Together, SAS and R form a complementary ecosystem where SAS delivers governance and production stability while R contributes innovation and analytical agility. Organizations leveraging both technologies gain scalability, transparency, reproducibility, and performance. The strongest enterprise teams understand not merely how to clean data, but how to build repeatable frameworks that convert operational chaos into analytical intelligence.

Conclusion

The Best Tigers in the World dataset demonstrates a universal truth across wildlife conservation, clinical trials, banking, insurance, and retail analytics: raw data is rarely analysis-ready. Duplicate identifiers, malformed text, invalid dates, unrealistic measurements, and missing values can quietly undermine reporting accuracy, predictive models, regulatory submissions, and executive decision-making. Effective data engineering is therefore not a technical luxury it is a business necessity.

A structured cleaning framework begins with profiling, metadata inspection, and validation before progressing into standardization, correction, enrichment, and reporting. SAS provides industrial-strength capabilities through DATA Step programming, PROC SQL, PROC FORMAT, PROC REPORT, and reusable macros. R complements these strengths through tidyverse-driven flexibility, rapid transformation pipelines, and modern analytical workflows. Together they provide an end-to-end ecosystem for producing trusted datasets.

The most successful organizations treat data cleaning as an engineered process rather than a one-time activity. They implement audit trails, independent QC, metadata governance, lineage tracking, standardized macros, validation checklists, and production controls. Whether the objective is SDTM compliance, financial risk analysis, insurance reporting, retail forecasting, or wildlife conservation intelligence, trustworthy analytics always begin with trustworthy data.

When clean data becomes a strategic asset, dashboards become credible, statistical outputs become reproducible, AI predictions become dependable, and decision-makers gain confidence. That is the real power of transforming raw operational data into analysis-ready intelligence using SAS and R.

Interview Questions and Answers

1. How would you identify duplicate tiger records?

Answer: Use PROC SORT NODUPKEY in SAS or distinct() in R using Tiger_ID as the business key.

2. Why is LENGTH placement important?

Answer: SAS assigns character length during compilation. Incorrect placement can truncate data permanently.

3. How do you validate malformed emails?

Answer: Use INDEX, FIND, VERIFY, PRXMATCH in SAS and grepl() in R.

4. Why is missing-value handling critical in SAS?

Answer: Missing numeric values are treated lower than valid values and can unintentionally satisfy conditional logic.

5. When would you choose PROC SQL over DATA Step?

Answer: PROC SQL is preferred for joins, aggregations, and relational processing; DATA Step excels in row-wise transformations and complex derivations.

---------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TIGERS DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

🐅Best Tigers in the World Dataset into Analysis-Ready Using SAS and R

🐅 Roar of Reliability: Transforming the “Best Tigers in the World” Dataset into Analysis-Ready Intelligence Using SAS and R for Enterprise-Grade Reporting

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

🐅Best Tigers in the World Dataset into Analysis-Ready Using SAS and R

🐅 Roar of Reliability: Transforming the “Best Tigers in the World” Dataset into Analysis-Ready Intelligence Using SAS and R for Enterprise-Grade Reporting

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?