🐅Best Tigers in the World Dataset into Analysis-Ready Using SAS and R
🐅 Roar of Reliability: Transforming the “Best Tigers in the World”
Dataset into Analysis-Ready Intelligence Using SAS and R for Enterprise-Grade
Reporting
Introduction: When Dirty Data Creates a Wildlife
Analytics Disaster
Imagine a
global wildlife conservation organization preparing an executive report ranking
the world's most famous tiger populations. The report is intended for
conservation funding agencies, biodiversity researchers, AI prediction systems,
and government policymakers.
However,
just before publication, analysts discover alarming issues:
- Duplicate Tiger IDs
- Negative tiger weights
- Missing observation dates
- Invalid conservation
statuses
- Corrupted region codes
- Mixed uppercase/lowercase
subspecies names
- Invalid GPS region mappings
- Malformed researcher emails
- Impossible tiger ages
- Text fields containing NULL
strings
As a
result:
- Population dashboards become
inaccurate
- AI habitat prediction models
fail
- Funding allocations become
misleading
- Regulatory conservation
reports become unreliable
This is
exactly why enterprise data cleaning exists.
Today
we'll build a complete Best Tigers in the World Dataset and transform it
into an analysis-ready dataset using both SAS and R.
Raw Business Dataset
|
Variable |
Description |
|
Tiger_ID |
Unique
Tiger Identifier |
|
Tiger_Name |
Tiger
Name |
|
Subspecies |
Tiger
Type |
|
Country |
Country |
|
Region_Code |
Region |
|
Age |
Tiger
Age |
|
Weight_KG |
Weight |
|
Observation_Date |
Observation
Date |
|
Researcher_Email |
Researcher
Email |
1.Raw SAS Dataset with Intentional Errors
data tiger_raw;
length Tiger_ID $8 Tiger_Name $25 Subspecies $30 Country $20
Region_Code $10 Observation_Date $20 Researcher_Email $60;
infile datalines dlm='|' dsd truncover;
input Tiger_ID $ Tiger_Name $ Subspecies $ Country $ Region_Code $
Age Weight_KG Observation_Date $ Researcher_Email $;
datalines;
T001|Sheru|BENGAL TIGER|India|IN|8|220|2024-01-10|research1@gmail.com
T002|Raja|bengal tiger|India|IND|9|-210|2024-02-15|research2gmail.com
T003| Maya |NULL|Nepal|NP|150|190|2024-03-20|research3@yahoo.com
T004|Khan|SIBERIAN TIGER|Russia|RU|10|300|2024-15-40|research4@gmail.com
T005|Sheru|BENGAL TIGER|India|IN|8|220|2024-01-10|research1@gmail.com
T006|Leo|SUMATRAN TIGER|Indonesia|ID|-5|140||research5@gmail.com
T007|Max|MALAYAN TIGER|Malaysia|MY|12|-150|2024-04-12|NULL
T008|Rocky|South China Tiger|China|CHN|9|170|2024-05-22|research8gmail
T009|TigerX|Siberian Tiger|Russia|RUS|7|280|2024-06-01|research9@gmail.com
T010|TigerX|Siberian Tiger|Russia|RUS|7|280|2024-06-01|research9@gmail.com
;
run;
proc print data=tiger_raw;
run;
OUTPUT:
| Obs | Tiger_ID | Tiger_Name | Subspecies | Country | Region_Code | Observation_Date | Researcher_Email | Age | Weight_KG |
|---|---|---|---|---|---|---|---|---|---|
| 1 | T001 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 2 | T002 | Raja | bengal tiger | India | IND | 2024-02-15 | research2gmail.com | 9 | -210 |
| 3 | T003 | Maya | NULL | Nepal | NP | 2024-03-20 | research3@yahoo.com | 150 | 190 |
| 4 | T004 | Khan | SIBERIAN TIGER | Russia | RU | 2024-15-40 | research4@gmail.com | 10 | 300 |
| 5 | T005 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 6 | T006 | Leo | SUMATRAN TIGER | Indonesia | ID | research5@gmail.com | -5 | 140 | |
| 7 | T007 | Max | MALAYAN TIGER | Malaysia | MY | 2024-04-12 | NULL | 12 | -150 |
| 8 | T008 | Rocky | South China Tiger | China | CHN | 2024-05-22 | research8gmail | 9 | 170 |
| 9 | T009 | TigerX | Siberian Tiger | Russia | RUS | 2024-06-01 | research9@gmail.com | 7 | 280 |
| 10 | T010 | TigerX | Siberian Tiger | Russia | RUS | 2024-06-01 | research9@gmail.com | 7 | 280 |
Why LENGTH Must Appear First
One of
the most overlooked SAS issues is Character Truncation Risk.
Bad example:
data test;
name="Siberian Tiger Population";
length name $10;
run;
proc print data=test;
run;
LOG:
OUTPUT:
| Obs | name |
|---|---|
| 1 | Siberian T |
The
variable length was fixed before expansion.
Correct:
data test;
length name $40;
name="Siberian Tiger Population";
run;
proc print data=test;
run;
OUTPUT:
| Obs | name |
|---|---|
| 1 | Siberian Tiger Population |
Key Point
SAS
allocates memory during compilation.
The first
encounter determines length unless LENGTH is explicitly defined first.
R behaves
differently because character vectors dynamically allocate memory and do not
suffer traditional truncation during assignment.
Step 1: Initial Data Profiling
proc contents data=tiger_raw;
run;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.TIGER_RAW | Observations | 10 |
|---|---|---|---|
| Member Type | DATA | Variables | 9 |
| Engine | V9 | Indexes | 0 |
| Created | 06/23/2026 18:43:54 | Observation Length | 192 |
| Last Modified | 06/23/2026 18:43:54 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 682 |
| Obs in First Data Page | 10 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_workE4C20001896F_odaws02-apse1-2.oda.sas.com/SAS_work27FB0001896F_odaws02-apse1-2.oda.sas.com/tiger_raw.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 134333503 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Alphabetic List of Variables and Attributes | |||
|---|---|---|---|
| # | Variable | Type | Len |
| 8 | Age | Num | 8 |
| 4 | Country | Char | 20 |
| 6 | Observation_Date | Char | 20 |
| 5 | Region_Code | Char | 10 |
| 7 | Researcher_Email | Char | 60 |
| 3 | Subspecies | Char | 30 |
| 1 | Tiger_ID | Char | 8 |
| 2 | Tiger_Name | Char | 25 |
| 9 | Weight_KG | Num | 8 |
proc freq data=tiger_raw;
tables subspecies region_code;
run;
OUTPUT:
The FREQ Procedure
| Subspecies | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| BENGAL TIGER | 2 | 20.00 | 2 | 20.00 |
| MALAYAN TIGER | 1 | 10.00 | 3 | 30.00 |
| NULL | 1 | 10.00 | 4 | 40.00 |
| SIBERIAN TIGER | 1 | 10.00 | 5 | 50.00 |
| SUMATRAN TIGER | 1 | 10.00 | 6 | 60.00 |
| Siberian Tiger | 2 | 20.00 | 8 | 80.00 |
| South China Tiger | 1 | 10.00 | 9 | 90.00 |
| bengal tiger | 1 | 10.00 | 10 | 100.00 |
| Region_Code | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| CHN | 1 | 10.00 | 1 | 10.00 |
| ID | 1 | 10.00 | 2 | 20.00 |
| IN | 2 | 20.00 | 4 | 40.00 |
| IND | 1 | 10.00 | 5 | 50.00 |
| MY | 1 | 10.00 | 6 | 60.00 |
| NP | 1 | 10.00 | 7 | 70.00 |
| RU | 1 | 10.00 | 8 | 80.00 |
| RUS | 2 | 20.00 | 10 | 100.00 |
proc means data=tiger_raw n nmiss min max;
var age weight_kg;
run;
OUTPUT:
The MEANS Procedure
| Variable | N | N Miss | Minimum | Maximum |
|---|---|---|---|---|
Age Weight_KG | 10 10 | 0 0 | -5.0000000 -210.0000000 | 150.0000000 300.0000000 |
Explanation
This
profiling step acts like a medical diagnosis before treatment. PROC CONTENTS
validates metadata, PROC FREQ identifies corrupted categorical values, and PROC
MEANS reveals impossible numeric ranges such as negative weights and ages above
biological limits. Enterprise teams always profile data before cleaning because
hidden quality issues often create downstream reporting failures. Metadata
validation is the first layer of production-grade data governance.
Step 2: Enterprise Cleaning Using DATA Step
data tiger_clean;
retain Source_System "GLOBAL_WILDLIFE";
set tiger_raw;
Tiger_Name = propcase(strip(Tiger_Name));
Subspecies = upcase(strip(Subspecies));
Country = propcase(strip(Country));
Researcher_Email = lowcase(strip(Researcher_Email));
if age < 0 then age=.;
if age > 30 then age=.;
Weight_KG = abs(Weight_KG);
if Subspecies='NULL' then Subspecies='UNKNOWN';
if Researcher_Email='null' then Researcher_Email='';
select(strip(upcase(region_code)));
when('IND') Region_Code='IN';
when('RUS') Region_Code='RU';
when('CHN') Region_Code='CN';
otherwise Region_Code=Region_Code;
end;
run;
proc print data=tiger_clean;
run;
OUTPUT:
| Obs | Source_System | Tiger_ID | Tiger_Name | Subspecies | Country | Region_Code | Observation_Date | Researcher_Email | Age | Weight_KG |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_WILDLIFE | T001 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 2 | GLOBAL_WILDLIFE | T002 | Raja | BENGAL TIGER | India | IN | 2024-02-15 | research2gmail.com | 9 | 210 |
| 3 | GLOBAL_WILDLIFE | T003 | Maya | UNKNOWN | Nepal | NP | 2024-03-20 | research3@yahoo.com | . | 190 |
| 4 | GLOBAL_WILDLIFE | T004 | Khan | SIBERIAN TIGER | Russia | RU | 2024-15-40 | research4@gmail.com | 10 | 300 |
| 5 | GLOBAL_WILDLIFE | T005 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 6 | GLOBAL_WILDLIFE | T006 | Leo | SUMATRAN TIGER | Indonesia | ID | research5@gmail.com | . | 140 | |
| 7 | GLOBAL_WILDLIFE | T007 | Max | MALAYAN TIGER | Malaysia | MY | 2024-04-12 | 12 | 150 | |
| 8 | GLOBAL_WILDLIFE | T008 | Rocky | SOUTH CHINA TIGER | China | CN | 2024-05-22 | research8gmail | 9 | 170 |
| 9 | GLOBAL_WILDLIFE | T009 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 |
| 10 | GLOBAL_WILDLIFE | T010 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 |
Explanation
Step 3: Using SELECT-WHEN
data tiger_clean;
length Tiger_Group $20;
set tiger_clean;
select(upcase(subspecies));
when('BENGAL TIGER')
Tiger_Group='ASIAN';
when('SIBERIAN TIGER')
Tiger_Group='NORTHERN';
otherwise
Tiger_Group='OTHER';
end;
run;
proc print data=tiger_clean;
run;
OUTPUT:
| Obs | TIGER_GROUP | Source_System | Tiger_ID | Tiger_Name | Subspecies | Country | Region_Code | Observation_Date | Researcher_Email | Age | Weight_KG |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ASIAN | GLOBAL_WILDLIFE | T001 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 2 | ASIAN | GLOBAL_WILDLIFE | T002 | Raja | BENGAL TIGER | India | IN | 2024-02-15 | research2gmail.com | 9 | 210 |
| 3 | OTHER | GLOBAL_WILDLIFE | T003 | Maya | UNKNOWN | Nepal | NP | 2024-03-20 | research3@yahoo.com | . | 190 |
| 4 | NORTHERN | GLOBAL_WILDLIFE | T004 | Khan | SIBERIAN TIGER | Russia | RU | 2024-15-40 | research4@gmail.com | 10 | 300 |
| 5 | ASIAN | GLOBAL_WILDLIFE | T005 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 6 | OTHER | GLOBAL_WILDLIFE | T006 | Leo | SUMATRAN TIGER | Indonesia | ID | research5@gmail.com | . | 140 | |
| 7 | OTHER | GLOBAL_WILDLIFE | T007 | Max | MALAYAN TIGER | Malaysia | MY | 2024-04-12 | 12 | 150 | |
| 8 | OTHER | GLOBAL_WILDLIFE | T008 | Rocky | SOUTH CHINA TIGER | China | CN | 2024-05-22 | research8gmail | 9 | 170 |
| 9 | NORTHERN | GLOBAL_WILDLIFE | T009 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 |
| 10 | NORTHERN | GLOBAL_WILDLIFE | T010 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 |
Explanation
SELECT-WHEN
is more efficient than multiple IF-THEN conditions when handling categorical
classification logic. It improves readability, maintainability, and enterprise
code standardization.
Step 4: Using ARRAY Processing
data tiger_clean;
set tiger_clean;
array nums(*) age weight_kg;
do i=1 to dim(nums);
if nums(i)<0 then nums(i)=.;
end;
drop i;
run;
proc print data=tiger_clean;
run;
OUTPUT:
| Obs | TIGER_GROUP | Source_System | Tiger_ID | Tiger_Name | Subspecies | Country | Region_Code | Observation_Date | Researcher_Email | Age | Weight_KG |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ASIAN | GLOBAL_WILDLIFE | T001 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 2 | ASIAN | GLOBAL_WILDLIFE | T002 | Raja | BENGAL TIGER | India | IN | 2024-02-15 | research2gmail.com | 9 | 210 |
| 3 | OTHER | GLOBAL_WILDLIFE | T003 | Maya | UNKNOWN | Nepal | NP | 2024-03-20 | research3@yahoo.com | . | 190 |
| 4 | NORTHERN | GLOBAL_WILDLIFE | T004 | Khan | SIBERIAN TIGER | Russia | RU | 2024-15-40 | research4@gmail.com | 10 | 300 |
| 5 | ASIAN | GLOBAL_WILDLIFE | T005 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 6 | OTHER | GLOBAL_WILDLIFE | T006 | Leo | SUMATRAN TIGER | Indonesia | ID | research5@gmail.com | . | 140 | |
| 7 | OTHER | GLOBAL_WILDLIFE | T007 | Max | MALAYAN TIGER | Malaysia | MY | 2024-04-12 | 12 | 150 | |
| 8 | OTHER | GLOBAL_WILDLIFE | T008 | Rocky | SOUTH CHINA TIGER | China | CN | 2024-05-22 | research8gmail | 9 | 170 |
| 9 | NORTHERN | GLOBAL_WILDLIFE | T009 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 |
| 10 | NORTHERN | GLOBAL_WILDLIFE | T010 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 |
Explanation
ARRAYS
allow simultaneous validation of multiple variables. Instead of writing
repetitive code for every numeric column, enterprise programmers use arrays to
scale validation rules efficiently across hundreds of variables.
Step 5: Removing Duplicates
proc sort data=tiger_clean
out=tiger_nodup nodupkey;
by tiger_id;
run;
proc print data=tiger_nodup;
run;
LOG:
OUTPUT:
| Obs | TIGER_GROUP | Source_System | Tiger_ID | Tiger_Name | Subspecies | Country | Region_Code | Observation_Date | Researcher_Email | Age | Weight_KG |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ASIAN | GLOBAL_WILDLIFE | T001 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 2 | ASIAN | GLOBAL_WILDLIFE | T002 | Raja | BENGAL TIGER | India | IN | 2024-02-15 | research2gmail.com | 9 | 210 |
| 3 | OTHER | GLOBAL_WILDLIFE | T003 | Maya | UNKNOWN | Nepal | NP | 2024-03-20 | research3@yahoo.com | . | 190 |
| 4 | NORTHERN | GLOBAL_WILDLIFE | T004 | Khan | SIBERIAN TIGER | Russia | RU | 2024-15-40 | research4@gmail.com | 10 | 300 |
| 5 | ASIAN | GLOBAL_WILDLIFE | T005 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 6 | OTHER | GLOBAL_WILDLIFE | T006 | Leo | SUMATRAN TIGER | Indonesia | ID | research5@gmail.com | . | 140 | |
| 7 | OTHER | GLOBAL_WILDLIFE | T007 | Max | MALAYAN TIGER | Malaysia | MY | 2024-04-12 | 12 | 150 | |
| 8 | OTHER | GLOBAL_WILDLIFE | T008 | Rocky | SOUTH CHINA TIGER | China | CN | 2024-05-22 | research8gmail | 9 | 170 |
| 9 | NORTHERN | GLOBAL_WILDLIFE | T009 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 |
| 10 | NORTHERN | GLOBAL_WILDLIFE | T010 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 |
Explanation
Duplicate
observations can inflate counts and distort summaries. PROC SORT NODUPKEY
removes repeated Tiger_ID values while preserving the first occurrence. This is
a standard enterprise deduplication strategy.
Step 6: PROC FORMAT
proc format;
value agegrp low-5='Young'
6-10='Adult'
11-high='Senior';
run;
LOG:
Explanation
Formats
provide business-friendly reporting labels without altering underlying data
values. They improve readability and executive presentation quality.
Step 7: PROC SQL Approach
proc sql;
create table tiger_sql as
select distinct a.*
from tiger_nodup a;
quit;
proc print data=tiger_sql;
run;
OUTPUT:
| Obs | TIGER_GROUP | Source_System | Tiger_ID | Tiger_Name | Subspecies | Country | Region_Code | Observation_Date | Researcher_Email | Age | Weight_KG |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ASIAN | GLOBAL_WILDLIFE | T001 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 2 | ASIAN | GLOBAL_WILDLIFE | T002 | Raja | BENGAL TIGER | India | IN | 2024-02-15 | research2gmail.com | 9 | 210 |
| 3 | ASIAN | GLOBAL_WILDLIFE | T005 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 |
| 4 | NORTHERN | GLOBAL_WILDLIFE | T004 | Khan | SIBERIAN TIGER | Russia | RU | 2024-15-40 | research4@gmail.com | 10 | 300 |
| 5 | NORTHERN | GLOBAL_WILDLIFE | T009 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 |
| 6 | NORTHERN | GLOBAL_WILDLIFE | T010 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 |
| 7 | OTHER | GLOBAL_WILDLIFE | T003 | Maya | UNKNOWN | Nepal | NP | 2024-03-20 | research3@yahoo.com | . | 190 |
| 8 | OTHER | GLOBAL_WILDLIFE | T006 | Leo | SUMATRAN TIGER | Indonesia | ID | research5@gmail.com | . | 140 | |
| 9 | OTHER | GLOBAL_WILDLIFE | T007 | Max | MALAYAN TIGER | Malaysia | MY | 2024-04-12 | 12 | 150 | |
| 10 | OTHER | GLOBAL_WILDLIFE | T008 | Rocky | SOUTH CHINA TIGER | China | CN | 2024-05-22 | research8gmail | 9 | 170 |
Explanation
PROC SQL
provides database-style processing. It is especially useful when integrating
multiple datasets, creating joins, performing aggregations, and implementing
relational business logic.
Step 8: PROC SQL Join
data tiger_rank;
input tiger_id $ population_rank;
datalines;
T001 1
T002 2
T003 3
T004 4
T005 5
T006 6
T007 7
T008 8
T009 9
T010 10
;
run;
proc print data=tiger_rank;
run;
OUTPUT:
| Obs | tiger_id | population_rank |
|---|---|---|
| 1 | T001 | 1 |
| 2 | T002 | 2 |
| 3 | T003 | 3 |
| 4 | T004 | 4 |
| 5 | T005 | 5 |
| 6 | T006 | 6 |
| 7 | T007 | 7 |
| 8 | T008 | 8 |
| 9 | T009 | 9 |
| 10 | T010 | 10 |
proc sql;
create table tiger_report as
select a.*,
b.population_rank
from tiger_nodup a
left join tiger_rank b
on a.tiger_id=b.tiger_id;
quit;
proc print data=tiger_report;
run;
OUTPUT:
| Obs | TIGER_GROUP | Source_System | Tiger_ID | Tiger_Name | Subspecies | Country | Region_Code | Observation_Date | Researcher_Email | Age | Weight_KG | population_rank |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ASIAN | GLOBAL_WILDLIFE | T001 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 | 1 |
| 2 | ASIAN | GLOBAL_WILDLIFE | T002 | Raja | BENGAL TIGER | India | IN | 2024-02-15 | research2gmail.com | 9 | 210 | 2 |
| 3 | OTHER | GLOBAL_WILDLIFE | T003 | Maya | UNKNOWN | Nepal | NP | 2024-03-20 | research3@yahoo.com | . | 190 | 3 |
| 4 | NORTHERN | GLOBAL_WILDLIFE | T004 | Khan | SIBERIAN TIGER | Russia | RU | 2024-15-40 | research4@gmail.com | 10 | 300 | 4 |
| 5 | ASIAN | GLOBAL_WILDLIFE | T005 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 | 5 |
| 6 | OTHER | GLOBAL_WILDLIFE | T006 | Leo | SUMATRAN TIGER | Indonesia | ID | research5@gmail.com | . | 140 | 6 | |
| 7 | OTHER | GLOBAL_WILDLIFE | T007 | Max | MALAYAN TIGER | Malaysia | MY | 2024-04-12 | 12 | 150 | 7 | |
| 8 | OTHER | GLOBAL_WILDLIFE | T008 | Rocky | SOUTH CHINA TIGER | China | CN | 2024-05-22 | research8gmail | 9 | 170 | 8 |
| 9 | NORTHERN | GLOBAL_WILDLIFE | T009 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 | 9 |
| 10 | NORTHERN | GLOBAL_WILDLIFE | T010 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 | 10 |
Explanation
SQL joins
combine datasets using business keys. In enterprise environments, this is used
to integrate demographic, operational, clinical, financial, or conservation
data.
Step 9: Advanced SAS Functions
data tiger_clean;
set tiger_clean;
Email_Domain=scan(researcher_email,2,'@');
Email_Check=index(researcher_email,'@');
Country_Code=substr(region_code,1,2);
Words=countw(tiger_name);
Missing_Count=cmiss(of _character_);
Numeric_Missing=nmiss(of _numeric_);
run;
proc print data=tiger_clean;
run;
OUTPUT:
| Obs | TIGER_GROUP | Source_System | Tiger_ID | Tiger_Name | Subspecies | Country | Region_Code | Observation_Date | Researcher_Email | Age | Weight_KG | Email_Domain | Email_Check | Country_Code | Words | Missing_Count | Numeric_Missing |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ASIAN | GLOBAL_WILDLIFE | T001 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 | gmail.com | 10 | IN | 1 | 0 | 1 |
| 2 | ASIAN | GLOBAL_WILDLIFE | T002 | Raja | BENGAL TIGER | India | IN | 2024-02-15 | research2gmail.com | 9 | 210 | 0 | IN | 1 | 1 | 1 | |
| 3 | OTHER | GLOBAL_WILDLIFE | T003 | Maya | UNKNOWN | Nepal | NP | 2024-03-20 | research3@yahoo.com | . | 190 | yahoo.com | 10 | NP | 1 | 0 | 2 |
| 4 | NORTHERN | GLOBAL_WILDLIFE | T004 | Khan | SIBERIAN TIGER | Russia | RU | 2024-15-40 | research4@gmail.com | 10 | 300 | gmail.com | 10 | RU | 1 | 0 | 1 |
| 5 | ASIAN | GLOBAL_WILDLIFE | T005 | Sheru | BENGAL TIGER | India | IN | 2024-01-10 | research1@gmail.com | 8 | 220 | gmail.com | 10 | IN | 1 | 0 | 1 |
| 6 | OTHER | GLOBAL_WILDLIFE | T006 | Leo | SUMATRAN TIGER | Indonesia | ID | research5@gmail.com | . | 140 | gmail.com | 10 | ID | 1 | 1 | 2 | |
| 7 | OTHER | GLOBAL_WILDLIFE | T007 | Max | MALAYAN TIGER | Malaysia | MY | 2024-04-12 | 12 | 150 | 0 | MY | 1 | 2 | 1 | ||
| 8 | OTHER | GLOBAL_WILDLIFE | T008 | Rocky | SOUTH CHINA TIGER | China | CN | 2024-05-22 | research8gmail | 9 | 170 | 0 | CN | 1 | 1 | 1 | |
| 9 | NORTHERN | GLOBAL_WILDLIFE | T009 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 | gmail.com | 10 | RU | 1 | 0 | 1 |
| 10 | NORTHERN | GLOBAL_WILDLIFE | T010 | Tigerx | SIBERIAN TIGER | Russia | RU | 2024-06-01 | research9@gmail.com | 7 | 280 | gmail.com | 10 | RU | 1 | 0 | 1 |
Explanation
Functions
like SCAN, INDEX, SUBSTR, CMISS, and NMISS support advanced validation
frameworks. They help detect malformed values, missing data, and metadata
inconsistencies.
Step 10: R Raw Dataset
library(tidyverse)
tiger_raw <- tibble(
tiger_id=c("T001","T002","T003","T004","T005","T006",
"T007","T008","T009","T010"),
tiger_name=c("Sheru","Raja"," Maya ","Khan","Sheru",
"Leo","Max","Rocky","TigerX","TigerX"),
subspecies=c("BENGAL TIGER","bengal tiger","NULL",
"SIBERIAN TIGER","BENGAL TIGER",
"SUMATRAN TIGER","MALAYAN TIGER",
"South China Tiger","Siberian Tiger",
"Siberian Tiger"),
age=c(8,9,150,10,8,-5,12,9,7,7),
weight_kg=c(220,-210,190,300,220,140,-150,170,280,280))
OUTPUT:
|
tiger_id |
tiger_name |
subspecies |
age |
weight_kg |
|
T001 |
Sheru |
BENGAL TIGER |
8 |
220 |
|
T002 |
Raja |
bengal tiger |
9 |
-210 |
|
T003 |
Maya |
NULL |
150 |
190 |
|
T004 |
Khan |
SIBERIAN TIGER |
10 |
300 |
|
T005 |
Sheru |
BENGAL TIGER |
8 |
220 |
|
T006 |
Leo |
SUMATRAN TIGER |
-5 |
140 |
|
T007 |
Max |
MALAYAN TIGER |
12 |
-150 |
|
T008 |
Rocky |
South China Tiger |
9 |
170 |
|
T009 |
TigerX |
Siberian Tiger |
7 |
280 |
|
T010 |
TigerX |
Siberian Tiger |
7 |
280 |
Step 11: R Cleaning Workflow
library(janitor)
library(lubridate)
library(stringr
tiger_clean <- tiger_raw %>%
clean_names() %>%
mutate(tiger_name=str_to_title(str_trim(tiger_name)),
subspecies=str_to_upper(str_trim(subspecies)),
age=if_else(age<0 | age>30,
NA_real_,as.numeric(age)),
weight_kg=abs(weight_kg),
subspecies=replace(subspecies,subspecies=="NULL",
"UNKNOWN")
) %>%
distinct(tiger_id,.keep_all=TRUE)
OUTPUT:
|
tiger_id |
tiger_name |
subspecies |
age |
weight_kg |
|
T001 |
Sheru |
BENGAL TIGER |
8 |
220 |
|
T002 |
Raja |
BENGAL TIGER |
9 |
210 |
|
T003 |
Maya |
UNKNOWN |
NA |
190 |
|
T004 |
Khan |
SIBERIAN TIGER |
10 |
300 |
|
T005 |
Sheru |
BENGAL TIGER |
8 |
220 |
|
T006 |
Leo |
SUMATRAN TIGER |
NA |
140 |
|
T007 |
Max |
MALAYAN TIGER |
12 |
150 |
|
T008 |
Rocky |
SOUTH CHINA TIGER |
9 |
170 |
|
T009 |
Tigerx |
SIBERIAN TIGER |
7 |
280 |
|
T010 |
Tigerx |
SIBERIAN TIGER |
7 |
280 |
Explanation
The
tidyverse workflow mirrors SAS cleaning logic. mutate() performs
transformations, str_trim() removes whitespace, str_to_upper() standardizes
text, if_else() validates age, abs() fixes weights, and distinct() removes
duplicates. This declarative style makes R highly flexible for exploratory and
production analytics.
SAS vs R Cleaning
Comparison
|
Task |
SAS |
R |
|
Standardize
Text |
PROPCASE |
str_to_title |
|
Trim
Spaces |
STRIP |
str_trim |
|
Missing
Values |
NMISS |
is.na |
|
Deduplicate |
PROC
SORT |
distinct |
|
Joins |
PROC
SQL |
left_join |
|
Formats |
PROC
FORMAT |
factor/case_when |
|
Date
Handling |
INTNX |
lubridate |
|
Validation |
DATA
Step |
mutate |
Enterprise Validation &
Compliance
For SDTM
and ADaM environments:
- Every transformation must be
traceable.
- Audit trails must document
all changes.
- Independent QC programmers
must reproduce results.
- Metadata must align with
Define.xml specifications.
- Regulatory reviewers expect
consistent derivation logic.
Critical
SAS rule:
if age=. then ...
Missing
numeric values in SAS are treated lower than valid numbers.
Example:
if age<18 then flag='Y';
Missing
ages will incorrectly become flagged.
Correct:
if age ne . and age<18 then
flag='Y';
This
single mistake has caused numerous production validation findings.
Business Logic Behind
Cleaning
Data
cleaning exists because raw operational data rarely reflects reality. Missing
values may result from delayed entry, system outages, or incomplete source
records. Age values above biological limits indicate entry errors, while
negative weights often arise from sign reversal mistakes during file transfers.
Standardizing dates ensures consistent chronological calculations, enabling
accurate duration, trend, and survival analyses. Text normalization removes
unnecessary variability that would otherwise split identical categories into
multiple groups. For example, "BENGAL TIGER", "Bengal
Tiger", and " bengal tiger " should represent one category. Email
validation supports researcher communication workflows and audit readiness.
Missing observation dates may require imputation or exclusion depending on
business rules. Consistent region coding improves geographic reporting
accuracy. Every correction must follow documented business logic, ensuring
reproducibility and traceability. Without standardization, analytical outputs
become unreliable, executive dashboards become misleading, and AI models learn
incorrect patterns.
20 Data Cleaning Best
Practices
- Profile before cleaning.
- Validate metadata.
- Standardize variable naming.
- Use audit trails.
- Create reusable macros.
- Validate source-to-target
mappings.
- Remove duplicates early.
- Check date formats.
- Validate email structures.
- Standardize categorical
values.
- Flag unrealistic numeric
values.
- Document assumptions.
- Maintain lineage records.
- Apply QC independently.
- Automate validation rules.
- Avoid hardcoding.
- Use metadata-driven
processing.
- Archive raw data.
- Version control code.
- Reconcile outputs before
deployment.
20 One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic is stronger
than visual inspection.
- Metadata drives consistency.
- Deduplication improves
trust.
- Standardized variables
improve reproducibility.
- Missing values deserve
investigation.
- Governance starts with
profiling.
- Auditability matters.
- Clean inputs create reliable
outputs.
- Traceability protects
compliance.
- Macros improve scalability.
- Defensive programming
reduces risk.
- Formats improve readability.
- Documentation prevents
confusion.
- QC is not optional.
- Date consistency matters.
- Text normalization reduces
noise.
- Automation reduces human
error.
- Reliable data powers AI.
- Analytics is only as good as
the source.
Summary: SAS vs R for
Enterprise Data Cleaning
SAS
excels in regulated environments requiring auditability, traceability,
repeatability, and compliance. DATA Step processing remains one of the most
powerful row-wise transformation engines available. PROC SQL integrates
relational logic efficiently, while PROC FORMAT, PROC REPORT, and ODS provide
enterprise reporting capabilities. R provides exceptional flexibility through
tidyverse packages, rapid prototyping, visualization, and advanced analytics.
Together, SAS and R form a complementary ecosystem where SAS delivers
governance and production stability while R contributes innovation and
analytical agility. Organizations leveraging both technologies gain
scalability, transparency, reproducibility, and performance. The strongest
enterprise teams understand not merely how to clean data, but how to build
repeatable frameworks that convert operational chaos into analytical
intelligence.
Conclusion
The Best
Tigers in the World dataset demonstrates a universal truth across wildlife
conservation, clinical trials, banking, insurance, and retail analytics: raw
data is rarely analysis-ready. Duplicate identifiers, malformed text, invalid
dates, unrealistic measurements, and missing values can quietly undermine
reporting accuracy, predictive models, regulatory submissions, and executive
decision-making. Effective data engineering is therefore not a technical luxury
it is a business necessity.
A
structured cleaning framework begins with profiling, metadata inspection, and
validation before progressing into standardization, correction, enrichment, and
reporting. SAS provides industrial-strength capabilities through DATA Step
programming, PROC SQL, PROC FORMAT, PROC REPORT, and reusable macros. R
complements these strengths through tidyverse-driven flexibility, rapid
transformation pipelines, and modern analytical workflows. Together they
provide an end-to-end ecosystem for producing trusted datasets.
The most
successful organizations treat data cleaning as an engineered process rather
than a one-time activity. They implement audit trails, independent QC, metadata
governance, lineage tracking, standardized macros, validation checklists, and
production controls. Whether the objective is SDTM compliance, financial risk
analysis, insurance reporting, retail forecasting, or wildlife conservation
intelligence, trustworthy analytics always begin with trustworthy data.
When
clean data becomes a strategic asset, dashboards become credible, statistical
outputs become reproducible, AI predictions become dependable, and
decision-makers gain confidence. That is the real power of transforming raw
operational data into analysis-ready intelligence using SAS and R.
Interview Questions and
Answers
1. How would you identify duplicate tiger records?
Answer: Use PROC SORT NODUPKEY in SAS or
distinct() in R using Tiger_ID as the business key.
2. Why is LENGTH placement important?
Answer: SAS assigns character length
during compilation. Incorrect placement can truncate data permanently.
3. How do you validate malformed emails?
Answer: Use INDEX, FIND, VERIFY,
PRXMATCH in SAS and grepl() in R.
4. Why is missing-value handling critical in SAS?
Answer: Missing numeric values are
treated lower than valid values and can unintentionally satisfy conditional
logic.
5. When would you choose PROC SQL over DATA Step?
Answer: PROC SQL is preferred for joins,
aggregations, and relational processing; DATA Step excels in row-wise
transformations and complex derivations.
---------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TIGERS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment