🐅Best Tigers in the World Dataset into Analysis-Ready Using SAS and R

🐅 Roar of Reliability: Transforming the “Best Tigers in the World” Dataset into Analysis-Ready Intelligence Using SAS and R for Enterprise-Grade Reporting

Introduction: When Dirty Data Creates a Wildlife Analytics Disaster

Imagine a global wildlife conservation organization preparing an executive report ranking the world's most famous tiger populations. The report is intended for conservation funding agencies, biodiversity researchers, AI prediction systems, and government policymakers.

However, just before publication, analysts discover alarming issues:

  • Duplicate Tiger IDs
  • Negative tiger weights
  • Missing observation dates
  • Invalid conservation statuses
  • Corrupted region codes
  • Mixed uppercase/lowercase subspecies names
  • Invalid GPS region mappings
  • Malformed researcher emails
  • Impossible tiger ages
  • Text fields containing NULL strings

As a result:

  • Population dashboards become inaccurate
  • AI habitat prediction models fail
  • Funding allocations become misleading
  • Regulatory conservation reports become unreliable

This is exactly why enterprise data cleaning exists.

Today we'll build a complete Best Tigers in the World Dataset and transform it into an analysis-ready dataset using both SAS and R.

Raw Business Dataset

Variable

Description

Tiger_ID

Unique Tiger Identifier

Tiger_Name

Tiger Name

Subspecies

Tiger Type

Country

Country

Region_Code

Region

Age

Tiger Age

Weight_KG

Weight

Observation_Date

Observation Date

Researcher_Email

Researcher Email

1.Raw SAS Dataset with Intentional Errors

data tiger_raw;

length Tiger_ID $8 Tiger_Name $25 Subspecies $30 Country $20 

Region_Code $10 Observation_Date $20 Researcher_Email $60;

infile datalines dlm='|' dsd truncover;

input Tiger_ID $ Tiger_Name $ Subspecies $ Country $ Region_Code $

Age Weight_KG Observation_Date $ Researcher_Email $;

datalines;

T001|Sheru|BENGAL TIGER|India|IN|8|220|2024-01-10|research1@gmail.com

T002|Raja|bengal tiger|India|IND|9|-210|2024-02-15|research2gmail.com

T003|  Maya |NULL|Nepal|NP|150|190|2024-03-20|research3@yahoo.com

T004|Khan|SIBERIAN TIGER|Russia|RU|10|300|2024-15-40|research4@gmail.com

T005|Sheru|BENGAL TIGER|India|IN|8|220|2024-01-10|research1@gmail.com

T006|Leo|SUMATRAN TIGER|Indonesia|ID|-5|140||research5@gmail.com

T007|Max|MALAYAN TIGER|Malaysia|MY|12|-150|2024-04-12|NULL

T008|Rocky|South China Tiger|China|CHN|9|170|2024-05-22|research8gmail

T009|TigerX|Siberian Tiger|Russia|RUS|7|280|2024-06-01|research9@gmail.com

T010|TigerX|Siberian Tiger|Russia|RUS|7|280|2024-06-01|research9@gmail.com

;

run;

proc print data=tiger_raw;

run;

OUTPUT:

ObsTiger_IDTiger_NameSubspeciesCountryRegion_CodeObservation_DateResearcher_EmailAgeWeight_KG
1T001SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
2T002Rajabengal tigerIndiaIND2024-02-15research2gmail.com9-210
3T003MayaNULLNepalNP2024-03-20research3@yahoo.com150190
4T004KhanSIBERIAN TIGERRussiaRU2024-15-40research4@gmail.com10300
5T005SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
6T006LeoSUMATRAN TIGERIndonesiaID research5@gmail.com-5140
7T007MaxMALAYAN TIGERMalaysiaMY2024-04-12NULL12-150
8T008RockySouth China TigerChinaCHN2024-05-22research8gmail9170
9T009TigerXSiberian TigerRussiaRUS2024-06-01research9@gmail.com7280
10T010TigerXSiberian TigerRussiaRUS2024-06-01research9@gmail.com7280

Why LENGTH Must Appear First

One of the most overlooked SAS issues is Character Truncation Risk.

Bad example:

data test;

name="Siberian Tiger Population";

length name $10;

run;

proc print data=test;

run;

LOG:

WARNING: Length of character variable name has already been set.
Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.

OUTPUT:

Obsname
1Siberian T

The variable length was fixed before expansion.

Correct:

data test;

length name $40;

name="Siberian Tiger Population";

run;

proc print data=test;

run;

OUTPUT:

Obsname
1Siberian Tiger Population

Key Point

SAS allocates memory during compilation.

The first encounter determines length unless LENGTH is explicitly defined first.

R behaves differently because character vectors dynamically allocate memory and do not suffer traditional truncation during assignment.

Step 1: Initial Data Profiling

proc contents data=tiger_raw;

run;

OUTPUT:

The CONTENTS Procedure

Data Set NameWORK.TIGER_RAWObservations10
Member TypeDATAVariables9
EngineV9Indexes0
Created06/23/2026 18:43:54Observation Length192
Last Modified06/23/2026 18:43:54Deleted Observations0
Protection CompressedNO
Data Set Type SortedNO
Label   
Data RepresentationSOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64  
Encodingutf-8 Unicode (UTF-8)  
Engine/Host Dependent Information
Data Set Page Size131072
Number of Data Set Pages1
First Data Page1
Max Obs per Page682
Obs in First Data Page10
Number of Data Set Repairs0
Filename/saswork/SAS_workE4C20001896F_odaws02-apse1-2.oda.sas.com/SAS_work27FB0001896F_odaws02-apse1-2.oda.sas.com/tiger_raw.sas7bdat
Release Created9.0401M8
Host CreatedLinux
Inode Number134333503
Access Permissionrw-r--r--
Owner Nameu63247146
File Size256KB
File Size (bytes)262144
Alphabetic List of Variables and Attributes
#VariableTypeLen
8AgeNum8
4CountryChar20
6Observation_DateChar20
5Region_CodeChar10
7Researcher_EmailChar60
3SubspeciesChar30
1Tiger_IDChar8
2Tiger_NameChar25
9Weight_KGNum8

proc freq data=tiger_raw;

tables subspecies region_code;

run;

OUTPUT:

The FREQ Procedure

SubspeciesFrequencyPercentCumulative
Frequency
Cumulative
Percent
BENGAL TIGER220.00220.00
MALAYAN TIGER110.00330.00
NULL110.00440.00
SIBERIAN TIGER110.00550.00
SUMATRAN TIGER110.00660.00
Siberian Tiger220.00880.00
South China Tiger110.00990.00
bengal tiger110.0010100.00
Region_CodeFrequencyPercentCumulative
Frequency
Cumulative
Percent
CHN110.00110.00
ID110.00220.00
IN220.00440.00
IND110.00550.00
MY110.00660.00
NP110.00770.00
RU110.00880.00
RUS220.0010100.00

proc means data=tiger_raw n nmiss min max;

var age weight_kg;

run;

OUTPUT:

The MEANS Procedure

VariableNN MissMinimumMaximum
Age
Weight_KG
10
10
0
0
-5.0000000
-210.0000000
150.0000000
300.0000000

Explanation

This profiling step acts like a medical diagnosis before treatment. PROC CONTENTS validates metadata, PROC FREQ identifies corrupted categorical values, and PROC MEANS reveals impossible numeric ranges such as negative weights and ages above biological limits. Enterprise teams always profile data before cleaning because hidden quality issues often create downstream reporting failures. Metadata validation is the first layer of production-grade data governance.

Step 2: Enterprise Cleaning Using DATA Step

data tiger_clean;

retain Source_System "GLOBAL_WILDLIFE";

set tiger_raw;

Tiger_Name = propcase(strip(Tiger_Name));

Subspecies = upcase(strip(Subspecies));

Country = propcase(strip(Country));

Researcher_Email = lowcase(strip(Researcher_Email));

if age < 0 then age=.;

if age > 30 then age=.;

Weight_KG = abs(Weight_KG);

if Subspecies='NULL' then Subspecies='UNKNOWN';

if Researcher_Email='null' then Researcher_Email='';

select(strip(upcase(region_code)));

when('IND') Region_Code='IN';

when('RUS') Region_Code='RU';

when('CHN') Region_Code='CN';

otherwise   Region_Code=Region_Code;

end;

run;

proc print data=tiger_clean;

run;

OUTPUT:

ObsSource_SystemTiger_IDTiger_NameSubspeciesCountryRegion_CodeObservation_DateResearcher_EmailAgeWeight_KG
1GLOBAL_WILDLIFET001SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
2GLOBAL_WILDLIFET002RajaBENGAL TIGERIndiaIN2024-02-15research2gmail.com9210
3GLOBAL_WILDLIFET003MayaUNKNOWNNepalNP2024-03-20research3@yahoo.com.190
4GLOBAL_WILDLIFET004KhanSIBERIAN TIGERRussiaRU2024-15-40research4@gmail.com10300
5GLOBAL_WILDLIFET005SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
6GLOBAL_WILDLIFET006LeoSUMATRAN TIGERIndonesiaID research5@gmail.com.140
7GLOBAL_WILDLIFET007MaxMALAYAN TIGERMalaysiaMY2024-04-12 12150
8GLOBAL_WILDLIFET008RockySOUTH CHINA TIGERChinaCN2024-05-22research8gmail9170
9GLOBAL_WILDLIFET009TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280
10GLOBAL_WILDLIFET010TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280

Explanation

This DATA Step performs standardization, normalization, and error correction. PROPCASE standardizes names, UPCASE harmonizes species classifications, LOWCASE cleans emails, and ABS converts invalid negative weights into usable values. Age validations remove biologically impossible values. RETAIN preserves source lineage information. These transformations create consistency essential for reliable statistical analysis and regulatory reporting.

Step 3: Using SELECT-WHEN

data tiger_clean;

length Tiger_Group $20;

set tiger_clean; 

select(upcase(subspecies));

when('BENGAL TIGER')

Tiger_Group='ASIAN';

when('SIBERIAN TIGER')

Tiger_Group='NORTHERN';

otherwise

Tiger_Group='OTHER';

end;

run;

proc print data=tiger_clean;

run;

OUTPUT:

ObsTIGER_GROUPSource_SystemTiger_IDTiger_NameSubspeciesCountryRegion_CodeObservation_DateResearcher_EmailAgeWeight_KG
1ASIANGLOBAL_WILDLIFET001SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
2ASIANGLOBAL_WILDLIFET002RajaBENGAL TIGERIndiaIN2024-02-15research2gmail.com9210
3OTHERGLOBAL_WILDLIFET003MayaUNKNOWNNepalNP2024-03-20research3@yahoo.com.190
4NORTHERNGLOBAL_WILDLIFET004KhanSIBERIAN TIGERRussiaRU2024-15-40research4@gmail.com10300
5ASIANGLOBAL_WILDLIFET005SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
6OTHERGLOBAL_WILDLIFET006LeoSUMATRAN TIGERIndonesiaID research5@gmail.com.140
7OTHERGLOBAL_WILDLIFET007MaxMALAYAN TIGERMalaysiaMY2024-04-12 12150
8OTHERGLOBAL_WILDLIFET008RockySOUTH CHINA TIGERChinaCN2024-05-22research8gmail9170
9NORTHERNGLOBAL_WILDLIFET009TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280
10NORTHERNGLOBAL_WILDLIFET010TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280

Explanation

SELECT-WHEN is more efficient than multiple IF-THEN conditions when handling categorical classification logic. It improves readability, maintainability, and enterprise code standardization.

Step 4: Using ARRAY Processing

data tiger_clean;

set tiger_clean;

array nums(*) age weight_kg;

do i=1 to dim(nums);

if nums(i)<0 then nums(i)=.;

end;

drop i;

run;

proc print data=tiger_clean;

run;

OUTPUT:

ObsTIGER_GROUPSource_SystemTiger_IDTiger_NameSubspeciesCountryRegion_CodeObservation_DateResearcher_EmailAgeWeight_KG
1ASIANGLOBAL_WILDLIFET001SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
2ASIANGLOBAL_WILDLIFET002RajaBENGAL TIGERIndiaIN2024-02-15research2gmail.com9210
3OTHERGLOBAL_WILDLIFET003MayaUNKNOWNNepalNP2024-03-20research3@yahoo.com.190
4NORTHERNGLOBAL_WILDLIFET004KhanSIBERIAN TIGERRussiaRU2024-15-40research4@gmail.com10300
5ASIANGLOBAL_WILDLIFET005SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
6OTHERGLOBAL_WILDLIFET006LeoSUMATRAN TIGERIndonesiaID research5@gmail.com.140
7OTHERGLOBAL_WILDLIFET007MaxMALAYAN TIGERMalaysiaMY2024-04-12 12150
8OTHERGLOBAL_WILDLIFET008RockySOUTH CHINA TIGERChinaCN2024-05-22research8gmail9170
9NORTHERNGLOBAL_WILDLIFET009TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280
10NORTHERNGLOBAL_WILDLIFET010TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280

Explanation

ARRAYS allow simultaneous validation of multiple variables. Instead of writing repetitive code for every numeric column, enterprise programmers use arrays to scale validation rules efficiently across hundreds of variables.

Step 5: Removing Duplicates

proc sort data=tiger_clean

          out=tiger_nodup nodupkey;

by tiger_id;

run;

proc print data=tiger_nodup;

run;

LOG:

NOTE: There were 10 observations read from the data set WORK.TIGER_CLEAN.
NOTE: 0 observations with duplicate key values were deleted.
NOTE: The data set WORK.TIGER_NODUP has 10 observations and 11 variables.

OUTPUT:

ObsTIGER_GROUPSource_SystemTiger_IDTiger_NameSubspeciesCountryRegion_CodeObservation_DateResearcher_EmailAgeWeight_KG
1ASIANGLOBAL_WILDLIFET001SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
2ASIANGLOBAL_WILDLIFET002RajaBENGAL TIGERIndiaIN2024-02-15research2gmail.com9210
3OTHERGLOBAL_WILDLIFET003MayaUNKNOWNNepalNP2024-03-20research3@yahoo.com.190
4NORTHERNGLOBAL_WILDLIFET004KhanSIBERIAN TIGERRussiaRU2024-15-40research4@gmail.com10300
5ASIANGLOBAL_WILDLIFET005SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
6OTHERGLOBAL_WILDLIFET006LeoSUMATRAN TIGERIndonesiaID research5@gmail.com.140
7OTHERGLOBAL_WILDLIFET007MaxMALAYAN TIGERMalaysiaMY2024-04-12 12150
8OTHERGLOBAL_WILDLIFET008RockySOUTH CHINA TIGERChinaCN2024-05-22research8gmail9170
9NORTHERNGLOBAL_WILDLIFET009TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280
10NORTHERNGLOBAL_WILDLIFET010TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280

Explanation

Duplicate observations can inflate counts and distort summaries. PROC SORT NODUPKEY removes repeated Tiger_ID values while preserving the first occurrence. This is a standard enterprise deduplication strategy.

Step 6: PROC FORMAT

proc format;

value agegrp low-5='Young'

                         6-10='Adult'

                    11-high='Senior';

run;

LOG:

NOTE: Format AGEGRP has been output.

Explanation

Formats provide business-friendly reporting labels without altering underlying data values. They improve readability and executive presentation quality.

Step 7: PROC SQL Approach

proc sql;

create table tiger_sql as

select distinct a.*

from tiger_nodup a;

quit;

proc print data=tiger_sql;

run;

OUTPUT:

ObsTIGER_GROUPSource_SystemTiger_IDTiger_NameSubspeciesCountryRegion_CodeObservation_DateResearcher_EmailAgeWeight_KG
1ASIANGLOBAL_WILDLIFET001SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
2ASIANGLOBAL_WILDLIFET002RajaBENGAL TIGERIndiaIN2024-02-15research2gmail.com9210
3ASIANGLOBAL_WILDLIFET005SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220
4NORTHERNGLOBAL_WILDLIFET004KhanSIBERIAN TIGERRussiaRU2024-15-40research4@gmail.com10300
5NORTHERNGLOBAL_WILDLIFET009TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280
6NORTHERNGLOBAL_WILDLIFET010TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280
7OTHERGLOBAL_WILDLIFET003MayaUNKNOWNNepalNP2024-03-20research3@yahoo.com.190
8OTHERGLOBAL_WILDLIFET006LeoSUMATRAN TIGERIndonesiaID research5@gmail.com.140
9OTHERGLOBAL_WILDLIFET007MaxMALAYAN TIGERMalaysiaMY2024-04-12 12150
10OTHERGLOBAL_WILDLIFET008RockySOUTH CHINA TIGERChinaCN2024-05-22research8gmail9170

Explanation

PROC SQL provides database-style processing. It is especially useful when integrating multiple datasets, creating joins, performing aggregations, and implementing relational business logic.

Step 8: PROC SQL Join

data tiger_rank;

input tiger_id $ population_rank;

datalines;

T001 1

T002 2

T003 3

T004 4

T005 5

T006 6

T007 7

T008 8

T009 9

T010 10

;

run;

proc print data=tiger_rank;

run;

OUTPUT:

Obstiger_idpopulation_rank
1T0011
2T0022
3T0033
4T0044
5T0055
6T0066
7T0077
8T0088
9T0099
10T01010

proc sql;

create table tiger_report as

select a.*,

       b.population_rank

from tiger_nodup a

left join tiger_rank b

on a.tiger_id=b.tiger_id;

quit;

proc print data=tiger_report;

run;

OUTPUT:

ObsTIGER_GROUPSource_SystemTiger_IDTiger_NameSubspeciesCountryRegion_CodeObservation_DateResearcher_EmailAgeWeight_KGpopulation_rank
1ASIANGLOBAL_WILDLIFET001SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com82201
2ASIANGLOBAL_WILDLIFET002RajaBENGAL TIGERIndiaIN2024-02-15research2gmail.com92102
3OTHERGLOBAL_WILDLIFET003MayaUNKNOWNNepalNP2024-03-20research3@yahoo.com.1903
4NORTHERNGLOBAL_WILDLIFET004KhanSIBERIAN TIGERRussiaRU2024-15-40research4@gmail.com103004
5ASIANGLOBAL_WILDLIFET005SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com82205
6OTHERGLOBAL_WILDLIFET006LeoSUMATRAN TIGERIndonesiaID research5@gmail.com.1406
7OTHERGLOBAL_WILDLIFET007MaxMALAYAN TIGERMalaysiaMY2024-04-12 121507
8OTHERGLOBAL_WILDLIFET008RockySOUTH CHINA TIGERChinaCN2024-05-22research8gmail91708
9NORTHERNGLOBAL_WILDLIFET009TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com72809
10NORTHERNGLOBAL_WILDLIFET010TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com728010

Explanation

SQL joins combine datasets using business keys. In enterprise environments, this is used to integrate demographic, operational, clinical, financial, or conservation data.

Step 9: Advanced SAS Functions

data tiger_clean;

set tiger_clean;

Email_Domain=scan(researcher_email,2,'@');

Email_Check=index(researcher_email,'@');

Country_Code=substr(region_code,1,2);

Words=countw(tiger_name);

Missing_Count=cmiss(of _character_);

Numeric_Missing=nmiss(of _numeric_);

run;

proc print data=tiger_clean;

run;

OUTPUT:

ObsTIGER_GROUPSource_SystemTiger_IDTiger_NameSubspeciesCountryRegion_CodeObservation_DateResearcher_EmailAgeWeight_KGEmail_DomainEmail_CheckCountry_CodeWordsMissing_CountNumeric_Missing
1ASIANGLOBAL_WILDLIFET001SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220gmail.com10IN101
2ASIANGLOBAL_WILDLIFET002RajaBENGAL TIGERIndiaIN2024-02-15research2gmail.com9210 0IN111
3OTHERGLOBAL_WILDLIFET003MayaUNKNOWNNepalNP2024-03-20research3@yahoo.com.190yahoo.com10NP102
4NORTHERNGLOBAL_WILDLIFET004KhanSIBERIAN TIGERRussiaRU2024-15-40research4@gmail.com10300gmail.com10RU101
5ASIANGLOBAL_WILDLIFET005SheruBENGAL TIGERIndiaIN2024-01-10research1@gmail.com8220gmail.com10IN101
6OTHERGLOBAL_WILDLIFET006LeoSUMATRAN TIGERIndonesiaID research5@gmail.com.140gmail.com10ID112
7OTHERGLOBAL_WILDLIFET007MaxMALAYAN TIGERMalaysiaMY2024-04-12 12150 0MY121
8OTHERGLOBAL_WILDLIFET008RockySOUTH CHINA TIGERChinaCN2024-05-22research8gmail9170 0CN111
9NORTHERNGLOBAL_WILDLIFET009TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280gmail.com10RU101
10NORTHERNGLOBAL_WILDLIFET010TigerxSIBERIAN TIGERRussiaRU2024-06-01research9@gmail.com7280gmail.com10RU101

Explanation

Functions like SCAN, INDEX, SUBSTR, CMISS, and NMISS support advanced validation frameworks. They help detect malformed values, missing data, and metadata inconsistencies.

Step 10: R Raw Dataset

library(tidyverse)

tiger_raw <- tibble(

  tiger_id=c("T001","T002","T003","T004","T005","T006",

             "T007","T008","T009","T010"),

  tiger_name=c("Sheru","Raja"," Maya ","Khan","Sheru",

               "Leo","Max","Rocky","TigerX","TigerX"),

  subspecies=c("BENGAL TIGER","bengal tiger","NULL",

               "SIBERIAN TIGER","BENGAL TIGER",

               "SUMATRAN TIGER","MALAYAN TIGER",

               "South China Tiger","Siberian Tiger",

               "Siberian Tiger"),

  age=c(8,9,150,10,8,-5,12,9,7,7),

  weight_kg=c(220,-210,190,300,220,140,-150,170,280,280))

OUTPUT:

tiger_id

tiger_name

subspecies

age

weight_kg

T001

Sheru

BENGAL TIGER

8

220

T002

Raja

bengal tiger

9

-210

T003

 Maya

NULL

150

190

T004

Khan

SIBERIAN TIGER

10

300

T005

Sheru

BENGAL TIGER

8

220

T006

Leo

SUMATRAN TIGER

-5

140

T007

Max

MALAYAN TIGER

12

-150

T008

Rocky

South China Tiger

9

170

T009

TigerX

Siberian Tiger

7

280

T010

TigerX

Siberian Tiger

7

280


Step 11: R Cleaning Workflow

library(janitor)

library(lubridate)

library(stringr

tiger_clean <- tiger_raw %>%

  clean_names() %>%

  mutate(tiger_name=str_to_title(str_trim(tiger_name)),

         subspecies=str_to_upper(str_trim(subspecies)),

                age=if_else(age<0 | age>30,

                NA_real_,as.numeric(age)),

          weight_kg=abs(weight_kg),

        subspecies=replace(subspecies,subspecies=="NULL",

                       "UNKNOWN")

  ) %>%

  distinct(tiger_id,.keep_all=TRUE)

OUTPUT:

tiger_id

tiger_name

subspecies

age

weight_kg

T001

Sheru

BENGAL TIGER

8

220

T002

Raja

BENGAL TIGER

9

210

T003

Maya

UNKNOWN

         NA

190

T004

Khan

SIBERIAN TIGER

10

300

T005

Sheru

BENGAL TIGER

8

220

T006

Leo

SUMATRAN TIGER

         NA

140

T007

Max

MALAYAN TIGER

12

150

T008

Rocky

SOUTH CHINA TIGER

9

170

T009

Tigerx

SIBERIAN TIGER

7

280

T010

Tigerx

SIBERIAN TIGER

7

280

Explanation

The tidyverse workflow mirrors SAS cleaning logic. mutate() performs transformations, str_trim() removes whitespace, str_to_upper() standardizes text, if_else() validates age, abs() fixes weights, and distinct() removes duplicates. This declarative style makes R highly flexible for exploratory and production analytics.

SAS vs R Cleaning Comparison

Task

SAS

R

Standardize Text

PROPCASE

str_to_title

Trim Spaces

STRIP

str_trim

Missing Values

NMISS

is.na

Deduplicate

PROC SORT

distinct

Joins

PROC SQL

left_join

Formats

PROC FORMAT

factor/case_when

Date Handling

INTNX

lubridate

Validation

DATA Step

mutate

Enterprise Validation & Compliance

For SDTM and ADaM environments:

  • Every transformation must be traceable.
  • Audit trails must document all changes.
  • Independent QC programmers must reproduce results.
  • Metadata must align with Define.xml specifications.
  • Regulatory reviewers expect consistent derivation logic.

Critical SAS rule:

if age=. then ...

Missing numeric values in SAS are treated lower than valid numbers.

Example:

if age<18 then flag='Y';

Missing ages will incorrectly become flagged.

Correct:

if age ne . and age<18 then flag='Y';

This single mistake has caused numerous production validation findings.

Business Logic Behind Cleaning

Data cleaning exists because raw operational data rarely reflects reality. Missing values may result from delayed entry, system outages, or incomplete source records. Age values above biological limits indicate entry errors, while negative weights often arise from sign reversal mistakes during file transfers. Standardizing dates ensures consistent chronological calculations, enabling accurate duration, trend, and survival analyses. Text normalization removes unnecessary variability that would otherwise split identical categories into multiple groups. For example, "BENGAL TIGER", "Bengal Tiger", and " bengal tiger " should represent one category. Email validation supports researcher communication workflows and audit readiness. Missing observation dates may require imputation or exclusion depending on business rules. Consistent region coding improves geographic reporting accuracy. Every correction must follow documented business logic, ensuring reproducibility and traceability. Without standardization, analytical outputs become unreliable, executive dashboards become misleading, and AI models learn incorrect patterns.

20 Data Cleaning Best Practices

  1. Profile before cleaning.
  2. Validate metadata.
  3. Standardize variable naming.
  4. Use audit trails.
  5. Create reusable macros.
  6. Validate source-to-target mappings.
  7. Remove duplicates early.
  8. Check date formats.
  9. Validate email structures.
  10. Standardize categorical values.
  11. Flag unrealistic numeric values.
  12. Document assumptions.
  13. Maintain lineage records.
  14. Apply QC independently.
  15. Automate validation rules.
  16. Avoid hardcoding.
  17. Use metadata-driven processing.
  18. Archive raw data.
  19. Version control code.
  20. Reconcile outputs before deployment.

20 One-Line Insights

  • Dirty data creates expensive business mistakes.
  • Validation logic is stronger than visual inspection.
  • Metadata drives consistency.
  • Deduplication improves trust.
  • Standardized variables improve reproducibility.
  • Missing values deserve investigation.
  • Governance starts with profiling.
  • Auditability matters.
  • Clean inputs create reliable outputs.
  • Traceability protects compliance.
  • Macros improve scalability.
  • Defensive programming reduces risk.
  • Formats improve readability.
  • Documentation prevents confusion.
  • QC is not optional.
  • Date consistency matters.
  • Text normalization reduces noise.
  • Automation reduces human error.
  • Reliable data powers AI.
  • Analytics is only as good as the source.

Summary: SAS vs R for Enterprise Data Cleaning

SAS excels in regulated environments requiring auditability, traceability, repeatability, and compliance. DATA Step processing remains one of the most powerful row-wise transformation engines available. PROC SQL integrates relational logic efficiently, while PROC FORMAT, PROC REPORT, and ODS provide enterprise reporting capabilities. R provides exceptional flexibility through tidyverse packages, rapid prototyping, visualization, and advanced analytics. Together, SAS and R form a complementary ecosystem where SAS delivers governance and production stability while R contributes innovation and analytical agility. Organizations leveraging both technologies gain scalability, transparency, reproducibility, and performance. The strongest enterprise teams understand not merely how to clean data, but how to build repeatable frameworks that convert operational chaos into analytical intelligence.

Conclusion

The Best Tigers in the World dataset demonstrates a universal truth across wildlife conservation, clinical trials, banking, insurance, and retail analytics: raw data is rarely analysis-ready. Duplicate identifiers, malformed text, invalid dates, unrealistic measurements, and missing values can quietly undermine reporting accuracy, predictive models, regulatory submissions, and executive decision-making. Effective data engineering is therefore not a technical luxury it is a business necessity.

A structured cleaning framework begins with profiling, metadata inspection, and validation before progressing into standardization, correction, enrichment, and reporting. SAS provides industrial-strength capabilities through DATA Step programming, PROC SQL, PROC FORMAT, PROC REPORT, and reusable macros. R complements these strengths through tidyverse-driven flexibility, rapid transformation pipelines, and modern analytical workflows. Together they provide an end-to-end ecosystem for producing trusted datasets.

The most successful organizations treat data cleaning as an engineered process rather than a one-time activity. They implement audit trails, independent QC, metadata governance, lineage tracking, standardized macros, validation checklists, and production controls. Whether the objective is SDTM compliance, financial risk analysis, insurance reporting, retail forecasting, or wildlife conservation intelligence, trustworthy analytics always begin with trustworthy data.

When clean data becomes a strategic asset, dashboards become credible, statistical outputs become reproducible, AI predictions become dependable, and decision-makers gain confidence. That is the real power of transforming raw operational data into analysis-ready intelligence using SAS and R.

Interview Questions and Answers

1. How would you identify duplicate tiger records?

Answer: Use PROC SORT NODUPKEY in SAS or distinct() in R using Tiger_ID as the business key.

2. Why is LENGTH placement important?

Answer: SAS assigns character length during compilation. Incorrect placement can truncate data permanently.

3. How do you validate malformed emails?

Answer: Use INDEX, FIND, VERIFY, PRXMATCH in SAS and grepl() in R.

4. Why is missing-value handling critical in SAS?

Answer: Missing numeric values are treated lower than valid values and can unintentionally satisfy conditional logic.

5. When would you choose PROC SQL over DATA Step?

Answer: PROC SQL is preferred for joins, aggregations, and relational processing; DATA Step excels in row-wise transformations and complex derivations.

---------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about TIGERS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS