Global Fruit Dataset Chaos into Analysis-Ready SAS and R Pipelines for Regulatory-Grade Reporting Excellence

Introduction

In real-world analytics environments, dirty data is not just an inconvenience it is a direct business risk. I have seen pharmaceutical companies delay clinical submissions because patient visit dates were corrupted. I have seen insurance firms reject valid claims because region codes were inconsistent. I have seen banking fraud systems fail because transaction IDs were duplicated.

Now imagine a global fruit-export intelligence platform collecting operational data from multiple countries about rare fruits such as Durian, Rambutan, Mangosteen, Miracle Fruit, Buddha’s Hand, Jabuticaba, Salak, and Ackee. The business wants to combine retail intelligence, shipment quality, healthcare nutrition research, and customer analytics into one centralized reporting system.

Unfortunately, the incoming datasets are corrupted.

Some fruit shipment IDs are duplicated.
Some supplier emails are malformed.
Some harvest dates are invalid.
Some fruit prices are negative.
Some region codes contain whitespace corruption.
Some category labels are inconsistent like:

tropical
Tropical
TROPI CAL
NULL
???

This type of poor-quality data creates catastrophic downstream problems:

Executive dashboards become misleading
AI prediction models learn incorrect patterns
Statistical summaries become unreliable
Regulatory reports fail validation
Clinical nutrition studies lose traceability
Enterprise reporting becomes inconsistent

This is where enterprise-grade SAS and R data engineering becomes critical.

Business Crisis Scenario

A multinational nutrition-research organization launched a global study analyzing rare fruits and their impact on metabolic health. The platform merged:

Retail sales
Clinical nutrition records
Shipment data
Supplier intelligence
Regional inventory systems

After deployment, executives discovered shocking inconsistencies:

Duplicate Fruit IDs inflated sales
Invalid harvest dates broke time-series forecasts
Negative prices caused financial losses in dashboards
Missing regions corrupted geographical analysis
Incorrect supplier emails blocked compliance alerts
Mixed categorical labels failed reporting aggregation

Even worse, machine-learning models incorrectly predicted demand because corrupted records distorted historical patterns.

This is exactly why enterprise cleaning workflows are mandatory.

1.Raw SAS Dataset with Intentional Errors

data fruit_raw;

length Fruit_ID $8 Fruit_Name $25 Country $20 Category $20

Supplier_Email $50 Region_Code $12 Harvest_Date $20

Price $12 Rating $8;

infile datalines dlm='|' truncover;

input Fruit_ID $ Fruit_Name $ Country $ Category $

Harvest_Date $ Price $ Supplier_Email $ Region_Code $

Rating $;

datalines;

FR001|Durian|Thailand|Tropical|12JAN2025|450|durian@gmail.com| ap01 |4.5

FR001|durian|THAILAND|tropi cal|31FEB2025|-450|wrongmail.com|AP01|5

FR002|Mangosteen|India|NULL|15MAR2025|550|mango@test.com|IN01|4.8

FR003|Rambutan|Malaysia|Tropical|.|700|rambutan#mail.com|MY 02|abc

FR004|Ackee|Jamaica|Exotic|21APR2025|-100|ackee@gmail|JM01|3.7

FR005|MiracleFruit|Ghana|???|15MAY2025|850| miracle@gmail.com |GH01|4.9

FR006|Salak|Indonesia|Tropical|44DEC2025|620|salak@gmail.com|ID01|5.2

FR007|Jabuticaba|Brazil|Berry|17JUN2025|NULL|jab@gmail.com|BR01|4.4

FR008|BuddhasHand|China|Citrus|19JUL2025|710|buddha@mail.com| CN01|4.6

FR009|Cherimoya|Peru|tropical|20AUG2025|800|cherimoya@gmail.com|PE01|4.1

FR010|Langsat|Malaysia|TROPICAL|22SEP2025|780|langsat@gmail.com|MY01|4.0

FR011|Soursop|India|Herbal|23OCT2025|-900|soursopgmail.com|IN01|2.1

FR012|Feijoa|Brazil|Berry|24NOV2025|640|fei@gmail.com|BR01|4.2

FR013|Cupuacu|Brazil|NULL|25DEC2025|770|cupuacu@gmail.com|BR02|4.5

FR014|Pulasan|Malaysia|Tropical|15JAN2025|720|pulasan@gmail.com|MY03|4.7

FR015|SnakeFruit|Indonesia|Tropical|16FEB2025|680|snake@gmail.com|ID01|4.3

;

run;

proc print data = fruit_raw;

run;

OUTPUT:


Obs	Fruit_ID	Fruit_Name	Country	Category	Supplier_Email	Region_Code	Harvest_Date	Price	Rating
1	FR001	Durian	Thailand	Tropical	durian@gmail.com	ap01	12JAN2025	450	4.5
2	FR001	durian	THAILAND	tropi cal	wrongmail.com	AP01	31FEB2025	-450	5
3	FR002	Mangosteen	India	NULL	mango@test.com	IN01	15MAR2025	550	4.8
4	FR003	Rambutan	Malaysia	Tropical	rambutan#mail.com	MY 02		700	abc
5	FR004	Ackee	Jamaica	Exotic	ackee@gmail	JM01	21APR2025	-100	3.7
6	FR005	MiracleFruit	Ghana	???	miracle@gmail.com	GH01	15MAY2025	850	4.9
7	FR006	Salak	Indonesia	Tropical	salak@gmail.com	ID01	44DEC2025	620	5.2
8	FR007	Jabuticaba	Brazil	Berry	jab@gmail.com	BR01	17JUN2025	NULL	4.4
9	FR008	BuddhasHand	China	Citrus	buddha@mail.com	CN01	19JUL2025	710	4.6
10	FR009	Cherimoya	Peru	tropical	cherimoya@gmail.com	PE01	20AUG2025	800	4.1
11	FR010	Langsat	Malaysia	TROPICAL	langsat@gmail.com	MY01	22SEP2025	780	4.0
12	FR011	Soursop	India	Herbal	soursopgmail.com	IN01	23OCT2025	-900	2.1
13	FR012	Feijoa	Brazil	Berry	fei@gmail.com	BR01	24NOV2025	640	4.2
14	FR013	Cupuacu	Brazil	NULL	cupuacu@gmail.com	BR02	25DEC2025	770	4.5
15	FR014	Pulasan	Malaysia	Tropical	pulasan@gmail.com	MY03	15JAN2025	720	4.7
16	FR015	SnakeFruit	Indonesia	Tropical	snake@gmail.com	ID01	16FEB2025	680	4.3

Why LENGTH Must Come Before Assignments

One of the biggest SAS mistakes beginners make is placing the LENGTH statement after assignments.

Incorrect:

Fruit_Status="Excellent";

length Fruit_Status $5;

This truncates the value to:

Excel

because SAS already assigned default length before LENGTH executed.

Correct:

length Fruit_Status $20;

Fruit_Status="Excellent";

In enterprise clinical trials, truncation can destroy SDTM mapping integrity.

R behaves differently because character vectors dynamically resize memory allocation. SAS, however, fixes variable length at compile time.

That difference is critical.

2.SAS Enterprise Cleaning Workflow

data fruit_clean;

retain Data_Source "GLOBAL_FRUIT_SYSTEM";

set fruit_raw;

length Clean_Category $20 Clean_Email $60 Parsed_Date 8;

format Parsed_Date yymmdd10.;

Fruit_Name=propcase(strip(Fruit_Name));

Country=upcase(compbl(strip(Country)));

Category=upcase(compress(Category));

if Category in ("NULL","???","") then

Clean_Category="UNKNOWN";

else Clean_Category=Category;

Price_Num=input(compress(Price),best12.);

if Price_Num < 0 then Price_Num=abs(Price_Num);

if missing(Price_Num) then

Price_Num=round(500 + rannor(2)*100,.01);

Parsed_Date=input(Harvest_Date,date9.);

if missing(Parsed_Date) then

Parsed_Date=today();

Clean_Email=strip(lowcase(Supplier_Email));

if index(Clean_Email,'@')=0 then

Clean_Email="invalid@email.com";

Region_Code=upcase(compress(Region_Code));

Rating_Num=input(Rating,best12.);

if Rating_Num > 5 then Rating_Num=5;

if Rating_Num=. then Rating_Num=3.5;

Days_To_Harvest=intck('day',today(),Parsed_Date);

drop Harvest_Date Price Rating Supplier_Email Category;

rename Parsed_Date = Harvest_Date

Price_Num = Price

Rating_Num = Rating

Clean_Email = Supplier_Email

Clean_Category = Category;

run;

proc print data = fruit_clean;

run;

OUTPUT:


Obs	Data_Source	Fruit_ID	Fruit_Name	Country	Region_Code	Category	Supplier_Email	Harvest_Date	Price	Rating	Days_To_Harvest
1	GLOBAL_FRUIT_SYSTEM	FR001	Durian	THAILAND	AP01	TROPICAL	durian@gmail.com	2025-01-12	450.00	4.5	-510
2	GLOBAL_FRUIT_SYSTEM	FR001	Durian	THAILAND	AP01	TROPICAL	invalid@email.com	2026-06-06	450.00	5.0	0
3	GLOBAL_FRUIT_SYSTEM	FR002	Mangosteen	INDIA	IN01	UNKNOWN	mango@test.com	2025-03-15	550.00	4.8	-448
4	GLOBAL_FRUIT_SYSTEM	FR003	Rambutan	MALAYSIA	MY02	TROPICAL	invalid@email.com	2026-06-06	700.00	3.5	0
5	GLOBAL_FRUIT_SYSTEM	FR004	Ackee	JAMAICA	JM01	EXOTIC	ackee@gmail	2025-04-21	100.00	3.7	-411
6	GLOBAL_FRUIT_SYSTEM	FR005	Miraclefruit	GHANA	GH01	UNKNOWN	miracle@gmail.com	2025-05-15	850.00	4.9	-387
7	GLOBAL_FRUIT_SYSTEM	FR006	Salak	INDONESIA	ID01	TROPICAL	salak@gmail.com	2026-06-06	620.00	5.0	0
8	GLOBAL_FRUIT_SYSTEM	FR007	Jabuticaba	BRAZIL	BR01	BERRY	jab@gmail.com	2025-06-17	631.18	4.4	-354
9	GLOBAL_FRUIT_SYSTEM	FR008	Buddhashand	CHINA	CN01	CITRUS	buddha@mail.com	2025-07-19	710.00	4.6	-322
10	GLOBAL_FRUIT_SYSTEM	FR009	Cherimoya	PERU	PE01	TROPICAL	cherimoya@gmail.com	2025-08-20	800.00	4.1	-290
11	GLOBAL_FRUIT_SYSTEM	FR010	Langsat	MALAYSIA	MY01	TROPICAL	langsat@gmail.com	2025-09-22	780.00	4.0	-257
12	GLOBAL_FRUIT_SYSTEM	FR011	Soursop	INDIA	IN01	HERBAL	invalid@email.com	2025-10-23	900.00	2.1	-226
13	GLOBAL_FRUIT_SYSTEM	FR012	Feijoa	BRAZIL	BR01	BERRY	fei@gmail.com	2025-11-24	640.00	4.2	-194
14	GLOBAL_FRUIT_SYSTEM	FR013	Cupuacu	BRAZIL	BR02	UNKNOWN	cupuacu@gmail.com	2025-12-25	770.00	4.5	-163
15	GLOBAL_FRUIT_SYSTEM	FR014	Pulasan	MALAYSIA	MY03	TROPICAL	pulasan@gmail.com	2025-01-15	720.00	4.7	-507
16	GLOBAL_FRUIT_SYSTEM	FR015	Snakefruit	INDONESIA	ID01	TROPICAL	snake@gmail.com	2025-02-16	680.00	4.3	-475

Explanation and Key Points

This DATA step demonstrates real-world production-grade cleaning logic. We used:

RETAIN for metadata persistence
PROPCASE for standardized names
COMPRESS for whitespace corruption
INPUT for character-to-numeric conversion
ABS for correcting negative values
INDEX for email validation
INTCK for temporal analytics
ROUND for financial precision

In regulated clinical environments, similar logic is applied to patient-level SDTM and ADaM derivations. Improper conversions can create FDA submission failures. This step converts chaotic operational intelligence into analysis-ready structured data.

3.PROC SORT Deduplication

proc sort data=fruit_clean nodupkey out=fruit_nodup;

by Fruit_ID;

run;

proc print data = fruit_nodup;

run;

LOG:

 NOTE: There were 16 observations read from the data set WORK.FRUIT_CLEAN.

 NOTE: 1 observations with duplicate key values were deleted.

 NOTE: The data set WORK.FRUIT_NODUP has 15 observations and 11 variables.

OUTPUT:


Obs	Data_Source	Fruit_ID	Fruit_Name	Country	Region_Code	Category	Supplier_Email	Harvest_Date	Price	Rating	Days_To_Harvest
1	GLOBAL_FRUIT_SYSTEM	FR001	Durian	THAILAND	AP01	TROPICAL	durian@gmail.com	2025-01-12	450.00	4.5	-510
2	GLOBAL_FRUIT_SYSTEM	FR002	Mangosteen	INDIA	IN01	UNKNOWN	mango@test.com	2025-03-15	550.00	4.8	-448
3	GLOBAL_FRUIT_SYSTEM	FR003	Rambutan	MALAYSIA	MY02	TROPICAL	invalid@email.com	2026-06-06	700.00	3.5	0
4	GLOBAL_FRUIT_SYSTEM	FR004	Ackee	JAMAICA	JM01	EXOTIC	ackee@gmail	2025-04-21	100.00	3.7	-411
5	GLOBAL_FRUIT_SYSTEM	FR005	Miraclefruit	GHANA	GH01	UNKNOWN	miracle@gmail.com	2025-05-15	850.00	4.9	-387
6	GLOBAL_FRUIT_SYSTEM	FR006	Salak	INDONESIA	ID01	TROPICAL	salak@gmail.com	2026-06-06	620.00	5.0	0
7	GLOBAL_FRUIT_SYSTEM	FR007	Jabuticaba	BRAZIL	BR01	BERRY	jab@gmail.com	2025-06-17	631.18	4.4	-354
8	GLOBAL_FRUIT_SYSTEM	FR008	Buddhashand	CHINA	CN01	CITRUS	buddha@mail.com	2025-07-19	710.00	4.6	-322
9	GLOBAL_FRUIT_SYSTEM	FR009	Cherimoya	PERU	PE01	TROPICAL	cherimoya@gmail.com	2025-08-20	800.00	4.1	-290
10	GLOBAL_FRUIT_SYSTEM	FR010	Langsat	MALAYSIA	MY01	TROPICAL	langsat@gmail.com	2025-09-22	780.00	4.0	-257
11	GLOBAL_FRUIT_SYSTEM	FR011	Soursop	INDIA	IN01	HERBAL	invalid@email.com	2025-10-23	900.00	2.1	-226
12	GLOBAL_FRUIT_SYSTEM	FR012	Feijoa	BRAZIL	BR01	BERRY	fei@gmail.com	2025-11-24	640.00	4.2	-194
13	GLOBAL_FRUIT_SYSTEM	FR013	Cupuacu	BRAZIL	BR02	UNKNOWN	cupuacu@gmail.com	2025-12-25	770.00	4.5	-163
14	GLOBAL_FRUIT_SYSTEM	FR014	Pulasan	MALAYSIA	MY03	TROPICAL	pulasan@gmail.com	2025-01-15	720.00	4.7	-507
15	GLOBAL_FRUIT_SYSTEM	FR015	Snakefruit	INDONESIA	ID01	TROPICAL	snake@gmail.com	2025-02-16	680.00	4.3	-475

Explanation and Key Points

PROC SORT NODUPKEY removes duplicate business keys. Duplicate IDs can inflate revenue calculations, patient counts, or insurance claims. In enterprise analytics, deduplication is one of the first QC checkpoints. Clinical studies particularly require unique subject identifiers for traceability and regulatory integrity.

4.PROC SQL Validation

proc sql;

create table fruit_summary as

select Country,Category,

count(*) as Total_Records,

avg(Price) as Avg_Price,

avg(Rating) as Avg_Rating

from fruit_nodup

group by Country,Category;

quit;

proc print data = fruit_summary;

run;

OUTPUT:


Obs	Country	Category	Total_Records	Avg_Price	Avg_Rating
1	BRAZIL	BERRY	2	635.590	4.30000
2	BRAZIL	UNKNOWN	1	770.000	4.50000
3	CHINA	CITRUS	1	710.000	4.60000
4	GHANA	UNKNOWN	1	850.000	4.90000
5	INDIA	HERBAL	1	900.000	2.10000
6	INDIA	UNKNOWN	1	550.000	4.80000
7	INDONESIA	TROPICAL	2	650.000	4.65000
8	JAMAICA	EXOTIC	1	100.000	3.70000
9	MALAYSIA	TROPICAL	3	733.333	4.06667
10	PERU	TROPICAL	1	800.000	4.10000
11	THAILAND	TROPICAL	1	450.000	4.50000

Explanation and Key Points

PROC SQL provides relational-style aggregation logic similar to enterprise databases. This workflow helps:

summarize operational intelligence
identify country-level anomalies
support executive dashboards
create analytical reporting layers

Compared to DATA step BY-group processing, PROC SQL is often easier for multidimensional aggregations and joins.

Advanced SAS Techniques

5.ARRAY Cleaning

data fruit_array_clean;

set fruit_nodup;

array chars(*) Fruit_Name Country Category;

do i=1 to dim(chars);

chars(i)=strip(upcase(chars(i)));

end;

drop i;

run;

proc print data = fruit_array_clean;

run;

OUTPUT:


Obs	Data_Source	Fruit_ID	Fruit_Name	Country	Region_Code	Category	Supplier_Email	Harvest_Date	Price	Rating	Days_To_Harvest
1	GLOBAL_FRUIT_SYSTEM	FR001	DURIAN	THAILAND	AP01	TROPICAL	durian@gmail.com	2025-01-12	450.00	4.5	-510
2	GLOBAL_FRUIT_SYSTEM	FR002	MANGOSTEEN	INDIA	IN01	UNKNOWN	mango@test.com	2025-03-15	550.00	4.8	-448
3	GLOBAL_FRUIT_SYSTEM	FR003	RAMBUTAN	MALAYSIA	MY02	TROPICAL	invalid@email.com	2026-06-06	700.00	3.5	0
4	GLOBAL_FRUIT_SYSTEM	FR004	ACKEE	JAMAICA	JM01	EXOTIC	ackee@gmail	2025-04-21	100.00	3.7	-411
5	GLOBAL_FRUIT_SYSTEM	FR005	MIRACLEFRUIT	GHANA	GH01	UNKNOWN	miracle@gmail.com	2025-05-15	850.00	4.9	-387
6	GLOBAL_FRUIT_SYSTEM	FR006	SALAK	INDONESIA	ID01	TROPICAL	salak@gmail.com	2026-06-06	620.00	5.0	0
7	GLOBAL_FRUIT_SYSTEM	FR007	JABUTICABA	BRAZIL	BR01	BERRY	jab@gmail.com	2025-06-17	631.18	4.4	-354
8	GLOBAL_FRUIT_SYSTEM	FR008	BUDDHASHAND	CHINA	CN01	CITRUS	buddha@mail.com	2025-07-19	710.00	4.6	-322
9	GLOBAL_FRUIT_SYSTEM	FR009	CHERIMOYA	PERU	PE01	TROPICAL	cherimoya@gmail.com	2025-08-20	800.00	4.1	-290
10	GLOBAL_FRUIT_SYSTEM	FR010	LANGSAT	MALAYSIA	MY01	TROPICAL	langsat@gmail.com	2025-09-22	780.00	4.0	-257
11	GLOBAL_FRUIT_SYSTEM	FR011	SOURSOP	INDIA	IN01	HERBAL	invalid@email.com	2025-10-23	900.00	2.1	-226
12	GLOBAL_FRUIT_SYSTEM	FR012	FEIJOA	BRAZIL	BR01	BERRY	fei@gmail.com	2025-11-24	640.00	4.2	-194
13	GLOBAL_FRUIT_SYSTEM	FR013	CUPUACU	BRAZIL	BR02	UNKNOWN	cupuacu@gmail.com	2025-12-25	770.00	4.5	-163
14	GLOBAL_FRUIT_SYSTEM	FR014	PULASAN	MALAYSIA	MY03	TROPICAL	pulasan@gmail.com	2025-01-15	720.00	4.7	-507
15	GLOBAL_FRUIT_SYSTEM	FR015	SNAKEFRUIT	INDONESIA	ID01	TROPICAL	snake@gmail.com	2025-02-16	680.00	4.3	-475

Explanation

This program demonstrates one of the most powerful DATA Step techniques in SAS called ARRAY processing. Arrays allow programmers to apply the same logic to multiple variables efficiently without repeatedly writing individual statements.

In this example, the array named chars(*) contains three character variables:

Fruit_Name
Country
Clean_Category

The (*) tells SAS to automatically determine the number of variables inside the array.

The DO loop:

do i = 1 to dim(chars);

iterates through each variable in the array dynamically. The DIM() function returns the total number of variables inside the array.

Inside the loop:

chars(i) = strip(upcase(chars(i)));

two cleaning operations occur:

UPCASE() converts text to uppercase
STRIP() removes leading and trailing spaces

For example:

Before Cleaning	After Cleaning
durian	DURIAN
thailand	THAILAND
tropical	TROPICAL

Finally:

drop i;

removes the temporary loop counter variable from the final dataset.

Key Points

Arrays reduce repetitive coding
Useful for bulk variable transformations
Improves maintainability in enterprise projects
Frequently used in SDTM and ADaM standardization
DIM() makes loops dynamic and scalable
Excellent for cleaning large clinical datasets

6.SELECT-WHEN Logic

data fruit_segment;

set fruit_array_clean;

Length Segment $12;

select;

when (Price >=800) Segment="PREMIUM";

when (Price >=600) Segment="STANDARD";

otherwise Segment="BASIC";

end;

run;

proc print data = fruit_segment;

run;

OUTPUT:


Obs	Data_Source	Fruit_ID	Fruit_Name	Country	Region_Code	Category	Supplier_Email	Harvest_Date	Price	Rating	Days_To_Harvest	Segment
1	GLOBAL_FRUIT_SYSTEM	FR001	DURIAN	THAILAND	AP01	TROPICAL	durian@gmail.com	2025-01-12	450.00	4.5	-510	BASIC
2	GLOBAL_FRUIT_SYSTEM	FR002	MANGOSTEEN	INDIA	IN01	UNKNOWN	mango@test.com	2025-03-15	550.00	4.8	-448	BASIC
3	GLOBAL_FRUIT_SYSTEM	FR003	RAMBUTAN	MALAYSIA	MY02	TROPICAL	invalid@email.com	2026-06-06	700.00	3.5	0	STANDARD
4	GLOBAL_FRUIT_SYSTEM	FR004	ACKEE	JAMAICA	JM01	EXOTIC	ackee@gmail	2025-04-21	100.00	3.7	-411	BASIC
5	GLOBAL_FRUIT_SYSTEM	FR005	MIRACLEFRUIT	GHANA	GH01	UNKNOWN	miracle@gmail.com	2025-05-15	850.00	4.9	-387	PREMIUM
6	GLOBAL_FRUIT_SYSTEM	FR006	SALAK	INDONESIA	ID01	TROPICAL	salak@gmail.com	2026-06-06	620.00	5.0	0	STANDARD
7	GLOBAL_FRUIT_SYSTEM	FR007	JABUTICABA	BRAZIL	BR01	BERRY	jab@gmail.com	2025-06-17	631.18	4.4	-354	STANDARD
8	GLOBAL_FRUIT_SYSTEM	FR008	BUDDHASHAND	CHINA	CN01	CITRUS	buddha@mail.com	2025-07-19	710.00	4.6	-322	STANDARD
9	GLOBAL_FRUIT_SYSTEM	FR009	CHERIMOYA	PERU	PE01	TROPICAL	cherimoya@gmail.com	2025-08-20	800.00	4.1	-290	PREMIUM
10	GLOBAL_FRUIT_SYSTEM	FR010	LANGSAT	MALAYSIA	MY01	TROPICAL	langsat@gmail.com	2025-09-22	780.00	4.0	-257	STANDARD
11	GLOBAL_FRUIT_SYSTEM	FR011	SOURSOP	INDIA	IN01	HERBAL	invalid@email.com	2025-10-23	900.00	2.1	-226	PREMIUM
12	GLOBAL_FRUIT_SYSTEM	FR012	FEIJOA	BRAZIL	BR01	BERRY	fei@gmail.com	2025-11-24	640.00	4.2	-194	STANDARD
13	GLOBAL_FRUIT_SYSTEM	FR013	CUPUACU	BRAZIL	BR02	UNKNOWN	cupuacu@gmail.com	2025-12-25	770.00	4.5	-163	STANDARD
14	GLOBAL_FRUIT_SYSTEM	FR014	PULASAN	MALAYSIA	MY03	TROPICAL	pulasan@gmail.com	2025-01-15	720.00	4.7	-507	STANDARD
15	GLOBAL_FRUIT_SYSTEM	FR015	SNAKEFRUIT	INDONESIA	ID01	TROPICAL	snake@gmail.com	2025-02-16	680.00	4.3	-475	STANDARD

Explanation

This program demonstrates the SELECT-WHEN statement in SAS, which is similar to:

IF-ELSE logic
CASE statements in SQL
case_when() in R

The purpose of this code is to categorize fruits into business segments based on their price.

Step-by-Step Logic

Condition 1

when (Price_Num >= 800)

If the fruit price is greater than or equal to 800, SAS assigns:

Segment = "PREMIUM";

Condition 2

when (Price_Num >= 600)

If the first condition fails but the price is at least 600, SAS assigns:

Segment = "STANDARD";

Otherwise Condition

Otherwise Segment = "BASIC";

If none of the above conditions are true, the fruit is classified as BASIC.

Example Output

Fruit_Name	Price_Num	Segment
DURIAN	850	PREMIUM
RAMBUTAN	650	STANDARD
ACKEE	300	BASIC

Why SELECT-WHEN is Preferred

Compared to long IF-ELSE chains, SELECT-WHEN:

improves readability
simplifies debugging
provides cleaner business-rule implementation
is easier to maintain in production systems

Key Points

Cleaner alternative to multiple IF statements
Excellent for business categorization
Common in risk scoring systems
Used in banking, insurance, and clinical classification logic
Easier to validate during QC review

7.FIRST./LAST. Processing

proc sort data=fruit_segment;

by Country;

run;

proc print data = fruit_segment;

run;

OUTPUT:


Obs	Data_Source	Fruit_ID	Fruit_Name	Country	Region_Code	Category	Supplier_Email	Harvest_Date	Price	Rating	Days_To_Harvest	Segment
1	GLOBAL_FRUIT_SYSTEM	FR007	JABUTICABA	BRAZIL	BR01	BERRY	jab@gmail.com	2025-06-17	631.18	4.4	-354	STANDARD
2	GLOBAL_FRUIT_SYSTEM	FR012	FEIJOA	BRAZIL	BR01	BERRY	fei@gmail.com	2025-11-24	640.00	4.2	-194	STANDARD
3	GLOBAL_FRUIT_SYSTEM	FR013	CUPUACU	BRAZIL	BR02	UNKNOWN	cupuacu@gmail.com	2025-12-25	770.00	4.5	-163	STANDARD
4	GLOBAL_FRUIT_SYSTEM	FR008	BUDDHASHAND	CHINA	CN01	CITRUS	buddha@mail.com	2025-07-19	710.00	4.6	-322	STANDARD
5	GLOBAL_FRUIT_SYSTEM	FR005	MIRACLEFRUIT	GHANA	GH01	UNKNOWN	miracle@gmail.com	2025-05-15	850.00	4.9	-387	PREMIUM
6	GLOBAL_FRUIT_SYSTEM	FR002	MANGOSTEEN	INDIA	IN01	UNKNOWN	mango@test.com	2025-03-15	550.00	4.8	-448	BASIC
7	GLOBAL_FRUIT_SYSTEM	FR011	SOURSOP	INDIA	IN01	HERBAL	invalid@email.com	2025-10-23	900.00	2.1	-226	PREMIUM
8	GLOBAL_FRUIT_SYSTEM	FR006	SALAK	INDONESIA	ID01	TROPICAL	salak@gmail.com	2026-06-06	620.00	5.0	0	STANDARD
9	GLOBAL_FRUIT_SYSTEM	FR015	SNAKEFRUIT	INDONESIA	ID01	TROPICAL	snake@gmail.com	2025-02-16	680.00	4.3	-475	STANDARD
10	GLOBAL_FRUIT_SYSTEM	FR004	ACKEE	JAMAICA	JM01	EXOTIC	ackee@gmail	2025-04-21	100.00	3.7	-411	BASIC
11	GLOBAL_FRUIT_SYSTEM	FR003	RAMBUTAN	MALAYSIA	MY02	TROPICAL	invalid@email.com	2026-06-06	700.00	3.5	0	STANDARD
12	GLOBAL_FRUIT_SYSTEM	FR010	LANGSAT	MALAYSIA	MY01	TROPICAL	langsat@gmail.com	2025-09-22	780.00	4.0	-257	STANDARD
13	GLOBAL_FRUIT_SYSTEM	FR014	PULASAN	MALAYSIA	MY03	TROPICAL	pulasan@gmail.com	2025-01-15	720.00	4.7	-507	STANDARD
14	GLOBAL_FRUIT_SYSTEM	FR009	CHERIMOYA	PERU	PE01	TROPICAL	cherimoya@gmail.com	2025-08-20	800.00	4.1	-290	PREMIUM
15	GLOBAL_FRUIT_SYSTEM	FR001	DURIAN	THAILAND	AP01	TROPICAL	durian@gmail.com	2025-01-12	450.00	4.5	-510	BASIC

data country_counts;

set fruit_segment;

by Country;

retain Count;

if first.Country then Count=0;

Count+1;

if last.Country;

run;

proc print data = country_counts;

run;

OUTPUT:


Obs	Data_Source	Fruit_ID	Fruit_Name	Country	Region_Code	Category	Supplier_Email	Harvest_Date	Price	Rating	Days_To_Harvest	Segment	Count
1	GLOBAL_FRUIT_SYSTEM	FR013	CUPUACU	BRAZIL	BR02	UNKNOWN	cupuacu@gmail.com	2025-12-25	770	4.5	-163	STANDARD	3
2	GLOBAL_FRUIT_SYSTEM	FR008	BUDDHASHAND	CHINA	CN01	CITRUS	buddha@mail.com	2025-07-19	710	4.6	-322	STANDARD	1
3	GLOBAL_FRUIT_SYSTEM	FR005	MIRACLEFRUIT	GHANA	GH01	UNKNOWN	miracle@gmail.com	2025-05-15	850	4.9	-387	PREMIUM	1
4	GLOBAL_FRUIT_SYSTEM	FR011	SOURSOP	INDIA	IN01	HERBAL	invalid@email.com	2025-10-23	900	2.1	-226	PREMIUM	2
5	GLOBAL_FRUIT_SYSTEM	FR015	SNAKEFRUIT	INDONESIA	ID01	TROPICAL	snake@gmail.com	2025-02-16	680	4.3	-475	STANDARD	2
6	GLOBAL_FRUIT_SYSTEM	FR004	ACKEE	JAMAICA	JM01	EXOTIC	ackee@gmail	2025-04-21	100	3.7	-411	BASIC	1
7	GLOBAL_FRUIT_SYSTEM	FR014	PULASAN	MALAYSIA	MY03	TROPICAL	pulasan@gmail.com	2025-01-15	720	4.7	-507	STANDARD	3
8	GLOBAL_FRUIT_SYSTEM	FR009	CHERIMOYA	PERU	PE01	TROPICAL	cherimoya@gmail.com	2025-08-20	800	4.1	-290	PREMIUM	1
9	GLOBAL_FRUIT_SYSTEM	FR001	DURIAN	THAILAND	AP01	TROPICAL	durian@gmail.com	2025-01-12	450	4.5	-510	BASIC	1

Explanation

This program demonstrates one of the most important SAS concepts called BY-group processing using:

FIRST.variable
LAST.variable

These are temporary automatic variables created by SAS whenever a dataset is processed with a BY statement.

Step 1.Sorting Requirement

proc sort data = fruit_segment;
by Country;
run;

Before BY-group processing, datasets must be sorted by the grouping variable.

Here, records are sorted by:

Country

Without sorting, SAS generates incorrect BY-group behavior.

Step 2.RETAIN Statement

retain Count;

Normally SAS resets variables to missing for each observation.

RETAIN preserves the value across rows.

This allows cumulative counting.

Step 3.FIRST.Country Logic

if first.Country then Count = 0;

When SAS encounters the first record of a new country group:

FIRST.Country = 1
Counter resets to zero

Example:

Country	FIRST.Country
INDIA	1
INDIA	0
MALAYSIA	1

Step 4.Increment Counter

Count + 1;

This cumulative statement increments the counter automatically.

Step 5.LAST.Country Logic

if last.Country;

This keeps only the final observation for each country group.

At that point, Count contains the total number of records for that country.

Example Output

Country	Count
BRAZIL	3
INDIA	2
MALAYSIA	4

Key Points

FIRST./LAST. variables are temporary automatic variables
Requires sorted data
RETAIN preserves values across observations
Commonly used for grouped summaries
Faster than some SQL aggregations
Essential in clinical trial derivations

8.PROC REPORT for Professional Outputs

proc report data=fruit_segment nowd;

column Country Fruit_Name Price Rating Segment;

define Country / group;

define Fruit_Name / display;

define Price / analysis mean;

define Rating / analysis mean;

run;

OUTPUT:


Country	Fruit_Name	Price	Rating	Segment
BRAZIL	JABUTICABA	631.18	4.4	STANDARD
	FEIJOA	640	4.2	STANDARD
	CUPUACU	770	4.5	STANDARD
CHINA	BUDDHASHAND	710	4.6	STANDARD
GHANA	MIRACLEFRUIT	850	4.9	PREMIUM
INDIA	MANGOSTEEN	550	4.8	BASIC
	SOURSOP	900	2.1	PREMIUM
INDONESIA	SALAK	620	5	STANDARD
	SNAKEFRUIT	680	4.3	STANDARD
JAMAICA	ACKEE	100	3.7	BASIC
MALAYSIA	RAMBUTAN	700	3.5	STANDARD
	LANGSAT	780	4	STANDARD
	PULASAN	720	4.7	STANDARD
PERU	CHERIMOYA	800	4.1	PREMIUM
THAILAND	DURIAN	450	4.5	BASIC

Explanation and Key Points

PROC REPORT is widely used in pharmaceutical TLF generation. It creates professional tabular outputs for executive reporting, compliance reporting, and operational intelligence dashboards. Unlike PROC PRINT, it provides advanced formatting and grouping flexibility.

9.R Data Cleaning Workflow

library(tidyverse)

library(lubridate)

library(janitor)

fruit_raw <- read.delim(

text="

FR001|Durian|Thailand|Tropical|12JAN2025|450|durian@gmail.com| ap01 |4.5

FR001|durian|THAILAND|tropi cal|31FEB2025|-450|wrongmail.com|AP01|5

FR002|Mangosteen|India|NULL|15MAR2025|550|mango@test.com|IN01|4.8

sep="|",

header=FALSE

)

OUTPUT:

	V1	V2	V3	V4	V5	V6	V7	V8	V9
1	FR001	Durian	Thailand	Tropical	12-Jan-2025	450	durian@gmail.com	ap01	4.5
2	FR001	durian	THAILAND	tropi cal	31FEB2025	-450	wrongmail.com	AP01	5
3	FR002	Mangosteen	India	NULL	15-Mar-2025	550	mango@test.com	IN01	4.8

colnames(fruit_raw) <- c(

"fruit_id","fruit_name","country","category",

"harvest_date","price","email",

"region_code","rating"

)

OUTPUT:

	fruit_id	fruit_name	country	category	harvest_date	price	email	region_code	rating
1	FR001	Durian	Thailand	Tropical	12-Jan-2025	450	durian@gmail.com	ap01	4.5
2	FR001	durian	THAILAND	tropi cal	31FEB2025	-450	wrongmail.com	AP01	5
3	FR002	Mangosteen	India	NULL	15-Mar-2025	550	mango@test.com	IN01	4.8

fruit_clean <- fruit_raw %>%

clean_names() %>%

mutate(

fruit_name = str_to_title(str_trim(fruit_name)),

country = str_to_upper(str_trim(country)),

category = str_replace_all(category," ",""),

category = case_when(

category %in% c("NULL","") ~ "UNKNOWN",

TRUE ~ category),

price = abs(as.numeric(price)),

email = if_else(grepl("@",email),

email,"invalid@email.com"),

region_code = str_to_upper(str_trim(region_code)),

rating = coalesce(as.numeric(rating),3.5),

parsed_date = suppressWarnings(

parse_date_time(harvest_date,

orders="dby")

date_flag =if_else(is.na(parsed_date),

"INVALID_DATE","VALID_DATE"

)

OUTPUT:

	fruit_id	fruit_name	country	category	harvest_date	price	email	region_code	rating	parsed_date	date_flag
1	FR001	Durian	THAILAND	Tropical	12JAN2025	450	durian@gmail.com	AP01	4.5	12-01-2025	VALID_DATE
2	FR001	Durian	THAILAND	tropical	31FEB2025	450	invalid@email.com	AP01	5	NA	INVALID_DATE
3	FR002	Mangosteen	INDIA	UNKNOWN	15MAR20 25	550	mango@test.com	IN01	4.8	15-03-2025	VALID_DATE

Explanation and Key Points

This R pipeline mirrors SAS cleaning logic using tidyverse functions. Key parallels include:

SAS	R
PROPCASE	str_to_title
COMPRESS	str_replace_all
INPUT	as.numeric
IF-THEN	case_when
COALESCEC	coalesce
STRIP	str_trim

R provides exceptional flexibility for exploratory data engineering, while SAS excels in governed enterprise environments.

Enterprise Validation & Compliance

In SDTM and ADaM environments, every transformation must be traceable.

Key enterprise requirements include:

Audit trails
QC independence
Metadata governance
Reproducibility
Variable lineage
Controlled terminology
Validation traceability

One dangerous SAS behavior is:

if Age < 18 then delete;

Because SAS treats missing numeric values as smaller than valid numbers, records with missing age also get deleted unintentionally.

This can create catastrophic analytical bias in clinical trials.

Defensive programming is mandatory.

20 Real-World Data Cleaning Best Practices

Standardize variable naming conventions
Validate dates before conversion
Remove duplicate business keys
Track audit metadata
Use reusable macros
Avoid hardcoded logic
Standardize categorical labels
Validate numeric ranges
Implement QC independence
Preserve raw source datasets
Log all transformations
Validate email structures
Standardize region codes
Use metadata-driven mappings
Separate raw and curated layers
Implement production checkpoints
Use defensive IF conditions
Validate missing-value assumptions
Automate reconciliation checks
Document derivation logic thoroughly

Business Logic Behind Data Cleaning

Business logic is the foundation of analytical reliability. Missing values are often imputed because downstream models and reporting systems cannot process null structures consistently. For example, if a fruit price is missing, revenue summaries become distorted. In clinical trials, missing patient ages may exclude subjects from eligibility analysis.

Unrealistic values must also be corrected. Negative shipment costs or impossible patient ages create misleading analytical outcomes. Date standardization is equally critical because reporting timelines, treatment durations, and seasonal analyses depend heavily on accurate temporal calculations.

Text normalization ensures that “Tropical,” “tropical,” and “TROPICAL” are treated as identical business categories. Without normalization, aggregation logic fragments analytical summaries.

These corrections are not cosmetic. They directly influence executive decisions, AI predictions, operational KPIs, and regulatory submissions.

20 One-Line Insights

Dirty data creates expensive business mistakes.
Validation logic is stronger than visual inspection.
Missing dates silently break forecasting models.
Duplicate IDs inflate analytical metrics.
Standardized variables improve reproducibility.
Regulatory submissions demand traceability.
Defensive programming prevents hidden failures.
Audit readiness begins with metadata discipline.
PROC SQL simplifies relational intelligence.
DATA step excels in row-wise control.
R accelerates exploratory transformations.
SAS dominates governed enterprise reporting.
Poor text normalization fragments dashboards.
Controlled terminology improves aggregation quality.
QC independence increases analytical trust.
Macros improve scalability and consistency.
Missing numeric values require special handling in SAS.
Enterprise analytics depends on reproducible pipelines.
Standardization improves machine-learning stability.
Reliable analytics begins with reliable data.

SAS vs R Comparison

Capability	SAS	R
Auditability	Excellent	Moderate
Regulatory Acceptance	Industry Standard	Growing
Flexibility	Moderate	Very High
Performance	Strong	Strong
Visualization	Moderate	Excellent
Governance	Excellent	Moderate
Macro Automation	Powerful	Functional
Exploratory Analysis	Moderate	Excellent
Production Stability	Excellent	Good
Open Source	No	Yes

SAS dominates highly regulated enterprise ecosystems because of auditability, reproducibility, and metadata control. R excels in exploratory analytics, statistical flexibility, and rapid prototyping. Together, they create an extremely powerful hybrid analytics framework.

Validation Checklist

1.Duplicate check completed
2.Date conversion validated
3.Missing-value review performed
4.Email structure verified
5.Region normalization applied
6.Numeric range validation completed
7.Metadata reviewed
8.QC reconciliation performed

Summary

Creating analysis-ready datasets is one of the most critical responsibilities in enterprise analytics, especially in industries such as clinical research, banking, insurance, retail, and healthcare. Raw operational data often contains serious quality issues including duplicate records, invalid dates, negative numeric values, inconsistent text formatting, missing variables, malformed emails, corrupted categorical values, and mixed data types. These problems can severely impact dashboards, statistical outputs, AI predictions, regulatory submissions, and executive decision-making.

SAS and R together provide a powerful ecosystem for solving these challenges. SAS is highly trusted in regulated industries because of its auditability, metadata governance, reproducibility, validation capabilities, and scalable production workflows. Features like DATA step processing, PROC SQL, PROC REPORT, PROC SORT, arrays, macros, and validation procedures help convert corrupted datasets into reliable analytical intelligence.

R complements SAS by providing flexible exploratory data analysis, advanced string processing, dynamic transformations, and modern packages such as tidyverse, dplyr, stringr, lubridate, and janitor. Functions like mutate(), case_when(), coalesce(), and parse_date_time() make data cleaning efficient and readable.

Together, SAS and R enable organizations to build trustworthy, scalable, production-grade analytical systems that improve reporting accuracy, regulatory compliance, operational intelligence, and enterprise decision-making reliability.

Conclusion

Modern analytics ecosystems depend entirely on structured, trustworthy, and analysis-ready data. Whether the business domain involves clinical trials, insurance claims, retail intelligence, banking fraud systems, or global fruit supply-chain analytics, poor-quality data introduces enormous operational and regulatory risk.

Duplicate identifiers distort reporting. Invalid dates destroy time-series analysis. Missing values silently corrupt statistical outputs. Inconsistent text formatting fragments dashboards. Malformed emails break compliance workflows. These issues are not theoretical they happen every day in enterprise production systems.

This is why SAS and R remain foundational technologies in enterprise data engineering. SAS provides unmatched governance, auditability, traceability, and production-grade reporting capabilities. Its DATA step architecture enables highly controlled transformation pipelines essential for regulated industries such as pharmaceuticals and banking. PROC SQL, PROC REPORT, PROC SUMMARY, and macro automation create scalable frameworks capable of supporting global operational intelligence.

R complements SAS by offering highly flexible exploratory analytics, modern string manipulation, advanced visualization ecosystems, and rapid transformation pipelines using tidyverse frameworks. Together, SAS and R create a hybrid analytics ecosystem that balances enterprise stability with analytical agility.

The future of analytics does not belong to organizations with the largest datasets. It belongs to organizations with the cleanest, most reliable, and most reproducible datasets.

Because in enterprise analytics, trustworthy insights begin long before dashboards they begin with disciplined data engineering.

Interview Questions and Answers

1. What is the difference between PROC SQL and DATA Step in SAS?

Answer:
DATA Step is primarily used for row-by-row processing and sequential logic, while PROC SQL is used for relational operations such as joins, aggregations, and subqueries. DATA Step performs better for complex row transformations, whereas PROC SQL is easier for database-style operations.

2. Why is LENGTH statement important in SAS?

Answer:
The LENGTH statement defines variable storage size before assignment. If LENGTH is declared after assigning values, SAS may truncate character variables because variable length is determined during compilation. Proper LENGTH placement prevents data loss and reporting inconsistencies.

3. What is the difference between WHERE and IF statements in SAS?

Answer:
WHERE filters observations before they enter the Program Data Vector (PDV), making it more efficient for large datasets. IF conditions are applied after observations are read into memory. WHERE cannot be used with newly created variables in the same DATA step.

4. Explain RETAIN statement in SAS?

Answer:
By default, SAS resets variable values to missing for every iteration. RETAIN preserves values across observations. It is commonly used for cumulative calculations, counters, and carrying forward previous values.

5. What are FIRST. and LAST. variables in SAS?

Answer:
FIRST.variable and LAST.variable are temporary variables created during BY-group processing. They help identify the first and last observation within grouped data and are widely used for summaries, counts, and accumulations.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent GLOBAL FRUITS DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can Advanced SAS Programming Detect, Clean, and Optimize Grid Stability Data While Identifying Fraud Patterns?
2.Can Modern Art Data Explain What Actually Sells? – A Real-World SAS Analytics Project
3.Can Advanced R Programming Detect, Clean, and Optimize Public Library Data While Identifying Fraud Patterns?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Fruit Supply Chain Corruption Solved Through SAS & R Automation

Explanation

Step-by-Step Logic

Condition 1

Condition 2

Otherwise Condition

Example Output

Why SELECT-WHEN is Preferred

Key Points

Summary

Interview Questions and Answers

1. What is the difference between PROC SQL and DATA Step in SAS?

2. Why is LENGTH statement important in SAS?

3. What is the difference between WHERE and IF statements in SAS?

4. Explain RETAIN statement in SAS?

5. What are FIRST. and LAST. variables in SAS?

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS