Fruit Supply Chain Corruption Solved Through SAS & R Automation

Global Fruit Dataset Chaos into Analysis-Ready SAS and R Pipelines for Regulatory-Grade Reporting Excellence

Introduction

In real-world analytics environments, dirty data is not just an inconvenience it is a direct business risk. I have seen pharmaceutical companies delay clinical submissions because patient visit dates were corrupted. I have seen insurance firms reject valid claims because region codes were inconsistent. I have seen banking fraud systems fail because transaction IDs were duplicated.

Now imagine a global fruit-export intelligence platform collecting operational data from multiple countries about rare fruits such as Durian, Rambutan, Mangosteen, Miracle Fruit, Buddha’s Hand, Jabuticaba, Salak, and Ackee. The business wants to combine retail intelligence, shipment quality, healthcare nutrition research, and customer analytics into one centralized reporting system.

Unfortunately, the incoming datasets are corrupted.

Some fruit shipment IDs are duplicated.
Some supplier emails are malformed.
Some harvest dates are invalid.
Some fruit prices are negative.
Some region codes contain whitespace corruption.
Some category labels are inconsistent like:

  • tropical
  • Tropical
  • TROPI CAL
  • NULL
  • ???

This type of poor-quality data creates catastrophic downstream problems:

  • Executive dashboards become misleading
  • AI prediction models learn incorrect patterns
  • Statistical summaries become unreliable
  • Regulatory reports fail validation
  • Clinical nutrition studies lose traceability
  • Enterprise reporting becomes inconsistent

This is where enterprise-grade SAS and R data engineering becomes critical.

Business Crisis Scenario

A multinational nutrition-research organization launched a global study analyzing rare fruits and their impact on metabolic health. The platform merged:

  • Retail sales
  • Clinical nutrition records
  • Shipment data
  • Supplier intelligence
  • Regional inventory systems

After deployment, executives discovered shocking inconsistencies:

  • Duplicate Fruit IDs inflated sales
  • Invalid harvest dates broke time-series forecasts
  • Negative prices caused financial losses in dashboards
  • Missing regions corrupted geographical analysis
  • Incorrect supplier emails blocked compliance alerts
  • Mixed categorical labels failed reporting aggregation

Even worse, machine-learning models incorrectly predicted demand because corrupted records distorted historical patterns.

This is exactly why enterprise cleaning workflows are mandatory.

1.Raw SAS Dataset with Intentional Errors

data fruit_raw;

length Fruit_ID $8 Fruit_Name $25 Country $20 Category $20

Supplier_Email $50 Region_Code $12 Harvest_Date $20

Price $12 Rating $8;

infile datalines dlm='|' truncover;

input Fruit_ID $ Fruit_Name $ Country $ Category $

Harvest_Date $ Price $ Supplier_Email $ Region_Code $

Rating $;

datalines;

FR001|Durian|Thailand|Tropical|12JAN2025|450|durian@gmail.com| ap01 |4.5

FR001|durian|THAILAND|tropi cal|31FEB2025|-450|wrongmail.com|AP01|5

FR002|Mangosteen|India|NULL|15MAR2025|550|mango@test.com|IN01|4.8

FR003|Rambutan|Malaysia|Tropical|.|700|rambutan#mail.com|MY 02|abc

FR004|Ackee|Jamaica|Exotic|21APR2025|-100|ackee@gmail|JM01|3.7

FR005|MiracleFruit|Ghana|???|15MAY2025|850| miracle@gmail.com |GH01|4.9

FR006|Salak|Indonesia|Tropical|44DEC2025|620|salak@gmail.com|ID01|5.2

FR007|Jabuticaba|Brazil|Berry|17JUN2025|NULL|jab@gmail.com|BR01|4.4

FR008|BuddhasHand|China|Citrus|19JUL2025|710|buddha@mail.com| CN01|4.6

FR009|Cherimoya|Peru|tropical|20AUG2025|800|cherimoya@gmail.com|PE01|4.1

FR010|Langsat|Malaysia|TROPICAL|22SEP2025|780|langsat@gmail.com|MY01|4.0

FR011|Soursop|India|Herbal|23OCT2025|-900|soursopgmail.com|IN01|2.1

FR012|Feijoa|Brazil|Berry|24NOV2025|640|fei@gmail.com|BR01|4.2

FR013|Cupuacu|Brazil|NULL|25DEC2025|770|cupuacu@gmail.com|BR02|4.5

FR014|Pulasan|Malaysia|Tropical|15JAN2025|720|pulasan@gmail.com|MY03|4.7

FR015|SnakeFruit|Indonesia|Tropical|16FEB2025|680|snake@gmail.com|ID01|4.3

;

run;

proc print data = fruit_raw;

run;

OUTPUT:

ObsFruit_IDFruit_NameCountryCategorySupplier_EmailRegion_CodeHarvest_DatePriceRating
1FR001DurianThailandTropicaldurian@gmail.comap0112JAN20254504.5
2FR001durianTHAILANDtropi calwrongmail.comAP0131FEB2025-4505
3FR002MangosteenIndiaNULLmango@test.comIN0115MAR20255504.8
4FR003RambutanMalaysiaTropicalrambutan#mail.comMY 02 700abc
5FR004AckeeJamaicaExoticackee@gmailJM0121APR2025-1003.7
6FR005MiracleFruitGhana???miracle@gmail.comGH0115MAY20258504.9
7FR006SalakIndonesiaTropicalsalak@gmail.comID0144DEC20256205.2
8FR007JabuticabaBrazilBerryjab@gmail.comBR0117JUN2025NULL4.4
9FR008BuddhasHandChinaCitrusbuddha@mail.comCN0119JUL20257104.6
10FR009CherimoyaPerutropicalcherimoya@gmail.comPE0120AUG20258004.1
11FR010LangsatMalaysiaTROPICALlangsat@gmail.comMY0122SEP20257804.0
12FR011SoursopIndiaHerbalsoursopgmail.comIN0123OCT2025-9002.1
13FR012FeijoaBrazilBerryfei@gmail.comBR0124NOV20256404.2
14FR013CupuacuBrazilNULLcupuacu@gmail.comBR0225DEC20257704.5
15FR014PulasanMalaysiaTropicalpulasan@gmail.comMY0315JAN20257204.7
16FR015SnakeFruitIndonesiaTropicalsnake@gmail.comID0116FEB20256804.3

Why LENGTH Must Come Before Assignments

One of the biggest SAS mistakes beginners make is placing the LENGTH statement after assignments.

Incorrect:

Fruit_Status="Excellent";

length Fruit_Status $5;

This truncates the value to:

Excel

because SAS already assigned default length before LENGTH executed.

Correct:

length Fruit_Status $20;

Fruit_Status="Excellent";

In enterprise clinical trials, truncation can destroy SDTM mapping integrity.

R behaves differently because character vectors dynamically resize memory allocation. SAS, however, fixes variable length at compile time.

That difference is critical.

2.SAS Enterprise Cleaning Workflow

data fruit_clean;

retain Data_Source "GLOBAL_FRUIT_SYSTEM";

set fruit_raw;

length Clean_Category $20 Clean_Email $60 Parsed_Date 8;

format Parsed_Date yymmdd10.;

Fruit_Name=propcase(strip(Fruit_Name));

Country=upcase(compbl(strip(Country)));

Category=upcase(compress(Category));

if Category in ("NULL","???","") then

    Clean_Category="UNKNOWN";

else Clean_Category=Category;

Price_Num=input(compress(Price),best12.);

if Price_Num < 0 then Price_Num=abs(Price_Num);

if missing(Price_Num) then

    Price_Num=round(500 + rannor(2)*100,.01);

Parsed_Date=input(Harvest_Date,date9.); 

if missing(Parsed_Date) then

    Parsed_Date=today();

Clean_Email=strip(lowcase(Supplier_Email));

if index(Clean_Email,'@')=0 then

    Clean_Email="invalid@email.com";

Region_Code=upcase(compress(Region_Code));

Rating_Num=input(Rating,best12.);

if Rating_Num > 5 then Rating_Num=5;

if Rating_Num=. then Rating_Num=3.5;

Days_To_Harvest=intck('day',today(),Parsed_Date);

drop Harvest_Date Price Rating Supplier_Email Category;

rename Parsed_Date = Harvest_Date

       Price_Num = Price

       Rating_Num = Rating

       Clean_Email = Supplier_Email

       Clean_Category = Category; 

run;

proc print data = fruit_clean;

run;

OUTPUT:

ObsData_SourceFruit_IDFruit_NameCountryRegion_CodeCategorySupplier_EmailHarvest_DatePriceRatingDays_To_Harvest
1GLOBAL_FRUIT_SYSTEMFR001DurianTHAILANDAP01TROPICALdurian@gmail.com2025-01-12450.004.5-510
2GLOBAL_FRUIT_SYSTEMFR001DurianTHAILANDAP01TROPICALinvalid@email.com2026-06-06450.005.00
3GLOBAL_FRUIT_SYSTEMFR002MangosteenINDIAIN01UNKNOWNmango@test.com2025-03-15550.004.8-448
4GLOBAL_FRUIT_SYSTEMFR003RambutanMALAYSIAMY02TROPICALinvalid@email.com2026-06-06700.003.50
5GLOBAL_FRUIT_SYSTEMFR004AckeeJAMAICAJM01EXOTICackee@gmail2025-04-21100.003.7-411
6GLOBAL_FRUIT_SYSTEMFR005MiraclefruitGHANAGH01UNKNOWNmiracle@gmail.com2025-05-15850.004.9-387
7GLOBAL_FRUIT_SYSTEMFR006SalakINDONESIAID01TROPICALsalak@gmail.com2026-06-06620.005.00
8GLOBAL_FRUIT_SYSTEMFR007JabuticabaBRAZILBR01BERRYjab@gmail.com2025-06-17631.184.4-354
9GLOBAL_FRUIT_SYSTEMFR008BuddhashandCHINACN01CITRUSbuddha@mail.com2025-07-19710.004.6-322
10GLOBAL_FRUIT_SYSTEMFR009CherimoyaPERUPE01TROPICALcherimoya@gmail.com2025-08-20800.004.1-290
11GLOBAL_FRUIT_SYSTEMFR010LangsatMALAYSIAMY01TROPICALlangsat@gmail.com2025-09-22780.004.0-257
12GLOBAL_FRUIT_SYSTEMFR011SoursopINDIAIN01HERBALinvalid@email.com2025-10-23900.002.1-226
13GLOBAL_FRUIT_SYSTEMFR012FeijoaBRAZILBR01BERRYfei@gmail.com2025-11-24640.004.2-194
14GLOBAL_FRUIT_SYSTEMFR013CupuacuBRAZILBR02UNKNOWNcupuacu@gmail.com2025-12-25770.004.5-163
15GLOBAL_FRUIT_SYSTEMFR014PulasanMALAYSIAMY03TROPICALpulasan@gmail.com2025-01-15720.004.7-507
16GLOBAL_FRUIT_SYSTEMFR015SnakefruitINDONESIAID01TROPICALsnake@gmail.com2025-02-16680.004.3-475

Explanation and Key Points

This DATA step demonstrates real-world production-grade cleaning logic. We used:

  • RETAIN for metadata persistence
  • PROPCASE for standardized names
  • COMPRESS for whitespace corruption
  • INPUT for character-to-numeric conversion
  • ABS for correcting negative values
  • INDEX for email validation
  • INTCK for temporal analytics
  • ROUND for financial precision

In regulated clinical environments, similar logic is applied to patient-level SDTM and ADaM derivations. Improper conversions can create FDA submission failures. This step converts chaotic operational intelligence into analysis-ready structured data.

3.PROC SORT Deduplication

proc sort data=fruit_clean nodupkey out=fruit_nodup;

by Fruit_ID;

run;

proc print data = fruit_nodup;

run;

LOG:

NOTE: There were 16 observations read from the data set WORK.FRUIT_CLEAN.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.FRUIT_NODUP has 15 observations and 11 variables.

OUTPUT:

ObsData_SourceFruit_IDFruit_NameCountryRegion_CodeCategorySupplier_EmailHarvest_DatePriceRatingDays_To_Harvest
1GLOBAL_FRUIT_SYSTEMFR001DurianTHAILANDAP01TROPICALdurian@gmail.com2025-01-12450.004.5-510
2GLOBAL_FRUIT_SYSTEMFR002MangosteenINDIAIN01UNKNOWNmango@test.com2025-03-15550.004.8-448
3GLOBAL_FRUIT_SYSTEMFR003RambutanMALAYSIAMY02TROPICALinvalid@email.com2026-06-06700.003.50
4GLOBAL_FRUIT_SYSTEMFR004AckeeJAMAICAJM01EXOTICackee@gmail2025-04-21100.003.7-411
5GLOBAL_FRUIT_SYSTEMFR005MiraclefruitGHANAGH01UNKNOWNmiracle@gmail.com2025-05-15850.004.9-387
6GLOBAL_FRUIT_SYSTEMFR006SalakINDONESIAID01TROPICALsalak@gmail.com2026-06-06620.005.00
7GLOBAL_FRUIT_SYSTEMFR007JabuticabaBRAZILBR01BERRYjab@gmail.com2025-06-17631.184.4-354
8GLOBAL_FRUIT_SYSTEMFR008BuddhashandCHINACN01CITRUSbuddha@mail.com2025-07-19710.004.6-322
9GLOBAL_FRUIT_SYSTEMFR009CherimoyaPERUPE01TROPICALcherimoya@gmail.com2025-08-20800.004.1-290
10GLOBAL_FRUIT_SYSTEMFR010LangsatMALAYSIAMY01TROPICALlangsat@gmail.com2025-09-22780.004.0-257
11GLOBAL_FRUIT_SYSTEMFR011SoursopINDIAIN01HERBALinvalid@email.com2025-10-23900.002.1-226
12GLOBAL_FRUIT_SYSTEMFR012FeijoaBRAZILBR01BERRYfei@gmail.com2025-11-24640.004.2-194
13GLOBAL_FRUIT_SYSTEMFR013CupuacuBRAZILBR02UNKNOWNcupuacu@gmail.com2025-12-25770.004.5-163
14GLOBAL_FRUIT_SYSTEMFR014PulasanMALAYSIAMY03TROPICALpulasan@gmail.com2025-01-15720.004.7-507
15GLOBAL_FRUIT_SYSTEMFR015SnakefruitINDONESIAID01TROPICALsnake@gmail.com2025-02-16680.004.3-475

Explanation and Key Points

PROC SORT NODUPKEY removes duplicate business keys. Duplicate IDs can inflate revenue calculations, patient counts, or insurance claims. In enterprise analytics, deduplication is one of the first QC checkpoints. Clinical studies particularly require unique subject identifiers for traceability and regulatory integrity.

4.PROC SQL Validation

proc sql;

create table fruit_summary as

select Country,Category,

       count(*) as Total_Records,

       avg(Price) as Avg_Price,

       avg(Rating) as Avg_Rating

from fruit_nodup

group by Country,Category;

quit;

proc print data = fruit_summary;

run;

OUTPUT:

ObsCountryCategoryTotal_RecordsAvg_PriceAvg_Rating
1BRAZILBERRY2635.5904.30000
2BRAZILUNKNOWN1770.0004.50000
3CHINACITRUS1710.0004.60000
4GHANAUNKNOWN1850.0004.90000
5INDIAHERBAL1900.0002.10000
6INDIAUNKNOWN1550.0004.80000
7INDONESIATROPICAL2650.0004.65000
8JAMAICAEXOTIC1100.0003.70000
9MALAYSIATROPICAL3733.3334.06667
10PERUTROPICAL1800.0004.10000
11THAILANDTROPICAL1450.0004.50000

Explanation and Key Points

PROC SQL provides relational-style aggregation logic similar to enterprise databases. This workflow helps:

  • summarize operational intelligence
  • identify country-level anomalies
  • support executive dashboards
  • create analytical reporting layers

Compared to DATA step BY-group processing, PROC SQL is often easier for multidimensional aggregations and joins.

Advanced SAS Techniques

5.ARRAY Cleaning

data fruit_array_clean;

set fruit_nodup;

array chars(*) Fruit_Name Country Category;

do i=1 to dim(chars);

   chars(i)=strip(upcase(chars(i)));

end;

drop i;

run;

proc print data = fruit_array_clean;

run;

OUTPUT:

ObsData_SourceFruit_IDFruit_NameCountryRegion_CodeCategorySupplier_EmailHarvest_DatePriceRatingDays_To_Harvest
1GLOBAL_FRUIT_SYSTEMFR001DURIANTHAILANDAP01TROPICALdurian@gmail.com2025-01-12450.004.5-510
2GLOBAL_FRUIT_SYSTEMFR002MANGOSTEENINDIAIN01UNKNOWNmango@test.com2025-03-15550.004.8-448
3GLOBAL_FRUIT_SYSTEMFR003RAMBUTANMALAYSIAMY02TROPICALinvalid@email.com2026-06-06700.003.50
4GLOBAL_FRUIT_SYSTEMFR004ACKEEJAMAICAJM01EXOTICackee@gmail2025-04-21100.003.7-411
5GLOBAL_FRUIT_SYSTEMFR005MIRACLEFRUITGHANAGH01UNKNOWNmiracle@gmail.com2025-05-15850.004.9-387
6GLOBAL_FRUIT_SYSTEMFR006SALAKINDONESIAID01TROPICALsalak@gmail.com2026-06-06620.005.00
7GLOBAL_FRUIT_SYSTEMFR007JABUTICABABRAZILBR01BERRYjab@gmail.com2025-06-17631.184.4-354
8GLOBAL_FRUIT_SYSTEMFR008BUDDHASHANDCHINACN01CITRUSbuddha@mail.com2025-07-19710.004.6-322
9GLOBAL_FRUIT_SYSTEMFR009CHERIMOYAPERUPE01TROPICALcherimoya@gmail.com2025-08-20800.004.1-290
10GLOBAL_FRUIT_SYSTEMFR010LANGSATMALAYSIAMY01TROPICALlangsat@gmail.com2025-09-22780.004.0-257
11GLOBAL_FRUIT_SYSTEMFR011SOURSOPINDIAIN01HERBALinvalid@email.com2025-10-23900.002.1-226
12GLOBAL_FRUIT_SYSTEMFR012FEIJOABRAZILBR01BERRYfei@gmail.com2025-11-24640.004.2-194
13GLOBAL_FRUIT_SYSTEMFR013CUPUACUBRAZILBR02UNKNOWNcupuacu@gmail.com2025-12-25770.004.5-163
14GLOBAL_FRUIT_SYSTEMFR014PULASANMALAYSIAMY03TROPICALpulasan@gmail.com2025-01-15720.004.7-507
15GLOBAL_FRUIT_SYSTEMFR015SNAKEFRUITINDONESIAID01TROPICALsnake@gmail.com2025-02-16680.004.3-475

Explanation

This program demonstrates one of the most powerful DATA Step techniques in SAS called ARRAY processing. Arrays allow programmers to apply the same logic to multiple variables efficiently without repeatedly writing individual statements.

In this example, the array named chars(*) contains three character variables:

  • Fruit_Name
  • Country
  • Clean_Category

The (*) tells SAS to automatically determine the number of variables inside the array.

The DO loop:

do i = 1 to dim(chars);

iterates through each variable in the array dynamically. The DIM() function returns the total number of variables inside the array.

Inside the loop:

chars(i) = strip(upcase(chars(i)));

two cleaning operations occur:

  • UPCASE() converts text to uppercase
  • STRIP() removes leading and trailing spaces

For example:

Before Cleaning

After Cleaning

durian

DURIAN

thailand

THAILAND

tropical

TROPICAL

Finally:

drop i;

removes the temporary loop counter variable from the final dataset.

Key Points

  • Arrays reduce repetitive coding
  • Useful for bulk variable transformations
  • Improves maintainability in enterprise projects
  • Frequently used in SDTM and ADaM standardization
  • DIM() makes loops dynamic and scalable
  • Excellent for cleaning large clinical datasets 

6.SELECT-WHEN Logic

data fruit_segment;

set fruit_array_clean;

Length Segment $12;

select;

   when (Price >=800) Segment="PREMIUM";

   when (Price >=600) Segment="STANDARD";

   otherwise Segment="BASIC";

end;

run;

proc print data = fruit_segment;

run;

OUTPUT:

ObsData_SourceFruit_IDFruit_NameCountryRegion_CodeCategorySupplier_EmailHarvest_DatePriceRatingDays_To_HarvestSegment
1GLOBAL_FRUIT_SYSTEMFR001DURIANTHAILANDAP01TROPICALdurian@gmail.com2025-01-12450.004.5-510BASIC
2GLOBAL_FRUIT_SYSTEMFR002MANGOSTEENINDIAIN01UNKNOWNmango@test.com2025-03-15550.004.8-448BASIC
3GLOBAL_FRUIT_SYSTEMFR003RAMBUTANMALAYSIAMY02TROPICALinvalid@email.com2026-06-06700.003.50STANDARD
4GLOBAL_FRUIT_SYSTEMFR004ACKEEJAMAICAJM01EXOTICackee@gmail2025-04-21100.003.7-411BASIC
5GLOBAL_FRUIT_SYSTEMFR005MIRACLEFRUITGHANAGH01UNKNOWNmiracle@gmail.com2025-05-15850.004.9-387PREMIUM
6GLOBAL_FRUIT_SYSTEMFR006SALAKINDONESIAID01TROPICALsalak@gmail.com2026-06-06620.005.00STANDARD
7GLOBAL_FRUIT_SYSTEMFR007JABUTICABABRAZILBR01BERRYjab@gmail.com2025-06-17631.184.4-354STANDARD
8GLOBAL_FRUIT_SYSTEMFR008BUDDHASHANDCHINACN01CITRUSbuddha@mail.com2025-07-19710.004.6-322STANDARD
9GLOBAL_FRUIT_SYSTEMFR009CHERIMOYAPERUPE01TROPICALcherimoya@gmail.com2025-08-20800.004.1-290PREMIUM
10GLOBAL_FRUIT_SYSTEMFR010LANGSATMALAYSIAMY01TROPICALlangsat@gmail.com2025-09-22780.004.0-257STANDARD
11GLOBAL_FRUIT_SYSTEMFR011SOURSOPINDIAIN01HERBALinvalid@email.com2025-10-23900.002.1-226PREMIUM
12GLOBAL_FRUIT_SYSTEMFR012FEIJOABRAZILBR01BERRYfei@gmail.com2025-11-24640.004.2-194STANDARD
13GLOBAL_FRUIT_SYSTEMFR013CUPUACUBRAZILBR02UNKNOWNcupuacu@gmail.com2025-12-25770.004.5-163STANDARD
14GLOBAL_FRUIT_SYSTEMFR014PULASANMALAYSIAMY03TROPICALpulasan@gmail.com2025-01-15720.004.7-507STANDARD
15GLOBAL_FRUIT_SYSTEMFR015SNAKEFRUITINDONESIAID01TROPICALsnake@gmail.com2025-02-16680.004.3-475STANDARD

Explanation

This program demonstrates the SELECT-WHEN statement in SAS, which is similar to:

  • IF-ELSE logic
  • CASE statements in SQL
  • case_when() in R

The purpose of this code is to categorize fruits into business segments based on their price.

Step-by-Step Logic

Condition 1

when (Price_Num >= 800)

If the fruit price is greater than or equal to 800, SAS assigns:

Segment = "PREMIUM";

Condition 2

when (Price_Num >= 600)

If the first condition fails but the price is at least 600, SAS assigns:

Segment = "STANDARD";

Otherwise Condition

Otherwise Segment = "BASIC";

If none of the above conditions are true, the fruit is classified as BASIC.

Example Output

Fruit_Name

Price_Num

Segment

DURIAN

850

PREMIUM

RAMBUTAN

650

STANDARD

ACKEE

300

BASIC

Why SELECT-WHEN is Preferred

Compared to long IF-ELSE chains, SELECT-WHEN:

  • improves readability
  • simplifies debugging
  • provides cleaner business-rule implementation
  • is easier to maintain in production systems

Key Points

  • Cleaner alternative to multiple IF statements
  • Excellent for business categorization
  • Common in risk scoring systems
  • Used in banking, insurance, and clinical classification logic
  • Easier to validate during QC review

7.FIRST./LAST. Processing

proc sort data=fruit_segment;

by Country;

run;

proc print data = fruit_segment;

run;

OUTPUT:

ObsData_SourceFruit_IDFruit_NameCountryRegion_CodeCategorySupplier_EmailHarvest_DatePriceRatingDays_To_HarvestSegment
1GLOBAL_FRUIT_SYSTEMFR007JABUTICABABRAZILBR01BERRYjab@gmail.com2025-06-17631.184.4-354STANDARD
2GLOBAL_FRUIT_SYSTEMFR012FEIJOABRAZILBR01BERRYfei@gmail.com2025-11-24640.004.2-194STANDARD
3GLOBAL_FRUIT_SYSTEMFR013CUPUACUBRAZILBR02UNKNOWNcupuacu@gmail.com2025-12-25770.004.5-163STANDARD
4GLOBAL_FRUIT_SYSTEMFR008BUDDHASHANDCHINACN01CITRUSbuddha@mail.com2025-07-19710.004.6-322STANDARD
5GLOBAL_FRUIT_SYSTEMFR005MIRACLEFRUITGHANAGH01UNKNOWNmiracle@gmail.com2025-05-15850.004.9-387PREMIUM
6GLOBAL_FRUIT_SYSTEMFR002MANGOSTEENINDIAIN01UNKNOWNmango@test.com2025-03-15550.004.8-448BASIC
7GLOBAL_FRUIT_SYSTEMFR011SOURSOPINDIAIN01HERBALinvalid@email.com2025-10-23900.002.1-226PREMIUM
8GLOBAL_FRUIT_SYSTEMFR006SALAKINDONESIAID01TROPICALsalak@gmail.com2026-06-06620.005.00STANDARD
9GLOBAL_FRUIT_SYSTEMFR015SNAKEFRUITINDONESIAID01TROPICALsnake@gmail.com2025-02-16680.004.3-475STANDARD
10GLOBAL_FRUIT_SYSTEMFR004ACKEEJAMAICAJM01EXOTICackee@gmail2025-04-21100.003.7-411BASIC
11GLOBAL_FRUIT_SYSTEMFR003RAMBUTANMALAYSIAMY02TROPICALinvalid@email.com2026-06-06700.003.50STANDARD
12GLOBAL_FRUIT_SYSTEMFR010LANGSATMALAYSIAMY01TROPICALlangsat@gmail.com2025-09-22780.004.0-257STANDARD
13GLOBAL_FRUIT_SYSTEMFR014PULASANMALAYSIAMY03TROPICALpulasan@gmail.com2025-01-15720.004.7-507STANDARD
14GLOBAL_FRUIT_SYSTEMFR009CHERIMOYAPERUPE01TROPICALcherimoya@gmail.com2025-08-20800.004.1-290PREMIUM
15GLOBAL_FRUIT_SYSTEMFR001DURIANTHAILANDAP01TROPICALdurian@gmail.com2025-01-12450.004.5-510BASIC

data country_counts;

set fruit_segment;

by Country;

retain Count;

if first.Country then Count=0;

Count+1;

if last.Country;

run;

proc print data = country_counts;

run;

OUTPUT:

ObsData_SourceFruit_IDFruit_NameCountryRegion_CodeCategorySupplier_EmailHarvest_DatePriceRatingDays_To_HarvestSegmentCount
1GLOBAL_FRUIT_SYSTEMFR013CUPUACUBRAZILBR02UNKNOWNcupuacu@gmail.com2025-12-257704.5-163STANDARD3
2GLOBAL_FRUIT_SYSTEMFR008BUDDHASHANDCHINACN01CITRUSbuddha@mail.com2025-07-197104.6-322STANDARD1
3GLOBAL_FRUIT_SYSTEMFR005MIRACLEFRUITGHANAGH01UNKNOWNmiracle@gmail.com2025-05-158504.9-387PREMIUM1
4GLOBAL_FRUIT_SYSTEMFR011SOURSOPINDIAIN01HERBALinvalid@email.com2025-10-239002.1-226PREMIUM2
5GLOBAL_FRUIT_SYSTEMFR015SNAKEFRUITINDONESIAID01TROPICALsnake@gmail.com2025-02-166804.3-475STANDARD2
6GLOBAL_FRUIT_SYSTEMFR004ACKEEJAMAICAJM01EXOTICackee@gmail2025-04-211003.7-411BASIC1
7GLOBAL_FRUIT_SYSTEMFR014PULASANMALAYSIAMY03TROPICALpulasan@gmail.com2025-01-157204.7-507STANDARD3
8GLOBAL_FRUIT_SYSTEMFR009CHERIMOYAPERUPE01TROPICALcherimoya@gmail.com2025-08-208004.1-290PREMIUM1
9GLOBAL_FRUIT_SYSTEMFR001DURIANTHAILANDAP01TROPICALdurian@gmail.com2025-01-124504.5-510BASIC1

Explanation

This program demonstrates one of the most important SAS concepts called BY-group processing using:

  • FIRST.variable
  • LAST.variable

These are temporary automatic variables created by SAS whenever a dataset is processed with a BY statement.

Step 1.Sorting Requirement

proc sort data = fruit_segment;
by Country;
run;

Before BY-group processing, datasets must be sorted by the grouping variable.

Here, records are sorted by:

Country

Without sorting, SAS generates incorrect BY-group behavior.

Step 2.RETAIN Statement

retain Count;

Normally SAS resets variables to missing for each observation.

RETAIN preserves the value across rows.

This allows cumulative counting.

Step 3.FIRST.Country Logic

if first.Country then Count = 0;

When SAS encounters the first record of a new country group:

  • FIRST.Country = 1
  • Counter resets to zero

Example:

Country

FIRST.Country

INDIA

1

INDIA

0

MALAYSIA

1

Step 4.Increment Counter

Count + 1;

This cumulative statement increments the counter automatically.

Step 5.LAST.Country Logic

if last.Country;

This keeps only the final observation for each country group.

At that point, Count contains the total number of records for that country.

Example Output

Country

Count

BRAZIL

3

INDIA

2

MALAYSIA

4

Key Points

  • FIRST./LAST. variables are temporary automatic variables
  • Requires sorted data
  • RETAIN preserves values across observations
  • Commonly used for grouped summaries
  • Faster than some SQL aggregations
  • Essential in clinical trial derivations

8.PROC REPORT for Professional Outputs

proc report data=fruit_segment nowd;

column Country Fruit_Name Price Rating Segment;

define Country / group;

define Fruit_Name / display;

define Price / analysis mean;

define Rating / analysis mean;

run;

OUTPUT:
CountryFruit_NamePriceRatingSegment
BRAZILJABUTICABA631.184.4STANDARD
 FEIJOA6404.2STANDARD
 CUPUACU7704.5STANDARD
CHINABUDDHASHAND7104.6STANDARD
GHANAMIRACLEFRUIT8504.9PREMIUM
INDIAMANGOSTEEN5504.8BASIC
 SOURSOP9002.1PREMIUM
INDONESIASALAK6205STANDARD
 SNAKEFRUIT6804.3STANDARD
JAMAICAACKEE1003.7BASIC
MALAYSIARAMBUTAN7003.5STANDARD
 LANGSAT7804STANDARD
 PULASAN7204.7STANDARD
PERUCHERIMOYA8004.1PREMIUM
THAILANDDURIAN4504.5BASIC

Explanation and Key Points

PROC REPORT is widely used in pharmaceutical TLF generation. It creates professional tabular outputs for executive reporting, compliance reporting, and operational intelligence dashboards. Unlike PROC PRINT, it provides advanced formatting and grouping flexibility.

9.R Data Cleaning Workflow

library(tidyverse)

library(lubridate)

library(janitor)

fruit_raw <- read.delim(

  text="

FR001|Durian|Thailand|Tropical|12JAN2025|450|durian@gmail.com| ap01 |4.5

FR001|durian|THAILAND|tropi cal|31FEB2025|-450|wrongmail.com|AP01|5

FR002|Mangosteen|India|NULL|15MAR2025|550|mango@test.com|IN01|4.8

",

  sep="|",

  header=FALSE

)

OUTPUT:

 

V1

V2

V3

V4

V5

V6

V7

V8

V9

1

FR001

Durian

Thailand

Tropical

12-Jan-2025

450

durian@gmail.com

 ap01 

4.5

2

FR001

durian

THAILAND

tropi cal

31FEB2025

-450

wrongmail.com

AP01

5

3

FR002

Mangosteen

India

NULL

15-Mar-2025

550

mango@test.com

IN01

4.8


colnames(fruit_raw) <- c(

  "fruit_id","fruit_name","country","category",

  "harvest_date","price","email",

  "region_code","rating"

)

OUTPUT:

 

fruit_id

fruit_name

country

category

harvest_date

price

email

region_code

rating

1

FR001

Durian

Thailand

Tropical

12-Jan-2025

450

durian@gmail.com

 ap01 

4.5

2

FR001

durian

THAILAND

tropi cal

31FEB2025

-450

wrongmail.com

AP01

5

3

FR002

Mangosteen

India

NULL

15-Mar-2025

550

mango@test.com

IN01

4.8


fruit_clean <- fruit_raw %>%

  clean_names() %>%

  mutate(

    fruit_name = str_to_title(str_trim(fruit_name)),

    country = str_to_upper(str_trim(country)),

    category = str_replace_all(category," ",""),

    category = case_when(

        category %in% c("NULL","") ~ "UNKNOWN",

        TRUE ~ category),

    price = abs(as.numeric(price)),

    email = if_else(grepl("@",email),

        email,"invalid@email.com"),

    region_code = str_to_upper(str_trim(region_code)),

    rating = coalesce(as.numeric(rating),3.5),

    parsed_date = suppressWarnings(

        parse_date_time(harvest_date,

          orders="dby")

      ),

    date_flag =if_else(is.na(parsed_date),

        "INVALID_DATE","VALID_DATE"

      )

  )

OUTPUT:

 

fruit_id

fruit_name

country

category

harvest_date

price

email

region_code

rating

parsed_date

date_flag

1

FR001

Durian

THAILAND

Tropical

12JAN2025

450

durian@gmail.com

AP01

4.5

12-01-2025

VALID_DATE

2

FR001

Durian

THAILAND

tropical

31FEB2025

450

invalid@email.com

AP01

5

NA

INVALID_DATE

3

FR002

Mangosteen

INDIA

UNKNOWN

15MAR20 25

550

mango@test.com

IN01

4.8

15-03-2025

VALID_DATE

Explanation and Key Points

This R pipeline mirrors SAS cleaning logic using tidyverse functions. Key parallels include:

SAS

R

PROPCASE

str_to_title

COMPRESS

str_replace_all

INPUT

as.numeric

IF-THEN

case_when

COALESCEC

coalesce

STRIP

str_trim

R provides exceptional flexibility for exploratory data engineering, while SAS excels in governed enterprise environments.

Enterprise Validation & Compliance

In SDTM and ADaM environments, every transformation must be traceable.

Key enterprise requirements include:

  • Audit trails
  • QC independence
  • Metadata governance
  • Reproducibility
  • Variable lineage
  • Controlled terminology
  • Validation traceability

One dangerous SAS behavior is:

if Age < 18 then delete;

Because SAS treats missing numeric values as smaller than valid numbers, records with missing age also get deleted unintentionally.

This can create catastrophic analytical bias in clinical trials.

Defensive programming is mandatory.

20 Real-World Data Cleaning Best Practices

  1. Standardize variable naming conventions
  2. Validate dates before conversion
  3. Remove duplicate business keys
  4. Track audit metadata
  5. Use reusable macros
  6. Avoid hardcoded logic
  7. Standardize categorical labels
  8. Validate numeric ranges
  9. Implement QC independence
  10. Preserve raw source datasets
  11. Log all transformations
  12. Validate email structures
  13. Standardize region codes
  14. Use metadata-driven mappings
  15. Separate raw and curated layers
  16. Implement production checkpoints
  17. Use defensive IF conditions
  18. Validate missing-value assumptions
  19. Automate reconciliation checks
  20. Document derivation logic thoroughly

Business Logic Behind Data Cleaning

Business logic is the foundation of analytical reliability. Missing values are often imputed because downstream models and reporting systems cannot process null structures consistently. For example, if a fruit price is missing, revenue summaries become distorted. In clinical trials, missing patient ages may exclude subjects from eligibility analysis.

Unrealistic values must also be corrected. Negative shipment costs or impossible patient ages create misleading analytical outcomes. Date standardization is equally critical because reporting timelines, treatment durations, and seasonal analyses depend heavily on accurate temporal calculations.

Text normalization ensures that “Tropical,” “tropical,” and “TROPICAL” are treated as identical business categories. Without normalization, aggregation logic fragments analytical summaries.

These corrections are not cosmetic. They directly influence executive decisions, AI predictions, operational KPIs, and regulatory submissions.

20 One-Line Insights

  1. Dirty data creates expensive business mistakes.
  2. Validation logic is stronger than visual inspection.
  3. Missing dates silently break forecasting models.
  4. Duplicate IDs inflate analytical metrics.
  5. Standardized variables improve reproducibility.
  6. Regulatory submissions demand traceability.
  7. Defensive programming prevents hidden failures.
  8. Audit readiness begins with metadata discipline.
  9. PROC SQL simplifies relational intelligence.
  10. DATA step excels in row-wise control.
  11. R accelerates exploratory transformations.
  12. SAS dominates governed enterprise reporting.
  13. Poor text normalization fragments dashboards.
  14. Controlled terminology improves aggregation quality.
  15. QC independence increases analytical trust.
  16. Macros improve scalability and consistency.
  17. Missing numeric values require special handling in SAS.
  18. Enterprise analytics depends on reproducible pipelines.
  19. Standardization improves machine-learning stability.
  20. Reliable analytics begins with reliable data.

SAS vs R Comparison

Capability

SAS

R

Auditability

Excellent

Moderate

Regulatory Acceptance

Industry Standard

Growing

Flexibility

Moderate

Very High

Performance

Strong

Strong

Visualization

Moderate

Excellent

Governance

Excellent

Moderate

Macro Automation

Powerful

Functional

Exploratory Analysis

Moderate

Excellent

Production Stability

Excellent

Good

Open Source

No

Yes

SAS dominates highly regulated enterprise ecosystems because of auditability, reproducibility, and metadata control. R excels in exploratory analytics, statistical flexibility, and rapid prototyping. Together, they create an extremely powerful hybrid analytics framework.

Validation Checklist

1.Duplicate check completed
2.Date conversion validated
3.Missing-value review performed
4.Email structure verified
5.Region normalization applied
6.Numeric range validation completed
7.Metadata reviewed
8.QC reconciliation performed

Summary

Creating analysis-ready datasets is one of the most critical responsibilities in enterprise analytics, especially in industries such as clinical research, banking, insurance, retail, and healthcare. Raw operational data often contains serious quality issues including duplicate records, invalid dates, negative numeric values, inconsistent text formatting, missing variables, malformed emails, corrupted categorical values, and mixed data types. These problems can severely impact dashboards, statistical outputs, AI predictions, regulatory submissions, and executive decision-making.

SAS and R together provide a powerful ecosystem for solving these challenges. SAS is highly trusted in regulated industries because of its auditability, metadata governance, reproducibility, validation capabilities, and scalable production workflows. Features like DATA step processing, PROC SQL, PROC REPORT, PROC SORT, arrays, macros, and validation procedures help convert corrupted datasets into reliable analytical intelligence.

R complements SAS by providing flexible exploratory data analysis, advanced string processing, dynamic transformations, and modern packages such as tidyverse, dplyr, stringr, lubridate, and janitor. Functions like mutate(), case_when(), coalesce(), and parse_date_time() make data cleaning efficient and readable.

Together, SAS and R enable organizations to build trustworthy, scalable, production-grade analytical systems that improve reporting accuracy, regulatory compliance, operational intelligence, and enterprise decision-making reliability.

Conclusion

Modern analytics ecosystems depend entirely on structured, trustworthy, and analysis-ready data. Whether the business domain involves clinical trials, insurance claims, retail intelligence, banking fraud systems, or global fruit supply-chain analytics, poor-quality data introduces enormous operational and regulatory risk.

Duplicate identifiers distort reporting. Invalid dates destroy time-series analysis. Missing values silently corrupt statistical outputs. Inconsistent text formatting fragments dashboards. Malformed emails break compliance workflows. These issues are not theoretical they happen every day in enterprise production systems.

This is why SAS and R remain foundational technologies in enterprise data engineering. SAS provides unmatched governance, auditability, traceability, and production-grade reporting capabilities. Its DATA step architecture enables highly controlled transformation pipelines essential for regulated industries such as pharmaceuticals and banking. PROC SQL, PROC REPORT, PROC SUMMARY, and macro automation create scalable frameworks capable of supporting global operational intelligence.

R complements SAS by offering highly flexible exploratory analytics, modern string manipulation, advanced visualization ecosystems, and rapid transformation pipelines using tidyverse frameworks. Together, SAS and R create a hybrid analytics ecosystem that balances enterprise stability with analytical agility.

The future of analytics does not belong to organizations with the largest datasets. It belongs to organizations with the cleanest, most reliable, and most reproducible datasets.

Because in enterprise analytics, trustworthy insights begin long before dashboards they begin with disciplined data engineering.

Interview Questions and Answers

1. What is the difference between PROC SQL and DATA Step in SAS?

Answer:
DATA Step is primarily used for row-by-row processing and sequential logic, while PROC SQL is used for relational operations such as joins, aggregations, and subqueries. DATA Step performs better for complex row transformations, whereas PROC SQL is easier for database-style operations.

2. Why is LENGTH statement important in SAS?

Answer:
The LENGTH statement defines variable storage size before assignment. If LENGTH is declared after assigning values, SAS may truncate character variables because variable length is determined during compilation. Proper LENGTH placement prevents data loss and reporting inconsistencies.

3. What is the difference between WHERE and IF statements in SAS?

Answer:
WHERE filters observations before they enter the Program Data Vector (PDV), making it more efficient for large datasets. IF conditions are applied after observations are read into memory. WHERE cannot be used with newly created variables in the same DATA step.

4. Explain RETAIN statement in SAS?

Answer:
By default, SAS resets variable values to missing for every iteration. RETAIN preserves values across observations. It is commonly used for cumulative calculations, counters, and carrying forward previous values.

5. What are FIRST. and LAST. variables in SAS?

Answer:
FIRST.variable and LAST.variable are temporary variables created during BY-group processing. They help identify the first and last observation within grouped data and are widely used for summaries, counts, and accumulations.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent GLOBAL FRUITS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can Advanced SAS Programming Detect, Clean, and Optimize Grid Stability Data While Identifying Fraud Patterns?

2.Can Modern Art Data Explain What Actually Sells? – A Real-World SAS Analytics Project

3.Can Advanced R Programming Detect, Clean, and Optimize Public Library Data While Identifying Fraud Patterns?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS