Fruit Supply Chain Corruption Solved Through SAS & R Automation
Global Fruit Dataset Chaos into Analysis-Ready SAS and R Pipelines for Regulatory-Grade Reporting Excellence
Introduction
In
real-world analytics environments, dirty data is not just an inconvenience it
is a direct business risk. I have seen pharmaceutical companies delay clinical
submissions because patient visit dates were corrupted. I have seen insurance
firms reject valid claims because region codes were inconsistent. I have seen
banking fraud systems fail because transaction IDs were duplicated.
Now
imagine a global fruit-export intelligence platform collecting operational data
from multiple countries about rare fruits such as Durian, Rambutan, Mangosteen,
Miracle Fruit, Buddha’s Hand, Jabuticaba, Salak, and Ackee. The business wants
to combine retail intelligence, shipment quality, healthcare nutrition
research, and customer analytics into one centralized reporting system.
Unfortunately,
the incoming datasets are corrupted.
Some
fruit shipment IDs are duplicated.
Some supplier emails are malformed.
Some harvest dates are invalid.
Some fruit prices are negative.
Some region codes contain whitespace corruption.
Some category labels are inconsistent like:
- tropical
- Tropical
- TROPI CAL
- NULL
- ???
This type
of poor-quality data creates catastrophic downstream problems:
- Executive dashboards become
misleading
- AI prediction models learn
incorrect patterns
- Statistical summaries become
unreliable
- Regulatory reports fail
validation
- Clinical nutrition studies
lose traceability
- Enterprise reporting becomes
inconsistent
This is
where enterprise-grade SAS and R data engineering becomes critical.
Business Crisis Scenario
A
multinational nutrition-research organization launched a global study analyzing
rare fruits and their impact on metabolic health. The platform merged:
- Retail sales
- Clinical nutrition records
- Shipment data
- Supplier intelligence
- Regional inventory systems
After
deployment, executives discovered shocking inconsistencies:
- Duplicate Fruit IDs inflated
sales
- Invalid harvest dates broke
time-series forecasts
- Negative prices caused
financial losses in dashboards
- Missing regions corrupted
geographical analysis
- Incorrect supplier emails
blocked compliance alerts
- Mixed categorical labels
failed reporting aggregation
Even
worse, machine-learning models incorrectly predicted demand because corrupted
records distorted historical patterns.
This is
exactly why enterprise cleaning workflows are mandatory.
1.Raw SAS Dataset with Intentional Errors
data fruit_raw;
length Fruit_ID $8 Fruit_Name $25 Country $20 Category $20
Supplier_Email $50 Region_Code $12 Harvest_Date $20
Price $12 Rating $8;
infile datalines dlm='|' truncover;
input Fruit_ID $ Fruit_Name $ Country $ Category $
Harvest_Date $ Price $ Supplier_Email $ Region_Code $
Rating $;
datalines;
FR001|Durian|Thailand|Tropical|12JAN2025|450|durian@gmail.com| ap01 |4.5
FR001|durian|THAILAND|tropi cal|31FEB2025|-450|wrongmail.com|AP01|5
FR002|Mangosteen|India|NULL|15MAR2025|550|mango@test.com|IN01|4.8
FR003|Rambutan|Malaysia|Tropical|.|700|rambutan#mail.com|MY 02|abc
FR004|Ackee|Jamaica|Exotic|21APR2025|-100|ackee@gmail|JM01|3.7
FR005|MiracleFruit|Ghana|???|15MAY2025|850| miracle@gmail.com |GH01|4.9
FR006|Salak|Indonesia|Tropical|44DEC2025|620|salak@gmail.com|ID01|5.2
FR007|Jabuticaba|Brazil|Berry|17JUN2025|NULL|jab@gmail.com|BR01|4.4
FR008|BuddhasHand|China|Citrus|19JUL2025|710|buddha@mail.com| CN01|4.6
FR009|Cherimoya|Peru|tropical|20AUG2025|800|cherimoya@gmail.com|PE01|4.1
FR010|Langsat|Malaysia|TROPICAL|22SEP2025|780|langsat@gmail.com|MY01|4.0
FR011|Soursop|India|Herbal|23OCT2025|-900|soursopgmail.com|IN01|2.1
FR012|Feijoa|Brazil|Berry|24NOV2025|640|fei@gmail.com|BR01|4.2
FR013|Cupuacu|Brazil|NULL|25DEC2025|770|cupuacu@gmail.com|BR02|4.5
FR014|Pulasan|Malaysia|Tropical|15JAN2025|720|pulasan@gmail.com|MY03|4.7
FR015|SnakeFruit|Indonesia|Tropical|16FEB2025|680|snake@gmail.com|ID01|4.3
;
run;
proc print data = fruit_raw;
run;
OUTPUT:
| Obs | Fruit_ID | Fruit_Name | Country | Category | Supplier_Email | Region_Code | Harvest_Date | Price | Rating |
|---|---|---|---|---|---|---|---|---|---|
| 1 | FR001 | Durian | Thailand | Tropical | durian@gmail.com | ap01 | 12JAN2025 | 450 | 4.5 |
| 2 | FR001 | durian | THAILAND | tropi cal | wrongmail.com | AP01 | 31FEB2025 | -450 | 5 |
| 3 | FR002 | Mangosteen | India | NULL | mango@test.com | IN01 | 15MAR2025 | 550 | 4.8 |
| 4 | FR003 | Rambutan | Malaysia | Tropical | rambutan#mail.com | MY 02 | 700 | abc | |
| 5 | FR004 | Ackee | Jamaica | Exotic | ackee@gmail | JM01 | 21APR2025 | -100 | 3.7 |
| 6 | FR005 | MiracleFruit | Ghana | ??? | miracle@gmail.com | GH01 | 15MAY2025 | 850 | 4.9 |
| 7 | FR006 | Salak | Indonesia | Tropical | salak@gmail.com | ID01 | 44DEC2025 | 620 | 5.2 |
| 8 | FR007 | Jabuticaba | Brazil | Berry | jab@gmail.com | BR01 | 17JUN2025 | NULL | 4.4 |
| 9 | FR008 | BuddhasHand | China | Citrus | buddha@mail.com | CN01 | 19JUL2025 | 710 | 4.6 |
| 10 | FR009 | Cherimoya | Peru | tropical | cherimoya@gmail.com | PE01 | 20AUG2025 | 800 | 4.1 |
| 11 | FR010 | Langsat | Malaysia | TROPICAL | langsat@gmail.com | MY01 | 22SEP2025 | 780 | 4.0 |
| 12 | FR011 | Soursop | India | Herbal | soursopgmail.com | IN01 | 23OCT2025 | -900 | 2.1 |
| 13 | FR012 | Feijoa | Brazil | Berry | fei@gmail.com | BR01 | 24NOV2025 | 640 | 4.2 |
| 14 | FR013 | Cupuacu | Brazil | NULL | cupuacu@gmail.com | BR02 | 25DEC2025 | 770 | 4.5 |
| 15 | FR014 | Pulasan | Malaysia | Tropical | pulasan@gmail.com | MY03 | 15JAN2025 | 720 | 4.7 |
| 16 | FR015 | SnakeFruit | Indonesia | Tropical | snake@gmail.com | ID01 | 16FEB2025 | 680 | 4.3 |
Why LENGTH Must Come Before
Assignments
One of
the biggest SAS mistakes beginners make is placing the LENGTH statement after
assignments.
Incorrect:
Fruit_Status="Excellent";
length Fruit_Status $5;
This truncates
the value to:
Excel
because
SAS already assigned default length before LENGTH executed.
Correct:
length Fruit_Status $20;
Fruit_Status="Excellent";
In
enterprise clinical trials, truncation can destroy SDTM mapping integrity.
R behaves
differently because character vectors dynamically resize memory allocation.
SAS, however, fixes variable length at compile time.
That
difference is critical.
2.SAS Enterprise Cleaning Workflow
data fruit_clean;
retain Data_Source "GLOBAL_FRUIT_SYSTEM";
set fruit_raw;
length Clean_Category $20 Clean_Email $60 Parsed_Date 8;
format Parsed_Date yymmdd10.;
Fruit_Name=propcase(strip(Fruit_Name));
Country=upcase(compbl(strip(Country)));
Category=upcase(compress(Category));
if Category in ("NULL","???","") then
Clean_Category="UNKNOWN";
else Clean_Category=Category;
Price_Num=input(compress(Price),best12.);
if Price_Num < 0 then Price_Num=abs(Price_Num);
if missing(Price_Num) then
Price_Num=round(500 + rannor(2)*100,.01);
Parsed_Date=input(Harvest_Date,date9.);
if missing(Parsed_Date) then
Parsed_Date=today();
Clean_Email=strip(lowcase(Supplier_Email));
if index(Clean_Email,'@')=0 then
Clean_Email="invalid@email.com";
Region_Code=upcase(compress(Region_Code));
Rating_Num=input(Rating,best12.);
if Rating_Num > 5 then Rating_Num=5;
if Rating_Num=. then Rating_Num=3.5;
Days_To_Harvest=intck('day',today(),Parsed_Date);
drop Harvest_Date Price Rating Supplier_Email Category;
rename Parsed_Date = Harvest_Date
Price_Num = Price
Rating_Num = Rating
Clean_Email = Supplier_Email
Clean_Category = Category;
run;
proc print data = fruit_clean;
run;
OUTPUT:
| Obs | Data_Source | Fruit_ID | Fruit_Name | Country | Region_Code | Category | Supplier_Email | Harvest_Date | Price | Rating | Days_To_Harvest |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_FRUIT_SYSTEM | FR001 | Durian | THAILAND | AP01 | TROPICAL | durian@gmail.com | 2025-01-12 | 450.00 | 4.5 | -510 |
| 2 | GLOBAL_FRUIT_SYSTEM | FR001 | Durian | THAILAND | AP01 | TROPICAL | invalid@email.com | 2026-06-06 | 450.00 | 5.0 | 0 |
| 3 | GLOBAL_FRUIT_SYSTEM | FR002 | Mangosteen | INDIA | IN01 | UNKNOWN | mango@test.com | 2025-03-15 | 550.00 | 4.8 | -448 |
| 4 | GLOBAL_FRUIT_SYSTEM | FR003 | Rambutan | MALAYSIA | MY02 | TROPICAL | invalid@email.com | 2026-06-06 | 700.00 | 3.5 | 0 |
| 5 | GLOBAL_FRUIT_SYSTEM | FR004 | Ackee | JAMAICA | JM01 | EXOTIC | ackee@gmail | 2025-04-21 | 100.00 | 3.7 | -411 |
| 6 | GLOBAL_FRUIT_SYSTEM | FR005 | Miraclefruit | GHANA | GH01 | UNKNOWN | miracle@gmail.com | 2025-05-15 | 850.00 | 4.9 | -387 |
| 7 | GLOBAL_FRUIT_SYSTEM | FR006 | Salak | INDONESIA | ID01 | TROPICAL | salak@gmail.com | 2026-06-06 | 620.00 | 5.0 | 0 |
| 8 | GLOBAL_FRUIT_SYSTEM | FR007 | Jabuticaba | BRAZIL | BR01 | BERRY | jab@gmail.com | 2025-06-17 | 631.18 | 4.4 | -354 |
| 9 | GLOBAL_FRUIT_SYSTEM | FR008 | Buddhashand | CHINA | CN01 | CITRUS | buddha@mail.com | 2025-07-19 | 710.00 | 4.6 | -322 |
| 10 | GLOBAL_FRUIT_SYSTEM | FR009 | Cherimoya | PERU | PE01 | TROPICAL | cherimoya@gmail.com | 2025-08-20 | 800.00 | 4.1 | -290 |
| 11 | GLOBAL_FRUIT_SYSTEM | FR010 | Langsat | MALAYSIA | MY01 | TROPICAL | langsat@gmail.com | 2025-09-22 | 780.00 | 4.0 | -257 |
| 12 | GLOBAL_FRUIT_SYSTEM | FR011 | Soursop | INDIA | IN01 | HERBAL | invalid@email.com | 2025-10-23 | 900.00 | 2.1 | -226 |
| 13 | GLOBAL_FRUIT_SYSTEM | FR012 | Feijoa | BRAZIL | BR01 | BERRY | fei@gmail.com | 2025-11-24 | 640.00 | 4.2 | -194 |
| 14 | GLOBAL_FRUIT_SYSTEM | FR013 | Cupuacu | BRAZIL | BR02 | UNKNOWN | cupuacu@gmail.com | 2025-12-25 | 770.00 | 4.5 | -163 |
| 15 | GLOBAL_FRUIT_SYSTEM | FR014 | Pulasan | MALAYSIA | MY03 | TROPICAL | pulasan@gmail.com | 2025-01-15 | 720.00 | 4.7 | -507 |
| 16 | GLOBAL_FRUIT_SYSTEM | FR015 | Snakefruit | INDONESIA | ID01 | TROPICAL | snake@gmail.com | 2025-02-16 | 680.00 | 4.3 | -475 |
Explanation and Key Points
This DATA
step demonstrates real-world production-grade cleaning logic. We used:
- RETAIN for metadata
persistence
- PROPCASE for standardized
names
- COMPRESS for whitespace
corruption
- INPUT for
character-to-numeric conversion
- ABS for correcting negative
values
- INDEX for email validation
- INTCK for temporal analytics
- ROUND for financial
precision
In
regulated clinical environments, similar logic is applied to patient-level SDTM
and ADaM derivations. Improper conversions can create FDA submission failures.
This step converts chaotic operational intelligence into analysis-ready structured
data.
3.PROC SORT Deduplication
proc sort data=fruit_clean nodupkey out=fruit_nodup;
by Fruit_ID;
run;
proc print data = fruit_nodup;
run;
LOG:
OUTPUT:
| Obs | Data_Source | Fruit_ID | Fruit_Name | Country | Region_Code | Category | Supplier_Email | Harvest_Date | Price | Rating | Days_To_Harvest |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_FRUIT_SYSTEM | FR001 | Durian | THAILAND | AP01 | TROPICAL | durian@gmail.com | 2025-01-12 | 450.00 | 4.5 | -510 |
| 2 | GLOBAL_FRUIT_SYSTEM | FR002 | Mangosteen | INDIA | IN01 | UNKNOWN | mango@test.com | 2025-03-15 | 550.00 | 4.8 | -448 |
| 3 | GLOBAL_FRUIT_SYSTEM | FR003 | Rambutan | MALAYSIA | MY02 | TROPICAL | invalid@email.com | 2026-06-06 | 700.00 | 3.5 | 0 |
| 4 | GLOBAL_FRUIT_SYSTEM | FR004 | Ackee | JAMAICA | JM01 | EXOTIC | ackee@gmail | 2025-04-21 | 100.00 | 3.7 | -411 |
| 5 | GLOBAL_FRUIT_SYSTEM | FR005 | Miraclefruit | GHANA | GH01 | UNKNOWN | miracle@gmail.com | 2025-05-15 | 850.00 | 4.9 | -387 |
| 6 | GLOBAL_FRUIT_SYSTEM | FR006 | Salak | INDONESIA | ID01 | TROPICAL | salak@gmail.com | 2026-06-06 | 620.00 | 5.0 | 0 |
| 7 | GLOBAL_FRUIT_SYSTEM | FR007 | Jabuticaba | BRAZIL | BR01 | BERRY | jab@gmail.com | 2025-06-17 | 631.18 | 4.4 | -354 |
| 8 | GLOBAL_FRUIT_SYSTEM | FR008 | Buddhashand | CHINA | CN01 | CITRUS | buddha@mail.com | 2025-07-19 | 710.00 | 4.6 | -322 |
| 9 | GLOBAL_FRUIT_SYSTEM | FR009 | Cherimoya | PERU | PE01 | TROPICAL | cherimoya@gmail.com | 2025-08-20 | 800.00 | 4.1 | -290 |
| 10 | GLOBAL_FRUIT_SYSTEM | FR010 | Langsat | MALAYSIA | MY01 | TROPICAL | langsat@gmail.com | 2025-09-22 | 780.00 | 4.0 | -257 |
| 11 | GLOBAL_FRUIT_SYSTEM | FR011 | Soursop | INDIA | IN01 | HERBAL | invalid@email.com | 2025-10-23 | 900.00 | 2.1 | -226 |
| 12 | GLOBAL_FRUIT_SYSTEM | FR012 | Feijoa | BRAZIL | BR01 | BERRY | fei@gmail.com | 2025-11-24 | 640.00 | 4.2 | -194 |
| 13 | GLOBAL_FRUIT_SYSTEM | FR013 | Cupuacu | BRAZIL | BR02 | UNKNOWN | cupuacu@gmail.com | 2025-12-25 | 770.00 | 4.5 | -163 |
| 14 | GLOBAL_FRUIT_SYSTEM | FR014 | Pulasan | MALAYSIA | MY03 | TROPICAL | pulasan@gmail.com | 2025-01-15 | 720.00 | 4.7 | -507 |
| 15 | GLOBAL_FRUIT_SYSTEM | FR015 | Snakefruit | INDONESIA | ID01 | TROPICAL | snake@gmail.com | 2025-02-16 | 680.00 | 4.3 | -475 |
Explanation and Key Points
PROC SORT
NODUPKEY removes duplicate business keys. Duplicate IDs can inflate revenue
calculations, patient counts, or insurance claims. In enterprise analytics,
deduplication is one of the first QC checkpoints. Clinical studies particularly
require unique subject identifiers for traceability and regulatory integrity.
4.PROC SQL Validation
proc sql;
create table fruit_summary as
select Country,Category,
count(*) as Total_Records,
avg(Price) as Avg_Price,
avg(Rating) as Avg_Rating
from fruit_nodup
group by Country,Category;
quit;
proc print data = fruit_summary;
run;
OUTPUT:
| Obs | Country | Category | Total_Records | Avg_Price | Avg_Rating |
|---|---|---|---|---|---|
| 1 | BRAZIL | BERRY | 2 | 635.590 | 4.30000 |
| 2 | BRAZIL | UNKNOWN | 1 | 770.000 | 4.50000 |
| 3 | CHINA | CITRUS | 1 | 710.000 | 4.60000 |
| 4 | GHANA | UNKNOWN | 1 | 850.000 | 4.90000 |
| 5 | INDIA | HERBAL | 1 | 900.000 | 2.10000 |
| 6 | INDIA | UNKNOWN | 1 | 550.000 | 4.80000 |
| 7 | INDONESIA | TROPICAL | 2 | 650.000 | 4.65000 |
| 8 | JAMAICA | EXOTIC | 1 | 100.000 | 3.70000 |
| 9 | MALAYSIA | TROPICAL | 3 | 733.333 | 4.06667 |
| 10 | PERU | TROPICAL | 1 | 800.000 | 4.10000 |
| 11 | THAILAND | TROPICAL | 1 | 450.000 | 4.50000 |
Explanation and Key Points
PROC SQL
provides relational-style aggregation logic similar to enterprise databases.
This workflow helps:
- summarize operational
intelligence
- identify country-level
anomalies
- support executive dashboards
- create analytical reporting
layers
Compared
to DATA step BY-group processing, PROC SQL is often easier for multidimensional
aggregations and joins.
Advanced SAS Techniques
5.ARRAY Cleaning
data fruit_array_clean;
set fruit_nodup;
array chars(*) Fruit_Name Country Category;
do i=1 to dim(chars);
chars(i)=strip(upcase(chars(i)));
end;
drop i;
run;
proc print data = fruit_array_clean;
run;
OUTPUT:
| Obs | Data_Source | Fruit_ID | Fruit_Name | Country | Region_Code | Category | Supplier_Email | Harvest_Date | Price | Rating | Days_To_Harvest |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_FRUIT_SYSTEM | FR001 | DURIAN | THAILAND | AP01 | TROPICAL | durian@gmail.com | 2025-01-12 | 450.00 | 4.5 | -510 |
| 2 | GLOBAL_FRUIT_SYSTEM | FR002 | MANGOSTEEN | INDIA | IN01 | UNKNOWN | mango@test.com | 2025-03-15 | 550.00 | 4.8 | -448 |
| 3 | GLOBAL_FRUIT_SYSTEM | FR003 | RAMBUTAN | MALAYSIA | MY02 | TROPICAL | invalid@email.com | 2026-06-06 | 700.00 | 3.5 | 0 |
| 4 | GLOBAL_FRUIT_SYSTEM | FR004 | ACKEE | JAMAICA | JM01 | EXOTIC | ackee@gmail | 2025-04-21 | 100.00 | 3.7 | -411 |
| 5 | GLOBAL_FRUIT_SYSTEM | FR005 | MIRACLEFRUIT | GHANA | GH01 | UNKNOWN | miracle@gmail.com | 2025-05-15 | 850.00 | 4.9 | -387 |
| 6 | GLOBAL_FRUIT_SYSTEM | FR006 | SALAK | INDONESIA | ID01 | TROPICAL | salak@gmail.com | 2026-06-06 | 620.00 | 5.0 | 0 |
| 7 | GLOBAL_FRUIT_SYSTEM | FR007 | JABUTICABA | BRAZIL | BR01 | BERRY | jab@gmail.com | 2025-06-17 | 631.18 | 4.4 | -354 |
| 8 | GLOBAL_FRUIT_SYSTEM | FR008 | BUDDHASHAND | CHINA | CN01 | CITRUS | buddha@mail.com | 2025-07-19 | 710.00 | 4.6 | -322 |
| 9 | GLOBAL_FRUIT_SYSTEM | FR009 | CHERIMOYA | PERU | PE01 | TROPICAL | cherimoya@gmail.com | 2025-08-20 | 800.00 | 4.1 | -290 |
| 10 | GLOBAL_FRUIT_SYSTEM | FR010 | LANGSAT | MALAYSIA | MY01 | TROPICAL | langsat@gmail.com | 2025-09-22 | 780.00 | 4.0 | -257 |
| 11 | GLOBAL_FRUIT_SYSTEM | FR011 | SOURSOP | INDIA | IN01 | HERBAL | invalid@email.com | 2025-10-23 | 900.00 | 2.1 | -226 |
| 12 | GLOBAL_FRUIT_SYSTEM | FR012 | FEIJOA | BRAZIL | BR01 | BERRY | fei@gmail.com | 2025-11-24 | 640.00 | 4.2 | -194 |
| 13 | GLOBAL_FRUIT_SYSTEM | FR013 | CUPUACU | BRAZIL | BR02 | UNKNOWN | cupuacu@gmail.com | 2025-12-25 | 770.00 | 4.5 | -163 |
| 14 | GLOBAL_FRUIT_SYSTEM | FR014 | PULASAN | MALAYSIA | MY03 | TROPICAL | pulasan@gmail.com | 2025-01-15 | 720.00 | 4.7 | -507 |
| 15 | GLOBAL_FRUIT_SYSTEM | FR015 | SNAKEFRUIT | INDONESIA | ID01 | TROPICAL | snake@gmail.com | 2025-02-16 | 680.00 | 4.3 | -475 |
Explanation
This
program demonstrates one of the most powerful DATA Step techniques in SAS
called ARRAY processing. Arrays allow programmers to apply the same
logic to multiple variables efficiently without repeatedly writing individual
statements.
In this
example, the array named chars(*) contains three character variables:
- Fruit_Name
- Country
- Clean_Category
The (*)
tells SAS to automatically determine the number of variables inside the array.
The DO
loop:
do i = 1 to dim(chars);
iterates
through each variable in the array dynamically. The DIM() function returns the
total number of variables inside the array.
Inside
the loop:
chars(i) =
strip(upcase(chars(i)));
two
cleaning operations occur:
- UPCASE() converts text to
uppercase
- STRIP() removes leading and
trailing spaces
For
example:
|
Before
Cleaning |
After
Cleaning |
|
durian |
DURIAN |
|
thailand |
THAILAND |
|
tropical |
TROPICAL |
Finally:
drop i;
removes
the temporary loop counter variable from the final dataset.
Key Points
- Arrays reduce repetitive
coding
- Useful for bulk variable
transformations
- Improves maintainability in
enterprise projects
- Frequently used in SDTM and
ADaM standardization
- DIM() makes loops dynamic
and scalable
- Excellent for cleaning large
clinical datasets
6.SELECT-WHEN Logic
data fruit_segment;
set fruit_array_clean;
Length Segment $12;
select;
when (Price >=800) Segment="PREMIUM";
when (Price >=600) Segment="STANDARD";
otherwise Segment="BASIC";
end;
run;
proc print data = fruit_segment;
run;
OUTPUT:
| Obs | Data_Source | Fruit_ID | Fruit_Name | Country | Region_Code | Category | Supplier_Email | Harvest_Date | Price | Rating | Days_To_Harvest | Segment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_FRUIT_SYSTEM | FR001 | DURIAN | THAILAND | AP01 | TROPICAL | durian@gmail.com | 2025-01-12 | 450.00 | 4.5 | -510 | BASIC |
| 2 | GLOBAL_FRUIT_SYSTEM | FR002 | MANGOSTEEN | INDIA | IN01 | UNKNOWN | mango@test.com | 2025-03-15 | 550.00 | 4.8 | -448 | BASIC |
| 3 | GLOBAL_FRUIT_SYSTEM | FR003 | RAMBUTAN | MALAYSIA | MY02 | TROPICAL | invalid@email.com | 2026-06-06 | 700.00 | 3.5 | 0 | STANDARD |
| 4 | GLOBAL_FRUIT_SYSTEM | FR004 | ACKEE | JAMAICA | JM01 | EXOTIC | ackee@gmail | 2025-04-21 | 100.00 | 3.7 | -411 | BASIC |
| 5 | GLOBAL_FRUIT_SYSTEM | FR005 | MIRACLEFRUIT | GHANA | GH01 | UNKNOWN | miracle@gmail.com | 2025-05-15 | 850.00 | 4.9 | -387 | PREMIUM |
| 6 | GLOBAL_FRUIT_SYSTEM | FR006 | SALAK | INDONESIA | ID01 | TROPICAL | salak@gmail.com | 2026-06-06 | 620.00 | 5.0 | 0 | STANDARD |
| 7 | GLOBAL_FRUIT_SYSTEM | FR007 | JABUTICABA | BRAZIL | BR01 | BERRY | jab@gmail.com | 2025-06-17 | 631.18 | 4.4 | -354 | STANDARD |
| 8 | GLOBAL_FRUIT_SYSTEM | FR008 | BUDDHASHAND | CHINA | CN01 | CITRUS | buddha@mail.com | 2025-07-19 | 710.00 | 4.6 | -322 | STANDARD |
| 9 | GLOBAL_FRUIT_SYSTEM | FR009 | CHERIMOYA | PERU | PE01 | TROPICAL | cherimoya@gmail.com | 2025-08-20 | 800.00 | 4.1 | -290 | PREMIUM |
| 10 | GLOBAL_FRUIT_SYSTEM | FR010 | LANGSAT | MALAYSIA | MY01 | TROPICAL | langsat@gmail.com | 2025-09-22 | 780.00 | 4.0 | -257 | STANDARD |
| 11 | GLOBAL_FRUIT_SYSTEM | FR011 | SOURSOP | INDIA | IN01 | HERBAL | invalid@email.com | 2025-10-23 | 900.00 | 2.1 | -226 | PREMIUM |
| 12 | GLOBAL_FRUIT_SYSTEM | FR012 | FEIJOA | BRAZIL | BR01 | BERRY | fei@gmail.com | 2025-11-24 | 640.00 | 4.2 | -194 | STANDARD |
| 13 | GLOBAL_FRUIT_SYSTEM | FR013 | CUPUACU | BRAZIL | BR02 | UNKNOWN | cupuacu@gmail.com | 2025-12-25 | 770.00 | 4.5 | -163 | STANDARD |
| 14 | GLOBAL_FRUIT_SYSTEM | FR014 | PULASAN | MALAYSIA | MY03 | TROPICAL | pulasan@gmail.com | 2025-01-15 | 720.00 | 4.7 | -507 | STANDARD |
| 15 | GLOBAL_FRUIT_SYSTEM | FR015 | SNAKEFRUIT | INDONESIA | ID01 | TROPICAL | snake@gmail.com | 2025-02-16 | 680.00 | 4.3 | -475 | STANDARD |
Explanation
This program demonstrates the SELECT-WHEN
statement in SAS, which is similar to:
- IF-ELSE
logic
- CASE
statements in SQL
- case_when()
in R
The purpose of this code is to categorize fruits into business segments
based on their price.
Step-by-Step Logic
Condition
1
when (Price_Num >= 800)
If the fruit price is greater than or equal to 800, SAS assigns:
Segment = "PREMIUM";
Condition
2
when (Price_Num >= 600)
If the first condition fails but the price is at least 600, SAS assigns:
Segment = "STANDARD";
Otherwise
Condition
OtherwiseSegment = "BASIC";
If none of the above conditions are true, the fruit is classified as BASIC.
Example Output
|
Fruit_Name |
Price_Num |
Segment |
|
DURIAN |
850 |
PREMIUM |
|
RAMBUTAN |
650 |
STANDARD |
|
ACKEE |
300 |
BASIC |
Why SELECT-WHEN is Preferred
Compared to long IF-ELSE chains, SELECT-WHEN:
- improves
readability
- simplifies
debugging
- provides
cleaner business-rule implementation
- is easier
to maintain in production systems
Key Points
- Cleaner
alternative to multiple IF statements
- Excellent
for business categorization
- Common in
risk scoring systems
- Used in
banking, insurance, and clinical classification logic
- Easier to
validate during QC review
7.FIRST./LAST. Processing
proc sort data=fruit_segment;
by Country;
run;
proc print data = fruit_segment;
run;
OUTPUT:
| Obs | Data_Source | Fruit_ID | Fruit_Name | Country | Region_Code | Category | Supplier_Email | Harvest_Date | Price | Rating | Days_To_Harvest | Segment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_FRUIT_SYSTEM | FR007 | JABUTICABA | BRAZIL | BR01 | BERRY | jab@gmail.com | 2025-06-17 | 631.18 | 4.4 | -354 | STANDARD |
| 2 | GLOBAL_FRUIT_SYSTEM | FR012 | FEIJOA | BRAZIL | BR01 | BERRY | fei@gmail.com | 2025-11-24 | 640.00 | 4.2 | -194 | STANDARD |
| 3 | GLOBAL_FRUIT_SYSTEM | FR013 | CUPUACU | BRAZIL | BR02 | UNKNOWN | cupuacu@gmail.com | 2025-12-25 | 770.00 | 4.5 | -163 | STANDARD |
| 4 | GLOBAL_FRUIT_SYSTEM | FR008 | BUDDHASHAND | CHINA | CN01 | CITRUS | buddha@mail.com | 2025-07-19 | 710.00 | 4.6 | -322 | STANDARD |
| 5 | GLOBAL_FRUIT_SYSTEM | FR005 | MIRACLEFRUIT | GHANA | GH01 | UNKNOWN | miracle@gmail.com | 2025-05-15 | 850.00 | 4.9 | -387 | PREMIUM |
| 6 | GLOBAL_FRUIT_SYSTEM | FR002 | MANGOSTEEN | INDIA | IN01 | UNKNOWN | mango@test.com | 2025-03-15 | 550.00 | 4.8 | -448 | BASIC |
| 7 | GLOBAL_FRUIT_SYSTEM | FR011 | SOURSOP | INDIA | IN01 | HERBAL | invalid@email.com | 2025-10-23 | 900.00 | 2.1 | -226 | PREMIUM |
| 8 | GLOBAL_FRUIT_SYSTEM | FR006 | SALAK | INDONESIA | ID01 | TROPICAL | salak@gmail.com | 2026-06-06 | 620.00 | 5.0 | 0 | STANDARD |
| 9 | GLOBAL_FRUIT_SYSTEM | FR015 | SNAKEFRUIT | INDONESIA | ID01 | TROPICAL | snake@gmail.com | 2025-02-16 | 680.00 | 4.3 | -475 | STANDARD |
| 10 | GLOBAL_FRUIT_SYSTEM | FR004 | ACKEE | JAMAICA | JM01 | EXOTIC | ackee@gmail | 2025-04-21 | 100.00 | 3.7 | -411 | BASIC |
| 11 | GLOBAL_FRUIT_SYSTEM | FR003 | RAMBUTAN | MALAYSIA | MY02 | TROPICAL | invalid@email.com | 2026-06-06 | 700.00 | 3.5 | 0 | STANDARD |
| 12 | GLOBAL_FRUIT_SYSTEM | FR010 | LANGSAT | MALAYSIA | MY01 | TROPICAL | langsat@gmail.com | 2025-09-22 | 780.00 | 4.0 | -257 | STANDARD |
| 13 | GLOBAL_FRUIT_SYSTEM | FR014 | PULASAN | MALAYSIA | MY03 | TROPICAL | pulasan@gmail.com | 2025-01-15 | 720.00 | 4.7 | -507 | STANDARD |
| 14 | GLOBAL_FRUIT_SYSTEM | FR009 | CHERIMOYA | PERU | PE01 | TROPICAL | cherimoya@gmail.com | 2025-08-20 | 800.00 | 4.1 | -290 | PREMIUM |
| 15 | GLOBAL_FRUIT_SYSTEM | FR001 | DURIAN | THAILAND | AP01 | TROPICAL | durian@gmail.com | 2025-01-12 | 450.00 | 4.5 | -510 | BASIC |
data country_counts;
set fruit_segment;
by Country;
retain Count;
if first.Country then Count=0;
Count+1;
if last.Country;
run;
proc print data = country_counts;
run;
OUTPUT:
| Obs | Data_Source | Fruit_ID | Fruit_Name | Country | Region_Code | Category | Supplier_Email | Harvest_Date | Price | Rating | Days_To_Harvest | Segment | Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLOBAL_FRUIT_SYSTEM | FR013 | CUPUACU | BRAZIL | BR02 | UNKNOWN | cupuacu@gmail.com | 2025-12-25 | 770 | 4.5 | -163 | STANDARD | 3 |
| 2 | GLOBAL_FRUIT_SYSTEM | FR008 | BUDDHASHAND | CHINA | CN01 | CITRUS | buddha@mail.com | 2025-07-19 | 710 | 4.6 | -322 | STANDARD | 1 |
| 3 | GLOBAL_FRUIT_SYSTEM | FR005 | MIRACLEFRUIT | GHANA | GH01 | UNKNOWN | miracle@gmail.com | 2025-05-15 | 850 | 4.9 | -387 | PREMIUM | 1 |
| 4 | GLOBAL_FRUIT_SYSTEM | FR011 | SOURSOP | INDIA | IN01 | HERBAL | invalid@email.com | 2025-10-23 | 900 | 2.1 | -226 | PREMIUM | 2 |
| 5 | GLOBAL_FRUIT_SYSTEM | FR015 | SNAKEFRUIT | INDONESIA | ID01 | TROPICAL | snake@gmail.com | 2025-02-16 | 680 | 4.3 | -475 | STANDARD | 2 |
| 6 | GLOBAL_FRUIT_SYSTEM | FR004 | ACKEE | JAMAICA | JM01 | EXOTIC | ackee@gmail | 2025-04-21 | 100 | 3.7 | -411 | BASIC | 1 |
| 7 | GLOBAL_FRUIT_SYSTEM | FR014 | PULASAN | MALAYSIA | MY03 | TROPICAL | pulasan@gmail.com | 2025-01-15 | 720 | 4.7 | -507 | STANDARD | 3 |
| 8 | GLOBAL_FRUIT_SYSTEM | FR009 | CHERIMOYA | PERU | PE01 | TROPICAL | cherimoya@gmail.com | 2025-08-20 | 800 | 4.1 | -290 | PREMIUM | 1 |
| 9 | GLOBAL_FRUIT_SYSTEM | FR001 | DURIAN | THAILAND | AP01 | TROPICAL | durian@gmail.com | 2025-01-12 | 450 | 4.5 | -510 | BASIC | 1 |
Explanation
This
program demonstrates one of the most important SAS concepts called BY-group
processing using:
- FIRST.variable
- LAST.variable
These are
temporary automatic variables created by SAS whenever a dataset is processed
with a BY statement.
Step 1.Sorting Requirement
proc sort data = fruit_segment;
by Country;
run;
Before
BY-group processing, datasets must be sorted by the grouping variable.
Here,
records are sorted by:
Country
Without
sorting, SAS generates incorrect BY-group behavior.
Step 2.RETAIN Statement
retain Count;
Normally
SAS resets variables to missing for each observation.
RETAIN
preserves the value across rows.
This
allows cumulative counting.
Step 3.FIRST.Country Logic
if first.Country then Count = 0;
When SAS
encounters the first record of a new country group:
- FIRST.Country = 1
- Counter resets to zero
Example:
|
Country |
FIRST.Country |
|
INDIA |
1 |
|
INDIA |
0 |
|
MALAYSIA |
1 |
Step 4.Increment Counter
Count + 1;
This
cumulative statement increments the counter automatically.
Step 5.LAST.Country Logic
if last.Country;
This
keeps only the final observation for each country group.
At that
point, Count contains the total number of records for that country.
Example Output
|
Country |
Count |
|
BRAZIL |
3 |
|
INDIA |
2 |
|
MALAYSIA |
4 |
Key Points
- FIRST./LAST. variables are
temporary automatic variables
- Requires sorted data
- RETAIN preserves values
across observations
- Commonly used for grouped
summaries
- Faster than some SQL
aggregations
- Essential in clinical trial
derivations
8.PROC REPORT for Professional Outputs
proc report data=fruit_segment nowd;
column Country Fruit_Name Price Rating Segment;
define Country / group;
define Fruit_Name / display;
define Price / analysis mean;
define Rating / analysis mean;
run;
| Country | Fruit_Name | Price | Rating | Segment |
|---|---|---|---|---|
| BRAZIL | JABUTICABA | 631.18 | 4.4 | STANDARD |
| FEIJOA | 640 | 4.2 | STANDARD | |
| CUPUACU | 770 | 4.5 | STANDARD | |
| CHINA | BUDDHASHAND | 710 | 4.6 | STANDARD |
| GHANA | MIRACLEFRUIT | 850 | 4.9 | PREMIUM |
| INDIA | MANGOSTEEN | 550 | 4.8 | BASIC |
| SOURSOP | 900 | 2.1 | PREMIUM | |
| INDONESIA | SALAK | 620 | 5 | STANDARD |
| SNAKEFRUIT | 680 | 4.3 | STANDARD | |
| JAMAICA | ACKEE | 100 | 3.7 | BASIC |
| MALAYSIA | RAMBUTAN | 700 | 3.5 | STANDARD |
| LANGSAT | 780 | 4 | STANDARD | |
| PULASAN | 720 | 4.7 | STANDARD | |
| PERU | CHERIMOYA | 800 | 4.1 | PREMIUM |
| THAILAND | DURIAN | 450 | 4.5 | BASIC |
Explanation and Key Points
PROC
REPORT is widely used in pharmaceutical TLF generation. It creates professional
tabular outputs for executive reporting, compliance reporting, and operational
intelligence dashboards. Unlike PROC PRINT, it provides advanced formatting and
grouping flexibility.
9.R Data Cleaning Workflow
library(tidyverse)
library(lubridate)
library(janitor)
fruit_raw <- read.delim(
text="
FR001|Durian|Thailand|Tropical|12JAN2025|450|durian@gmail.com| ap01 |4.5
FR001|durian|THAILAND|tropi cal|31FEB2025|-450|wrongmail.com|AP01|5
FR002|Mangosteen|India|NULL|15MAR2025|550|mango@test.com|IN01|4.8
",
sep="|",
header=FALSE
)
OUTPUT:
|
|
V1 |
V2 |
V3 |
V4 |
V5 |
V6 |
V7 |
V8 |
V9 |
|
1 |
FR001 |
Durian |
Thailand |
Tropical |
12-Jan-2025 |
450 |
durian@gmail.com |
ap01 |
4.5 |
|
2 |
FR001 |
durian |
THAILAND |
tropi cal |
31FEB2025 |
-450 |
wrongmail.com |
AP01 |
5 |
|
3 |
FR002 |
Mangosteen |
India |
NULL |
15-Mar-2025 |
550 |
mango@test.com |
IN01 |
4.8 |
colnames(fruit_raw) <- c(
"fruit_id","fruit_name","country","category",
"harvest_date","price","email",
"region_code","rating"
)
OUTPUT:
|
|
fruit_id |
fruit_name |
country |
category |
harvest_date |
price |
email |
region_code |
rating |
|
1 |
FR001 |
Durian |
Thailand |
Tropical |
12-Jan-2025 |
450 |
durian@gmail.com |
ap01 |
4.5 |
|
2 |
FR001 |
durian |
THAILAND |
tropi cal |
31FEB2025 |
-450 |
wrongmail.com |
AP01 |
5 |
|
3 |
FR002 |
Mangosteen |
India |
NULL |
15-Mar-2025 |
550 |
mango@test.com |
IN01 |
4.8 |
fruit_clean <- fruit_raw %>%
clean_names() %>%
mutate(
fruit_name = str_to_title(str_trim(fruit_name)),
country = str_to_upper(str_trim(country)),
category = str_replace_all(category," ",""),
category = case_when(
category %in% c("NULL","") ~ "UNKNOWN",
TRUE ~ category),
price = abs(as.numeric(price)),
email = if_else(grepl("@",email),
email,"invalid@email.com"),
region_code = str_to_upper(str_trim(region_code)),
rating = coalesce(as.numeric(rating),3.5),
parsed_date = suppressWarnings(
parse_date_time(harvest_date,
orders="dby")
),
date_flag =if_else(is.na(parsed_date),
"INVALID_DATE","VALID_DATE"
)
)
OUTPUT:
|
|
fruit_id |
fruit_name |
country |
category |
harvest_date |
price |
email |
region_code |
rating |
parsed_date |
date_flag |
|
1 |
FR001 |
Durian |
THAILAND |
Tropical |
12JAN2025 |
450 |
durian@gmail.com |
AP01 |
4.5 |
12-01-2025 |
VALID_DATE |
|
2 |
FR001 |
Durian |
THAILAND |
tropical |
31FEB2025 |
450 |
invalid@email.com |
AP01 |
5 |
NA |
INVALID_DATE |
|
3 |
FR002 |
Mangosteen |
INDIA |
UNKNOWN |
15MAR20 25 |
550 |
mango@test.com |
IN01 |
4.8 |
15-03-2025 |
VALID_DATE |
Explanation and Key Points
This R
pipeline mirrors SAS cleaning logic using tidyverse functions. Key parallels
include:
|
SAS |
R |
|
PROPCASE |
str_to_title |
|
COMPRESS |
str_replace_all |
|
INPUT |
as.numeric |
|
IF-THEN |
case_when |
|
COALESCEC |
coalesce |
|
STRIP |
str_trim |
R
provides exceptional flexibility for exploratory data engineering, while SAS
excels in governed enterprise environments.
Enterprise Validation &
Compliance
In SDTM
and ADaM environments, every transformation must be traceable.
Key
enterprise requirements include:
- Audit trails
- QC independence
- Metadata governance
- Reproducibility
- Variable lineage
- Controlled terminology
- Validation traceability
One
dangerous SAS behavior is:
if Age < 18 then delete;
Because
SAS treats missing numeric values as smaller than valid numbers, records with
missing age also get deleted unintentionally.
This can
create catastrophic analytical bias in clinical trials.
Defensive
programming is mandatory.
20 Real-World Data Cleaning
Best Practices
- Standardize variable naming
conventions
- Validate dates before
conversion
- Remove duplicate business
keys
- Track audit metadata
- Use reusable macros
- Avoid hardcoded logic
- Standardize categorical
labels
- Validate numeric ranges
- Implement QC independence
- Preserve raw source datasets
- Log all transformations
- Validate email structures
- Standardize region codes
- Use metadata-driven mappings
- Separate raw and curated
layers
- Implement production
checkpoints
- Use defensive IF conditions
- Validate missing-value
assumptions
- Automate reconciliation
checks
- Document derivation logic
thoroughly
Business Logic Behind Data
Cleaning
Business
logic is the foundation of analytical reliability. Missing values are often
imputed because downstream models and reporting systems cannot process null
structures consistently. For example, if a fruit price is missing, revenue
summaries become distorted. In clinical trials, missing patient ages may
exclude subjects from eligibility analysis.
Unrealistic
values must also be corrected. Negative shipment costs or impossible patient
ages create misleading analytical outcomes. Date standardization is equally
critical because reporting timelines, treatment durations, and seasonal
analyses depend heavily on accurate temporal calculations.
Text
normalization ensures that “Tropical,” “tropical,” and “TROPICAL” are treated
as identical business categories. Without normalization, aggregation logic
fragments analytical summaries.
These
corrections are not cosmetic. They directly influence executive decisions, AI
predictions, operational KPIs, and regulatory submissions.
20 One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic is stronger
than visual inspection.
- Missing dates silently break
forecasting models.
- Duplicate IDs inflate
analytical metrics.
- Standardized variables
improve reproducibility.
- Regulatory submissions
demand traceability.
- Defensive programming
prevents hidden failures.
- Audit readiness begins with
metadata discipline.
- PROC SQL simplifies
relational intelligence.
- DATA step excels in row-wise
control.
- R accelerates exploratory
transformations.
- SAS dominates governed
enterprise reporting.
- Poor text normalization
fragments dashboards.
- Controlled terminology
improves aggregation quality.
- QC independence increases
analytical trust.
- Macros improve scalability
and consistency.
- Missing numeric values
require special handling in SAS.
- Enterprise analytics depends
on reproducible pipelines.
- Standardization improves
machine-learning stability.
- Reliable analytics begins
with reliable data.
SAS vs R Comparison
|
Capability |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Industry
Standard |
Growing |
|
Flexibility |
Moderate |
Very
High |
|
Performance |
Strong |
Strong |
|
Visualization |
Moderate |
Excellent |
|
Governance |
Excellent |
Moderate |
|
Macro
Automation |
Powerful |
Functional |
|
Exploratory
Analysis |
Moderate |
Excellent |
|
Production
Stability |
Excellent |
Good |
|
Open
Source |
No |
Yes |
SAS
dominates highly regulated enterprise ecosystems because of auditability,
reproducibility, and metadata control. R excels in exploratory analytics,
statistical flexibility, and rapid prototyping. Together, they create an
extremely powerful hybrid analytics framework.
Validation Checklist
1.Duplicate
check completed
2.Date conversion validated
3.Missing-value review performed
4.Email structure verified
5.Region normalization applied
6.Numeric range validation completed
7.Metadata reviewed
8.QC reconciliation performed
Summary
Creating analysis-ready datasets is one of the most critical responsibilities
in enterprise analytics, especially in industries such as clinical research,
banking, insurance, retail, and healthcare. Raw operational data often contains
serious quality issues including duplicate records, invalid dates, negative
numeric values, inconsistent text formatting, missing variables, malformed
emails, corrupted categorical values, and mixed data types. These problems can
severely impact dashboards, statistical outputs, AI predictions, regulatory
submissions, and executive decision-making.
SAS and R together provide a powerful ecosystem for solving these
challenges. SAS is highly trusted in regulated industries because of its
auditability, metadata governance, reproducibility, validation capabilities,
and scalable production workflows. Features like DATA step processing, PROC
SQL, PROC REPORT, PROC SORT, arrays, macros, and validation procedures help
convert corrupted datasets into reliable analytical intelligence.
R complements SAS by providing flexible exploratory data analysis, advanced
string processing, dynamic transformations, and modern packages such as
tidyverse, dplyr, stringr, lubridate, and janitor. Functions like mutate(),
case_when(), coalesce(), and parse_date_time() make data cleaning efficient and
readable.
Together, SAS and R enable organizations to build trustworthy, scalable,
production-grade analytical systems that improve reporting accuracy, regulatory
compliance, operational intelligence, and enterprise decision-making
reliability.
Conclusion
Modern
analytics ecosystems depend entirely on structured, trustworthy, and
analysis-ready data. Whether the business domain involves clinical trials,
insurance claims, retail intelligence, banking fraud systems, or global fruit
supply-chain analytics, poor-quality data introduces enormous operational and
regulatory risk.
Duplicate
identifiers distort reporting. Invalid dates destroy time-series analysis.
Missing values silently corrupt statistical outputs. Inconsistent text
formatting fragments dashboards. Malformed emails break compliance workflows.
These issues are not theoretical they happen every day in enterprise production
systems.
This is
why SAS and R remain foundational technologies in enterprise data engineering.
SAS provides unmatched governance, auditability, traceability, and
production-grade reporting capabilities. Its DATA step architecture enables
highly controlled transformation pipelines essential for regulated industries
such as pharmaceuticals and banking. PROC SQL, PROC REPORT, PROC SUMMARY, and
macro automation create scalable frameworks capable of supporting global
operational intelligence.
R
complements SAS by offering highly flexible exploratory analytics, modern
string manipulation, advanced visualization ecosystems, and rapid
transformation pipelines using tidyverse frameworks. Together, SAS and R create
a hybrid analytics ecosystem that balances enterprise stability with analytical
agility.
The
future of analytics does not belong to organizations with the largest datasets.
It belongs to organizations with the cleanest, most reliable, and most
reproducible datasets.
Because
in enterprise analytics, trustworthy insights begin long before dashboards they
begin with disciplined data engineering.
Interview Questions and Answers
1. What is the difference between PROC SQL
and DATA Step in SAS?
Answer:
DATA Step is primarily used for row-by-row processing and sequential logic,
while PROC SQL is used for relational operations such as joins, aggregations,
and subqueries. DATA Step performs better for complex row transformations,
whereas PROC SQL is easier for database-style operations.
2. Why is LENGTH statement important in SAS?
Answer:
The LENGTH statement defines variable storage size before assignment. If LENGTH
is declared after assigning values, SAS may truncate character variables
because variable length is determined during compilation. Proper LENGTH
placement prevents data loss and reporting inconsistencies.
3. What is the difference between WHERE and
IF statements in SAS?
Answer:
WHERE filters observations before they enter the Program Data Vector (PDV),
making it more efficient for large datasets. IF conditions are applied after
observations are read into memory. WHERE cannot be used with newly created
variables in the same DATA step.
4. Explain RETAIN statement in SAS?
Answer:
By default, SAS resets variable values to missing for every iteration. RETAIN
preserves values across observations. It is commonly used for cumulative
calculations, counters, and carrying forward previous values.
5. What are FIRST. and LAST. variables in
SAS?
Answer:
FIRST.variable and LAST.variable are temporary variables created during
BY-group processing. They help identify the first and last observation within
grouped data and are widely used for summaries, counts, and accumulations.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent GLOBAL FRUITS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment