World's Most Expensive Flowers Dataset into Analysis-Ready Intelligence Using SAS and R
🌸 PETALS OF BILLIONS AND PIXELS OF PRECISION: Transforming the World's Most Expensive Flowers Dataset into Analysis-Ready Intelligence Using SAS and R
Introduction:When Luxury Meets Data Chaos
Imagine a
luxury global flower auction platform specializing in ultra-premium flowers
such as:
- Juliet Rose
- Shenzhen Nongke Orchid
- Kadupul Flower
- Gold of Kinabalu Orchid
- Saffron Crocus
The
company prepares a quarterly executive report showing global luxury flower
demand trends and pricing forecasts using AI models.
Suddenly
disaster strikes.
The
analytics dashboard reports:
- Negative flower prices.
- Duplicate auction IDs.
- Missing auction dates.
- Invalid buyer emails.
- Impossible flower ages.
- Region codes such as apac, APC,
asia-pacific, and A P A C.
- Corrupted timestamps.
- NULL values stored as text.
- Category labels such as Rare,
rare, RARE, and R@RE.
The AI
forecasting engine predicts a 300% increase in orchid demand because duplicate
records were counted multiple times.
The
finance department approves inventory purchases worth millions.
Regulators
question reporting integrity.
Executives
lose confidence in analytics.
This is
not a technology failure.
This is a
data quality failure.
Enterprise Dataset Design
Dataset: Most Expensive Flowers in World
Dataset Characteristics
|
Attribute |
Value |
|
Observations |
10 |
|
Variables |
9 |
|
Domain |
Luxury
Retail |
|
Purpose |
Pricing
analytics |
|
Complexity |
Enterprise
grade |
Variables
|
Variable |
Description |
|
Auction_ID |
Transaction
identifier |
|
Flower_Name |
Luxury
flower |
|
Country |
Origin
country |
|
Auction_Date |
Sale
date |
|
Region |
Market
region |
|
Buyer_Email |
Customer
email |
|
Price_USD |
Auction
price |
|
Category |
Flower
type |
|
Record_TS |
Load
timestamp |
1.Corrupted SAS Raw Dataset
data flower_raw;
length Auction_ID $8 Flower_Name $40 Country $25 Region $20
Buyer_Email $60 Category $20 Auction_Date $20 Record_TS $30;
infile datalines dlm='|' dsd truncover;
input Auction_ID $ Flower_Name $ Country $ Auction_Date $
Region $ Buyer_Email $ Price_USD Category $ Record_TS $;
datalines;
A001|Juliet Rose|UK|2025-01-05| apac |buyer1@gmail.com|15000000|Rare|2025-01-05T10:30:00
A002|Shenzhen Nongke Orchid|China||APC|buyer2gmail.com|200000|rare|2025-01-06T11:00:00
A003|Kadupul Flower|Sri Lanka|2025-02-10|asia-pacific|NULL|-5000|RARE|INVALID_TS
A004|Gold of Kinabalu Orchid|Malaysia|2025-02-12|EMEA|buyer4@yahoo.com|30000|Luxury|2025-02-12T12:30:00
A004|Gold of Kinabalu Orchid|Malaysia|2025-02-12|EMEA|buyer4@yahoo.com|30000|Luxury|2025-02-12T12:30:00
A005|Saffron Crocus|India|2025-03-15| ApAc |buyer5@|4500|premium|2025-03-15T15:00:00
A006|Blue Rose|Japan|2025-03-18|APAC|buyer6@gmail.com|12000|Rare|2025-03-18T09:00:00
A007|Tulip Bulb Emperor|Netherlands|2025-03-19|EMEA|buyer7@yahoo.com|15000|Premium|2025-03-19T10:00:00
A008|Ghost Orchid|USA|2025-03-20|AMER|buyer8gmail.com|-8000|rare|BAD_TS
A009|Black Baccara Rose|France||emea|NULL|25000|Luxury|2025-03-21T11:00:00
A010|Middlemist Red|UK|2025-03-22| EMEA |buyer10@gmail.com|50000|RARE|2025-03-22T12:00:00
;
run;
proc print data=flower_raw;
run;
OUTPUT:
| Obs | Auction_ID | Flower_Name | Country | Region | Buyer_Email | Category | Auction_Date | Record_TS | Price_USD |
|---|---|---|---|---|---|---|---|---|---|
| 1 | A001 | Juliet Rose | UK | apac | buyer1@gmail.com | Rare | 2025-01-05 | 2025-01-05T10:30:00 | 15000000 |
| 2 | A002 | Shenzhen Nongke Orchid | China | APC | buyer2gmail.com | rare | 2025-01-06T11:00:00 | 200000 | |
| 3 | A003 | Kadupul Flower | Sri Lanka | asia-pacific | NULL | RARE | 2025-02-10 | INVALID_TS | -5000 |
| 4 | A004 | Gold of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12 | 2025-02-12T12:30:00 | 30000 |
| 5 | A004 | Gold of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12 | 2025-02-12T12:30:00 | 30000 |
| 6 | A005 | Saffron Crocus | India | ApAc | buyer5@ | premium | 2025-03-15 | 2025-03-15T15:00:00 | 4500 |
| 7 | A006 | Blue Rose | Japan | APAC | buyer6@gmail.com | Rare | 2025-03-18 | 2025-03-18T09:00:00 | 12000 |
| 8 | A007 | Tulip Bulb Emperor | Netherlands | EMEA | buyer7@yahoo.com | Premium | 2025-03-19 | 2025-03-19T10:00:00 | 15000 |
| 9 | A008 | Ghost Orchid | USA | AMER | buyer8gmail.com | rare | 2025-03-20 | BAD_TS | -8000 |
| 10 | A009 | Black Baccara Rose | France | emea | NULL | Luxury | 2025-03-21T11:00:00 | 25000 | |
| 11 | A010 | Middlemist Red | UK | EMEA | buyer10@gmail.com | RARE | 2025-03-22 | 2025-03-22T12:00:00 | 50000 |
Why LENGTH Must Come First
One of
the most expensive mistakes in SAS programming involves character truncation.
Example:
data example;
region='Asia Pacific Region';
length region $5;
run;
Result:
Asia
Because
SAS assigns the variable length during first creation.
Correct
approach:
data example;
length region $30;
region='Asia Pacific Region';
run;
Unlike
SAS, R dynamically allocates character vector memory and therefore does not
suffer from compile-time truncation behavior.
This
difference becomes extremely important in SDTM and ADaM production
environments.
2.SAS Cleaning Workflow
Standardization Layer
data flower_clean;
set flower_raw;
Auction_ID=strip(upcase(Auction_ID));
Flower_Name=propcase(strip(Flower_Name));
Country=propcase(strip(Country));
Region=compress(upcase(region));
Category=propcase(Category);
Buyer_Email=lowcase(strip(Buyer_Email));
Price_USD=abs(Price_USD);
Buyer_Email=lowcase(strip(Buyer_Email));
Auction_Date_Num=input(Auction_Date,yymmdd10.);
format Auction_Date_Num yymmdd10.;
drop Auction_Date;
rename Auction_Date_Num=Auction_Date;
run;
proc print data=flower_clean;
run;
OUTPUT:
| Obs | Auction_ID | Flower_Name | Country | Region | Buyer_Email | Category | Record_TS | Price_USD | Auction_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | A001 | Juliet Rose | Uk | APAC | buyer1@gmail.com | Rare | 2025-01-05T10:30:00 | 15000000 | 2025-01-05 |
| 2 | A002 | Shenzhen Nongke Orchid | China | APC | buyer2gmail.com | Rare | 2025-01-06T11:00:00 | 200000 | . |
| 3 | A003 | Kadupul Flower | Sri Lanka | ASIA-PACIFIC | null | Rare | INVALID_TS | 5000 | 2025-02-10 |
| 4 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 |
| 5 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 |
| 6 | A005 | Saffron Crocus | India | APAC | buyer5@ | Premium | 2025-03-15T15:00:00 | 4500 | 2025-03-15 |
| 7 | A006 | Blue Rose | Japan | APAC | buyer6@gmail.com | Rare | 2025-03-18T09:00:00 | 12000 | 2025-03-18 |
| 8 | A007 | Tulip Bulb Emperor | Netherlands | EMEA | buyer7@yahoo.com | Premium | 2025-03-19T10:00:00 | 15000 | 2025-03-19 |
| 9 | A008 | Ghost Orchid | Usa | AMER | buyer8gmail.com | Rare | BAD_TS | 8000 | 2025-03-20 |
| 10 | A009 | Black Baccara Rose | France | EMEA | null | Luxury | 2025-03-21T11:00:00 | 25000 | . |
| 11 | A010 | Middlemist Red | Uk | EMEA | buyer10@gmail.com | Rare | 2025-03-22T12:00:00 | 50000 | 2025-03-22 |
Explanation
This
layer performs normalization of case sensitivity, whitespace corruption, and
negative values.
Functions
used:
- STRIP
- PROPCASE
- COMPRESS
- LOWCASE
- UPCASE
- ABS
These
functions ensure identical business entities are recognized as identical analytical
entities.
Without
normalization, "Rare" and "rare" become separate
categories.
3.Quick Validation Code
3.1 Number of observations
proc sql;
select count(*) as Obs_Count
from flower_raw;
quit;
OUTPUT:
| Obs_Count |
|---|
| 11 |
3.2 Number of variables
proc contents data=flower_raw position;
run;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.FLOWER_RAW | Observations | 11 |
|---|---|---|---|
| Member Type | DATA | Variables | 9 |
| Engine | V9 | Indexes | 0 |
| Created | 06/29/2026 11:49:45 | Observation Length | 232 |
| Last Modified | 06/29/2026 11:49:45 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 564 |
| Obs in First Data Page | 11 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_work47EF0000C0E8_odaws02-apse1-2.oda.sas.com/SAS_work7D190000C0E8_odaws02-apse1-2.oda.sas.com/flower_raw.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 201367403 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Alphabetic List of Variables and Attributes | |||
|---|---|---|---|
| # | Variable | Type | Len |
| 7 | Auction_Date | Char | 20 |
| 1 | Auction_ID | Char | 8 |
| 5 | Buyer_Email | Char | 60 |
| 6 | Category | Char | 20 |
| 3 | Country | Char | 25 |
| 2 | Flower_Name | Char | 40 |
| 9 | Price_USD | Num | 8 |
| 8 | Record_TS | Char | 30 |
| 4 | Region | Char | 20 |
| Variables in Creation Order | |||
|---|---|---|---|
| # | Variable | Type | Len |
| 1 | Auction_ID | Char | 8 |
| 2 | Flower_Name | Char | 40 |
| 3 | Country | Char | 25 |
| 4 | Region | Char | 20 |
| 5 | Buyer_Email | Char | 60 |
| 6 | Category | Char | 20 |
| 7 | Auction_Date | Char | 20 |
| 8 | Record_TS | Char | 30 |
| 9 | Price_USD | Num | 8 |
3.3 Duplicate IDs
proc sql;
select Auction_ID,
count(*) as Frequency
from flower_raw
group by Auction_ID
having count(*)>1;
quit;
OUTPUT:
| Auction_ID | Frequency |
|---|---|
| A004 | 2 |
3.4 IF THEN ELSE Validation
data flower_clean1;
set flower_clean;
if missing(Auction_Date) then
Auction_Date='01JAN2025'd;
if Price_USD<100 then
Price_USD=100;
if index(Buyer_Email,'@')=0 or index(Buyer_Email,'.')=0
then Buyer_Email='unknown@domain.com';
format Auction_Date yymmdd10.;
run;
proc print data=flower_clean1;
run;
OUTPUT:
| Obs | Auction_ID | Flower_Name | Country | Region | Buyer_Email | Category | Record_TS | Price_USD | Auction_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | A001 | Juliet Rose | Uk | APAC | buyer1@gmail.com | Rare | 2025-01-05T10:30:00 | 15000000 | 2025-01-05 |
| 2 | A002 | Shenzhen Nongke Orchid | China | APC | unknown@domain.com | Rare | 2025-01-06T11:00:00 | 200000 | 2025-01-01 |
| 3 | A003 | Kadupul Flower | Sri Lanka | ASIA-PACIFIC | unknown@domain.com | Rare | INVALID_TS | 5000 | 2025-02-10 |
| 4 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 |
| 5 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 |
| 6 | A005 | Saffron Crocus | India | APAC | unknown@domain.com | Premium | 2025-03-15T15:00:00 | 4500 | 2025-03-15 |
| 7 | A006 | Blue Rose | Japan | APAC | buyer6@gmail.com | Rare | 2025-03-18T09:00:00 | 12000 | 2025-03-18 |
| 8 | A007 | Tulip Bulb Emperor | Netherlands | EMEA | buyer7@yahoo.com | Premium | 2025-03-19T10:00:00 | 15000 | 2025-03-19 |
| 9 | A008 | Ghost Orchid | Usa | AMER | unknown@domain.com | Rare | BAD_TS | 8000 | 2025-03-20 |
| 10 | A009 | Black Baccara Rose | France | EMEA | unknown@domain.com | Luxury | 2025-03-21T11:00:00 | 25000 | 2025-01-01 |
| 11 | A010 | Middlemist Red | Uk | EMEA | buyer10@gmail.com | Rare | 2025-03-22T12:00:00 | 50000 | 2025-03-22 |
Explanation
This
logic handles:
- Missing dates
- Invalid prices
- Malformed emails
Defensive
programming is mandatory in production systems because upstream systems rarely
guarantee perfect data.
3.5 SELECT WHEN Logic
data flower_clean2;
set flower_clean1;
select(compress(upcase(region)));
when('APAC','APC','ASIA-PACIFIC') Region='APAC';
when('EMEA') Region='EMEA';
otherwise Region='UNKNOWN';
end;
run;
proc print data=flower_clean2;
run;
OUTPUT:
| Obs | Auction_ID | Flower_Name | Country | Region | Buyer_Email | Category | Record_TS | Price_USD | Auction_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | A001 | Juliet Rose | Uk | APAC | buyer1@gmail.com | Rare | 2025-01-05T10:30:00 | 15000000 | 2025-01-05 |
| 2 | A002 | Shenzhen Nongke Orchid | China | APAC | unknown@domain.com | Rare | 2025-01-06T11:00:00 | 200000 | 2025-01-01 |
| 3 | A003 | Kadupul Flower | Sri Lanka | APAC | unknown@domain.com | Rare | INVALID_TS | 5000 | 2025-02-10 |
| 4 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 |
| 5 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 |
| 6 | A005 | Saffron Crocus | India | APAC | unknown@domain.com | Premium | 2025-03-15T15:00:00 | 4500 | 2025-03-15 |
| 7 | A006 | Blue Rose | Japan | APAC | buyer6@gmail.com | Rare | 2025-03-18T09:00:00 | 12000 | 2025-03-18 |
| 8 | A007 | Tulip Bulb Emperor | Netherlands | EMEA | buyer7@yahoo.com | Premium | 2025-03-19T10:00:00 | 15000 | 2025-03-19 |
| 9 | A008 | Ghost Orchid | Usa | UNKNOWN | unknown@domain.com | Rare | BAD_TS | 8000 | 2025-03-20 |
| 10 | A009 | Black Baccara Rose | France | EMEA | unknown@domain.com | Luxury | 2025-03-21T11:00:00 | 25000 | 2025-01-01 |
| 11 | A010 | Middlemist Red | Uk | EMEA | buyer10@gmail.com | Rare | 2025-03-22T12:00:00 | 50000 | 2025-03-22 |
Explanation
SELECT-WHEN
provides cleaner logic than multiple IF statements.
This
approach improves maintainability and readability.
3.6 ARRAY Validation
data flower_clean3;
set flower_clean2;
array chars(*) Auction_ID Flower_Name Country Region Category;
do i=1 to dim(chars);
chars(i)=strip(chars(i));
end;
run;
proc print data=flower_clean3;
run;
OUTPUT:
| Obs | Auction_ID | Flower_Name | Country | Region | Buyer_Email | Category | Record_TS | Price_USD | Auction_Date | i |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | A001 | Juliet Rose | Uk | APAC | buyer1@gmail.com | Rare | 2025-01-05T10:30:00 | 15000000 | 2025-01-05 | 6 |
| 2 | A002 | Shenzhen Nongke Orchid | China | APAC | unknown@domain.com | Rare | 2025-01-06T11:00:00 | 200000 | 2025-01-01 | 6 |
| 3 | A003 | Kadupul Flower | Sri Lanka | APAC | unknown@domain.com | Rare | INVALID_TS | 5000 | 2025-02-10 | 6 |
| 4 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 | 6 |
| 5 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 | 6 |
| 6 | A005 | Saffron Crocus | India | APAC | unknown@domain.com | Premium | 2025-03-15T15:00:00 | 4500 | 2025-03-15 | 6 |
| 7 | A006 | Blue Rose | Japan | APAC | buyer6@gmail.com | Rare | 2025-03-18T09:00:00 | 12000 | 2025-03-18 | 6 |
| 8 | A007 | Tulip Bulb Emperor | Netherlands | EMEA | buyer7@yahoo.com | Premium | 2025-03-19T10:00:00 | 15000 | 2025-03-19 | 6 |
| 9 | A008 | Ghost Orchid | Usa | UNKNOWN | unknown@domain.com | Rare | BAD_TS | 8000 | 2025-03-20 | 6 |
| 10 | A009 | Black Baccara Rose | France | EMEA | unknown@domain.com | Luxury | 2025-03-21T11:00:00 | 25000 | 2025-01-01 | 6 |
| 11 | A010 | Middlemist Red | Uk | EMEA | buyer10@gmail.com | Rare | 2025-03-22T12:00:00 | 50000 | 2025-03-22 | 6 |
Explanation
Arrays
eliminate repetitive coding and support scalable enterprise workflows.
4.RETAIN Example
data Retain1;
set flower_clean3;
retain Batch_Number 0;
Batch_Number+1;
run;
proc print data=Retain1;
run;
OUTPUT:
| Obs | Auction_ID | Flower_Name | Country | Region | Buyer_Email | Category | Record_TS | Price_USD | Auction_Date | i | Batch_Number |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | A001 | Juliet Rose | Uk | APAC | buyer1@gmail.com | Rare | 2025-01-05T10:30:00 | 15000000 | 2025-01-05 | 6 | 1 |
| 2 | A002 | Shenzhen Nongke Orchid | China | APAC | unknown@domain.com | Rare | 2025-01-06T11:00:00 | 200000 | 2025-01-01 | 6 | 2 |
| 3 | A003 | Kadupul Flower | Sri Lanka | APAC | unknown@domain.com | Rare | INVALID_TS | 5000 | 2025-02-10 | 6 | 3 |
| 4 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 | 6 | 4 |
| 5 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 | 6 | 5 |
| 6 | A005 | Saffron Crocus | India | APAC | unknown@domain.com | Premium | 2025-03-15T15:00:00 | 4500 | 2025-03-15 | 6 | 6 |
| 7 | A006 | Blue Rose | Japan | APAC | buyer6@gmail.com | Rare | 2025-03-18T09:00:00 | 12000 | 2025-03-18 | 6 | 7 |
| 8 | A007 | Tulip Bulb Emperor | Netherlands | EMEA | buyer7@yahoo.com | Premium | 2025-03-19T10:00:00 | 15000 | 2025-03-19 | 6 | 8 |
| 9 | A008 | Ghost Orchid | Usa | UNKNOWN | unknown@domain.com | Rare | BAD_TS | 8000 | 2025-03-20 | 6 | 9 |
| 10 | A009 | Black Baccara Rose | France | EMEA | unknown@domain.com | Luxury | 2025-03-21T11:00:00 | 25000 | 2025-01-01 | 6 | 10 |
| 11 | A010 | Middlemist Red | Uk | EMEA | buyer10@gmail.com | Rare | 2025-03-22T12:00:00 | 50000 | 2025-03-22 | 6 | 11 |
Explanation
RETAIN
preserves values between observations and is heavily used in sequence
generation and audit trails.
5.FIRST. and LAST. Processing
proc sort data=Retain1;
by Region Auction_ID;
run;
proc print data=Retain1;
run;
OUTPUT:
| Obs | Auction_ID | Flower_Name | Country | Region | Buyer_Email | Category | Record_TS | Price_USD | Auction_Date | i | Batch_Number |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | A001 | Juliet Rose | Uk | APAC | buyer1@gmail.com | Rare | 2025-01-05T10:30:00 | 15000000 | 2025-01-05 | 6 | 1 |
| 2 | A002 | Shenzhen Nongke Orchid | China | APAC | unknown@domain.com | Rare | 2025-01-06T11:00:00 | 200000 | 2025-01-01 | 6 | 2 |
| 3 | A003 | Kadupul Flower | Sri Lanka | APAC | unknown@domain.com | Rare | INVALID_TS | 5000 | 2025-02-10 | 6 | 3 |
| 4 | A005 | Saffron Crocus | India | APAC | unknown@domain.com | Premium | 2025-03-15T15:00:00 | 4500 | 2025-03-15 | 6 | 6 |
| 5 | A006 | Blue Rose | Japan | APAC | buyer6@gmail.com | Rare | 2025-03-18T09:00:00 | 12000 | 2025-03-18 | 6 | 7 |
| 6 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 | 6 | 4 |
| 7 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 | 6 | 5 |
| 8 | A007 | Tulip Bulb Emperor | Netherlands | EMEA | buyer7@yahoo.com | Premium | 2025-03-19T10:00:00 | 15000 | 2025-03-19 | 6 | 8 |
| 9 | A009 | Black Baccara Rose | France | EMEA | unknown@domain.com | Luxury | 2025-03-21T11:00:00 | 25000 | 2025-01-01 | 6 | 10 |
| 10 | A010 | Middlemist Red | Uk | EMEA | buyer10@gmail.com | Rare | 2025-03-22T12:00:00 | 50000 | 2025-03-22 | 6 | 11 |
| 11 | A008 | Ghost Orchid | Usa | UNKNOWN | unknown@domain.com | Rare | BAD_TS | 8000 | 2025-03-20 | 6 | 9 |
data region_summary;
set Retain1;
by Region;
if first.Region then Count=0;
Count+1;
if last.Region;
run;
proc print data=region_summary;
run;
OUTPUT:
| Obs | Auction_ID | Flower_Name | Country | Region | Buyer_Email | Category | Record_TS | Price_USD | Auction_Date | i | Batch_Number | Count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | A006 | Blue Rose | Japan | APAC | buyer6@gmail.com | Rare | 2025-03-18T09:00:00 | 12000 | 2025-03-18 | 6 | 7 | 5 |
| 2 | A010 | Middlemist Red | Uk | EMEA | buyer10@gmail.com | Rare | 2025-03-22T12:00:00 | 50000 | 2025-03-22 | 6 | 11 | 5 |
| 3 | A008 | Ghost Orchid | Usa | UNKNOWN | unknown@domain.com | Rare | BAD_TS | 8000 | 2025-03-20 | 6 | 9 | 1 |
Explanation
FIRST./LAST.
variables enable grouped processing without SQL overhead.
6.PROC SORT NODUPKEY
proc sort data=flower_clean3 nodupkey;
by Auction_ID;
run;
proc print data=flower_clean3;
run;
LOG:
OUTPUT:
| Obs | Auction_ID | Flower_Name | Country | Region | Buyer_Email | Category | Record_TS | Price_USD | Auction_Date | i |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | A001 | Juliet Rose | Uk | APAC | buyer1@gmail.com | Rare | 2025-01-05T10:30:00 | 15000000 | 2025-01-05 | 6 |
| 2 | A002 | Shenzhen Nongke Orchid | China | APAC | unknown@domain.com | Rare | 2025-01-06T11:00:00 | 200000 | 2025-01-01 | 6 |
| 3 | A003 | Kadupul Flower | Sri Lanka | APAC | unknown@domain.com | Rare | INVALID_TS | 5000 | 2025-02-10 | 6 |
| 4 | A004 | Gold Of Kinabalu Orchid | Malaysia | EMEA | buyer4@yahoo.com | Luxury | 2025-02-12T12:30:00 | 30000 | 2025-02-12 | 6 |
| 5 | A005 | Saffron Crocus | India | APAC | unknown@domain.com | Premium | 2025-03-15T15:00:00 | 4500 | 2025-03-15 | 6 |
| 6 | A006 | Blue Rose | Japan | APAC | buyer6@gmail.com | Rare | 2025-03-18T09:00:00 | 12000 | 2025-03-18 | 6 |
| 7 | A007 | Tulip Bulb Emperor | Netherlands | EMEA | buyer7@yahoo.com | Premium | 2025-03-19T10:00:00 | 15000 | 2025-03-19 | 6 |
| 8 | A008 | Ghost Orchid | Usa | UNKNOWN | unknown@domain.com | Rare | BAD_TS | 8000 | 2025-03-20 | 6 |
| 9 | A009 | Black Baccara Rose | France | EMEA | unknown@domain.com | Luxury | 2025-03-21T11:00:00 | 25000 | 2025-01-01 | 6 |
| 10 | A010 | Middlemist Red | Uk | EMEA | buyer10@gmail.com | Rare | 2025-03-22T12:00:00 | 50000 | 2025-03-22 | 6 |
Explanation
Removes
duplicate transactions while preserving first occurrence.
7.PROC FORMAT
proc format;
value pricegrp low-10000='Low'
10001-50000='Premium'
50001-high='Ultra Luxury';
run;
LOG:
Explanation
Formats
improve reporting consistency and executive readability.
8.PROC SQL Approach
proc sql;
create table flower_sql as
select Region,
count(*) as Transactions,
mean(Price_USD) as Avg_Price format=pricegrp.
from flower_clean3
group by Region;
quit;
proc print data=flower_sql;
run;
OUTPUT:
| Obs | Region | Transactions | Avg_Price |
|---|---|---|---|
| 1 | APAC | 5 | Ultra Luxury |
| 2 | EMEA | 4 | Premium |
| 3 | UNKNOWN | 1 | Low |
Explanation
PROC SQL
simplifies aggregation logic and resembles ANSI SQL databases.
9.DATA Step Alternative
proc summary data=flower_clean3 nway;
class Region;
var Price_USD;
output out=summary mean=;
run;
proc print data=summary;
run;
OUTPUT:
| Obs | Region | _TYPE_ | _FREQ_ | Price_USD |
|---|---|---|---|---|
| 1 | APAC | 1 | 5 | 3044300 |
| 2 | EMEA | 1 | 4 | 30000 |
| 3 | UNKNOWN | 1 | 1 | 8000 |
Explanation
DATA Step
and PROC SUMMARY often outperform SQL on extremely large SAS datasets.
10.R Raw Dataset
flower_raw <- read.delim(
text="
Auction_ID|Flower_Name|Country|Auction_Date|Region|Buyer_Email|Price_USD|Category|Record_TS
A001|Juliet Rose|UK|2025-01-05| apac |buyer1@gmail.com|15000000|Rare|2025-01-05T10:30:00
A002|Shenzhen Nongke Orchid|China||APC|buyer2gmail.com|200000|rare|2025-01-06T11:00:00
A003|Kadupul Flower|Sri Lanka|2025-02-10|asia-pacific|NULL|-5000|RARE|INVALID_TS
A004|Gold of Kinabalu Orchid|Malaysia|2025-02-12|EMEA|buyer4@yahoo.com|30000|Luxury|2025-02-12T12:30:00
A004|Gold of Kinabalu Orchid|Malaysia|2025-02-12|EMEA|buyer4@yahoo.com|30000|Luxury|2025-02-12T12:30:00
A005|Saffron Crocus|India|2025-03-15| ApAc |buyer5@|4500|premium|2025-03-15T15:00:00
A006|Blue Rose|Japan|2025-03-18|APAC|buyer6@gmail.com|12000|Rare|2025-03-18T09:00:00
A007|Tulip Bulb Emperor|Netherlands|2025-03-19|EMEA|buyer7@yahoo.com|15000|Premium|2025-03-19T10:00:00
A008|Ghost Orchid|USA|2025-03-20|AMER|buyer8gmail.com|-8000|rare|BAD_TS
A009|Black Baccara Rose|France||emea|NULL|25000|Luxury|2025-03-21T11:00:00
A010|Middlemist Red|UK|2025-03-22| EMEA |buyer10@gmail.com|50000|RARE|2025-03-22T12:00:00
",
sep="|",
header=TRUE,
stringsAsFactors=FALSE
)
OUTPUT:
|
Auction_ID |
Flower_Name |
Country |
Auction_Date |
Region |
Buyer_Email |
Price_USD |
Category |
Record_TS |
|
A001 |
Juliet
Rose |
UK |
2025-01-05 |
apac |
buyer1@gmail.com |
15000000 |
Rare |
2025-01-05T10:30:00 |
|
A002 |
Shenzhen
Nongke Orchid |
China |
APC |
buyer2gmail.com |
200000 |
rare |
2025-01-06T11:00:00 |
|
|
A003 |
Kadupul
Flower |
Sri
Lanka |
2025-02-10 |
asia-pacific |
NULL |
-5000 |
RARE |
INVALID_TS |
|
A004 |
Gold of
Kinabalu Orchid |
Malaysia |
2025-02-12 |
EMEA |
buyer4@yahoo.com |
30000 |
Luxury |
2025-02-12T12:30:00 |
|
A004 |
Gold of
Kinabalu Orchid |
Malaysia |
2025-02-12 |
EMEA |
buyer4@yahoo.com |
30000 |
Luxury |
2025-02-12T12:30:00 |
|
A005 |
Saffron
Crocus |
India |
2025-03-15 |
ApAc |
buyer5@ |
4500 |
premium |
2025-03-15T15:00:00 |
|
A006 |
Blue
Rose |
Japan |
2025-03-18 |
APAC |
buyer6@gmail.com |
12000 |
Rare |
2025-03-18T09:00:00 |
|
A007 |
Tulip
Bulb Emperor |
Netherlands |
2025-03-19 |
EMEA |
buyer7@yahoo.com |
15000 |
Premium |
2025-03-19T10:00:00 |
|
A008 |
Ghost
Orchid |
USA |
2025-03-20 |
AMER |
buyer8gmail.com |
-8000 |
rare |
BAD_TS |
|
A009 |
Black
Baccara Rose |
France |
emea |
NULL |
25000 |
Luxury |
2025-03-21T11:00:00 |
|
|
A010 |
Middlemist
Red |
UK |
2025-03-22 |
EMEA |
buyer10@gmail.com |
50000 |
RARE |
2025-03-22T12:00:00 |
11.R Cleaning Layer
library(tidyverse)
library(janitor)
library(lubridate)
options(scipen = 999)
flower_clean <- flower_raw %>%
clean_names() %>%
mutate(
flower_name = str_to_title(str_trim(flower_name)),
country = str_to_title(country),
region = str_replace_all(region, " ", ""),
region = str_to_upper(region),
buyer_email = str_to_lower(buyer_email),
price_usd = abs(price_usd),
buyer_email = if_else(buyer_email == "null" |
!grepl("@", buyer_email) | !grepl("\\.", buyer_email),
"unknown@domain.com",buyer_email),
auction_date = coalesce(auction_date, "2025-01-01")
) %>%
distinct(auction_id, .keep_all = TRUE)
OUTPUT:
|
auction_id |
flower_name |
country |
auction_date |
region |
buyer_email |
price_usd |
category |
record_ts |
|
A001 |
Juliet
Rose |
Uk |
2025-01-05 |
APAC |
buyer1@gmail.com |
15000000 |
Rare |
2025-01-05T10:30:00 |
|
A002 |
Shenzhen
Nongke Orchid |
China |
APC |
unknown@domain.com |
200000 |
rare |
2025-01-06T11:00:00 |
|
|
A003 |
Kadupul
Flower |
Sri
Lanka |
2025-02-10 |
ASIA-PACIFIC |
unknown@domain.com |
5000 |
RARE |
INVALID_TS |
|
A004 |
Gold Of
Kinabalu Orchid |
Malaysia |
2025-02-12 |
EMEA |
buyer4@yahoo.com |
30000 |
Luxury |
2025-02-12T12:30:00 |
|
A005 |
Saffron
Crocus |
India |
2025-03-15 |
APAC |
unknown@domain.com |
4500 |
premium |
2025-03-15T15:00:00 |
|
A006 |
Blue
Rose |
Japan |
2025-03-18 |
APAC |
buyer6@gmail.com |
12000 |
Rare |
2025-03-18T09:00:00 |
|
A007 |
Tulip
Bulb Emperor |
Netherlands |
2025-03-19 |
EMEA |
buyer7@yahoo.com |
15000 |
Premium |
2025-03-19T10:00:00 |
|
A008 |
Ghost
Orchid |
Usa |
2025-03-20 |
AMER |
unknown@domain.com |
8000 |
rare |
BAD_TS |
|
A009 |
Black
Baccara Rose |
France |
EMEA |
unknown@domain.com |
25000 |
Luxury |
2025-03-21T11:00:00 |
|
|
A010 |
Middlemist
Red |
Uk |
2025-03-22 |
EMEA |
buyer10@gmail.com |
50000 |
RARE |
2025-03-22T12:00:00 |
Explanation
Equivalent
mappings:
|
SAS |
R |
|
STRIP |
str_trim |
|
PROPCASE |
str_to_title |
|
LOWCASE |
str_to_lower |
|
ABS |
abs |
|
COMPRESS |
str_replace_all |
|
COALESCEC |
coalesce |
R
provides elegant pipeline syntax while SAS provides stronger auditability.
Enterprise Validation and
Compliance
In
regulated environments such as clinical research:
- SDTM submissions require
traceability.
- ADaM datasets require
reproducibility.
- QC programmers must
independently validate outputs.
- Audit trails must preserve
lineage.
Missing
numeric values in SAS are treated as smaller than every number.
Example:
if Age<18 then Minor='Y';
If Age is
missing:
Result = Y
This
creates catastrophic analytical errors.
Correct
approach:
if not missing(age) and age<18
then Minor='Y';
Twenty Data Cleaning Best
Practices
- Validate metadata before
coding.
- Never overwrite source
datasets.
- Use reusable macros.
- Apply standard formats.
- Maintain audit trails.
- Use version control.
- Validate duplicates early.
- Separate raw and clean
layers.
- Standardize date formats.
- Normalize text variables.
- Validate ranges.
- Use defensive programming.
- Automate QC checks.
- Document derivations.
- Validate joins.
- Use controlled terminology.
- Protect lineage.
- Review missing values.
- Build reusable utilities.
- Maintain deployment
controls.
Business Logic Behind
Cleaning
Missing
values are rarely random.
A missing
patient visit date may indicate site entry delay rather than missed treatment.
Negative
invoice values may indicate reversal transactions rather than actual refunds.
A patient
age of 350 years likely reflects a data-entry issue.
Normalizing
text variables prevents duplicate categories.
Date
standardization enables time-series analytics.
Without
these corrections, downstream AI models inherit upstream errors.
Data
science cannot compensate for fundamentally broken source data.
Twenty One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic beats
visual inspection.
- Metadata drives
reproducibility.
- Standardization reduces
ambiguity.
- Missing values have business
meaning.
- Duplicate records distort AI
models.
- Auditable code builds trust.
- Traceability protects
submissions.
- Defensive programming
prevents disasters.
- Controlled terminology
improves consistency.
- Data lineage matters.
- Automation reduces risk.
- Validation is continuous.
- Reproducibility drives
confidence.
- Governance enables scale.
- Macros reduce redundancy.
- Formats improve readability.
- Documentation saves
projects.
- Clean data powers AI.
- Quality precedes analytics.
SAS V/S R
|
Capability |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
acceptance |
Excellent |
Growing |
|
Flexibility |
Moderate |
Excellent |
|
Visualization |
Moderate |
Excellent |
|
Scalability |
Excellent |
Excellent |
|
Traceability |
Excellent |
Moderate |
SAS
dominates highly regulated environments while R excels in exploratory analytics
and advanced modeling.
Together
they create a powerful analytical ecosystem.
Final Thoughts
Whether
you work in luxury flower analytics, banking fraud detection, insurance
underwriting, or clinical trials, poor-quality data quietly destroys trust long
before dashboards fail visibly.
The real
value of SAS and R is not simply cleaning datasets.
It is
creating confidence.
SAS
provides industrial-strength governance, reproducibility, and regulatory
compliance.
R
provides flexibility, innovation, and rapid experimentation.
Together
they transform corrupted operational records into trustworthy business
intelligence.
In modern
analytics ecosystems, organizations do not compete on who owns the most data.
They
compete on who trusts their data enough to act on it.
And that
journey begins with disciplined, repeatable, production-grade data cleaning
frameworks.
Interview Questions
1. How would you remove duplicates in SAS?
Use:
proc sort nodupkey;
or
proc sql
select distinct
2. Why are missing values dangerous in SAS?
Because
missing numeric values are lower than all valid values.
3. When would you prefer PROC SQL over DATA Step?
Complex
joins and aggregations.
4. Why use arrays?
To
eliminate repetitive validation logic.
5. How do you validate malformed emails in R?
grepl("@",email)
or
regular expressions using stringr.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent FLOWERS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
· Clinical SAS Programmer
· Research Data Analyst
· Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment