World's Most Expensive Flowers Dataset into Analysis-Ready Intelligence Using SAS and R

 🌸 PETALS OF BILLIONS AND PIXELS OF PRECISION: Transforming the World's Most Expensive Flowers Dataset into Analysis-Ready Intelligence Using SAS and R

Introduction:When Luxury Meets Data Chaos

Imagine a luxury global flower auction platform specializing in ultra-premium flowers such as:

  • Juliet Rose
  • Shenzhen Nongke Orchid
  • Kadupul Flower
  • Gold of Kinabalu Orchid
  • Saffron Crocus

The company prepares a quarterly executive report showing global luxury flower demand trends and pricing forecasts using AI models.

Suddenly disaster strikes.

The analytics dashboard reports:

  • Negative flower prices.
  • Duplicate auction IDs.
  • Missing auction dates.
  • Invalid buyer emails.
  • Impossible flower ages.
  • Region codes such as apac, APC, asia-pacific, and A P A C.
  • Corrupted timestamps.
  • NULL values stored as text.
  • Category labels such as Rare, rare, RARE, and R@RE.

The AI forecasting engine predicts a 300% increase in orchid demand because duplicate records were counted multiple times.

The finance department approves inventory purchases worth millions.

Regulators question reporting integrity.

Executives lose confidence in analytics.

This is not a technology failure.

This is a data quality failure.

Enterprise Dataset Design

Dataset: Most Expensive Flowers in World

Dataset Characteristics

Attribute

Value

Observations

10

Variables

9

Domain

Luxury Retail

Purpose

Pricing analytics

Complexity

Enterprise grade

Variables

Variable

Description

Auction_ID

Transaction identifier

Flower_Name

Luxury flower

Country

Origin country

Auction_Date

Sale date

Region

Market region

Buyer_Email

Customer email

Price_USD

Auction price

Category

Flower type

Record_TS

Load timestamp

1.Corrupted SAS Raw Dataset

data flower_raw;

length Auction_ID $8 Flower_Name $40 Country $25 Region $20

Buyer_Email $60 Category $20 Auction_Date $20 Record_TS $30;

infile datalines dlm='|' dsd truncover;

input Auction_ID $ Flower_Name $ Country $ Auction_Date $

Region $ Buyer_Email $ Price_USD Category $ Record_TS $;

datalines;

A001|Juliet Rose|UK|2025-01-05| apac |buyer1@gmail.com|15000000|Rare|2025-01-05T10:30:00

A002|Shenzhen Nongke Orchid|China||APC|buyer2gmail.com|200000|rare|2025-01-06T11:00:00

A003|Kadupul Flower|Sri Lanka|2025-02-10|asia-pacific|NULL|-5000|RARE|INVALID_TS

A004|Gold of Kinabalu Orchid|Malaysia|2025-02-12|EMEA|buyer4@yahoo.com|30000|Luxury|2025-02-12T12:30:00

A004|Gold of Kinabalu Orchid|Malaysia|2025-02-12|EMEA|buyer4@yahoo.com|30000|Luxury|2025-02-12T12:30:00

A005|Saffron Crocus|India|2025-03-15| ApAc |buyer5@|4500|premium|2025-03-15T15:00:00

A006|Blue Rose|Japan|2025-03-18|APAC|buyer6@gmail.com|12000|Rare|2025-03-18T09:00:00

A007|Tulip Bulb Emperor|Netherlands|2025-03-19|EMEA|buyer7@yahoo.com|15000|Premium|2025-03-19T10:00:00

A008|Ghost Orchid|USA|2025-03-20|AMER|buyer8gmail.com|-8000|rare|BAD_TS

A009|Black Baccara Rose|France||emea|NULL|25000|Luxury|2025-03-21T11:00:00

A010|Middlemist Red|UK|2025-03-22| EMEA |buyer10@gmail.com|50000|RARE|2025-03-22T12:00:00

;

run;

proc print data=flower_raw;

run;

OUTPUT:

ObsAuction_IDFlower_NameCountryRegionBuyer_EmailCategoryAuction_DateRecord_TSPrice_USD
1A001Juliet RoseUKapacbuyer1@gmail.comRare2025-01-052025-01-05T10:30:0015000000
2A002Shenzhen Nongke OrchidChinaAPCbuyer2gmail.comrare 2025-01-06T11:00:00200000
3A003Kadupul FlowerSri Lankaasia-pacificNULLRARE2025-02-10INVALID_TS-5000
4A004Gold of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-122025-02-12T12:30:0030000
5A004Gold of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-122025-02-12T12:30:0030000
6A005Saffron CrocusIndiaApAcbuyer5@premium2025-03-152025-03-15T15:00:004500
7A006Blue RoseJapanAPACbuyer6@gmail.comRare2025-03-182025-03-18T09:00:0012000
8A007Tulip Bulb EmperorNetherlandsEMEAbuyer7@yahoo.comPremium2025-03-192025-03-19T10:00:0015000
9A008Ghost OrchidUSAAMERbuyer8gmail.comrare2025-03-20BAD_TS-8000
10A009Black Baccara RoseFranceemeaNULLLuxury 2025-03-21T11:00:0025000
11A010Middlemist RedUKEMEAbuyer10@gmail.comRARE2025-03-222025-03-22T12:00:0050000

Why LENGTH Must Come First

One of the most expensive mistakes in SAS programming involves character truncation.

Example:

data example;

region='Asia Pacific Region';

length region $5; 

run;

Result:

Asia

Because SAS assigns the variable length during first creation.

Correct approach:

data example;

length region $30;

region='Asia Pacific Region';

run;

Unlike SAS, R dynamically allocates character vector memory and therefore does not suffer from compile-time truncation behavior.

This difference becomes extremely important in SDTM and ADaM production environments.

2.SAS Cleaning Workflow

Standardization Layer

data flower_clean;

set flower_raw;

Auction_ID=strip(upcase(Auction_ID));

Flower_Name=propcase(strip(Flower_Name));

Country=propcase(strip(Country));

Region=compress(upcase(region));

Category=propcase(Category);

Buyer_Email=lowcase(strip(Buyer_Email));

Price_USD=abs(Price_USD);

Buyer_Email=lowcase(strip(Buyer_Email));

Auction_Date_Num=input(Auction_Date,yymmdd10.);

format Auction_Date_Num yymmdd10.;

drop Auction_Date;

rename Auction_Date_Num=Auction_Date;

run;

proc print data=flower_clean;

run;

OUTPUT:

ObsAuction_IDFlower_NameCountryRegionBuyer_EmailCategoryRecord_TSPrice_USDAuction_Date
1A001Juliet RoseUkAPACbuyer1@gmail.comRare2025-01-05T10:30:00150000002025-01-05
2A002Shenzhen Nongke OrchidChinaAPCbuyer2gmail.comRare2025-01-06T11:00:00200000.
3A003Kadupul FlowerSri LankaASIA-PACIFICnullRareINVALID_TS50002025-02-10
4A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-12
5A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-12
6A005Saffron CrocusIndiaAPACbuyer5@Premium2025-03-15T15:00:0045002025-03-15
7A006Blue RoseJapanAPACbuyer6@gmail.comRare2025-03-18T09:00:00120002025-03-18
8A007Tulip Bulb EmperorNetherlandsEMEAbuyer7@yahoo.comPremium2025-03-19T10:00:00150002025-03-19
9A008Ghost OrchidUsaAMERbuyer8gmail.comRareBAD_TS80002025-03-20
10A009Black Baccara RoseFranceEMEAnullLuxury2025-03-21T11:00:0025000.
11A010Middlemist RedUkEMEAbuyer10@gmail.comRare2025-03-22T12:00:00500002025-03-22

Explanation

This layer performs normalization of case sensitivity, whitespace corruption, and negative values.

Functions used:

  • STRIP
  • PROPCASE
  • COMPRESS
  • LOWCASE
  • UPCASE
  • ABS

These functions ensure identical business entities are recognized as identical analytical entities.

Without normalization, "Rare" and "rare" become separate categories.

3.Quick Validation Code

3.1 Number of observations

proc sql;

select count(*) as Obs_Count

from flower_raw;

quit;

OUTPUT:

Obs_Count
11

3.2 Number of variables

proc contents data=flower_raw position;

run;

OUTPUT:

The CONTENTS Procedure

Data Set NameWORK.FLOWER_RAWObservations11
Member TypeDATAVariables9
EngineV9Indexes0
Created06/29/2026 11:49:45Observation Length232
Last Modified06/29/2026 11:49:45Deleted Observations0
Protection CompressedNO
Data Set Type SortedNO
Label   
Data RepresentationSOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64  
Encodingutf-8 Unicode (UTF-8)  
Engine/Host Dependent Information
Data Set Page Size131072
Number of Data Set Pages1
First Data Page1
Max Obs per Page564
Obs in First Data Page11
Number of Data Set Repairs0
Filename/saswork/SAS_work47EF0000C0E8_odaws02-apse1-2.oda.sas.com/SAS_work7D190000C0E8_odaws02-apse1-2.oda.sas.com/flower_raw.sas7bdat
Release Created9.0401M8
Host CreatedLinux
Inode Number201367403
Access Permissionrw-r--r--
Owner Nameu63247146
File Size256KB
File Size (bytes)262144
Alphabetic List of Variables and Attributes
#VariableTypeLen
7Auction_DateChar20
1Auction_IDChar8
5Buyer_EmailChar60
6CategoryChar20
3CountryChar25
2Flower_NameChar40
9Price_USDNum8
8Record_TSChar30
4RegionChar20
Variables in Creation Order
#VariableTypeLen
1Auction_IDChar8
2Flower_NameChar40
3CountryChar25
4RegionChar20
5Buyer_EmailChar60
6CategoryChar20
7Auction_DateChar20
8Record_TSChar30
9Price_USDNum8

3.3 Duplicate IDs

proc sql;

select Auction_ID,

       count(*) as Frequency

from flower_raw

group by Auction_ID

having count(*)>1;

quit;

OUTPUT:

Auction_IDFrequency
A0042

3.4 IF THEN ELSE Validation

data flower_clean1;

set flower_clean;

if missing(Auction_Date) then

Auction_Date='01JAN2025'd;

if Price_USD<100 then

Price_USD=100;

if index(Buyer_Email,'@')=0 or index(Buyer_Email,'.')=0 

then Buyer_Email='unknown@domain.com';

format Auction_Date yymmdd10.;

run;

proc print data=flower_clean1;

run;

OUTPUT:

ObsAuction_IDFlower_NameCountryRegionBuyer_EmailCategoryRecord_TSPrice_USDAuction_Date
1A001Juliet RoseUkAPACbuyer1@gmail.comRare2025-01-05T10:30:00150000002025-01-05
2A002Shenzhen Nongke OrchidChinaAPCunknown@domain.comRare2025-01-06T11:00:002000002025-01-01
3A003Kadupul FlowerSri LankaASIA-PACIFICunknown@domain.comRareINVALID_TS50002025-02-10
4A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-12
5A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-12
6A005Saffron CrocusIndiaAPACunknown@domain.comPremium2025-03-15T15:00:0045002025-03-15
7A006Blue RoseJapanAPACbuyer6@gmail.comRare2025-03-18T09:00:00120002025-03-18
8A007Tulip Bulb EmperorNetherlandsEMEAbuyer7@yahoo.comPremium2025-03-19T10:00:00150002025-03-19
9A008Ghost OrchidUsaAMERunknown@domain.comRareBAD_TS80002025-03-20
10A009Black Baccara RoseFranceEMEAunknown@domain.comLuxury2025-03-21T11:00:00250002025-01-01
11A010Middlemist RedUkEMEAbuyer10@gmail.comRare2025-03-22T12:00:00500002025-03-22

Explanation

This logic handles:

  • Missing dates
  • Invalid prices
  • Malformed emails

Defensive programming is mandatory in production systems because upstream systems rarely guarantee perfect data.

3.5 SELECT WHEN Logic

data flower_clean2;

set flower_clean1;

select(compress(upcase(region)));

when('APAC','APC','ASIA-PACIFIC') Region='APAC';

when('EMEA') Region='EMEA';

otherwise Region='UNKNOWN';

end;

run;

proc print data=flower_clean2;

run;

OUTPUT:

ObsAuction_IDFlower_NameCountryRegionBuyer_EmailCategoryRecord_TSPrice_USDAuction_Date
1A001Juliet RoseUkAPACbuyer1@gmail.comRare2025-01-05T10:30:00150000002025-01-05
2A002Shenzhen Nongke OrchidChinaAPACunknown@domain.comRare2025-01-06T11:00:002000002025-01-01
3A003Kadupul FlowerSri LankaAPACunknown@domain.comRareINVALID_TS50002025-02-10
4A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-12
5A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-12
6A005Saffron CrocusIndiaAPACunknown@domain.comPremium2025-03-15T15:00:0045002025-03-15
7A006Blue RoseJapanAPACbuyer6@gmail.comRare2025-03-18T09:00:00120002025-03-18
8A007Tulip Bulb EmperorNetherlandsEMEAbuyer7@yahoo.comPremium2025-03-19T10:00:00150002025-03-19
9A008Ghost OrchidUsaUNKNOWNunknown@domain.comRareBAD_TS80002025-03-20
10A009Black Baccara RoseFranceEMEAunknown@domain.comLuxury2025-03-21T11:00:00250002025-01-01
11A010Middlemist RedUkEMEAbuyer10@gmail.comRare2025-03-22T12:00:00500002025-03-22

Explanation

SELECT-WHEN provides cleaner logic than multiple IF statements.

This approach improves maintainability and readability.

3.6 ARRAY Validation

data flower_clean3;

set flower_clean2;

array chars(*) Auction_ID Flower_Name Country Region Category;

do i=1 to dim(chars);

chars(i)=strip(chars(i));

end;

run;

proc print data=flower_clean3;

run;

OUTPUT:

ObsAuction_IDFlower_NameCountryRegionBuyer_EmailCategoryRecord_TSPrice_USDAuction_Datei
1A001Juliet RoseUkAPACbuyer1@gmail.comRare2025-01-05T10:30:00150000002025-01-056
2A002Shenzhen Nongke OrchidChinaAPACunknown@domain.comRare2025-01-06T11:00:002000002025-01-016
3A003Kadupul FlowerSri LankaAPACunknown@domain.comRareINVALID_TS50002025-02-106
4A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-126
5A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-126
6A005Saffron CrocusIndiaAPACunknown@domain.comPremium2025-03-15T15:00:0045002025-03-156
7A006Blue RoseJapanAPACbuyer6@gmail.comRare2025-03-18T09:00:00120002025-03-186
8A007Tulip Bulb EmperorNetherlandsEMEAbuyer7@yahoo.comPremium2025-03-19T10:00:00150002025-03-196
9A008Ghost OrchidUsaUNKNOWNunknown@domain.comRareBAD_TS80002025-03-206
10A009Black Baccara RoseFranceEMEAunknown@domain.comLuxury2025-03-21T11:00:00250002025-01-016
11A010Middlemist RedUkEMEAbuyer10@gmail.comRare2025-03-22T12:00:00500002025-03-226

Explanation

Arrays eliminate repetitive coding and support scalable enterprise workflows.

4.RETAIN Example

data Retain1;

set flower_clean3;

retain Batch_Number 0;

Batch_Number+1;

run;

proc print data=Retain1;

run;

OUTPUT:

ObsAuction_IDFlower_NameCountryRegionBuyer_EmailCategoryRecord_TSPrice_USDAuction_DateiBatch_Number
1A001Juliet RoseUkAPACbuyer1@gmail.comRare2025-01-05T10:30:00150000002025-01-0561
2A002Shenzhen Nongke OrchidChinaAPACunknown@domain.comRare2025-01-06T11:00:002000002025-01-0162
3A003Kadupul FlowerSri LankaAPACunknown@domain.comRareINVALID_TS50002025-02-1063
4A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-1264
5A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-1265
6A005Saffron CrocusIndiaAPACunknown@domain.comPremium2025-03-15T15:00:0045002025-03-1566
7A006Blue RoseJapanAPACbuyer6@gmail.comRare2025-03-18T09:00:00120002025-03-1867
8A007Tulip Bulb EmperorNetherlandsEMEAbuyer7@yahoo.comPremium2025-03-19T10:00:00150002025-03-1968
9A008Ghost OrchidUsaUNKNOWNunknown@domain.comRareBAD_TS80002025-03-2069
10A009Black Baccara RoseFranceEMEAunknown@domain.comLuxury2025-03-21T11:00:00250002025-01-01610
11A010Middlemist RedUkEMEAbuyer10@gmail.comRare2025-03-22T12:00:00500002025-03-22611

Explanation

RETAIN preserves values between observations and is heavily used in sequence generation and audit trails.

5.FIRST. and LAST. Processing

proc sort data=Retain1;

by Region Auction_ID;

run;

proc print data=Retain1;

run;

OUTPUT:

ObsAuction_IDFlower_NameCountryRegionBuyer_EmailCategoryRecord_TSPrice_USDAuction_DateiBatch_Number
1A001Juliet RoseUkAPACbuyer1@gmail.comRare2025-01-05T10:30:00150000002025-01-0561
2A002Shenzhen Nongke OrchidChinaAPACunknown@domain.comRare2025-01-06T11:00:002000002025-01-0162
3A003Kadupul FlowerSri LankaAPACunknown@domain.comRareINVALID_TS50002025-02-1063
4A005Saffron CrocusIndiaAPACunknown@domain.comPremium2025-03-15T15:00:0045002025-03-1566
5A006Blue RoseJapanAPACbuyer6@gmail.comRare2025-03-18T09:00:00120002025-03-1867
6A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-1264
7A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-1265
8A007Tulip Bulb EmperorNetherlandsEMEAbuyer7@yahoo.comPremium2025-03-19T10:00:00150002025-03-1968
9A009Black Baccara RoseFranceEMEAunknown@domain.comLuxury2025-03-21T11:00:00250002025-01-01610
10A010Middlemist RedUkEMEAbuyer10@gmail.comRare2025-03-22T12:00:00500002025-03-22611
11A008Ghost OrchidUsaUNKNOWNunknown@domain.comRareBAD_TS80002025-03-2069

data region_summary;

set Retain1;

by Region;

if first.Region then Count=0;

Count+1;

if last.Region;

run;

proc print data=region_summary;

run;

OUTPUT:

ObsAuction_IDFlower_NameCountryRegionBuyer_EmailCategoryRecord_TSPrice_USDAuction_DateiBatch_NumberCount
1A006Blue RoseJapanAPACbuyer6@gmail.comRare2025-03-18T09:00:00120002025-03-18675
2A010Middlemist RedUkEMEAbuyer10@gmail.comRare2025-03-22T12:00:00500002025-03-226115
3A008Ghost OrchidUsaUNKNOWNunknown@domain.comRareBAD_TS80002025-03-20691

Explanation

FIRST./LAST. variables enable grouped processing without SQL overhead.

6.PROC SORT NODUPKEY

proc sort data=flower_clean3 nodupkey;

by Auction_ID;

run;

proc print data=flower_clean3;

run;

LOG:

NOTE: There were 11 observations read from the data set WORK.FLOWER_CLEAN3.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.FLOWER_CLEAN3 has 10 observations and 10 variables.

OUTPUT:

ObsAuction_IDFlower_NameCountryRegionBuyer_EmailCategoryRecord_TSPrice_USDAuction_Datei
1A001Juliet RoseUkAPACbuyer1@gmail.comRare2025-01-05T10:30:00150000002025-01-056
2A002Shenzhen Nongke OrchidChinaAPACunknown@domain.comRare2025-01-06T11:00:002000002025-01-016
3A003Kadupul FlowerSri LankaAPACunknown@domain.comRareINVALID_TS50002025-02-106
4A004Gold Of Kinabalu OrchidMalaysiaEMEAbuyer4@yahoo.comLuxury2025-02-12T12:30:00300002025-02-126
5A005Saffron CrocusIndiaAPACunknown@domain.comPremium2025-03-15T15:00:0045002025-03-156
6A006Blue RoseJapanAPACbuyer6@gmail.comRare2025-03-18T09:00:00120002025-03-186
7A007Tulip Bulb EmperorNetherlandsEMEAbuyer7@yahoo.comPremium2025-03-19T10:00:00150002025-03-196
8A008Ghost OrchidUsaUNKNOWNunknown@domain.comRareBAD_TS80002025-03-206
9A009Black Baccara RoseFranceEMEAunknown@domain.comLuxury2025-03-21T11:00:00250002025-01-016
10A010Middlemist RedUkEMEAbuyer10@gmail.comRare2025-03-22T12:00:00500002025-03-226

Explanation

Removes duplicate transactions while preserving first occurrence.

7.PROC FORMAT

proc format;

value pricegrp low-10000='Low'

             10001-50000='Premium'

              50001-high='Ultra Luxury';

run;

LOG:

NOTE: Format PRICEGRP has been output.

Explanation

Formats improve reporting consistency and executive readability.

8.PROC SQL Approach

proc sql;

create table flower_sql as

select Region,

count(*) as Transactions,

mean(Price_USD) as Avg_Price format=pricegrp.

from flower_clean3

group by Region;

quit;

proc print data=flower_sql;

run;

OUTPUT:

ObsRegionTransactionsAvg_Price
1APAC5Ultra Luxury
2EMEA4Premium
3UNKNOWN1Low

Explanation

PROC SQL simplifies aggregation logic and resembles ANSI SQL databases.

9.DATA Step Alternative

proc summary data=flower_clean3 nway;

class Region;

var Price_USD;

output out=summary mean=;

run;

proc print data=summary;

run;

OUTPUT:

ObsRegion_TYPE__FREQ_Price_USD
1APAC153044300
2EMEA1430000
3UNKNOWN118000

Explanation

DATA Step and PROC SUMMARY often outperform SQL on extremely large SAS datasets.

10.R Raw Dataset

flower_raw <- read.delim(

  text="

Auction_ID|Flower_Name|Country|Auction_Date|Region|Buyer_Email|Price_USD|Category|Record_TS

A001|Juliet Rose|UK|2025-01-05| apac |buyer1@gmail.com|15000000|Rare|2025-01-05T10:30:00

A002|Shenzhen Nongke Orchid|China||APC|buyer2gmail.com|200000|rare|2025-01-06T11:00:00

A003|Kadupul Flower|Sri Lanka|2025-02-10|asia-pacific|NULL|-5000|RARE|INVALID_TS

A004|Gold of Kinabalu Orchid|Malaysia|2025-02-12|EMEA|buyer4@yahoo.com|30000|Luxury|2025-02-12T12:30:00

A004|Gold of Kinabalu Orchid|Malaysia|2025-02-12|EMEA|buyer4@yahoo.com|30000|Luxury|2025-02-12T12:30:00

A005|Saffron Crocus|India|2025-03-15| ApAc |buyer5@|4500|premium|2025-03-15T15:00:00

A006|Blue Rose|Japan|2025-03-18|APAC|buyer6@gmail.com|12000|Rare|2025-03-18T09:00:00

A007|Tulip Bulb Emperor|Netherlands|2025-03-19|EMEA|buyer7@yahoo.com|15000|Premium|2025-03-19T10:00:00

A008|Ghost Orchid|USA|2025-03-20|AMER|buyer8gmail.com|-8000|rare|BAD_TS

A009|Black Baccara Rose|France||emea|NULL|25000|Luxury|2025-03-21T11:00:00

A010|Middlemist Red|UK|2025-03-22| EMEA |buyer10@gmail.com|50000|RARE|2025-03-22T12:00:00

",

  sep="|",

  header=TRUE,

  stringsAsFactors=FALSE

)

OUTPUT:

Auction_ID

Flower_Name

Country

Auction_Date

Region

Buyer_Email

Price_USD

Category

Record_TS

A001

Juliet Rose

UK

2025-01-05

 apac

buyer1@gmail.com

15000000

Rare

2025-01-05T10:30:00

A002

Shenzhen Nongke Orchid

China

APC

buyer2gmail.com

200000

rare

2025-01-06T11:00:00

A003

Kadupul Flower

Sri Lanka

2025-02-10

asia-pacific

NULL

-5000

RARE

INVALID_TS

A004

Gold of Kinabalu Orchid

Malaysia

2025-02-12

EMEA

buyer4@yahoo.com

30000

Luxury

2025-02-12T12:30:00

A004

Gold of Kinabalu Orchid

Malaysia

2025-02-12

EMEA

buyer4@yahoo.com

30000

Luxury

2025-02-12T12:30:00

A005

Saffron Crocus

India

2025-03-15

 ApAc

buyer5@

4500

premium

2025-03-15T15:00:00

A006

Blue Rose

Japan

2025-03-18

APAC

buyer6@gmail.com

12000

Rare

2025-03-18T09:00:00

A007

Tulip Bulb Emperor

Netherlands

2025-03-19

EMEA

buyer7@yahoo.com

15000

Premium

2025-03-19T10:00:00

A008

Ghost Orchid

USA

2025-03-20

AMER

buyer8gmail.com

-8000

rare

BAD_TS

A009

Black Baccara Rose

France

emea

NULL

25000

Luxury

2025-03-21T11:00:00

A010

Middlemist Red

UK

2025-03-22

 EMEA

buyer10@gmail.com

50000

RARE

2025-03-22T12:00:00


11.R Cleaning Layer

library(tidyverse)

library(janitor)

library(lubridate)

options(scipen = 999)

flower_clean <- flower_raw %>%

  clean_names() %>%

  mutate(

    flower_name = str_to_title(str_trim(flower_name)),

    country = str_to_title(country),

    region = str_replace_all(region, " ", ""),

    region = str_to_upper(region),

    buyer_email = str_to_lower(buyer_email),

    price_usd = abs(price_usd),

    buyer_email = if_else(buyer_email == "null" |

        !grepl("@", buyer_email) | !grepl("\\.", buyer_email),

      "unknown@domain.com",buyer_email),

    auction_date = coalesce(auction_date, "2025-01-01")

  ) %>%

  distinct(auction_id, .keep_all = TRUE)

OUTPUT:

auction_id

flower_name

country

auction_date

region

buyer_email

price_usd

category

record_ts

A001

Juliet Rose

Uk

2025-01-05

APAC

buyer1@gmail.com

15000000

Rare

2025-01-05T10:30:00

A002

Shenzhen Nongke Orchid

China

APC

unknown@domain.com

200000

rare

2025-01-06T11:00:00

A003

Kadupul Flower

Sri Lanka

2025-02-10

ASIA-PACIFIC

unknown@domain.com

5000

RARE

INVALID_TS

A004

Gold Of Kinabalu Orchid

Malaysia

2025-02-12

EMEA

buyer4@yahoo.com

30000

Luxury

2025-02-12T12:30:00

A005

Saffron Crocus

India

2025-03-15

APAC

unknown@domain.com

4500

premium

2025-03-15T15:00:00

A006

Blue Rose

Japan

2025-03-18

APAC

buyer6@gmail.com

12000

Rare

2025-03-18T09:00:00

A007

Tulip Bulb Emperor

Netherlands

2025-03-19

EMEA

buyer7@yahoo.com

15000

Premium

2025-03-19T10:00:00

A008

Ghost Orchid

Usa

2025-03-20

AMER

unknown@domain.com

8000

rare

BAD_TS

A009

Black Baccara Rose

France

EMEA

unknown@domain.com

25000

Luxury

2025-03-21T11:00:00

A010

Middlemist Red

Uk

2025-03-22

EMEA

buyer10@gmail.com

50000

RARE

2025-03-22T12:00:00

Explanation

Equivalent mappings:

SAS

R

STRIP

str_trim

PROPCASE

str_to_title

LOWCASE

str_to_lower

ABS

abs

COMPRESS

str_replace_all

COALESCEC

coalesce

R provides elegant pipeline syntax while SAS provides stronger auditability.

Enterprise Validation and Compliance

In regulated environments such as clinical research:

  • SDTM submissions require traceability.
  • ADaM datasets require reproducibility.
  • QC programmers must independently validate outputs.
  • Audit trails must preserve lineage.

Missing numeric values in SAS are treated as smaller than every number.

Example:

if Age<18 then Minor='Y';

If Age is missing:

Result = Y

This creates catastrophic analytical errors.

Correct approach:

if not missing(age) and age<18 then Minor='Y';

Twenty Data Cleaning Best Practices

  1. Validate metadata before coding.
  2. Never overwrite source datasets.
  3. Use reusable macros.
  4. Apply standard formats.
  5. Maintain audit trails.
  6. Use version control.
  7. Validate duplicates early.
  8. Separate raw and clean layers.
  9. Standardize date formats.
  10. Normalize text variables.
  11. Validate ranges.
  12. Use defensive programming.
  13. Automate QC checks.
  14. Document derivations.
  15. Validate joins.
  16. Use controlled terminology.
  17. Protect lineage.
  18. Review missing values.
  19. Build reusable utilities.
  20. Maintain deployment controls.

Business Logic Behind Cleaning

Missing values are rarely random.

A missing patient visit date may indicate site entry delay rather than missed treatment.

Negative invoice values may indicate reversal transactions rather than actual refunds.

A patient age of 350 years likely reflects a data-entry issue.

Normalizing text variables prevents duplicate categories.

Date standardization enables time-series analytics.

Without these corrections, downstream AI models inherit upstream errors.

Data science cannot compensate for fundamentally broken source data.

Twenty One-Line Insights

  • Dirty data creates expensive business mistakes.
  • Validation logic beats visual inspection.
  • Metadata drives reproducibility.
  • Standardization reduces ambiguity.
  • Missing values have business meaning.
  • Duplicate records distort AI models.
  • Auditable code builds trust.
  • Traceability protects submissions.
  • Defensive programming prevents disasters.
  • Controlled terminology improves consistency.
  • Data lineage matters.
  • Automation reduces risk.
  • Validation is continuous.
  • Reproducibility drives confidence.
  • Governance enables scale.
  • Macros reduce redundancy.
  • Formats improve readability.
  • Documentation saves projects.
  • Clean data powers AI.
  • Quality precedes analytics.

SAS V/S R

Capability

SAS

R

Auditability

Excellent

Moderate

Regulatory acceptance

Excellent

Growing

Flexibility

Moderate

Excellent

Visualization

Moderate

Excellent

Scalability

Excellent

Excellent

Traceability

Excellent

Moderate

SAS dominates highly regulated environments while R excels in exploratory analytics and advanced modeling.

Together they create a powerful analytical ecosystem.

Final Thoughts

Whether you work in luxury flower analytics, banking fraud detection, insurance underwriting, or clinical trials, poor-quality data quietly destroys trust long before dashboards fail visibly.

The real value of SAS and R is not simply cleaning datasets.

It is creating confidence.

SAS provides industrial-strength governance, reproducibility, and regulatory compliance.

R provides flexibility, innovation, and rapid experimentation.

Together they transform corrupted operational records into trustworthy business intelligence.

In modern analytics ecosystems, organizations do not compete on who owns the most data.

They compete on who trusts their data enough to act on it.

And that journey begins with disciplined, repeatable, production-grade data cleaning frameworks.

Interview Questions

1. How would you remove duplicates in SAS?

Use:

proc sort nodupkey;

or

proc sql

select distinct

2. Why are missing values dangerous in SAS?

Because missing numeric values are lower than all valid values.

3. When would you prefer PROC SQL over DATA Step?

Complex joins and aggregations.

4. Why use arrays?

To eliminate repetitive validation logic.

5. How do you validate malformed emails in R?

grepl("@",email)

or regular expressions using stringr.


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent FLOWERS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics 

·  Clinical SAS Programmer

·  Research Data Analyst

·  Regulatory Data Validator


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS