☕ Espresso, Arabica, Robusta and Broken Data: Building Enterprise Analytics with SAS and R

 ☕ From Bitter Beans to Brilliant Insights: Transforming Global Coffee Data into Analytical Intelligence with SAS and R

Introduction:The Business Crisis Nobody Saw Coming

A multinational coffee retail company launched an AI recommendation engine to predict customer purchasing behavior across 40 countries. Within weeks, executives noticed strange results.

Customers who preferred premium Ethiopian Yirgacheffe coffee were receiving recommendations for instant coffee sachets.

Financial dashboards reported negative revenues.

Regional sales reports showed "asia", "ASIA", " Apac ", and "APAC" as four different business regions.

Duplicate coffee identifiers caused inventory inflation.

Malformed supplier emails prevented procurement alerts from being delivered.

Some roasting dates were recorded as 2045 while others appeared as 1920.

The AI system classified several luxury coffee brands as unknown categories simply because one dataset contained "Espresso", another contained "espresso", and another contained "ESPRESSO ".

This is exactly how poor-quality operational data quietly becomes a strategic business risk.

Dirty data damages:

  • Executive dashboards
  • AI prediction models
  • Forecasting engines
  • Regulatory submissions
  • Statistical outputs
  • Inventory planning
  • Customer segmentation
  • Financial reporting

As Clinical SAS Programmers and Data Scientists, we face identical problems in clinical trials where a single incorrect patient age or missing visit date can invalidate an entire analysis population.

Today we will solve these challenges using a global coffee dataset containing intentional corruption.

Global Coffee Dataset

20 Coffee Types Across the World

Our dataset contains more than twenty coffee varieties:

  • Espresso
  • Americano
  • Cappuccino
  • Latte
  • Mocha
  • Flat White
  • Macchiato
  • Ristretto
  • Affogato
  • Irish Coffee
  • Turkish Coffee
  • Arabic Coffee
  • Cold Brew
  • Nitro Coffee
  • Doppio
  • Cortado
  • Red Eye
  • Viennese Coffee
  • Frappuccino
  • Cafe Au Lait
  • Lungo
  • Piccolo Latte

Dataset Variables

Our enterprise dataset contains nine variables:

  1. Coffee_ID
  2. Coffee_Name
  3. Country
  4. Region
  5. Supplier_Email
  6. Roast_Date
  7. Price_USD
  8. Customer_Rating
  9. Category

Intentional Data Corruption

The raw dataset intentionally contains:

  • Duplicate Coffee IDs
  • Missing roast dates
  • Negative prices
  • Invalid ratings
  • Mixed upper and lowercase values
  • NULL strings
  • Extra whitespace
  • Invalid timestamps
  • Corrupted regions
  • Malformed email addresses
  • Invalid category labels
  • Mixed character and numeric values

1.SAS Raw Dataset Creation

data coffee_raw;

length Coffee_ID $10 Coffee_Name $40 Country $25 Region $20

       Supplier_Email $60 Category $20;

informat Roast_Date anydtdte20.;

format Roast_Date date9.;

infile datalines dlm='|' dsd truncover;

input Coffee_ID $ Coffee_Name :$40. Country :$25. Region :$15.

Supplier_Email :$60. Roast_Date :anydtdte20. Price_USD $

Customer_Rating Category :$20.;

datalines;

CF001| espresso |italy| apac |sales@coffee.com|2025-01-05|4.50|5|Premium

CF002|LATTE|Italy|APAC|invalidemail|2025-02-01|-5.25|6|premium

CF003|Mocha|usa|Amer|supplier@email|.|3.75|4|Regular

CF003|Mocha|USA|AMER|supplier@email|2025-03-10|3.75|4|Regular

CF004|NULL|Brazil|LATAM|contact@coffee.com|2045-10-20|2.50|3|regular

CF005|Cold Brew|India| asia |coffee@vendor.com|1920-01-01|5.00|2|Unknown

CF006|Cappuccino|Italy|EMEA|cappuccino@gmail.com|2025-04-15|-6.25|5|Premium

CF007| Americano |USA|amer|americano.com|2025-04-20|3.95|4|Regular

CF008|Flat White|Australia|APAC|flat@coffee.com|2025-15-01|4.25|5|Premium

CF009|Macchiato|Italy|EUROPE|macchiato@coffee|2025-05-01|5.50|7|Luxury

CF010|Affogato|Italy| EMEA |affogato@coffee.com||6.25|4|Premium

CF011|Irish Coffee|Ireland|EMEA|irish@coffee.com|2025-05-15|-8.75|-1|Alcoholic

CF012|Turkish Coffee|Turkey|MiddleEast|turkish@coffee.com|2025-05-20|4.20|5|Traditional

CF013|Arabic Coffee|Saudi Arabia|MEA|arabic@coffee.com|2025-05-25|3.80|4|Traditional

CF014|Nitro Coffee|USA|americas|nitrocoffee.com|2025-06-01|5.90|5|Premium

CF015|Doppio|Italy|APAC|doppio@coffee.com|2025-06-05|-3.60|2|premium

CF016|Cortado|Spain|Europe|cortado@coffee.com|2025-06-10|4.60|4|Regular

CF017|Red Eye|USA|AMER|redeye@coffee.com|2025-06-15|NULL|3|Regular

CF018|Viennese Coffee|Austria|EMEA|vienna@coffee.com|2025-06-20|5.75|5|Luxury

CF019|Frappuccino|USA|APAC|frap@coffee.com|2025-13-01|7.20|4|Frozen

CF020|Cafe Au Lait|France|EU|cafeaulait@coffee.com|2025-06-25|4.80|3|Regular

CF021|Lungo|Italy|EMEA|lungo@@coffee.com|2025-07-01|3.90|5|Regular

CF022|Piccolo Latte|Australia|APAC|piccolo@coffee.com|2025-07-05|-4.10|6|PREMIUM

;

run;

proc print data=coffee_raw;

run;

OUTPUT:

ObsCoffee_IDCoffee_NameCountryRegionSupplier_EmailCategoryRoast_DatePrice_USDCustomer_Rating
1CF001espressoitalyapacsales@coffee.comPremium05JAN20254.505
2CF002LATTEItalyAPACinvalidemailpremium01FEB2025-5.256
3CF003MochausaAmersupplier@emailRegular.3.754
4CF003MochaUSAAMERsupplier@emailRegular10MAR20253.754
5CF004NULLBrazilLATAMcontact@coffee.comregular20OCT20452.503
6CF005Cold BrewIndiaasiacoffee@vendor.comUnknown01JAN19205.002
7CF006CappuccinoItalyEMEAcappuccino@gmail.comPremium15APR2025-6.255
8CF007AmericanoUSAameramericano.comRegular20APR20253.954
9CF008Flat WhiteAustraliaAPACflat@coffee.comPremium.4.255
10CF009MacchiatoItalyEUROPEmacchiato@coffeeLuxury01MAY20255.507
11CF010AffogatoItalyEMEAaffogato@coffee.comPremium.6.254
12CF011Irish CoffeeIrelandEMEAirish@coffee.comAlcoholic15MAY2025-8.75-1
13CF012Turkish CoffeeTurkeyMiddleEastturkish@coffee.comTraditional20MAY20254.205
14CF013Arabic CoffeeSaudi ArabiaMEAarabic@coffee.comTraditional25MAY20253.804
15CF014Nitro CoffeeUSAamericasnitrocoffee.comPremium01JUN20255.905
16CF015DoppioItalyAPACdoppio@coffee.compremium05JUN2025-3.602
17CF016CortadoSpainEuropecortado@coffee.comRegular10JUN20254.604
18CF017Red EyeUSAAMERredeye@coffee.comRegular15JUN2025NULL3
19CF018Viennese CoffeeAustriaEMEAvienna@coffee.comLuxury20JUN20255.755
20CF019FrappuccinoUSAAPACfrap@coffee.comFrozen.7.204
21CF020Cafe Au LaitFranceEUcafeaulait@coffee.comRegular25JUN20254.803
22CF021LungoItalyEMEAlungo@@coffee.comRegular01JUL20253.905
23CF022Piccolo LatteAustraliaAPACpiccolo@coffee.comPREMIUM05JUL2025-4.106

Explanation

The LENGTH statement appears before INPUT processing to prevent character truncation risk. If SAS encounters a variable assignment before LENGTH declaration, it automatically allocates storage based on the first observed value. A value such as "Espresso" may allocate only eight bytes, causing later values such as "Viennese Coffee" to be truncated. This behavior is fundamentally different from R where character vectors are dynamically managed in memory and are not fixed-width structures.

2.Enterprise Cleaning Workflow

2.1 Standardizing Text

data coffee_clean;

set coffee_raw;

coffee_name = propcase(strip(coffee_name));

country = propcase(country);

region = upcase(strip(region));

supplier_email = lowcase(strip(supplier_email));

run;

proc print data=coffee_clean;

run;

OUTPUT:

ObsCoffee_IDCoffee_NameCountryRegionSupplier_EmailCategoryRoast_DatePrice_USDCustomer_Rating
1CF001EspressoItalyAPACsales@coffee.comPremium05JAN20254.505
2CF002LatteItalyAPACinvalidemailpremium01FEB2025-5.256
3CF003MochaUsaAMERsupplier@emailRegular.3.754
4CF003MochaUsaAMERsupplier@emailRegular10MAR20253.754
5CF004NullBrazilLATAMcontact@coffee.comregular20OCT20452.503
6CF005Cold BrewIndiaASIAcoffee@vendor.comUnknown01JAN19205.002
7CF006CappuccinoItalyEMEAcappuccino@gmail.comPremium15APR2025-6.255
8CF007AmericanoUsaAMERamericano.comRegular20APR20253.954
9CF008Flat WhiteAustraliaAPACflat@coffee.comPremium.4.255
10CF009MacchiatoItalyEUROPEmacchiato@coffeeLuxury01MAY20255.507
11CF010AffogatoItalyEMEAaffogato@coffee.comPremium.6.254
12CF011Irish CoffeeIrelandEMEAirish@coffee.comAlcoholic15MAY2025-8.75-1
13CF012Turkish CoffeeTurkeyMIDDLEEASTturkish@coffee.comTraditional20MAY20254.205
14CF013Arabic CoffeeSaudi ArabiaMEAarabic@coffee.comTraditional25MAY20253.804
15CF014Nitro CoffeeUsaAMERICASnitrocoffee.comPremium01JUN20255.905
16CF015DoppioItalyAPACdoppio@coffee.compremium05JUN2025-3.602
17CF016CortadoSpainEUROPEcortado@coffee.comRegular10JUN20254.604
18CF017Red EyeUsaAMERredeye@coffee.comRegular15JUN2025NULL3
19CF018Viennese CoffeeAustriaEMEAvienna@coffee.comLuxury20JUN20255.755
20CF019FrappuccinoUsaAPACfrap@coffee.comFrozen.7.204
21CF020Cafe Au LaitFranceEUcafeaulait@coffee.comRegular25JUN20254.803
22CF021LungoItalyEMEAlungo@@coffee.comRegular01JUL20253.905
23CF022Piccolo LatteAustraliaAPACpiccolo@coffee.comPREMIUM05JUL2025-4.106

Explanation

STRIP removes leading and trailing spaces, PROPCASE standardizes capitalization, while UPCASE ensures grouping consistency during aggregation. Without normalization, PROC FREQ treats "asia", "ASIA", and " Asia " as independent categories producing incorrect statistics.

2.2 Correcting Negative Prices

data coffee_clean;

set coffee_clean;

price_usd_Num = input(price_usd,best32.);

drop price_usd;

rename price_usd_Num=price_Usd;

price_usd = abs(price_usd);

price_usd = round(price_usd,.01);

run;

proc print data=coffee_clean;

run;

OUTPUT:

ObsCoffee_IDCoffee_NameCountryRegionSupplier_EmailCategoryRoast_DateCustomer_Ratingprice_Usd
1CF001EspressoItalyAPACsales@coffee.comPremium05JAN202554.50
2CF002LatteItalyAPACinvalidemailpremium01FEB20256-5.25
3CF003MochaUsaAMERsupplier@emailRegular.43.75
4CF003MochaUsaAMERsupplier@emailRegular10MAR202543.75
5CF004NullBrazilLATAMcontact@coffee.comregular20OCT204532.50
6CF005Cold BrewIndiaASIAcoffee@vendor.comUnknown01JAN192025.00
7CF006CappuccinoItalyEMEAcappuccino@gmail.comPremium15APR20255-6.25
8CF007AmericanoUsaAMERamericano.comRegular20APR202543.95
9CF008Flat WhiteAustraliaAPACflat@coffee.comPremium.54.25
10CF009MacchiatoItalyEUROPEmacchiato@coffeeLuxury01MAY202575.50
11CF010AffogatoItalyEMEAaffogato@coffee.comPremium.46.25
12CF011Irish CoffeeIrelandEMEAirish@coffee.comAlcoholic15MAY2025-1-8.75
13CF012Turkish CoffeeTurkeyMIDDLEEASTturkish@coffee.comTraditional20MAY202554.20
14CF013Arabic CoffeeSaudi ArabiaMEAarabic@coffee.comTraditional25MAY202543.80
15CF014Nitro CoffeeUsaAMERICASnitrocoffee.comPremium01JUN202555.90
16CF015DoppioItalyAPACdoppio@coffee.compremium05JUN20252-3.60
17CF016CortadoSpainEUROPEcortado@coffee.comRegular10JUN202544.60
18CF017Red EyeUsaAMERredeye@coffee.comRegular15JUN20253.
19CF018Viennese CoffeeAustriaEMEAvienna@coffee.comLuxury20JUN202555.75
20CF019FrappuccinoUsaAPACfrap@coffee.comFrozen.47.20
21CF020Cafe Au LaitFranceEUcafeaulait@coffee.comRegular25JUN202534.80
22CF021LungoItalyEMEAlungo@@coffee.comRegular01JUL202553.90
23CF022Piccolo LatteAustraliaAPACpiccolo@coffee.comPREMIUM05JUL20256-4.10

Explanation

ABS converts negative values into positive billing amounts while ROUND enforces financial precision. Negative transaction values can severely distort forecasting and revenue recognition calculations.

2.3 Missing Value Detection

data coffee_clean;

set coffee_clean;

missing_count=nmiss(price_usd,customer_rating)

             +cmiss(coffee_name,country,region);

run;

proc print data=coffee_clean;

run;

OUTPUT:

ObsCoffee_IDCoffee_NameCountryRegionSupplier_EmailCategoryRoast_DateCustomer_Ratingprice_Usdmissing_count
1CF001EspressoItalyAPACsales@coffee.comPremium05JAN202554.500
2CF002LatteItalyAPACinvalidemailpremium01FEB20256-5.250
3CF003MochaUsaAMERsupplier@emailRegular.43.750
4CF003MochaUsaAMERsupplier@emailRegular10MAR202543.750
5CF004NullBrazilLATAMcontact@coffee.comregular20OCT204532.500
6CF005Cold BrewIndiaASIAcoffee@vendor.comUnknown01JAN192025.000
7CF006CappuccinoItalyEMEAcappuccino@gmail.comPremium15APR20255-6.250
8CF007AmericanoUsaAMERamericano.comRegular20APR202543.950
9CF008Flat WhiteAustraliaAPACflat@coffee.comPremium.54.250
10CF009MacchiatoItalyEUROPEmacchiato@coffeeLuxury01MAY202575.500
11CF010AffogatoItalyEMEAaffogato@coffee.comPremium.46.250
12CF011Irish CoffeeIrelandEMEAirish@coffee.comAlcoholic15MAY2025-1-8.750
13CF012Turkish CoffeeTurkeyMIDDLEEASTturkish@coffee.comTraditional20MAY202554.200
14CF013Arabic CoffeeSaudi ArabiaMEAarabic@coffee.comTraditional25MAY202543.800
15CF014Nitro CoffeeUsaAMERICASnitrocoffee.comPremium01JUN202555.900
16CF015DoppioItalyAPACdoppio@coffee.compremium05JUN20252-3.600
17CF016CortadoSpainEUROPEcortado@coffee.comRegular10JUN202544.600
18CF017Red EyeUsaAMERredeye@coffee.comRegular15JUN20253.1
19CF018Viennese CoffeeAustriaEMEAvienna@coffee.comLuxury20JUN202555.750
20CF019FrappuccinoUsaAPACfrap@coffee.comFrozen.47.200
21CF020Cafe Au LaitFranceEUcafeaulait@coffee.comRegular25JUN202534.800
22CF021LungoItalyEMEAlungo@@coffee.comRegular01JUL202553.900
23CF022Piccolo LatteAustraliaAPACpiccolo@coffee.comPREMIUM05JUL20256-4.100

Explanation

NMISS evaluates numeric variables while CMISS handles character variables. Using both functions provides a complete missingness assessment framework.

2.4 SELECT-WHEN Logic

data coffee_clean;

set coffee_clean;

select(upcase(region));

when('APAC') region_group='Asia Pacific';

when('AMER') region_group='Americas';

when('LATAM') region_group='LatinAmerica';

otherwise region_group='Unknown';

end;

drop region;

rename region_group=region;

run;

proc print data=coffee_clean;

run;

OUTPUT:

ObsCoffee_IDCoffee_NameCountrySupplier_EmailCategoryRoast_DateCustomer_Ratingprice_Usdmissing_countregion
1CF001EspressoItalysales@coffee.comPremium05JAN202554.500Asia Pacific
2CF002LatteItalyinvalidemailpremium01FEB20256-5.250Asia Pacific
3CF003MochaUsasupplier@emailRegular.43.750Americas
4CF003MochaUsasupplier@emailRegular10MAR202543.750Americas
5CF004NullBrazilcontact@coffee.comregular20OCT204532.500LatinAmerica
6CF005Cold BrewIndiacoffee@vendor.comUnknown01JAN192025.000Unknown
7CF006CappuccinoItalycappuccino@gmail.comPremium15APR20255-6.250Unknown
8CF007AmericanoUsaamericano.comRegular20APR202543.950Americas
9CF008Flat WhiteAustraliaflat@coffee.comPremium.54.250Asia Pacific
10CF009MacchiatoItalymacchiato@coffeeLuxury01MAY202575.500Unknown
11CF010AffogatoItalyaffogato@coffee.comPremium.46.250Unknown
12CF011Irish CoffeeIrelandirish@coffee.comAlcoholic15MAY2025-1-8.750Unknown
13CF012Turkish CoffeeTurkeyturkish@coffee.comTraditional20MAY202554.200Unknown
14CF013Arabic CoffeeSaudi Arabiaarabic@coffee.comTraditional25MAY202543.800Unknown
15CF014Nitro CoffeeUsanitrocoffee.comPremium01JUN202555.900Unknown
16CF015DoppioItalydoppio@coffee.compremium05JUN20252-3.600Asia Pacific
17CF016CortadoSpaincortado@coffee.comRegular10JUN202544.600Unknown
18CF017Red EyeUsaredeye@coffee.comRegular15JUN20253.1Americas
19CF018Viennese CoffeeAustriavienna@coffee.comLuxury20JUN202555.750Unknown
20CF019FrappuccinoUsafrap@coffee.comFrozen.47.200Asia Pacific
21CF020Cafe Au LaitFrancecafeaulait@coffee.comRegular25JUN202534.800Unknown
22CF021LungoItalylungo@@coffee.comRegular01JUL202553.900Unknown
23CF022Piccolo LatteAustraliapiccolo@coffee.comPREMIUM05JUL20256-4.100Asia Pacific

Explanation

SELECT-WHEN improves readability compared with large IF-THEN chains and performs particularly well when handling numerous category mappings.

2.5 ARRAY Processing

data coffee_clean;

set coffee_clean;

array charvars(*) coffee_name country region category;

do i=1 to dim(charvars);

charvars(i)=strip(propcase(charvars(i)));

end;

run;

proc print data=coffee_clean;

run;

OUTPUT:

ObsCoffee_IDCoffee_NameCountrySupplier_EmailCategoryRoast_DateCustomer_Ratingprice_Usdmissing_countregioni
1CF001EspressoItalysales@coffee.comPremium05JAN202554.500Asia Pacific5
2CF002LatteItalyinvalidemailPremium01FEB20256-5.250Asia Pacific5
3CF003MochaUsasupplier@emailRegular.43.750Americas5
4CF003MochaUsasupplier@emailRegular10MAR202543.750Americas5
5CF004NullBrazilcontact@coffee.comRegular20OCT204532.500Latinamerica5
6CF005Cold BrewIndiacoffee@vendor.comUnknown01JAN192025.000Unknown5
7CF006CappuccinoItalycappuccino@gmail.comPremium15APR20255-6.250Unknown5
8CF007AmericanoUsaamericano.comRegular20APR202543.950Americas5
9CF008Flat WhiteAustraliaflat@coffee.comPremium.54.250Asia Pacific5
10CF009MacchiatoItalymacchiato@coffeeLuxury01MAY202575.500Unknown5
11CF010AffogatoItalyaffogato@coffee.comPremium.46.250Unknown5
12CF011Irish CoffeeIrelandirish@coffee.comAlcoholic15MAY2025-1-8.750Unknown5
13CF012Turkish CoffeeTurkeyturkish@coffee.comTraditional20MAY202554.200Unknown5
14CF013Arabic CoffeeSaudi Arabiaarabic@coffee.comTraditional25MAY202543.800Unknown5
15CF014Nitro CoffeeUsanitrocoffee.comPremium01JUN202555.900Unknown5
16CF015DoppioItalydoppio@coffee.comPremium05JUN20252-3.600Asia Pacific5
17CF016CortadoSpaincortado@coffee.comRegular10JUN202544.600Unknown5
18CF017Red EyeUsaredeye@coffee.comRegular15JUN20253.1Americas5
19CF018Viennese CoffeeAustriavienna@coffee.comLuxury20JUN202555.750Unknown5
20CF019FrappuccinoUsafrap@coffee.comFrozen.47.200Asia Pacific5
21CF020Cafe Au LaitFrancecafeaulait@coffee.comRegular25JUN202534.800Unknown5
22CF021LungoItalylungo@@coffee.comRegular01JUL202553.900Unknown5
23CF022Piccolo LatteAustraliapiccolo@coffee.comPremium05JUL20256-4.100Asia Pacific5

Explanation

ARRAYS eliminate repetitive programming and significantly improve maintainability in production pipelines containing hundreds of variables.

3.PROC SORT Deduplication

proc sort data=coffee_clean nodupkey;

by coffee_id;

run;

proc print data=coffee_clean;

run;

LOG:

NOTE: There were 23 observations read from the data set WORK.COFFEE_CLEAN.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.COFFEE_CLEAN has 22 observations and 11 variables.

OUTPUT:

ObsCoffee_IDCoffee_NameCountrySupplier_EmailCategoryRoast_DateCustomer_Ratingprice_Usdmissing_countregioni
1CF001EspressoItalysales@coffee.comPremium05JAN202554.500Asia Pacific5
2CF002LatteItalyinvalidemailPremium01FEB20256-5.250Asia Pacific5
3CF003MochaUsasupplier@emailRegular.43.750Americas5
4CF004NullBrazilcontact@coffee.comRegular20OCT204532.500Latinamerica5
5CF005Cold BrewIndiacoffee@vendor.comUnknown01JAN192025.000Unknown5
6CF006CappuccinoItalycappuccino@gmail.comPremium15APR20255-6.250Unknown5
7CF007AmericanoUsaamericano.comRegular20APR202543.950Americas5
8CF008Flat WhiteAustraliaflat@coffee.comPremium.54.250Asia Pacific5
9CF009MacchiatoItalymacchiato@coffeeLuxury01MAY202575.500Unknown5
10CF010AffogatoItalyaffogato@coffee.comPremium.46.250Unknown5
11CF011Irish CoffeeIrelandirish@coffee.comAlcoholic15MAY2025-1-8.750Unknown5
12CF012Turkish CoffeeTurkeyturkish@coffee.comTraditional20MAY202554.200Unknown5
13CF013Arabic CoffeeSaudi Arabiaarabic@coffee.comTraditional25MAY202543.800Unknown5
14CF014Nitro CoffeeUsanitrocoffee.comPremium01JUN202555.900Unknown5
15CF015DoppioItalydoppio@coffee.comPremium05JUN20252-3.600Asia Pacific5
16CF016CortadoSpaincortado@coffee.comRegular10JUN202544.600Unknown5
17CF017Red EyeUsaredeye@coffee.comRegular15JUN20253.1Americas5
18CF018Viennese CoffeeAustriavienna@coffee.comLuxury20JUN202555.750Unknown5
19CF019FrappuccinoUsafrap@coffee.comFrozen.47.200Asia Pacific5
20CF020Cafe Au LaitFrancecafeaulait@coffee.comRegular25JUN202534.800Unknown5
21CF021LungoItalylungo@@coffee.comRegular01JUL202553.900Unknown5
22CF022Piccolo LatteAustraliapiccolo@coffee.comPremium05JUL20256-4.100Asia Pacific5

Explanation

NODUPKEY retains the first occurrence of each business key. Duplicate records are among the leading causes of incorrect KPIs and inventory distortions.

4.PROC FORMAT

proc format;

value ratingfmt 1-2='Poor'

                  3='Average'

                  4='Good'

                  5='Excellent';

run;

LOG:

NOTE: Format RATINGFMT has been output.

Explanation

PROC FORMAT separates presentation logic from storage logic which simplifies reporting and improves maintainability.

5.PROC SQL Validation

proc sql;

create table invalid_email as

select *

from coffee_clean

where index(Supplier_Email,'@')=0;

quit;

proc print data=invalid_email;

run;

OUTPUT:

ObsCoffee_IDCoffee_NameCountrySupplier_EmailCategoryRoast_DateCustomer_Ratingprice_Usdmissing_countregioni
1CF002LatteItalyinvalidemailPremium01FEB20256-5.250Asia Pacific5
2CF007AmericanoUsaamericano.comRegular20APR202543.950Americas5
3CF014Nitro CoffeeUsanitrocoffee.comPremium01JUN202555.900Unknown5

Explanation

PROC SQL excels in exception reporting, joins, and relational validation tasks that are cumbersome in DATA step processing.

6.DATA Step Alternative

data valid_email invalid_email;

set coffee_clean;

if find(supplier_email,'@')=0 then

output invalid_email;

else output valid_email;

run;

proc print data=invalid_email;

run;

OUTPUT:

ObsCoffee_IDCoffee_NameCountrySupplier_EmailCategoryRoast_DateCustomer_Ratingprice_Usdmissing_countregioni
1CF002LatteItalyinvalidemailPremium01FEB20256-5.250Asia Pacific5
2CF007AmericanoUsaamericano.comRegular20APR202543.950Americas5
3CF014Nitro CoffeeUsanitrocoffee.comPremium01JUN202555.900Unknown5

proc print data=valid_email;

run;

OUTPUT:

ObsCoffee_IDCoffee_NameCountrySupplier_EmailCategoryRoast_DateCustomer_Ratingprice_Usdmissing_countregioni
1CF001EspressoItalysales@coffee.comPremium05JAN202554.500Asia Pacific5
2CF003MochaUsasupplier@emailRegular.43.750Americas5
3CF004NullBrazilcontact@coffee.comRegular20OCT204532.500Latinamerica5
4CF005Cold BrewIndiacoffee@vendor.comUnknown01JAN192025.000Unknown5
5CF006CappuccinoItalycappuccino@gmail.comPremium15APR20255-6.250Unknown5
6CF008Flat WhiteAustraliaflat@coffee.comPremium.54.250Asia Pacific5
7CF009MacchiatoItalymacchiato@coffeeLuxury01MAY202575.500Unknown5
8CF010AffogatoItalyaffogato@coffee.comPremium.46.250Unknown5
9CF011Irish CoffeeIrelandirish@coffee.comAlcoholic15MAY2025-1-8.750Unknown5
10CF012Turkish CoffeeTurkeyturkish@coffee.comTraditional20MAY202554.200Unknown5
11CF013Arabic CoffeeSaudi Arabiaarabic@coffee.comTraditional25MAY202543.800Unknown5
12CF015DoppioItalydoppio@coffee.comPremium05JUN20252-3.600Asia Pacific5
13CF016CortadoSpaincortado@coffee.comRegular10JUN202544.600Unknown5
14CF017Red EyeUsaredeye@coffee.comRegular15JUN20253.1Americas5
15CF018Viennese CoffeeAustriavienna@coffee.comLuxury20JUN202555.750Unknown5
16CF019FrappuccinoUsafrap@coffee.comFrozen.47.200Asia Pacific5
17CF020Cafe Au LaitFrancecafeaulait@coffee.comRegular25JUN202534.800Unknown5
18CF021LungoItalylungo@@coffee.comRegular01JUL202553.900Unknown5
19CF022Piccolo LatteAustraliapiccolo@coffee.comPremium05JUL20256-4.100Asia Pacific5

Explanation

DATA step processing often outperforms SQL for sequential row-based transformations and provides greater procedural control.

Advanced DATA Step Features

We also leverage:

  • RETAIN for cumulative inventory totals.
  • FIRST./LAST. processing for supplier summaries.
  • MERGE statements for reference enrichment.
  • CATX for composite identifiers.
  • COALESCEC for fallback values.
  • VERIFY for invalid characters.
  • SUBSTR and SCAN for parsing identifiers.
  • INPUT and PUT conversions.
  • INTCK and INTNX for aging calculations.

7.Reusable SAS Macro

%macro standardize(ds,variable);

data &ds;

set coffee_clean;

&variable=upcase(strip(&variable));

run;

proc print data=&ds;

run;

%mend;

%standardize(coffee_clean,supplier_email)

OUTPUT:

ObsCoffee_IDCoffee_NameCountrySupplier_EmailCategoryRoast_DateCustomer_Ratingprice_Usdmissing_countregioni
1CF001EspressoItalySALES@COFFEE.COMPremium05JAN202554.500Asia Pacific5
2CF002LatteItalyINVALIDEMAILPremium01FEB20256-5.250Asia Pacific5
3CF003MochaUsaSUPPLIER@EMAILRegular.43.750Americas5
4CF004NullBrazilCONTACT@COFFEE.COMRegular20OCT204532.500Latinamerica5
5CF005Cold BrewIndiaCOFFEE@VENDOR.COMUnknown01JAN192025.000Unknown5
6CF006CappuccinoItalyCAPPUCCINO@GMAIL.COMPremium15APR20255-6.250Unknown5
7CF007AmericanoUsaAMERICANO.COMRegular20APR202543.950Americas5
8CF008Flat WhiteAustraliaFLAT@COFFEE.COMPremium.54.250Asia Pacific5
9CF009MacchiatoItalyMACCHIATO@COFFEELuxury01MAY202575.500Unknown5
10CF010AffogatoItalyAFFOGATO@COFFEE.COMPremium.46.250Unknown5
11CF011Irish CoffeeIrelandIRISH@COFFEE.COMAlcoholic15MAY2025-1-8.750Unknown5
12CF012Turkish CoffeeTurkeyTURKISH@COFFEE.COMTraditional20MAY202554.200Unknown5
13CF013Arabic CoffeeSaudi ArabiaARABIC@COFFEE.COMTraditional25MAY202543.800Unknown5
14CF014Nitro CoffeeUsaNITROCOFFEE.COMPremium01JUN202555.900Unknown5
15CF015DoppioItalyDOPPIO@COFFEE.COMPremium05JUN20252-3.600Asia Pacific5
16CF016CortadoSpainCORTADO@COFFEE.COMRegular10JUN202544.600Unknown5
17CF017Red EyeUsaREDEYE@COFFEE.COMRegular15JUN20253.1Americas5
18CF018Viennese CoffeeAustriaVIENNA@COFFEE.COMLuxury20JUN202555.750Unknown5
19CF019FrappuccinoUsaFRAP@COFFEE.COMFrozen.47.200Asia Pacific5
20CF020Cafe Au LaitFranceCAFEAULAIT@COFFEE.COMRegular25JUN202534.800Unknown5
21CF021LungoItalyLUNGO@@COFFEE.COMRegular01JUL202553.900Unknown5
22CF022Piccolo LatteAustraliaPICCOLO@COFFEE.COMPremium05JUL20256-4.100Asia Pacific5

Explanation

Macros promote standardization and reduce maintenance costs. Enterprise environments often manage thousands of programs where duplicated logic becomes a validation nightmare.

8.R Raw Dataset

library(tibble)

coffee_raw <- tibble(

  coffee_id = c(

    "CF001","CF002","CF003","CF003","CF004","CF005",

    "CF006","CF007","CF008","CF009","CF010","CF011",

    "CF012","CF013","CF014","CF015","CF016","CF017",

    "CF018","CF019","CF020","CF021","CF022"),

  coffee_name = c(" espresso ","LATTE","Mocha","Mocha","NULL",

    "Cold Brew","Cappuccino"," Americano ","Flat White","Macchiato",

    "Affogato","Irish Coffee","Turkish Coffee","Arabic Coffee",

    "Nitro Coffee","Doppio","Cortado","Red Eye","Viennese Coffee",

    "Frappuccino","Cafe Au Lait","Lungo","Piccolo Latte"),

  country = c("italy","Italy","usa","USA","Brazil","India",

    "Italy","USA","Australia","Italy","Italy","Ireland",

    "Turkey","Saudi Arabia","USA","Italy","Spain",

    "USA","Austria","USA","France","Italy","Australia"),

  region = c(" apac ","APAC","Amer","AMER","LATAM"," asia ",

    "EMEA","amer","APAC","EUROPE"," EMEA ","EMEA","MiddleEast",

    "MEA","americas","APAC","Europe","AMER","EMEA","APAC","EU",

    "EMEA","APAC"),

  supplier_email = c("sales@coffee.com","invalidemail",

    "supplier@email","supplier@email","contact@coffee.com",

    "coffee@vendor.com","cappuccino@gmail.com","americano.com",

    "flat@coffee.com","macchiato@coffee","affogato@coffee.com",

    "irish@coffee.com","turkish@coffee.com","arabic@coffee.com",

    "nitrocoffee.com","doppio@coffee.com","cortado@coffee.com",

    "redeye@coffee.com","vienna@coffee.com","frap@coffee.com",

    "cafeaulait@coffee.com","lungo@@coffee.com","piccolo@coffee.com"),

  roast_date = c("2025-01-05","2025-02-01",NA,"2025-03-10","2045-10-20",

    "1920-01-01","2025-04-15","2025-04-20","2025-15-01","2025-05-01",

    NA,"2025-05-15","2025-05-20","2025-05-25","2025-06-01","2025-06-05",

    "2025-06-10","2025-06-15","2025-06-20","2025-13-01","2025-06-25",

    "2025-07-01","2025-07-05"),

  price_usd = c(4.50,-5.25,3.75,3.75,2.50,5.00,-6.25,3.95,4.25,5.50,

    6.25,-8.75,4.20,3.80,5.90,-3.60,4.60,NA,5.75,7.20,4.80,3.90,-4.10),

  customer_rating = c(5,6,4,4,3,2,5,4,5,7,4,-1,5,4,5,2,4,3,5,4,3,5,6),

  category = c("Premium","premium","Regular","Regular","regular",

    "Unknown","Premium","Regular","Premium","Luxury","Premium",

    "Alcoholic","Traditional","Traditional","Premium","premium",

    "Regular","Regular","Luxury","Frozen","Regular","Regular","PREMIUM")

)

OUTPUT:

coffee_id

coffee_name

country

region

supplier_email

roast_date

price_usd

customer_rating

category

CF001

 espresso

italy

 apac

sales@coffee.com

2025-01-05

4.5

5

Premium

CF002

LATTE

Italy

APAC

invalidemail

2025-02-01

-5.25

6

premium

CF003

Mocha

usa

Amer

supplier@email

3.75

4

Regular

CF003

Mocha

USA

AMER

supplier@email

2025-03-10

3.75

4

Regular

CF004

NULL

Brazil

LATAM

contact@coffee.com

2045-10-20

2.5

3

regular

CF005

Cold Brew

India

 asia

coffee@vendor.com

1920-01-01

5

2

Unknown

CF006

Cappuccino

Italy

EMEA

cappuccino@gmail.com

2025-04-15

-6.25

5

Premium

CF007

 Americano

USA

amer

americano.com

2025-04-20

3.95

4

Regular

CF008

Flat White

Australia

APAC

flat@coffee.com

2025-15-01

4.25

5

Premium

CF009

Macchiato

Italy

EUROPE

macchiato@coffee

2025-05-01

5.5

7

Luxury

CF010

Affogato

Italy

 EMEA

affogato@coffee.com

6.25

4

Premium

CF011

Irish Coffee

Ireland

EMEA

irish@coffee.com

2025-05-15

-8.75

-1

Alcoholic

CF012

Turkish Coffee

Turkey

MiddleEast

turkish@coffee.com

2025-05-20

4.2

5

Traditional

CF013

Arabic Coffee

Saudi Arabia

MEA

arabic@coffee.com

2025-05-25

3.8

4

Traditional

CF014

Nitro Coffee

USA

americas

nitrocoffee.com

2025-06-01

5.9

5

Premium

CF015

Doppio

Italy

APAC

doppio@coffee.com

2025-06-05

-3.6

2

premium

CF016

Cortado

Spain

Europe

cortado@coffee.com

2025-06-10

4.6

4

Regular

CF017

Red Eye

USA

AMER

redeye@coffee.com

2025-06-15

3

Regular

CF018

Viennese Coffee

Austria

EMEA

vienna@coffee.com

2025-06-20

5.75

5

Luxury

CF019

Frappuccino

USA

APAC

frap@coffee.com

2025-13-01

7.2

4

Frozen

CF020

Cafe Au Lait

France

EU

cafeaulait@coffee.com

2025-06-25

4.8

3

Regular

CF021

Lungo

Italy

EMEA

lungo@@coffee.com

2025-07-01

3.9

5

Regular

CF022

Piccolo Latte

Australia

APAC

piccolo@coffee.com

2025-07-05

-4.1

6

PREMIUM


9.R Equivalent Cleaning Pipeline

library(janitor)

library(lubridate)

library(dplyr)

library(stringr)

coffee_clean <- coffee_raw %>%

  janitor::clean_names() %>%

  mutate(

    coffee_id  = str_trim(coffee_id),

    coffee_name=str_to_title(str_trim(coffee_name)),

    country=str_to_title(country),

    region=str_to_upper(str_trim(region)),

    supplier_email=str_to_lower(supplier_email),

    price_usd=abs(price_usd),

    price_usd=round(price_usd,2),

    category=coalesce(category,"Unknown"),

    customer_rating=if_else(customer_rating>5,5,customer_rating)

  ) %>%

  distinct(coffee_id,.keep_all = TRUE)

OUTPUT:

coffee_id

coffee_name

country

region

supplier_email

roast_date

price_usd

customer_rating

category

CF001

Espresso

Italy

APAC

sales@coffee.com

2025-01-05

4.5

5

Premium

CF002

Latte

Italy

APAC

invalidemail

2025-02-01

5.25

5

premium

CF003

Mocha

Usa

AMER

supplier@email

3.75

4

Regular

CF004

Null

Brazil

LATAM

contact@coffee.com

2045-10-20

2.5

3

regular

CF005

Cold Brew

India

ASIA

coffee@vendor.com

1920-01-01

5

2

Unknown

CF006

Cappuccino

Italy

EMEA

cappuccino@gmail.com

2025-04-15

6.25

5

Premium

CF007

Americano

Usa

AMER

americano.com

2025-04-20

3.95

4

Regular

CF008

Flat White

Australia

APAC

flat@coffee.com

2025-15-01

4.25

5

Premium

CF009

Macchiato

Italy

EUROPE

macchiato@coffee

2025-05-01

5.5

5

Luxury

CF010

Affogato

Italy

EMEA

affogato@coffee.com

6.25

4

Premium

CF011

Irish Coffee

Ireland

EMEA

irish@coffee.com

2025-05-15

8.75

-1

Alcoholic

CF012

Turkish Coffee

Turkey

MIDDLEEAST

turkish@coffee.com

2025-05-20

4.2

5

Traditional

CF013

Arabic Coffee

Saudi Arabia

MEA

arabic@coffee.com

2025-05-25

3.8

4

Traditional

CF014

Nitro Coffee

Usa

AMERICAS

nitrocoffee.com

2025-06-01

5.9

5

Premium

CF015

Doppio

Italy

APAC

doppio@coffee.com

2025-06-05

3.6

2

premium

CF016

Cortado

Spain

EUROPE

cortado@coffee.com

2025-06-10

4.6

4

Regular

CF017

Red Eye

Usa

AMER

redeye@coffee.com

2025-06-15

3

Regular

CF018

Viennese Coffee

Austria

EMEA

vienna@coffee.com

2025-06-20

5.75

5

Luxury

CF019

Frappuccino

Usa

APAC

frap@coffee.com

2025-13-01

7.2

4

Frozen

CF020

Cafe Au Lait

France

EU

cafeaulait@coffee.com

2025-06-25

4.8

3

Regular

CF021

Lungo

Italy

EMEA

lungo@@coffee.com

2025-07-01

3.9

5

Regular

CF022

Piccolo Latte

Australia

APAC

piccolo@coffee.com

2025-07-05

4.1

5

PREMIUM

Explanation

mutate() mirrors DATA step assignments, case_when() behaves similarly to SELECT-WHEN, while across() provides ARRAY-like transformations across multiple columns simultaneously.

PROC SQL versus DATA Step

PROC SQL is ideal for:

  • Joins
  • Aggregations
  • Validation reports
  • Relational operations

DATA Step excels at:

  • Sequential processing
  • Complex business logic
  • FIRST./LAST. analysis
  • Retained calculations

Modern enterprises use both.

Enterprise Validation and Compliance

In clinical research, these concepts become regulatory requirements.

Poor cleaning can affect:

  • SDTM datasets
  • ADaM derivations
  • TLF outputs
  • Define.xml traceability

Regulators expect:

  • Audit trails
  • Independent QC
  • Reproducibility
  • Metadata governance
  • End-to-end lineage

One dangerous SAS behavior is that missing numeric values are smaller than every valid number.

if age <18 then pediatric='Y';

A missing age will incorrectly classify patients as pediatric unless explicitly handled.

Business Logic Behind Cleaning

Organizations do not clean data for aesthetics.

They clean data because business decisions depend on it.

Missing dates affect time-to-event analysis.

Incorrect ages distort patient demographics.

Negative revenue corrupts financial statements.

Inconsistent text values inflate category counts.

Missing values may require imputation to preserve statistical power.

Date standardization ensures reproducibility across global systems.

Text normalization enables machine learning models to recognize identical concepts.

Age corrections prevent incorrect treatment assignment.

Salary normalization improves compensation benchmarking.

Every cleaning decision should have documented business justification and traceability.

Twenty Best Practices

  1. Maintain metadata repositories.
  2. Validate before transformation.
  3. Standardize macro libraries.
  4. Preserve raw datasets.
  5. Implement audit trails.
  6. Separate development and production.
  7. Use defensive programming.
  8. Validate all joins.
  9. Review duplicates independently.
  10. Document assumptions.
  11. Version control programs.
  12. Standardize missing values.
  13. Build reusable formats.
  14. Maintain lineage.
  15. Create QC programs.
  16. Avoid hardcoding.
  17. Use parameterized macros.
  18. Validate date ranges.
  19. Profile data continuously.
  20. Automate reporting checks.

Twenty Enterprise Insights

  • Dirty data creates expensive business mistakes.
  • Validation logic beats visual inspection.
  • Standardization improves reproducibility.
  • Metadata is a strategic asset.
  • Duplicate records silently destroy trust.
  • Missing values are business signals.
  • Auditability matters.
  • Reusable code reduces risk.
  • Defensive programming saves projects.
  • Macros improve consistency.
  • Governance improves scalability.
  • Data lineage supports compliance.
  • Profiling prevents surprises.
  • Clean data powers AI.
  • Reporting accuracy drives confidence.
  • Automation improves quality.
  • Traceability protects submissions.
  • Documentation saves time.
  • Quality begins upstream.
  • Reliable analytics start with reliable data.

SAS versus R

SAS dominates regulated industries because of auditability, validation discipline, and reproducible outputs.

R offers unmatched flexibility, visualization capability, and modern data engineering workflows.

SAS provides exceptional scalability for production reporting while R accelerates experimentation and advanced analytics.

Together they form an ideal partnership:

  • SAS ensures governance.
  • R enables innovation.
  • SAS delivers compliance.
  • R delivers flexibility.
  • SAS provides operational stability.
  • R provides analytical agility.

Organizations increasingly deploy hybrid architectures combining both technologies.

Conclusion

Data cleaning is not a preprocessing task.

It is risk management.

Whether analyzing coffee purchasing behavior or clinical trial outcomes, poor-quality data introduces uncertainty into every downstream decision.

A duplicate identifier can inflate sales.

An invalid date can destroy survival analysis.

A malformed email can disrupt procurement operations.

A missing age can misclassify treatment populations.

SAS offers industrial-strength governance, reproducibility, and compliance controls.

R provides speed, flexibility, and modern transformation capabilities.

Together they transform fragmented operational records into reliable analytical intelligence.

The future belongs to organizations capable of building structured, auditable, scalable cleaning frameworks that convert raw information into trusted business decisions.

The companies that invest in data quality today build competitive advantage tomorrow.

Clean data is not merely technical hygiene.

It is enterprise strategy.

Interview Questions

1. A duplicate patient record appears in SDTM DM. How would you handle it?

Use PROC SORT NODUPKEY, investigate source systems, document decisions, and perform independent QC.

2. Why is LENGTH placement important in SAS?

SAS assigns storage during compilation. Incorrect placement causes truncation and irreversible data loss.

3. When would you prefer PROC SQL over DATA step?

For joins, aggregations, and relational logic.

4. Why are missing numeric values dangerous in SAS?

Missing values are considered smaller than valid values and can trigger incorrect conditions.

5. How does R's coalesce() compare with SAS COALESCEC?

Both select the first non-missing value and are heavily used in enterprise imputation workflows.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about COFFEE DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1. Are We Losing Wildlife Faster Than We Think? – Animal Conservation Analysis Using SAS

2.Can Modern Art Data Explain What Actually Sells? – A Real-World SAS Analytics Project

3.Which Vehicles Truly Perform Better on Indian Roads? – A Real-World SAS Analytics Project

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS