☕ Espresso, Arabica, Robusta and Broken Data: Building Enterprise Analytics with SAS and R
☕ From Bitter Beans to Brilliant Insights: Transforming Global Coffee Data into Analytical Intelligence with SAS and R
Introduction:The Business Crisis Nobody Saw Coming
A
multinational coffee retail company launched an AI recommendation engine to
predict customer purchasing behavior across 40 countries. Within weeks,
executives noticed strange results.
Customers
who preferred premium Ethiopian Yirgacheffe coffee were receiving
recommendations for instant coffee sachets.
Financial
dashboards reported negative revenues.
Regional
sales reports showed "asia", "ASIA", " Apac ",
and "APAC" as four different business regions.
Duplicate
coffee identifiers caused inventory inflation.
Malformed
supplier emails prevented procurement alerts from being delivered.
Some
roasting dates were recorded as 2045 while others appeared as 1920.
The AI
system classified several luxury coffee brands as unknown categories simply
because one dataset contained "Espresso", another contained
"espresso", and another contained "ESPRESSO ".
This is
exactly how poor-quality operational data quietly becomes a strategic business
risk.
Dirty
data damages:
- Executive dashboards
- AI prediction models
- Forecasting engines
- Regulatory submissions
- Statistical outputs
- Inventory planning
- Customer segmentation
- Financial reporting
As
Clinical SAS Programmers and Data Scientists, we face identical problems in
clinical trials where a single incorrect patient age or missing visit date can
invalidate an entire analysis population.
Today we
will solve these challenges using a global coffee dataset containing
intentional corruption.
Global Coffee Dataset
20 Coffee Types Across the World
Our
dataset contains more than twenty coffee varieties:
- Espresso
- Americano
- Cappuccino
- Latte
- Mocha
- Flat White
- Macchiato
- Ristretto
- Affogato
- Irish Coffee
- Turkish Coffee
- Arabic Coffee
- Cold Brew
- Nitro Coffee
- Doppio
- Cortado
- Red Eye
- Viennese Coffee
- Frappuccino
- Cafe Au Lait
- Lungo
- Piccolo Latte
Dataset Variables
Our
enterprise dataset contains nine variables:
- Coffee_ID
- Coffee_Name
- Country
- Region
- Supplier_Email
- Roast_Date
- Price_USD
- Customer_Rating
- Category
Intentional Data Corruption
The raw
dataset intentionally contains:
- Duplicate Coffee IDs
- Missing roast dates
- Negative prices
- Invalid ratings
- Mixed upper and lowercase
values
- NULL strings
- Extra whitespace
- Invalid timestamps
- Corrupted regions
- Malformed email addresses
- Invalid category labels
- Mixed character and numeric
values
1.SAS Raw Dataset Creation
data coffee_raw;
length Coffee_ID $10 Coffee_Name $40 Country $25 Region $20
Supplier_Email $60 Category $20;
informat Roast_Date anydtdte20.;
format Roast_Date date9.;
infile datalines dlm='|' dsd truncover;
input Coffee_ID $ Coffee_Name :$40. Country :$25. Region :$15.
Supplier_Email :$60. Roast_Date :anydtdte20. Price_USD $
Customer_Rating Category :$20.;
datalines;
CF001| espresso |italy| apac |sales@coffee.com|2025-01-05|4.50|5|Premium
CF002|LATTE|Italy|APAC|invalidemail|2025-02-01|-5.25|6|premium
CF003|Mocha|usa|Amer|supplier@email|.|3.75|4|Regular
CF003|Mocha|USA|AMER|supplier@email|2025-03-10|3.75|4|Regular
CF004|NULL|Brazil|LATAM|contact@coffee.com|2045-10-20|2.50|3|regular
CF005|Cold Brew|India| asia |coffee@vendor.com|1920-01-01|5.00|2|Unknown
CF006|Cappuccino|Italy|EMEA|cappuccino@gmail.com|2025-04-15|-6.25|5|Premium
CF007| Americano |USA|amer|americano.com|2025-04-20|3.95|4|Regular
CF008|Flat White|Australia|APAC|flat@coffee.com|2025-15-01|4.25|5|Premium
CF009|Macchiato|Italy|EUROPE|macchiato@coffee|2025-05-01|5.50|7|Luxury
CF010|Affogato|Italy| EMEA |affogato@coffee.com||6.25|4|Premium
CF011|Irish Coffee|Ireland|EMEA|irish@coffee.com|2025-05-15|-8.75|-1|Alcoholic
CF012|Turkish Coffee|Turkey|MiddleEast|turkish@coffee.com|2025-05-20|4.20|5|Traditional
CF013|Arabic Coffee|Saudi Arabia|MEA|arabic@coffee.com|2025-05-25|3.80|4|Traditional
CF014|Nitro Coffee|USA|americas|nitrocoffee.com|2025-06-01|5.90|5|Premium
CF015|Doppio|Italy|APAC|doppio@coffee.com|2025-06-05|-3.60|2|premium
CF016|Cortado|Spain|Europe|cortado@coffee.com|2025-06-10|4.60|4|Regular
CF017|Red Eye|USA|AMER|redeye@coffee.com|2025-06-15|NULL|3|Regular
CF018|Viennese Coffee|Austria|EMEA|vienna@coffee.com|2025-06-20|5.75|5|Luxury
CF019|Frappuccino|USA|APAC|frap@coffee.com|2025-13-01|7.20|4|Frozen
CF020|Cafe Au Lait|France|EU|cafeaulait@coffee.com|2025-06-25|4.80|3|Regular
CF021|Lungo|Italy|EMEA|lungo@@coffee.com|2025-07-01|3.90|5|Regular
CF022|Piccolo Latte|Australia|APAC|piccolo@coffee.com|2025-07-05|-4.10|6|PREMIUM
;
run;
proc print data=coffee_raw;
run;
OUTPUT:
| Obs | Coffee_ID | Coffee_Name | Country | Region | Supplier_Email | Category | Roast_Date | Price_USD | Customer_Rating |
|---|---|---|---|---|---|---|---|---|---|
| 1 | CF001 | espresso | italy | apac | sales@coffee.com | Premium | 05JAN2025 | 4.50 | 5 |
| 2 | CF002 | LATTE | Italy | APAC | invalidemail | premium | 01FEB2025 | -5.25 | 6 |
| 3 | CF003 | Mocha | usa | Amer | supplier@email | Regular | . | 3.75 | 4 |
| 4 | CF003 | Mocha | USA | AMER | supplier@email | Regular | 10MAR2025 | 3.75 | 4 |
| 5 | CF004 | NULL | Brazil | LATAM | contact@coffee.com | regular | 20OCT2045 | 2.50 | 3 |
| 6 | CF005 | Cold Brew | India | asia | coffee@vendor.com | Unknown | 01JAN1920 | 5.00 | 2 |
| 7 | CF006 | Cappuccino | Italy | EMEA | cappuccino@gmail.com | Premium | 15APR2025 | -6.25 | 5 |
| 8 | CF007 | Americano | USA | amer | americano.com | Regular | 20APR2025 | 3.95 | 4 |
| 9 | CF008 | Flat White | Australia | APAC | flat@coffee.com | Premium | . | 4.25 | 5 |
| 10 | CF009 | Macchiato | Italy | EUROPE | macchiato@coffee | Luxury | 01MAY2025 | 5.50 | 7 |
| 11 | CF010 | Affogato | Italy | EMEA | affogato@coffee.com | Premium | . | 6.25 | 4 |
| 12 | CF011 | Irish Coffee | Ireland | EMEA | irish@coffee.com | Alcoholic | 15MAY2025 | -8.75 | -1 |
| 13 | CF012 | Turkish Coffee | Turkey | MiddleEast | turkish@coffee.com | Traditional | 20MAY2025 | 4.20 | 5 |
| 14 | CF013 | Arabic Coffee | Saudi Arabia | MEA | arabic@coffee.com | Traditional | 25MAY2025 | 3.80 | 4 |
| 15 | CF014 | Nitro Coffee | USA | americas | nitrocoffee.com | Premium | 01JUN2025 | 5.90 | 5 |
| 16 | CF015 | Doppio | Italy | APAC | doppio@coffee.com | premium | 05JUN2025 | -3.60 | 2 |
| 17 | CF016 | Cortado | Spain | Europe | cortado@coffee.com | Regular | 10JUN2025 | 4.60 | 4 |
| 18 | CF017 | Red Eye | USA | AMER | redeye@coffee.com | Regular | 15JUN2025 | NULL | 3 |
| 19 | CF018 | Viennese Coffee | Austria | EMEA | vienna@coffee.com | Luxury | 20JUN2025 | 5.75 | 5 |
| 20 | CF019 | Frappuccino | USA | APAC | frap@coffee.com | Frozen | . | 7.20 | 4 |
| 21 | CF020 | Cafe Au Lait | France | EU | cafeaulait@coffee.com | Regular | 25JUN2025 | 4.80 | 3 |
| 22 | CF021 | Lungo | Italy | EMEA | lungo@@coffee.com | Regular | 01JUL2025 | 3.90 | 5 |
| 23 | CF022 | Piccolo Latte | Australia | APAC | piccolo@coffee.com | PREMIUM | 05JUL2025 | -4.10 | 6 |
Explanation
The
LENGTH statement appears before INPUT processing to prevent character
truncation risk. If SAS encounters a variable assignment before LENGTH
declaration, it automatically allocates storage based on the first observed
value. A value such as "Espresso" may allocate only eight bytes,
causing later values such as "Viennese Coffee" to be truncated. This
behavior is fundamentally different from R where character vectors are
dynamically managed in memory and are not fixed-width structures.
2.Enterprise Cleaning Workflow
2.1 Standardizing Text
data coffee_clean;
set coffee_raw;
coffee_name = propcase(strip(coffee_name));
country = propcase(country);
region = upcase(strip(region));
supplier_email = lowcase(strip(supplier_email));
run;
proc print data=coffee_clean;
run;
OUTPUT:
| Obs | Coffee_ID | Coffee_Name | Country | Region | Supplier_Email | Category | Roast_Date | Price_USD | Customer_Rating |
|---|---|---|---|---|---|---|---|---|---|
| 1 | CF001 | Espresso | Italy | APAC | sales@coffee.com | Premium | 05JAN2025 | 4.50 | 5 |
| 2 | CF002 | Latte | Italy | APAC | invalidemail | premium | 01FEB2025 | -5.25 | 6 |
| 3 | CF003 | Mocha | Usa | AMER | supplier@email | Regular | . | 3.75 | 4 |
| 4 | CF003 | Mocha | Usa | AMER | supplier@email | Regular | 10MAR2025 | 3.75 | 4 |
| 5 | CF004 | Null | Brazil | LATAM | contact@coffee.com | regular | 20OCT2045 | 2.50 | 3 |
| 6 | CF005 | Cold Brew | India | ASIA | coffee@vendor.com | Unknown | 01JAN1920 | 5.00 | 2 |
| 7 | CF006 | Cappuccino | Italy | EMEA | cappuccino@gmail.com | Premium | 15APR2025 | -6.25 | 5 |
| 8 | CF007 | Americano | Usa | AMER | americano.com | Regular | 20APR2025 | 3.95 | 4 |
| 9 | CF008 | Flat White | Australia | APAC | flat@coffee.com | Premium | . | 4.25 | 5 |
| 10 | CF009 | Macchiato | Italy | EUROPE | macchiato@coffee | Luxury | 01MAY2025 | 5.50 | 7 |
| 11 | CF010 | Affogato | Italy | EMEA | affogato@coffee.com | Premium | . | 6.25 | 4 |
| 12 | CF011 | Irish Coffee | Ireland | EMEA | irish@coffee.com | Alcoholic | 15MAY2025 | -8.75 | -1 |
| 13 | CF012 | Turkish Coffee | Turkey | MIDDLEEAST | turkish@coffee.com | Traditional | 20MAY2025 | 4.20 | 5 |
| 14 | CF013 | Arabic Coffee | Saudi Arabia | MEA | arabic@coffee.com | Traditional | 25MAY2025 | 3.80 | 4 |
| 15 | CF014 | Nitro Coffee | Usa | AMERICAS | nitrocoffee.com | Premium | 01JUN2025 | 5.90 | 5 |
| 16 | CF015 | Doppio | Italy | APAC | doppio@coffee.com | premium | 05JUN2025 | -3.60 | 2 |
| 17 | CF016 | Cortado | Spain | EUROPE | cortado@coffee.com | Regular | 10JUN2025 | 4.60 | 4 |
| 18 | CF017 | Red Eye | Usa | AMER | redeye@coffee.com | Regular | 15JUN2025 | NULL | 3 |
| 19 | CF018 | Viennese Coffee | Austria | EMEA | vienna@coffee.com | Luxury | 20JUN2025 | 5.75 | 5 |
| 20 | CF019 | Frappuccino | Usa | APAC | frap@coffee.com | Frozen | . | 7.20 | 4 |
| 21 | CF020 | Cafe Au Lait | France | EU | cafeaulait@coffee.com | Regular | 25JUN2025 | 4.80 | 3 |
| 22 | CF021 | Lungo | Italy | EMEA | lungo@@coffee.com | Regular | 01JUL2025 | 3.90 | 5 |
| 23 | CF022 | Piccolo Latte | Australia | APAC | piccolo@coffee.com | PREMIUM | 05JUL2025 | -4.10 | 6 |
Explanation
STRIP
removes leading and trailing spaces, PROPCASE standardizes capitalization,
while UPCASE ensures grouping consistency during aggregation. Without
normalization, PROC FREQ treats "asia", "ASIA", and "
Asia " as independent categories producing incorrect statistics.
2.2 Correcting Negative Prices
data coffee_clean;
set coffee_clean;
price_usd_Num = input(price_usd,best32.);
drop price_usd;
rename price_usd_Num=price_Usd;
price_usd = abs(price_usd);
price_usd = round(price_usd,.01);
run;
proc print data=coffee_clean;
run;
OUTPUT:
| Obs | Coffee_ID | Coffee_Name | Country | Region | Supplier_Email | Category | Roast_Date | Customer_Rating | price_Usd |
|---|---|---|---|---|---|---|---|---|---|
| 1 | CF001 | Espresso | Italy | APAC | sales@coffee.com | Premium | 05JAN2025 | 5 | 4.50 |
| 2 | CF002 | Latte | Italy | APAC | invalidemail | premium | 01FEB2025 | 6 | -5.25 |
| 3 | CF003 | Mocha | Usa | AMER | supplier@email | Regular | . | 4 | 3.75 |
| 4 | CF003 | Mocha | Usa | AMER | supplier@email | Regular | 10MAR2025 | 4 | 3.75 |
| 5 | CF004 | Null | Brazil | LATAM | contact@coffee.com | regular | 20OCT2045 | 3 | 2.50 |
| 6 | CF005 | Cold Brew | India | ASIA | coffee@vendor.com | Unknown | 01JAN1920 | 2 | 5.00 |
| 7 | CF006 | Cappuccino | Italy | EMEA | cappuccino@gmail.com | Premium | 15APR2025 | 5 | -6.25 |
| 8 | CF007 | Americano | Usa | AMER | americano.com | Regular | 20APR2025 | 4 | 3.95 |
| 9 | CF008 | Flat White | Australia | APAC | flat@coffee.com | Premium | . | 5 | 4.25 |
| 10 | CF009 | Macchiato | Italy | EUROPE | macchiato@coffee | Luxury | 01MAY2025 | 7 | 5.50 |
| 11 | CF010 | Affogato | Italy | EMEA | affogato@coffee.com | Premium | . | 4 | 6.25 |
| 12 | CF011 | Irish Coffee | Ireland | EMEA | irish@coffee.com | Alcoholic | 15MAY2025 | -1 | -8.75 |
| 13 | CF012 | Turkish Coffee | Turkey | MIDDLEEAST | turkish@coffee.com | Traditional | 20MAY2025 | 5 | 4.20 |
| 14 | CF013 | Arabic Coffee | Saudi Arabia | MEA | arabic@coffee.com | Traditional | 25MAY2025 | 4 | 3.80 |
| 15 | CF014 | Nitro Coffee | Usa | AMERICAS | nitrocoffee.com | Premium | 01JUN2025 | 5 | 5.90 |
| 16 | CF015 | Doppio | Italy | APAC | doppio@coffee.com | premium | 05JUN2025 | 2 | -3.60 |
| 17 | CF016 | Cortado | Spain | EUROPE | cortado@coffee.com | Regular | 10JUN2025 | 4 | 4.60 |
| 18 | CF017 | Red Eye | Usa | AMER | redeye@coffee.com | Regular | 15JUN2025 | 3 | . |
| 19 | CF018 | Viennese Coffee | Austria | EMEA | vienna@coffee.com | Luxury | 20JUN2025 | 5 | 5.75 |
| 20 | CF019 | Frappuccino | Usa | APAC | frap@coffee.com | Frozen | . | 4 | 7.20 |
| 21 | CF020 | Cafe Au Lait | France | EU | cafeaulait@coffee.com | Regular | 25JUN2025 | 3 | 4.80 |
| 22 | CF021 | Lungo | Italy | EMEA | lungo@@coffee.com | Regular | 01JUL2025 | 5 | 3.90 |
| 23 | CF022 | Piccolo Latte | Australia | APAC | piccolo@coffee.com | PREMIUM | 05JUL2025 | 6 | -4.10 |
Explanation
ABS
converts negative values into positive billing amounts while ROUND enforces
financial precision. Negative transaction values can severely distort
forecasting and revenue recognition calculations.
2.3 Missing Value Detection
data coffee_clean;
set coffee_clean;
missing_count=nmiss(price_usd,customer_rating)
+cmiss(coffee_name,country,region);
run;
proc print data=coffee_clean;
run;
OUTPUT:
| Obs | Coffee_ID | Coffee_Name | Country | Region | Supplier_Email | Category | Roast_Date | Customer_Rating | price_Usd | missing_count |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CF001 | Espresso | Italy | APAC | sales@coffee.com | Premium | 05JAN2025 | 5 | 4.50 | 0 |
| 2 | CF002 | Latte | Italy | APAC | invalidemail | premium | 01FEB2025 | 6 | -5.25 | 0 |
| 3 | CF003 | Mocha | Usa | AMER | supplier@email | Regular | . | 4 | 3.75 | 0 |
| 4 | CF003 | Mocha | Usa | AMER | supplier@email | Regular | 10MAR2025 | 4 | 3.75 | 0 |
| 5 | CF004 | Null | Brazil | LATAM | contact@coffee.com | regular | 20OCT2045 | 3 | 2.50 | 0 |
| 6 | CF005 | Cold Brew | India | ASIA | coffee@vendor.com | Unknown | 01JAN1920 | 2 | 5.00 | 0 |
| 7 | CF006 | Cappuccino | Italy | EMEA | cappuccino@gmail.com | Premium | 15APR2025 | 5 | -6.25 | 0 |
| 8 | CF007 | Americano | Usa | AMER | americano.com | Regular | 20APR2025 | 4 | 3.95 | 0 |
| 9 | CF008 | Flat White | Australia | APAC | flat@coffee.com | Premium | . | 5 | 4.25 | 0 |
| 10 | CF009 | Macchiato | Italy | EUROPE | macchiato@coffee | Luxury | 01MAY2025 | 7 | 5.50 | 0 |
| 11 | CF010 | Affogato | Italy | EMEA | affogato@coffee.com | Premium | . | 4 | 6.25 | 0 |
| 12 | CF011 | Irish Coffee | Ireland | EMEA | irish@coffee.com | Alcoholic | 15MAY2025 | -1 | -8.75 | 0 |
| 13 | CF012 | Turkish Coffee | Turkey | MIDDLEEAST | turkish@coffee.com | Traditional | 20MAY2025 | 5 | 4.20 | 0 |
| 14 | CF013 | Arabic Coffee | Saudi Arabia | MEA | arabic@coffee.com | Traditional | 25MAY2025 | 4 | 3.80 | 0 |
| 15 | CF014 | Nitro Coffee | Usa | AMERICAS | nitrocoffee.com | Premium | 01JUN2025 | 5 | 5.90 | 0 |
| 16 | CF015 | Doppio | Italy | APAC | doppio@coffee.com | premium | 05JUN2025 | 2 | -3.60 | 0 |
| 17 | CF016 | Cortado | Spain | EUROPE | cortado@coffee.com | Regular | 10JUN2025 | 4 | 4.60 | 0 |
| 18 | CF017 | Red Eye | Usa | AMER | redeye@coffee.com | Regular | 15JUN2025 | 3 | . | 1 |
| 19 | CF018 | Viennese Coffee | Austria | EMEA | vienna@coffee.com | Luxury | 20JUN2025 | 5 | 5.75 | 0 |
| 20 | CF019 | Frappuccino | Usa | APAC | frap@coffee.com | Frozen | . | 4 | 7.20 | 0 |
| 21 | CF020 | Cafe Au Lait | France | EU | cafeaulait@coffee.com | Regular | 25JUN2025 | 3 | 4.80 | 0 |
| 22 | CF021 | Lungo | Italy | EMEA | lungo@@coffee.com | Regular | 01JUL2025 | 5 | 3.90 | 0 |
| 23 | CF022 | Piccolo Latte | Australia | APAC | piccolo@coffee.com | PREMIUM | 05JUL2025 | 6 | -4.10 | 0 |
Explanation
NMISS
evaluates numeric variables while CMISS handles character variables. Using both
functions provides a complete missingness assessment framework.
2.4 SELECT-WHEN Logic
data coffee_clean;
set coffee_clean;
select(upcase(region));
when('APAC') region_group='Asia Pacific';
when('AMER') region_group='Americas';
when('LATAM') region_group='LatinAmerica';
otherwise region_group='Unknown';
end;
drop region;
rename region_group=region;
run;
proc print data=coffee_clean;
run;
OUTPUT:
| Obs | Coffee_ID | Coffee_Name | Country | Supplier_Email | Category | Roast_Date | Customer_Rating | price_Usd | missing_count | region |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CF001 | Espresso | Italy | sales@coffee.com | Premium | 05JAN2025 | 5 | 4.50 | 0 | Asia Pacific |
| 2 | CF002 | Latte | Italy | invalidemail | premium | 01FEB2025 | 6 | -5.25 | 0 | Asia Pacific |
| 3 | CF003 | Mocha | Usa | supplier@email | Regular | . | 4 | 3.75 | 0 | Americas |
| 4 | CF003 | Mocha | Usa | supplier@email | Regular | 10MAR2025 | 4 | 3.75 | 0 | Americas |
| 5 | CF004 | Null | Brazil | contact@coffee.com | regular | 20OCT2045 | 3 | 2.50 | 0 | LatinAmerica |
| 6 | CF005 | Cold Brew | India | coffee@vendor.com | Unknown | 01JAN1920 | 2 | 5.00 | 0 | Unknown |
| 7 | CF006 | Cappuccino | Italy | cappuccino@gmail.com | Premium | 15APR2025 | 5 | -6.25 | 0 | Unknown |
| 8 | CF007 | Americano | Usa | americano.com | Regular | 20APR2025 | 4 | 3.95 | 0 | Americas |
| 9 | CF008 | Flat White | Australia | flat@coffee.com | Premium | . | 5 | 4.25 | 0 | Asia Pacific |
| 10 | CF009 | Macchiato | Italy | macchiato@coffee | Luxury | 01MAY2025 | 7 | 5.50 | 0 | Unknown |
| 11 | CF010 | Affogato | Italy | affogato@coffee.com | Premium | . | 4 | 6.25 | 0 | Unknown |
| 12 | CF011 | Irish Coffee | Ireland | irish@coffee.com | Alcoholic | 15MAY2025 | -1 | -8.75 | 0 | Unknown |
| 13 | CF012 | Turkish Coffee | Turkey | turkish@coffee.com | Traditional | 20MAY2025 | 5 | 4.20 | 0 | Unknown |
| 14 | CF013 | Arabic Coffee | Saudi Arabia | arabic@coffee.com | Traditional | 25MAY2025 | 4 | 3.80 | 0 | Unknown |
| 15 | CF014 | Nitro Coffee | Usa | nitrocoffee.com | Premium | 01JUN2025 | 5 | 5.90 | 0 | Unknown |
| 16 | CF015 | Doppio | Italy | doppio@coffee.com | premium | 05JUN2025 | 2 | -3.60 | 0 | Asia Pacific |
| 17 | CF016 | Cortado | Spain | cortado@coffee.com | Regular | 10JUN2025 | 4 | 4.60 | 0 | Unknown |
| 18 | CF017 | Red Eye | Usa | redeye@coffee.com | Regular | 15JUN2025 | 3 | . | 1 | Americas |
| 19 | CF018 | Viennese Coffee | Austria | vienna@coffee.com | Luxury | 20JUN2025 | 5 | 5.75 | 0 | Unknown |
| 20 | CF019 | Frappuccino | Usa | frap@coffee.com | Frozen | . | 4 | 7.20 | 0 | Asia Pacific |
| 21 | CF020 | Cafe Au Lait | France | cafeaulait@coffee.com | Regular | 25JUN2025 | 3 | 4.80 | 0 | Unknown |
| 22 | CF021 | Lungo | Italy | lungo@@coffee.com | Regular | 01JUL2025 | 5 | 3.90 | 0 | Unknown |
| 23 | CF022 | Piccolo Latte | Australia | piccolo@coffee.com | PREMIUM | 05JUL2025 | 6 | -4.10 | 0 | Asia Pacific |
Explanation
SELECT-WHEN
improves readability compared with large IF-THEN chains and performs
particularly well when handling numerous category mappings.
2.5 ARRAY Processing
data coffee_clean;
set coffee_clean;
array charvars(*) coffee_name country region category;
do i=1 to dim(charvars);
charvars(i)=strip(propcase(charvars(i)));
end;
run;
proc print data=coffee_clean;
run;
OUTPUT:
| Obs | Coffee_ID | Coffee_Name | Country | Supplier_Email | Category | Roast_Date | Customer_Rating | price_Usd | missing_count | region | i |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CF001 | Espresso | Italy | sales@coffee.com | Premium | 05JAN2025 | 5 | 4.50 | 0 | Asia Pacific | 5 |
| 2 | CF002 | Latte | Italy | invalidemail | Premium | 01FEB2025 | 6 | -5.25 | 0 | Asia Pacific | 5 |
| 3 | CF003 | Mocha | Usa | supplier@email | Regular | . | 4 | 3.75 | 0 | Americas | 5 |
| 4 | CF003 | Mocha | Usa | supplier@email | Regular | 10MAR2025 | 4 | 3.75 | 0 | Americas | 5 |
| 5 | CF004 | Null | Brazil | contact@coffee.com | Regular | 20OCT2045 | 3 | 2.50 | 0 | Latinamerica | 5 |
| 6 | CF005 | Cold Brew | India | coffee@vendor.com | Unknown | 01JAN1920 | 2 | 5.00 | 0 | Unknown | 5 |
| 7 | CF006 | Cappuccino | Italy | cappuccino@gmail.com | Premium | 15APR2025 | 5 | -6.25 | 0 | Unknown | 5 |
| 8 | CF007 | Americano | Usa | americano.com | Regular | 20APR2025 | 4 | 3.95 | 0 | Americas | 5 |
| 9 | CF008 | Flat White | Australia | flat@coffee.com | Premium | . | 5 | 4.25 | 0 | Asia Pacific | 5 |
| 10 | CF009 | Macchiato | Italy | macchiato@coffee | Luxury | 01MAY2025 | 7 | 5.50 | 0 | Unknown | 5 |
| 11 | CF010 | Affogato | Italy | affogato@coffee.com | Premium | . | 4 | 6.25 | 0 | Unknown | 5 |
| 12 | CF011 | Irish Coffee | Ireland | irish@coffee.com | Alcoholic | 15MAY2025 | -1 | -8.75 | 0 | Unknown | 5 |
| 13 | CF012 | Turkish Coffee | Turkey | turkish@coffee.com | Traditional | 20MAY2025 | 5 | 4.20 | 0 | Unknown | 5 |
| 14 | CF013 | Arabic Coffee | Saudi Arabia | arabic@coffee.com | Traditional | 25MAY2025 | 4 | 3.80 | 0 | Unknown | 5 |
| 15 | CF014 | Nitro Coffee | Usa | nitrocoffee.com | Premium | 01JUN2025 | 5 | 5.90 | 0 | Unknown | 5 |
| 16 | CF015 | Doppio | Italy | doppio@coffee.com | Premium | 05JUN2025 | 2 | -3.60 | 0 | Asia Pacific | 5 |
| 17 | CF016 | Cortado | Spain | cortado@coffee.com | Regular | 10JUN2025 | 4 | 4.60 | 0 | Unknown | 5 |
| 18 | CF017 | Red Eye | Usa | redeye@coffee.com | Regular | 15JUN2025 | 3 | . | 1 | Americas | 5 |
| 19 | CF018 | Viennese Coffee | Austria | vienna@coffee.com | Luxury | 20JUN2025 | 5 | 5.75 | 0 | Unknown | 5 |
| 20 | CF019 | Frappuccino | Usa | frap@coffee.com | Frozen | . | 4 | 7.20 | 0 | Asia Pacific | 5 |
| 21 | CF020 | Cafe Au Lait | France | cafeaulait@coffee.com | Regular | 25JUN2025 | 3 | 4.80 | 0 | Unknown | 5 |
| 22 | CF021 | Lungo | Italy | lungo@@coffee.com | Regular | 01JUL2025 | 5 | 3.90 | 0 | Unknown | 5 |
| 23 | CF022 | Piccolo Latte | Australia | piccolo@coffee.com | Premium | 05JUL2025 | 6 | -4.10 | 0 | Asia Pacific | 5 |
Explanation
ARRAYS
eliminate repetitive programming and significantly improve maintainability in
production pipelines containing hundreds of variables.
3.PROC SORT Deduplication
proc sort data=coffee_clean nodupkey;
by coffee_id;
run;
proc print data=coffee_clean;
run;
LOG:
OUTPUT:
| Obs | Coffee_ID | Coffee_Name | Country | Supplier_Email | Category | Roast_Date | Customer_Rating | price_Usd | missing_count | region | i |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CF001 | Espresso | Italy | sales@coffee.com | Premium | 05JAN2025 | 5 | 4.50 | 0 | Asia Pacific | 5 |
| 2 | CF002 | Latte | Italy | invalidemail | Premium | 01FEB2025 | 6 | -5.25 | 0 | Asia Pacific | 5 |
| 3 | CF003 | Mocha | Usa | supplier@email | Regular | . | 4 | 3.75 | 0 | Americas | 5 |
| 4 | CF004 | Null | Brazil | contact@coffee.com | Regular | 20OCT2045 | 3 | 2.50 | 0 | Latinamerica | 5 |
| 5 | CF005 | Cold Brew | India | coffee@vendor.com | Unknown | 01JAN1920 | 2 | 5.00 | 0 | Unknown | 5 |
| 6 | CF006 | Cappuccino | Italy | cappuccino@gmail.com | Premium | 15APR2025 | 5 | -6.25 | 0 | Unknown | 5 |
| 7 | CF007 | Americano | Usa | americano.com | Regular | 20APR2025 | 4 | 3.95 | 0 | Americas | 5 |
| 8 | CF008 | Flat White | Australia | flat@coffee.com | Premium | . | 5 | 4.25 | 0 | Asia Pacific | 5 |
| 9 | CF009 | Macchiato | Italy | macchiato@coffee | Luxury | 01MAY2025 | 7 | 5.50 | 0 | Unknown | 5 |
| 10 | CF010 | Affogato | Italy | affogato@coffee.com | Premium | . | 4 | 6.25 | 0 | Unknown | 5 |
| 11 | CF011 | Irish Coffee | Ireland | irish@coffee.com | Alcoholic | 15MAY2025 | -1 | -8.75 | 0 | Unknown | 5 |
| 12 | CF012 | Turkish Coffee | Turkey | turkish@coffee.com | Traditional | 20MAY2025 | 5 | 4.20 | 0 | Unknown | 5 |
| 13 | CF013 | Arabic Coffee | Saudi Arabia | arabic@coffee.com | Traditional | 25MAY2025 | 4 | 3.80 | 0 | Unknown | 5 |
| 14 | CF014 | Nitro Coffee | Usa | nitrocoffee.com | Premium | 01JUN2025 | 5 | 5.90 | 0 | Unknown | 5 |
| 15 | CF015 | Doppio | Italy | doppio@coffee.com | Premium | 05JUN2025 | 2 | -3.60 | 0 | Asia Pacific | 5 |
| 16 | CF016 | Cortado | Spain | cortado@coffee.com | Regular | 10JUN2025 | 4 | 4.60 | 0 | Unknown | 5 |
| 17 | CF017 | Red Eye | Usa | redeye@coffee.com | Regular | 15JUN2025 | 3 | . | 1 | Americas | 5 |
| 18 | CF018 | Viennese Coffee | Austria | vienna@coffee.com | Luxury | 20JUN2025 | 5 | 5.75 | 0 | Unknown | 5 |
| 19 | CF019 | Frappuccino | Usa | frap@coffee.com | Frozen | . | 4 | 7.20 | 0 | Asia Pacific | 5 |
| 20 | CF020 | Cafe Au Lait | France | cafeaulait@coffee.com | Regular | 25JUN2025 | 3 | 4.80 | 0 | Unknown | 5 |
| 21 | CF021 | Lungo | Italy | lungo@@coffee.com | Regular | 01JUL2025 | 5 | 3.90 | 0 | Unknown | 5 |
| 22 | CF022 | Piccolo Latte | Australia | piccolo@coffee.com | Premium | 05JUL2025 | 6 | -4.10 | 0 | Asia Pacific | 5 |
Explanation
NODUPKEY
retains the first occurrence of each business key. Duplicate records are among
the leading causes of incorrect KPIs and inventory distortions.
4.PROC FORMAT
proc format;
value ratingfmt 1-2='Poor'
3='Average'
4='Good'
5='Excellent';
run;
LOG:
Explanation
PROC
FORMAT separates presentation logic from storage logic which simplifies
reporting and improves maintainability.
5.PROC SQL Validation
proc sql;
create table invalid_email as
select *
from coffee_clean
where index(Supplier_Email,'@')=0;
quit;
proc print data=invalid_email;
run;
OUTPUT:
| Obs | Coffee_ID | Coffee_Name | Country | Supplier_Email | Category | Roast_Date | Customer_Rating | price_Usd | missing_count | region | i |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CF002 | Latte | Italy | invalidemail | Premium | 01FEB2025 | 6 | -5.25 | 0 | Asia Pacific | 5 |
| 2 | CF007 | Americano | Usa | americano.com | Regular | 20APR2025 | 4 | 3.95 | 0 | Americas | 5 |
| 3 | CF014 | Nitro Coffee | Usa | nitrocoffee.com | Premium | 01JUN2025 | 5 | 5.90 | 0 | Unknown | 5 |
Explanation
PROC SQL
excels in exception reporting, joins, and relational validation tasks that are
cumbersome in DATA step processing.
6.DATA Step Alternative
data valid_email invalid_email;
set coffee_clean;
if find(supplier_email,'@')=0 then
output invalid_email;
else output valid_email;
run;
proc print data=invalid_email;
run;
OUTPUT:
| Obs | Coffee_ID | Coffee_Name | Country | Supplier_Email | Category | Roast_Date | Customer_Rating | price_Usd | missing_count | region | i |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CF002 | Latte | Italy | invalidemail | Premium | 01FEB2025 | 6 | -5.25 | 0 | Asia Pacific | 5 |
| 2 | CF007 | Americano | Usa | americano.com | Regular | 20APR2025 | 4 | 3.95 | 0 | Americas | 5 |
| 3 | CF014 | Nitro Coffee | Usa | nitrocoffee.com | Premium | 01JUN2025 | 5 | 5.90 | 0 | Unknown | 5 |
proc print data=valid_email;
run;
OUTPUT:
| Obs | Coffee_ID | Coffee_Name | Country | Supplier_Email | Category | Roast_Date | Customer_Rating | price_Usd | missing_count | region | i |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CF001 | Espresso | Italy | sales@coffee.com | Premium | 05JAN2025 | 5 | 4.50 | 0 | Asia Pacific | 5 |
| 2 | CF003 | Mocha | Usa | supplier@email | Regular | . | 4 | 3.75 | 0 | Americas | 5 |
| 3 | CF004 | Null | Brazil | contact@coffee.com | Regular | 20OCT2045 | 3 | 2.50 | 0 | Latinamerica | 5 |
| 4 | CF005 | Cold Brew | India | coffee@vendor.com | Unknown | 01JAN1920 | 2 | 5.00 | 0 | Unknown | 5 |
| 5 | CF006 | Cappuccino | Italy | cappuccino@gmail.com | Premium | 15APR2025 | 5 | -6.25 | 0 | Unknown | 5 |
| 6 | CF008 | Flat White | Australia | flat@coffee.com | Premium | . | 5 | 4.25 | 0 | Asia Pacific | 5 |
| 7 | CF009 | Macchiato | Italy | macchiato@coffee | Luxury | 01MAY2025 | 7 | 5.50 | 0 | Unknown | 5 |
| 8 | CF010 | Affogato | Italy | affogato@coffee.com | Premium | . | 4 | 6.25 | 0 | Unknown | 5 |
| 9 | CF011 | Irish Coffee | Ireland | irish@coffee.com | Alcoholic | 15MAY2025 | -1 | -8.75 | 0 | Unknown | 5 |
| 10 | CF012 | Turkish Coffee | Turkey | turkish@coffee.com | Traditional | 20MAY2025 | 5 | 4.20 | 0 | Unknown | 5 |
| 11 | CF013 | Arabic Coffee | Saudi Arabia | arabic@coffee.com | Traditional | 25MAY2025 | 4 | 3.80 | 0 | Unknown | 5 |
| 12 | CF015 | Doppio | Italy | doppio@coffee.com | Premium | 05JUN2025 | 2 | -3.60 | 0 | Asia Pacific | 5 |
| 13 | CF016 | Cortado | Spain | cortado@coffee.com | Regular | 10JUN2025 | 4 | 4.60 | 0 | Unknown | 5 |
| 14 | CF017 | Red Eye | Usa | redeye@coffee.com | Regular | 15JUN2025 | 3 | . | 1 | Americas | 5 |
| 15 | CF018 | Viennese Coffee | Austria | vienna@coffee.com | Luxury | 20JUN2025 | 5 | 5.75 | 0 | Unknown | 5 |
| 16 | CF019 | Frappuccino | Usa | frap@coffee.com | Frozen | . | 4 | 7.20 | 0 | Asia Pacific | 5 |
| 17 | CF020 | Cafe Au Lait | France | cafeaulait@coffee.com | Regular | 25JUN2025 | 3 | 4.80 | 0 | Unknown | 5 |
| 18 | CF021 | Lungo | Italy | lungo@@coffee.com | Regular | 01JUL2025 | 5 | 3.90 | 0 | Unknown | 5 |
| 19 | CF022 | Piccolo Latte | Australia | piccolo@coffee.com | Premium | 05JUL2025 | 6 | -4.10 | 0 | Asia Pacific | 5 |
Explanation
DATA step
processing often outperforms SQL for sequential row-based transformations and
provides greater procedural control.
Advanced DATA Step Features
We also
leverage:
- RETAIN for cumulative
inventory totals.
- FIRST./LAST. processing for
supplier summaries.
- MERGE statements for
reference enrichment.
- CATX for composite
identifiers.
- COALESCEC for fallback
values.
- VERIFY for invalid
characters.
- SUBSTR and SCAN for parsing
identifiers.
- INPUT and PUT conversions.
- INTCK and INTNX for aging
calculations.
7.Reusable SAS Macro
%macro standardize(ds,variable);
data &ds;
set coffee_clean;
&variable=upcase(strip(&variable));
run;
proc print data=&ds;
run;
%mend;
%standardize(coffee_clean,supplier_email)
OUTPUT:
| Obs | Coffee_ID | Coffee_Name | Country | Supplier_Email | Category | Roast_Date | Customer_Rating | price_Usd | missing_count | region | i |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | CF001 | Espresso | Italy | SALES@COFFEE.COM | Premium | 05JAN2025 | 5 | 4.50 | 0 | Asia Pacific | 5 |
| 2 | CF002 | Latte | Italy | INVALIDEMAIL | Premium | 01FEB2025 | 6 | -5.25 | 0 | Asia Pacific | 5 |
| 3 | CF003 | Mocha | Usa | SUPPLIER@EMAIL | Regular | . | 4 | 3.75 | 0 | Americas | 5 |
| 4 | CF004 | Null | Brazil | CONTACT@COFFEE.COM | Regular | 20OCT2045 | 3 | 2.50 | 0 | Latinamerica | 5 |
| 5 | CF005 | Cold Brew | India | COFFEE@VENDOR.COM | Unknown | 01JAN1920 | 2 | 5.00 | 0 | Unknown | 5 |
| 6 | CF006 | Cappuccino | Italy | CAPPUCCINO@GMAIL.COM | Premium | 15APR2025 | 5 | -6.25 | 0 | Unknown | 5 |
| 7 | CF007 | Americano | Usa | AMERICANO.COM | Regular | 20APR2025 | 4 | 3.95 | 0 | Americas | 5 |
| 8 | CF008 | Flat White | Australia | FLAT@COFFEE.COM | Premium | . | 5 | 4.25 | 0 | Asia Pacific | 5 |
| 9 | CF009 | Macchiato | Italy | MACCHIATO@COFFEE | Luxury | 01MAY2025 | 7 | 5.50 | 0 | Unknown | 5 |
| 10 | CF010 | Affogato | Italy | AFFOGATO@COFFEE.COM | Premium | . | 4 | 6.25 | 0 | Unknown | 5 |
| 11 | CF011 | Irish Coffee | Ireland | IRISH@COFFEE.COM | Alcoholic | 15MAY2025 | -1 | -8.75 | 0 | Unknown | 5 |
| 12 | CF012 | Turkish Coffee | Turkey | TURKISH@COFFEE.COM | Traditional | 20MAY2025 | 5 | 4.20 | 0 | Unknown | 5 |
| 13 | CF013 | Arabic Coffee | Saudi Arabia | ARABIC@COFFEE.COM | Traditional | 25MAY2025 | 4 | 3.80 | 0 | Unknown | 5 |
| 14 | CF014 | Nitro Coffee | Usa | NITROCOFFEE.COM | Premium | 01JUN2025 | 5 | 5.90 | 0 | Unknown | 5 |
| 15 | CF015 | Doppio | Italy | DOPPIO@COFFEE.COM | Premium | 05JUN2025 | 2 | -3.60 | 0 | Asia Pacific | 5 |
| 16 | CF016 | Cortado | Spain | CORTADO@COFFEE.COM | Regular | 10JUN2025 | 4 | 4.60 | 0 | Unknown | 5 |
| 17 | CF017 | Red Eye | Usa | REDEYE@COFFEE.COM | Regular | 15JUN2025 | 3 | . | 1 | Americas | 5 |
| 18 | CF018 | Viennese Coffee | Austria | VIENNA@COFFEE.COM | Luxury | 20JUN2025 | 5 | 5.75 | 0 | Unknown | 5 |
| 19 | CF019 | Frappuccino | Usa | FRAP@COFFEE.COM | Frozen | . | 4 | 7.20 | 0 | Asia Pacific | 5 |
| 20 | CF020 | Cafe Au Lait | France | CAFEAULAIT@COFFEE.COM | Regular | 25JUN2025 | 3 | 4.80 | 0 | Unknown | 5 |
| 21 | CF021 | Lungo | Italy | LUNGO@@COFFEE.COM | Regular | 01JUL2025 | 5 | 3.90 | 0 | Unknown | 5 |
| 22 | CF022 | Piccolo Latte | Australia | PICCOLO@COFFEE.COM | Premium | 05JUL2025 | 6 | -4.10 | 0 | Asia Pacific | 5 |
Explanation
Macros
promote standardization and reduce maintenance costs. Enterprise environments
often manage thousands of programs where duplicated logic becomes a validation
nightmare.
8.R Raw Dataset
library(tibble)
coffee_raw <- tibble(
coffee_id = c(
"CF001","CF002","CF003","CF003","CF004","CF005",
"CF006","CF007","CF008","CF009","CF010","CF011",
"CF012","CF013","CF014","CF015","CF016","CF017",
"CF018","CF019","CF020","CF021","CF022"),
coffee_name = c(" espresso ","LATTE","Mocha","Mocha","NULL",
"Cold Brew","Cappuccino"," Americano ","Flat White","Macchiato",
"Affogato","Irish Coffee","Turkish Coffee","Arabic Coffee",
"Nitro Coffee","Doppio","Cortado","Red Eye","Viennese Coffee",
"Frappuccino","Cafe Au Lait","Lungo","Piccolo Latte"),
country = c("italy","Italy","usa","USA","Brazil","India",
"Italy","USA","Australia","Italy","Italy","Ireland",
"Turkey","Saudi Arabia","USA","Italy","Spain",
"USA","Austria","USA","France","Italy","Australia"),
region = c(" apac ","APAC","Amer","AMER","LATAM"," asia ",
"EMEA","amer","APAC","EUROPE"," EMEA ","EMEA","MiddleEast",
"MEA","americas","APAC","Europe","AMER","EMEA","APAC","EU",
"EMEA","APAC"),
supplier_email = c("sales@coffee.com","invalidemail",
"supplier@email","supplier@email","contact@coffee.com",
"coffee@vendor.com","cappuccino@gmail.com","americano.com",
"flat@coffee.com","macchiato@coffee","affogato@coffee.com",
"irish@coffee.com","turkish@coffee.com","arabic@coffee.com",
"nitrocoffee.com","doppio@coffee.com","cortado@coffee.com",
"redeye@coffee.com","vienna@coffee.com","frap@coffee.com",
"cafeaulait@coffee.com","lungo@@coffee.com","piccolo@coffee.com"),
roast_date = c("2025-01-05","2025-02-01",NA,"2025-03-10","2045-10-20",
"1920-01-01","2025-04-15","2025-04-20","2025-15-01","2025-05-01",
NA,"2025-05-15","2025-05-20","2025-05-25","2025-06-01","2025-06-05",
"2025-06-10","2025-06-15","2025-06-20","2025-13-01","2025-06-25",
"2025-07-01","2025-07-05"),
price_usd = c(4.50,-5.25,3.75,3.75,2.50,5.00,-6.25,3.95,4.25,5.50,
6.25,-8.75,4.20,3.80,5.90,-3.60,4.60,NA,5.75,7.20,4.80,3.90,-4.10),
customer_rating = c(5,6,4,4,3,2,5,4,5,7,4,-1,5,4,5,2,4,3,5,4,3,5,6),
category = c("Premium","premium","Regular","Regular","regular",
"Unknown","Premium","Regular","Premium","Luxury","Premium",
"Alcoholic","Traditional","Traditional","Premium","premium",
"Regular","Regular","Luxury","Frozen","Regular","Regular","PREMIUM")
)
OUTPUT:
|
coffee_id |
coffee_name |
country |
region |
supplier_email |
roast_date |
price_usd |
customer_rating |
category |
|
CF001 |
espresso |
italy |
apac |
sales@coffee.com |
2025-01-05 |
4.5 |
5 |
Premium |
|
CF002 |
LATTE |
Italy |
APAC |
invalidemail |
2025-02-01 |
-5.25 |
6 |
premium |
|
CF003 |
Mocha |
usa |
Amer |
supplier@email |
3.75 |
4 |
Regular |
|
|
CF003 |
Mocha |
USA |
AMER |
supplier@email |
2025-03-10 |
3.75 |
4 |
Regular |
|
CF004 |
NULL |
Brazil |
LATAM |
contact@coffee.com |
2045-10-20 |
2.5 |
3 |
regular |
|
CF005 |
Cold
Brew |
India |
asia |
coffee@vendor.com |
1920-01-01 |
5 |
2 |
Unknown |
|
CF006 |
Cappuccino |
Italy |
EMEA |
cappuccino@gmail.com |
2025-04-15 |
-6.25 |
5 |
Premium |
|
CF007 |
Americano |
USA |
amer |
americano.com |
2025-04-20 |
3.95 |
4 |
Regular |
|
CF008 |
Flat
White |
Australia |
APAC |
flat@coffee.com |
2025-15-01 |
4.25 |
5 |
Premium |
|
CF009 |
Macchiato |
Italy |
EUROPE |
macchiato@coffee |
2025-05-01 |
5.5 |
7 |
Luxury |
|
CF010 |
Affogato |
Italy |
EMEA |
affogato@coffee.com |
6.25 |
4 |
Premium |
|
|
CF011 |
Irish
Coffee |
Ireland |
EMEA |
irish@coffee.com |
2025-05-15 |
-8.75 |
-1 |
Alcoholic |
|
CF012 |
Turkish
Coffee |
Turkey |
MiddleEast |
turkish@coffee.com |
2025-05-20 |
4.2 |
5 |
Traditional |
|
CF013 |
Arabic
Coffee |
Saudi
Arabia |
MEA |
arabic@coffee.com |
2025-05-25 |
3.8 |
4 |
Traditional |
|
CF014 |
Nitro
Coffee |
USA |
americas |
nitrocoffee.com |
2025-06-01 |
5.9 |
5 |
Premium |
|
CF015 |
Doppio |
Italy |
APAC |
doppio@coffee.com |
2025-06-05 |
-3.6 |
2 |
premium |
|
CF016 |
Cortado |
Spain |
Europe |
cortado@coffee.com |
2025-06-10 |
4.6 |
4 |
Regular |
|
CF017 |
Red Eye |
USA |
AMER |
redeye@coffee.com |
2025-06-15 |
3 |
Regular |
|
|
CF018 |
Viennese
Coffee |
Austria |
EMEA |
vienna@coffee.com |
2025-06-20 |
5.75 |
5 |
Luxury |
|
CF019 |
Frappuccino |
USA |
APAC |
frap@coffee.com |
2025-13-01 |
7.2 |
4 |
Frozen |
|
CF020 |
Cafe Au
Lait |
France |
EU |
cafeaulait@coffee.com |
2025-06-25 |
4.8 |
3 |
Regular |
|
CF021 |
Lungo |
Italy |
EMEA |
lungo@@coffee.com |
2025-07-01 |
3.9 |
5 |
Regular |
|
CF022 |
Piccolo
Latte |
Australia |
APAC |
piccolo@coffee.com |
2025-07-05 |
-4.1 |
6 |
PREMIUM |
9.R Equivalent Cleaning Pipeline
library(janitor)
library(lubridate)
library(dplyr)
library(stringr)
coffee_clean <- coffee_raw %>%
janitor::clean_names() %>%
mutate(
coffee_id = str_trim(coffee_id),
coffee_name=str_to_title(str_trim(coffee_name)),
country=str_to_title(country),
region=str_to_upper(str_trim(region)),
supplier_email=str_to_lower(supplier_email),
price_usd=abs(price_usd),
price_usd=round(price_usd,2),
category=coalesce(category,"Unknown"),
customer_rating=if_else(customer_rating>5,5,customer_rating)
) %>%
distinct(coffee_id,.keep_all = TRUE)
OUTPUT:
|
coffee_id |
coffee_name |
country |
region |
supplier_email |
roast_date |
price_usd |
customer_rating |
category |
|
CF001 |
Espresso |
Italy |
APAC |
sales@coffee.com |
2025-01-05 |
4.5 |
5 |
Premium |
|
CF002 |
Latte |
Italy |
APAC |
invalidemail |
2025-02-01 |
5.25 |
5 |
premium |
|
CF003 |
Mocha |
Usa |
AMER |
supplier@email |
3.75 |
4 |
Regular |
|
|
CF004 |
Null |
Brazil |
LATAM |
contact@coffee.com |
2045-10-20 |
2.5 |
3 |
regular |
|
CF005 |
Cold
Brew |
India |
ASIA |
coffee@vendor.com |
1920-01-01 |
5 |
2 |
Unknown |
|
CF006 |
Cappuccino |
Italy |
EMEA |
cappuccino@gmail.com |
2025-04-15 |
6.25 |
5 |
Premium |
|
CF007 |
Americano |
Usa |
AMER |
americano.com |
2025-04-20 |
3.95 |
4 |
Regular |
|
CF008 |
Flat
White |
Australia |
APAC |
flat@coffee.com |
2025-15-01 |
4.25 |
5 |
Premium |
|
CF009 |
Macchiato |
Italy |
EUROPE |
macchiato@coffee |
2025-05-01 |
5.5 |
5 |
Luxury |
|
CF010 |
Affogato |
Italy |
EMEA |
affogato@coffee.com |
6.25 |
4 |
Premium |
|
|
CF011 |
Irish
Coffee |
Ireland |
EMEA |
irish@coffee.com |
2025-05-15 |
8.75 |
-1 |
Alcoholic |
|
CF012 |
Turkish
Coffee |
Turkey |
MIDDLEEAST |
turkish@coffee.com |
2025-05-20 |
4.2 |
5 |
Traditional |
|
CF013 |
Arabic
Coffee |
Saudi
Arabia |
MEA |
arabic@coffee.com |
2025-05-25 |
3.8 |
4 |
Traditional |
|
CF014 |
Nitro
Coffee |
Usa |
AMERICAS |
nitrocoffee.com |
2025-06-01 |
5.9 |
5 |
Premium |
|
CF015 |
Doppio |
Italy |
APAC |
doppio@coffee.com |
2025-06-05 |
3.6 |
2 |
premium |
|
CF016 |
Cortado |
Spain |
EUROPE |
cortado@coffee.com |
2025-06-10 |
4.6 |
4 |
Regular |
|
CF017 |
Red Eye |
Usa |
AMER |
redeye@coffee.com |
2025-06-15 |
3 |
Regular |
|
|
CF018 |
Viennese
Coffee |
Austria |
EMEA |
vienna@coffee.com |
2025-06-20 |
5.75 |
5 |
Luxury |
|
CF019 |
Frappuccino |
Usa |
APAC |
frap@coffee.com |
2025-13-01 |
7.2 |
4 |
Frozen |
|
CF020 |
Cafe Au
Lait |
France |
EU |
cafeaulait@coffee.com |
2025-06-25 |
4.8 |
3 |
Regular |
|
CF021 |
Lungo |
Italy |
EMEA |
lungo@@coffee.com |
2025-07-01 |
3.9 |
5 |
Regular |
|
CF022 |
Piccolo
Latte |
Australia |
APAC |
piccolo@coffee.com |
2025-07-05 |
4.1 |
5 |
PREMIUM |
Explanation
mutate()
mirrors DATA step assignments, case_when() behaves similarly to SELECT-WHEN,
while across() provides ARRAY-like transformations across multiple columns
simultaneously.
PROC SQL versus DATA Step
PROC SQL
is ideal for:
- Joins
- Aggregations
- Validation reports
- Relational operations
DATA Step
excels at:
- Sequential processing
- Complex business logic
- FIRST./LAST. analysis
- Retained calculations
Modern
enterprises use both.
Enterprise Validation and
Compliance
In
clinical research, these concepts become regulatory requirements.
Poor
cleaning can affect:
- SDTM datasets
- ADaM derivations
- TLF outputs
- Define.xml traceability
Regulators
expect:
- Audit trails
- Independent QC
- Reproducibility
- Metadata governance
- End-to-end lineage
One
dangerous SAS behavior is that missing numeric values are smaller than every
valid number.
if age <18 then pediatric='Y';
A missing
age will incorrectly classify patients as pediatric unless explicitly handled.
Business Logic Behind
Cleaning
Organizations
do not clean data for aesthetics.
They
clean data because business decisions depend on it.
Missing
dates affect time-to-event analysis.
Incorrect
ages distort patient demographics.
Negative
revenue corrupts financial statements.
Inconsistent
text values inflate category counts.
Missing
values may require imputation to preserve statistical power.
Date
standardization ensures reproducibility across global systems.
Text
normalization enables machine learning models to recognize identical concepts.
Age
corrections prevent incorrect treatment assignment.
Salary
normalization improves compensation benchmarking.
Every
cleaning decision should have documented business justification and
traceability.
Twenty Best Practices
- Maintain metadata
repositories.
- Validate before
transformation.
- Standardize macro libraries.
- Preserve raw datasets.
- Implement audit trails.
- Separate development and
production.
- Use defensive programming.
- Validate all joins.
- Review duplicates
independently.
- Document assumptions.
- Version control programs.
- Standardize missing values.
- Build reusable formats.
- Maintain lineage.
- Create QC programs.
- Avoid hardcoding.
- Use parameterized macros.
- Validate date ranges.
- Profile data continuously.
- Automate reporting checks.
Twenty Enterprise Insights
- Dirty data creates expensive
business mistakes.
- Validation logic beats
visual inspection.
- Standardization improves
reproducibility.
- Metadata is a strategic
asset.
- Duplicate records silently
destroy trust.
- Missing values are business
signals.
- Auditability matters.
- Reusable code reduces risk.
- Defensive programming saves
projects.
- Macros improve consistency.
- Governance improves
scalability.
- Data lineage supports
compliance.
- Profiling prevents
surprises.
- Clean data powers AI.
- Reporting accuracy drives
confidence.
- Automation improves quality.
- Traceability protects
submissions.
- Documentation saves time.
- Quality begins upstream.
- Reliable analytics start
with reliable data.
SAS versus R
SAS
dominates regulated industries because of auditability, validation discipline,
and reproducible outputs.
R offers
unmatched flexibility, visualization capability, and modern data engineering
workflows.
SAS
provides exceptional scalability for production reporting while R accelerates
experimentation and advanced analytics.
Together
they form an ideal partnership:
- SAS ensures governance.
- R enables innovation.
- SAS delivers compliance.
- R delivers flexibility.
- SAS provides operational
stability.
- R provides analytical
agility.
Organizations
increasingly deploy hybrid architectures combining both technologies.
Conclusion
Data
cleaning is not a preprocessing task.
It is
risk management.
Whether
analyzing coffee purchasing behavior or clinical trial outcomes, poor-quality
data introduces uncertainty into every downstream decision.
A
duplicate identifier can inflate sales.
An
invalid date can destroy survival analysis.
A
malformed email can disrupt procurement operations.
A missing
age can misclassify treatment populations.
SAS
offers industrial-strength governance, reproducibility, and compliance
controls.
R
provides speed, flexibility, and modern transformation capabilities.
Together
they transform fragmented operational records into reliable analytical intelligence.
The
future belongs to organizations capable of building structured, auditable,
scalable cleaning frameworks that convert raw information into trusted business
decisions.
The
companies that invest in data quality today build competitive advantage tomorrow.
Clean
data is not merely technical hygiene.
It is
enterprise strategy.
Interview Questions
1. A duplicate patient record appears in SDTM DM.
How would you handle it?
Use PROC
SORT NODUPKEY, investigate source systems, document decisions, and perform
independent QC.
2. Why is LENGTH placement important in SAS?
SAS
assigns storage during compilation. Incorrect placement causes truncation and
irreversible data loss.
3. When would you prefer PROC SQL over DATA step?
For
joins, aggregations, and relational logic.
4. Why are missing numeric values dangerous in SAS?
Missing
values are considered smaller than valid values and can trigger incorrect
conditions.
5. How does R's coalesce() compare with SAS
COALESCEC?
Both
select the first non-missing value and are heavily used in enterprise
imputation workflows.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about COFFEE DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment