Behind the World's Best Alcohol Brands: Building Analysis-Ready Datasets with SAS, R, and Data Validation

From Premium Spirits to Trusted Analytics: Transforming the Best Alcohol Brands in the World Dataset into Analysis-Ready Intelligence Using SAS and R

Introduction:Why Global Alcohol Brand Data Can Become a Business Disaster

Imagine a multinational beverage company preparing its annual executive performance dashboard covering premium alcohol brands across multiple countries. The dataset contains sales transactions from brands such as:

  • Johnnie Walker
  • Jack Daniel's
  • Hennessy
  • Bacardi
  • Smirnoff
  • Corona
  • Heineken
  • Absolut

Everything appears normal until analysts discover:

  • Duplicate transaction IDs
  • Negative sales revenue
  • Missing launch dates
  • Invalid email contacts
  • Mixed region coding standards
  • Corrupted product categories
  • Impossible ratings
  • Extra spaces in brand names
  • Character values stored as numbers
  • Numeric values stored as character strings

The result?

Executive dashboards become inaccurate. AI demand forecasting models produce misleading predictions. Regulatory reporting becomes unreliable. Marketing teams allocate budgets incorrectly. Financial audits fail validation checks.

This is why data cleaning is not a cosmetic activity—it is a business-critical engineering discipline.

1.Raw SAS Dataset with Intentional Errors

SAS Raw Data

data alcohol_raw;

length Brand $30 Country $20 Category $20 Launch_Date $12

       Region $10 Contact_Email $50 Revenue $15;

infile datalines dlm='|' dsd truncover;

input Transaction_ID Brand $ Country $ Category $ Rating

      Revenue $ Launch_Date $ Contact_Email $ Region $;

datalines;

1001| johnnie walker |Scotland|Whisky|9|250000|15JAN2024|contact@jw.com|EU

1001|JOHNNIE WALKER|SCOTLAND|whisky|9|-250000|32JAN2024|wrongmail|EUR

1002|Hennessy|France|Cognac|11|350000|10FEB2024|sales@hennessy.com|EU

1003| Bacardi |Cuba|Rum|-2|280000|.|info@bacardi.com|LATAM

1004|Smirnoff|Russia|vodka|8|NULL|15MAR2024|support@smirnoff.com|EU01

1005|Jack Daniels|USA|WHISKEY|7|500000|28FEB2024|jack@@mail.com|US

1006|Heineken|Netherlands|Beer|10|450000|31APR2024|sales@heineken.com|EU

1007|Corona|Mexico|BEER|9|420000|11MAY2024|contact@corona.com|LA

1008|Absolut|Sweden|Vodka|8|370000|12JUN2024|NULL|EUR

1009|Jameson|Ireland|Whisky|9|390000|15JUL2024|sales@jameson.com|EU

1010|NULL|India|Beer|5|200000|01AUG2024|beer@abc.com|APAC

1011|Kingfisher|India|Beer|4|-10000|15SEP2024|invalidmail.com|AP

1012|Budweiser|USA|Beer|8|550000|10OCT2024|sales@bud.com|USA

1013|Carlsberg|Denmark|Beer|9|470000|11NOV2024|sales@carlsberg.com|EU

1014|Chivas Regal|Scotland|Whisky|7|460000|12DEC2024|chivas@brand.com|EU

1015|Grey Goose|France|Vodka|8|430000|13JAN2025|goose@mail.com|EU

;

run;

proc print data=alcohol_raw;

run;

OUTPUT:

ObsBrandCountryCategoryLaunch_DateRegionContact_EmailRevenueTransaction_IDRating
1johnnie walkerScotlandWhisky15JAN2024EUcontact@jw.com25000010019
2JOHNNIE WALKERSCOTLANDwhisky32JAN2024EURwrongmail-25000010019
3HennessyFranceCognac10FEB2024EUsales@hennessy.com350000100211
4BacardiCubaRum LATAMinfo@bacardi.com2800001003-2
5SmirnoffRussiavodka15MAR2024EU01support@smirnoff.comNULL10048
6Jack DanielsUSAWHISKEY28FEB2024USjack@@mail.com50000010057
7HeinekenNetherlandsBeer31APR2024EUsales@heineken.com450000100610
8CoronaMexicoBEER11MAY2024LAcontact@corona.com42000010079
9AbsolutSwedenVodka12JUN2024EURNULL37000010088
10JamesonIrelandWhisky15JUL2024EUsales@jameson.com39000010099
11NULLIndiaBeer01AUG2024APACbeer@abc.com20000010105
12KingfisherIndiaBeer15SEP2024APinvalidmail.com-1000010114
13BudweiserUSABeer10OCT2024USAsales@bud.com55000010128
14CarlsbergDenmarkBeer11NOV2024EUsales@carlsberg.com47000010139
15Chivas RegalScotlandWhisky12DEC2024EUchivas@brand.com46000010147
16Grey GooseFranceVodka13JAN2025EUgoose@mail.com43000010158

Why LENGTH Must Come First

One of the most overlooked SAS production risks is character truncation.

Incorrect

data test;

name='Johnnie Walker Premium Edition';

length name $10;

run;

proc print data=test;

run;

LOG:

WARNING: Length of character variable name has already been set.
Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.

OUTPUT:

Obsname
1Johnnie Wa

Correct

data test;

length name $40;

name='Johnnie Walker Premium Edition';

run;

proc print data=test;

run;

OUTPUT:

Obsname
1Johnnie Walker Premium Edition

Explanation

SAS determines character length during compilation. If LENGTH appears after assignment, the variable may already be defined with insufficient storage. Large pharmaceutical and banking systems frequently suffer silent truncation issues, causing failed joins and reconciliation problems.

In R, character vectors dynamically resize and do not suffer this exact problem, making string handling more flexible.

Step 1: Data Profiling

PROC CONTENTS

proc contents data=alcohol_raw;

run;

OUTPUT:

The CONTENTS Procedure

Data Set NameWORK.ALCOHOL_RAWObservations16
Member TypeDATAVariables9
EngineV9Indexes0
Created06/24/2026 11:50:05Observation Length176
Last Modified06/24/2026 11:50:05Deleted Observations0
Protection CompressedNO
Data Set Type SortedNO
Label   
Data RepresentationSOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64  
Encodingutf-8 Unicode (UTF-8)  
Engine/Host Dependent Information
Data Set Page Size131072
Number of Data Set Pages1
First Data Page1
Max Obs per Page743
Obs in First Data Page16
Number of Data Set Repairs0
Filename/saswork/SAS_work6F2C00008D88_odaws02-apse1-2.oda.sas.com/SAS_workBEA500008D88_odaws02-apse1-2.oda.sas.com/alcohol_raw.sas7bdat
Release Created9.0401M8
Host CreatedLinux
Inode Number201333005
Access Permissionrw-r--r--
Owner Nameu63247146
File Size256KB
File Size (bytes)262144
Alphabetic List of Variables and Attributes
#VariableTypeLen
1BrandChar30
3CategoryChar20
6Contact_EmailChar50
2CountryChar20
4Launch_DateChar12
9RatingNum8
5RegionChar10
7RevenueChar15
8Transaction_IDNum8

Explanation

PROC CONTENTS acts as the first validation checkpoint. It identifies variable attributes, storage lengths, formats, informats, and types. Before cleaning any enterprise dataset, programmers should inspect metadata because many downstream failures originate from incorrect variable definitions. In regulated clinical trials, metadata verification is part of audit readiness and traceability requirements.

Step 2: Enterprise Data Cleaning

DATA Step Cleaning Workflow

data alcohol_clean;

length Brand_Clean $30 Category_Clean $20

       Region_Clean $15;

set alcohol_raw;

Brand_Clean=propcase(strip(Brand));

if upcase(Brand_Clean)="NULL" then Brand_Clean="Unknown";

Category_Clean=propcase(strip(Category));

Revenue_Num=input(compress(Revenue),best12.);

Revenue_Num=abs(Revenue_Num);

Rating=max(min(Rating,10),1);

select(upcase(region));

   when('EU','EUR','EU01') Region_Clean='EUROPE';

   when('US','USA') Region_Clean='NORTH_AMERICA';

   when('AP','APAC') Region_Clean='ASIA_PACIFIC';

   when('LA','LATAM') Region_Clean='LATIN_AMERICA';

   otherwise Region_Clean='UNKNOWN';

end;

Email_Flag=

(prxmatch('/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$/i',

Contact_Email)>0);

drop Revenue Region Brand Category;

rename Revenue_Num=Revenue Region_Clean=Region

       Brand_Clean=Brand Category_Clean=Category;

run;

proc print data=alcohol_clean;

run;

OUTPUT:

ObsBrandCategoryRegionCountryLaunch_DateContact_EmailTransaction_IDRatingRevenueEmail_Flag
1Johnnie WalkerWhiskyEUROPEScotland15JAN2024contact@jw.com100192500000
2Johnnie WalkerWhiskyEUROPESCOTLAND32JAN2024wrongmail100192500000
3HennessyCognacEUROPEFrance10FEB2024sales@hennessy.com1002103500000
4BacardiRumLATIN_AMERICACuba info@bacardi.com100312800000
5SmirnoffVodkaEUROPERussia15MAR2024support@smirnoff.com10048.0
6Jack DanielsWhiskeyNORTH_AMERICAUSA28FEB2024jack@@mail.com100575000000
7HeinekenBeerEUROPENetherlands31APR2024sales@heineken.com1006104500000
8CoronaBeerLATIN_AMERICAMexico11MAY2024contact@corona.com100794200000
9AbsolutVodkaEUROPESweden12JUN2024NULL100883700000
10JamesonWhiskyEUROPEIreland15JUL2024sales@jameson.com100993900000
11UnknownBeerASIA_PACIFICIndia01AUG2024beer@abc.com101052000000
12KingfisherBeerASIA_PACIFICIndia15SEP2024invalidmail.com10114100000
13BudweiserBeerNORTH_AMERICAUSA10OCT2024sales@bud.com101285500000
14CarlsbergBeerEUROPEDenmark11NOV2024sales@carlsberg.com101394700000
15Chivas RegalWhiskyEUROPEScotland12DEC2024chivas@brand.com101474600000
16Grey GooseVodkaEUROPEFrance13JAN2025goose@mail.com101584300000

Explanation

This DATA step demonstrates production-quality standardization. We normalize text using STRIP and PROPCASE, convert revenue from character to numeric using INPUT, eliminate negative values using ABS, and harmonize region codes through SELECT-WHEN logic. Email validation uses PRXMATCH regular expressions. Such transformations ensure consistent reporting, improve join accuracy, and prevent analytical bias caused by inconsistent categories.

Step 3: Deduplication

proc sort data=alcohol_clean

          out=alcohol_nodup nodupkey;

by Transaction_ID;

run;

proc print data=alcohol_nodup;

run;

LOG:

NOTE: There were 16 observations read from the data set WORK.ALCOHOL_CLEAN.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.ALCOHOL_NODUP has 15 observations and 10 variables.

OUTPUT:

ObsBrandCategoryRegionCountryLaunch_DateContact_EmailTransaction_IDRatingRevenueEmail_Flag
1Johnnie WalkerWhiskyEUROPEScotland15JAN2024contact@jw.com100192500000
2HennessyCognacEUROPEFrance10FEB2024sales@hennessy.com1002103500000
3BacardiRumLATIN_AMERICACuba info@bacardi.com100312800000
4SmirnoffVodkaEUROPERussia15MAR2024support@smirnoff.com10048.0
5Jack DanielsWhiskeyNORTH_AMERICAUSA28FEB2024jack@@mail.com100575000000
6HeinekenBeerEUROPENetherlands31APR2024sales@heineken.com1006104500000
7CoronaBeerLATIN_AMERICAMexico11MAY2024contact@corona.com100794200000
8AbsolutVodkaEUROPESweden12JUN2024NULL100883700000
9JamesonWhiskyEUROPEIreland15JUL2024sales@jameson.com100993900000
10UnknownBeerASIA_PACIFICIndia01AUG2024beer@abc.com101052000000
11KingfisherBeerASIA_PACIFICIndia15SEP2024invalidmail.com10114100000
12BudweiserBeerNORTH_AMERICAUSA10OCT2024sales@bud.com101285500000
13CarlsbergBeerEUROPEDenmark11NOV2024sales@carlsberg.com101394700000
14Chivas RegalWhiskyEUROPEScotland12DEC2024chivas@brand.com101474600000
15Grey GooseVodkaEUROPEFrance13JAN2025goose@mail.com101584300000

Explanation

Duplicate records often distort revenue calculations and KPI reporting. PROC SORT NODUPKEY retains the first occurrence of a BY-group while removing duplicates. In clinical trials, duplicate subject IDs can invalidate statistical outputs and compromise regulatory submissions.

Step 4: ARRAY Processing

data alcohol_array;

set alcohol_nodup;

array nums {*} Rating Revenue;

do i=1 to dim(nums);

 if nums{i}<0 then nums{i}=abs(nums{i});

end;

drop i;

run;

proc print data=alcohol_array;

run;

OUTPUT:

ObsBrandCategoryRegionCountryLaunch_DateContact_EmailTransaction_IDRatingRevenueEmail_Flag
1Johnnie WalkerWhiskyEUROPEScotland15JAN2024contact@jw.com100192500000
2HennessyCognacEUROPEFrance10FEB2024sales@hennessy.com1002103500000
3BacardiRumLATIN_AMERICACuba info@bacardi.com100312800000
4SmirnoffVodkaEUROPERussia15MAR2024support@smirnoff.com10048.0
5Jack DanielsWhiskeyNORTH_AMERICAUSA28FEB2024jack@@mail.com100575000000
6HeinekenBeerEUROPENetherlands31APR2024sales@heineken.com1006104500000
7CoronaBeerLATIN_AMERICAMexico11MAY2024contact@corona.com100794200000
8AbsolutVodkaEUROPESweden12JUN2024NULL100883700000
9JamesonWhiskyEUROPEIreland15JUL2024sales@jameson.com100993900000
10UnknownBeerASIA_PACIFICIndia01AUG2024beer@abc.com101052000000
11KingfisherBeerASIA_PACIFICIndia15SEP2024invalidmail.com10114100000
12BudweiserBeerNORTH_AMERICAUSA10OCT2024sales@bud.com101285500000
13CarlsbergBeerEUROPEDenmark11NOV2024sales@carlsberg.com101394700000
14Chivas RegalWhiskyEUROPEScotland12DEC2024chivas@brand.com101474600000
15Grey GooseVodkaEUROPEFrance13JAN2025goose@mail.com101584300000

Explanation

Arrays allow efficient bulk processing across multiple variables. Instead of writing repetitive IF statements, programmers can iterate through variables dynamically. Arrays improve maintainability and reduce coding errors in large-scale data cleaning workflows.

Step 5: FIRST. and LAST. Processing

proc sort data=alcohol_array;

by Country;

run;

proc print data=alcohol_array;

run;

OUTPUT:

ObsBrandCategoryRegionCountryLaunch_DateContact_EmailTransaction_IDRatingRevenueEmail_Flag
1BacardiRumLATIN_AMERICACuba info@bacardi.com100312800000
2CarlsbergBeerEUROPEDenmark11NOV2024sales@carlsberg.com101394700000
3HennessyCognacEUROPEFrance10FEB2024sales@hennessy.com1002103500000
4Grey GooseVodkaEUROPEFrance13JAN2025goose@mail.com101584300000
5UnknownBeerASIA_PACIFICIndia01AUG2024beer@abc.com101052000000
6KingfisherBeerASIA_PACIFICIndia15SEP2024invalidmail.com10114100000
7JamesonWhiskyEUROPEIreland15JUL2024sales@jameson.com100993900000
8CoronaBeerLATIN_AMERICAMexico11MAY2024contact@corona.com100794200000
9HeinekenBeerEUROPENetherlands31APR2024sales@heineken.com1006104500000
10SmirnoffVodkaEUROPERussia15MAR2024support@smirnoff.com10048.0
11Johnnie WalkerWhiskyEUROPEScotland15JAN2024contact@jw.com100192500000
12Chivas RegalWhiskyEUROPEScotland12DEC2024chivas@brand.com101474600000
13AbsolutVodkaEUROPESweden12JUN2024NULL100883700000
14Jack DanielsWhiskeyNORTH_AMERICAUSA28FEB2024jack@@mail.com100575000000
15BudweiserBeerNORTH_AMERICAUSA10OCT2024sales@bud.com101285500000

data country_summary;

set alcohol_array;

by Country;

retain Brand_Count 0;

if first.Country then Brand_Count=0;

Brand_Count+1;

if last.Country;

run;

proc print data=country_summary;

run;

OUTPUT:

ObsBrandCategoryRegionCountryLaunch_DateContact_EmailTransaction_IDRatingRevenueEmail_FlagBrand_Count
1BacardiRumLATIN_AMERICACuba info@bacardi.com1003128000001
2CarlsbergBeerEUROPEDenmark11NOV2024sales@carlsberg.com1013947000001
3Grey GooseVodkaEUROPEFrance13JAN2025goose@mail.com1015843000002
4KingfisherBeerASIA_PACIFICIndia15SEP2024invalidmail.com101141000002
5JamesonWhiskyEUROPEIreland15JUL2024sales@jameson.com1009939000001
6CoronaBeerLATIN_AMERICAMexico11MAY2024contact@corona.com1007942000001
7HeinekenBeerEUROPENetherlands31APR2024sales@heineken.com10061045000001
8SmirnoffVodkaEUROPERussia15MAR2024support@smirnoff.com10048.01
9Chivas RegalWhiskyEUROPEScotland12DEC2024chivas@brand.com1014746000002
10AbsolutVodkaEUROPESweden12JUN2024NULL1008837000001
11BudweiserBeerNORTH_AMERICAUSA10OCT2024sales@bud.com1012855000002

Explanation

FIRST. and LAST. processing enables group-wise calculations without PROC SQL. This technique is widely used in healthcare studies for patient visit tracking, treatment exposure calculations, and longitudinal analyses.

Step 6: PROC FORMAT

proc format;

value ratefmt 1-4='Low'

              5-7='Medium'

             8-10='Premium';

run;

LOG:

NOTE: Format RATEFMT has been output.

Explanation

Formats provide business-friendly labels without modifying underlying data. This separation preserves raw values while improving presentation quality in reports and dashboards.

2.PROC SQL vs DATA Step

Step 1: SQL Join

proc sql;

create table alcohol_final as

select a.*,

       b.Brand_Count

from alcohol_array a

left join country_summary b

on a.Country=b.Country;

quit;

proc print data=alcohol_final;

run;

OUTPUT:

ObsBrandCategoryRegionCountryLaunch_DateContact_EmailTransaction_IDRatingRevenueEmail_FlagBrand_Count
1BacardiRumLATIN_AMERICACuba info@bacardi.com1003128000001
2CarlsbergBeerEUROPEDenmark11NOV2024sales@carlsberg.com1013947000001
3HennessyCognacEUROPEFrance10FEB2024sales@hennessy.com10021035000002
4Grey GooseVodkaEUROPEFrance13JAN2025goose@mail.com1015843000002
5UnknownBeerASIA_PACIFICIndia01AUG2024beer@abc.com1010520000002
6KingfisherBeerASIA_PACIFICIndia15SEP2024invalidmail.com101141000002
7JamesonWhiskyEUROPEIreland15JUL2024sales@jameson.com1009939000001
8CoronaBeerLATIN_AMERICAMexico11MAY2024contact@corona.com1007942000001
9HeinekenBeerEUROPENetherlands31APR2024sales@heineken.com10061045000001
10SmirnoffVodkaEUROPERussia15MAR2024support@smirnoff.com10048.01
11Johnnie WalkerWhiskyEUROPEScotland15JAN2024contact@jw.com1001925000002
12Chivas RegalWhiskyEUROPEScotland12DEC2024chivas@brand.com1014746000002
13AbsolutVodkaEUROPESweden12JUN2024NULL1008837000001
14Jack DanielsWhiskeyNORTH_AMERICAUSA28FEB2024jack@@mail.com1005750000002
15BudweiserBeerNORTH_AMERICAUSA10OCT2024sales@bud.com1012855000002

Step 2: DATA Step Merge

proc sort data=alcohol_array;

by Country;

run;

proc print data=alcohol_array;

run;

OUTPUT:

ObsBrandCategoryRegionCountryLaunch_DateContact_EmailTransaction_IDRatingRevenueEmail_Flag
1BacardiRumLATIN_AMERICACuba info@bacardi.com100312800000
2CarlsbergBeerEUROPEDenmark11NOV2024sales@carlsberg.com101394700000
3HennessyCognacEUROPEFrance10FEB2024sales@hennessy.com1002103500000
4Grey GooseVodkaEUROPEFrance13JAN2025goose@mail.com101584300000
5UnknownBeerASIA_PACIFICIndia01AUG2024beer@abc.com101052000000
6KingfisherBeerASIA_PACIFICIndia15SEP2024invalidmail.com10114100000
7JamesonWhiskyEUROPEIreland15JUL2024sales@jameson.com100993900000
8CoronaBeerLATIN_AMERICAMexico11MAY2024contact@corona.com100794200000
9HeinekenBeerEUROPENetherlands31APR2024sales@heineken.com1006104500000
10SmirnoffVodkaEUROPERussia15MAR2024support@smirnoff.com10048.0
11Johnnie WalkerWhiskyEUROPEScotland15JAN2024contact@jw.com100192500000
12Chivas RegalWhiskyEUROPEScotland12DEC2024chivas@brand.com101474600000
13AbsolutVodkaEUROPESweden12JUN2024NULL100883700000
14Jack DanielsWhiskeyNORTH_AMERICAUSA28FEB2024jack@@mail.com100575000000
15BudweiserBeerNORTH_AMERICAUSA10OCT2024sales@bud.com101285500000

proc sort data=country_summary;

by Country;

run;

proc print data=country_summary;

run;

OUTPUT:

ObsBrandCategoryRegionCountryLaunch_DateContact_EmailTransaction_IDRatingRevenueEmail_FlagBrand_Count
1BacardiRumLATIN_AMERICACuba info@bacardi.com1003128000001
2CarlsbergBeerEUROPEDenmark11NOV2024sales@carlsberg.com1013947000001
3Grey GooseVodkaEUROPEFrance13JAN2025goose@mail.com1015843000002
4KingfisherBeerASIA_PACIFICIndia15SEP2024invalidmail.com101141000002
5JamesonWhiskyEUROPEIreland15JUL2024sales@jameson.com1009939000001
6CoronaBeerLATIN_AMERICAMexico11MAY2024contact@corona.com1007942000001
7HeinekenBeerEUROPENetherlands31APR2024sales@heineken.com10061045000001
8SmirnoffVodkaEUROPERussia15MAR2024support@smirnoff.com10048.01
9Chivas RegalWhiskyEUROPEScotland12DEC2024chivas@brand.com1014746000002
10AbsolutVodkaEUROPESweden12JUN2024NULL1008837000001
11BudweiserBeerNORTH_AMERICAUSA10OCT2024sales@bud.com1012855000002

data alcohol_merge;

merge alcohol_array(in=a)

      country_summary(in=b);

by Country;

if a;

run;

proc print data=alcohol_merge;

run;

OUTPUT:

ObsBrandCategoryRegionCountryLaunch_DateContact_EmailTransaction_IDRatingRevenueEmail_FlagBrand_Count
1BacardiRumLATIN_AMERICACuba info@bacardi.com1003128000001
2CarlsbergBeerEUROPEDenmark11NOV2024sales@carlsberg.com1013947000001
3Grey GooseVodkaEUROPEFrance13JAN2025goose@mail.com1015843000002
4Grey GooseVodkaEUROPEFrance13JAN2025goose@mail.com1015843000002
5KingfisherBeerASIA_PACIFICIndia15SEP2024invalidmail.com101141000002
6KingfisherBeerASIA_PACIFICIndia15SEP2024invalidmail.com101141000002
7JamesonWhiskyEUROPEIreland15JUL2024sales@jameson.com1009939000001
8CoronaBeerLATIN_AMERICAMexico11MAY2024contact@corona.com1007942000001
9HeinekenBeerEUROPENetherlands31APR2024sales@heineken.com10061045000001
10SmirnoffVodkaEUROPERussia15MAR2024support@smirnoff.com10048.01
11Chivas RegalWhiskyEUROPEScotland12DEC2024chivas@brand.com1014746000002
12Chivas RegalWhiskyEUROPEScotland12DEC2024chivas@brand.com1014746000002
13AbsolutVodkaEUROPESweden12JUN2024NULL1008837000001
14BudweiserBeerNORTH_AMERICAUSA10OCT2024sales@bud.com1012855000002
15BudweiserBeerNORTH_AMERICAUSA10OCT2024sales@bud.com1012855000002

Explanation

PROC SQL offers relational database-style syntax and flexibility for complex joins. DATA Step MERGE is often faster for sorted datasets and provides greater row-level control. Enterprise SAS programmers must understand both approaches because performance requirements vary by workload.

3.Additional SAS Procedures

Step 1: PROC FREQ

proc freq data=alcohol_final;

tables Region Category;

run;

OUTPUT:

The FREQ Procedure

RegionFrequencyPercentCumulative
Frequency
Cumulative
Percent
ASIA_PACIFIC213.33213.33
EUROPE960.001173.33
LATIN_AMERICA213.331386.67
NORTH_AMERICA213.3315100.00
CategoryFrequencyPercentCumulative
Frequency
Cumulative
Percent
Beer640.00640.00
Cognac16.67746.67
Rum16.67853.33
Vodka320.001173.33
Whiskey16.671280.00
Whisky320.0015100.00

Step 2: PROC MEANS

proc means data=alcohol_final n mean median min max;

var Revenue Rating;

run;

OUTPUT:

The MEANS Procedure

VariableNMeanMedianMinimumMaximum
Revenue
Rating
14
15
366428.57
7.4666667
405000.00
8.0000000
10000.00
1.0000000
550000.00
10.0000000

Step 3: PROC SUMMARY

proc summary data=alcohol_final nway;

class Region;

var Revenue;

output out=region_rev sum=;

run;

proc print data=region_rev;

run;

OUTPUT:

ObsRegion_TYPE__FREQ_Revenue
1ASIA_PACIFIC12210000
2EUROPE193170000
3LATIN_AMERICA12700000
4NORTH_AMERICA121050000

Step 4: PROC TRANSPOSE

proc transpose data=region_rev out=transpose_rev;

by Region;

var Revenue;

run;

proc print data=transpose_rev;

run;

OUTPUT:

ObsRegion_NAME_COL1
1ASIA_PACIFICRevenue210000
2EUROPERevenue3170000
3LATIN_AMERICARevenue700000
4NORTH_AMERICARevenue1050000

Step 5: PROC REPORT

proc report data=alcohol_final nowd;

column Brand Region Revenue Rating;

run;

OUTPUT:

BrandRegionRevenueRating
BacardiLATIN_AMERICA2800001
CarlsbergEUROPE4700009
HennessyEUROPE35000010
Grey GooseEUROPE4300008
UnknownASIA_PACIFIC2000005
KingfisherASIA_PACIFIC100004
JamesonEUROPE3900009
CoronaLATIN_AMERICA4200009
HeinekenEUROPE45000010
SmirnoffEUROPE.8
Johnnie WalkerEUROPE2500009
Chivas RegalEUROPE4600007
AbsolutEUROPE3700008
Jack DanielsNORTH_AMERICA5000007
BudweiserNORTH_AMERICA5500008

Step 6: SAS Macro for Reusable Validation

%macro check_missing(ds);

data _missing;

set &ds;

if cmiss(of _character_)>0

or nmiss(of _numeric_)>0;

run;

proc print data=transpose_rev;

run;


proc sql;

select count(*) as Missing_Count

from _missing;

quit;

%mend;

%check_missing(alcohol_final);

OUTPUT:

ObsRegion_NAME_COL1
1ASIA_PACIFICRevenue210000
2EUROPERevenue3170000
3LATIN_AMERICARevenue700000
4NORTH_AMERICARevenue1050000
Missing_Count
2

Explanation

Macros standardize validation logic. Instead of repeating code across hundreds of studies or business projects, organizations maintain reusable libraries. This improves consistency, auditability, and productivity.

Why this is better

  • Checks all character variables using CMISS
  • Checks all numeric variables using NMISS
  • Works for any dataset
  • Commonly used in clinical programming QC

4.Raw Dataset in R 

alcohol_raw <- read.delim(

  text="

Transaction_ID|Brand|Country|Category|Rating|Revenue|Launch_Date|Email|Region

1001| johnnie walker |Scotland|Whisky|9|250000|15JAN2024|contact@jw.com|EU

1001|JOHNNIE WALKER|SCOTLAND|whisky|9|-250000|32JAN2024|wrongmail|EUR

1002|Hennessy|France|Cognac|11|350000|10FEB2024|sales@hennessy.com|EU

1003| Bacardi |Cuba|Rum|-2|280000|NA|info@bacardi.com|LATAM

1004|Smirnoff|Russia|vodka|8|NULL|15MAR2024|support@smirnoff.com|EU01

1005|Jack Daniels|USA|WHISKEY|7|500000|28FEB2024|jack@@mail.com|US

1006|Heineken|Netherlands|Beer|10|450000|31APR2024|sales@heineken.com|EU

1007|Corona|Mexico|BEER|9|420000|11MAY2024|contact@corona.com|LA

1008|Absolut|Sweden|Vodka|8|370000|12JUN2024|NULL|EUR

1009|Jameson|Ireland|Whisky|9|390000|15JUL2024|sales@jameson.com|EU

",

  sep="|",

  header=TRUE

)

OUTPUT:

Transaction_ID

Brand

Country

Category

Rating

Revenue

Launch_Date

Email

Region

1001

 johnnie walker

Scotland

Whisky

9

250000

15JAN2024

contact@jw.com

EU

1001

JOHNNIE WALKER

SCOTLAND

whisky

9

-250000

32JAN2024

wrongmail

EUR

1002

Hennessy

France

Cognac

11

350000

10FEB2024

sales@hennessy.com

EU

1003

 Bacardi

Cuba

Rum

-2

280000

info@bacardi.com

LATAM

1004

Smirnoff

Russia

vodka

8

NULL

15MAR2024

support@smirnoff.com

EU01

1005

Jack Daniels

USA

WHISKEY

7

500000

28FEB2024

jack@@mail.com

US

1006

Heineken

Netherlands

Beer

10

450000

31APR2024

sales@heineken.com

EU

1007

Corona

Mexico

BEER

9

420000

11MAY2024

contact@corona.com

LA

1008

Absolut

Sweden

Vodka

8

370000

12JUN2024

NULL

EUR

1009

Jameson

Ireland

Whisky

9

390000

15JUL2024

sales@jameson.com

EU


5.R Cleaning Workflow

library(tidyverse)

library(janitor)

library(lubridate)

alcohol_clean <- alcohol_raw %>%

  clean_names() %>%

  mutate(brand=str_to_title(str_trim(brand)),

      category=str_to_title(category),

       revenue=as.numeric(replace(revenue,revenue=="NULL",NA)),

       revenue=abs(revenue),

        rating=case_when(rating>10 ~ 10,

        rating<1 ~ 1,TRUE ~ rating),

        region=case_when(

        region %in% c("EU","EUR","EU01") ~ "EUROPE",

        region %in% c("US","USA") ~ "NORTH_AMERICA",

        TRUE ~ region),

         email=coalesce(email,"unknown@email.com")

      ) %>% 

         distinct(transaction_id,.keep_all=TRUE)

OUTPUT:

transaction_id

brand

country

category

rating

revenue

launch_date

email

region

1001

Johnnie Walker

Scotland

Whisky

9

250000

15JAN2024

contact@jw.com

EUROPE

1002

Hennessy

France

Cognac

10

350000

10FEB2024

sales@hennessy.com

EUROPE

1003

Bacardi

Cuba

Rum

1

280000

info@bacardi.com

LATAM

1004

Smirnoff

Russia

Vodka

8

15MAR2024

support@smirnoff.com

EUROPE

1005

Jack Daniels

USA

Whiskey

7

500000

28FEB2024

jack@@mail.com

NORTH_AMERICA

1006

Heineken

Netherlands

Beer

10

450000

31APR2024

sales@heineken.com

EUROPE

1007

Corona

Mexico

Beer

9

420000

11MAY2024

contact@corona.com

LA

1008

Absolut

Sweden

Vodka

8

370000

12JUN2024

NULL

EUROPE

1009

Jameson

Ireland

Whisky

9

390000

15JUL2024

sales@jameson.com

EUROPE

Explanation

The tidyverse workflow mirrors SAS cleaning logic. mutate() resembles DATA step assignments. case_when() behaves like SELECT-WHEN. distinct() performs deduplication similar to PROC SORT NODUPKEY. coalesce() resembles SAS COALESCEC. These functions create concise, readable, and reproducible cleaning pipelines.

Validation & Compliance

In regulated industries such as clinical research:

  • SDTM ensures submission-standardized datasets.
  • ADaM supports statistical analyses.
  • Audit trails document every transformation.
  • QC programming must remain independent.
  • Traceability must exist from raw source to final output.
  • Metadata must be version controlled.

A critical SAS risk:

if Revenue_Num < 1000 then Flag='Y';

Missing numeric values are treated lower than any valid number.

Therefore:

<  1000

returns TRUE.

This can accidentally classify missing values as low revenue or high risk, causing serious reporting errors.

20 Data Cleaning Best Practices

  1. Validate metadata before coding.
  2. Standardize naming conventions.
  3. Remove duplicates early.
  4. Preserve raw datasets.
  5. Create audit trails.
  6. Document assumptions.
  7. Validate date ranges.
  8. Check categorical domains.
  9. Standardize missing values.
  10. Use reusable macros.
  11. Perform independent QC.
  12. Apply defensive programming.
  13. Maintain lineage tracking.
  14. Version control code.
  15. Validate joins.
  16. Use controlled terminology.
  17. Review truncation risks.
  18. Validate formats.
  19. Automate reporting checks.
  20. Test production deployments.

Business Logic Behind Cleaning

Business rules convert raw observations into reliable information. Missing values are often imputed because analytical models require complete observations. For example, if a product launch date is unavailable, business stakeholders may use the first known transaction date as a proxy. Unrealistic values require correction because they distort metrics. A rating of 11 on a 10-point scale is impossible and must be capped. Negative revenue may indicate accounting reversals or data-entry issues; validation determines whether correction or exclusion is appropriate.

Text normalization is equally important. “johnnie walker,” “JOHNNIE WALKER,” and “ Johnnie Walker ” should represent the same brand. Without standardization, frequency counts and aggregations become fragmented. Date standardization ensures consistent calculations such as months-on-market and year-over-year growth. In healthcare, patient age values such as -5 or 250 years create impossible demographic distributions. In banking, salary values stored as text prevent credit-risk calculations. Missing visit dates can affect treatment exposure calculations and statistical analyses. Every cleaning decision should be governed by documented business rules, reviewed by stakeholders, and validated independently. The objective is not merely to repair data but to create a trusted analytical asset supporting reliable reporting, predictive modeling, regulatory submissions, and executive decision-making.

20 One-Line Insights

  1. Dirty data creates expensive business mistakes.
  2. Standardized variables improve reproducibility.
  3. Validation logic is stronger than visual inspection.
  4. Metadata drives analytics quality.
  5. Every duplicate changes a metric.
  6. Missing values deserve investigation.
  7. Clean joins require clean keys.
  8. Auditability builds trust.
  9. Traceability supports compliance.
  10. Formats improve usability.
  11. Macros improve consistency.
  12. Arrays reduce repetitive code.
  13. PROC SQL simplifies integration.
  14. DATA Steps provide granular control.
  15. R excels at flexible transformations.
  16. SAS excels at governed workflows.
  17. Automated QC reduces risk.
  18. Controlled terminology improves reporting.
  19. Production code must be defensive.
  20. Analytics is only as good as its source data.

SAS vs R Comparison

Feature

SAS

R

Auditability

Excellent

Moderate

Regulatory Acceptance

Excellent

Growing

Scalability

Excellent

Excellent

Flexibility

High

Very High

Metadata Control

Excellent

Moderate

Visualization

Good

Excellent

Reproducibility

Excellent

Excellent

Learning Curve

Moderate

Moderate

Enterprise Deployment

Excellent

High

Open Source Ecosystem

Limited

Massive

Validation Checklist

1.Duplicate IDs removed

2.Invalid dates investigated

3.Negative amounts reviewed

4.Missing values standardized

5.Email validation completed

6.Region codes harmonized

7.Metadata verified

8.QC review completed

9.Traceability documented

10.Reporting outputs validated

Summary: SAS and R for Enterprise Data Engineering

SAS and R complement each other exceptionally well in modern analytics ecosystems. SAS provides structured governance, strong metadata management, repeatable validation frameworks, and audit-ready programming capabilities required in highly regulated industries such as pharmaceuticals, banking, and insurance. Features such as PROC SQL, DATA Step processing, formats, macros, and validation procedures create highly reliable production workflows.

R brings flexibility, rapid development, advanced data wrangling, and a rich ecosystem of packages. The tidyverse framework simplifies transformations while enabling powerful exploratory analysis and visualization. Functions such as mutate(), across(), case_when(), coalesce(), replace_na(), and parse_date_time() support efficient data engineering pipelines.

Together, SAS and R create a balanced architecture. SAS governs, validates, and standardizes enterprise datasets. R accelerates transformation, exploration, and advanced analytics. Organizations increasingly use both technologies because they combine regulatory confidence with analytical agility. When integrated properly, they produce scalable, reproducible, and trustworthy analytical assets capable of supporting executive reporting, machine learning, regulatory submissions, and strategic decision-making.

Conclusion

The world's leading alcohol brands generate enormous volumes of operational, sales, distribution, and marketing data. However, raw data rarely arrives in a form suitable for analysis. Duplicate transaction IDs, malformed emails, inconsistent region codes, invalid dates, negative revenue values, corrupted categories, and missing fields can quietly undermine dashboards, forecasts, compliance reports, and executive decisions. What appears to be a small data-quality issue can become a significant financial, operational, or regulatory problem when propagated through enterprise systems.

A structured data-cleaning framework transforms unreliable records into trusted analytical intelligence. In SAS, this transformation is achieved through disciplined metadata management, DATA Step programming, PROC SQL integration, validation macros, formats, deduplication routines, and comprehensive reporting procedures. Techniques such as FIRST./LAST. processing, ARRAY logic, INPUT/PUT conversions, RETAIN statements, and PROC REPORT enable production-grade workflows that are scalable and auditable.

R complements this foundation by offering modern data engineering capabilities through tidyverse packages. Functions such as mutate(), case_when(), distinct(), coalesce(), replace_na(), separate(), unite(), and parse_date_time() provide expressive, maintainable pipelines for rapid transformation and exploratory analysis. The ability to combine SAS governance with R flexibility gives organizations a powerful hybrid ecosystem.

The ultimate objective of data cleaning is not merely correcting errors. It is establishing trust. Trust in metrics, trust in statistical analyses, trust in predictive models, trust in regulatory submissions, and trust in executive decision-making. Whether the dataset represents global alcohol brands, clinical trial subjects, insurance claims, banking transactions, or retail operations, the same principle applies: reliable analytics begin with reliable data. Organizations that invest in robust cleaning, validation, governance, and traceability frameworks build a sustainable competitive advantage because every insight generated from their systems is grounded in accuracy, consistency, and reproducibility.

Interview Questions & Answers

1. A revenue dashboard suddenly doubles sales figures. How would you investigate?

Answer:
First, check for duplicate transaction IDs using PROC SORT NODUPKEY or PROC SQL COUNT(*) GROUP BY logic. Compare row counts before and after deduplication. Validate joins because one-to-many joins frequently create duplicate records. In R, use distinct() and anti_join() to identify duplicate sources.

2. How would you handle invalid dates such as 32JAN2024?

Answer:
Use INPUT() with date informats in SAS and parse_date_time() in R. Invalid dates should become missing and be reviewed through validation reports rather than forcefully corrected without business approval.

3. Explain a real-world risk of missing numeric values in SAS?

Answer:
Missing numeric values are considered lower than valid numbers. A condition such as if amount < 1000 will also capture missing values unless explicitly checked with missing(amount)=0.

4. When would you choose PROC SQL over a DATA Step MERGE?

Answer:
Use PROC SQL for complex joins, aggregations, and relational operations. Use DATA Step MERGE when datasets are sorted and row-level processing efficiency is required.

5. How do you validate that a cleaned dataset is production-ready?

Answer:
Perform metadata checks, row-count reconciliation, duplicate validation, missing-value assessments, date-range validation, frequency reviews, PROC COMPARE verification, independent QC review, and final reporting sign-off. In R, supplement this with summary statistics, distinct counts, and validation scripts to ensure reproducibility.

6. Why does CMISS(OF _ALL_) fail inside PROC SQL?

Answer:
OF _ALL_, OF _CHARACTER_, and OF _NUMERIC_ are DATA Step variable lists. PROC SQL does not support DATA Step variable list syntax. Therefore CMISS(OF _ALL_) generates ERROR 22-322. The solution is to use a DATA Step before PROC SQL or explicitly list variables in the SQL WHERE clause.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about ALCOHOL DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS