From Premium Spirits to Trusted Analytics: Transforming the Best Alcohol Brands in the World Dataset into Analysis-Ready Intelligence Using SAS and R

Introduction:Why Global Alcohol Brand Data Can Become a Business Disaster

Imagine a multinational beverage company preparing its annual executive performance dashboard covering premium alcohol brands across multiple countries. The dataset contains sales transactions from brands such as:

Johnnie Walker
Jack Daniel's
Hennessy
Bacardi
Smirnoff
Corona
Heineken
Absolut

Everything appears normal until analysts discover:

Duplicate transaction IDs
Negative sales revenue
Missing launch dates
Invalid email contacts
Mixed region coding standards
Corrupted product categories
Impossible ratings
Extra spaces in brand names
Character values stored as numbers
Numeric values stored as character strings

The result?

Executive dashboards become inaccurate. AI demand forecasting models produce misleading predictions. Regulatory reporting becomes unreliable. Marketing teams allocate budgets incorrectly. Financial audits fail validation checks.

This is why data cleaning is not a cosmetic activity—it is a business-critical engineering discipline.

1.Raw SAS Dataset with Intentional Errors

SAS Raw Data

data alcohol_raw;

length Brand $30 Country $20 Category $20 Launch_Date $12

Region $10 Contact_Email $50 Revenue $15;

infile datalines dlm='|' dsd truncover;

input Transaction_ID Brand $ Country $ Category $ Rating

Revenue $ Launch_Date $ Contact_Email $ Region $;

datalines;

1001| johnnie walker |Scotland|Whisky|9|250000|15JAN2024|contact@jw.com|EU

1001|JOHNNIE WALKER|SCOTLAND|whisky|9|-250000|32JAN2024|wrongmail|EUR

1002|Hennessy|France|Cognac|11|350000|10FEB2024|sales@hennessy.com|EU

1003| Bacardi |Cuba|Rum|-2|280000|.|info@bacardi.com|LATAM

1004|Smirnoff|Russia|vodka|8|NULL|15MAR2024|support@smirnoff.com|EU01

1005|Jack Daniels|USA|WHISKEY|7|500000|28FEB2024|jack@@mail.com|US

1006|Heineken|Netherlands|Beer|10|450000|31APR2024|sales@heineken.com|EU

1007|Corona|Mexico|BEER|9|420000|11MAY2024|contact@corona.com|LA

1008|Absolut|Sweden|Vodka|8|370000|12JUN2024|NULL|EUR

1009|Jameson|Ireland|Whisky|9|390000|15JUL2024|sales@jameson.com|EU

1010|NULL|India|Beer|5|200000|01AUG2024|beer@abc.com|APAC

1011|Kingfisher|India|Beer|4|-10000|15SEP2024|invalidmail.com|AP

1012|Budweiser|USA|Beer|8|550000|10OCT2024|sales@bud.com|USA

1013|Carlsberg|Denmark|Beer|9|470000|11NOV2024|sales@carlsberg.com|EU

1014|Chivas Regal|Scotland|Whisky|7|460000|12DEC2024|chivas@brand.com|EU

1015|Grey Goose|France|Vodka|8|430000|13JAN2025|goose@mail.com|EU

;

run;

proc print data=alcohol_raw;

run;

OUTPUT:


Obs	Brand	Country	Category	Launch_Date	Region	Contact_Email	Revenue	Transaction_ID	Rating
1	johnnie walker	Scotland	Whisky	15JAN2024	EU	contact@jw.com	250000	1001	9
2	JOHNNIE WALKER	SCOTLAND	whisky	32JAN2024	EUR	wrongmail	-250000	1001	9
3	Hennessy	France	Cognac	10FEB2024	EU	sales@hennessy.com	350000	1002	11
4	Bacardi	Cuba	Rum		LATAM	info@bacardi.com	280000	1003	-2
5	Smirnoff	Russia	vodka	15MAR2024	EU01	support@smirnoff.com	NULL	1004	8
6	Jack Daniels	USA	WHISKEY	28FEB2024	US	jack@@mail.com	500000	1005	7
7	Heineken	Netherlands	Beer	31APR2024	EU	sales@heineken.com	450000	1006	10
8	Corona	Mexico	BEER	11MAY2024	LA	contact@corona.com	420000	1007	9
9	Absolut	Sweden	Vodka	12JUN2024	EUR	NULL	370000	1008	8
10	Jameson	Ireland	Whisky	15JUL2024	EU	sales@jameson.com	390000	1009	9
11	NULL	India	Beer	01AUG2024	APAC	beer@abc.com	200000	1010	5
12	Kingfisher	India	Beer	15SEP2024	AP	invalidmail.com	-10000	1011	4
13	Budweiser	USA	Beer	10OCT2024	USA	sales@bud.com	550000	1012	8
14	Carlsberg	Denmark	Beer	11NOV2024	EU	sales@carlsberg.com	470000	1013	9
15	Chivas Regal	Scotland	Whisky	12DEC2024	EU	chivas@brand.com	460000	1014	7
16	Grey Goose	France	Vodka	13JAN2025	EU	goose@mail.com	430000	1015	8

Why LENGTH Must Come First

One of the most overlooked SAS production risks is character truncation.

Incorrect

data test;

name='Johnnie Walker Premium Edition';

length name $10;

run;

proc print data=test;

run;

LOG:

WARNING: Length of character variable name has already been set. 

Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.

OUTPUT:


Obs	name
1	Johnnie Wa

Correct

data test;

length name $40;

name='Johnnie Walker Premium Edition';

run;

proc print data=test;

run;

OUTPUT:


Obs	name
1	Johnnie Walker Premium Edition

Explanation

SAS determines character length during compilation. If LENGTH appears after assignment, the variable may already be defined with insufficient storage. Large pharmaceutical and banking systems frequently suffer silent truncation issues, causing failed joins and reconciliation problems.

In R, character vectors dynamically resize and do not suffer this exact problem, making string handling more flexible.

Step 1: Data Profiling

PROC CONTENTS

proc contents data=alcohol_raw;

run;

OUTPUT:

The CONTENTS Procedure


Data Set Name	WORK.ALCOHOL_RAW	Observations	16
Member Type	DATA	Variables	9
Engine	V9	Indexes	0
Created	06/24/2026 11:50:05	Observation Length	176
Last Modified	06/24/2026 11:50:05	Deleted Observations	0
Protection		Compressed	NO
Data Set Type		Sorted	NO
Label
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64
Encoding	utf-8 Unicode (UTF-8)


Engine/Host Dependent Information
Data Set Page Size	131072
Number of Data Set Pages	1
First Data Page	1
Max Obs per Page	743
Obs in First Data Page	16
Number of Data Set Repairs	0
Filename	/saswork/SAS_work6F2C00008D88_odaws02-apse1-2.oda.sas.com/SAS_workBEA500008D88_odaws02-apse1-2.oda.sas.com/alcohol_raw.sas7bdat
Release Created	9.0401M8
Host Created	Linux
Inode Number	201333005
Access Permission	rw-r--r--
Owner Name	u63247146
File Size	256KB
File Size (bytes)	262144


Alphabetic List of Variables and Attributes
#	Variable	Type	Len
1	Brand	Char	30
3	Category	Char	20
6	Contact_Email	Char	50
2	Country	Char	20
4	Launch_Date	Char	12
9	Rating	Num	8
5	Region	Char	10
7	Revenue	Char	15
8	Transaction_ID	Num	8

Explanation

PROC CONTENTS acts as the first validation checkpoint. It identifies variable attributes, storage lengths, formats, informats, and types. Before cleaning any enterprise dataset, programmers should inspect metadata because many downstream failures originate from incorrect variable definitions. In regulated clinical trials, metadata verification is part of audit readiness and traceability requirements.

Step 2: Enterprise Data Cleaning

DATA Step Cleaning Workflow

data alcohol_clean;

length Brand_Clean $30 Category_Clean $20

Region_Clean $15;

set alcohol_raw;

Brand_Clean=propcase(strip(Brand));

if upcase(Brand_Clean)="NULL" then Brand_Clean="Unknown";

Category_Clean=propcase(strip(Category));

Revenue_Num=input(compress(Revenue),best12.);

Revenue_Num=abs(Revenue_Num);

Rating=max(min(Rating,10),1);

select(upcase(region));

when('EU','EUR','EU01') Region_Clean='EUROPE';

when('US','USA') Region_Clean='NORTH_AMERICA';

when('AP','APAC') Region_Clean='ASIA_PACIFIC';

when('LA','LATAM') Region_Clean='LATIN_AMERICA';

otherwise Region_Clean='UNKNOWN';

end;

Email_Flag=

(prxmatch('/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$/i',

Contact_Email)>0);

drop Revenue Region Brand Category;

rename Revenue_Num=Revenue Region_Clean=Region

Brand_Clean=Brand Category_Clean=Category;

run;

proc print data=alcohol_clean;

run;

OUTPUT:


Obs	Brand	Category	Region	Country	Launch_Date	Contact_Email	Transaction_ID	Rating	Revenue
1	Johnnie Walker	Whisky	EUROPE	Scotland	15JAN2024	contact@jw.com	1001	9	250000
2	Johnnie Walker	Whisky	EUROPE	SCOTLAND	32JAN2024	wrongmail	1001	9	250000
3	Hennessy	Cognac	EUROPE	France	10FEB2024	sales@hennessy.com	1002	10	350000
4	Bacardi	Rum	LATIN_AMERICA	Cuba		info@bacardi.com	1003	1	280000
5	Smirnoff	Vodka	EUROPE	Russia	15MAR2024	support@smirnoff.com	1004	8	.
6	Jack Daniels	Whiskey	NORTH_AMERICA	USA	28FEB2024	jack@@mail.com	1005	7	500000
7	Heineken	Beer	EUROPE	Netherlands	31APR2024	sales@heineken.com	1006	10	450000
8	Corona	Beer	LATIN_AMERICA	Mexico	11MAY2024	contact@corona.com	1007	9	420000
9	Absolut	Vodka	EUROPE	Sweden	12JUN2024	NULL	1008	8	370000
10	Jameson	Whisky	EUROPE	Ireland	15JUL2024	sales@jameson.com	1009	9	390000
11	Unknown	Beer	ASIA_PACIFIC	India	01AUG2024	beer@abc.com	1010	5	200000
12	Kingfisher	Beer	ASIA_PACIFIC	India	15SEP2024	invalidmail.com	1011	4	10000
13	Budweiser	Beer	NORTH_AMERICA	USA	10OCT2024	sales@bud.com	1012	8	550000
14	Carlsberg	Beer	EUROPE	Denmark	11NOV2024	sales@carlsberg.com	1013	9	470000
15	Chivas Regal	Whisky	EUROPE	Scotland	12DEC2024	chivas@brand.com	1014	7	460000
16	Grey Goose	Vodka	EUROPE	France	13JAN2025	goose@mail.com	1015	8	430000

Explanation

This DATA step demonstrates production-quality standardization. We normalize text using STRIP and PROPCASE, convert revenue from character to numeric using INPUT, eliminate negative values using ABS, and harmonize region codes through SELECT-WHEN logic. Email validation uses PRXMATCH regular expressions. Such transformations ensure consistent reporting, improve join accuracy, and prevent analytical bias caused by inconsistent categories.

Step 3: Deduplication

proc sort data=alcohol_clean

out=alcohol_nodup nodupkey;

by Transaction_ID;

run;

proc print data=alcohol_nodup;

run;

LOG:

 NOTE: There were 16 observations read from the data set WORK.ALCOHOL_CLEAN.

 NOTE: 1 observations with duplicate key values were deleted.

 NOTE: The data set WORK.ALCOHOL_NODUP has 15 observations and 10 variables.

OUTPUT:


Obs	Brand	Category	Region	Country	Launch_Date	Contact_Email	Transaction_ID	Rating	Revenue
1	Johnnie Walker	Whisky	EUROPE	Scotland	15JAN2024	contact@jw.com	1001	9	250000
2	Hennessy	Cognac	EUROPE	France	10FEB2024	sales@hennessy.com	1002	10	350000
3	Bacardi	Rum	LATIN_AMERICA	Cuba		info@bacardi.com	1003	1	280000
4	Smirnoff	Vodka	EUROPE	Russia	15MAR2024	support@smirnoff.com	1004	8	.
5	Jack Daniels	Whiskey	NORTH_AMERICA	USA	28FEB2024	jack@@mail.com	1005	7	500000
6	Heineken	Beer	EUROPE	Netherlands	31APR2024	sales@heineken.com	1006	10	450000
7	Corona	Beer	LATIN_AMERICA	Mexico	11MAY2024	contact@corona.com	1007	9	420000
8	Absolut	Vodka	EUROPE	Sweden	12JUN2024	NULL	1008	8	370000
9	Jameson	Whisky	EUROPE	Ireland	15JUL2024	sales@jameson.com	1009	9	390000
10	Unknown	Beer	ASIA_PACIFIC	India	01AUG2024	beer@abc.com	1010	5	200000
11	Kingfisher	Beer	ASIA_PACIFIC	India	15SEP2024	invalidmail.com	1011	4	10000
12	Budweiser	Beer	NORTH_AMERICA	USA	10OCT2024	sales@bud.com	1012	8	550000
13	Carlsberg	Beer	EUROPE	Denmark	11NOV2024	sales@carlsberg.com	1013	9	470000
14	Chivas Regal	Whisky	EUROPE	Scotland	12DEC2024	chivas@brand.com	1014	7	460000
15	Grey Goose	Vodka	EUROPE	France	13JAN2025	goose@mail.com	1015	8	430000

Explanation

Duplicate records often distort revenue calculations and KPI reporting. PROC SORT NODUPKEY retains the first occurrence of a BY-group while removing duplicates. In clinical trials, duplicate subject IDs can invalidate statistical outputs and compromise regulatory submissions.

Step 4: ARRAY Processing

data alcohol_array;

set alcohol_nodup;

array nums {*} Rating Revenue;

do i=1 to dim(nums);

if nums{i}<0 then nums{i}=abs(nums{i});

end;

drop i;

run;

proc print data=alcohol_array;

run;

OUTPUT:


Obs	Brand	Category	Region	Country	Launch_Date	Contact_Email	Transaction_ID	Rating	Revenue
1	Johnnie Walker	Whisky	EUROPE	Scotland	15JAN2024	contact@jw.com	1001	9	250000
2	Hennessy	Cognac	EUROPE	France	10FEB2024	sales@hennessy.com	1002	10	350000
3	Bacardi	Rum	LATIN_AMERICA	Cuba		info@bacardi.com	1003	1	280000
4	Smirnoff	Vodka	EUROPE	Russia	15MAR2024	support@smirnoff.com	1004	8	.
5	Jack Daniels	Whiskey	NORTH_AMERICA	USA	28FEB2024	jack@@mail.com	1005	7	500000
6	Heineken	Beer	EUROPE	Netherlands	31APR2024	sales@heineken.com	1006	10	450000
7	Corona	Beer	LATIN_AMERICA	Mexico	11MAY2024	contact@corona.com	1007	9	420000
8	Absolut	Vodka	EUROPE	Sweden	12JUN2024	NULL	1008	8	370000
9	Jameson	Whisky	EUROPE	Ireland	15JUL2024	sales@jameson.com	1009	9	390000
10	Unknown	Beer	ASIA_PACIFIC	India	01AUG2024	beer@abc.com	1010	5	200000
11	Kingfisher	Beer	ASIA_PACIFIC	India	15SEP2024	invalidmail.com	1011	4	10000
12	Budweiser	Beer	NORTH_AMERICA	USA	10OCT2024	sales@bud.com	1012	8	550000
13	Carlsberg	Beer	EUROPE	Denmark	11NOV2024	sales@carlsberg.com	1013	9	470000
14	Chivas Regal	Whisky	EUROPE	Scotland	12DEC2024	chivas@brand.com	1014	7	460000
15	Grey Goose	Vodka	EUROPE	France	13JAN2025	goose@mail.com	1015	8	430000

Explanation

Arrays allow efficient bulk processing across multiple variables. Instead of writing repetitive IF statements, programmers can iterate through variables dynamically. Arrays improve maintainability and reduce coding errors in large-scale data cleaning workflows.

Step 5: FIRST. and LAST. Processing

proc sort data=alcohol_array;

by Country;

run;

proc print data=alcohol_array;

run;

OUTPUT:


Obs	Brand	Category	Region	Country	Launch_Date	Contact_Email	Transaction_ID	Rating	Revenue
1	Bacardi	Rum	LATIN_AMERICA	Cuba		info@bacardi.com	1003	1	280000
2	Carlsberg	Beer	EUROPE	Denmark	11NOV2024	sales@carlsberg.com	1013	9	470000
3	Hennessy	Cognac	EUROPE	France	10FEB2024	sales@hennessy.com	1002	10	350000
4	Grey Goose	Vodka	EUROPE	France	13JAN2025	goose@mail.com	1015	8	430000
5	Unknown	Beer	ASIA_PACIFIC	India	01AUG2024	beer@abc.com	1010	5	200000
6	Kingfisher	Beer	ASIA_PACIFIC	India	15SEP2024	invalidmail.com	1011	4	10000
7	Jameson	Whisky	EUROPE	Ireland	15JUL2024	sales@jameson.com	1009	9	390000
8	Corona	Beer	LATIN_AMERICA	Mexico	11MAY2024	contact@corona.com	1007	9	420000
9	Heineken	Beer	EUROPE	Netherlands	31APR2024	sales@heineken.com	1006	10	450000
10	Smirnoff	Vodka	EUROPE	Russia	15MAR2024	support@smirnoff.com	1004	8	.
11	Johnnie Walker	Whisky	EUROPE	Scotland	15JAN2024	contact@jw.com	1001	9	250000
12	Chivas Regal	Whisky	EUROPE	Scotland	12DEC2024	chivas@brand.com	1014	7	460000
13	Absolut	Vodka	EUROPE	Sweden	12JUN2024	NULL	1008	8	370000
14	Jack Daniels	Whiskey	NORTH_AMERICA	USA	28FEB2024	jack@@mail.com	1005	7	500000
15	Budweiser	Beer	NORTH_AMERICA	USA	10OCT2024	sales@bud.com	1012	8	550000

data country_summary;

set alcohol_array;

by Country;

retain Brand_Count 0;

if first.Country then Brand_Count=0;

Brand_Count+1;

if last.Country;

run;

proc print data=country_summary;

run;

OUTPUT:


Obs	Brand	Category	Region	Country	Launch_Date	Contact_Email	Transaction_ID	Rating	Revenue	Brand_Count
1	Bacardi	Rum	LATIN_AMERICA	Cuba		info@bacardi.com	1003	1	280000	1
2	Carlsberg	Beer	EUROPE	Denmark	11NOV2024	sales@carlsberg.com	1013	9	470000	1
3	Grey Goose	Vodka	EUROPE	France	13JAN2025	goose@mail.com	1015	8	430000	2
4	Kingfisher	Beer	ASIA_PACIFIC	India	15SEP2024	invalidmail.com	1011	4	10000	2
5	Jameson	Whisky	EUROPE	Ireland	15JUL2024	sales@jameson.com	1009	9	390000	1
6	Corona	Beer	LATIN_AMERICA	Mexico	11MAY2024	contact@corona.com	1007	9	420000	1
7	Heineken	Beer	EUROPE	Netherlands	31APR2024	sales@heineken.com	1006	10	450000	1
8	Smirnoff	Vodka	EUROPE	Russia	15MAR2024	support@smirnoff.com	1004	8	.	1
9	Chivas Regal	Whisky	EUROPE	Scotland	12DEC2024	chivas@brand.com	1014	7	460000	2
10	Absolut	Vodka	EUROPE	Sweden	12JUN2024	NULL	1008	8	370000	1
11	Budweiser	Beer	NORTH_AMERICA	USA	10OCT2024	sales@bud.com	1012	8	550000	2

Explanation

FIRST. and LAST. processing enables group-wise calculations without PROC SQL. This technique is widely used in healthcare studies for patient visit tracking, treatment exposure calculations, and longitudinal analyses.

Step 6: PROC FORMAT

proc format;

value ratefmt 1-4='Low'

5-7='Medium'

8-10='Premium';

run;

LOG:

 NOTE: Format RATEFMT has been output.

Explanation

Formats provide business-friendly labels without modifying underlying data. This separation preserves raw values while improving presentation quality in reports and dashboards.

2.PROC SQL vs DATA Step

Step 1: SQL Join

proc sql;

create table alcohol_final as

select a.*,

b.Brand_Count

from alcohol_array a

left join country_summary b

on a.Country=b.Country;

quit;

proc print data=alcohol_final;

run;

OUTPUT:


Obs	Brand	Category	Region	Country	Launch_Date	Contact_Email	Transaction_ID	Rating	Revenue	Brand_Count
1	Bacardi	Rum	LATIN_AMERICA	Cuba		info@bacardi.com	1003	1	280000	1
2	Carlsberg	Beer	EUROPE	Denmark	11NOV2024	sales@carlsberg.com	1013	9	470000	1
3	Hennessy	Cognac	EUROPE	France	10FEB2024	sales@hennessy.com	1002	10	350000	2
4	Grey Goose	Vodka	EUROPE	France	13JAN2025	goose@mail.com	1015	8	430000	2
5	Unknown	Beer	ASIA_PACIFIC	India	01AUG2024	beer@abc.com	1010	5	200000	2
6	Kingfisher	Beer	ASIA_PACIFIC	India	15SEP2024	invalidmail.com	1011	4	10000	2
7	Jameson	Whisky	EUROPE	Ireland	15JUL2024	sales@jameson.com	1009	9	390000	1
8	Corona	Beer	LATIN_AMERICA	Mexico	11MAY2024	contact@corona.com	1007	9	420000	1
9	Heineken	Beer	EUROPE	Netherlands	31APR2024	sales@heineken.com	1006	10	450000	1
10	Smirnoff	Vodka	EUROPE	Russia	15MAR2024	support@smirnoff.com	1004	8	.	1
11	Johnnie Walker	Whisky	EUROPE	Scotland	15JAN2024	contact@jw.com	1001	9	250000	2
12	Chivas Regal	Whisky	EUROPE	Scotland	12DEC2024	chivas@brand.com	1014	7	460000	2
13	Absolut	Vodka	EUROPE	Sweden	12JUN2024	NULL	1008	8	370000	1
14	Jack Daniels	Whiskey	NORTH_AMERICA	USA	28FEB2024	jack@@mail.com	1005	7	500000	2
15	Budweiser	Beer	NORTH_AMERICA	USA	10OCT2024	sales@bud.com	1012	8	550000	2

Step 2: DATA Step Merge

proc sort data=alcohol_array;

by Country;

run;

proc print data=alcohol_array;

run;

OUTPUT:


Obs	Brand	Category	Region	Country	Launch_Date	Contact_Email	Transaction_ID	Rating	Revenue
1	Bacardi	Rum	LATIN_AMERICA	Cuba		info@bacardi.com	1003	1	280000
2	Carlsberg	Beer	EUROPE	Denmark	11NOV2024	sales@carlsberg.com	1013	9	470000
3	Hennessy	Cognac	EUROPE	France	10FEB2024	sales@hennessy.com	1002	10	350000
4	Grey Goose	Vodka	EUROPE	France	13JAN2025	goose@mail.com	1015	8	430000
5	Unknown	Beer	ASIA_PACIFIC	India	01AUG2024	beer@abc.com	1010	5	200000
6	Kingfisher	Beer	ASIA_PACIFIC	India	15SEP2024	invalidmail.com	1011	4	10000
7	Jameson	Whisky	EUROPE	Ireland	15JUL2024	sales@jameson.com	1009	9	390000
8	Corona	Beer	LATIN_AMERICA	Mexico	11MAY2024	contact@corona.com	1007	9	420000
9	Heineken	Beer	EUROPE	Netherlands	31APR2024	sales@heineken.com	1006	10	450000
10	Smirnoff	Vodka	EUROPE	Russia	15MAR2024	support@smirnoff.com	1004	8	.
11	Johnnie Walker	Whisky	EUROPE	Scotland	15JAN2024	contact@jw.com	1001	9	250000
12	Chivas Regal	Whisky	EUROPE	Scotland	12DEC2024	chivas@brand.com	1014	7	460000
13	Absolut	Vodka	EUROPE	Sweden	12JUN2024	NULL	1008	8	370000
14	Jack Daniels	Whiskey	NORTH_AMERICA	USA	28FEB2024	jack@@mail.com	1005	7	500000
15	Budweiser	Beer	NORTH_AMERICA	USA	10OCT2024	sales@bud.com	1012	8	550000

proc sort data=country_summary;

by Country;

run;

proc print data=country_summary;

run;

OUTPUT:


Obs	Brand	Category	Region	Country	Launch_Date	Contact_Email	Transaction_ID	Rating	Revenue	Brand_Count
1	Bacardi	Rum	LATIN_AMERICA	Cuba		info@bacardi.com	1003	1	280000	1
2	Carlsberg	Beer	EUROPE	Denmark	11NOV2024	sales@carlsberg.com	1013	9	470000	1
3	Grey Goose	Vodka	EUROPE	France	13JAN2025	goose@mail.com	1015	8	430000	2
4	Kingfisher	Beer	ASIA_PACIFIC	India	15SEP2024	invalidmail.com	1011	4	10000	2
5	Jameson	Whisky	EUROPE	Ireland	15JUL2024	sales@jameson.com	1009	9	390000	1
6	Corona	Beer	LATIN_AMERICA	Mexico	11MAY2024	contact@corona.com	1007	9	420000	1
7	Heineken	Beer	EUROPE	Netherlands	31APR2024	sales@heineken.com	1006	10	450000	1
8	Smirnoff	Vodka	EUROPE	Russia	15MAR2024	support@smirnoff.com	1004	8	.	1
9	Chivas Regal	Whisky	EUROPE	Scotland	12DEC2024	chivas@brand.com	1014	7	460000	2
10	Absolut	Vodka	EUROPE	Sweden	12JUN2024	NULL	1008	8	370000	1
11	Budweiser	Beer	NORTH_AMERICA	USA	10OCT2024	sales@bud.com	1012	8	550000	2

data alcohol_merge;

merge alcohol_array(in=a)

country_summary(in=b);

by Country;

if a;

run;

proc print data=alcohol_merge;

run;

OUTPUT:


Obs	Brand	Category	Region	Country	Launch_Date	Contact_Email	Transaction_ID	Rating	Revenue	Brand_Count
1	Bacardi	Rum	LATIN_AMERICA	Cuba		info@bacardi.com	1003	1	280000	1
2	Carlsberg	Beer	EUROPE	Denmark	11NOV2024	sales@carlsberg.com	1013	9	470000	1
3	Grey Goose	Vodka	EUROPE	France	13JAN2025	goose@mail.com	1015	8	430000	2
4	Grey Goose	Vodka	EUROPE	France	13JAN2025	goose@mail.com	1015	8	430000	2
5	Kingfisher	Beer	ASIA_PACIFIC	India	15SEP2024	invalidmail.com	1011	4	10000	2
6	Kingfisher	Beer	ASIA_PACIFIC	India	15SEP2024	invalidmail.com	1011	4	10000	2
7	Jameson	Whisky	EUROPE	Ireland	15JUL2024	sales@jameson.com	1009	9	390000	1
8	Corona	Beer	LATIN_AMERICA	Mexico	11MAY2024	contact@corona.com	1007	9	420000	1
9	Heineken	Beer	EUROPE	Netherlands	31APR2024	sales@heineken.com	1006	10	450000	1
10	Smirnoff	Vodka	EUROPE	Russia	15MAR2024	support@smirnoff.com	1004	8	.	1
11	Chivas Regal	Whisky	EUROPE	Scotland	12DEC2024	chivas@brand.com	1014	7	460000	2
12	Chivas Regal	Whisky	EUROPE	Scotland	12DEC2024	chivas@brand.com	1014	7	460000	2
13	Absolut	Vodka	EUROPE	Sweden	12JUN2024	NULL	1008	8	370000	1
14	Budweiser	Beer	NORTH_AMERICA	USA	10OCT2024	sales@bud.com	1012	8	550000	2
15	Budweiser	Beer	NORTH_AMERICA	USA	10OCT2024	sales@bud.com	1012	8	550000	2

Explanation

PROC SQL offers relational database-style syntax and flexibility for complex joins. DATA Step MERGE is often faster for sorted datasets and provides greater row-level control. Enterprise SAS programmers must understand both approaches because performance requirements vary by workload.

3.Additional SAS Procedures

Step 1: PROC FREQ

proc freq data=alcohol_final;

tables Region Category;

run;

OUTPUT:

The FREQ Procedure


Region	Frequency	Percent	Cumulative Frequency	Cumulative Percent
ASIA_PACIFIC	2	13.33	2	13.33
EUROPE	9	60.00	11	73.33
LATIN_AMERICA	2	13.33	13	86.67
NORTH_AMERICA	2	13.33	15	100.00


Category	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Beer	6	40.00	6	40.00
Cognac	1	6.67	7	46.67
Rum	1	6.67	8	53.33
Vodka	3	20.00	11	73.33
Whiskey	1	6.67	12	80.00
Whisky	3	20.00	15	100.00

Step 2: PROC MEANS

proc means data=alcohol_final n mean median min max;

var Revenue Rating;

run;

OUTPUT:

The MEANS Procedure


Variable	N	Mean	Median	Minimum	Maximum
Revenue Rating	14 15	366428.57 7.4666667	405000.00 8.0000000	10000.00 1.0000000	550000.00 10.0000000

Step 3: PROC SUMMARY

proc summary data=alcohol_final nway;

class Region;

var Revenue;

output out=region_rev sum=;

run;

proc print data=region_rev;

run;

OUTPUT:


Obs	Region	_TYPE_	_FREQ_	Revenue
1	ASIA_PACIFIC	1	2	210000
2	EUROPE	1	9	3170000
3	LATIN_AMERICA	1	2	700000
4	NORTH_AMERICA	1	2	1050000

Step 4: PROC TRANSPOSE

proc transpose data=region_rev out=transpose_rev;

by Region;

var Revenue;

run;

proc print data=transpose_rev;

run;

OUTPUT:


Obs	Region	_NAME_	COL1
1	ASIA_PACIFIC	Revenue	210000
2	EUROPE	Revenue	3170000
3	LATIN_AMERICA	Revenue	700000
4	NORTH_AMERICA	Revenue	1050000

Step 5: PROC REPORT

proc report data=alcohol_final nowd;

column Brand Region Revenue Rating;

run;

OUTPUT:


Brand	Region	Revenue	Rating
Bacardi	LATIN_AMERICA	280000	1
Carlsberg	EUROPE	470000	9
Hennessy	EUROPE	350000	10
Grey Goose	EUROPE	430000	8
Unknown	ASIA_PACIFIC	200000	5
Kingfisher	ASIA_PACIFIC	10000	4
Jameson	EUROPE	390000	9
Corona	LATIN_AMERICA	420000	9
Heineken	EUROPE	450000	10
Smirnoff	EUROPE	.	8
Johnnie Walker	EUROPE	250000	9
Chivas Regal	EUROPE	460000	7
Absolut	EUROPE	370000	8
Jack Daniels	NORTH_AMERICA	500000	7
Budweiser	NORTH_AMERICA	550000	8

Step 6: SAS Macro for Reusable Validation

%macro check_missing(ds);

data _missing;

set &ds;

if cmiss(of _character_)>0

or nmiss(of _numeric_)>0;

run;

proc print data=transpose_rev;

run;

proc sql;

select count(*) as Missing_Count

from _missing;

quit;

%mend;

%check_missing(alcohol_final);

OUTPUT:


Obs	Region	_NAME_	COL1
1	ASIA_PACIFIC	Revenue	210000
2	EUROPE	Revenue	3170000
3	LATIN_AMERICA	Revenue	700000
4	NORTH_AMERICA	Revenue	1050000


Missing_Count
2

Explanation

Macros standardize validation logic. Instead of repeating code across hundreds of studies or business projects, organizations maintain reusable libraries. This improves consistency, auditability, and productivity.

Why this is better

Checks all character variables using CMISS
Checks all numeric variables using NMISS
Works for any dataset
Commonly used in clinical programming QC

4.Raw Dataset in R

alcohol_raw <- read.delim(

text="

1001| johnnie walker |Scotland|Whisky|9|250000|15JAN2024|contact@jw.com|EU

1001|JOHNNIE WALKER|SCOTLAND|whisky|9|-250000|32JAN2024|wrongmail|EUR

1002|Hennessy|France|Cognac|11|350000|10FEB2024|sales@hennessy.com|EU

1003| Bacardi |Cuba|Rum|-2|280000|NA|info@bacardi.com|LATAM

1004|Smirnoff|Russia|vodka|8|NULL|15MAR2024|support@smirnoff.com|EU01

1005|Jack Daniels|USA|WHISKEY|7|500000|28FEB2024|jack@@mail.com|US

1006|Heineken|Netherlands|Beer|10|450000|31APR2024|sales@heineken.com|EU

1007|Corona|Mexico|BEER|9|420000|11MAY2024|contact@corona.com|LA

1008|Absolut|Sweden|Vodka|8|370000|12JUN2024|NULL|EUR

1009|Jameson|Ireland|Whisky|9|390000|15JUL2024|sales@jameson.com|EU

sep="|",

header=TRUE

)

OUTPUT:

Transaction_ID	Brand	Country	Category	Rating	Revenue	Launch_Date	Email	Region
1001	johnnie walker	Scotland	Whisky	9	250000	15JAN2024	contact@jw.com	EU
1001	JOHNNIE WALKER	SCOTLAND	whisky	9	-250000	32JAN2024	wrongmail	EUR
1002	Hennessy	France	Cognac	11	350000	10FEB2024	sales@hennessy.com	EU
1003	Bacardi	Cuba	Rum	-2	280000		info@bacardi.com	LATAM
1004	Smirnoff	Russia	vodka	8	NULL	15MAR2024	support@smirnoff.com	EU01
1005	Jack Daniels	USA	WHISKEY	7	500000	28FEB2024	jack@@mail.com	US
1006	Heineken	Netherlands	Beer	10	450000	31APR2024	sales@heineken.com	EU
1007	Corona	Mexico	BEER	9	420000	11MAY2024	contact@corona.com	LA
1008	Absolut	Sweden	Vodka	8	370000	12JUN2024	NULL	EUR
1009	Jameson	Ireland	Whisky	9	390000	15JUL2024	sales@jameson.com	EU

5.R Cleaning Workflow

library(tidyverse)

library(janitor)

library(lubridate)

alcohol_clean <- alcohol_raw %>%

clean_names() %>%

mutate(brand=str_to_title(str_trim(brand)),

category=str_to_title(category),

revenue=as.numeric(replace(revenue,revenue=="NULL",NA)),

revenue=abs(revenue),

rating=case_when(rating>10 ~ 10,

rating<1 ~ 1,TRUE ~ rating),

region=case_when(

region %in% c("EU","EUR","EU01") ~ "EUROPE",

region %in% c("US","USA") ~ "NORTH_AMERICA",

TRUE ~ region),

email=coalesce(email,"unknown@email.com")

) %>%

distinct(transaction_id,.keep_all=TRUE)

OUTPUT:

transaction_id	brand	country	category	rating	revenue	launch_date	email	region
1001	Johnnie Walker	Scotland	Whisky	9	250000	15JAN2024	contact@jw.com	EUROPE
1002	Hennessy	France	Cognac	10	350000	10FEB2024	sales@hennessy.com	EUROPE
1003	Bacardi	Cuba	Rum	1	280000		info@bacardi.com	LATAM
1004	Smirnoff	Russia	Vodka	8		15MAR2024	support@smirnoff.com	EUROPE
1005	Jack Daniels	USA	Whiskey	7	500000	28FEB2024	jack@@mail.com	NORTH_AMERICA
1006	Heineken	Netherlands	Beer	10	450000	31APR2024	sales@heineken.com	EUROPE
1007	Corona	Mexico	Beer	9	420000	11MAY2024	contact@corona.com	LA
1008	Absolut	Sweden	Vodka	8	370000	12JUN2024	NULL	EUROPE
1009	Jameson	Ireland	Whisky	9	390000	15JUL2024	sales@jameson.com	EUROPE

Explanation

The tidyverse workflow mirrors SAS cleaning logic. mutate() resembles DATA step assignments. case_when() behaves like SELECT-WHEN. distinct() performs deduplication similar to PROC SORT NODUPKEY. coalesce() resembles SAS COALESCEC. These functions create concise, readable, and reproducible cleaning pipelines.

Validation & Compliance

In regulated industries such as clinical research:

SDTM ensures submission-standardized datasets.
ADaM supports statistical analyses.
Audit trails document every transformation.
QC programming must remain independent.
Traceability must exist from raw source to final output.
Metadata must be version controlled.

A critical SAS risk:

if Revenue_Num < 1000 then Flag='Y';

Missing numeric values are treated lower than any valid number.

Therefore:

. < 1000

returns TRUE.

This can accidentally classify missing values as low revenue or high risk, causing serious reporting errors.

20 Data Cleaning Best Practices

Validate metadata before coding.
Standardize naming conventions.
Remove duplicates early.
Preserve raw datasets.
Create audit trails.
Document assumptions.
Validate date ranges.
Check categorical domains.
Standardize missing values.
Use reusable macros.
Perform independent QC.
Apply defensive programming.
Maintain lineage tracking.
Version control code.
Validate joins.
Use controlled terminology.
Review truncation risks.
Validate formats.
Automate reporting checks.
Test production deployments.

Business Logic Behind Cleaning

Business rules convert raw observations into reliable information. Missing values are often imputed because analytical models require complete observations. For example, if a product launch date is unavailable, business stakeholders may use the first known transaction date as a proxy. Unrealistic values require correction because they distort metrics. A rating of 11 on a 10-point scale is impossible and must be capped. Negative revenue may indicate accounting reversals or data-entry issues; validation determines whether correction or exclusion is appropriate.

Text normalization is equally important. “johnnie walker,” “JOHNNIE WALKER,” and “ Johnnie Walker ” should represent the same brand. Without standardization, frequency counts and aggregations become fragmented. Date standardization ensures consistent calculations such as months-on-market and year-over-year growth. In healthcare, patient age values such as -5 or 250 years create impossible demographic distributions. In banking, salary values stored as text prevent credit-risk calculations. Missing visit dates can affect treatment exposure calculations and statistical analyses. Every cleaning decision should be governed by documented business rules, reviewed by stakeholders, and validated independently. The objective is not merely to repair data but to create a trusted analytical asset supporting reliable reporting, predictive modeling, regulatory submissions, and executive decision-making.

20 One-Line Insights

Dirty data creates expensive business mistakes.
Standardized variables improve reproducibility.
Validation logic is stronger than visual inspection.
Metadata drives analytics quality.
Every duplicate changes a metric.
Missing values deserve investigation.
Clean joins require clean keys.
Auditability builds trust.
Traceability supports compliance.
Formats improve usability.
Macros improve consistency.
Arrays reduce repetitive code.
PROC SQL simplifies integration.
DATA Steps provide granular control.
R excels at flexible transformations.
SAS excels at governed workflows.
Automated QC reduces risk.
Controlled terminology improves reporting.
Production code must be defensive.
Analytics is only as good as its source data.

SAS vs R Comparison

Feature	SAS	R
Auditability	Excellent	Moderate
Regulatory Acceptance	Excellent	Growing
Scalability	Excellent	Excellent
Flexibility	High	Very High
Metadata Control	Excellent	Moderate
Visualization	Good	Excellent
Reproducibility	Excellent	Excellent
Learning Curve	Moderate	Moderate
Enterprise Deployment	Excellent	High
Open Source Ecosystem	Limited	Massive

Validation Checklist

1.Duplicate IDs removed

2.Invalid dates investigated

3.Negative amounts reviewed

4.Missing values standardized

5.Email validation completed

6.Region codes harmonized

7.Metadata verified

8.QC review completed

9.Traceability documented

10.Reporting outputs validated

Summary: SAS and R for Enterprise Data Engineering

SAS and R complement each other exceptionally well in modern analytics ecosystems. SAS provides structured governance, strong metadata management, repeatable validation frameworks, and audit-ready programming capabilities required in highly regulated industries such as pharmaceuticals, banking, and insurance. Features such as PROC SQL, DATA Step processing, formats, macros, and validation procedures create highly reliable production workflows.

R brings flexibility, rapid development, advanced data wrangling, and a rich ecosystem of packages. The tidyverse framework simplifies transformations while enabling powerful exploratory analysis and visualization. Functions such as mutate(), across(), case_when(), coalesce(), replace_na(), and parse_date_time() support efficient data engineering pipelines.

Together, SAS and R create a balanced architecture. SAS governs, validates, and standardizes enterprise datasets. R accelerates transformation, exploration, and advanced analytics. Organizations increasingly use both technologies because they combine regulatory confidence with analytical agility. When integrated properly, they produce scalable, reproducible, and trustworthy analytical assets capable of supporting executive reporting, machine learning, regulatory submissions, and strategic decision-making.

Conclusion

The world's leading alcohol brands generate enormous volumes of operational, sales, distribution, and marketing data. However, raw data rarely arrives in a form suitable for analysis. Duplicate transaction IDs, malformed emails, inconsistent region codes, invalid dates, negative revenue values, corrupted categories, and missing fields can quietly undermine dashboards, forecasts, compliance reports, and executive decisions. What appears to be a small data-quality issue can become a significant financial, operational, or regulatory problem when propagated through enterprise systems.

A structured data-cleaning framework transforms unreliable records into trusted analytical intelligence. In SAS, this transformation is achieved through disciplined metadata management, DATA Step programming, PROC SQL integration, validation macros, formats, deduplication routines, and comprehensive reporting procedures. Techniques such as FIRST./LAST. processing, ARRAY logic, INPUT/PUT conversions, RETAIN statements, and PROC REPORT enable production-grade workflows that are scalable and auditable.

R complements this foundation by offering modern data engineering capabilities through tidyverse packages. Functions such as mutate(), case_when(), distinct(), coalesce(), replace_na(), separate(), unite(), and parse_date_time() provide expressive, maintainable pipelines for rapid transformation and exploratory analysis. The ability to combine SAS governance with R flexibility gives organizations a powerful hybrid ecosystem.

The ultimate objective of data cleaning is not merely correcting errors. It is establishing trust. Trust in metrics, trust in statistical analyses, trust in predictive models, trust in regulatory submissions, and trust in executive decision-making. Whether the dataset represents global alcohol brands, clinical trial subjects, insurance claims, banking transactions, or retail operations, the same principle applies: reliable analytics begin with reliable data. Organizations that invest in robust cleaning, validation, governance, and traceability frameworks build a sustainable competitive advantage because every insight generated from their systems is grounded in accuracy, consistency, and reproducibility.

Interview Questions & Answers

1. A revenue dashboard suddenly doubles sales figures. How would you investigate?

Answer:
First, check for duplicate transaction IDs using PROC SORT NODUPKEY or PROC SQL COUNT(*) GROUP BY logic. Compare row counts before and after deduplication. Validate joins because one-to-many joins frequently create duplicate records. In R, use distinct() and anti_join() to identify duplicate sources.

2. How would you handle invalid dates such as 32JAN2024?

Answer:
Use INPUT() with date informats in SAS and parse_date_time() in R. Invalid dates should become missing and be reviewed through validation reports rather than forcefully corrected without business approval.

3. Explain a real-world risk of missing numeric values in SAS?

Answer:
Missing numeric values are considered lower than valid numbers. A condition such as if amount < 1000 will also capture missing values unless explicitly checked with missing(amount)=0.

4. When would you choose PROC SQL over a DATA Step MERGE?

Answer:
Use PROC SQL for complex joins, aggregations, and relational operations. Use DATA Step MERGE when datasets are sorted and row-level processing efficiency is required.

5. How do you validate that a cleaned dataset is production-ready?

Answer:
Perform metadata checks, row-count reconciliation, duplicate validation, missing-value assessments, date-range validation, frequency reviews, PROC COMPARE verification, independent QC review, and final reporting sign-off. In R, supplement this with summary statistics, distinct counts, and validation scripts to ensure reproducibility.

6. Why does CMISS(OF _ALL_) fail inside PROC SQL?

Answer:
OF _ALL_, OF _CHARACTER_, and OF _NUMERIC_ are DATA Step variable lists. PROC SQL does not support DATA Step variable list syntax. Therefore CMISS(OF _ALL_) generates ERROR 22-322. The solution is to use a DATA Step before PROC SQL or explicitly list variables in the SQL WHERE clause.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about ALCOHOL DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Behind the World's Best Alcohol Brands: Building Analysis-Ready Datasets with SAS, R, and Data Validation

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

Behind the World's Best Alcohol Brands: Building Analysis-Ready Datasets with SAS, R, and Data Validation

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Post a Comment

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

1.How Can SAS Map The Journey And Influence Of India’s Freedom Fighters?

2.Can SAS Identify The Best Project And Tool Combinations For Software Teams?

3.Can SAS Reveal Which Office Environments Are Truly Productive And Which Are Not?