Behind the World's Best Alcohol Brands: Building Analysis-Ready Datasets with SAS, R, and Data Validation
From Premium Spirits to Trusted Analytics: Transforming the Best Alcohol Brands in the World Dataset into Analysis-Ready Intelligence Using SAS and R
Introduction:Why Global Alcohol Brand Data Can
Become a Business Disaster
Imagine a
multinational beverage company preparing its annual executive performance
dashboard covering premium alcohol brands across multiple countries. The
dataset contains sales transactions from brands such as:
- Johnnie Walker
- Jack Daniel's
- Hennessy
- Bacardi
- Smirnoff
- Corona
- Heineken
- Absolut
Everything
appears normal until analysts discover:
- Duplicate transaction IDs
- Negative sales revenue
- Missing launch dates
- Invalid email contacts
- Mixed region coding
standards
- Corrupted product categories
- Impossible ratings
- Extra spaces in brand names
- Character values stored as
numbers
- Numeric values stored as
character strings
The
result?
Executive
dashboards become inaccurate. AI demand forecasting models produce misleading
predictions. Regulatory reporting becomes unreliable. Marketing teams allocate budgets
incorrectly. Financial audits fail validation checks.
This is why data cleaning is not a cosmetic activity—it is a business-critical engineering discipline.
1.Raw SAS Dataset with Intentional Errors
SAS Raw Data
data alcohol_raw;
length Brand $30 Country $20 Category $20 Launch_Date $12
Region $10 Contact_Email $50 Revenue $15;
infile datalines dlm='|' dsd truncover;
input Transaction_ID Brand $ Country $ Category $ Rating
Revenue $ Launch_Date $ Contact_Email $ Region $;
datalines;
1001| johnnie walker |Scotland|Whisky|9|250000|15JAN2024|contact@jw.com|EU
1001|JOHNNIE WALKER|SCOTLAND|whisky|9|-250000|32JAN2024|wrongmail|EUR
1002|Hennessy|France|Cognac|11|350000|10FEB2024|sales@hennessy.com|EU
1003| Bacardi |Cuba|Rum|-2|280000|.|info@bacardi.com|LATAM
1004|Smirnoff|Russia|vodka|8|NULL|15MAR2024|support@smirnoff.com|EU01
1005|Jack Daniels|USA|WHISKEY|7|500000|28FEB2024|jack@@mail.com|US
1006|Heineken|Netherlands|Beer|10|450000|31APR2024|sales@heineken.com|EU
1007|Corona|Mexico|BEER|9|420000|11MAY2024|contact@corona.com|LA
1008|Absolut|Sweden|Vodka|8|370000|12JUN2024|NULL|EUR
1009|Jameson|Ireland|Whisky|9|390000|15JUL2024|sales@jameson.com|EU
1010|NULL|India|Beer|5|200000|01AUG2024|beer@abc.com|APAC
1011|Kingfisher|India|Beer|4|-10000|15SEP2024|invalidmail.com|AP
1012|Budweiser|USA|Beer|8|550000|10OCT2024|sales@bud.com|USA
1013|Carlsberg|Denmark|Beer|9|470000|11NOV2024|sales@carlsberg.com|EU
1014|Chivas Regal|Scotland|Whisky|7|460000|12DEC2024|chivas@brand.com|EU
1015|Grey Goose|France|Vodka|8|430000|13JAN2025|goose@mail.com|EU
;
run;
proc print data=alcohol_raw;
run;
OUTPUT:
| Obs | Brand | Country | Category | Launch_Date | Region | Contact_Email | Revenue | Transaction_ID | Rating |
|---|---|---|---|---|---|---|---|---|---|
| 1 | johnnie walker | Scotland | Whisky | 15JAN2024 | EU | contact@jw.com | 250000 | 1001 | 9 |
| 2 | JOHNNIE WALKER | SCOTLAND | whisky | 32JAN2024 | EUR | wrongmail | -250000 | 1001 | 9 |
| 3 | Hennessy | France | Cognac | 10FEB2024 | EU | sales@hennessy.com | 350000 | 1002 | 11 |
| 4 | Bacardi | Cuba | Rum | LATAM | info@bacardi.com | 280000 | 1003 | -2 | |
| 5 | Smirnoff | Russia | vodka | 15MAR2024 | EU01 | support@smirnoff.com | NULL | 1004 | 8 |
| 6 | Jack Daniels | USA | WHISKEY | 28FEB2024 | US | jack@@mail.com | 500000 | 1005 | 7 |
| 7 | Heineken | Netherlands | Beer | 31APR2024 | EU | sales@heineken.com | 450000 | 1006 | 10 |
| 8 | Corona | Mexico | BEER | 11MAY2024 | LA | contact@corona.com | 420000 | 1007 | 9 |
| 9 | Absolut | Sweden | Vodka | 12JUN2024 | EUR | NULL | 370000 | 1008 | 8 |
| 10 | Jameson | Ireland | Whisky | 15JUL2024 | EU | sales@jameson.com | 390000 | 1009 | 9 |
| 11 | NULL | India | Beer | 01AUG2024 | APAC | beer@abc.com | 200000 | 1010 | 5 |
| 12 | Kingfisher | India | Beer | 15SEP2024 | AP | invalidmail.com | -10000 | 1011 | 4 |
| 13 | Budweiser | USA | Beer | 10OCT2024 | USA | sales@bud.com | 550000 | 1012 | 8 |
| 14 | Carlsberg | Denmark | Beer | 11NOV2024 | EU | sales@carlsberg.com | 470000 | 1013 | 9 |
| 15 | Chivas Regal | Scotland | Whisky | 12DEC2024 | EU | chivas@brand.com | 460000 | 1014 | 7 |
| 16 | Grey Goose | France | Vodka | 13JAN2025 | EU | goose@mail.com | 430000 | 1015 | 8 |
Why LENGTH Must Come First
One of
the most overlooked SAS production risks is character truncation.
Incorrect
data test;
name='Johnnie Walker Premium Edition';
length name $10;
run;
proc print data=test;
run;
LOG:
OUTPUT:
| Obs | name |
|---|---|
| 1 | Johnnie Wa |
Correct
data test;
length name $40;
name='Johnnie Walker Premium Edition';
run;
proc print data=test;
run;
OUTPUT:
| Obs | name |
|---|---|
| 1 | Johnnie Walker Premium Edition |
Explanation
SAS
determines character length during compilation. If LENGTH appears after
assignment, the variable may already be defined with insufficient storage.
Large pharmaceutical and banking systems frequently suffer silent truncation
issues, causing failed joins and reconciliation problems.
In R,
character vectors dynamically resize and do not suffer this exact problem,
making string handling more flexible.
Step 1: Data Profiling
PROC CONTENTS
proc contents data=alcohol_raw;
run;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.ALCOHOL_RAW | Observations | 16 |
|---|---|---|---|
| Member Type | DATA | Variables | 9 |
| Engine | V9 | Indexes | 0 |
| Created | 06/24/2026 11:50:05 | Observation Length | 176 |
| Last Modified | 06/24/2026 11:50:05 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 743 |
| Obs in First Data Page | 16 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_work6F2C00008D88_odaws02-apse1-2.oda.sas.com/SAS_workBEA500008D88_odaws02-apse1-2.oda.sas.com/alcohol_raw.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 201333005 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Alphabetic List of Variables and Attributes | |||
|---|---|---|---|
| # | Variable | Type | Len |
| 1 | Brand | Char | 30 |
| 3 | Category | Char | 20 |
| 6 | Contact_Email | Char | 50 |
| 2 | Country | Char | 20 |
| 4 | Launch_Date | Char | 12 |
| 9 | Rating | Num | 8 |
| 5 | Region | Char | 10 |
| 7 | Revenue | Char | 15 |
| 8 | Transaction_ID | Num | 8 |
Explanation
PROC
CONTENTS acts as the first validation checkpoint. It identifies variable
attributes, storage lengths, formats, informats, and types. Before cleaning any
enterprise dataset, programmers should inspect metadata because many downstream
failures originate from incorrect variable definitions. In regulated clinical
trials, metadata verification is part of audit readiness and traceability
requirements.
Step 2: Enterprise Data Cleaning
DATA Step Cleaning Workflow
data alcohol_clean;
length Brand_Clean $30 Category_Clean $20
Region_Clean $15;
set alcohol_raw;
Brand_Clean=propcase(strip(Brand));
if upcase(Brand_Clean)="NULL" then Brand_Clean="Unknown";
Category_Clean=propcase(strip(Category));
Revenue_Num=input(compress(Revenue),best12.);
Revenue_Num=abs(Revenue_Num);
Rating=max(min(Rating,10),1);
select(upcase(region));
when('EU','EUR','EU01') Region_Clean='EUROPE';
when('US','USA') Region_Clean='NORTH_AMERICA';
when('AP','APAC') Region_Clean='ASIA_PACIFIC';
when('LA','LATAM') Region_Clean='LATIN_AMERICA';
otherwise Region_Clean='UNKNOWN';
end;
Email_Flag=
(prxmatch('/^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$/i',
Contact_Email)>0);
drop Revenue Region Brand Category;
rename Revenue_Num=Revenue Region_Clean=Region
Brand_Clean=Brand Category_Clean=Category;
run;
proc print data=alcohol_clean;
run;
OUTPUT:
| Obs | Brand | Category | Region | Country | Launch_Date | Contact_Email | Transaction_ID | Rating | Revenue | Email_Flag |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Johnnie Walker | Whisky | EUROPE | Scotland | 15JAN2024 | contact@jw.com | 1001 | 9 | 250000 | 0 |
| 2 | Johnnie Walker | Whisky | EUROPE | SCOTLAND | 32JAN2024 | wrongmail | 1001 | 9 | 250000 | 0 |
| 3 | Hennessy | Cognac | EUROPE | France | 10FEB2024 | sales@hennessy.com | 1002 | 10 | 350000 | 0 |
| 4 | Bacardi | Rum | LATIN_AMERICA | Cuba | info@bacardi.com | 1003 | 1 | 280000 | 0 | |
| 5 | Smirnoff | Vodka | EUROPE | Russia | 15MAR2024 | support@smirnoff.com | 1004 | 8 | . | 0 |
| 6 | Jack Daniels | Whiskey | NORTH_AMERICA | USA | 28FEB2024 | jack@@mail.com | 1005 | 7 | 500000 | 0 |
| 7 | Heineken | Beer | EUROPE | Netherlands | 31APR2024 | sales@heineken.com | 1006 | 10 | 450000 | 0 |
| 8 | Corona | Beer | LATIN_AMERICA | Mexico | 11MAY2024 | contact@corona.com | 1007 | 9 | 420000 | 0 |
| 9 | Absolut | Vodka | EUROPE | Sweden | 12JUN2024 | NULL | 1008 | 8 | 370000 | 0 |
| 10 | Jameson | Whisky | EUROPE | Ireland | 15JUL2024 | sales@jameson.com | 1009 | 9 | 390000 | 0 |
| 11 | Unknown | Beer | ASIA_PACIFIC | India | 01AUG2024 | beer@abc.com | 1010 | 5 | 200000 | 0 |
| 12 | Kingfisher | Beer | ASIA_PACIFIC | India | 15SEP2024 | invalidmail.com | 1011 | 4 | 10000 | 0 |
| 13 | Budweiser | Beer | NORTH_AMERICA | USA | 10OCT2024 | sales@bud.com | 1012 | 8 | 550000 | 0 |
| 14 | Carlsberg | Beer | EUROPE | Denmark | 11NOV2024 | sales@carlsberg.com | 1013 | 9 | 470000 | 0 |
| 15 | Chivas Regal | Whisky | EUROPE | Scotland | 12DEC2024 | chivas@brand.com | 1014 | 7 | 460000 | 0 |
| 16 | Grey Goose | Vodka | EUROPE | France | 13JAN2025 | goose@mail.com | 1015 | 8 | 430000 | 0 |
Explanation
This DATA
step demonstrates production-quality standardization. We normalize text using
STRIP and PROPCASE, convert revenue from character to numeric using INPUT,
eliminate negative values using ABS, and harmonize region codes through
SELECT-WHEN logic. Email validation uses PRXMATCH regular expressions. Such
transformations ensure consistent reporting, improve join accuracy, and prevent
analytical bias caused by inconsistent categories.
Step 3: Deduplication
proc sort data=alcohol_clean
out=alcohol_nodup nodupkey;
by Transaction_ID;
run;
proc print data=alcohol_nodup;
run;
LOG:
OUTPUT:
| Obs | Brand | Category | Region | Country | Launch_Date | Contact_Email | Transaction_ID | Rating | Revenue | Email_Flag |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Johnnie Walker | Whisky | EUROPE | Scotland | 15JAN2024 | contact@jw.com | 1001 | 9 | 250000 | 0 |
| 2 | Hennessy | Cognac | EUROPE | France | 10FEB2024 | sales@hennessy.com | 1002 | 10 | 350000 | 0 |
| 3 | Bacardi | Rum | LATIN_AMERICA | Cuba | info@bacardi.com | 1003 | 1 | 280000 | 0 | |
| 4 | Smirnoff | Vodka | EUROPE | Russia | 15MAR2024 | support@smirnoff.com | 1004 | 8 | . | 0 |
| 5 | Jack Daniels | Whiskey | NORTH_AMERICA | USA | 28FEB2024 | jack@@mail.com | 1005 | 7 | 500000 | 0 |
| 6 | Heineken | Beer | EUROPE | Netherlands | 31APR2024 | sales@heineken.com | 1006 | 10 | 450000 | 0 |
| 7 | Corona | Beer | LATIN_AMERICA | Mexico | 11MAY2024 | contact@corona.com | 1007 | 9 | 420000 | 0 |
| 8 | Absolut | Vodka | EUROPE | Sweden | 12JUN2024 | NULL | 1008 | 8 | 370000 | 0 |
| 9 | Jameson | Whisky | EUROPE | Ireland | 15JUL2024 | sales@jameson.com | 1009 | 9 | 390000 | 0 |
| 10 | Unknown | Beer | ASIA_PACIFIC | India | 01AUG2024 | beer@abc.com | 1010 | 5 | 200000 | 0 |
| 11 | Kingfisher | Beer | ASIA_PACIFIC | India | 15SEP2024 | invalidmail.com | 1011 | 4 | 10000 | 0 |
| 12 | Budweiser | Beer | NORTH_AMERICA | USA | 10OCT2024 | sales@bud.com | 1012 | 8 | 550000 | 0 |
| 13 | Carlsberg | Beer | EUROPE | Denmark | 11NOV2024 | sales@carlsberg.com | 1013 | 9 | 470000 | 0 |
| 14 | Chivas Regal | Whisky | EUROPE | Scotland | 12DEC2024 | chivas@brand.com | 1014 | 7 | 460000 | 0 |
| 15 | Grey Goose | Vodka | EUROPE | France | 13JAN2025 | goose@mail.com | 1015 | 8 | 430000 | 0 |
Explanation
Duplicate
records often distort revenue calculations and KPI reporting. PROC SORT
NODUPKEY retains the first occurrence of a BY-group while removing duplicates.
In clinical trials, duplicate subject IDs can invalidate statistical outputs
and compromise regulatory submissions.
Step 4: ARRAY Processing
data alcohol_array;
set alcohol_nodup;
array nums {*} Rating Revenue;
do i=1 to dim(nums);
if nums{i}<0 then nums{i}=abs(nums{i});
end;
drop i;
run;
proc print data=alcohol_array;
run;
OUTPUT:
| Obs | Brand | Category | Region | Country | Launch_Date | Contact_Email | Transaction_ID | Rating | Revenue | Email_Flag |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Johnnie Walker | Whisky | EUROPE | Scotland | 15JAN2024 | contact@jw.com | 1001 | 9 | 250000 | 0 |
| 2 | Hennessy | Cognac | EUROPE | France | 10FEB2024 | sales@hennessy.com | 1002 | 10 | 350000 | 0 |
| 3 | Bacardi | Rum | LATIN_AMERICA | Cuba | info@bacardi.com | 1003 | 1 | 280000 | 0 | |
| 4 | Smirnoff | Vodka | EUROPE | Russia | 15MAR2024 | support@smirnoff.com | 1004 | 8 | . | 0 |
| 5 | Jack Daniels | Whiskey | NORTH_AMERICA | USA | 28FEB2024 | jack@@mail.com | 1005 | 7 | 500000 | 0 |
| 6 | Heineken | Beer | EUROPE | Netherlands | 31APR2024 | sales@heineken.com | 1006 | 10 | 450000 | 0 |
| 7 | Corona | Beer | LATIN_AMERICA | Mexico | 11MAY2024 | contact@corona.com | 1007 | 9 | 420000 | 0 |
| 8 | Absolut | Vodka | EUROPE | Sweden | 12JUN2024 | NULL | 1008 | 8 | 370000 | 0 |
| 9 | Jameson | Whisky | EUROPE | Ireland | 15JUL2024 | sales@jameson.com | 1009 | 9 | 390000 | 0 |
| 10 | Unknown | Beer | ASIA_PACIFIC | India | 01AUG2024 | beer@abc.com | 1010 | 5 | 200000 | 0 |
| 11 | Kingfisher | Beer | ASIA_PACIFIC | India | 15SEP2024 | invalidmail.com | 1011 | 4 | 10000 | 0 |
| 12 | Budweiser | Beer | NORTH_AMERICA | USA | 10OCT2024 | sales@bud.com | 1012 | 8 | 550000 | 0 |
| 13 | Carlsberg | Beer | EUROPE | Denmark | 11NOV2024 | sales@carlsberg.com | 1013 | 9 | 470000 | 0 |
| 14 | Chivas Regal | Whisky | EUROPE | Scotland | 12DEC2024 | chivas@brand.com | 1014 | 7 | 460000 | 0 |
| 15 | Grey Goose | Vodka | EUROPE | France | 13JAN2025 | goose@mail.com | 1015 | 8 | 430000 | 0 |
Explanation
Arrays
allow efficient bulk processing across multiple variables. Instead of writing
repetitive IF statements, programmers can iterate through variables
dynamically. Arrays improve maintainability and reduce coding errors in
large-scale data cleaning workflows.
Step 5: FIRST. and LAST. Processing
proc sort data=alcohol_array;
by Country;
run;
proc print data=alcohol_array;
run;
OUTPUT:
| Obs | Brand | Category | Region | Country | Launch_Date | Contact_Email | Transaction_ID | Rating | Revenue | Email_Flag |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Bacardi | Rum | LATIN_AMERICA | Cuba | info@bacardi.com | 1003 | 1 | 280000 | 0 | |
| 2 | Carlsberg | Beer | EUROPE | Denmark | 11NOV2024 | sales@carlsberg.com | 1013 | 9 | 470000 | 0 |
| 3 | Hennessy | Cognac | EUROPE | France | 10FEB2024 | sales@hennessy.com | 1002 | 10 | 350000 | 0 |
| 4 | Grey Goose | Vodka | EUROPE | France | 13JAN2025 | goose@mail.com | 1015 | 8 | 430000 | 0 |
| 5 | Unknown | Beer | ASIA_PACIFIC | India | 01AUG2024 | beer@abc.com | 1010 | 5 | 200000 | 0 |
| 6 | Kingfisher | Beer | ASIA_PACIFIC | India | 15SEP2024 | invalidmail.com | 1011 | 4 | 10000 | 0 |
| 7 | Jameson | Whisky | EUROPE | Ireland | 15JUL2024 | sales@jameson.com | 1009 | 9 | 390000 | 0 |
| 8 | Corona | Beer | LATIN_AMERICA | Mexico | 11MAY2024 | contact@corona.com | 1007 | 9 | 420000 | 0 |
| 9 | Heineken | Beer | EUROPE | Netherlands | 31APR2024 | sales@heineken.com | 1006 | 10 | 450000 | 0 |
| 10 | Smirnoff | Vodka | EUROPE | Russia | 15MAR2024 | support@smirnoff.com | 1004 | 8 | . | 0 |
| 11 | Johnnie Walker | Whisky | EUROPE | Scotland | 15JAN2024 | contact@jw.com | 1001 | 9 | 250000 | 0 |
| 12 | Chivas Regal | Whisky | EUROPE | Scotland | 12DEC2024 | chivas@brand.com | 1014 | 7 | 460000 | 0 |
| 13 | Absolut | Vodka | EUROPE | Sweden | 12JUN2024 | NULL | 1008 | 8 | 370000 | 0 |
| 14 | Jack Daniels | Whiskey | NORTH_AMERICA | USA | 28FEB2024 | jack@@mail.com | 1005 | 7 | 500000 | 0 |
| 15 | Budweiser | Beer | NORTH_AMERICA | USA | 10OCT2024 | sales@bud.com | 1012 | 8 | 550000 | 0 |
data country_summary;
set alcohol_array;
by Country;
retain Brand_Count 0;
if first.Country then Brand_Count=0;
Brand_Count+1;
if last.Country;
run;
proc print data=country_summary;
run;
OUTPUT:
| Obs | Brand | Category | Region | Country | Launch_Date | Contact_Email | Transaction_ID | Rating | Revenue | Email_Flag | Brand_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Bacardi | Rum | LATIN_AMERICA | Cuba | info@bacardi.com | 1003 | 1 | 280000 | 0 | 1 | |
| 2 | Carlsberg | Beer | EUROPE | Denmark | 11NOV2024 | sales@carlsberg.com | 1013 | 9 | 470000 | 0 | 1 |
| 3 | Grey Goose | Vodka | EUROPE | France | 13JAN2025 | goose@mail.com | 1015 | 8 | 430000 | 0 | 2 |
| 4 | Kingfisher | Beer | ASIA_PACIFIC | India | 15SEP2024 | invalidmail.com | 1011 | 4 | 10000 | 0 | 2 |
| 5 | Jameson | Whisky | EUROPE | Ireland | 15JUL2024 | sales@jameson.com | 1009 | 9 | 390000 | 0 | 1 |
| 6 | Corona | Beer | LATIN_AMERICA | Mexico | 11MAY2024 | contact@corona.com | 1007 | 9 | 420000 | 0 | 1 |
| 7 | Heineken | Beer | EUROPE | Netherlands | 31APR2024 | sales@heineken.com | 1006 | 10 | 450000 | 0 | 1 |
| 8 | Smirnoff | Vodka | EUROPE | Russia | 15MAR2024 | support@smirnoff.com | 1004 | 8 | . | 0 | 1 |
| 9 | Chivas Regal | Whisky | EUROPE | Scotland | 12DEC2024 | chivas@brand.com | 1014 | 7 | 460000 | 0 | 2 |
| 10 | Absolut | Vodka | EUROPE | Sweden | 12JUN2024 | NULL | 1008 | 8 | 370000 | 0 | 1 |
| 11 | Budweiser | Beer | NORTH_AMERICA | USA | 10OCT2024 | sales@bud.com | 1012 | 8 | 550000 | 0 | 2 |
Explanation
FIRST.
and LAST. processing enables group-wise calculations without PROC SQL. This
technique is widely used in healthcare studies for patient visit tracking,
treatment exposure calculations, and longitudinal analyses.
Step 6: PROC FORMAT
proc format;
value ratefmt 1-4='Low'
5-7='Medium'
8-10='Premium';
run;
LOG:
Explanation
Formats
provide business-friendly labels without modifying underlying data. This
separation preserves raw values while improving presentation quality in reports
and dashboards.
2.PROC SQL vs DATA Step
Step 1: SQL Join
proc sql;
create table alcohol_final as
select a.*,
b.Brand_Count
from alcohol_array a
left join country_summary b
on a.Country=b.Country;
quit;
proc print data=alcohol_final;
run;
OUTPUT:
| Obs | Brand | Category | Region | Country | Launch_Date | Contact_Email | Transaction_ID | Rating | Revenue | Email_Flag | Brand_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Bacardi | Rum | LATIN_AMERICA | Cuba | info@bacardi.com | 1003 | 1 | 280000 | 0 | 1 | |
| 2 | Carlsberg | Beer | EUROPE | Denmark | 11NOV2024 | sales@carlsberg.com | 1013 | 9 | 470000 | 0 | 1 |
| 3 | Hennessy | Cognac | EUROPE | France | 10FEB2024 | sales@hennessy.com | 1002 | 10 | 350000 | 0 | 2 |
| 4 | Grey Goose | Vodka | EUROPE | France | 13JAN2025 | goose@mail.com | 1015 | 8 | 430000 | 0 | 2 |
| 5 | Unknown | Beer | ASIA_PACIFIC | India | 01AUG2024 | beer@abc.com | 1010 | 5 | 200000 | 0 | 2 |
| 6 | Kingfisher | Beer | ASIA_PACIFIC | India | 15SEP2024 | invalidmail.com | 1011 | 4 | 10000 | 0 | 2 |
| 7 | Jameson | Whisky | EUROPE | Ireland | 15JUL2024 | sales@jameson.com | 1009 | 9 | 390000 | 0 | 1 |
| 8 | Corona | Beer | LATIN_AMERICA | Mexico | 11MAY2024 | contact@corona.com | 1007 | 9 | 420000 | 0 | 1 |
| 9 | Heineken | Beer | EUROPE | Netherlands | 31APR2024 | sales@heineken.com | 1006 | 10 | 450000 | 0 | 1 |
| 10 | Smirnoff | Vodka | EUROPE | Russia | 15MAR2024 | support@smirnoff.com | 1004 | 8 | . | 0 | 1 |
| 11 | Johnnie Walker | Whisky | EUROPE | Scotland | 15JAN2024 | contact@jw.com | 1001 | 9 | 250000 | 0 | 2 |
| 12 | Chivas Regal | Whisky | EUROPE | Scotland | 12DEC2024 | chivas@brand.com | 1014 | 7 | 460000 | 0 | 2 |
| 13 | Absolut | Vodka | EUROPE | Sweden | 12JUN2024 | NULL | 1008 | 8 | 370000 | 0 | 1 |
| 14 | Jack Daniels | Whiskey | NORTH_AMERICA | USA | 28FEB2024 | jack@@mail.com | 1005 | 7 | 500000 | 0 | 2 |
| 15 | Budweiser | Beer | NORTH_AMERICA | USA | 10OCT2024 | sales@bud.com | 1012 | 8 | 550000 | 0 | 2 |
Step 2: DATA Step Merge
proc sort data=alcohol_array;
by Country;
run;
proc print data=alcohol_array;
run;
OUTPUT:
| Obs | Brand | Category | Region | Country | Launch_Date | Contact_Email | Transaction_ID | Rating | Revenue | Email_Flag |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Bacardi | Rum | LATIN_AMERICA | Cuba | info@bacardi.com | 1003 | 1 | 280000 | 0 | |
| 2 | Carlsberg | Beer | EUROPE | Denmark | 11NOV2024 | sales@carlsberg.com | 1013 | 9 | 470000 | 0 |
| 3 | Hennessy | Cognac | EUROPE | France | 10FEB2024 | sales@hennessy.com | 1002 | 10 | 350000 | 0 |
| 4 | Grey Goose | Vodka | EUROPE | France | 13JAN2025 | goose@mail.com | 1015 | 8 | 430000 | 0 |
| 5 | Unknown | Beer | ASIA_PACIFIC | India | 01AUG2024 | beer@abc.com | 1010 | 5 | 200000 | 0 |
| 6 | Kingfisher | Beer | ASIA_PACIFIC | India | 15SEP2024 | invalidmail.com | 1011 | 4 | 10000 | 0 |
| 7 | Jameson | Whisky | EUROPE | Ireland | 15JUL2024 | sales@jameson.com | 1009 | 9 | 390000 | 0 |
| 8 | Corona | Beer | LATIN_AMERICA | Mexico | 11MAY2024 | contact@corona.com | 1007 | 9 | 420000 | 0 |
| 9 | Heineken | Beer | EUROPE | Netherlands | 31APR2024 | sales@heineken.com | 1006 | 10 | 450000 | 0 |
| 10 | Smirnoff | Vodka | EUROPE | Russia | 15MAR2024 | support@smirnoff.com | 1004 | 8 | . | 0 |
| 11 | Johnnie Walker | Whisky | EUROPE | Scotland | 15JAN2024 | contact@jw.com | 1001 | 9 | 250000 | 0 |
| 12 | Chivas Regal | Whisky | EUROPE | Scotland | 12DEC2024 | chivas@brand.com | 1014 | 7 | 460000 | 0 |
| 13 | Absolut | Vodka | EUROPE | Sweden | 12JUN2024 | NULL | 1008 | 8 | 370000 | 0 |
| 14 | Jack Daniels | Whiskey | NORTH_AMERICA | USA | 28FEB2024 | jack@@mail.com | 1005 | 7 | 500000 | 0 |
| 15 | Budweiser | Beer | NORTH_AMERICA | USA | 10OCT2024 | sales@bud.com | 1012 | 8 | 550000 | 0 |
proc sort data=country_summary;
by Country;
run;
proc print data=country_summary;
run;
OUTPUT:
| Obs | Brand | Category | Region | Country | Launch_Date | Contact_Email | Transaction_ID | Rating | Revenue | Email_Flag | Brand_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Bacardi | Rum | LATIN_AMERICA | Cuba | info@bacardi.com | 1003 | 1 | 280000 | 0 | 1 | |
| 2 | Carlsberg | Beer | EUROPE | Denmark | 11NOV2024 | sales@carlsberg.com | 1013 | 9 | 470000 | 0 | 1 |
| 3 | Grey Goose | Vodka | EUROPE | France | 13JAN2025 | goose@mail.com | 1015 | 8 | 430000 | 0 | 2 |
| 4 | Kingfisher | Beer | ASIA_PACIFIC | India | 15SEP2024 | invalidmail.com | 1011 | 4 | 10000 | 0 | 2 |
| 5 | Jameson | Whisky | EUROPE | Ireland | 15JUL2024 | sales@jameson.com | 1009 | 9 | 390000 | 0 | 1 |
| 6 | Corona | Beer | LATIN_AMERICA | Mexico | 11MAY2024 | contact@corona.com | 1007 | 9 | 420000 | 0 | 1 |
| 7 | Heineken | Beer | EUROPE | Netherlands | 31APR2024 | sales@heineken.com | 1006 | 10 | 450000 | 0 | 1 |
| 8 | Smirnoff | Vodka | EUROPE | Russia | 15MAR2024 | support@smirnoff.com | 1004 | 8 | . | 0 | 1 |
| 9 | Chivas Regal | Whisky | EUROPE | Scotland | 12DEC2024 | chivas@brand.com | 1014 | 7 | 460000 | 0 | 2 |
| 10 | Absolut | Vodka | EUROPE | Sweden | 12JUN2024 | NULL | 1008 | 8 | 370000 | 0 | 1 |
| 11 | Budweiser | Beer | NORTH_AMERICA | USA | 10OCT2024 | sales@bud.com | 1012 | 8 | 550000 | 0 | 2 |
data alcohol_merge;
merge alcohol_array(in=a)
country_summary(in=b);
by Country;
if a;
run;
proc print data=alcohol_merge;
run;
OUTPUT:
| Obs | Brand | Category | Region | Country | Launch_Date | Contact_Email | Transaction_ID | Rating | Revenue | Email_Flag | Brand_Count |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Bacardi | Rum | LATIN_AMERICA | Cuba | info@bacardi.com | 1003 | 1 | 280000 | 0 | 1 | |
| 2 | Carlsberg | Beer | EUROPE | Denmark | 11NOV2024 | sales@carlsberg.com | 1013 | 9 | 470000 | 0 | 1 |
| 3 | Grey Goose | Vodka | EUROPE | France | 13JAN2025 | goose@mail.com | 1015 | 8 | 430000 | 0 | 2 |
| 4 | Grey Goose | Vodka | EUROPE | France | 13JAN2025 | goose@mail.com | 1015 | 8 | 430000 | 0 | 2 |
| 5 | Kingfisher | Beer | ASIA_PACIFIC | India | 15SEP2024 | invalidmail.com | 1011 | 4 | 10000 | 0 | 2 |
| 6 | Kingfisher | Beer | ASIA_PACIFIC | India | 15SEP2024 | invalidmail.com | 1011 | 4 | 10000 | 0 | 2 |
| 7 | Jameson | Whisky | EUROPE | Ireland | 15JUL2024 | sales@jameson.com | 1009 | 9 | 390000 | 0 | 1 |
| 8 | Corona | Beer | LATIN_AMERICA | Mexico | 11MAY2024 | contact@corona.com | 1007 | 9 | 420000 | 0 | 1 |
| 9 | Heineken | Beer | EUROPE | Netherlands | 31APR2024 | sales@heineken.com | 1006 | 10 | 450000 | 0 | 1 |
| 10 | Smirnoff | Vodka | EUROPE | Russia | 15MAR2024 | support@smirnoff.com | 1004 | 8 | . | 0 | 1 |
| 11 | Chivas Regal | Whisky | EUROPE | Scotland | 12DEC2024 | chivas@brand.com | 1014 | 7 | 460000 | 0 | 2 |
| 12 | Chivas Regal | Whisky | EUROPE | Scotland | 12DEC2024 | chivas@brand.com | 1014 | 7 | 460000 | 0 | 2 |
| 13 | Absolut | Vodka | EUROPE | Sweden | 12JUN2024 | NULL | 1008 | 8 | 370000 | 0 | 1 |
| 14 | Budweiser | Beer | NORTH_AMERICA | USA | 10OCT2024 | sales@bud.com | 1012 | 8 | 550000 | 0 | 2 |
| 15 | Budweiser | Beer | NORTH_AMERICA | USA | 10OCT2024 | sales@bud.com | 1012 | 8 | 550000 | 0 | 2 |
Explanation
PROC SQL
offers relational database-style syntax and flexibility for complex joins. DATA
Step MERGE is often faster for sorted datasets and provides greater row-level
control. Enterprise SAS programmers must understand both approaches because
performance requirements vary by workload.
3.Additional SAS Procedures
Step 1: PROC FREQ
proc freq data=alcohol_final;
tables Region Category;
run;
OUTPUT:
The FREQ Procedure
| Region | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| ASIA_PACIFIC | 2 | 13.33 | 2 | 13.33 |
| EUROPE | 9 | 60.00 | 11 | 73.33 |
| LATIN_AMERICA | 2 | 13.33 | 13 | 86.67 |
| NORTH_AMERICA | 2 | 13.33 | 15 | 100.00 |
| Category | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| Beer | 6 | 40.00 | 6 | 40.00 |
| Cognac | 1 | 6.67 | 7 | 46.67 |
| Rum | 1 | 6.67 | 8 | 53.33 |
| Vodka | 3 | 20.00 | 11 | 73.33 |
| Whiskey | 1 | 6.67 | 12 | 80.00 |
| Whisky | 3 | 20.00 | 15 | 100.00 |
Step 2: PROC MEANS
proc means data=alcohol_final n mean median min max;
var Revenue Rating;
run;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Median | Minimum | Maximum |
|---|---|---|---|---|---|
Revenue Rating | 14 15 | 366428.57 7.4666667 | 405000.00 8.0000000 | 10000.00 1.0000000 | 550000.00 10.0000000 |
Step 3: PROC SUMMARY
proc summary data=alcohol_final nway;
class Region;
var Revenue;
output out=region_rev sum=;
run;
proc print data=region_rev;
run;
OUTPUT:
| Obs | Region | _TYPE_ | _FREQ_ | Revenue |
|---|---|---|---|---|
| 1 | ASIA_PACIFIC | 1 | 2 | 210000 |
| 2 | EUROPE | 1 | 9 | 3170000 |
| 3 | LATIN_AMERICA | 1 | 2 | 700000 |
| 4 | NORTH_AMERICA | 1 | 2 | 1050000 |
Step 4: PROC TRANSPOSE
proc transpose data=region_rev out=transpose_rev;
by Region;
var Revenue;
run;
proc print data=transpose_rev;
run;
OUTPUT:
| Obs | Region | _NAME_ | COL1 |
|---|---|---|---|
| 1 | ASIA_PACIFIC | Revenue | 210000 |
| 2 | EUROPE | Revenue | 3170000 |
| 3 | LATIN_AMERICA | Revenue | 700000 |
| 4 | NORTH_AMERICA | Revenue | 1050000 |
Step 5: PROC REPORT
proc report data=alcohol_final nowd;
column Brand Region Revenue Rating;
run;
OUTPUT:
| Brand | Region | Revenue | Rating |
|---|---|---|---|
| Bacardi | LATIN_AMERICA | 280000 | 1 |
| Carlsberg | EUROPE | 470000 | 9 |
| Hennessy | EUROPE | 350000 | 10 |
| Grey Goose | EUROPE | 430000 | 8 |
| Unknown | ASIA_PACIFIC | 200000 | 5 |
| Kingfisher | ASIA_PACIFIC | 10000 | 4 |
| Jameson | EUROPE | 390000 | 9 |
| Corona | LATIN_AMERICA | 420000 | 9 |
| Heineken | EUROPE | 450000 | 10 |
| Smirnoff | EUROPE | . | 8 |
| Johnnie Walker | EUROPE | 250000 | 9 |
| Chivas Regal | EUROPE | 460000 | 7 |
| Absolut | EUROPE | 370000 | 8 |
| Jack Daniels | NORTH_AMERICA | 500000 | 7 |
| Budweiser | NORTH_AMERICA | 550000 | 8 |
Step 6: SAS Macro for Reusable Validation
%macro check_missing(ds);
data _missing;
set &ds;
if cmiss(of _character_)>0
or nmiss(of _numeric_)>0;
run;
proc print data=transpose_rev;
run;
proc sql;
select count(*) as Missing_Count
from _missing;
quit;
%mend;
%check_missing(alcohol_final);
OUTPUT:
| Obs | Region | _NAME_ | COL1 |
|---|---|---|---|
| 1 | ASIA_PACIFIC | Revenue | 210000 |
| 2 | EUROPE | Revenue | 3170000 |
| 3 | LATIN_AMERICA | Revenue | 700000 |
| 4 | NORTH_AMERICA | Revenue | 1050000 |
| Missing_Count |
|---|
| 2 |
Explanation
Macros
standardize validation logic. Instead of repeating code across hundreds of
studies or business projects, organizations maintain reusable libraries. This
improves consistency, auditability, and productivity.
Why this is better
- Checks all character
variables using CMISS
- Checks all numeric variables
using NMISS
- Works for any dataset
- Commonly used in clinical
programming QC
4.Raw Dataset in R
alcohol_raw <- read.delim(
text="
Transaction_ID|Brand|Country|Category|Rating|Revenue|Launch_Date|Email|Region
1001| johnnie walker |Scotland|Whisky|9|250000|15JAN2024|contact@jw.com|EU
1001|JOHNNIE WALKER|SCOTLAND|whisky|9|-250000|32JAN2024|wrongmail|EUR
1002|Hennessy|France|Cognac|11|350000|10FEB2024|sales@hennessy.com|EU
1003| Bacardi |Cuba|Rum|-2|280000|NA|info@bacardi.com|LATAM
1004|Smirnoff|Russia|vodka|8|NULL|15MAR2024|support@smirnoff.com|EU01
1005|Jack Daniels|USA|WHISKEY|7|500000|28FEB2024|jack@@mail.com|US
1006|Heineken|Netherlands|Beer|10|450000|31APR2024|sales@heineken.com|EU
1007|Corona|Mexico|BEER|9|420000|11MAY2024|contact@corona.com|LA
1008|Absolut|Sweden|Vodka|8|370000|12JUN2024|NULL|EUR
1009|Jameson|Ireland|Whisky|9|390000|15JUL2024|sales@jameson.com|EU
",
sep="|",
header=TRUE
)
OUTPUT:
|
Transaction_ID |
Brand |
Country |
Category |
Rating |
Revenue |
Launch_Date |
Email |
Region |
|
1001 |
johnnie walker |
Scotland |
Whisky |
9 |
250000 |
15JAN2024 |
contact@jw.com |
EU |
|
1001 |
JOHNNIE
WALKER |
SCOTLAND |
whisky |
9 |
-250000 |
32JAN2024 |
wrongmail |
EUR |
|
1002 |
Hennessy |
France |
Cognac |
11 |
350000 |
10FEB2024 |
sales@hennessy.com |
EU |
|
1003 |
Bacardi |
Cuba |
Rum |
-2 |
280000 |
info@bacardi.com |
LATAM |
|
|
1004 |
Smirnoff |
Russia |
vodka |
8 |
NULL |
15MAR2024 |
support@smirnoff.com |
EU01 |
|
1005 |
Jack
Daniels |
USA |
WHISKEY |
7 |
500000 |
28FEB2024 |
jack@@mail.com |
US |
|
1006 |
Heineken |
Netherlands |
Beer |
10 |
450000 |
31APR2024 |
sales@heineken.com |
EU |
|
1007 |
Corona |
Mexico |
BEER |
9 |
420000 |
11MAY2024 |
contact@corona.com |
LA |
|
1008 |
Absolut |
Sweden |
Vodka |
8 |
370000 |
12JUN2024 |
NULL |
EUR |
|
1009 |
Jameson |
Ireland |
Whisky |
9 |
390000 |
15JUL2024 |
sales@jameson.com |
EU |
5.R Cleaning Workflow
library(tidyverse)
library(janitor)
library(lubridate)
alcohol_clean <- alcohol_raw %>%
clean_names() %>%
mutate(brand=str_to_title(str_trim(brand)),
category=str_to_title(category),
revenue=as.numeric(replace(revenue,revenue=="NULL",NA)),
revenue=abs(revenue),
rating=case_when(rating>10 ~ 10,
rating<1 ~ 1,TRUE ~ rating),
region=case_when(
region %in% c("EU","EUR","EU01") ~ "EUROPE",
region %in% c("US","USA") ~ "NORTH_AMERICA",
TRUE ~ region),
email=coalesce(email,"unknown@email.com")
) %>%
distinct(transaction_id,.keep_all=TRUE)
|
transaction_id |
brand |
country |
category |
rating |
revenue |
launch_date |
email |
region |
|
1001 |
Johnnie Walker |
Scotland |
Whisky |
9 |
250000 |
15JAN2024 |
contact@jw.com |
EUROPE |
|
1002 |
Hennessy |
France |
Cognac |
10 |
350000 |
10FEB2024 |
sales@hennessy.com |
EUROPE |
|
1003 |
Bacardi |
Cuba |
Rum |
1 |
280000 |
info@bacardi.com |
LATAM |
|
|
1004 |
Smirnoff |
Russia |
Vodka |
8 |
15MAR2024 |
support@smirnoff.com |
EUROPE |
|
|
1005 |
Jack Daniels |
USA |
Whiskey |
7 |
500000 |
28FEB2024 |
jack@@mail.com |
NORTH_AMERICA |
|
1006 |
Heineken |
Netherlands |
Beer |
10 |
450000 |
31APR2024 |
sales@heineken.com |
EUROPE |
|
1007 |
Corona |
Mexico |
Beer |
9 |
420000 |
11MAY2024 |
contact@corona.com |
LA |
|
1008 |
Absolut |
Sweden |
Vodka |
8 |
370000 |
12JUN2024 |
NULL |
EUROPE |
|
1009 |
Jameson |
Ireland |
Whisky |
9 |
390000 |
15JUL2024 |
sales@jameson.com |
EUROPE |
Explanation
The tidyverse
workflow mirrors SAS cleaning logic. mutate() resembles DATA step assignments.
case_when() behaves like SELECT-WHEN. distinct() performs deduplication similar
to PROC SORT NODUPKEY. coalesce() resembles SAS COALESCEC. These functions
create concise, readable, and reproducible cleaning pipelines.
Validation & Compliance
In
regulated industries such as clinical research:
- SDTM ensures
submission-standardized datasets.
- ADaM supports statistical
analyses.
- Audit trails document every
transformation.
- QC programming must remain
independent.
- Traceability must exist from
raw source to final output.
- Metadata must be version
controlled.
A
critical SAS risk:
if Revenue_Num < 1000 then
Flag='Y';
Missing
numeric values are treated lower than any valid number.
Therefore:
. <
returns
TRUE.
This can
accidentally classify missing values as low revenue or high risk, causing
serious reporting errors.
20 Data Cleaning Best
Practices
- Validate metadata before
coding.
- Standardize naming
conventions.
- Remove duplicates early.
- Preserve raw datasets.
- Create audit trails.
- Document assumptions.
- Validate date ranges.
- Check categorical domains.
- Standardize missing values.
- Use reusable macros.
- Perform independent QC.
- Apply defensive programming.
- Maintain lineage tracking.
- Version control code.
- Validate joins.
- Use controlled terminology.
- Review truncation risks.
- Validate formats.
- Automate reporting checks.
- Test production deployments.
Business Logic Behind
Cleaning
Business
rules convert raw observations into reliable information. Missing values are
often imputed because analytical models require complete observations. For
example, if a product launch date is unavailable, business stakeholders may use
the first known transaction date as a proxy. Unrealistic values require
correction because they distort metrics. A rating of 11 on a 10-point scale is
impossible and must be capped. Negative revenue may indicate accounting
reversals or data-entry issues; validation determines whether correction or
exclusion is appropriate.
Text
normalization is equally important. “johnnie walker,” “JOHNNIE WALKER,” and “
Johnnie Walker ” should represent the same brand. Without standardization,
frequency counts and aggregations become fragmented. Date standardization
ensures consistent calculations such as months-on-market and year-over-year
growth. In healthcare, patient age values such as -5 or 250 years create
impossible demographic distributions. In banking, salary values stored as text
prevent credit-risk calculations. Missing visit dates can affect treatment
exposure calculations and statistical analyses. Every cleaning decision should
be governed by documented business rules, reviewed by stakeholders, and
validated independently. The objective is not merely to repair data but to
create a trusted analytical asset supporting reliable reporting, predictive
modeling, regulatory submissions, and executive decision-making.
20 One-Line Insights
- Dirty data creates expensive
business mistakes.
- Standardized variables
improve reproducibility.
- Validation logic is stronger
than visual inspection.
- Metadata drives analytics
quality.
- Every duplicate changes a
metric.
- Missing values deserve
investigation.
- Clean joins require clean
keys.
- Auditability builds trust.
- Traceability supports
compliance.
- Formats improve usability.
- Macros improve consistency.
- Arrays reduce repetitive
code.
- PROC SQL simplifies
integration.
- DATA Steps provide granular
control.
- R excels at flexible
transformations.
- SAS excels at governed
workflows.
- Automated QC reduces risk.
- Controlled terminology improves
reporting.
- Production code must be
defensive.
- Analytics is only as good as
its source data.
SAS vs R Comparison
|
Feature |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Excellent |
Growing |
|
Scalability |
Excellent |
Excellent |
|
Flexibility |
High |
Very
High |
|
Metadata
Control |
Excellent |
Moderate |
|
Visualization |
Good |
Excellent |
|
Reproducibility |
Excellent |
Excellent |
|
Learning
Curve |
Moderate |
Moderate |
|
Enterprise
Deployment |
Excellent |
High |
|
Open
Source Ecosystem |
Limited |
Massive |
Validation Checklist
1.Duplicate
IDs removed
2.Invalid
dates investigated
3.Negative
amounts reviewed
4.Missing
values standardized
5.Email
validation completed
6.Region
codes harmonized
7.Metadata
verified
8.QC
review completed
9.Traceability
documented
10.Reporting
outputs validated
Summary: SAS and R for
Enterprise Data Engineering
SAS and R
complement each other exceptionally well in modern analytics ecosystems. SAS
provides structured governance, strong metadata management, repeatable
validation frameworks, and audit-ready programming capabilities required in
highly regulated industries such as pharmaceuticals, banking, and insurance.
Features such as PROC SQL, DATA Step processing, formats, macros, and
validation procedures create highly reliable production workflows.
R brings flexibility,
rapid development, advanced data wrangling, and a rich ecosystem of packages.
The tidyverse framework simplifies transformations while enabling powerful
exploratory analysis and visualization. Functions such as mutate(), across(),
case_when(), coalesce(), replace_na(), and parse_date_time() support efficient
data engineering pipelines.
Together,
SAS and R create a balanced architecture. SAS governs, validates, and
standardizes enterprise datasets. R accelerates transformation, exploration,
and advanced analytics. Organizations increasingly use both technologies
because they combine regulatory confidence with analytical agility. When
integrated properly, they produce scalable, reproducible, and trustworthy
analytical assets capable of supporting executive reporting, machine learning,
regulatory submissions, and strategic decision-making.
Conclusion
The
world's leading alcohol brands generate enormous volumes of operational, sales,
distribution, and marketing data. However, raw data rarely arrives in a form
suitable for analysis. Duplicate transaction IDs, malformed emails,
inconsistent region codes, invalid dates, negative revenue values, corrupted
categories, and missing fields can quietly undermine dashboards, forecasts,
compliance reports, and executive decisions. What appears to be a small
data-quality issue can become a significant financial, operational, or
regulatory problem when propagated through enterprise systems.
A
structured data-cleaning framework transforms unreliable records into trusted
analytical intelligence. In SAS, this transformation is achieved through
disciplined metadata management, DATA Step programming, PROC SQL integration,
validation macros, formats, deduplication routines, and comprehensive reporting
procedures. Techniques such as FIRST./LAST. processing, ARRAY logic, INPUT/PUT
conversions, RETAIN statements, and PROC REPORT enable production-grade
workflows that are scalable and auditable.
R
complements this foundation by offering modern data engineering capabilities
through tidyverse packages. Functions such as mutate(), case_when(),
distinct(), coalesce(), replace_na(), separate(), unite(), and
parse_date_time() provide expressive, maintainable pipelines for rapid
transformation and exploratory analysis. The ability to combine SAS governance
with R flexibility gives organizations a powerful hybrid ecosystem.
The
ultimate objective of data cleaning is not merely correcting errors. It is
establishing trust. Trust in metrics, trust in statistical analyses, trust in
predictive models, trust in regulatory submissions, and trust in executive
decision-making. Whether the dataset represents global alcohol brands, clinical
trial subjects, insurance claims, banking transactions, or retail operations,
the same principle applies: reliable analytics begin with reliable data.
Organizations that invest in robust cleaning, validation, governance, and
traceability frameworks build a sustainable competitive advantage because every
insight generated from their systems is grounded in accuracy, consistency, and
reproducibility.
Interview Questions &
Answers
1. A revenue dashboard suddenly doubles sales
figures. How would you investigate?
Answer:
First, check for duplicate transaction IDs using PROC SORT NODUPKEY or PROC SQL
COUNT(*) GROUP BY logic. Compare row counts before and after deduplication.
Validate joins because one-to-many joins frequently create duplicate records.
In R, use distinct() and anti_join() to identify duplicate sources.
2. How would you handle invalid dates such as
32JAN2024?
Answer:
Use INPUT() with date informats in SAS and parse_date_time() in R. Invalid
dates should become missing and be reviewed through validation reports rather
than forcefully corrected without business approval.
3. Explain a real-world risk of missing numeric
values in SAS?
Answer:
Missing numeric values are considered lower than valid numbers. A condition
such as if amount < 1000 will also capture missing values unless explicitly
checked with missing(amount)=0.
4. When would you choose PROC SQL over a DATA Step
MERGE?
Answer:
Use PROC SQL for complex joins, aggregations, and relational operations. Use
DATA Step MERGE when datasets are sorted and row-level processing efficiency is
required.
5. How do you validate that a cleaned dataset is
production-ready?
Answer:
Perform metadata checks, row-count reconciliation, duplicate validation,
missing-value assessments, date-range validation, frequency reviews, PROC
COMPARE verification, independent QC review, and final reporting sign-off. In
R, supplement this with summary statistics, distinct counts, and validation
scripts to ensure reproducibility.
6. Why does CMISS(OF _ALL_) fail
inside PROC SQL?
Answer:
OF _ALL_, OF _CHARACTER_, and OF _NUMERIC_ are DATA Step variable lists. PROC
SQL does not support DATA Step variable list syntax. Therefore CMISS(OF _ALL_)
generates ERROR 22-322. The solution is to use a DATA Step before PROC SQL or
explicitly list variables in the SQL WHERE clause.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about ALCOHOL DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment