India’s Fuel Economy Decoded: Advanced Petrol Price Cleansing and Reporting with SAS and R
Transforming India’s Petrol Price History (2000–2026) into Analytical Intelligence with SAS WHERE vs IF, PROC SQL, DATA Step and R
1. Introduction -When One Bad Record Destroys an
Entire Analysis
Imagine a
national petroleum analytics company preparing a strategic fuel pricing report
for the Government of India. The dataset contains petrol prices from 2000–2026
across multiple Indian states. Executives depend on this analysis to forecast
inflation, transportation costs, and subsidy planning.
Everything
looks professional until one hidden issue changes the story completely.
A petrol
price entered as -105 instead of 105.
A state
name recorded as "delhi " instead of "DELHI".
A missing
year accidentally treated as valid.
A duplicate
record inflates the national average.
Suddenly,
dashboards become misleading. Regional fuel trends become inaccurate. Business
decisions become dangerous.
This is
exactly why professional data cleaning matters.
In modern
analytics, raw data is rarely perfect. Whether in clinical trials, banking,
insurance, or petroleum analytics, data arrives broken, inconsistent,
duplicated, and incomplete. The role of a SAS programmer or data analyst is not
merely to generate reports it is to establish trust in data.
Two
technologies dominate enterprise-grade data preparation:
- SAS trusted heavily in
pharmaceuticals, banking, and regulatory industries
- R powerful for modern data
wrangling and visualization
This
article demonstrates how to build, clean, validate, filter, and report petrol
price datasets using:
- SAS DATA Step
- PROC SQL
- WHERE vs IF filtering
- R tidyverse methods
- Advanced validation logic
- Real-world business rules
We
intentionally create messy datasets and solve them professionally.
2. Raw Data Creation in SAS
— Building the Messy Dataset
Business Scenario
A
petroleum intelligence firm collected petrol price records from 2000–2026
across India. The raw flat file contains:
- Missing petrol prices
- Duplicate IDs
- Invalid negative prices
- Wrong date formats
- Mixed case state names
- NULL text values
- Blank region names
Raw Flat File Simulation Using INFILE
filename fueldata temp;
data _null_;
file fueldata;
put "Fuel_ID|State|Region|Year|Petrol_Price|Record_Date|Fuel_Type
|Source_System|Remarks";
put "101|Delhi |north|2000|30|12-01-2000|PETROL|SYS_A|VALID";
put "102|MUMBAI|WEST|2001|-32|15-02-2001|PETROL|SYS_B|NEGATIVE";
put "103|Chennai|south|2002|NULL|22-03-2002|petrol|SYS_C|MISSING";
put "104|Kolkata|EAST|2003|35|31-13-2003|PETROL|SYS_A|BADDATE";
put "105|Delhi |NORTH|2004|40|01-05-2004|PETROL|SYS_A|VALID";
put "105|Delhi |NORTH|2004|40|01-05-2004|PETROL|SYS_A|DUPLICATE";
put "106| Hyderabad|south|2005|45|14-07-2005|PETROL|SYS_B|VALID";
put "107|NULL|WEST|2006|48|18-08-2006|PETROL|SYS_C|BADSTATE";
put "108|Pune| |2007|52|20-09-2007|PETROL|SYS_A|BLANKREGION";
put "109|Bangalore|south|2008|-55|15-10-2008|PETROL|SYS_B|NEGATIVE";
put "110|Mumbai|WEST|2026|105|11-01-2026|PETROL|SYS_C|LATEST";
run;
LOG:
Why This Matters
This
simulates real enterprise flat-file ingestion.
Key Professional Concepts
- FILENAME TEMP creates
temporary external files.
- PUT statements simulate raw
source systems.
- Delimiter (|) mimics
production feeds.
- Intentional data corruption
tests cleaning logic.
- Regulatory systems often
validate these exact issues.
Importing Raw Data
data petrol_raw;
retain Fuel_ID State Region Year Petrol_Price Record_Date
Fuel_Type Source_System Remarks;
infile fueldata dlm='|' dsd firstobs=2;
length State $20 Region $15 Fuel_Type $15 Source_System $15
Record_Date $10 Remarks $20;
input Fuel_ID State $ Region $ Year Petrol_Price $ Record_Date $
Fuel_Type $ Source_System $ Remarks $;
run;
proc print data = petrol_raw;
run;
OUTPUT:
| Obs | Fuel_ID | State | Region | Year | Petrol_Price | Record_Date | Fuel_Type | Source_System | Remarks |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | Delhi | north | 2000 | 30 | 12-01-2000 | PETROL | SYS_A | VALID |
| 2 | 102 | MUMBAI | WEST | 2001 | -32 | 15-02-2001 | PETROL | SYS_B | NEGATIVE |
| 3 | 103 | Chennai | south | 2002 | NULL | 22-03-2002 | petrol | SYS_C | MISSING |
| 4 | 104 | Kolkata | EAST | 2003 | 35 | 31-13-2003 | PETROL | SYS_A | BADDATE |
| 5 | 105 | Delhi | NORTH | 2004 | 40 | 01-05-2004 | PETROL | SYS_A | VALID |
| 6 | 105 | Delhi | NORTH | 2004 | 40 | 01-05-2004 | PETROL | SYS_A | DUPLICATE |
| 7 | 106 | Hyderabad | south | 2005 | 45 | 14-07-2005 | PETROL | SYS_B | VALID |
| 8 | 107 | NULL | WEST | 2006 | 48 | 18-08-2006 | PETROL | SYS_C | BADSTATE |
| 9 | 108 | Pune | 2007 | 52 | 20-09-2007 | PETROL | SYS_A | BLANKREGION | |
| 10 | 109 | Bangalore | south | 2008 | -55 | 15-10-2008 | PETROL | SYS_B | NEGATIVE |
| 11 | 110 | Mumbai | WEST | 2026 | 105 | 11-01-2026 | PETROL | SYS_C | LATEST |
The Truncation Trap — Why
LENGTH Must Come First
One of
the biggest hidden dangers in SAS is character truncation.
Incorrect:
if State='Delhi' then
Region='NORTH';
length Region $15;
Correct:
length Region $15;
if State='Delhi' then
Region='NORTH';
Why?
Because
SAS determines variable length during first compilation. If Region='N'
initially appears, SAS permanently allocates length 1 unless explicitly defined
earlier.
This
silently destroys data integrity.
DATA STEP Cleaning Logic
data petrol_clean;
set petrol_raw;
retain Fuel_ID State Region Year Petrol_Price Record_Date
Fuel_Type Source_System Remarks;
length State $20 Price_Category $15;
State=upcase(strip(State));
if State='NULL' or State='' then State='UNKNOWN';
Region=upcase(strip(Region));
if Region='' then Region='UNKNOWN';
if Petrol_Price='NULL' then Petrol_Price='';
Petrol_Num=input(Petrol_Price,best12.);
Petrol_Num=abs(Petrol_Num);
month_num=input(scan(Record_Date,2,'-'),best12.);
if month_num >= 1 and month_num <= 12 then
Date_Num =input(Record_Date,ddmmyy10.);
else Date_Num=.;
format Date_Num date9.;
Fuel_Type=upcase(Fuel_Type);
select;
when (Petrol_Num < 40) Price_Category='LOW';
when (40 <= Petrol_Num <= 70) Price_Category='MEDIUM';
otherwise Price_Category='HIGH';
end;
drop Petrol_Price Record_Date;
rename Petrol_Num = Petrol_Price Date_Num = Record_Date ;
run;
proc print data = petrol_clean;
run;
OUTPUT:
| Obs | Fuel_ID | State | Region | Year | Fuel_Type | Source_System | Remarks | Price_Category | Petrol_Price | month_num | Record_Date |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | DELHI | NORTH | 2000 | PETROL | SYS_A | VALID | LOW | 30 | 1 | 12JAN2000 |
| 2 | 102 | MUMBAI | WEST | 2001 | PETROL | SYS_B | NEGATIVE | LOW | 32 | 2 | 15FEB2001 |
| 3 | 103 | CHENNAI | SOUTH | 2002 | PETROL | SYS_C | MISSING | LOW | . | 3 | 22MAR2002 |
| 4 | 104 | KOLKATA | EAST | 2003 | PETROL | SYS_A | BADDATE | LOW | 35 | 13 | . |
| 5 | 105 | DELHI | NORTH | 2004 | PETROL | SYS_A | VALID | MEDIUM | 40 | 5 | 01MAY2004 |
| 6 | 105 | DELHI | NORTH | 2004 | PETROL | SYS_A | DUPLICATE | MEDIUM | 40 | 5 | 01MAY2004 |
| 7 | 106 | HYDERABAD | SOUTH | 2005 | PETROL | SYS_B | VALID | MEDIUM | 45 | 7 | 14JUL2005 |
| 8 | 107 | UNKNOWN | WEST | 2006 | PETROL | SYS_C | BADSTATE | MEDIUM | 48 | 8 | 18AUG2006 |
| 9 | 108 | PUNE | UNKNOWN | 2007 | PETROL | SYS_A | BLANKREGION | MEDIUM | 52 | 9 | 20SEP2007 |
| 10 | 109 | BANGALORE | SOUTH | 2008 | PETROL | SYS_B | NEGATIVE | MEDIUM | 55 | 10 | 15OCT2008 |
| 11 | 110 | MUMBAI | WEST | 2026 | PETROL | SYS_C | LATEST | HIGH | 105 | 1 | 11JAN2026 |
Why This DATA STEP Is
Professional
Key Functions Explained
ABS()
Converts
negative petrol prices into valid positive values.
Example:
-55 → 55
Useful
in:
- Banking corrections
- Financial normalization
- Sensor anomaly handling
INPUT()
Converts
character dates into numeric SAS dates.
Date_Num=input(Record_Date,ddmmyy10.);
Without
conversion:
- Sorting fails
- Date arithmetic fails
- Reporting becomes unreliable
SELECT-WHEN vs IF-THEN
IF-THEN
Best for:
- Complex Boolean conditions
- Nested logic
- Dynamic calculations
SELECT-WHEN
Best for:
- Categorical grouping
- Cleaner readability
- Faster maintenance
This
project uses SELECT-WHEN because price categories are mutually exclusive.
Deduplication Using PROC SORT
proc sort data=petrol_clean out=petrol_nodup nodupkey;
by Fuel_ID;
run;
proc print data = petrol_nodup;
run;
LOG:
OUTPUT:
| Obs | Fuel_ID | State | Region | Year | Fuel_Type | Source_System | Remarks | Price_Category | Petrol_Price | month_num | Record_Date |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | DELHI | NORTH | 2000 | PETROL | SYS_A | VALID | LOW | 30 | 1 | 12JAN2000 |
| 2 | 102 | MUMBAI | WEST | 2001 | PETROL | SYS_B | NEGATIVE | LOW | 32 | 2 | 15FEB2001 |
| 3 | 103 | CHENNAI | SOUTH | 2002 | PETROL | SYS_C | MISSING | LOW | . | 3 | 22MAR2002 |
| 4 | 104 | KOLKATA | EAST | 2003 | PETROL | SYS_A | BADDATE | LOW | 35 | 13 | . |
| 5 | 105 | DELHI | NORTH | 2004 | PETROL | SYS_A | VALID | MEDIUM | 40 | 5 | 01MAY2004 |
| 6 | 106 | HYDERABAD | SOUTH | 2005 | PETROL | SYS_B | VALID | MEDIUM | 45 | 7 | 14JUL2005 |
| 7 | 107 | UNKNOWN | WEST | 2006 | PETROL | SYS_C | BADSTATE | MEDIUM | 48 | 8 | 18AUG2006 |
| 8 | 108 | PUNE | UNKNOWN | 2007 | PETROL | SYS_A | BLANKREGION | MEDIUM | 52 | 9 | 20SEP2007 |
| 9 | 109 | BANGALORE | SOUTH | 2008 | PETROL | SYS_B | NEGATIVE | MEDIUM | 55 | 10 | 15OCT2008 |
| 10 | 110 | MUMBAI | WEST | 2026 | PETROL | SYS_C | LATEST | HIGH | 105 | 1 | 11JAN2026 |
Why PROC SORT NODUPKEY Matters
Duplicate
fuel records can inflate:
- State averages
- Inflation estimates
- Government subsidy
calculations
NODUPKEY
preserves only one unique observation per key.
WHERE vs IF in SAS
Using WHERE
proc print data=petrol_nodup;
where Region='SOUTH';
run;
OUTPUT:
| Obs | Fuel_ID | State | Region | Year | Fuel_Type | Source_System | Remarks | Price_Category | Petrol_Price | month_num | Record_Date |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 103 | CHENNAI | SOUTH | 2002 | PETROL | SYS_C | MISSING | LOW | . | 3 | 22MAR2002 |
| 6 | 106 | HYDERABAD | SOUTH | 2005 | PETROL | SYS_B | VALID | MEDIUM | 45 | 7 | 14JUL2005 |
| 9 | 109 | BANGALORE | SOUTH | 2008 | PETROL | SYS_B | NEGATIVE | MEDIUM | 55 | 10 | 15OCT2008 |
WHERE Characteristics
- Filters BEFORE reading data
- Faster for large datasets
- Cannot use newly created
variables
Using IF
data south_prices;
set petrol_nodup;
if Region='SOUTH';
run;
proc print data=south_prices;
run;
OUTPUT:
| Obs | Fuel_ID | State | Region | Year | Fuel_Type | Source_System | Remarks | Price_Category | Petrol_Price | month_num | Record_Date |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 103 | CHENNAI | SOUTH | 2002 | PETROL | SYS_C | MISSING | LOW | . | 3 | 22MAR2002 |
| 2 | 106 | HYDERABAD | SOUTH | 2005 | PETROL | SYS_B | VALID | MEDIUM | 45 | 7 | 14JUL2005 |
| 3 | 109 | BANGALORE | SOUTH | 2008 | PETROL | SYS_B | NEGATIVE | MEDIUM | 55 | 10 | 15OCT2008 |
IF Characteristics
- Filters AFTER reading
observations
- Can use computed variables
- More flexible
Performance Difference
|
Feature |
WHERE |
IF |
|
Execution
Stage |
Before
PDV |
After
PDV |
|
Speed |
Faster |
Slower |
|
Uses
Derived Variables |
No |
Yes |
|
Best
For |
Extraction |
Transformation |
PROC SQL Alternative
proc sql;
create table petrol_sql as
select Fuel_ID,State,Region,Year,Petrol_Price format=8.2,
Price_Category
from petrol_nodup
where Petrol_Price > 50;
quit;
proc print data=petrol_sql;
run;
OUTPUT:
| Obs | Fuel_ID | State | Region | Year | Petrol_Price | Price_Category |
|---|---|---|---|---|---|---|
| 1 | 108 | PUNE | UNKNOWN | 2007 | 52.00 | MEDIUM |
| 2 | 109 | BANGALORE | SOUTH | 2008 | 55.00 | MEDIUM |
| 3 | 110 | MUMBAI | WEST | 2026 | 105.00 | HIGH |
Why PROC SQL Is Powerful
PROC SQL
is ideal for:
- Joins
- Aggregation
- Filtering
- Enterprise reporting
DATA Step
excels at:
- Row-wise transformations
- Sequential processing
- Complex derivations
Professional
programmers use both strategically.
3. The R Refinement Layer — Tidyverse Cleaning
library(dplyr)
library(stringr)
library(tidyr)
petrol_raw <- data.frame(
Fuel_ID=c(101,102,103,104,105),
State=c("Delhi ","MUMBAI","NULL","Chennai","Delhi"),
Price=c(30,-32,NA,35,40)
)
OUTPUT:
|
|
Fuel_ID |
State |
Price |
|
1 |
101 |
Delhi |
30 |
|
2 |
102 |
MUMBAI |
-32 |
|
3 |
103 |
NULL |
NA |
|
4 |
104 |
Chennai |
35 |
|
5 |
105 |
Delhi |
40 |
petrol_clean <- petrol_raw %>%
mutate(
State=str_trim(toupper(State)),
State=ifelse(State=="NULL","UNKNOWN",State),
Price=abs(Price),
Price=replace_na(Price,0)
)
OUTPUT:
|
|
Fuel_ID |
State |
Price |
|
1 |
101 |
DELHI |
30 |
|
2 |
102 |
MUMBAI |
32 |
|
3 |
103 |
UNKNOWN |
0 |
|
4 |
104 |
CHENNAI |
35 |
|
5 |
105 |
DELHI |
40 |
R Functions Explained
|
R
Function |
Purpose |
SAS
Equivalent |
|
mutate() |
Create
variables |
DATA
STEP |
|
filter() |
Subset
rows |
WHERE/IF |
|
replace_na() |
Handle
missing |
IF
MISSING() |
|
toupper() |
Standardization |
UPCASE() |
|
str_trim() |
Remove
spaces |
STRIP() |
Advanced Regex Cleaning
petrol_clean$State <- gsub("[^A-Z ]","",petrol_clean$State)
OUTPUT:
| Fuel_ID | State | Price |
1 | 101 | DELHI | 30 |
2 | 102 | MUMBAI | 32 |
3 | 103 | UNKNOWN | 0 |
4 | 104 | CHENNAI | 35 |
5 | 105 | DELHI | 40 |
petrol_clean$State <- trimws(petrol_clean$State)
OUTPUT:
| Fuel_ID | State | Price |
1 | 101 | DELHI | 30 |
2 | 102 | MUMBAI | 32 |
3 | 103 | UNKNOWN | 0 |
4 | 104 | CHENNAI | 35 |
5 | 105 | DELHI | 40 |
Why Regex Cleaning Matters
Real-world
systems contain:
- Hidden tabs
- Extra punctuation
- Invalid symbols
Regex
removes these systematically.
4. Business Logic & The Missing Value Trap
High-Stakes Loan Approval Scenario
Imagine a
bank approving fuel dealership loans.
Eligibility
Rule:
Petrol Price > 50
Now
imagine missing petrol prices.
In SAS:
data petrol_clean2;
retain Fuel_ID State Region Year Petrol_Price Record_Date
Fuel_Type Source_System Remarks Reject;
set petrol_clean;
if missing(Petrol_Price) then Reject="Review";
else if Petrol_Price > 50 then Reject="Yes";
else Reject="No";
run;
proc print data=petrol_clean2;
run;
Problem:
Missing
numeric values are treated as smaller than any number.
Meaning:
. < 50
= TRUE
A missing
value accidentally becomes eligible or rejected incorrectly.
This can:
- Trigger regulatory
violations
- Cause financial losses
- Destroy audit credibility
Always
explicitly check missing values:
if missing(Petrol_Num) then Reject='REVIEW';
Key Point
SET
imports variable attributes immediately.
Therefore:
- RETAIN after SET may not
reorder variables properly
- LENGTH, FORMAT, ATTRIB,
RETAIN are best placed before SET
OUTPUT:
| Obs | Fuel_ID | State | Region | Year | Petrol_Price | Record_Date | Fuel_Type | Source_System | Remarks | Reject | Price_Category | month_num |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | DELHI | NORTH | 2000 | 30 | 12JAN2000 | PETROL | SYS_A | VALID | No | LOW | 1 |
| 2 | 102 | MUMBAI | WEST | 2001 | 32 | 15FEB2001 | PETROL | SYS_B | NEGATIVE | No | LOW | 2 |
| 3 | 103 | CHENNAI | SOUTH | 2002 | . | 22MAR2002 | PETROL | SYS_C | MISSING | Review | LOW | 3 |
| 4 | 104 | KOLKATA | EAST | 2003 | 35 | . | PETROL | SYS_A | BADDATE | No | LOW | 13 |
| 5 | 105 | DELHI | NORTH | 2004 | 40 | 01MAY2004 | PETROL | SYS_A | VALID | No | MEDIUM | 5 |
| 6 | 105 | DELHI | NORTH | 2004 | 40 | 01MAY2004 | PETROL | SYS_A | DUPLICATE | No | MEDIUM | 5 |
| 7 | 106 | HYDERABAD | SOUTH | 2005 | 45 | 14JUL2005 | PETROL | SYS_B | VALID | No | MEDIUM | 7 |
| 8 | 107 | UNKNOWN | WEST | 2006 | 48 | 18AUG2006 | PETROL | SYS_C | BADSTATE | No | MEDIUM | 8 |
| 9 | 108 | PUNE | UNKNOWN | 2007 | 52 | 20SEP2007 | PETROL | SYS_A | BLANKREGION | Yes | MEDIUM | 9 |
| 10 | 109 | BANGALORE | SOUTH | 2008 | 55 | 15OCT2008 | PETROL | SYS_B | NEGATIVE | Yes | MEDIUM | 10 |
| 11 | 110 | MUMBAI | WEST | 2026 | 105 | 11JAN2026 | PETROL | SYS_C | LATEST | Yes | HIGH | 1 |
5. 20 Golden Rules for
Professional SAS Projects
- Always define LENGTH before
assignments.
- Never trust raw imported
data.
- Standardize character casing
immediately.
- Remove duplicates before
aggregation.
- Validate dates rigorously.
- Use WHERE for faster
filtering.
- Use IF for derived-variable
logic.
- Never overwrite raw
datasets.
- Preserve audit copies.
- Use formats consistently.
- Validate missing values
explicitly.
- Document derivation rules.
- Use meaningful variable
names.
- Avoid hardcoded business
logic.
- Validate source-to-target
mapping.
- Test edge-case scenarios.
- Use PROC CONTENTS
frequently.
- Normalize text using
STRIP().
- Prefer SELECT-WHEN for
categories.
- Always create QC validation
reports.
Extended Reporting & Aggregation
proc summary data=petrol_nodup nway;
class Region;
var Petrol_price;
output out=region_summary mean=Avg_Price
max=Max_Price min=Min_Price;
run;
proc print data=region_summary;
run;
OUTPUT:
| Obs | Region | _TYPE_ | _FREQ_ | Avg_Price | Max_Price | Min_Price |
|---|---|---|---|---|---|---|
| 1 | EAST | 1 | 1 | 35.0000 | 35 | 35 |
| 2 | NORTH | 1 | 2 | 35.0000 | 40 | 30 |
| 3 | SOUTH | 1 | 3 | 50.0000 | 55 | 45 |
| 4 | UNKNOWN | 1 | 1 | 52.0000 | 52 | 52 |
| 5 | WEST | 1 | 3 | 61.6667 | 105 | 32 |
Professional Reporting
proc report data=region_summary nowd;
columns Region Avg_Price Max_Price Min_Price;
define Region / group;
define Avg_Price / analysis;
define Max_Price / analysis;
define Min_Price / analysis;
title "Regional Petrol Price Summary Report";
run;
OUTPUT:
| Region | Avg_Price | Max_Price | Min_Price |
|---|---|---|---|
| EAST | 35 | 35 | 35 |
| NORTH | 35 | 40 | 30 |
| SOUTH | 50 | 55 | 45 |
| UNKNOWN | 52 | 52 | 52 |
| WEST | 61.666667 | 105 | 32 |
Why PROC REPORT Is
Enterprise Preferred
PROC
REPORT provides:
- Controlled layouts
- Regulatory formatting
- Executive-ready outputs
- Dynamic summarization
Widely
used in:
- Clinical TLFs
- Financial reporting
- Government analytics
6. 20 Additional Data
Cleaning Best Practices
- Validate SDTM compliance
before submission.
- Maintain Define.xml
traceability.
- Use controlled terminology
consistently.
- Preserve original raw
values.
- Validate unit consistency.
- Create automated QC macros.
- Avoid manual spreadsheet
edits.
- Reconcile cross-domain
mismatches.
- Verify subject uniqueness.
- Validate treatment dates
carefully.
- Maintain audit logs.
- Track derivation origins.
- Use metadata-driven
programming.
- Validate protocol
deviations.
- Perform double programming
QC.
- Use checksum comparisons.
- Detect impossible date
sequences.
- Standardize categorical
variables.
- Document transformation
assumptions.
- Ensure reproducibility
across environments.
7. Business Logic Behind
Data Cleaning
Data
cleaning is not cosmetic it directly impacts decision-making.
Suppose
petrol prices contain negative values. Without correction, average price
calculations become distorted. Governments may incorrectly estimate inflation
trends. Oil companies may misprice supply contracts.
Missing
values are equally dangerous. A blank patient age in a clinical trial could
accidentally qualify an ineligible participant. Similarly, missing petrol
pricing records could distort national fuel subsidy calculations.
Date
correction is critical because incorrect timelines affect forecasting. If a
2026 record is accidentally stored as 2006, trend analysis becomes unreliable.
Salary
normalization in banking, age validation in healthcare, and price
standardization in petroleum analytics all follow the same principle:
Reliable decisions
require reliable data.
That is
why professional SAS programmers prioritize:
- Validation
- Standardization
- Auditability
- Reproducibility
before
any reporting begins.
8. 20 Sharp Key Insights
- Dirty data leads to wrong
conclusions.
- Standardization ensures
reproducibility.
- Missing values are hidden
business risks.
- WHERE filtering improves
performance.
- IF filtering improves
flexibility.
- PROC SQL simplifies
aggregation.
- DATA Step excels at
transformations.
- Duplicate records inflate
metrics.
- LENGTH prevents truncation
disasters.
- Regex cleaning removes
hidden corruption.
- Audit trails protect
regulatory trust.
- Validation improves business
credibility.
- ABS() corrects numerical
anomalies.
- INPUT() enables true date
analytics.
- PROC REPORT creates
executive-ready outputs.
- Null handling prevents
logical failures.
- Text normalization improves
joins.
- QC checks reduce compliance
risk.
- Structured programming
improves maintenance.
- Reliable analytics begin
with clean data.
9. Summary
SAS and R
both provide exceptional capabilities for enterprise-grade data cleaning, but
each has distinct strengths.
SAS
dominates highly regulated industries such as pharmaceuticals, banking,
insurance, and government analytics because of its:
- Stability
- Auditability
- Regulatory trust
- Structured processing
Its DATA
Step architecture makes row-wise transformations highly efficient. Features
like WHERE processing, PROC SORT, PROC SQL, and PROC REPORT allow scalable
enterprise workflows.
Meanwhile,
R excels in:
- Modern wrangling
- Visualization
- Open-source flexibility
- Rapid exploratory analysis
Packages
such as dplyr, tidyr, and stringr simplify complex transformations using
readable syntax.
In
professional environments, the strongest analysts understand both ecosystems.
SAS
provides enterprise-grade reliability.
R
provides analytical flexibility.
Together,
they create scalable, accurate, and maintainable analytics pipelines capable of
handling real-world messy data.
From
petrol pricing intelligence to clinical trial validation, the core principle
remains unchanged:
Clean
data drives trustworthy decisions.
10. Conclusion
The
journey from raw petrol price records to professional analytical intelligence
demonstrates a critical truth about modern data science:
Analytics
is only as reliable as the quality of the underlying data.
A single
duplicate record can inflate averages.
A missing
value can invalidate eligibility decisions.
A
truncated character variable can silently corrupt reporting.
These are
not theoretical problems they occur daily across banking systems, healthcare
platforms, petroleum analytics, and government reporting infrastructures.
This
project showcased how professional programmers use:
- SAS DATA Step
- PROC SQL
- WHERE vs IF logic
- PROC SORT
- PROC REPORT
- R tidyverse pipelines
to
transform chaotic datasets into structured, validated, business-ready
intelligence.
More
importantly, we explored the reasoning behind each operation.
Professional
data cleaning is not merely technical coding.
It is:
- Risk management
- Business protection
- Regulatory defense
- Decision assurance
The
distinction between amateur and enterprise-grade analytics often lies not in
dashboards, but in how rigorously the underlying data was validated.
Organizations
trust analysts who:
- Preserve audit trails
- Anticipate edge cases
- Handle missing values safely
- Document transformations
clearly
- Build reproducible pipelines
Whether
you work in clinical trials, petroleum analytics, banking, or insurance,
mastering structured data cleaning frameworks will always remain one of the
most valuable skills in analytics engineering.
Clean
data is not a luxury.
It is the
foundation of trustworthy intelligence.
11. Interview Questions
& Answers
1. Why is WHERE faster than IF in SAS?
Answer:
WHERE
filters observations before entering the Program Data Vector (PDV), reducing
I/O operations. IF evaluates after data is read into memory.
2. When would you prefer DATA Step over PROC SQL?
Answer:
Use DATA
Step for sequential row-level transformations and complex derivations. Use PROC
SQL for joins, aggregations, and relational operations.
3. How does SAS treat missing numeric values?
Answer:
SAS
treats missing numeric values as smaller than any valid number. This can create
logical errors if not explicitly handled.
4. Explain a real-world duplicate record issue.
Answer:
Duplicate
patient or petrol records can inflate averages, distort KPIs, and produce
misleading executive reports. PROC SORT NODUPKEY helps resolve this.
5. What is the SAS equivalent of mutate() in R?
Answer:
The SAS
DATA Step performs variable creation and modification similarly to mutate() in
R.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about PETROL DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment