India’s Most Famous Chief Ministers Reimagined Through Clean Data, Smart Filtering and Enterprise Analytics with SAS and R
India’s Most Famous Chief Ministers Dataset (2000–2026) into Professional Analytical Power Using PROC DATASETS, PROC SQL, DATA Step, SAS and R
1. Introduction — When Political Data Becomes an
Analytical Disaster
Imagine a
national media analytics company preparing an election intelligence dashboard
containing information about India’s most famous Chief Ministers from
2000–2026. The organization receives data from multiple regional offices. One
branch enters “Yogi Adityanath,” another writes “yogi adityanath,” and another
uses “NULL.” Some tenure years are negative due to migration errors. Duplicate
CM IDs exist because two analysts merged Excel sheets incorrectly. State names
appear as “UP,” “Uttar Pradesh,” and “uttar pradesh.”
Now
imagine this corrupted dataset being used for:
- Election forecasting
- Political popularity
indexing
- Government policy analytics
- Budget allocation research
- Television debate
intelligence systems
One
incorrect value could distort an entire trend analysis.
This is
where SAS and R become indispensable. Professional analytics is not about
beautiful dashboards alone it begins with disciplined data engineering and
intelligent filtering.
In
enterprise environments, analysts rarely receive perfect datasets. Real-world
datasets are chaotic, fragmented, duplicated, and structurally inconsistent.
The real skill lies in transforming broken information into trustworthy
analytical intelligence.
This
project demonstrates how to build, clean, validate, filter, manage, and report
a dataset about India’s most famous Chief Ministers from 2000–2026 using:
- PROC DATASETS
- PROC SQL
- DATA STEP
- PROC SORT
- PROC REPORT
- PROC SUMMARY
- Advanced SAS functions
- Tidyverse in R
We will
intentionally create errors, solve them professionally, and explain the
business logic behind every correction.
2. Raw Data Creation in SAS and R
Raw SAS Dataset with Intentional Errors
data cm_raw;
length CM_ID 8 CM_Name $25 State $20 Party $20 Region $15 Status $12;
infile datalines dlm='|' dsd truncover;
input CM_ID CM_Name $ State $ Party $ Region $ Popularity_Score
Years_in_Power Election_Year $ Status $ Approval_Rating;
datalines;
101|Narendra Modi|Gujarat|BJP|WEST|95|13|2001|ACTIVE|98
102|yogi adityanath|UP|BJP|north|88|8|2017|ACTIVE|92
103|NULL|Tamil Nadu|DMK|south|85|5|2021|ACTIVE|90
104|Mamata Banerjee|West Bengal|TMC|EAST|-91|15|2011|ACTIVE|96
105|Arvind Kejriwal|Delhi|AAP|North|89|-10|2015|ACTIVE|93
106|Nitish Kumar|Bihar|JDU|EAST|84|20|20AB|ACTIVE|89
106|Nitish Kumar|Bihar|JDU|EAST|84|20|20AB|ACTIVE|89
107| |Maharashtra|Shiv Sena|WEST|82|4|2022|ACTIVE|85
108|KCR|Telangana|BRS|south|.|10|2014|ACTIVE|91
109|Pinarayi Vijayan|Kerala|LDF|south|86|9|2021|ACTIVE|90
110|Shivraj Singh|MP|BJP|CENTRAL|87|17|2005|ACTIVE|94
111|N Chandrababu Naidu|AP|TDP|south|90|14|2014|ACTIVE|95
;
run;
proc print data=cm_raw;
run;
OUTPUT:
| Obs | CM_ID | CM_Name | State | Party | Region | Status | Popularity_Score | Years_in_Power | Election_Year | Approval_Rating |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | Narendra Modi | Gujarat | BJP | WEST | ACTIVE | 95 | 13 | 2001 | 98 |
| 2 | 102 | yogi adityanath | UP | BJP | north | ACTIVE | 88 | 8 | 2017 | 92 |
| 3 | 103 | NULL | Tamil Nadu | DMK | south | ACTIVE | 85 | 5 | 2021 | 90 |
| 4 | 104 | Mamata Banerjee | West Bengal | TMC | EAST | ACTIVE | -91 | 15 | 2011 | 96 |
| 5 | 105 | Arvind Kejriwal | Delhi | AAP | North | ACTIVE | 89 | -10 | 2015 | 93 |
| 6 | 106 | Nitish Kumar | Bihar | JDU | EAST | ACTIVE | 84 | 20 | 20AB | 89 |
| 7 | 106 | Nitish Kumar | Bihar | JDU | EAST | ACTIVE | 84 | 20 | 20AB | 89 |
| 8 | 107 | Maharashtra | Shiv Sena | WEST | ACTIVE | 82 | 4 | 2022 | 85 | |
| 9 | 108 | KCR | Telangana | BRS | south | ACTIVE | . | 10 | 2014 | 91 |
| 10 | 109 | Pinarayi Vijayan | Kerala | LDF | south | ACTIVE | 86 | 9 | 2021 | 90 |
| 11 | 110 | Shivraj Singh | MP | BJP | CENTRAL | ACTIVE | 87 | 17 | 2005 | 94 |
| 12 | 111 | N Chandrababu Naidu | AP | TDP | south | ACTIVE | 90 | 14 | 2014 | 95 |
#Create raw CM dataset in R
cm_raw <- data.frame(
CM_ID = c(101,102,103,104,105,106,106,107,108,109,110,111),
CM_Name = c("Narendra Modi","yogi adityanath","NULL","Mamata Banerjee",
"Arvind Kejriwal","Nitish Kumar","Nitish Kumar","","KCR",
"Pinarayi Vijayan","Shivraj Singh","N Chandrababu Naidu"),
State = c("Gujarat","UP","Tamil Nadu","West Bengal","Delhi","Bihar",
"Bihar","Maharashtra","Telangana","Kerala","MP","AP"),
Party = c("BJP","BJP","DMK","TMC","AAP","JDU","JDU","Shiv Sena","BRS",
"LDF","BJP","TDP"),
Region = c("WEST","north","south","EAST","North","EAST","EAST","WEST",
"south","south","CENTRAL","south"),
Popularity_Score = c(95,88,85,-91,89,84,84,82,NA,86,87,90),
Years_in_Power = c(13,8,5,15,-10,20,20,4,10,9,17,14),
Election_Year = c("2001","2017","2021","2011","2015","20AB","20AB",
"2022","2014","2021","2005","2014"),
Status = c("ACTIVE","ACTIVE","ACTIVE","ACTIVE","ACTIVE","ACTIVE","ACTIVE",
"ACTIVE","ACTIVE","ACTIVE","ACTIVE","ACTIVE"),
Approval_Rating = c(98,92,90,96,93,89,89,85,91,90,94,95),
stringsAsFactors = FALSE
)
OUTPUT:
|
|
CM_ID |
CM_Name |
State |
Party |
Region |
Popularity_Score |
Years_in_Power |
Election_Year |
Status |
Approval_Rating |
|
1 |
101 |
Narendra Modi |
Gujarat |
BJP |
WEST |
95 |
13 |
2001 |
ACTIVE |
98 |
|
2 |
102 |
yogi adityanath |
UP |
BJP |
north |
88 |
8 |
2017 |
ACTIVE |
92 |
|
3 |
103 |
NULL |
Tamil Nadu |
DMK |
south |
85 |
5 |
2021 |
ACTIVE |
90 |
|
4 |
104 |
Mamata Banerjee |
West Bengal |
TMC |
EAST |
-91 |
15 |
2011 |
ACTIVE |
96 |
|
5 |
105 |
Arvind Kejriwal |
Delhi |
AAP |
North |
89 |
-10 |
2015 |
ACTIVE |
93 |
|
6 |
106 |
Nitish Kumar |
Bihar |
JDU |
EAST |
84 |
20 |
20AB |
ACTIVE |
89 |
|
7 |
106 |
Nitish Kumar |
Bihar |
JDU |
EAST |
84 |
20 |
20AB |
ACTIVE |
89 |
|
8 |
107 |
|
Maharashtra |
Shiv Sena |
WEST |
82 |
4 |
2022 |
ACTIVE |
85 |
|
9 |
108 |
KCR |
Telangana |
BRS |
south |
NA |
10 |
2014 |
ACTIVE |
91 |
|
10 |
109 |
Pinarayi Vijayan |
Kerala |
LDF |
south |
86 |
9 |
2021 |
ACTIVE |
90 |
|
11 |
110 |
Shivraj Singh |
MP |
BJP |
CENTRAL |
87 |
17 |
2005 |
ACTIVE |
94 |
|
12 |
111 |
N Chandrababu Naidu |
AP |
TDP |
south |
90 |
14 |
2014 |
ACTIVE |
95 |
Explanation
This
dataset intentionally contains multiple enterprise-level data quality issues:
- Duplicate CM_ID
- Invalid year (“20AB”)
- Missing CM name
- NULL values
- Negative popularity score
- Negative years in power
- Inconsistent capitalization
- Abbreviated state names
- Blank observations
The LENGTH
statement is critical because SAS assigns storage length during variable
creation. If conditional logic executes before proper length allocation, SAS
may truncate strings permanently. For example, “N Chandrababu Naidu” may become
“N Chandr” if the variable length defaults incorrectly. This is known as the
truncation trap.
Professional
SAS programmers always define LENGTH before transformations.
SAS vs R Equivalents
|
SAS Function |
R Equivalent |
Purpose |
|
|
|
Proper capitalization |
|
|
|
Convert to uppercase |
|
|
|
Convert to lowercase |
|
|
|
Remove spaces |
|
|
|
Remove negative sign |
PROC DATASETS for Efficient Dataset Management
proc datasets library=work nolist;
modify cm_raw;
label CM_Name="Chief Minister Name"
Popularity_Score="Public Popularity Score"
Years_in_Power="Political Tenure";
format Approval_Rating 8.2;
quit;
proc print data=cm_raw label;
run;
OUTPUT:
| Obs | CM_ID | Chief Minister Name | State | Party | Region | Status | Public Popularity Score | Political Tenure | Election_Year | Approval_Rating |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | Narendra Modi | Gujarat | BJP | WEST | ACTIVE | 95 | 13 | 2001 | 98.00 |
| 2 | 102 | yogi adityanath | UP | BJP | north | ACTIVE | 88 | 8 | 2017 | 92.00 |
| 3 | 103 | NULL | Tamil Nadu | DMK | south | ACTIVE | 85 | 5 | 2021 | 90.00 |
| 4 | 104 | Mamata Banerjee | West Bengal | TMC | EAST | ACTIVE | -91 | 15 | 2011 | 96.00 |
| 5 | 105 | Arvind Kejriwal | Delhi | AAP | North | ACTIVE | 89 | -10 | 2015 | 93.00 |
| 6 | 106 | Nitish Kumar | Bihar | JDU | EAST | ACTIVE | 84 | 20 | 20AB | 89.00 |
| 7 | 106 | Nitish Kumar | Bihar | JDU | EAST | ACTIVE | 84 | 20 | 20AB | 89.00 |
| 8 | 107 | Maharashtra | Shiv Sena | WEST | ACTIVE | 82 | 4 | 2022 | 85.00 | |
| 9 | 108 | KCR | Telangana | BRS | south | ACTIVE | . | 10 | 2014 | 91.00 |
| 10 | 109 | Pinarayi Vijayan | Kerala | LDF | south | ACTIVE | 86 | 9 | 2021 | 90.00 |
| 11 | 110 | Shivraj Singh | MP | BJP | CENTRAL | ACTIVE | 87 | 17 | 2005 | 94.00 |
| 12 | 111 | N Chandrababu Naidu | AP | TDP | south | ACTIVE | 90 | 14 | 2014 | 95.00 |
Explanation
PROC
DATASETS is significantly faster than repeatedly recreating datasets through
DATA steps because it modifies metadata directly without rewriting
observations.
Professional
advantages:
- Faster execution
- Efficient memory usage
- Enterprise-scale dataset
maintenance
- Variable renaming without
recreating data
- Labeling and formatting
optimization
DATA STEP Cleaning Logic
data cm_clean;
set cm_raw;
length Clean_State $25 Clean_Region $15;
CM_Name=propcase(strip(CM_Name));
if CM_Name="Null" or CM_Name="" then CM_Name="Unknown CM";
Popularity_Score=abs(Popularity_Score);
Years_in_Power=abs(Years_in_Power);
Clean_Region=upcase(Region);
select(upcase(State));
when('UP') Clean_State='Uttar Pradesh';
when('AP') Clean_State='Andhra Pradesh';
when('MP') Clean_State='Madhya Pradesh';
otherwise Clean_State=propcase(State);
end;
Election_Year_Num=input(Election_Year,8.);
if missing(Election_Year_Num) then Election_Year_Num=2020;
drop State Region Election_Year;
rename Clean_State = State
Clean_Region= Region
Election_Year_Num= Election_Year;
run;
proc print data=cm_clean;
run;
OUTPUT:
| Obs | CM_ID | CM_Name | Party | Status | Popularity_Score | Years_in_Power | Approval_Rating | State | Region | Election_Year |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | Narendra Modi | BJP | ACTIVE | 95 | 13 | 98.00 | Gujarat | WEST | 2001 |
| 2 | 102 | Yogi Adityanath | BJP | ACTIVE | 88 | 8 | 92.00 | Uttar Pradesh | NORTH | 2017 |
| 3 | 103 | Unknown CM | DMK | ACTIVE | 85 | 5 | 90.00 | Tamil Nadu | SOUTH | 2021 |
| 4 | 104 | Mamata Banerjee | TMC | ACTIVE | 91 | 15 | 96.00 | West Bengal | EAST | 2011 |
| 5 | 105 | Arvind Kejriwal | AAP | ACTIVE | 89 | 10 | 93.00 | Delhi | NORTH | 2015 |
| 6 | 106 | Nitish Kumar | JDU | ACTIVE | 84 | 20 | 89.00 | Bihar | EAST | 2020 |
| 7 | 106 | Nitish Kumar | JDU | ACTIVE | 84 | 20 | 89.00 | Bihar | EAST | 2020 |
| 8 | 107 | Unknown CM | Shiv Sena | ACTIVE | 82 | 4 | 85.00 | Maharashtra | WEST | 2022 |
| 9 | 108 | Kcr | BRS | ACTIVE | . | 10 | 91.00 | Telangana | SOUTH | 2014 |
| 10 | 109 | Pinarayi Vijayan | LDF | ACTIVE | 86 | 9 | 90.00 | Kerala | SOUTH | 2021 |
| 11 | 110 | Shivraj Singh | BJP | ACTIVE | 87 | 17 | 94.00 | Madhya Pradesh | CENTRAL | 2005 |
| 12 | 111 | N Chandrababu Naidu | TDP | ACTIVE | 90 | 14 | 95.00 | Andhra Pradesh | SOUTH | 2014 |
Explanation
This DATA
STEP demonstrates industrial-grade transformation logic.
Technical Highlights
- ABS() corrects negative
values
- INPUT() converts character
years to numeric
- COALESCEC() prioritizes
nonmissing character values
- SELECT-WHEN improves
efficiency for categorical mapping
- PROPCASE() standardizes
capitalization
IF-THEN vs SELECT-WHEN
|
Feature |
IF-THEN |
SELECT-WHEN |
|
Best
for |
Complex
conditions |
Multiple
categorical mappings |
|
Readability |
Moderate |
Excellent |
|
Speed |
Slower
for many categories |
Faster |
|
Enterprise
Use |
Validation
logic |
Lookup
standardization |
SELECT-WHEN
is preferred when many discrete category mappings exist because SAS processes
it more efficiently.
Duplicate Removal
proc sort data=cm_clean nodupkey;
by CM_ID;
run;
proc print data=cm_clean;
run;
LOG:
OUTPUT:
| Obs | CM_ID | CM_Name | Party | Status | Popularity_Score | Years_in_Power | Approval_Rating | State | Region | Election_Year |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | Narendra Modi | BJP | ACTIVE | 95 | 13 | 98.00 | Gujarat | WEST | 2001 |
| 2 | 102 | Yogi Adityanath | BJP | ACTIVE | 88 | 8 | 92.00 | Uttar Pradesh | NORTH | 2017 |
| 3 | 103 | Unknown CM | DMK | ACTIVE | 85 | 5 | 90.00 | Tamil Nadu | SOUTH | 2021 |
| 4 | 104 | Mamata Banerjee | TMC | ACTIVE | 91 | 15 | 96.00 | West Bengal | EAST | 2011 |
| 5 | 105 | Arvind Kejriwal | AAP | ACTIVE | 89 | 10 | 93.00 | Delhi | NORTH | 2015 |
| 6 | 106 | Nitish Kumar | JDU | ACTIVE | 84 | 20 | 89.00 | Bihar | EAST | 2020 |
| 7 | 107 | Unknown CM | Shiv Sena | ACTIVE | 82 | 4 | 85.00 | Maharashtra | WEST | 2022 |
| 8 | 108 | Kcr | BRS | ACTIVE | . | 10 | 91.00 | Telangana | SOUTH | 2014 |
| 9 | 109 | Pinarayi Vijayan | LDF | ACTIVE | 86 | 9 | 90.00 | Kerala | SOUTH | 2021 |
| 10 | 110 | Shivraj Singh | BJP | ACTIVE | 87 | 17 | 94.00 | Madhya Pradesh | CENTRAL | 2005 |
| 11 | 111 | N Chandrababu Naidu | TDP | ACTIVE | 90 | 14 | 95.00 | Andhra Pradesh | SOUTH | 2014 |
Explanation
NODUPKEY removes
duplicate IDs while preserving the first occurrence. This is essential in:
- Clinical trials
- Banking systems
- Government registries
- Election databases
Duplicate
records can inflate counts and produce false analytics.
PROC SQL Version
proc sql;
create table cm_sql as
select distinct CM_ID,propcase(CM_Name) as CM_Name length=30,
abs(Popularity_Score) as Popularity_Score,
abs(Years_in_Power) as Years_in_Power,
upcase(Region) as Region
from cm_raw
where Approval_Rating >= 90;
quit;
proc print data=cm_sql;
run;
OUTPUT:
| Obs | CM_ID | CM_Name | Popularity_Score | Years_in_Power | Region |
|---|---|---|---|---|---|
| 1 | 101 | Narendra Modi | 95 | 13 | WEST |
| 2 | 102 | Yogi Adityanath | 88 | 8 | NORTH |
| 3 | 103 | Null | 85 | 5 | SOUTH |
| 4 | 104 | Mamata Banerjee | 91 | 15 | EAST |
| 5 | 105 | Arvind Kejriwal | 89 | 10 | NORTH |
| 6 | 108 | Kcr | . | 10 | SOUTH |
| 7 | 109 | Pinarayi Vijayan | 86 | 9 | SOUTH |
| 8 | 110 | Shivraj Singh | 87 | 17 | CENTRAL |
| 9 | 111 | N Chandrababu Naidu | 90 | 14 | SOUTH |
Explanation
PROC SQL
provides relational-style querying and is ideal for:
- Joins
- Aggregation
- Dynamic filtering
- Subqueries
- Multi-table analytics
DATA STEP
is procedural, while PROC SQL is declarative.
Professional
programmers use both strategically.
3. The R Refinement Layer (Tidyverse Approach)
library(dplyr)
library(stringr)
library(tidyr)
cm_clean <- cm_raw %>%
mutate(CM_Name=str_to_title(trimws(CM_Name)),
CM_Name=ifelse(CM_Name=="Null" | CM_Name=="","Unknown CM",
CM_Name),
Popularity_Score=abs(Popularity_Score),
Years_in_Power=abs(Years_in_Power),
Region = toupper(Region),
State=case_when(State=="UP" ~ "Uttar Pradesh",
State=="AP" ~ "Andhra Pradesh",
State=="MP" ~ "Madhya Pradesh",
TRUE ~ State),
Election_Year=suppressWarnings(as.numeric(Election_Year)),
Election_Year=replace_na(Election_Year,2020)
) %>%
distinct(CM_ID,.keep_all=TRUE)
OUTPUT:
|
|
CM_ID |
CM_Name |
State |
Party |
Region |
Popularity_Score |
Years_in_Power |
Election_Year |
Status |
Approval_Rating |
|
1 |
101 |
Narendra Modi |
Gujarat |
BJP |
WEST |
95 |
13 |
2001 |
ACTIVE |
98 |
|
2 |
102 |
Yogi Adityanath |
Uttar Pradesh |
BJP |
NORTH |
88 |
8 |
2017 |
ACTIVE |
92 |
|
3 |
103 |
Unknown CM |
Tamil Nadu |
DMK |
SOUTH |
85 |
5 |
2021 |
ACTIVE |
90 |
|
4 |
104 |
Mamata Banerjee |
West Bengal |
TMC |
EAST |
91 |
15 |
2011 |
ACTIVE |
96 |
|
5 |
105 |
Arvind Kejriwal |
Delhi |
AAP |
NORTH |
89 |
10 |
2015 |
ACTIVE |
93 |
|
6 |
106 |
Nitish Kumar |
Bihar |
JDU |
EAST |
84 |
20 |
2020 |
ACTIVE |
89 |
|
7 |
107 |
Unknown CM |
Maharashtra |
Shiv Sena |
WEST |
82 |
4 |
2022 |
ACTIVE |
85 |
|
8 |
108 |
Kcr |
Telangana |
BRS |
SOUTH |
NA |
10 |
2014 |
ACTIVE |
91 |
|
9 |
109 |
Pinarayi Vijayan |
Kerala |
LDF |
SOUTH |
86 |
9 |
2021 |
ACTIVE |
90 |
|
10 |
110 |
Shivraj Singh |
Madhya Pradesh |
BJP |
CENTRAL |
87 |
17 |
2005 |
ACTIVE |
94 |
|
11 |
111 |
N Chandrababu Naidu |
Andhra Pradesh |
TDP |
SOUTH |
90 |
14 |
2014 |
ACTIVE |
95 |
Explanation
The
tidyverse ecosystem simplifies modern data wrangling.
SAS vs R Logic Bridge
|
R
Function |
SAS
Equivalent |
|
mutate() |
DATA
STEP assignment |
|
case_when() |
SELECT-WHEN |
|
filter() |
WHERE
statement |
|
distinct() |
PROC
SORT NODUPKEY |
|
replace_na() |
COALESCE/IF
MISSING |
|
|
CM_ID |
CM_Name |
State |
Party |
Region |
Popularity_Score |
Years_in_Power |
Election_Year |
Status |
Approval_Rating |
|
1 |
101 |
Narendra Modi |
Gujarat |
BJP |
WEST |
95 |
13 |
2001 |
ACTIVE |
98 |
|
2 |
102 |
Yogi Adityanath |
Uttar Pradesh |
BJP |
NORTH |
88 |
8 |
2017 |
ACTIVE |
92 |
|
3 |
103 |
Unknown CM |
Tamil Nadu |
DMK |
SOUTH |
85 |
5 |
2021 |
ACTIVE |
90 |
|
4 |
104 |
Mamata Banerjee |
West Bengal |
TMC |
EAST |
91 |
15 |
2011 |
ACTIVE |
96 |
|
5 |
105 |
Arvind Kejriwal |
Delhi |
AAP |
NORTH |
89 |
10 |
2015 |
ACTIVE |
93 |
|
6 |
106 |
Nitish Kumar |
Bihar |
JDU |
EAST |
84 |
20 |
2020 |
ACTIVE |
89 |
|
7 |
107 |
Unknown CM |
Maharashtra |
Shiv Sena |
WEST |
82 |
4 |
2022 |
ACTIVE |
85 |
|
8 |
108 |
Kcr |
Telangana |
BRS |
SOUTH |
NA |
10 |
2014 |
ACTIVE |
91 |
|
9 |
109 |
Pinarayi Vijayan |
Kerala |
LDF |
SOUTH |
86 |
9 |
2021 |
ACTIVE |
90 |
|
10 |
110 |
Shivraj Singh |
Madhya Pradesh |
BJP |
CENTRAL |
87 |
17 |
2005 |
ACTIVE |
94 |
|
11 |
111 |
N Chandrababu Naidu |
Andhra Pradesh |
TDP |
SOUTH |
90 |
14 |
2014 |
ACTIVE |
95 |
Explanation
gsub()
performs regex-based replacement.
trimws()
removes hidden whitespace.
These
functions are essential because hidden blanks often cause merge failures.
4. Business Logic & The
“Why”
High-Stakes Political Analytics Scenario
Suppose a
national election intelligence company uses this dataset to identify
top-performing leaders for campaign funding allocation.
Now
imagine:
- Negative popularity values
remain uncorrected
- Missing years treated
incorrectly
- Duplicate CM records inflate
rankings
A Chief
Minister could appear less popular than reality, causing incorrect strategic
decisions.
The Missing Value Trap in SAS
In SAS,
missing numeric values are treated as smaller than any valid number.
Example:
data cm_flag;
retain CM_ID CM_Name Party Status Popularity_Score Years_in_Power
Approval_Rating State Region Election_Year Flag;
length Flag $5.;
set cm_clean;
if Approval_Rating < 90 then Flag="NO";
else Flag="YES";
run;
proc print data=cm_flag;
run;
OUTPUT:
| Obs | CM_ID | CM_Name | Party | Status | Popularity_Score | Years_in_Power | Approval_Rating | State | Region | Election_Year | Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 101 | Narendra Modi | BJP | ACTIVE | 95 | 13 | 98.00 | Gujarat | WEST | 2001 | YES |
| 2 | 102 | Yogi Adityanath | BJP | ACTIVE | 88 | 8 | 92.00 | Uttar Pradesh | NORTH | 2017 | YES |
| 3 | 103 | Unknown CM | DMK | ACTIVE | 85 | 5 | 90.00 | Tamil Nadu | SOUTH | 2021 | YES |
| 4 | 104 | Mamata Banerjee | TMC | ACTIVE | 91 | 15 | 96.00 | West Bengal | EAST | 2011 | YES |
| 5 | 105 | Arvind Kejriwal | AAP | ACTIVE | 89 | 10 | 93.00 | Delhi | NORTH | 2015 | YES |
| 6 | 106 | Nitish Kumar | JDU | ACTIVE | 84 | 20 | 89.00 | Bihar | EAST | 2020 | NO |
| 7 | 107 | Unknown CM | Shiv Sena | ACTIVE | 82 | 4 | 85.00 | Maharashtra | WEST | 2022 | NO |
| 8 | 108 | Kcr | BRS | ACTIVE | . | 10 | 91.00 | Telangana | SOUTH | 2014 | YES |
| 9 | 109 | Pinarayi Vijayan | LDF | ACTIVE | 86 | 9 | 90.00 | Kerala | SOUTH | 2021 | YES |
| 10 | 110 | Shivraj Singh | BJP | ACTIVE | 87 | 17 | 94.00 | Madhya Pradesh | CENTRAL | 2005 | YES |
| 11 | 111 | N Chandrababu Naidu | TDP | ACTIVE | 90 | 14 | 95.00 | Andhra Pradesh | SOUTH | 2014 | YES |
If
Approval_Rating is missing, SAS still evaluates it as smaller than 90.
This can
accidentally classify missing candidates as low-performing.
Correct
logic:
if not missing(Approval_Rating)
and Approval_Rating < 90 then Flag='LOW';
This is
critical in:
- Loan approvals
- Clinical trial eligibility
- Government subsidies
- Insurance risk scoring
5. Key Points of
Implementation
- Always define LENGTH before
transformations.
- Never trust raw imported
data.
- Standardize text before
joins.
- Remove duplicates
immediately.
- Validate dates rigorously.
- Use PROC DATASETS for
metadata optimization.
- Avoid hardcoding whenever
possible.
- Document every
transformation.
- Maintain audit trails.
- Use consistent naming
conventions.
- Validate NULL and blank
values separately.
- Use SELECT-WHEN for
categorical mapping.
- Apply formats
professionally.
- Separate raw and cleaned
layers.
- Never overwrite source data.
- Use ABS carefully with
business approval.
- Validate missing numeric
logic in SAS.
- Optimize sorting operations.
- Build reusable macros.
- Always cross-check SAS vs R
outputs.
Aggregation with PROC SUMMARY
proc summary data=cm_clean nway;
class Region;
var Popularity_Score Approval_Rating;
output out=region_summary
mean=Avg_Popularity Avg_Approval;
run;
proc print data=region_summary;
run;
OUTPUT:
| Obs | Region | _TYPE_ | _FREQ_ | Avg_Popularity | Avg_Approval |
|---|---|---|---|---|---|
| 1 | CENTRAL | 1 | 1 | 87.0 | 94.00 |
| 2 | EAST | 1 | 2 | 87.5 | 92.50 |
| 3 | NORTH | 1 | 2 | 88.5 | 92.50 |
| 4 | SOUTH | 1 | 4 | 87.0 | 91.50 |
| 5 | WEST | 1 | 2 | 88.5 | 91.50 |
Explanation
PROC
SUMMARY efficiently aggregates regional political metrics.
Used
heavily in:
- Clinical trials
- Sales analytics
- Government dashboards
Professional Reporting
proc report data=region_summary nowd;
columns Region Avg_Popularity Avg_Approval;
define Region / group;
define Avg_Popularity / analysis;
define Avg_Approval / analysis;
run;
OUTPUT:
| Region | Public Popularity Score | Avg_Approval |
|---|---|---|
| CENTRAL | 87 | 94.00 |
| EAST | 87.5 | 92.50 |
| NORTH | 88.5 | 92.50 |
| SOUTH | 87 | 91.50 |
| WEST | 88.5 | 91.50 |
Explanation
PROC
REPORT produces enterprise-grade formatted reports suitable for executives and
regulators.
6. 20 Additional Data
Cleaning Best Practices
- Maintain SDTM traceability.
- Preserve raw datasets
untouched.
- Track derivation logic.
- Validate controlled
terminology.
- Audit every transformation
step.
- Standardize date formats.
- Validate key uniqueness.
- Use metadata-driven
programming.
- Store validation logs.
- Reconcile SAS and R outputs.
- Use version control.
- Validate clinical ranges.
- Flag impossible values.
- Avoid silent truncation.
- Apply consistent coding
standards.
- Test macros independently.
- Use reusable validation
functions.
- Separate business rules from
code.
- Maintain reviewer comments.
- Ensure regulatory
reproducibility.
7. Business Logic Behind
Data Cleaning
Data
cleaning is not cosmetic it directly affects business outcomes. Missing values
can distort statistical models and produce inaccurate forecasting. For example,
if patient age is missing in a clinical trial, the patient may incorrectly
qualify for eligibility criteria. Similarly, negative salary values in a loan
approval system could incorrectly reduce debt-risk calculations.
Date
correction is equally critical. Invalid dates disrupt trend analysis, survival
analysis, and forecasting models. Standardized dates ensure proper sequencing
of events.
Replacing
missing values improves analytical continuity. However, imputation must follow
documented business rules. Blind replacement introduces analytical bias.
Normalization
techniques such as ABS() are used carefully. While converting negative revenue
to positive may correct import errors, doing so without validation could hide
fraud.
In
political analytics, incorrect popularity metrics can distort election
strategies. In healthcare, the same issue could affect patient safety.
Therefore, professional data cleaning combines technical skill with domain
understanding.
8. 20 Key Points — Sharp
& Impactful
- Dirty data destroys trust.
- Standardization improves
reproducibility.
- Missing values can silently
corrupt logic.
- PROC DATASETS improves
efficiency.
- DATA STEP offers procedural
flexibility.
- PROC SQL simplifies
relational logic.
- R excels in modern
wrangling.
- Duplicates inflate
analytics.
- LENGTH prevents truncation
disasters.
- Validation is not optional.
- Documentation saves
projects.
- Audit trails protect
organizations.
- Metadata matters.
- Business rules drive
cleaning logic.
- Efficient filtering improves
scalability.
- Sorting impacts performance.
- Text normalization prevents
merge failures.
- Date conversion requires
precision.
- Enterprise SAS prioritizes
reproducibility.
- Clean data creates reliable
intelligence.
9. Summary
This
project demonstrated how enterprise-grade data engineering transforms chaotic
political datasets into trustworthy analytical assets. Using a dataset of
India’s most famous Chief Ministers from 2000–2026, we intentionally introduced
real-world errors including duplicates, invalid years, missing values,
inconsistent capitalization, and structural anomalies.
Through
SAS, we explored the power of:
- PROC DATASETS for metadata
optimization
- DATA STEP for procedural
transformations
- PROC SORT for deduplication
- PROC SUMMARY and REPORT for
professional reporting
- PROC SQL for relational
analytics
We also
replicated the same business logic using R and tidyverse packages,
demonstrating modern data wrangling approaches using mutate(), case_when(), replace_na(),
and regex functions.
A major
lesson from this project is that data cleaning is deeply tied to business
logic. Poorly handled missing values can incorrectly qualify candidates,
distort forecasting, or create regulatory risks.
SAS
remains exceptionally strong for enterprise governance, auditability, and
large-scale regulated analytics. R provides flexibility, readability, and rapid
exploratory transformations.
Together,
SAS and R create a highly scalable analytical ecosystem capable of supporting
banking, clinical research, political intelligence, and government analytics.
The true
value of analytics begins not with visualization but with disciplined,
reproducible, validated data engineering.
10. Conclusion
Modern
analytics projects fail more often because of poor data quality than poor
algorithms. Organizations invest millions into dashboards, AI systems, and
predictive engines, yet overlook the foundational layer that powers everything:
structured, validated, governed datasets.
This
project showed how even a seemingly simple political dataset can contain dangerous
inconsistencies capable of damaging enterprise decision-making. Duplicate IDs,
inconsistent state names, missing popularity scores, invalid dates, and
truncation errors may appear small individually, but together they create
analytical chaos.
Using SAS
and R together provides an exceptionally powerful framework for solving these
issues professionally.
SAS
offers:
- Regulatory-grade
reproducibility
- High-performance enterprise
processing
- Advanced metadata control
- Stable auditability
R
contributes:
- Flexible transformation
pipelines
- Elegant text handling
- Rapid exploratory workflows
- Modern functional
programming patterns
The
combination allows analysts to build scalable, production-ready cleaning
frameworks capable of handling healthcare, finance, government, and political
analytics.
One of
the most important lessons is that cleaning data is not merely technical work.
Every correction must align with business logic. Converting negative values,
imputing dates, replacing NULLs, and standardizing text all require contextual
understanding.
Professional
analytics teams do not simply “fix data.” They engineer trust.
In the
real world, executives, regulators, researchers, and citizens make decisions
based on analytical outputs. Therefore, data engineers and SAS programmers
carry enormous responsibility.
Clean
data is not just operational hygiene.
It is the
foundation of reliable intelligence, ethical decision-making, and scalable
analytical success.
11. Interview Questions
& Answers
1. Why would you use PROC DATASETS instead of a
DATA STEP?
Answer:
PROC
DATASETS modifies metadata directly without rewriting observations. It is
faster and more memory-efficient for renaming, labeling, formatting, and
deleting datasets.
2. Explain the truncation trap in SAS.
Answer:
If LENGTH
is not declared before assignment logic, SAS may allocate insufficient variable
storage, permanently truncating character values.
3. When would you prefer SELECT-WHEN over IF-THEN?
Answer:
SELECT-WHEN
is preferred for multiple categorical mappings because it improves readability
and execution efficiency.
4. What is the SAS missing numeric trap?
Answer:
SAS
treats missing numeric values as smaller than all valid numbers. This can
incorrectly qualify records unless explicit missing checks are added.
5. Compare PROC SQL and DATA STEP.
Answer:
|
PROC
SQL |
DATA
STEP |
|
Declarative |
Procedural |
|
Better
for joins |
Better
for row-wise logic |
|
Easier
aggregation |
Better
sequential processing |
|
SQL
standard syntax |
SAS-native
flexibility |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent CM DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
· Clinical SAS Programmer
· Research Data Analyst
· Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment