Kings, Kingdoms & Knowledge Engineering: Converting Dirty Historical Data into High-Trust SAS and R Analytical Assets
Famous Kings in India Data into Analysis-Ready SAS and R Pipelines for Enterprise Reporting Excellence
Introduction
In
enterprise analytics, data rarely arrives in a clean, standardized,
regulatory-ready format. Whether working in banking, insurance, retail, or
clinical research, organizations constantly battle corrupted records, duplicate
identifiers, invalid dates, inconsistent categories, malformed text, and
missing values. As experienced Clinical SAS Programmers and Data Scientists
know, poor-quality data silently destroys dashboards, machine learning
predictions, executive decisions, and regulatory submissions.
Imagine a
healthcare analytics company supporting a historical genomic research project
studying hereditary diseases across royal bloodlines of famous Indian kings.
The raw dataset contains duplicate king IDs, impossible reign years, negative
treasury amounts, malformed researcher emails, inconsistent dynasty labels, mixed-case
text, corrupted timestamps, and missing region codes. A single invalid record
incorrectly associates a king with the wrong dynasty lineage, causing
downstream statistical models to fail validation.
This is
not merely a technical inconvenience.
In SDTM/ADaM
environments, corrupted data can trigger FDA rejection risks, audit findings,
inaccurate survival analysis, incorrect treatment arm assignment, and broken
traceability. In banking, dirty data misclassifies fraud-risk customers. In
insurance, claim-processing systems reject valid claims because of malformed
categorical fields. In retail analytics, duplicate transactions inflate revenue
metrics and mislead executive strategy.
The
difference between chaos and trusted analytical intelligence is structured data
engineering.
In this
project, we will create a corrupted “FAMOUS KINGS IN INDIA” dataset with
intentional enterprise-grade data quality problems and transform it into an
analysis-ready professional dataset using both SAS and R. We will compare PROC
SQL vs DATA Step methodologies while demonstrating enterprise validation
frameworks, reusable macros, professional reporting workflows, and
regulatory-grade cleaning strategies.
Raw Enterprise Dataset
Creation in SAS
Business Problem
The
historical analytics division receives operational data collected from multiple
archives. The files contain:
- Duplicate King IDs
- Invalid reign dates
- Mixed dynasty spellings
- Missing region values
- Corrupted email addresses
- Negative treasury values
- Invalid category labels
- Mixed character/numeric
corruption
- Whitespace contamination
- NULL string values
SAS Raw Dataset Creation
data kings_raw;
length King_ID $12 King_Name $40 Dynasty $30 Region $15
Researcher_Email $60 Kingdom_Type $20 Treasury 8
Reign_Start $20 Reign_End $20 Notes $50;
infile datalines dlm='|' truncover;
input King_ID $ King_Name $ Dynasty $ Region $ Researcher_Email $
Kingdom_Type $ Treasury Reign_Start $ Reign_End $ Notes $;
datalines;
K001|ashoka |Maurya|north|ashoka.gmail.com|Empire|500000|268BC|232BC|NULL
K002|CHANDRAGUPTA maurya|Maurya |NORTH |chandra@royal|Empire|-20000|321BC|297BC|duplicate
K003|Samudragupta|gupta|south|samudra@gmail.com|kingdom|750000|335BC|380BC|valid
K004|Harsha|null|east|harsha@@mail.com|KINGDOM|450000|606AD|647AD|missing dynasty
K005|Krishnadevaraya|Vijayanagara|south |krishna.com|Empire|900000|1509AD|1529AD|bad email
K005|Krishnadevaraya|Vijayanagara|south |krishna.com|Empire|900000|1509AD|1529AD|duplicate
K006|Rajaraja Chola|CHOLA|SOUTH|raja@gmail.com|empire|850000|985AD|1014AD|valid
K007|Tipu Sultan|Mysore|NULL|tipu@gmail|Sultanate|650000|1782AD|1799AD|bad region
K008|Shivaji|Maratha|west|shivaji@gmail.com|EMPIRE|9999999|1674AD|1680AD|valid
K009|Akbar|Mughal|north|akbarmail.com|Empire|-100|1556AD|1605AD|negative treasury
K010|NULL|Mughal|north|admin@gmail.com|Empire|400000|1605AD|1627AD|missing king
;
run;
proc print data = kings_raw;
run;
OUTPUT:
| Obs | King_ID | King_Name | Dynasty | Region | Researcher_Email | Kingdom_Type | Treasury | Reign_Start | Reign_End | Notes |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | K001 | ashoka | Maurya | north | ashoka.gmail.com | Empire | 500000 | 268BC | 232BC | NULL |
| 2 | K002 | CHANDRAGUPTA maurya | Maurya | NORTH | chandra@royal | Empire | -20000 | 321BC | 297BC | duplicate |
| 3 | K003 | Samudragupta | gupta | south | samudra@gmail.com | kingdom | 750000 | 335BC | 380BC | valid |
| 4 | K004 | Harsha | null | east | harsha@@mail.com | KINGDOM | 450000 | 606AD | 647AD | missing dynasty |
| 5 | K005 | Krishnadevaraya | Vijayanagara | south | krishna.com | Empire | 900000 | 1509AD | 1529AD | bad email |
| 6 | K005 | Krishnadevaraya | Vijayanagara | south | krishna.com | Empire | 900000 | 1509AD | 1529AD | duplicate |
| 7 | K006 | Rajaraja Chola | CHOLA | SOUTH | raja@gmail.com | empire | 850000 | 985AD | 1014AD | valid |
| 8 | K007 | Tipu Sultan | Mysore | NULL | tipu@gmail | Sultanate | 650000 | 1782AD | 1799AD | bad region |
| 9 | K008 | Shivaji | Maratha | west | shivaji@gmail.com | EMPIRE | 9999999 | 1674AD | 1680AD | valid |
| 10 | K009 | Akbar | Mughal | north | akbarmail.com | Empire | -100 | 1556AD | 1605AD | negative treasury |
| 11 | K010 | NULL | Mughal | north | admin@gmail.com | Empire | 400000 | 1605AD | 1627AD | missing king |
Explanation and Key Points
This
dataset intentionally mimics real enterprise corruption scenarios. The LENGTH statement
appears before assignments because SAS determines variable storage length
during compilation. If LENGTH is declared later, character truncation may
silently occur. For example, “Vijayanagara” could be truncated if Dynasty were
initialized with smaller lengths earlier.
The
INFILE and DATALINES combination simulates operational flat-file ingestion
pipelines commonly used in healthcare and banking systems. TRUNCOVER prevents
SAS from reading the next line when fields are incomplete.
Notice
the intentional corruption:
- NULL strings
- malformed emails
- inconsistent
uppercase/lowercase text
- duplicate IDs
- negative treasury values
These
issues directly affect analytical reliability and reporting accuracy.
Unlike
SAS fixed-length character allocation, R dynamically handles character vectors,
reducing truncation risks but increasing runtime memory flexibility tradeoffs.
FAMOUS KINGS IN INDIA Raw Dataset in R
library(tibble)
kings_raw <- tribble(
~King_ID, ~King_Name, ~Dynasty, ~Region, ~Researcher_Email, ~Kingdom_Type, ~Treasury, ~Reign_Start, ~Reign_End, ~Notes,
"K001", "ashoka ", "Maurya", "north", "ashoka.gmail.com", "Empire", 500000, "268BC", "232BC", "NULL",
"K002", "CHANDRAGUPTA maurya", "Maurya ", "NORTH ", "chandra@royal", "Empire", -20000, "321BC", "297BC", "duplicate",
"K003", "Samudragupta", "gupta", "south", "samudra@gmail.com", "kingdom", 750000, "335BC", "380BC", "valid",
"K004", "Harsha", "null", "east", "harsha@@mail.com", "KINGDOM", 450000, "606AD", "647AD", "missing dynasty",
"K005", "Krishnadevaraya", "Vijayanagara", "south ", "krishna.com", "Empire", 900000, "1509AD", "1529AD", "bad email",
"K005", "Krishnadevaraya", "Vijayanagara", "south ", "krishna.com", "Empire", 900000, "1509AD", "1529AD", "duplicate",
"K006", "Rajaraja Chola", "CHOLA", "SOUTH", "raja@gmail.com", "empire", 850000, "985AD", "1014AD", "valid",
"K007", "Tipu Sultan", "Mysore", "NULL", "tipu@gmail", "Sultanate", 650000, "1782AD", "1799AD", "bad region",
"K008", "Shivaji", "Maratha", "west", "shivaji@gmail.com", "EMPIRE", 9999999, "1674AD", "1680AD", "valid",
"K009", "Akbar", "Mughal", "north", "akbarmail.com", "Empire", -100, "1556AD", "1605AD", "negative treasury",
"K010", "NULL", "Mughal", "north", "admin@gmail.com", "Empire", 400000, "1605AD", "1627AD", "missing king"
)
OUTPUT:
|
|
King_ID |
King_Name |
Dynasty |
Region |
Researcher_Email |
Kingdom_Type |
Treasury |
Reign_Start |
Reign_End |
Notes |
|
1 |
K001 |
ashoka |
Maurya |
north |
ashoka.gmail.com |
Empire |
500000 |
268BC |
232BC |
NULL |
|
2 |
K002 |
CHANDRAGUPTA maurya |
Maurya |
NORTH |
chandra@royal |
Empire |
-20000 |
321BC |
297BC |
duplicate |
|
3 |
K003 |
Samudragupta |
gupta |
south |
samudra@gmail.com |
kingdom |
750000 |
335BC |
380BC |
valid |
|
4 |
K004 |
Harsha |
null |
east |
harsha@@mail.com |
KINGDOM |
450000 |
606AD |
647AD |
missing dynasty |
|
5 |
K005 |
Krishnadevaraya |
Vijayanagara |
south |
krishna.com |
Empire |
900000 |
1509AD |
1529AD |
bad email |
|
6 |
K005 |
Krishnadevaraya |
Vijayanagara |
south |
krishna.com |
Empire |
900000 |
1509AD |
1529AD |
duplicate |
|
7 |
K006 |
Rajaraja Chola |
CHOLA |
SOUTH |
raja@gmail.com |
empire |
850000 |
985AD |
1014AD |
valid |
|
8 |
K007 |
Tipu Sultan |
Mysore |
NULL |
tipu@gmail |
Sultanate |
650000 |
1782AD |
1799AD |
bad region |
|
9 |
K008 |
Shivaji |
Maratha |
west |
shivaji@gmail.com |
EMPIRE |
9999999 |
1674AD |
1680AD |
valid |
|
10 |
K009 |
Akbar |
Mughal |
north |
akbarmail.com |
Empire |
-100 |
1556AD |
1605AD |
negative treasury |
|
11 |
K010 |
NULL |
Mughal |
north |
admin@gmail.com |
Empire |
400000 |
1605AD |
1627AD |
missing king |
Explanation and Key Points
This R
program creates the same corrupted enterprise-style dataset that was originally
developed in SAS. Here, the tribble() function from the tibble package is used
because it provides a cleaner and more readable row-wise dataset structure,
especially for interview demonstrations and enterprise prototypes.
The
dataset intentionally contains multiple data-quality issues:
- Duplicate King_ID values (K005)
- Negative treasury amounts
- Mixed uppercase/lowercase
text
- Invalid email formats
- NULL string contamination
- Whitespace corruption
- Inconsistent categorical
labels
PROC CONTENTS Validation
proc contents data=kings_raw;
run;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.KINGS_RAW | Observations | 11 |
|---|---|---|---|
| Member Type | DATA | Variables | 10 |
| Engine | V9 | Indexes | 0 |
| Created | 05/20/2026 07:43:04 | Observation Length | 280 |
| Last Modified | 05/20/2026 07:43:04 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 467 |
| Obs in First Data Page | 11 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_workDFD200001319_odaws01-apse1-2.oda.sas.com/SAS_work140F00001319_odaws01-apse1-2.oda.sas.com/kings_raw.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 134352353 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Alphabetic List of Variables and Attributes | |||
|---|---|---|---|
| # | Variable | Type | Len |
| 3 | Dynasty | Char | 30 |
| 1 | King_ID | Char | 12 |
| 2 | King_Name | Char | 40 |
| 6 | Kingdom_Type | Char | 20 |
| 10 | Notes | Char | 50 |
| 4 | Region | Char | 15 |
| 9 | Reign_End | Char | 20 |
| 8 | Reign_Start | Char | 20 |
| 5 | Researcher_Email | Char | 60 |
| 7 | Treasury | Num | 8 |
Explanation and Key Points
PROC
CONTENTS is one of the most underestimated validation procedures in SAS. Before
cleaning begins, programmers must inspect variable types, lengths, formats, and
informats.
In
regulated environments, incorrect variable type assignments cause catastrophic
downstream failures:
- Character-to-numeric
conversion issues
- Broken joins
- Invalid derivations
- Reporting mismatches
For
example:
- Treasury must remain numeric
- Reign_Start must eventually
standardize into analyzable date formats
- King_Name requires enough
storage length
Clinical
trial programmers frequently use PROC CONTENTS during SDTM validation because
metadata mismatches trigger Define.xml inconsistencies and FDA rejection risks.
Enterprise Data Cleaning Using DATA Step
data kings_clean;
retain Source_System "ROYAL_ANALYTICS";
set kings_raw;
King_Name = propcase(strip(King_Name));
Dynasty = upcase(strip(Dynasty));
Region = upcase(strip(Region));
King_Name = tranwrd(King_Name,"Null","");
Dynasty = tranwrd(Dynasty,"NULL","UNKNOWN");
Treasury = abs(Treasury);
if missing(King_Name) then King_Name="UNKNOWN";
if Region="NULL" then Region="UNKNOWN";
if index(Researcher_Email,'@')=0 then
Researcher_Email="INVALID_EMAIL";
Treasury = round(Treasury,.01);
King_Full = catx('-',King_ID,King_Name);
Length_Flag = length(King_Name);
if Length_Flag < 7 then Flag='CHECK';
drop Length_Flag;
run;
proc print data = kings_clean;
run;
OUTPUT:
| Obs | Source_System | King_ID | King_Name | Dynasty | Region | Researcher_Email | Kingdom_Type | Treasury | Reign_Start | Reign_End | Notes | King_Full | Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ROYAL_ANALYTICS | K001 | Ashoka | MAURYA | NORTH | INVALID_EMAIL | Empire | 500000 | 268BC | 232BC | NULL | K001-Ashoka | CHECK |
| 2 | ROYAL_ANALYTICS | K002 | Chandragupta Maurya | MAURYA | NORTH | chandra@royal | Empire | 20000 | 321BC | 297BC | duplicate | K002-Chandragupta Maurya | |
| 3 | ROYAL_ANALYTICS | K003 | Samudragupta | GUPTA | SOUTH | samudra@gmail.com | kingdom | 750000 | 335BC | 380BC | valid | K003-Samudragupta | |
| 4 | ROYAL_ANALYTICS | K004 | Harsha | UNKNOWN | EAST | harsha@@mail.com | KINGDOM | 450000 | 606AD | 647AD | missing dynasty | K004-Harsha | CHECK |
| 5 | ROYAL_ANALYTICS | K005 | Krishnadevaraya | VIJAYANAGARA | SOUTH | INVALID_EMAIL | Empire | 900000 | 1509AD | 1529AD | bad email | K005-Krishnadevaraya | |
| 6 | ROYAL_ANALYTICS | K005 | Krishnadevaraya | VIJAYANAGARA | SOUTH | INVALID_EMAIL | Empire | 900000 | 1509AD | 1529AD | duplicate | K005-Krishnadevaraya | |
| 7 | ROYAL_ANALYTICS | K006 | Rajaraja Chola | CHOLA | SOUTH | raja@gmail.com | empire | 850000 | 985AD | 1014AD | valid | K006-Rajaraja Chola | |
| 8 | ROYAL_ANALYTICS | K007 | Tipu Sultan | MYSORE | UNKNOWN | tipu@gmail | Sultanate | 650000 | 1782AD | 1799AD | bad region | K007-Tipu Sultan | |
| 9 | ROYAL_ANALYTICS | K008 | Shivaji | MARATHA | WEST | shivaji@gmail.com | EMPIRE | 9999999 | 1674AD | 1680AD | valid | K008-Shivaji | |
| 10 | ROYAL_ANALYTICS | K009 | Akbar | MUGHAL | NORTH | INVALID_EMAIL | Empire | 100 | 1556AD | 1605AD | negative treasury | K009-Akbar | CHECK |
| 11 | ROYAL_ANALYTICS | K010 | UNKNOWN | MUGHAL | NORTH | admin@gmail.com | Empire | 400000 | 1605AD | 1627AD | missing king | K010-UNKNOWN |
Explanation and Key Points
This DATA
step demonstrates enterprise-grade transformation logic.
Key
techniques include:
- RETAIN preserves Source_System
metadata
- PROPCASE standardizes
mixed-case names
- STRIP removes whitespace
contamination
- UPCASE harmonizes
categorical variables
- TRANWRD replaces corrupted
text
- ABS corrects negative
treasury values
- INDEX validates malformed
emails
- CATX creates traceable
composite identifiers
In
healthcare submissions, similar logic standardizes treatment names, adverse
event categories, and investigator records.
One
critical SAS behavior:
Missing numeric values are treated as smaller than valid numbers. This can
silently distort filtering logic:
if amount < 100;
This
statement includes missing values unless explicitly excluded.
That
behavior is a major enterprise risk during patient eligibility screening or
fraud analytics.
Removing Duplicates Using PROC SORT
proc sort data=kings_clean nodupkey out=kings_nodup;
by King_ID;
run;
proc print data = kings_nodup;
run;
LOG:
OUTPUT:
| Obs | Source_System | King_ID | King_Name | Dynasty | Region | Researcher_Email | Kingdom_Type | Treasury | Reign_Start | Reign_End | Notes | King_Full | Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ROYAL_ANALYTICS | K001 | Ashoka | MAURYA | NORTH | INVALID_EMAIL | Empire | 500000 | 268BC | 232BC | NULL | K001-Ashoka | CHECK |
| 2 | ROYAL_ANALYTICS | K002 | Chandragupta Maurya | MAURYA | NORTH | chandra@royal | Empire | 20000 | 321BC | 297BC | duplicate | K002-Chandragupta Maurya | |
| 3 | ROYAL_ANALYTICS | K003 | Samudragupta | GUPTA | SOUTH | samudra@gmail.com | kingdom | 750000 | 335BC | 380BC | valid | K003-Samudragupta | |
| 4 | ROYAL_ANALYTICS | K004 | Harsha | UNKNOWN | EAST | harsha@@mail.com | KINGDOM | 450000 | 606AD | 647AD | missing dynasty | K004-Harsha | CHECK |
| 5 | ROYAL_ANALYTICS | K005 | Krishnadevaraya | VIJAYANAGARA | SOUTH | INVALID_EMAIL | Empire | 900000 | 1509AD | 1529AD | bad email | K005-Krishnadevaraya | |
| 6 | ROYAL_ANALYTICS | K006 | Rajaraja Chola | CHOLA | SOUTH | raja@gmail.com | empire | 850000 | 985AD | 1014AD | valid | K006-Rajaraja Chola | |
| 7 | ROYAL_ANALYTICS | K007 | Tipu Sultan | MYSORE | UNKNOWN | tipu@gmail | Sultanate | 650000 | 1782AD | 1799AD | bad region | K007-Tipu Sultan | |
| 8 | ROYAL_ANALYTICS | K008 | Shivaji | MARATHA | WEST | shivaji@gmail.com | EMPIRE | 9999999 | 1674AD | 1680AD | valid | K008-Shivaji | |
| 9 | ROYAL_ANALYTICS | K009 | Akbar | MUGHAL | NORTH | INVALID_EMAIL | Empire | 100 | 1556AD | 1605AD | negative treasury | K009-Akbar | CHECK |
| 10 | ROYAL_ANALYTICS | K010 | UNKNOWN | MUGHAL | NORTH | admin@gmail.com | Empire | 400000 | 1605AD | 1627AD | missing king | K010-UNKNOWN |
Explanation and Key Points
PROC SORT
NODUPKEY is essential in enterprise deduplication frameworks.
Duplicate
records create:
- inflated revenue metrics
- duplicate patient enrollment
- repeated insurance claims
- inaccurate adverse-event
counts
In SDTM
domains, duplicate USUBJID values may invalidate entire submission packages.
The BY
statement determines uniqueness logic. Here, King_ID acts as the enterprise
business key.
Always
validate deduplication carefully because aggressive removal can unintentionally
discard valid longitudinal observations.
PROC FORMAT Standardization
proc format;
value $regionfmt 'NORTH'='Northern India'
'SOUTH'='Southern India'
'EAST' ='Eastern India'
'WEST' ='Western India'
'UNKNOWN'='Unknown Region';
run;
LOG:
NOTE: Format $REGIONFMT has been output.
Explanation and Key Points
PROC
FORMAT centralizes metadata governance.
Benefits
include:
- reusable enterprise
standards
- reporting consistency
- regulatory traceability
- easier maintenance
Instead
of repeatedly hardcoding logic across programs, formats create reusable
controlled terminology layers similar to CDISC controlled terminology.
In
clinical systems:
- gender mappings
- treatment codes
- severity classifications
- laboratory categories
are
frequently standardized using PROC FORMAT.
PROC SQL vs DATA Step Merge
proc sql;
create table kingdom_summary as
select Dynasty,
count(*) as Total_Kings,
avg(Treasury) as Avg_Treasury
from kings_nodup
group by Dynasty;
quit;
proc print data = kingdom_summary;
run;
OUTPUT:
| Obs | Dynasty | Total_Kings | Avg_Treasury |
|---|---|---|---|
| 1 | CHOLA | 1 | 850000 |
| 2 | GUPTA | 1 | 750000 |
| 3 | MARATHA | 1 | 9999999 |
| 4 | MAURYA | 2 | 260000 |
| 5 | MUGHAL | 2 | 200050 |
| 6 | MYSORE | 1 | 650000 |
| 7 | UNKNOWN | 1 | 450000 |
| 8 | VIJAYANAGARA | 1 | 900000 |
Explanation and Key Points
PROC SQL
provides relational flexibility and concise aggregation syntax.
Advantages:
- easier joins
- grouped summaries
- dynamic queries
- database compatibility
DATA Step
excels in row-wise sequential processing and FIRST./LAST. logic.
PROC SQL
resembles modern R dplyr summarization pipelines, making it intuitive for
analysts transitioning between SAS and R ecosystems.
FIRST./LAST. Processing
proc sort data=kings_nodup;
by Dynasty;
run;
proc print data = kings_nodup;
run;
OUTPUT:
| Obs | Source_System | King_ID | King_Name | Dynasty | Region | Researcher_Email | Kingdom_Type | Treasury | Reign_Start | Reign_End | Notes | King_Full | Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ROYAL_ANALYTICS | K006 | Rajaraja Chola | CHOLA | SOUTH | raja@gmail.com | empire | 850000 | 985AD | 1014AD | valid | K006-Rajaraja Chola | |
| 2 | ROYAL_ANALYTICS | K003 | Samudragupta | GUPTA | SOUTH | samudra@gmail.com | kingdom | 750000 | 335BC | 380BC | valid | K003-Samudragupta | |
| 3 | ROYAL_ANALYTICS | K008 | Shivaji | MARATHA | WEST | shivaji@gmail.com | EMPIRE | 9999999 | 1674AD | 1680AD | valid | K008-Shivaji | |
| 4 | ROYAL_ANALYTICS | K001 | Ashoka | MAURYA | NORTH | INVALID_EMAIL | Empire | 500000 | 268BC | 232BC | NULL | K001-Ashoka | CHECK |
| 5 | ROYAL_ANALYTICS | K002 | Chandragupta Maurya | MAURYA | NORTH | chandra@royal | Empire | 20000 | 321BC | 297BC | duplicate | K002-Chandragupta Maurya | |
| 6 | ROYAL_ANALYTICS | K009 | Akbar | MUGHAL | NORTH | INVALID_EMAIL | Empire | 100 | 1556AD | 1605AD | negative treasury | K009-Akbar | CHECK |
| 7 | ROYAL_ANALYTICS | K010 | UNKNOWN | MUGHAL | NORTH | admin@gmail.com | Empire | 400000 | 1605AD | 1627AD | missing king | K010-UNKNOWN | |
| 8 | ROYAL_ANALYTICS | K007 | Tipu Sultan | MYSORE | UNKNOWN | tipu@gmail | Sultanate | 650000 | 1782AD | 1799AD | bad region | K007-Tipu Sultan | |
| 9 | ROYAL_ANALYTICS | K004 | Harsha | UNKNOWN | EAST | harsha@@mail.com | KINGDOM | 450000 | 606AD | 647AD | missing dynasty | K004-Harsha | CHECK |
| 10 | ROYAL_ANALYTICS | K005 | Krishnadevaraya | VIJAYANAGARA | SOUTH | INVALID_EMAIL | Empire | 900000 | 1509AD | 1529AD | bad email | K005-Krishnadevaraya |
data dynasty_rank;
set kings_nodup;
by Dynasty;
if first.Dynasty then Rank=1;
else Rank+1;
format Region regionfmt.;
run;
proc print data = dynasty_rank;
run;
OUTPUT:
| Obs | Source_System | King_ID | King_Name | Dynasty | Region | Researcher_Email | Kingdom_Type | Treasury | Reign_Start | Reign_End | Notes | King_Full | Flag | Rank |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ROYAL_ANALYTICS | K006 | Rajaraja Chola | CHOLA | Southern India | raja@gmail.com | empire | 850000 | 985AD | 1014AD | valid | K006-Rajaraja Chola | 1 | |
| 2 | ROYAL_ANALYTICS | K003 | Samudragupta | GUPTA | Southern India | samudra@gmail.com | kingdom | 750000 | 335BC | 380BC | valid | K003-Samudragupta | 1 | |
| 3 | ROYAL_ANALYTICS | K008 | Shivaji | MARATHA | Western India | shivaji@gmail.com | EMPIRE | 9999999 | 1674AD | 1680AD | valid | K008-Shivaji | 1 | |
| 4 | ROYAL_ANALYTICS | K001 | Ashoka | MAURYA | Northern India | INVALID_EMAIL | Empire | 500000 | 268BC | 232BC | NULL | K001-Ashoka | CHECK | 1 |
| 5 | ROYAL_ANALYTICS | K002 | Chandragupta Maurya | MAURYA | Northern India | chandra@royal | Empire | 20000 | 321BC | 297BC | duplicate | K002-Chandragupta Maurya | 2 | |
| 6 | ROYAL_ANALYTICS | K009 | Akbar | MUGHAL | Northern India | INVALID_EMAIL | Empire | 100 | 1556AD | 1605AD | negative treasury | K009-Akbar | CHECK | 1 |
| 7 | ROYAL_ANALYTICS | K010 | UNKNOWN | MUGHAL | Northern India | admin@gmail.com | Empire | 400000 | 1605AD | 1627AD | missing king | K010-UNKNOWN | 2 | |
| 8 | ROYAL_ANALYTICS | K007 | Tipu Sultan | MYSORE | Unknown Region | tipu@gmail | Sultanate | 650000 | 1782AD | 1799AD | bad region | K007-Tipu Sultan | 1 | |
| 9 | ROYAL_ANALYTICS | K004 | Harsha | UNKNOWN | Eastern India | harsha@@mail.com | KINGDOM | 450000 | 606AD | 647AD | missing dynasty | K004-Harsha | CHECK | 1 |
| 10 | ROYAL_ANALYTICS | K005 | Krishnadevaraya | VIJAYANAGARA | Southern India | INVALID_EMAIL | Empire | 900000 | 1509AD | 1529AD | bad email | K005-Krishnadevaraya | 1 |
Explanation and Key Points
FIRST./LAST.
processing is one of SAS’s most powerful enterprise features.
It
enables:
- longitudinal tracking
- patient visit sequencing
- transaction ranking
- grouped calculations
Healthcare
examples include:
- identifying first dose dates
- last follow-up visits
- treatment discontinuation
logic
R
typically requires grouped mutate() operations to replicate this behavior.
Enterprise Macro Framework
%macro validate(dsname=);
proc freq data=&dsname;
tables Dynasty Region Kingdom_Type / missing;
run;
proc means data=&dsname n nmiss mean min max;
var Treasury;
run;
%mend;
%validate(dsname=kings_nodup);
OUTPUT:
The FREQ Procedure
| Dynasty | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| CHOLA | 1 | 10.00 | 1 | 10.00 |
| GUPTA | 1 | 10.00 | 2 | 20.00 |
| MARATHA | 1 | 10.00 | 3 | 30.00 |
| MAURYA | 2 | 20.00 | 5 | 50.00 |
| MUGHAL | 2 | 20.00 | 7 | 70.00 |
| MYSORE | 1 | 10.00 | 8 | 80.00 |
| UNKNOWN | 1 | 10.00 | 9 | 90.00 |
| VIJAYANAGARA | 1 | 10.00 | 10 | 100.00 |
| Region | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| EAST | 1 | 10.00 | 1 | 10.00 |
| NORTH | 4 | 40.00 | 5 | 50.00 |
| SOUTH | 3 | 30.00 | 8 | 80.00 |
| UNKNOWN | 1 | 10.00 | 9 | 90.00 |
| WEST | 1 | 10.00 | 10 | 100.00 |
| Kingdom_Type | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| EMPIRE | 1 | 10.00 | 1 | 10.00 |
| Empire | 5 | 50.00 | 6 | 60.00 |
| KINGDOM | 1 | 10.00 | 7 | 70.00 |
| Sultanate | 1 | 10.00 | 8 | 80.00 |
| empire | 1 | 10.00 | 9 | 90.00 |
| kingdom | 1 | 10.00 | 10 | 100.00 |
The MEANS Procedure
| Analysis Variable : Treasury | ||||
|---|---|---|---|---|
| N | N Miss | Mean | Minimum | Maximum |
| 10 | 0 | 1452009.90 | 100.0000000 | 9999999.00 |
Explanation and Key Points
Macros
create reusable enterprise validation frameworks.
Benefits:
- standardization
- scalability
- reproducibility
- automation
Large
pharmaceutical companies maintain macro libraries for:
- SDTM validation
- ADaM derivations
- TLF generation
- QC automation
Macros
reduce coding duplication while improving governance consistency.
R Equivalent Cleaning Workflow
library(dplyr)
library(stringr)
library(janitor)
library(tidyr)
kings_clean <- kings_raw %>%
clean_names() %>%
mutate(
king_name = str_to_title(str_trim(king_name)),
dynasty = str_to_upper(str_trim(dynasty)),
region = str_to_upper(str_trim(region)),
treasury = abs(treasury),
researcher_email =
if_else(grepl("@", researcher_email),
researcher_email,"INVALID_EMAIL"),
dynasty = coalesce(dynasty,"UNKNOWN"),
kingdom_type = case_when(kingdom_type == "EMPIRE" ~ "Empire",
TRUE ~ kingdom_type)
) %>%
distinct(king_id,.keep_all=TRUE)
OUTPUT:
|
|
king_id |
king_name |
dynasty |
region |
researcher_email |
kingdom_type |
treasury |
reign_start |
reign_end |
notes |
|
1 |
K001 |
Ashoka |
MAURYA |
NORTH |
INVALID_EMAIL |
Empire |
500000 |
268BC |
232BC |
NULL |
|
2 |
K002 |
Chandragupta Maurya |
MAURYA |
NORTH |
chandra@royal |
Empire |
20000 |
321BC |
297BC |
duplicate |
|
3 |
K003 |
Samudragupta |
GUPTA |
SOUTH |
samudra@gmail.com |
kingdom |
750000 |
335BC |
380BC |
valid |
|
4 |
K004 |
Harsha |
NULL |
EAST |
harsha@@mail.com |
KINGDOM |
450000 |
606AD |
647AD |
missing dynasty |
|
5 |
K005 |
Krishnadevaraya |
VIJAYANAGARA |
SOUTH |
INVALID_EMAIL |
Empire |
900000 |
1509AD |
1529AD |
bad email |
|
6 |
K006 |
Rajaraja Chola |
CHOLA |
SOUTH |
raja@gmail.com |
empire |
850000 |
985AD |
1014AD |
valid |
|
7 |
K007 |
Tipu Sultan |
MYSORE |
NULL |
tipu@gmail |
Sultanate |
650000 |
1782AD |
1799AD |
bad region |
|
8 |
K008 |
Shivaji |
MARATHA |
WEST |
shivaji@gmail.com |
Empire |
9999999 |
1674AD |
1680AD |
valid |
|
9 |
K009 |
Akbar |
MUGHAL |
NORTH |
INVALID_EMAIL |
Empire |
100 |
1556AD |
1605AD |
negative treasury |
|
10 |
K010 |
Null |
MUGHAL |
NORTH |
admin@gmail.com |
Empire |
400000 |
1605AD |
1627AD |
missing king |
Explanation and Key Points
R
provides flexible vectorized transformations with tidyverse pipelines.
Distinct
is part of the dplyr package. If library(dplyr)
is missing, R cannot recognize the function.
Equivalent
mappings:
- mutate() → SAS assignment
logic
- case_when() → SELECT-WHEN
- distinct() → PROC SORT
NODUPKEY
- coalesce() → COALESCEC
- str_trim() → STRIP
- grepl() → INDEX/FIND
R excels
in:
- exploratory analytics
- rapid transformation
- visualization integration
- flexible string handling
SAS
excels in:
- auditability
- production stability
- controlled metadata
governance
Together,
they create enterprise-grade analytical ecosystems.
Enterprise Validation &
Compliance
In SDTM
and ADaM environments, every transformation must maintain traceability.
Critical
compliance requirements include:
- Audit trails
- Metadata lineage
- QC independence
- Reproducibility
- Controlled terminology
- Regulatory transparency
One
dangerous SAS behavior:
Missing numeric values sort lower than valid numbers.
Example:
if age < 18;
This
unintentionally includes missing ages unless explicitly checked.
Such
logic errors can:
- enroll ineligible patients
- distort efficacy analysis
- invalidate statistical
outputs
Validation
must always include:
- PROC COMPARE
- independent QC programming
- metadata reconciliation
- derivation traceability
Business Logic Behind
Cleaning
Data
cleaning is not cosmetic it directly influences analytical truth.
Suppose a
king’s treasury value appears as -500000. Negative treasury values may
represent corrupted ingestion logic, OCR extraction failures, or accounting
sign reversals. Correcting such values using ABS() ensures financial analytics
remain interpretable.
Missing
dates disrupt trend analysis and duration calculations. Standardized dates
allow INTCK and INTNX calculations to derive reign durations accurately.
Text
normalization matters because “Maurya,” “maurya,” and “MAURYA” represent the
same analytical category. Without standardization, summary reports split
identical categories into separate groups.
Malformed
emails affect operational communication workflows and audit tracking.
Duplicate
records inflate counts and distort statistical summaries.
In
clinical research, similar corrections impact:
- adverse event frequency
- patient exposure
calculations
- treatment compliance
analysis
- endpoint derivations
Cleaning
logic ensures downstream dashboards, AI models, statistical procedures, and
executive decisions are based on trusted analytical intelligence instead of
corrupted operational noise.
20 Enterprise Data-Cleaning
Best Practices
- Always validate metadata
before transformations.
- Define LENGTH before
assignments.
- Standardize controlled
terminology early.
- Never trust source-system
casing.
- Separate raw and cleaned
datasets.
- Maintain immutable raw
archives.
- Use reusable macro
frameworks.
- Validate duplicates before
removal.
- Document all derivation
logic.
- Use QC-independent
programming.
- Centralize PROC FORMAT
mappings.
- Audit missing-value behavior
carefully.
- Validate joins using row
counts.
- Use defensive programming
patterns.
- Maintain enterprise naming
standards.
- Avoid hardcoded business
rules.
- Track lineage from source to
report.
- Standardize date formats
globally.
- Automate validation reports.
- Preserve reproducibility
across environments.
20 Sharp One-Line Insights
- Dirty data creates expensive
business mistakes.
- Standardized variables
improve reproducibility.
- Validation logic is stronger
than visual inspection.
- Duplicate IDs destroy
analytical trust.
- Metadata drives enterprise
consistency.
- Missing dates silently
corrupt timelines.
- PROC FORMAT improves
governance.
- R accelerates exploratory
cleaning.
- SAS dominates regulated
production workflows.
- Character truncation creates
hidden corruption.
- Audit trails matter more
than dashboards.
- QC independence prevents
analytical bias.
- Controlled terminology
improves compliance.
- Data lineage protects
regulatory credibility.
- PROC SQL simplifies
relational summaries.
- DATA Step excels in
sequential processing.
- Macros improve enterprise
scalability.
- Standardization reduces
reporting ambiguity.
- Cleaning logic impacts AI
predictions.
- Reliable analytics begin
with trusted data.
SAS vs R Comparison
|
Feature |
SAS |
R |
|
Auditability |
Excellent |
Moderate |
|
Regulatory
Acceptance |
Industry
Standard |
Growing |
|
Metadata
Governance |
Strong |
Flexible |
|
Scalability |
Excellent |
High |
|
Visualization |
Moderate |
Excellent |
|
Macro
Automation |
Powerful |
Flexible |
|
String
Handling |
Structured |
Dynamic |
|
Validation
Frameworks |
Enterprise-grade |
Customizable |
|
Clinical
Trial Usage |
Dominant |
Supplemental |
|
Learning
Curve |
Moderate |
Moderate-High |
Small Enterprise Validation
Checklist
- Validate duplicates
- Check missing values
- Standardize categorical
variables
- Validate formats/informats
- Reconcile row counts
- Validate joins
- Check date integrity
- Review truncation risks
- Perform QC comparison
- Archive raw datasets
Summary
SAS and R
each play critical roles in enterprise data engineering ecosystems. SAS
provides unmatched auditability, regulatory alignment, metadata governance, and
production stability. This is why pharmaceutical companies continue relying
heavily on SAS for SDTM, ADaM, and TLF workflows.
R offers
extraordinary flexibility, modern string processing, exploratory analytics,
machine learning integration, and visualization capabilities. Its tidyverse
ecosystem dramatically simplifies transformation pipelines and accelerates
exploratory development.
In
enterprise environments, the strongest strategy is rarely SAS versus R.
It is SAS
plus R.
SAS
ensures:
- governance
- validation
- traceability
- reproducibility
R
enables:
- agility
- advanced analytics
- interactive exploration
- modern data science
workflows
Together,
they create scalable, production-grade analytical intelligence systems capable
of supporting healthcare analytics, banking fraud detection, insurance claim
validation, retail forecasting, and regulatory reporting.
The true
value of data cleaning lies not in cosmetic formatting but in preserving
analytical truth, business credibility, and decision-making reliability.
Conclusion
Modern
analytics ecosystems are built on one foundational principle: trusted data
creates trusted decisions.
Without
structured cleaning frameworks, organizations risk catastrophic analytical
failures. Duplicate records inflate business metrics. Missing values distort
predictive models. Corrupted dates break timelines. Invalid categorical
mappings produce misleading dashboards. In regulated industries such as
clinical research, these failures can trigger rejected submissions, audit
findings, patient safety risks, and financial penalties.
This
project demonstrated how a corrupted “FAMOUS KINGS IN INDIA” dataset can be
transformed into analysis-ready intelligence using enterprise SAS and R
workflows. We explored advanced DATA Step engineering, PROC SQL summarization,
reusable macro frameworks, PROC FORMAT governance, FIRST./LAST. processing,
validation logic, and tidyverse-based R transformations.
The
deeper lesson extends beyond syntax.
Professional
data engineering requires:
- governance
- traceability
- reproducibility
- validation independence
- defensive programming
- enterprise metadata
discipline
SAS
remains the gold standard for regulated production environments because of its
auditability, stability, and compliance readiness. R complements SAS through
flexibility, exploratory analytics, advanced transformations, and modern data
science integration.
Organizations
that combine both technologies intelligently achieve:
- scalable reporting
- trusted AI pipelines
- reproducible analytics
- enterprise-grade governance
- high-quality executive
intelligence
Ultimately,
analytical excellence is never accidental.
It is
engineered through disciplined cleaning frameworks, standardized transformation
logic, rigorous validation, and production-ready data architecture.
Reliable
intelligence begins where dirty data ends.
Interview Questions & Answers
1. Explain FIRST. and LAST. processing in
SAS?
Answer
FIRST. and LAST. variables are temporary variables automatically created
during BY-group processing.
Example
proc sort data=sales;by Region;run;
data summary;set sales;by Region;if first.Region then Total=0;Total + Sales;if last.Region;run;
Key
Points
- Used for
grouped calculations.
- Helpful in
cumulative totals.
- Common in
patient visit tracking.
- Efficient
alternative to SQL aggregation.
2. How does SAS treat missing numeric
values?
Answer
SAS treats missing numeric values (.)
as smaller than all valid numbers.
Dangerous
Example
if score < 50 then flag='FAIL';
Missing scores also become FAIL.
Correct
Logic
if score ne . and score < 50 then flag='FAIL';
Key
Points
- Very
important in clinical validation.
- Missing
values can silently corrupt outputs.
- Always
explicitly check missing conditions.
- Regulatory
submissions require proper handling.
3. What is the difference between CMISS and
NMISS?
Answer
|
Function |
Purpose |
|
NMISS |
Counts missing numeric values |
|
CMISS |
Counts missing character + numeric values |
Example
data test;set raw;missing_count=nmiss(score,age);all_missing=cmiss(name,city,score);run;
Key
Points
- CMISS is
more flexible.
- Useful in
validation checks.
- Frequently
used in SDTM QC programming.
4. Explain RETAIN statement in SAS?
Answer
Normally SAS resets variables to missing for every iteration. RETAIN
preserves values across observations.
Example
data running_total;set sales;retain total_sales 0;total_sales + sales;run;
Key
Points
- Useful for
cumulative calculations.
- Common in
financial analysis.
- Important
for longitudinal patient tracking.
5. What are arrays in SAS and why are they
useful?
Answer
Arrays allow multiple variables to be processed using loops.
Example
data clean_scores;set exam;array scores(*) score1-score5;do i=1 to dim(scores);if scores(i) < 0 then scores(i)=0;end;run;
Key
Points
- Reduces
repetitive code.
- Useful in
bulk cleaning operations.
- Common in
laboratory data processing.
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about KINGS DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment