Kings, Kingdoms & Knowledge Engineering: Converting Dirty Historical Data into High-Trust SAS and R Analytical Assets

Famous Kings in India Data into Analysis-Ready SAS and R Pipelines for Enterprise Reporting Excellence

Introduction

In enterprise analytics, data rarely arrives in a clean, standardized, regulatory-ready format. Whether working in banking, insurance, retail, or clinical research, organizations constantly battle corrupted records, duplicate identifiers, invalid dates, inconsistent categories, malformed text, and missing values. As experienced Clinical SAS Programmers and Data Scientists know, poor-quality data silently destroys dashboards, machine learning predictions, executive decisions, and regulatory submissions.

Imagine a healthcare analytics company supporting a historical genomic research project studying hereditary diseases across royal bloodlines of famous Indian kings. The raw dataset contains duplicate king IDs, impossible reign years, negative treasury amounts, malformed researcher emails, inconsistent dynasty labels, mixed-case text, corrupted timestamps, and missing region codes. A single invalid record incorrectly associates a king with the wrong dynasty lineage, causing downstream statistical models to fail validation.

This is not merely a technical inconvenience.

In SDTM/ADaM environments, corrupted data can trigger FDA rejection risks, audit findings, inaccurate survival analysis, incorrect treatment arm assignment, and broken traceability. In banking, dirty data misclassifies fraud-risk customers. In insurance, claim-processing systems reject valid claims because of malformed categorical fields. In retail analytics, duplicate transactions inflate revenue metrics and mislead executive strategy.

The difference between chaos and trusted analytical intelligence is structured data engineering.

In this project, we will create a corrupted “FAMOUS KINGS IN INDIA” dataset with intentional enterprise-grade data quality problems and transform it into an analysis-ready professional dataset using both SAS and R. We will compare PROC SQL vs DATA Step methodologies while demonstrating enterprise validation frameworks, reusable macros, professional reporting workflows, and regulatory-grade cleaning strategies.

Raw Enterprise Dataset Creation in SAS

Business Problem

The historical analytics division receives operational data collected from multiple archives. The files contain:

  • Duplicate King IDs
  • Invalid reign dates
  • Mixed dynasty spellings
  • Missing region values
  • Corrupted email addresses
  • Negative treasury values
  • Invalid category labels
  • Mixed character/numeric corruption
  • Whitespace contamination
  • NULL string values

SAS Raw Dataset Creation

data kings_raw;

length King_ID $12 King_Name $40 Dynasty $30 Region $15

       Researcher_Email $60 Kingdom_Type $20 Treasury 8

       Reign_Start $20 Reign_End $20 Notes $50;

infile datalines dlm='|' truncover;

input King_ID $ King_Name $ Dynasty $ Region $ Researcher_Email $

      Kingdom_Type $ Treasury Reign_Start $ Reign_End $ Notes $;

datalines;

K001|ashoka |Maurya|north|ashoka.gmail.com|Empire|500000|268BC|232BC|NULL

K002|CHANDRAGUPTA maurya|Maurya |NORTH |chandra@royal|Empire|-20000|321BC|297BC|duplicate

K003|Samudragupta|gupta|south|samudra@gmail.com|kingdom|750000|335BC|380BC|valid

K004|Harsha|null|east|harsha@@mail.com|KINGDOM|450000|606AD|647AD|missing dynasty

K005|Krishnadevaraya|Vijayanagara|south |krishna.com|Empire|900000|1509AD|1529AD|bad email

K005|Krishnadevaraya|Vijayanagara|south |krishna.com|Empire|900000|1509AD|1529AD|duplicate

K006|Rajaraja Chola|CHOLA|SOUTH|raja@gmail.com|empire|850000|985AD|1014AD|valid

K007|Tipu Sultan|Mysore|NULL|tipu@gmail|Sultanate|650000|1782AD|1799AD|bad region

K008|Shivaji|Maratha|west|shivaji@gmail.com|EMPIRE|9999999|1674AD|1680AD|valid

K009|Akbar|Mughal|north|akbarmail.com|Empire|-100|1556AD|1605AD|negative treasury

K010|NULL|Mughal|north|admin@gmail.com|Empire|400000|1605AD|1627AD|missing king

;

run;

proc print data = kings_raw;

run;

OUTPUT:

ObsKing_IDKing_NameDynastyRegionResearcher_EmailKingdom_TypeTreasuryReign_StartReign_EndNotes
1K001ashokaMauryanorthashoka.gmail.comEmpire500000268BC232BCNULL
2K002CHANDRAGUPTA mauryaMauryaNORTHchandra@royalEmpire-20000321BC297BCduplicate
3K003Samudraguptaguptasouthsamudra@gmail.comkingdom750000335BC380BCvalid
4K004Harshanulleastharsha@@mail.comKINGDOM450000606AD647ADmissing dynasty
5K005KrishnadevarayaVijayanagarasouthkrishna.comEmpire9000001509AD1529ADbad email
6K005KrishnadevarayaVijayanagarasouthkrishna.comEmpire9000001509AD1529ADduplicate
7K006Rajaraja CholaCHOLASOUTHraja@gmail.comempire850000985AD1014ADvalid
8K007Tipu SultanMysoreNULLtipu@gmailSultanate6500001782AD1799ADbad region
9K008ShivajiMarathawestshivaji@gmail.comEMPIRE99999991674AD1680ADvalid
10K009AkbarMughalnorthakbarmail.comEmpire-1001556AD1605ADnegative treasury
11K010NULLMughalnorthadmin@gmail.comEmpire4000001605AD1627ADmissing king

Explanation and Key Points

This dataset intentionally mimics real enterprise corruption scenarios. The LENGTH statement appears before assignments because SAS determines variable storage length during compilation. If LENGTH is declared later, character truncation may silently occur. For example, “Vijayanagara” could be truncated if Dynasty were initialized with smaller lengths earlier.

The INFILE and DATALINES combination simulates operational flat-file ingestion pipelines commonly used in healthcare and banking systems. TRUNCOVER prevents SAS from reading the next line when fields are incomplete.

Notice the intentional corruption:

  • NULL strings
  • malformed emails
  • inconsistent uppercase/lowercase text
  • duplicate IDs
  • negative treasury values

These issues directly affect analytical reliability and reporting accuracy.

Unlike SAS fixed-length character allocation, R dynamically handles character vectors, reducing truncation risks but increasing runtime memory flexibility tradeoffs.

FAMOUS KINGS IN INDIA Raw Dataset in R

library(tibble)

kings_raw <- tribble(

  ~King_ID, ~King_Name,              ~Dynasty,       ~Region, ~Researcher_Email,    ~Kingdom_Type, ~Treasury, ~Reign_Start, ~Reign_End, ~Notes,

  "K001",   "ashoka ",              "Maurya",       "north", "ashoka.gmail.com",   "Empire",       500000,    "268BC",      "232BC",    "NULL",

  "K002",   "CHANDRAGUPTA maurya",  "Maurya ",      "NORTH ", "chandra@royal",     "Empire",      -20000,     "321BC",      "297BC",    "duplicate",

  "K003",   "Samudragupta",         "gupta",        "south", "samudra@gmail.com",  "kingdom",      750000,    "335BC",      "380BC",    "valid",

  "K004",   "Harsha",               "null",         "east",  "harsha@@mail.com",   "KINGDOM",           450000,    "606AD",      "647AD",    "missing dynasty",

  "K005",   "Krishnadevaraya",      "Vijayanagara", "south ", "krishna.com",       "Empire",       900000,    "1509AD",     "1529AD",   "bad email",

  "K005",   "Krishnadevaraya",      "Vijayanagara", "south ", "krishna.com",       "Empire",       900000,    "1509AD",     "1529AD",   "duplicate",

  "K006",   "Rajaraja Chola",       "CHOLA",        "SOUTH", "raja@gmail.com",     "empire",       850000,    "985AD",      "1014AD",   "valid",

  "K007",   "Tipu Sultan",          "Mysore",       "NULL",  "tipu@gmail",         "Sultanate",               650000,    "1782AD",     "1799AD",   "bad region",

  "K008",   "Shivaji",              "Maratha",      "west",  "shivaji@gmail.com",  "EMPIRE",               9999999,    "1674AD",     "1680AD",   "valid",

  "K009",   "Akbar",                "Mughal",       "north", "akbarmail.com",      "Empire",      -100,       "1556AD",     "1605AD",   "negative treasury", 

  "K010",   "NULL",                 "Mughal",       "north", "admin@gmail.com",    "Empire",       400000,    "1605AD",     "1627AD",   "missing king"

)

OUTPUT:

 

King_ID

King_Name

Dynasty

Region

Researcher_Email

Kingdom_Type

Treasury

Reign_Start

Reign_End

Notes

1

K001

ashoka 

Maurya

north

ashoka.gmail.com

Empire

500000

268BC

232BC

NULL

2

K002

CHANDRAGUPTA maurya

Maurya 

NORTH 

chandra@royal

Empire

-20000

321BC

297BC

duplicate

3

K003

Samudragupta

gupta

south

samudra@gmail.com

kingdom

750000

335BC

380BC

valid

4

K004

Harsha

null

east

harsha@@mail.com

KINGDOM

450000

606AD

647AD

missing dynasty

5

K005

Krishnadevaraya

Vijayanagara

south 

krishna.com

Empire

900000

1509AD

1529AD

bad email

6

K005

Krishnadevaraya

Vijayanagara

south 

krishna.com

Empire

900000

1509AD

1529AD

duplicate

7

K006

Rajaraja Chola

CHOLA

SOUTH

raja@gmail.com

empire

850000

985AD

1014AD

valid

8

K007

Tipu Sultan

Mysore

NULL

tipu@gmail

Sultanate

650000

1782AD

1799AD

bad region

9

K008

Shivaji

Maratha

west

shivaji@gmail.com

EMPIRE

9999999

1674AD

1680AD

valid

10

K009

Akbar

Mughal

north

akbarmail.com

Empire

-100

1556AD

1605AD

negative treasury

11

K010

NULL

Mughal

north

admin@gmail.com

Empire

400000

1605AD

1627AD

missing king









Explanation and Key Points

This R program creates the same corrupted enterprise-style dataset that was originally developed in SAS. Here, the tribble() function from the tibble package is used because it provides a cleaner and more readable row-wise dataset structure, especially for interview demonstrations and enterprise prototypes.

The dataset intentionally contains multiple data-quality issues:

  • Duplicate King_ID values (K005)
  • Negative treasury amounts
  • Mixed uppercase/lowercase text
  • Invalid email formats
  • NULL string contamination
  • Whitespace corruption
  • Inconsistent categorical labels

PROC CONTENTS Validation

proc contents data=kings_raw;

run;

OUTPUT:

The CONTENTS Procedure

Data Set NameWORK.KINGS_RAWObservations11
Member TypeDATAVariables10
EngineV9Indexes0
Created05/20/2026 07:43:04Observation Length280
Last Modified05/20/2026 07:43:04Deleted Observations0
Protection CompressedNO
Data Set Type SortedNO
Label   
Data RepresentationSOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64  
Encodingutf-8 Unicode (UTF-8)  
Engine/Host Dependent Information
Data Set Page Size131072
Number of Data Set Pages1
First Data Page1
Max Obs per Page467
Obs in First Data Page11
Number of Data Set Repairs0
Filename/saswork/SAS_workDFD200001319_odaws01-apse1-2.oda.sas.com/SAS_work140F00001319_odaws01-apse1-2.oda.sas.com/kings_raw.sas7bdat
Release Created9.0401M8
Host CreatedLinux
Inode Number134352353
Access Permissionrw-r--r--
Owner Nameu63247146
File Size256KB
File Size (bytes)262144
Alphabetic List of Variables and Attributes
#VariableTypeLen
3DynastyChar30
1King_IDChar12
2King_NameChar40
6Kingdom_TypeChar20
10NotesChar50
4RegionChar15
9Reign_EndChar20
8Reign_StartChar20
5Researcher_EmailChar60
7TreasuryNum8

Explanation and Key Points

PROC CONTENTS is one of the most underestimated validation procedures in SAS. Before cleaning begins, programmers must inspect variable types, lengths, formats, and informats.

In regulated environments, incorrect variable type assignments cause catastrophic downstream failures:

  • Character-to-numeric conversion issues
  • Broken joins
  • Invalid derivations
  • Reporting mismatches

For example:

  • Treasury must remain numeric
  • Reign_Start must eventually standardize into analyzable date formats
  • King_Name requires enough storage length

Clinical trial programmers frequently use PROC CONTENTS during SDTM validation because metadata mismatches trigger Define.xml inconsistencies and FDA rejection risks.

Enterprise Data Cleaning Using DATA Step

data kings_clean;

retain Source_System "ROYAL_ANALYTICS";

set kings_raw;

King_Name = propcase(strip(King_Name));

Dynasty = upcase(strip(Dynasty));

Region = upcase(strip(Region));

King_Name = tranwrd(King_Name,"Null","");

Dynasty = tranwrd(Dynasty,"NULL","UNKNOWN");

Treasury = abs(Treasury);

if missing(King_Name) then King_Name="UNKNOWN";

if Region="NULL" then Region="UNKNOWN";

if index(Researcher_Email,'@')=0 then

Researcher_Email="INVALID_EMAIL";

Treasury = round(Treasury,.01);

King_Full = catx('-',King_ID,King_Name);

Length_Flag = length(King_Name);

if Length_Flag < 7 then Flag='CHECK';

drop Length_Flag;

run;

proc print data = kings_clean;

run;

OUTPUT:

ObsSource_SystemKing_IDKing_NameDynastyRegionResearcher_EmailKingdom_TypeTreasuryReign_StartReign_EndNotesKing_FullFlag
1ROYAL_ANALYTICSK001AshokaMAURYANORTHINVALID_EMAILEmpire500000268BC232BCNULLK001-AshokaCHECK
2ROYAL_ANALYTICSK002Chandragupta MauryaMAURYANORTHchandra@royalEmpire20000321BC297BCduplicateK002-Chandragupta Maurya 
3ROYAL_ANALYTICSK003SamudraguptaGUPTASOUTHsamudra@gmail.comkingdom750000335BC380BCvalidK003-Samudragupta 
4ROYAL_ANALYTICSK004HarshaUNKNOWNEASTharsha@@mail.comKINGDOM450000606AD647ADmissing dynastyK004-HarshaCHECK
5ROYAL_ANALYTICSK005KrishnadevarayaVIJAYANAGARASOUTHINVALID_EMAILEmpire9000001509AD1529ADbad emailK005-Krishnadevaraya 
6ROYAL_ANALYTICSK005KrishnadevarayaVIJAYANAGARASOUTHINVALID_EMAILEmpire9000001509AD1529ADduplicateK005-Krishnadevaraya 
7ROYAL_ANALYTICSK006Rajaraja CholaCHOLASOUTHraja@gmail.comempire850000985AD1014ADvalidK006-Rajaraja Chola 
8ROYAL_ANALYTICSK007Tipu SultanMYSOREUNKNOWNtipu@gmailSultanate6500001782AD1799ADbad regionK007-Tipu Sultan 
9ROYAL_ANALYTICSK008ShivajiMARATHAWESTshivaji@gmail.comEMPIRE99999991674AD1680ADvalidK008-Shivaji 
10ROYAL_ANALYTICSK009AkbarMUGHALNORTHINVALID_EMAILEmpire1001556AD1605ADnegative treasuryK009-AkbarCHECK
11ROYAL_ANALYTICSK010UNKNOWNMUGHALNORTHadmin@gmail.comEmpire4000001605AD1627ADmissing kingK010-UNKNOWN 

Explanation and Key Points

This DATA step demonstrates enterprise-grade transformation logic.

Key techniques include:

  • RETAIN preserves Source_System metadata
  • PROPCASE standardizes mixed-case names
  • STRIP removes whitespace contamination
  • UPCASE harmonizes categorical variables
  • TRANWRD replaces corrupted text
  • ABS corrects negative treasury values
  • INDEX validates malformed emails
  • CATX creates traceable composite identifiers

In healthcare submissions, similar logic standardizes treatment names, adverse event categories, and investigator records.

One critical SAS behavior:
Missing numeric values are treated as smaller than valid numbers. This can silently distort filtering logic:

if amount < 100;

This statement includes missing values unless explicitly excluded.

That behavior is a major enterprise risk during patient eligibility screening or fraud analytics.

Removing Duplicates Using PROC SORT

proc sort data=kings_clean nodupkey out=kings_nodup;

by King_ID;

run;

proc print data = kings_nodup;

run;

LOG:

NOTE: There were 11 observations read from the data set WORK.KINGS_CLEAN.
NOTE: 1 observations with duplicate key values were deleted.
NOTE: The data set WORK.KINGS_NODUP has 10 observations and 13 variables.

OUTPUT:

ObsSource_SystemKing_IDKing_NameDynastyRegionResearcher_EmailKingdom_TypeTreasuryReign_StartReign_EndNotesKing_FullFlag
1ROYAL_ANALYTICSK001AshokaMAURYANORTHINVALID_EMAILEmpire500000268BC232BCNULLK001-AshokaCHECK
2ROYAL_ANALYTICSK002Chandragupta MauryaMAURYANORTHchandra@royalEmpire20000321BC297BCduplicateK002-Chandragupta Maurya 
3ROYAL_ANALYTICSK003SamudraguptaGUPTASOUTHsamudra@gmail.comkingdom750000335BC380BCvalidK003-Samudragupta 
4ROYAL_ANALYTICSK004HarshaUNKNOWNEASTharsha@@mail.comKINGDOM450000606AD647ADmissing dynastyK004-HarshaCHECK
5ROYAL_ANALYTICSK005KrishnadevarayaVIJAYANAGARASOUTHINVALID_EMAILEmpire9000001509AD1529ADbad emailK005-Krishnadevaraya 
6ROYAL_ANALYTICSK006Rajaraja CholaCHOLASOUTHraja@gmail.comempire850000985AD1014ADvalidK006-Rajaraja Chola 
7ROYAL_ANALYTICSK007Tipu SultanMYSOREUNKNOWNtipu@gmailSultanate6500001782AD1799ADbad regionK007-Tipu Sultan 
8ROYAL_ANALYTICSK008ShivajiMARATHAWESTshivaji@gmail.comEMPIRE99999991674AD1680ADvalidK008-Shivaji 
9ROYAL_ANALYTICSK009AkbarMUGHALNORTHINVALID_EMAILEmpire1001556AD1605ADnegative treasuryK009-AkbarCHECK
10ROYAL_ANALYTICSK010UNKNOWNMUGHALNORTHadmin@gmail.comEmpire4000001605AD1627ADmissing kingK010-UNKNOWN 

Explanation and Key Points

PROC SORT NODUPKEY is essential in enterprise deduplication frameworks.

Duplicate records create:

  • inflated revenue metrics
  • duplicate patient enrollment
  • repeated insurance claims
  • inaccurate adverse-event counts

In SDTM domains, duplicate USUBJID values may invalidate entire submission packages.

The BY statement determines uniqueness logic. Here, King_ID acts as the enterprise business key.

Always validate deduplication carefully because aggressive removal can unintentionally discard valid longitudinal observations.

PROC FORMAT Standardization

proc format;

value $regionfmt 'NORTH'='Northern India'

                 'SOUTH'='Southern India'

                 'EAST' ='Eastern India'

                 'WEST' ='Western India'

               'UNKNOWN'='Unknown Region';

run;

LOG:

NOTE: Format $REGIONFMT has been output.

Explanation and Key Points

PROC FORMAT centralizes metadata governance.

Benefits include:

  • reusable enterprise standards
  • reporting consistency
  • regulatory traceability
  • easier maintenance

Instead of repeatedly hardcoding logic across programs, formats create reusable controlled terminology layers similar to CDISC controlled terminology.

In clinical systems:

  • gender mappings
  • treatment codes
  • severity classifications
  • laboratory categories

are frequently standardized using PROC FORMAT.

PROC SQL vs DATA Step Merge

proc sql;

create table kingdom_summary as

select Dynasty,

       count(*) as Total_Kings,

       avg(Treasury) as Avg_Treasury

from kings_nodup

group by Dynasty;

quit;

proc print data = kingdom_summary;

run;

OUTPUT:

ObsDynastyTotal_KingsAvg_Treasury
1CHOLA1850000
2GUPTA1750000
3MARATHA19999999
4MAURYA2260000
5MUGHAL2200050
6MYSORE1650000
7UNKNOWN1450000
8VIJAYANAGARA1900000

Explanation and Key Points

PROC SQL provides relational flexibility and concise aggregation syntax.

Advantages:

  • easier joins
  • grouped summaries
  • dynamic queries
  • database compatibility

DATA Step excels in row-wise sequential processing and FIRST./LAST. logic.

PROC SQL resembles modern R dplyr summarization pipelines, making it intuitive for analysts transitioning between SAS and R ecosystems.

FIRST./LAST. Processing

proc sort data=kings_nodup;

by Dynasty;

run;

proc print data = kings_nodup;

run;

OUTPUT:

ObsSource_SystemKing_IDKing_NameDynastyRegionResearcher_EmailKingdom_TypeTreasuryReign_StartReign_EndNotesKing_FullFlag
1ROYAL_ANALYTICSK006Rajaraja CholaCHOLASOUTHraja@gmail.comempire850000985AD1014ADvalidK006-Rajaraja Chola 
2ROYAL_ANALYTICSK003SamudraguptaGUPTASOUTHsamudra@gmail.comkingdom750000335BC380BCvalidK003-Samudragupta 
3ROYAL_ANALYTICSK008ShivajiMARATHAWESTshivaji@gmail.comEMPIRE99999991674AD1680ADvalidK008-Shivaji 
4ROYAL_ANALYTICSK001AshokaMAURYANORTHINVALID_EMAILEmpire500000268BC232BCNULLK001-AshokaCHECK
5ROYAL_ANALYTICSK002Chandragupta MauryaMAURYANORTHchandra@royalEmpire20000321BC297BCduplicateK002-Chandragupta Maurya 
6ROYAL_ANALYTICSK009AkbarMUGHALNORTHINVALID_EMAILEmpire1001556AD1605ADnegative treasuryK009-AkbarCHECK
7ROYAL_ANALYTICSK010UNKNOWNMUGHALNORTHadmin@gmail.comEmpire4000001605AD1627ADmissing kingK010-UNKNOWN 
8ROYAL_ANALYTICSK007Tipu SultanMYSOREUNKNOWNtipu@gmailSultanate6500001782AD1799ADbad regionK007-Tipu Sultan 
9ROYAL_ANALYTICSK004HarshaUNKNOWNEASTharsha@@mail.comKINGDOM450000606AD647ADmissing dynastyK004-HarshaCHECK
10ROYAL_ANALYTICSK005KrishnadevarayaVIJAYANAGARASOUTHINVALID_EMAILEmpire9000001509AD1529ADbad emailK005-Krishnadevaraya 

data dynasty_rank;

set kings_nodup;

by Dynasty;

if first.Dynasty then Rank=1;

else Rank+1;

format Region regionfmt.;

run;

proc print data = dynasty_rank;

run;

OUTPUT:

ObsSource_SystemKing_IDKing_NameDynastyRegionResearcher_EmailKingdom_TypeTreasuryReign_StartReign_EndNotesKing_FullFlagRank
1ROYAL_ANALYTICSK006Rajaraja CholaCHOLASouthern Indiaraja@gmail.comempire850000985AD1014ADvalidK006-Rajaraja Chola 1
2ROYAL_ANALYTICSK003SamudraguptaGUPTASouthern Indiasamudra@gmail.comkingdom750000335BC380BCvalidK003-Samudragupta 1
3ROYAL_ANALYTICSK008ShivajiMARATHAWestern Indiashivaji@gmail.comEMPIRE99999991674AD1680ADvalidK008-Shivaji 1
4ROYAL_ANALYTICSK001AshokaMAURYANorthern IndiaINVALID_EMAILEmpire500000268BC232BCNULLK001-AshokaCHECK1
5ROYAL_ANALYTICSK002Chandragupta MauryaMAURYANorthern Indiachandra@royalEmpire20000321BC297BCduplicateK002-Chandragupta Maurya 2
6ROYAL_ANALYTICSK009AkbarMUGHALNorthern IndiaINVALID_EMAILEmpire1001556AD1605ADnegative treasuryK009-AkbarCHECK1
7ROYAL_ANALYTICSK010UNKNOWNMUGHALNorthern Indiaadmin@gmail.comEmpire4000001605AD1627ADmissing kingK010-UNKNOWN 2
8ROYAL_ANALYTICSK007Tipu SultanMYSOREUnknown Regiontipu@gmailSultanate6500001782AD1799ADbad regionK007-Tipu Sultan 1
9ROYAL_ANALYTICSK004HarshaUNKNOWNEastern Indiaharsha@@mail.comKINGDOM450000606AD647ADmissing dynastyK004-HarshaCHECK1
10ROYAL_ANALYTICSK005KrishnadevarayaVIJAYANAGARASouthern IndiaINVALID_EMAILEmpire9000001509AD1529ADbad emailK005-Krishnadevaraya 1

Explanation and Key Points

FIRST./LAST. processing is one of SAS’s most powerful enterprise features.

It enables:

  • longitudinal tracking
  • patient visit sequencing
  • transaction ranking
  • grouped calculations

Healthcare examples include:

  • identifying first dose dates
  • last follow-up visits
  • treatment discontinuation logic

R typically requires grouped mutate() operations to replicate this behavior.

Enterprise Macro Framework

%macro validate(dsname=);

proc freq data=&dsname;

tables Dynasty Region Kingdom_Type / missing;

run;


proc means data=&dsname n nmiss mean min max;

var Treasury;

run;

%mend;

%validate(dsname=kings_nodup);

OUTPUT:

The FREQ Procedure

DynastyFrequencyPercentCumulative
Frequency
Cumulative
Percent
CHOLA110.00110.00
GUPTA110.00220.00
MARATHA110.00330.00
MAURYA220.00550.00
MUGHAL220.00770.00
MYSORE110.00880.00
UNKNOWN110.00990.00
VIJAYANAGARA110.0010100.00
RegionFrequencyPercentCumulative
Frequency
Cumulative
Percent
EAST110.00110.00
NORTH440.00550.00
SOUTH330.00880.00
UNKNOWN110.00990.00
WEST110.0010100.00
Kingdom_TypeFrequencyPercentCumulative
Frequency
Cumulative
Percent
EMPIRE110.00110.00
Empire550.00660.00
KINGDOM110.00770.00
Sultanate110.00880.00
empire110.00990.00
kingdom110.0010100.00

The MEANS Procedure

Analysis Variable : Treasury
NN MissMeanMinimumMaximum
1001452009.90100.00000009999999.00

Explanation and Key Points

Macros create reusable enterprise validation frameworks.

Benefits:

  • standardization
  • scalability
  • reproducibility
  • automation

Large pharmaceutical companies maintain macro libraries for:

  • SDTM validation
  • ADaM derivations
  • TLF generation
  • QC automation

Macros reduce coding duplication while improving governance consistency.

R Equivalent Cleaning Workflow

library(dplyr)

library(stringr)

library(janitor)

library(tidyr)

kings_clean <- kings_raw %>%

  clean_names() %>%

  mutate(

    king_name = str_to_title(str_trim(king_name)),

    dynasty = str_to_upper(str_trim(dynasty)),

    region = str_to_upper(str_trim(region)),

    treasury = abs(treasury),

    researcher_email =

      if_else(grepl("@", researcher_email),

              researcher_email,"INVALID_EMAIL"),

    dynasty = coalesce(dynasty,"UNKNOWN"),

    kingdom_type = case_when(kingdom_type == "EMPIRE" ~ "Empire",

                             TRUE ~ kingdom_type)

  ) %>%

  distinct(king_id,.keep_all=TRUE)

OUTPUT:

 

king_id

king_name

dynasty

region

researcher_email

kingdom_type

treasury

reign_start

reign_end

notes

1

K001

Ashoka

MAURYA

NORTH

INVALID_EMAIL

Empire

500000

268BC

232BC

NULL

2

K002

Chandragupta Maurya

MAURYA

NORTH

chandra@royal

Empire

20000

321BC

297BC

duplicate

3

K003

Samudragupta

GUPTA

SOUTH

samudra@gmail.com

kingdom

750000

335BC

380BC

valid

4

K004

Harsha

NULL

EAST

harsha@@mail.com

KINGDOM

450000

606AD

647AD

missing dynasty

5

K005

Krishnadevaraya

VIJAYANAGARA

SOUTH

INVALID_EMAIL

Empire

900000

1509AD

1529AD

bad email

6

K006

Rajaraja Chola

CHOLA

SOUTH

raja@gmail.com

empire

850000

985AD

1014AD

valid

7

K007

Tipu Sultan

MYSORE

NULL

tipu@gmail

Sultanate

650000

1782AD

1799AD

bad region

8

K008

Shivaji

MARATHA

WEST

shivaji@gmail.com

Empire

9999999

1674AD

1680AD

valid

9

K009

Akbar

MUGHAL

NORTH

INVALID_EMAIL

Empire

100

1556AD

1605AD

negative treasury

10

K010

Null

MUGHAL

NORTH

admin@gmail.com

Empire

400000

1605AD

1627AD

missing king

Explanation and Key Points

R provides flexible vectorized transformations with tidyverse pipelines.

Distinct is part of the dplyr package. If library(dplyr) is missing, R cannot recognize the function.

Equivalent mappings:

  • mutate() → SAS assignment logic
  • case_when() → SELECT-WHEN
  • distinct() → PROC SORT NODUPKEY
  • coalesce() → COALESCEC
  • str_trim() → STRIP
  • grepl() → INDEX/FIND

R excels in:

  • exploratory analytics
  • rapid transformation
  • visualization integration
  • flexible string handling

SAS excels in:

  • auditability
  • production stability
  • controlled metadata governance

Together, they create enterprise-grade analytical ecosystems.

Enterprise Validation & Compliance

In SDTM and ADaM environments, every transformation must maintain traceability.

Critical compliance requirements include:

  • Audit trails
  • Metadata lineage
  • QC independence
  • Reproducibility
  • Controlled terminology
  • Regulatory transparency

One dangerous SAS behavior:
Missing numeric values sort lower than valid numbers.

Example:

if age < 18;

This unintentionally includes missing ages unless explicitly checked.

Such logic errors can:

  • enroll ineligible patients
  • distort efficacy analysis
  • invalidate statistical outputs

Validation must always include:

  • PROC COMPARE
  • independent QC programming
  • metadata reconciliation
  • derivation traceability

Business Logic Behind Cleaning

Data cleaning is not cosmetic it directly influences analytical truth.

Suppose a king’s treasury value appears as -500000. Negative treasury values may represent corrupted ingestion logic, OCR extraction failures, or accounting sign reversals. Correcting such values using ABS() ensures financial analytics remain interpretable.

Missing dates disrupt trend analysis and duration calculations. Standardized dates allow INTCK and INTNX calculations to derive reign durations accurately.

Text normalization matters because “Maurya,” “maurya,” and “MAURYA” represent the same analytical category. Without standardization, summary reports split identical categories into separate groups.

Malformed emails affect operational communication workflows and audit tracking.

Duplicate records inflate counts and distort statistical summaries.

In clinical research, similar corrections impact:

  • adverse event frequency
  • patient exposure calculations
  • treatment compliance analysis
  • endpoint derivations

Cleaning logic ensures downstream dashboards, AI models, statistical procedures, and executive decisions are based on trusted analytical intelligence instead of corrupted operational noise.

20 Enterprise Data-Cleaning Best Practices

  1. Always validate metadata before transformations.
  2. Define LENGTH before assignments.
  3. Standardize controlled terminology early.
  4. Never trust source-system casing.
  5. Separate raw and cleaned datasets.
  6. Maintain immutable raw archives.
  7. Use reusable macro frameworks.
  8. Validate duplicates before removal.
  9. Document all derivation logic.
  10. Use QC-independent programming.
  11. Centralize PROC FORMAT mappings.
  12. Audit missing-value behavior carefully.
  13. Validate joins using row counts.
  14. Use defensive programming patterns.
  15. Maintain enterprise naming standards.
  16. Avoid hardcoded business rules.
  17. Track lineage from source to report.
  18. Standardize date formats globally.
  19. Automate validation reports.
  20. Preserve reproducibility across environments.

20 Sharp One-Line Insights

  1. Dirty data creates expensive business mistakes.
  2. Standardized variables improve reproducibility.
  3. Validation logic is stronger than visual inspection.
  4. Duplicate IDs destroy analytical trust.
  5. Metadata drives enterprise consistency.
  6. Missing dates silently corrupt timelines.
  7. PROC FORMAT improves governance.
  8. R accelerates exploratory cleaning.
  9. SAS dominates regulated production workflows.
  10. Character truncation creates hidden corruption.
  11. Audit trails matter more than dashboards.
  12. QC independence prevents analytical bias.
  13. Controlled terminology improves compliance.
  14. Data lineage protects regulatory credibility.
  15. PROC SQL simplifies relational summaries.
  16. DATA Step excels in sequential processing.
  17. Macros improve enterprise scalability.
  18. Standardization reduces reporting ambiguity.
  19. Cleaning logic impacts AI predictions.
  20. Reliable analytics begin with trusted data.

SAS vs R Comparison

Feature

SAS

R

Auditability

Excellent

Moderate

Regulatory Acceptance

Industry Standard

Growing

Metadata Governance

Strong

Flexible

Scalability

Excellent

High

Visualization

Moderate

Excellent

Macro Automation

Powerful

Flexible

String Handling

Structured

Dynamic

Validation Frameworks

Enterprise-grade

Customizable

Clinical Trial Usage

Dominant

Supplemental

Learning Curve

Moderate

Moderate-High

Small Enterprise Validation Checklist

  • Validate duplicates
  • Check missing values
  • Standardize categorical variables
  • Validate formats/informats
  • Reconcile row counts
  • Validate joins
  • Check date integrity
  • Review truncation risks
  • Perform QC comparison
  • Archive raw datasets

Summary

SAS and R each play critical roles in enterprise data engineering ecosystems. SAS provides unmatched auditability, regulatory alignment, metadata governance, and production stability. This is why pharmaceutical companies continue relying heavily on SAS for SDTM, ADaM, and TLF workflows.

R offers extraordinary flexibility, modern string processing, exploratory analytics, machine learning integration, and visualization capabilities. Its tidyverse ecosystem dramatically simplifies transformation pipelines and accelerates exploratory development.

In enterprise environments, the strongest strategy is rarely SAS versus R.

It is SAS plus R.

SAS ensures:

  • governance
  • validation
  • traceability
  • reproducibility

R enables:

  • agility
  • advanced analytics
  • interactive exploration
  • modern data science workflows

Together, they create scalable, production-grade analytical intelligence systems capable of supporting healthcare analytics, banking fraud detection, insurance claim validation, retail forecasting, and regulatory reporting.

The true value of data cleaning lies not in cosmetic formatting but in preserving analytical truth, business credibility, and decision-making reliability.

Conclusion

Modern analytics ecosystems are built on one foundational principle: trusted data creates trusted decisions.

Without structured cleaning frameworks, organizations risk catastrophic analytical failures. Duplicate records inflate business metrics. Missing values distort predictive models. Corrupted dates break timelines. Invalid categorical mappings produce misleading dashboards. In regulated industries such as clinical research, these failures can trigger rejected submissions, audit findings, patient safety risks, and financial penalties.

This project demonstrated how a corrupted “FAMOUS KINGS IN INDIA” dataset can be transformed into analysis-ready intelligence using enterprise SAS and R workflows. We explored advanced DATA Step engineering, PROC SQL summarization, reusable macro frameworks, PROC FORMAT governance, FIRST./LAST. processing, validation logic, and tidyverse-based R transformations.

The deeper lesson extends beyond syntax.

Professional data engineering requires:

  • governance
  • traceability
  • reproducibility
  • validation independence
  • defensive programming
  • enterprise metadata discipline

SAS remains the gold standard for regulated production environments because of its auditability, stability, and compliance readiness. R complements SAS through flexibility, exploratory analytics, advanced transformations, and modern data science integration.

Organizations that combine both technologies intelligently achieve:

  • scalable reporting
  • trusted AI pipelines
  • reproducible analytics
  • enterprise-grade governance
  • high-quality executive intelligence

Ultimately, analytical excellence is never accidental.

It is engineered through disciplined cleaning frameworks, standardized transformation logic, rigorous validation, and production-ready data architecture.

Reliable intelligence begins where dirty data ends.

Interview Questions & Answers

1. Explain FIRST. and LAST. processing in SAS?

Answer

FIRST. and LAST. variables are temporary variables automatically created during BY-group processing.

Example

proc sort data=sales;
by Region;
run;

data summary; set sales; by Region; if first.Region then Total=0; Total + Sales; if last.Region; run;

Key Points

  • Used for grouped calculations.
  • Helpful in cumulative totals.
  • Common in patient visit tracking.
  • Efficient alternative to SQL aggregation.

2. How does SAS treat missing numeric values?

Answer

SAS treats missing numeric values (.) as smaller than all valid numbers.

Dangerous Example

if score < 50 then flag='FAIL';

Missing scores also become FAIL.

Correct Logic

if score ne . and score < 50 then flag='FAIL';

Key Points

  • Very important in clinical validation.
  • Missing values can silently corrupt outputs.
  • Always explicitly check missing conditions.
  • Regulatory submissions require proper handling.

3. What is the difference between CMISS and NMISS?

Answer

Function

Purpose

NMISS

      Counts missing numeric values

CMISS

      Counts missing character + numeric values

Example

data test;
set raw;
missing_count=nmiss(score,age);
all_missing=cmiss(name,city,score);
run;

Key Points

  • CMISS is more flexible.
  • Useful in validation checks.
  • Frequently used in SDTM QC programming.

4. Explain RETAIN statement in SAS?

Answer

Normally SAS resets variables to missing for every iteration. RETAIN preserves values across observations.

Example

data running_total;
set sales;
retain total_sales 0;
total_sales + sales;
run;

Key Points

  • Useful for cumulative calculations.
  • Common in financial analysis.
  • Important for longitudinal patient tracking.

5. What are arrays in SAS and why are they useful?

Answer

Arrays allow multiple variables to be processed using loops.

Example

data clean_scores;
set exam;
array scores(*) score1-score5;
do i=1 to dim(scores);
if scores(i) < 0 then scores(i)=0; end; run;

Key Points

  • Reduces repetitive code.
  • Useful in bulk cleaning operations.
  • Common in laboratory data processing.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about KINGS DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:



3.Elevating SAS Validation to a New Standard : When Vintage Cameras Meet Dirty Data 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Comments

Popular posts from this blog

Beyond Fabric and Fashion: Turning the World’s Most Beautiful Sarees Dataset into Structured Intelligence with SAS and R

Data Cleaning Secrets Using Famous Food Dataset:Handling Duplicate Records in SAS

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS