Global Bird Intelligence Engineering: PROC SQL vs DATA Step Cleaning Strategies in SAS and R
Bird Kingdom Analytics Chaos: Transforming Dirty Wildlife Data into Regulatory-Ready Intelligence
Introduction
In modern
analytics ecosystems, dirty data is not a small inconvenience it is an
enterprise risk. As Clinical SAS Programmers and Data Scientists, we frequently
encounter datasets where duplicate identifiers, corrupted dates, inconsistent
category labels, malformed emails, and impossible numeric ranges silently
destroy analytical accuracy.
Imagine a
global wildlife conservation organization tracking endangered birds across
multiple continents. Their executive dashboard suddenly shows that the Bald
Eagle population declined by 3000% overnight. Simultaneously, AI prediction
models classify healthy regions as extinction-risk zones.
After
investigation, the root causes were shocking:
- Duplicate bird IDs
- Negative population counts
- Invalid migration dates
- Corrupted country codes
- Mixed uppercase/lowercase
species names
- NULL strings stored as
character values
- Malformed researcher emails
- Impossible age ranges
- Missing observation dates
- Incorrect region mappings
This
scenario is surprisingly similar to real clinical trial failures where:
- patients are enrolled twice,
- treatment dates are invalid,
- adverse event timings fail,
- SDTM submissions are
rejected,
- and regulatory audits
uncover missing traceability.
Dirty
data destroys:
- dashboards,
- AI predictions,
- statistical outputs,
- compliance submissions,
- operational intelligence,
- and executive trust.
That is
why enterprise-grade data engineering using SAS and R is essential.
Creating Raw Corrupted Bird Dataset in SAS
1.SAS Raw Dataset with Intentional Errors
DATA bird_raw;
LENGTH Bird_ID $12 Bird_Name $35 Country $20 Region_Code $10
Researcher_Email $50 Bird_Category $20 Observation_Status $15;
INFILE DATALINES DLM='|' MISSOVER;
INPUT Bird_ID $ Bird_Name $ Country $ Population_Count Bird_Age
Observation_Date :ANYDTDTE20. Region_Code $ Researcher_Email $
Bird_Category $ Observation_Status $;
FORMAT Observation_Date DATE9.;
DATALINES;
BR001|bald eagle|usa|500|12|12JAN2025|NA01|eagle.gmail.com|Raptor|Active
BR002|PARROT|India|-50|5|15FEB2025|IN-1|parrot@bird.org|PetBird|ACTIVE
BR003| snowy owl |canada|120|200|.|CA_01|owl#mail.com|WildBird|Closed
BR004|Penguin|Antarctica|3000|15|32JAN2025|ANT01|penguin@mail.com|Bird|Open
BR004|Penguin|Antarctica|3000|15|32JAN2025|ANT01|penguin@mail.com|Bird|Open
BR005|NULL|Brazil|800|8|15MAR2025|BR-01|macaw@mail|Exotic|active
BR006|Flamingo|Kenya|-999|10|18APR2025|AF01|flamingo@mail.com|WaterBird|Pending
BR007|Peacock|india|1500|25|.|IND01|peacock.com|NationalBird|ACTIVE
BR008|Kiwi|NewZealand|50|-5|01MAY2025|NZ01|kiwi@mail.com|RareBird|Closed
BR009|Falcon|UAE|400|7|15JUN2025|UAE01|falcon@mail.com|Raptor|NULL
;
RUN;
PROC PRINT DATA=bird_raw;
RUN;
OUTPUT:
| Obs | Bird_ID | Bird_Name | Country | Region_Code | Researcher_Email | Bird_Category | Observation_Status | Population_Count | Bird_Age | Observation_Date |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BR001 | bald eagle | usa | NA01 | eagle.gmail.com | Raptor | Active | 500 | 12 | 12JAN2025 |
| 2 | BR002 | PARROT | India | IN-1 | parrot@bird.org | PetBird | ACTIVE | -50 | 5 | 15FEB2025 |
| 3 | BR003 | snowy owl | canada | CA_01 | owl#mail.com | WildBird | Closed | 120 | 200 | . |
| 4 | BR004 | Penguin | Antarctica | ANT01 | penguin@mail.com | Bird | Open | 3000 | 15 | . |
| 5 | BR004 | Penguin | Antarctica | ANT01 | penguin@mail.com | Bird | Open | 3000 | 15 | . |
| 6 | BR005 | NULL | Brazil | BR-01 | macaw@mail | Exotic | active | 800 | 8 | 15MAR2025 |
| 7 | BR006 | Flamingo | Kenya | AF01 | flamingo@mail.com | WaterBird | Pending | -999 | 10 | 18APR2025 |
| 8 | BR007 | Peacock | india | IND01 | peacock.com | NationalBird | ACTIVE | 1500 | 25 | . |
| 9 | BR008 | Kiwi | NewZealand | NZ01 | kiwi@mail.com | RareBird | Closed | 50 | -5 | 01MAY2025 |
| 10 | BR009 | Falcon | UAE | UAE01 | falcon@mail.com | Raptor | NULL | 400 | 7 | 15JUN2025 |
Explanation and Key Points
This raw
dataset intentionally simulates real-world enterprise corruption. Notice
several production-grade issues:
- Duplicate Bird_ID (BR004)
- Negative population counts
(-50, -999)
- Impossible ages (200, -5)
- Missing dates (.)
- Invalid dates (32JAN2025)
- Malformed emails
- Mixed text formatting
- NULL stored as text
- Inconsistent category labels
The
LENGTH statement appears before assignment logic because SAS determines
character variable storage length during compilation. If LENGTH is declared
after assignment, SAS may permanently truncate strings. This is called Character
Truncation Risk.
Unlike
SAS fixed-length character allocation, R dynamically manages character vector
memory, reducing truncation risk but increasing runtime memory variability.
Using
ANYDTDTE20. informat allows flexible date interpretation from inconsistent
source systems.
2.PROC CONTENTS for Metadata Validation
PROC CONTENTS DATA=bird_raw VARNUM;
RUN;
OUTPUT:
The CONTENTS Procedure
| Data Set Name | WORK.BIRD_RAW | Observations | 10 |
|---|---|---|---|
| Member Type | DATA | Variables | 10 |
| Engine | V9 | Indexes | 0 |
| Created | 06/11/2026 15:57:29 | Observation Length | 192 |
| Last Modified | 06/11/2026 15:57:29 | Deleted Observations | 0 |
| Protection | Compressed | NO | |
| Data Set Type | Sorted | NO | |
| Label | |||
| Data Representation | SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64 | ||
| Encoding | utf-8 Unicode (UTF-8) |
| Engine/Host Dependent Information | |
|---|---|
| Data Set Page Size | 131072 |
| Number of Data Set Pages | 1 |
| First Data Page | 1 |
| Max Obs per Page | 682 |
| Obs in First Data Page | 10 |
| Number of Data Set Repairs | 0 |
| Filename | /saswork/SAS_workE87500010BED_odaws02-apse1-2.oda.sas.com/SAS_work4DE200010BED_odaws02-apse1-2.oda.sas.com/bird_raw.sas7bdat |
| Release Created | 9.0401M8 |
| Host Created | Linux |
| Inode Number | 201326726 |
| Access Permission | rw-r--r-- |
| Owner Name | u63247146 |
| File Size | 256KB |
| File Size (bytes) | 262144 |
| Variables in Creation Order | ||||
|---|---|---|---|---|
| # | Variable | Type | Len | Format |
| 1 | Bird_ID | Char | 12 | |
| 2 | Bird_Name | Char | 35 | |
| 3 | Country | Char | 20 | |
| 4 | Region_Code | Char | 10 | |
| 5 | Researcher_Email | Char | 50 | |
| 6 | Bird_Category | Char | 20 | |
| 7 | Observation_Status | Char | 15 | |
| 8 | Population_Count | Num | 8 | |
| 9 | Bird_Age | Num | 8 | |
| 10 | Observation_Date | Num | 8 | DATE9. |
Explanation and Key Points
PROC
CONTENTS is one of the most underused but critical validation procedures in
SAS. It verifies:
- variable type,
- length,
- format,
- informat,
- label structure,
- and dataset metadata
integrity.
In
clinical trials, incorrect variable types can break SDTM mapping and Define.xml
generation. For example:
- character dates instead of
numeric SAS dates
- numeric IDs accidentally
imported as character values
- truncation caused by
incorrect LENGTH statements
Metadata
validation should always occur before transformation logic.
SAS Cleaning Workflow
3.Standardization and Cleaning
DATA bird_clean;
SET bird_raw;
Bird_Name = PROPCASE(STRIP(Bird_Name));
Country = UPCASE(STRIP(Country));
Bird_Category = COMPRESS(PROPCASE(Bird_Category));
Observation_Status = COALESCEC(UPCASE(Observation_Status),'UNKNOWN');
Researcher_Email = LOWCASE(COMPRESS(Researcher_Email));
IF FIND(Researcher_Email,'@')=0 THEN
Researcher_Email='invalid@mail.com';
Population_Count = ABS(Population_Count);
IF Bird_Age < 0 OR Bird_Age > 120 THEN Bird_Age=.;
IF Bird_Name='Null' THEN Bird_Name='Unknown Bird';
IF MISSING(Observation_Date) THEN Observation_Date='01JAN2025'D;
RUN;
PROC PRINT DATA=bird_clean;
RUN;
OUTPUT:
| Obs | Bird_ID | Bird_Name | Country | Region_Code | Researcher_Email | Bird_Category | Observation_Status | Population_Count | Bird_Age | Observation_Date |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BR001 | Bald Eagle | USA | NA01 | invalid@mail.com | Raptor | ACTIVE | 500 | 12 | 12JAN2025 |
| 2 | BR002 | Parrot | INDIA | IN-1 | parrot@bird.org | Petbird | ACTIVE | 50 | 5 | 15FEB2025 |
| 3 | BR003 | Snowy Owl | CANADA | CA_01 | invalid@mail.com | Wildbird | CLOSED | 120 | . | 01JAN2025 |
| 4 | BR004 | Penguin | ANTARCTICA | ANT01 | penguin@mail.com | Bird | OPEN | 3000 | 15 | 01JAN2025 |
| 5 | BR004 | Penguin | ANTARCTICA | ANT01 | penguin@mail.com | Bird | OPEN | 3000 | 15 | 01JAN2025 |
| 6 | BR005 | Unknown Bird | BRAZIL | BR-01 | macaw@mail | Exotic | ACTIVE | 800 | 8 | 15MAR2025 |
| 7 | BR006 | Flamingo | KENYA | AF01 | flamingo@mail.com | Waterbird | PENDING | 999 | 10 | 18APR2025 |
| 8 | BR007 | Peacock | INDIA | IND01 | invalid@mail.com | Nationalbird | ACTIVE | 1500 | 25 | 01JAN2025 |
| 9 | BR008 | Kiwi | NEWZEALAND | NZ01 | kiwi@mail.com | Rarebird | CLOSED | 50 | . | 01MAY2025 |
| 10 | BR009 | Falcon | UAE | UAE01 | falcon@mail.com | Raptor | NULL | 400 | 7 | 15JUN2025 |
Explanation and Key Points
This DATA
step demonstrates core enterprise cleaning architecture.
Key
transformations:
- PROPCASE() standardizes
names
- STRIP() removes whitespace
corruption
- UPCASE() normalizes country
values
- COMPRESS() removes embedded spaces
- COALESCEC() handles missing
character values
- ABS() fixes negative numeric
corruption
- FIND() validates email
structure
- MISSING() identifies missing
dates
In
clinical programming, missing dates frequently cause catastrophic downstream
issues:
- incorrect treatment exposure
windows,
- adverse event
misclassification,
- survival analysis failures,
- invalid TLF outputs.
Date
imputation is often required under SAP-defined business rules.
4.Deduplication Using PROC SORT
PROC SORT DATA=bird_clean OUT=bird_nodup NODUPKEY;
BY Bird_ID;
RUN;
PROC PRINT DATA=bird_nodup;
RUN;
OUTPUT:
| Obs | Bird_ID | Bird_Name | Country | Region_Code | Researcher_Email | Bird_Category | Observation_Status | Population_Count | Bird_Age | Observation_Date |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BR001 | Bald Eagle | USA | NA01 | invalid@mail.com | Raptor | ACTIVE | 500 | 12 | 12JAN2025 |
| 2 | BR002 | Parrot | INDIA | IN-1 | parrot@bird.org | Petbird | ACTIVE | 50 | 5 | 15FEB2025 |
| 3 | BR003 | Snowy Owl | CANADA | CA_01 | invalid@mail.com | Wildbird | CLOSED | 120 | . | 01JAN2025 |
| 4 | BR004 | Penguin | ANTARCTICA | ANT01 | penguin@mail.com | Bird | OPEN | 3000 | 15 | 01JAN2025 |
| 5 | BR005 | Unknown Bird | BRAZIL | BR-01 | macaw@mail | Exotic | ACTIVE | 800 | 8 | 15MAR2025 |
| 6 | BR006 | Flamingo | KENYA | AF01 | flamingo@mail.com | Waterbird | PENDING | 999 | 10 | 18APR2025 |
| 7 | BR007 | Peacock | INDIA | IND01 | invalid@mail.com | Nationalbird | ACTIVE | 1500 | 25 | 01JAN2025 |
| 8 | BR008 | Kiwi | NEWZEALAND | NZ01 | kiwi@mail.com | Rarebird | CLOSED | 50 | . | 01MAY2025 |
| 9 | BR009 | Falcon | UAE | UAE01 | falcon@mail.com | Raptor | NULL | 400 | 7 | 15JUN2025 |
Explanation and Key Points
PROC SORT
NODUPKEY removes duplicate records based on BY-group logic.
In real
enterprise systems:
- duplicate patients,
- duplicate transactions,
- duplicate claims,
- duplicate policy IDs
can
inflate analytics and produce false trends.
Clinical
trials are especially sensitive because duplicate subject records may
invalidate statistical analysis populations.
5.Create Region Lookup Dataset
DATA region_lookup;
SET bird_nodup(KEEP=Bird_ID Region_Code);
LENGTH Clean_Region $15;
Clean_Region = UPCASE(COMPRESS(Region_Code));
RUN;
PROC PRINT DATA=region_lookup;
RUN;
OUTPUT:
| Obs | Bird_ID | Region_Code | Clean_Region |
|---|---|---|---|
| 1 | BR001 | NA01 | NA01 |
| 2 | BR002 | IN-1 | IN-1 |
| 3 | BR003 | CA_01 | CA_01 |
| 4 | BR004 | ANT01 | ANT01 |
| 5 | BR005 | BR-01 | BR-01 |
| 6 | BR006 | AF01 | AF01 |
| 7 | BR007 | IND01 | IND01 |
| 8 | BR008 | NZ01 | NZ01 |
| 9 | BR009 | UAE01 | UAE01 |
1. Separate Lookup Dataset
DATA region_lookup;
This
simulates production environments where:
- region mappings,
- protocol mappings,
- treatment lookups,
- country standards
come from
separate datasets.
2. KEEP Statement Optimization
KEEP=Bird_ID Region_Code
reduces
memory usage and improves performance.
In large
clinical datasets with millions of records, removing unnecessary variables
significantly improves processing efficiency.
3. LENGTH Before Assignment
LENGTH Clean_Region $15;
prevents
character truncation risk.
If LENGTH
is omitted, SAS assigns length based on the first encountered value.
4. COMPRESS + UPCASE Standardization
UPCASE(COMPRESS(Region_Code))
removes
hidden spaces and standardizes region values.
Example:
|
Raw
Value |
Clean
Value |
|
na 01 |
NA01 |
|
IN-1 |
IN-1 |
5. IN= Dataset Flags
IN=a
IN=b
allow
validation of:
- matched records,
- unmatched records,
- orphan keys,
- missing joins.
This is
heavily used in SDTM and ADaM validation workflows.
PROC SQL vs DATA Step
6.PROC SQL Join
PROC SQL;
CREATE TABLE bird_sql AS
SELECT a.*,
b.Clean_Region AS Region_Code
FROM bird_nodup AS a
LEFT JOIN region_lookup AS b
ON a.Bird_ID = b.Bird_ID;
QUIT;
PROC PRINT DATA=bird_sql;
RUN;
OUTPUT:
| Obs | Bird_ID | Bird_Name | Country | Region_Code | Researcher_Email | Bird_Category | Observation_Status | Population_Count | Bird_Age | Observation_Date |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BR001 | Bald Eagle | USA | NA01 | invalid@mail.com | Raptor | ACTIVE | 500 | 12 | 12JAN2025 |
| 2 | BR002 | Parrot | INDIA | IN-1 | parrot@bird.org | Petbird | ACTIVE | 50 | 5 | 15FEB2025 |
| 3 | BR003 | Snowy Owl | CANADA | CA_01 | invalid@mail.com | Wildbird | CLOSED | 120 | . | 01JAN2025 |
| 4 | BR004 | Penguin | ANTARCTICA | ANT01 | penguin@mail.com | Bird | OPEN | 3000 | 15 | 01JAN2025 |
| 5 | BR005 | Unknown Bird | BRAZIL | BR-01 | macaw@mail | Exotic | ACTIVE | 800 | 8 | 15MAR2025 |
| 6 | BR006 | Flamingo | KENYA | AF01 | flamingo@mail.com | Waterbird | PENDING | 999 | 10 | 18APR2025 |
| 7 | BR007 | Peacock | INDIA | IND01 | invalid@mail.com | Nationalbird | ACTIVE | 1500 | 25 | 01JAN2025 |
| 8 | BR008 | Kiwi | NEWZEALAND | NZ01 | kiwi@mail.com | Rarebird | CLOSED | 50 | . | 01MAY2025 |
| 9 | BR009 | Falcon | UAE | UAE01 | falcon@mail.com | Raptor | NULL | 400 | 7 | 15JUN2025 |
Explanation and Key Points
PROC SQL Advantages
- Easier joins
- Complex filtering
- Database-style processing
- Dynamic transformations
DATA Step Advantages
- Faster sequential processing
- Better row-level control
- FIRST./LAST. logic
- RETAIN and ARRAY
functionality
Clinical
programmers frequently combine both approaches depending on:
- volume,
- traceability,
- and sponsor standards.
7.DATA Step Merge
PROC SORT DATA=bird_nodup;
BY Bird_ID;
RUN;
PROC PRINT DATA=bird_nodup;
RUN;
OUTPUT:
| Obs | Bird_ID | Bird_Name | Country | Region_Code | Researcher_Email | Bird_Category | Observation_Status | Population_Count | Bird_Age | Observation_Date |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BR001 | Bald Eagle | USA | NA01 | invalid@mail.com | Raptor | ACTIVE | 500 | 12 | 12JAN2025 |
| 2 | BR002 | Parrot | INDIA | IN-1 | parrot@bird.org | Petbird | ACTIVE | 50 | 5 | 15FEB2025 |
| 3 | BR003 | Snowy Owl | CANADA | CA_01 | invalid@mail.com | Wildbird | CLOSED | 120 | . | 01JAN2025 |
| 4 | BR004 | Penguin | ANTARCTICA | ANT01 | penguin@mail.com | Bird | OPEN | 3000 | 15 | 01JAN2025 |
| 5 | BR005 | Unknown Bird | BRAZIL | BR-01 | macaw@mail | Exotic | ACTIVE | 800 | 8 | 15MAR2025 |
| 6 | BR006 | Flamingo | KENYA | AF01 | flamingo@mail.com | Waterbird | PENDING | 999 | 10 | 18APR2025 |
| 7 | BR007 | Peacock | INDIA | IND01 | invalid@mail.com | Nationalbird | ACTIVE | 1500 | 25 | 01JAN2025 |
| 8 | BR008 | Kiwi | NEWZEALAND | NZ01 | kiwi@mail.com | Rarebird | CLOSED | 50 | . | 01MAY2025 |
| 9 | BR009 | Falcon | UAE | UAE01 | falcon@mail.com | Raptor | NULL | 400 | 7 | 15JUN2025 |
PROC SORT DATA=region_lookup;
BY Bird_ID;
RUN;
PROC PRINT DATA=region_lookup;
RUN;
OUTPUT:
| Obs | Bird_ID | Region_Code | Clean_Region |
|---|---|---|---|
| 1 | BR001 | NA01 | NA01 |
| 2 | BR002 | IN-1 | IN-1 |
| 3 | BR003 | CA_01 | CA_01 |
| 4 | BR004 | ANT01 | ANT01 |
| 5 | BR005 | BR-01 | BR-01 |
| 6 | BR006 | AF01 | AF01 |
| 7 | BR007 | IND01 | IND01 |
| 8 | BR008 | NZ01 | NZ01 |
| 9 | BR009 | UAE01 | UAE01 |
DATA bird_merge;
MERGE bird_nodup(IN=a)
region_lookup(IN=b);
BY Bird_ID;
IF a;
RUN;
PROC PRINT DATA=bird_merge;
RUN;
OUTPUT:
| Obs | Bird_ID | Bird_Name | Country | Region_Code | Researcher_Email | Bird_Category | Observation_Status | Population_Count | Bird_Age | Observation_Date | Clean_Region |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BR001 | Bald Eagle | USA | NA01 | invalid@mail.com | Raptor | ACTIVE | 500 | 12 | 12JAN2025 | NA01 |
| 2 | BR002 | Parrot | INDIA | IN-1 | parrot@bird.org | Petbird | ACTIVE | 50 | 5 | 15FEB2025 | IN-1 |
| 3 | BR003 | Snowy Owl | CANADA | CA_01 | invalid@mail.com | Wildbird | CLOSED | 120 | . | 01JAN2025 | CA_01 |
| 4 | BR004 | Penguin | ANTARCTICA | ANT01 | penguin@mail.com | Bird | OPEN | 3000 | 15 | 01JAN2025 | ANT01 |
| 5 | BR005 | Unknown Bird | BRAZIL | BR-01 | macaw@mail | Exotic | ACTIVE | 800 | 8 | 15MAR2025 | BR-01 |
| 6 | BR006 | Flamingo | KENYA | AF01 | flamingo@mail.com | Waterbird | PENDING | 999 | 10 | 18APR2025 | AF01 |
| 7 | BR007 | Peacock | INDIA | IND01 | invalid@mail.com | Nationalbird | ACTIVE | 1500 | 25 | 01JAN2025 | IND01 |
| 8 | BR008 | Kiwi | NEWZEALAND | NZ01 | kiwi@mail.com | Rarebird | CLOSED | 50 | . | 01MAY2025 | NZ01 |
| 9 | BR009 | Falcon | UAE | UAE01 | falcon@mail.com | Raptor | NULL | 400 | 7 | 15JUN2025 | UAE01 |
DATA Step vs PROC SQL for
Merge
|
Feature |
DATA
Step MERGE |
PROC
SQL JOIN |
|
Speed |
Faster
sequential processing |
Good
for relational joins |
|
Requires
Sorting |
Yes |
No |
|
FIRST./LAST.
Logic |
Excellent |
Not
available |
|
Complex
Conditions |
Moderate |
Easier |
|
Clinical
Trial Usage |
Very
common |
Very
common |
|
Debugging |
Easier
row tracing |
Easier
join readability |
Advanced SAS Techniques
8.FIRST./LAST. Processing
PROC SORT DATA=bird_nodup;
BY Country;
RUN;
PROC PRINT DATA=bird_nodup;
RUN;
OUTPUT:
| Obs | Bird_ID | Bird_Name | Country | Region_Code | Researcher_Email | Bird_Category | Observation_Status | Population_Count | Bird_Age | Observation_Date |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BR004 | Penguin | ANTARCTICA | ANT01 | penguin@mail.com | Bird | OPEN | 3000 | 15 | 01JAN2025 |
| 2 | BR005 | Unknown Bird | BRAZIL | BR-01 | macaw@mail | Exotic | ACTIVE | 800 | 8 | 15MAR2025 |
| 3 | BR003 | Snowy Owl | CANADA | CA_01 | invalid@mail.com | Wildbird | CLOSED | 120 | . | 01JAN2025 |
| 4 | BR002 | Parrot | INDIA | IN-1 | parrot@bird.org | Petbird | ACTIVE | 50 | 5 | 15FEB2025 |
| 5 | BR007 | Peacock | INDIA | IND01 | invalid@mail.com | Nationalbird | ACTIVE | 1500 | 25 | 01JAN2025 |
| 6 | BR006 | Flamingo | KENYA | AF01 | flamingo@mail.com | Waterbird | PENDING | 999 | 10 | 18APR2025 |
| 7 | BR008 | Kiwi | NEWZEALAND | NZ01 | kiwi@mail.com | Rarebird | CLOSED | 50 | . | 01MAY2025 |
| 8 | BR009 | Falcon | UAE | UAE01 | falcon@mail.com | Raptor | NULL | 400 | 7 | 15JUN2025 |
| 9 | BR001 | Bald Eagle | USA | NA01 | invalid@mail.com | Raptor | ACTIVE | 500 | 12 | 12JAN2025 |
DATA country_summary;
SET bird_nodup;
BY Country;
RETAIN Total_Birds 0;
IF FIRST.Country THEN Total_Birds=0;
Total_Birds + Population_Count;
IF LAST.Country;
RUN;
PROC PRINT DATA=country_summary;
RUN;
OUTPUT:
| Obs | Bird_ID | Bird_Name | Country | Region_Code | Researcher_Email | Bird_Category | Observation_Status | Population_Count | Bird_Age | Observation_Date | Total_Birds |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BR004 | Penguin | ANTARCTICA | ANT01 | penguin@mail.com | Bird | OPEN | 3000 | 15 | 01JAN2025 | 3000 |
| 2 | BR005 | Unknown Bird | BRAZIL | BR-01 | macaw@mail | Exotic | ACTIVE | 800 | 8 | 15MAR2025 | 800 |
| 3 | BR003 | Snowy Owl | CANADA | CA_01 | invalid@mail.com | Wildbird | CLOSED | 120 | . | 01JAN2025 | 120 |
| 4 | BR007 | Peacock | INDIA | IND01 | invalid@mail.com | Nationalbird | ACTIVE | 1500 | 25 | 01JAN2025 | 1550 |
| 5 | BR006 | Flamingo | KENYA | AF01 | flamingo@mail.com | Waterbird | PENDING | 999 | 10 | 18APR2025 | 999 |
| 6 | BR008 | Kiwi | NEWZEALAND | NZ01 | kiwi@mail.com | Rarebird | CLOSED | 50 | . | 01MAY2025 | 50 |
| 7 | BR009 | Falcon | UAE | UAE01 | falcon@mail.com | Raptor | NULL | 400 | 7 | 15JUN2025 | 400 |
| 8 | BR001 | Bald Eagle | USA | NA01 | invalid@mail.com | Raptor | ACTIVE | 500 | 12 | 12JAN2025 | 500 |
9.ARRAY Processing
DATA bird_array;
SET bird_nodup;
ARRAY nums {*} Population_Count Bird_Age;
DO i=1 TO DIM(nums);
IF nums{i}<0 THEN nums{i}=ABS(nums{i});
END;
DROP i;
RUN;
PROC PRINT DATA=bird_array;
RUN;
OUTPUT:
| Obs | Bird_ID | Bird_Name | Country | Region_Code | Researcher_Email | Bird_Category | Observation_Status | Population_Count | Bird_Age | Observation_Date |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | BR004 | Penguin | ANTARCTICA | ANT01 | penguin@mail.com | Bird | OPEN | 3000 | 15 | 01JAN2025 |
| 2 | BR005 | Unknown Bird | BRAZIL | BR-01 | macaw@mail | Exotic | ACTIVE | 800 | 8 | 15MAR2025 |
| 3 | BR003 | Snowy Owl | CANADA | CA_01 | invalid@mail.com | Wildbird | CLOSED | 120 | . | 01JAN2025 |
| 4 | BR002 | Parrot | INDIA | IN-1 | parrot@bird.org | Petbird | ACTIVE | 50 | 5 | 15FEB2025 |
| 5 | BR007 | Peacock | INDIA | IND01 | invalid@mail.com | Nationalbird | ACTIVE | 1500 | 25 | 01JAN2025 |
| 6 | BR006 | Flamingo | KENYA | AF01 | flamingo@mail.com | Waterbird | PENDING | 999 | 10 | 18APR2025 |
| 7 | BR008 | Kiwi | NEWZEALAND | NZ01 | kiwi@mail.com | Rarebird | CLOSED | 50 | . | 01MAY2025 |
| 8 | BR009 | Falcon | UAE | UAE01 | falcon@mail.com | Raptor | NULL | 400 | 7 | 15JUN2025 |
| 9 | BR001 | Bald Eagle | USA | NA01 | invalid@mail.com | Raptor | ACTIVE | 500 | 12 | 12JAN2025 |
R Data Cleaning Workflow
10.Creating Corrupted Dataset in R
library(tidyverse)
library(lubridate)
library(janitor)
bird_raw <- tibble(
bird_id = c("BR001","BR002","BR003","BR004"),
bird_name = c("bald eagle","PARROT"," snowy owl ","NULL"),
country = c("usa","India","canada","Brazil"),
population_count = c(500,-50,120,-999),
bird_age = c(12,5,200,-5),
researcher_email = c("eagle.gmail.com","parrot@bird.org",
"owl#mail.com","macaw@mail")
)
OUTPUT:
|
|
bird_id |
bird_name |
country |
population_count |
bird_age |
researcher_email |
|
1 |
BR001 |
bald eagle |
usa |
500 |
12 |
eagle.gmail.com |
|
2 |
BR002 |
PARROT |
India |
-50 |
5 |
parrot@bird.org |
|
3 |
BR003 |
snowy owl |
canada |
120 |
200 |
owl#mail.com |
|
4 |
BR004 |
NULL |
Brazil |
-999 |
-5 |
macaw@mail |
11.R Cleaning Logic
bird_clean <- bird_raw %>%
clean_names() %>%
mutate(bird_name = str_to_title(str_trim(bird_name)),
country = str_to_upper(str_trim(country)),
population_count = abs(population_count),
bird_age = if_else(bird_age < 0 |
bird_age > 120,NA_real_,
as.numeric(bird_age)),
researcher_email = if_else(grepl("@", researcher_email),
researcher_email,"invalid@mail.com"),
bird_name = if_else(bird_name=="Null","Unknown Bird",bird_name)
)
OUTPUT:
|
|
bird_id |
bird_name |
country |
population_count |
bird_age |
researcher_email |
|
1 |
BR001 |
Bald Eagle |
USA |
500 |
12 |
invalid@mail.com |
|
2 |
BR002 |
Parrot |
INDIA |
50 |
5 |
parrot@bird.org |
|
3 |
BR003 |
Snowy Owl |
CANADA |
120 |
NA |
invalid@mail.com |
|
4 |
BR004 |
Unknown Bird |
BRAZIL |
999 |
NA |
macaw@mail |
Explanation and Key Points
This R
pipeline mirrors SAS cleaning logic using tidyverse architecture.
SAS vs R Equivalents
|
SAS |
R
Equivalent |
|
DATA
Step |
mutate() |
|
IF-THEN |
if_else() |
|
PROPCASE |
str_to_title() |
|
STRIP |
str_trim() |
|
UPCASE |
str_to_upper() |
|
COMPRESS |
str_replace_all() |
|
PROC
SORT NODUPKEY |
distinct() |
|
MERGE |
left_join() |
R
provides exceptional flexibility for exploratory transformations, while SAS
dominates regulated production environments due to auditability.
Enterprise Validation &
Compliance
In SDTM
and ADaM workflows:
- traceability is mandatory,
- transformations must be
reproducible,
- QC independence is required,
- and every derivation must be
documented.
One
dangerous SAS behavior:
numeric missing values (.) are treated as smaller than valid numbers.
Example:
IF score < 50 THEN
Risk='High';
Missing
scores will incorrectly become High Risk.
This can:
- misclassify patients,
- reject insurance claims,
- distort fraud models,
- and corrupt regulatory
outputs.
Always
explicitly validate missingness.
20 Enterprise Data Cleaning
Best Practices
- Validate metadata before
transformation.
- Standardize variable naming
conventions.
- Remove duplicate identifiers
early.
- Never trust raw source
systems.
- Normalize text values
consistently.
- Validate email structures.
- Standardize date formats
enterprise-wide.
- Apply defensive programming.
- Use reusable macros.
- Document all derivations.
- Separate raw and cleaned
layers.
- Maintain audit trails.
- Use QC independence.
- Avoid hardcoded business
logic.
- Track lineage across
datasets.
- Validate impossible ranges.
- Use automated reconciliation
checks.
- Standardize missing value
handling.
- Archive production versions.
- Validate joins before
reporting.
Business Logic Section
Business
logic transformation is the backbone of analytical reliability. Missing values
are rarely “empty”; they usually represent operational failures, delayed
reporting, system outages, or incomplete workflows. In clinical trials, a
missing treatment date can invalidate patient exposure calculations and distort
survival analysis. In banking systems, missing transaction timestamps may
trigger fraud-detection failures.
Imputation
strategies are therefore critical. For example:
- missing observation dates
may be imputed using protocol-defined rules,
- negative population counts
may be converted using ABS(),
- impossible bird ages like
200 years are set to missing,
- inconsistent country labels
are standardized using UPCASE(),
- and corrupted text like
“NULL” is converted into meaningful categories.
Normalization
improves downstream analytics:
- dashboards aggregate correctly,
- PROC REPORT outputs align,
- machine learning models
avoid category fragmentation,
- and executive reports remain
trustworthy.
Without
normalization:
“india,” “INDIA,” and “India ” become three different categories.
Similarly,
malformed emails compromise communication workflows and audit readiness.
Business
logic transforms operational chaos into enterprise intelligence.
20 Sharp One-Line Insights
- Dirty data creates expensive
business mistakes.
- Validation logic is stronger
than visual inspection.
- Standardized variables
improve reproducibility.
- Duplicate IDs silently
destroy analytics.
- Missing dates corrupt
timelines.
- SAS excels in regulated
environments.
- R dominates exploratory
flexibility.
- Metadata governance prevents
production chaos.
- Audit trails protect
regulatory trust.
- PROC SQL simplifies
enterprise joins.
- DATA Step offers superior
row-level control.
- Character truncation is a
hidden SAS risk.
- Defensive programming
prevents future failures.
- Traceability is mandatory in
clinical trials.
- Normalized categories
improve AI performance.
- Missing values require
explicit handling.
- QC independence strengthens
compliance.
- Macros improve enterprise
scalability.
- Reusable code reduces
validation risk.
- Clean data powers reliable
decisions.
SAS vs R Comparison Summary
SAS and R
are both powerful enterprise data-engineering ecosystems, but their strengths
differ significantly.
SAS is
dominant in:
- clinical trials,
- pharmaceutical analytics,
- insurance reporting,
- and regulated environments.
Its
strengths include:
- reproducibility,
- auditability,
- stable metadata management,
- enterprise deployment
controls,
- and deterministic execution.
Features
like:
- PROC FORMAT,
- DATA Step processing,
- FIRST./LAST. logic,
- RETAIN statements,
- and macro-driven automation
make SAS
exceptionally strong for production pipelines.
R,
however, provides unmatched flexibility for:
- exploratory analytics,
- machine learning,
- advanced visualization,
- and rapid transformation
workflows.
Packages
like:
- dplyr,
- stringr,
- tidyr,
- lubridate,
- and purrr
allow
highly expressive data engineering.
In
enterprise environments, SAS is typically used for:
- validated pipelines,
- SDTM/ADaM generation,
- TLF production,
- and regulatory submissions.
R is
frequently integrated for:
- exploratory QC,
- advanced analytics,
- AI experimentation,
- and visualization.
The
strongest organizations combine both.
SAS
provides governance and compliance stability.
R
provides analytical innovation and flexibility.
Together,
they create scalable, trustworthy, production-grade intelligence ecosystems.
Conclusion
Modern
analytics systems cannot survive on raw operational data alone. Whether the
domain is clinical research, wildlife intelligence, banking, retail, or
insurance, corrupted datasets create dangerous analytical distortions that
silently damage business decisions.
This
project demonstrated how intentionally corrupted global bird datasets can be
transformed into reliable analytical intelligence using both SAS and R.
We
explored:
- duplicate handling,
- date standardization,
- text normalization,
- invalid value correction,
- metadata validation,
- email verification,
- PROC SQL joins,
- DATA Step engineering,
- arrays,
- macros,
- and tidyverse
transformations.
The most
important lesson is this:
Data
cleaning is not cosmetic work.
It is
enterprise risk management.
Incorrect
missing-value handling can:
- invalidate clinical trials,
- corrupt AI predictions,
- trigger financial losses,
- misclassify risk,
- and produce regulatory
rejection.
SAS
remains one of the most trusted enterprise-grade systems because of its:
- auditability,
- deterministic processing,
- validation frameworks,
- and production reliability.
R
complements SAS with:
- flexible transformations,
- modern data engineering,
- scalable exploratory
workflows,
- and advanced analytics.
Organizations
that combine both technologies create stronger analytical ecosystems.
Clean
data improves:
- dashboards,
- AI models,
- operational intelligence,
- executive reporting,
- and strategic
decision-making.
Ultimately,
reliable analytics begins long before machine learning or visualization.
It begins
with disciplined, structured, traceable data engineering.
That is
the true foundation of trustworthy enterprise intelligence.
Scenario-Based Interview
Questions & Answers
1. How would you identify duplicate records in SAS?
Answer:
I would
use PROC SORT NODUPKEY or PROC SQL GROUP BY HAVING COUNT(*)>1. In enterprise
clinical environments, duplicate patient IDs can distort subject counts and
analysis populations.
2. Why is missing-value handling dangerous in SAS?
Answer:
SAS
treats numeric missing values as smaller than valid numbers. Without explicit
validation, missing scores may incorrectly enter conditional logic and produce
analytical misclassification.
3. When would you choose PROC SQL over DATA Step?
Answer:
I prefer
PROC SQL for complex joins and aggregation. DATA Step is better for row-level
sequential logic, FIRST./LAST. processing, RETAIN statements, and performance
optimization.
4. How do you validate malformed emails in SAS and
R?
Answer:
In SAS, I
use FIND(), INDEX(), VERIFY(), or PRXMATCH(). In R, I use grepl() or stringr
regex functions for email pattern validation.
5. How do you ensure audit readiness in enterprise
cleaning workflows?
Answer:
I
maintain:
- traceable transformations,
- independent QC,
- metadata validation,
- standardized macros,
- version-controlled code,
- and reproducible derivation
logic aligned with SDTM/ADaM requirements.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent BIRD DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and Exams Reviewers and Observers
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment