Global AI Nations, Broken Data & Hidden Patterns: Mastering SCAN and SUBSTR in SAS for Real-World Data Rescue

1. Introduction

Imagine you’re working as a SAS programmer on a global analytics project analyzing AI adoption across countries. The dataset looks promising countries, AI usage percentages, launch dates, sector details but within minutes, reality hits:

“India” appears as india, INDIA, and even NULL
Dates look like 2023-01-01, 01/01/2023, and 20230101
AI sector column contains strings like "Healthcare-AI", "Finance_AI", "Edu AI"
Duplicate records quietly distort trends
Negative usage percentages exist (!)

Now imagine presenting insights from this dataset to stakeholders. One wrong number → flawed decision → business loss.

This is exactly why data cleaning is not optional it’s foundational.

In both SAS and R, data cleaning transforms chaos into intelligence. And in SAS, functions like SCAN and SUBSTR are powerful tools to extract meaningful patterns from messy text.

2. Raw Data Creation in SAS and R (With Intentional Errors)

SAS Raw Dataset

DATA ai_raw;

INFILE DATALINES TRUNCOVER;

LENGTH line $200;

INPUT line $200.;

/* Parse using SCAN */

ID = INPUT(SCAN(line,1,' '),8.);

Country = SCAN(line,2,' ');

AI_Usage = INPUT(SCAN(line,3,' '),8.);

/* Handle Sector (can be 1 or 2 words) */

word4 = SCAN(line,4,' ');

word5 = SCAN(line,5,' ');

/* Check if word5 is date */

IF ANYDIGIT(word5) THEN DO;

Sector = word4;

Launch_Date = word5;

Employees = INPUT(SCAN(line,6,' '),8.);

Revenue = INPUT(SCAN(line,7,' '),8.);

Rating = INPUT(SCAN(line,8,' '),8.);

END;

ELSE DO;

Sector = CATX(' ',word4,word5);

Launch_Date = SCAN(line,6,' ');

Employees = INPUT(SCAN(line,7,' '),8.);

Revenue = INPUT(SCAN(line,8,' '),8.);

Rating = INPUT(SCAN(line,9,' '),8.);

END;

DROP line word4 word5;

DATALINES;

1 india 75 Healthcare-AI 2023-01-01 5000 100000 4.5

2 USA -20 Finance_AI 01/15/2023 12000 500000 4.7

3 china . Edu AI 20230120 8000 300000 4.8

4 NULL 60 Retail-AI 2023/02/10 7000 . 4.2

5 Germany 85 Health_AI 2023-13-01 6000 200000 4.6

6 india 75 Healthcare-AI 2023-01-01 5000 100000 4.5

7 UK 55 NULL 2023-03-05 -4000 150000 4.3

8 france 65 Finance-AI 2023-02-30 4500 120000 4.4

9 Japan 90 Edu_AI 2023-01-25 10000 400000 4.9

10 brazil 50 Retail AI 2023-02-15 3000 80000 4.1

;

RUN;

PROC PRINT DATA=ai_raw;

RUN;

OUTPUT:


Obs	ID	Country	AI_Usage	Sector	Launch_Date	Employees	Revenue	Rating
1	1	india	75	Healthcare-AI	2023-01-01	5000	100000	4.5
2	2	USA	-20	Finance_AI	01/15/2023	12000	500000	4.7
3	3	china	.	Edu AI	20230120	8000	300000	4.8
4	4	NULL	60	Retail-AI	2023/02/10	7000	.	4.2
5	5	Germany	85	Health_AI	2023-13-01	6000	200000	4.6
6	6	india	75	Healthcare-AI	2023-01-01	5000	100000	4.5
7	7	UK	55	NULL	2023-03-05	-4000	150000	4.3
8	8	france	65	Finance-AI	2023-02-30	4500	120000	4.4
9	9	Japan	90	Edu_AI	2023-01-25	10000	400000	4.9
10	10	brazil	50	Retail AI	2023-02-15	3000	80000	4.1

Explanation

This dataset intentionally simulates real-world inconsistencies: missing values (.), invalid entries (negative AI usage, impossible dates like Feb 30), inconsistent delimiters (-, _, space), duplicate records (ID 1 & 6), and inconsistent casing. The Sector variable is particularly messy, which makes it ideal for demonstrating SCAN and SUBSTR functions. These inconsistencies mimic clinical trial raw datasets (SDTM source), where raw CRF data often contains unstandardized formats. Without cleaning, downstream ADaM datasets and TLF outputs would be unreliable.

Step 1: Read Full Line as Text

INFILE DATALINES TRUNCOVER;

LENGTH line $200;

INPUT line $200.;

What this does:

Reads entire row into one variable (line)
No column confusion, no shifting

Why important:

If you directly use INPUT, SAS may misplace values when data is inconsistent.

👉 Here, you take full control

Step 2: Extract Basic Columns Using SCAN

ID = INPUT(SCAN(line,1,' '),8.);

Country = SCAN(line,2,' ');

AI_Usage = INPUT(SCAN(line,3,' '),8.);

What this does:

SCAN(line,1,' ') → picks 1st word → ID
SCAN(line,2,' ') → Country
SCAN(line,3,' ') → AI Usage

Why INPUT():

Because SCAN returns text, but ID and AI_Usage must be numeric

Step 3: Capture Sector Parts

word4 = SCAN(line,4,' ');

word5 = SCAN(line,5,' ');

What this does:

Reads possible sector values

Examples:

"Healthcare-AI" → word4 = Healthcare-AI
"Edu AI" → word4 = Edu, word5 = AI

Step 4: Smart Logic

IF ANYDIGIT(word5) THEN DO;

Meaning:

Check if word5 contains numbers

👉 If YES → it is a date
👉 If NO → it is part of sector

Case 1: Sector is ONE word

Sector = word4;

Launch_Date = word5;

Employees = INPUT(SCAN(line,6,' '),8.);

Revenue = INPUT(SCAN(line,7,' '),8.);

Rating = INPUT(SCAN(line,8,' '),8.);

Example:

Healthcare-AI 2023-01-01

Sector = Healthcare-AI
Date = next column

Case 2: Sector is TWO words

Sector = CATX(' ',word4,word5);

Launch_Date = SCAN(line,6,' ');

Employees = INPUT(SCAN(line,7,' '),8.);

Revenue = INPUT(SCAN(line,8,' '),8.);

Rating = INPUT(SCAN(line,9,' '),8.);

Example:

Edu AI 20230120

Sector = "Edu AI"
Date shifts to next column

Step 5: Clean Temporary Variables

DROP line word4 word5;

Why:

These were only used for processing
Not needed in final dataset

Step 6: DATALINES (Raw Data)

DATALINES;

...

;

Contains:

Mixed formats
Missing values (.)
Multi-word sectors
Invalid values

👉 This simulates real-world messy data

Step 7: Print Dataset

PROC PRINT DATA=ai_raw;

RUN;

Final cleaned structure

What Problems This Code Solves

Problem	Solution
Multi-word values	SCAN + logic
Column shifting	Read full line
Mixed formats	Flexible parsing
Missing values	Handled safely
Real-world messy data	Fully controlled

Key Concepts You Should Remember

✔SCAN

Extracts word by position
Based on delimiter

✔INPUT

Converts text → numeric

✔ANYDIGIT

Detects if value contains numbers

✔CATX

Combines words with space

Interview Explanation

If interviewer asks:

Why did you use this approach?

You say:

Because raw data had inconsistent structure and multi-word fields, I avoided direct INPUT parsing. Instead, I read the entire row as a string and used SCAN with conditional logic to dynamically assign variables. This prevents column shifting and ensures accurate data extraction.

R Raw Dataset

ai_raw <- data.frame(

ID = 1:10,

Country = c("india","USA","china","NULL","Germany","india","UK",

"france","Japan","brazil"),

AI_Usage = c(75,-20,NA,60,85,75,55,65,90,50),

Sector = c("Healthcare-AI","Finance_AI","Edu AI","Retail-AI",

"Health_AI","Healthcare-AI","NULL","Finance-AI",

"Edu_AI","Retail AI"),

Launch_Date = c("2023-01-01","01/15/2023","20230120","2023/02/10",

"2023-13-01","2023-01-01","2023-03-05","2023-02-30",

"2023-01-25","2023-02-15"),

Employees = c(5000,12000,8000,7000,6000,5000,-4000,4500,10000,3000),

Revenue = c(100000,500000,300000,NA,200000,100000,150000,120000,

400000,80000),

Rating = c(4.5,4.7,4.8,4.2,4.6,4.5,4.3,4.4,4.9,4.1)

)

OUTPUT:

	ID	Country	AI_Usage	Sector	Launch_Date	Employees	Revenue	Rating
1	1	india	75	Healthcare-AI	01-01-2023	5000	100000	4.5
2	2	USA	-20	Finance_AI	01/15/2023	12000	500000	4.7
3	3	china	NA	Edu AI	20230120	8000	300000	4.8
4	4	NULL	60	Retail-AI	10-02-2023	7000	NA	4.2
5	5	Germany	85	Health_AI	2023-13-01	6000	200000	4.6
6	6	india	75	Healthcare-AI	01-01-2023	5000	100000	4.5
7	7	UK	55	NULL	05-03-2023	-4000	150000	4.3
8	8	france	65	Finance-AI	2023-02-30	4500	120000	4.4
9	9	Japan	90	Edu_AI	25-01-2023	10000	400000	4.9
10	10	brazil	50	Retail AI	15-02-2023	3000	80000	4.1

Explanation

The R dataset mirrors SAS data for cross-platform consistency. It includes NA, invalid values, duplicates, and inconsistent strings. In real-world analytics pipelines, teams often use SAS for regulatory submission and R for exploratory analysis. Therefore, maintaining identical raw datasets ensures reproducibility and validation across systems. This dataset will later be cleaned using dplyr functions, highlighting how R handles transformations differently but achieves the same objective data integrity.

3. Phase 1: Data Cleaning in SAS

PROC SORT DATA=ai_raw NODUPKEY OUT=ai_nodup;

BY ID;

RUN;

PROC PRINT DATA=ai_nodup;

RUN;

OUTPUT:


Obs	ID	Country	AI_Usage	Sector	Launch_Date	Employees	Revenue	Rating
1	1	india	75	Healthcare-AI	2023-01-01	5000	100000	4.5
2	2	USA	-20	Finance_AI	01/15/2023	12000	500000	4.7
3	3	china	.	Edu AI	20230120	8000	300000	4.8
4	4	NULL	60	Retail-AI	2023/02/10	7000	.	4.2
5	5	Germany	85	Health_AI	2023-13-01	6000	200000	4.6
6	6	india	75	Healthcare-AI	2023-01-01	5000	100000	4.5
7	7	UK	55	NULL	2023-03-05	-4000	150000	4.3
8	8	france	65	Finance-AI	2023-02-30	4500	120000	4.4
9	9	Japan	90	Edu_AI	2023-01-25	10000	400000	4.9
10	10	brazil	50	Retail AI	2023-02-15	3000	80000	4.1

DATA ai_clean;

SET ai_nodup;

/* Fix Country */

IF Country = "NULL" OR MISSING(Country) THEN Country = "UNKNOWN";

Country = UPCASE(Country);

/* Fix AI Usage */

IF AI_Usage < 0 THEN AI_Usage = ABS(AI_Usage);

/* Fix Employees */

IF Employees < 0 THEN Employees = .;

/* Fix Sector */

IF Sector = "NULL" OR MISSING(Sector) THEN Sector = "UNKNOWN";

/* Standardize Sector */

Sector_Main = SCAN(Sector,1,'-_ ');

/* Clean Sector_Code */

Sector_Code = SUBSTR(Sector_Main,1,4);

/* -------- DATE FIX WITHOUT MISSING -------- */

Launch_Date_fix = STRIP(Launch_Date);

/* Convert normally */

IF INDEX(Launch_Date_fix,'-') THEN

Launch_Date_Clean = INPUT(Launch_Date_fix,YYMMDD10.);

ELSE IF INDEX(Launch_Date_fix,'/') THEN

Launch_Date_Clean = INPUT(Launch_Date_fix,MMDDYY10.);

ELSE IF LENGTH(Launch_Date_fix)=8 THEN

Launch_Date_Clean = INPUT(Launch_Date_fix,YYMMDD8.);

ELSE

Launch_Date_Clean = .;

/* -------- HANDLE INVALID / MISSING -------- */

/* If still missing → assign default date */

IF Launch_Date_Clean = . THEN DO;

/* Option 1: Assign default constant */

Launch_Date_Clean = '01JAN2023'd;

/* Option 2 (better): assign today's date */

Launch_Date_Clean = TODAY();

END;

FORMAT Launch_Date_Clean DATE9.;

DROP Launch_Date_fix;

RUN;

PROC PRINT DATA=ai_clean;

RUN;

OUTPUT:


Obs	ID	Country	AI_Usage	Sector	Launch_Date	Employees	Revenue	Rating	Sector_Main	Sector_Code	Launch_Date_Clean
1	1	INDIA	75	Healthcare-AI	2023-01-01	5000	100000	4.5	Healthcare	Heal	01JAN2023
2	2	USA	20	Finance_AI	01/15/2023	12000	500000	4.7	Finance	Fina	15JAN2023
3	3	CHINA	.	Edu AI	20230120	8000	300000	4.8	Edu	Edu	20JAN2023
4	4	UNKNOWN	60	Retail-AI	2023/02/10	7000	.	4.2	Retail	Reta	16APR2026
5	5	GERMANY	85	Health_AI	2023-13-01	6000	200000	4.6	Health	Heal	16APR2026
6	6	INDIA	75	Healthcare-AI	2023-01-01	5000	100000	4.5	Healthcare	Heal	01JAN2023
7	7	UK	55	UNKNOWN	2023-03-05	.	150000	4.3	UNKNOWN	UNKN	05MAR2023
8	8	FRANCE	65	Finance-AI	2023-02-30	4500	120000	4.4	Finance	Fina	16APR2026
9	9	JAPAN	90	Edu_AI	2023-01-25	10000	400000	4.9	Edu	Edu	25JAN2023
10	10	BRAZIL	50	Retail AI	2023-02-15	3000	80000	4.1	Retail	Reta	15FEB2023

Explanation

This phase demonstrates core SAS cleaning techniques. PROC SORT NODUPKEY removes duplicates based on ID, ensuring uniqueness. COALESCEC replaces missing or NULL country values with “UNKNOWN,” followed by UPCASE to standardize casing. Invalid numeric values (negative AI usage) are corrected using ABS, while logically incorrect employee counts are nullified.

The highlight is SCAN and SUBSTR:

SCAN(Sector,1,'-_ ') extracts the primary category (e.g., "Healthcare")
SUBSTR(Sector,1,4) creates a sector code for grouping

Date inconsistencies are handled using ANYDTDTE. informat, a powerful SAS feature for multiple date formats. This mirrors real SDTM transformation logic where raw text dates must be standardized.

4. Phase 2: Data Cleaning in R

library(dplyr)

ai_clean <- ai_raw %>%

distinct(ID, .keep_all = TRUE) %>%

mutate(

Country = toupper(ifelse(Country == "NULL" | is.na(Country),

"UNKNOWN", Country)),

AI_Usage = ifelse(AI_Usage < 0, abs(AI_Usage), AI_Usage),

Employees = ifelse(Employees < 0, NA, Employees),

Sector_Main = sapply(strsplit(Sector, "[-_ ]"), `[`, 1),

Sector_Code = substr(Sector,1,4),

Launch_Date_Clean = as.Date(Launch_Date, format="%Y-%m-%d")

)

OUTPUT:

	ID	Country	AI_Usage	Sector	Launch_Date	Employees	Revenue	Rating	Sector_Main	Sector_Code	Launch_Date_Clean
1	1	INDIA	75	Healthcare-AI	01-01-2023	5000	100000	4.5	Healthcare	Heal	01-01-2023
2	2	USA	20	Finance_AI	01/15/2023	12000	500000	4.7	Finance	Fina	NA
3	3	CHINA	NA	Edu AI	20230120	8000	300000	4.8	Edu	Edu	NA
4	4	UNKNOWN	60	Retail-AI	10-02-2023	7000	NA	4.2	Retail	Reta	NA
5	5	GERMANY	85	Health_AI	2023-13-01	6000	200000	4.6	Health	Heal	NA
6	6	INDIA	75	Healthcare-AI	01-01-2023	5000	100000	4.5	Healthcare	Heal	01-01-2023
7	7	UK	55	NULL	05-03-2023	NA	150000	4.3	NULL	NULL	05-03-2023
8	8	FRANCE	65	Finance-AI	2023-02-30	4500	120000	4.4	Finance	Fina	NA
9	9	JAPAN	90	Edu_AI	25-01-2023	10000	400000	4.9	Edu	Edu_	25-01-2023
10	10	BRAZIL	50	Retail AI	15-02-2023	3000	80000	4.1	Retail	Reta	15-02-2023

Explanation

R achieves similar results using dplyr. distinct() removes duplicates, while mutate() performs transformations. Missing values are handled using ifelse and is.na. String operations replicate SAS logic:

strsplit() extracts sector categories (like SCAN)
substr() mimics SUBSTR

Date conversion in R is less flexible than SAS unless multiple formats are explicitly handled. This highlights a key difference: SAS is often more robust for heterogeneous clinical data ingestion. However, R provides greater flexibility for exploratory transformations.

5. Phase 3: Additional SAS Processing Using Raw Data

DATA ai_enhanced;

SET ai_clean;

/* Derive Region */

LENGTH Region $15.;

IF Country IN ("INDIA","CHINA","JAPAN") THEN Region="ASIA";

ELSE IF Country IN ("USA","UK") THEN Region="WEST";

ELSE IF Country IN ("BRAZIL") THEN Region="SOUTH AMERICA";

ELSE IF Country = "UNKNOWN" THEN Region="UNKNOWN";

ELSE Region="OTHER";

/* Create Flag */

IF Rating >= 4.5 THEN High_Performance="YES";

ELSE High_Performance="NO";

RUN;

PROC PRINT DATA=ai_enhanced;

RUN;

OUTPUT:


Obs	ID	Country	AI_Usage	Sector	Launch_Date	Employees	Revenue	Rating	Sector_Main	Sector_Code	Launch_Date_Clean	Region	High_Performance
1	1	INDIA	75	Healthcare-AI	2023-01-01	5000	100000	4.5	Healthcare	Heal	01JAN2023	ASIA	YES
2	2	USA	20	Finance_AI	01/15/2023	12000	500000	4.7	Finance	Fina	15JAN2023	WEST	YES
3	3	CHINA	.	Edu AI	20230120	8000	300000	4.8	Edu	Edu	20JAN2023	ASIA	YES
4	4	UNKNOWN	60	Retail-AI	2023/02/10	7000	.	4.2	Retail	Reta	16APR2026	UNKNOWN	NO
5	5	GERMANY	85	Health_AI	2023-13-01	6000	200000	4.6	Health	Heal	16APR2026	OTHER	YES
6	6	INDIA	75	Healthcare-AI	2023-01-01	5000	100000	4.5	Healthcare	Heal	01JAN2023	ASIA	YES
7	7	UK	55	UNKNOWN	2023-03-05	.	150000	4.3	UNKNOWN	UNKN	05MAR2023	WEST	NO
8	8	FRANCE	65	Finance-AI	2023-02-30	4500	120000	4.4	Finance	Fina	16APR2026	OTHER	NO
9	9	JAPAN	90	Edu_AI	2023-01-25	10000	400000	4.9	Edu	Edu	25JAN2023	ASIA	YES
10	10	BRAZIL	50	Retail AI	2023-02-15	3000	80000	4.1	Retail	Reta	15FEB2023	SOUTH AMERICA	NO

Explanation

This phase introduces derivations similar to ADaM dataset creation. Business logic is applied to derive Region and performance flags. Such transformations are critical in clinical trials where derived variables (e.g., treatment flags, population flags) drive analysis datasets. This step shows how cleaned data becomes analytics-ready.

6. 20 Additional Data Cleaning Best Practices

Always validate raw data against source (CRF)
Maintain audit trail for every transformation
Use standardized formats (ISO dates)
Apply controlled terminology (CDISC)
Validate ranges (age, lab values)
Remove duplicates carefully
Use macros for reusable cleaning
Log all warnings/errors
Handle NULL explicitly
Standardize case sensitivity
Use validation datasets
Cross-check SDTM vs ADaM
Perform QC programming
Maintain metadata documentation
Use PROC COMPARE for validation
Avoid hardcoding values
Ensure reproducibility
Document derivations
Use version control
Follow regulatory compliance (FDA)

7. Business Logic Behind Data Cleaning

Data cleaning is not cosmetic it directly impacts decisions.

Missing country → replaced with UNKNOWN → avoids data loss
Negative AI usage → corrected → prevents incorrect averages
Date standardization → enables time-series analysis

In clinical trials:

Wrong patient age → incorrect efficacy grouping
Missing treatment date → invalid survival analysis

Thus, cleaning ensures data integrity, accuracy, and trust.

8. 20 Key Insights

Dirty data leads to wrong conclusions
Standardization ensures reproducibility
Missing values distort analytics
Duplicate data inflates metrics
Text inconsistency breaks joins
Dates drive timelines
Validation is non-negotiable
Cleaning is iterative
Automation reduces errors
Audit trails build trust
SAS excels in structured cleaning
R excels in flexibility
Metadata is critical
Regulatory compliance is mandatory
Transformation logic must be transparent
Always question raw data
Edge cases matter most
QC is as important as coding
Clean data = reliable insights
Garbage in = garbage out

9. Summary (SAS vs R)

Feature	SAS	R
Data Handling	Strong (clinical)	Flexible
Date Parsing	Excellent	Moderate
String Functions	Powerful (SCAN)	Flexible
Regulatory Use	Industry standard	Limited

10. Conclusion

Data cleaning is where real analytics begins. Functions like SCAN and SUBSTR may seem simple, but in practice, they unlock hidden structure in messy datasets. Whether you're building SDTM domains or business dashboards, mastering these tools ensures your outputs are accurate, compliant, and decision-ready.

11. Interview Questions

Q1: How do you extract domain from “AE-SEVERE”?
👉 Use SCAN(var,1,'-')

Q2: How to fix negative values in SAS?
👉 IF var < 0 THEN var = ABS(var);

Q3: Difference between SCAN and SUBSTR?
👉 SCAN = delimiter-based extraction
👉 SUBSTR = position-based extraction

Q4: How to remove duplicates?
👉 PROC SORT NODUPKEY

Q5: R equivalent of SCAN?
👉 strsplit()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.

Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about AI DATA.

Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.

This project is suitable for:
·  Students learning SAS
·  Data analysts building portfolios
·  Professionals preparing for SAS interviews
·  Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On :

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS
2.Can Advanced SAS Programming Detect, Correct, and Optimize Global Cargo Shipping Route Data While Improving Efficiency and Reducing Costs?
3.Why Even Small Data Errors Can Collapse Enterprise Analytics
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact | Privacy Policy

Search This Blog

SAS Learning Hub

Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

Global AI Nations, Broken Data & Hidden Patterns: Mastering SCAN and SUBSTR in SAS for Real-World Data Rescue

Follow Us On :

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

About Us | Contact | Privacy Policy

Comments

Post a Comment