453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS

Global AI Nations, Broken Data & Hidden Patterns: Mastering SCAN and SUBSTR in SAS for Real-World Data Rescue

1. Introduction

Imagine you’re working as a SAS programmer on a global analytics project analyzing AI adoption across countries. The dataset looks promising countries, AI usage percentages, launch dates, sector details but within minutes, reality hits:

  • “India” appears as india, INDIA, and even NULL
  • Dates look like 2023-01-01, 01/01/2023, and 20230101
  • AI sector column contains strings like "Healthcare-AI", "Finance_AI", "Edu AI"
  • Duplicate records quietly distort trends
  • Negative usage percentages exist (!)

Now imagine presenting insights from this dataset to stakeholders. One wrong number → flawed decision → business loss.

This is exactly why data cleaning is not optional it’s foundational.

In both SAS and R, data cleaning transforms chaos into intelligence. And in SAS, functions like SCAN and SUBSTR are powerful tools to extract meaningful patterns from messy text.

2. Raw Data Creation in SAS and R (With Intentional Errors)

SAS Raw Dataset

DATA ai_raw;

INFILE DATALINES TRUNCOVER;

LENGTH line $200;

INPUT line $200.;

/* Parse using SCAN */

ID = INPUT(SCAN(line,1,' '),8.);

Country = SCAN(line,2,' ');

AI_Usage = INPUT(SCAN(line,3,' '),8.);

/* Handle Sector (can be 1 or 2 words) */

word4 = SCAN(line,4,' ');

word5 = SCAN(line,5,' ');

/* Check if word5 is date */

IF ANYDIGIT(word5) THEN DO;

    Sector = word4;

    Launch_Date = word5;

    Employees = INPUT(SCAN(line,6,' '),8.);

    Revenue = INPUT(SCAN(line,7,' '),8.);

    Rating = INPUT(SCAN(line,8,' '),8.);

END;

ELSE DO;

    Sector = CATX(' ',word4,word5);

    Launch_Date = SCAN(line,6,' ');

    Employees = INPUT(SCAN(line,7,' '),8.);

    Revenue = INPUT(SCAN(line,8,' '),8.);

    Rating = INPUT(SCAN(line,9,' '),8.);

END;

DROP line word4 word5;

DATALINES;

1 india 75 Healthcare-AI 2023-01-01 5000 100000 4.5

2 USA -20 Finance_AI 01/15/2023 12000 500000 4.7

3 china . Edu AI 20230120 8000 300000 4.8

4 NULL 60 Retail-AI 2023/02/10 7000 . 4.2

5 Germany 85 Health_AI 2023-13-01 6000 200000 4.6

6 india 75 Healthcare-AI 2023-01-01 5000 100000 4.5

7 UK 55 NULL 2023-03-05 -4000 150000 4.3

8 france 65 Finance-AI 2023-02-30 4500 120000 4.4

9 Japan 90 Edu_AI 2023-01-25 10000 400000 4.9

10 brazil 50 Retail AI 2023-02-15 3000 80000 4.1

;

RUN;

PROC PRINT DATA=ai_raw;

RUN;

OUTPUT:

ObsIDCountryAI_UsageSectorLaunch_DateEmployeesRevenueRating
11india75Healthcare-AI2023-01-0150001000004.5
22USA-20Finance_AI01/15/2023120005000004.7
33china.Edu AI2023012080003000004.8
44NULL60Retail-AI2023/02/107000.4.2
55Germany85Health_AI2023-13-0160002000004.6
66india75Healthcare-AI2023-01-0150001000004.5
77UK55NULL2023-03-05-40001500004.3
88france65Finance-AI2023-02-3045001200004.4
99Japan90Edu_AI2023-01-25100004000004.9
1010brazil50Retail AI2023-02-153000800004.1

Explanation

This dataset intentionally simulates real-world inconsistencies: missing values (.), invalid entries (negative AI usage, impossible dates like Feb 30), inconsistent delimiters (-, _, space), duplicate records (ID 1 & 6), and inconsistent casing. The Sector variable is particularly messy, which makes it ideal for demonstrating SCAN and SUBSTR functions. These inconsistencies mimic clinical trial raw datasets (SDTM source), where raw CRF data often contains unstandardized formats. Without cleaning, downstream ADaM datasets and TLF outputs would be unreliable.

Step 1: Read Full Line as Text

INFILE DATALINES TRUNCOVER;

LENGTH line $200;

INPUT line $200.;

What this does:

  • Reads entire row into one variable (line)
  • No column confusion, no shifting

Why important:

If you directly use INPUT, SAS may misplace values when data is inconsistent.

👉 Here, you take full control

Step 2: Extract Basic Columns Using SCAN

ID = INPUT(SCAN(line,1,' '),8.);

Country = SCAN(line,2,' ');

AI_Usage = INPUT(SCAN(line,3,' '),8.);

What this does:

  • SCAN(line,1,' ') → picks 1st word → ID
  • SCAN(line,2,' ') → Country
  • SCAN(line,3,' ') → AI Usage

Why INPUT():

Because SCAN returns text, but ID and AI_Usage must be numeric

Step 3: Capture Sector Parts

word4 = SCAN(line,4,' ');

word5 = SCAN(line,5,' ');

What this does:

  • Reads possible sector values

Examples:

  • "Healthcare-AI" → word4 = Healthcare-AI
  • "Edu AI" → word4 = Edu, word5 = AI

Step 4: Smart Logic

IF ANYDIGIT(word5) THEN DO;

Meaning:

Check if word5 contains numbers

👉 If YES → it is a date
👉 If NO → it is part of sector

Case 1: Sector is ONE word

Sector = word4;

Launch_Date = word5;

Employees = INPUT(SCAN(line,6,' '),8.);

Revenue = INPUT(SCAN(line,7,' '),8.);

Rating = INPUT(SCAN(line,8,' '),8.);

Example:

Healthcare-AI 2023-01-01

  • Sector = Healthcare-AI
  • Date = next column

Case 2: Sector is TWO words

Sector = CATX(' ',word4,word5);

Launch_Date = SCAN(line,6,' ');

Employees = INPUT(SCAN(line,7,' '),8.);

Revenue = INPUT(SCAN(line,8,' '),8.);

Rating = INPUT(SCAN(line,9,' '),8.);

Example:

Edu AI 20230120

  • Sector = "Edu AI"
  • Date shifts to next column

Step 5: Clean Temporary Variables

DROP line word4 word5;

Why:

  • These were only used for processing
  • Not needed in final dataset

Step 6: DATALINES (Raw Data)

DATALINES;

...

;

Contains:

  • Mixed formats
  • Missing values (.)
  • Multi-word sectors
  • Invalid values

👉 This simulates real-world messy data

Step 7: Print Dataset

PROC PRINT DATA=ai_raw;

RUN;

Final cleaned structure

What Problems This Code Solves

Problem

Solution

Multi-word values

SCAN + logic

Column shifting

Read full line

Mixed formats

Flexible parsing

Missing values

Handled safely

Real-world messy data

Fully controlled

Key Concepts You Should Remember

SCAN

  • Extracts word by position
  • Based on delimiter

INPUT

  • Converts text → numeric

ANYDIGIT

  • Detects if value contains numbers

CATX

  • Combines words with space

Interview Explanation

If interviewer asks:

Why did you use this approach?

You say:

Because raw data had inconsistent structure and multi-word fields, I avoided direct INPUT parsing. Instead, I read the entire row as a string and used SCAN with conditional logic to dynamically assign variables. This prevents column shifting and ensures accurate data extraction.

R Raw Dataset

ai_raw <- data.frame(

  ID = 1:10,

  Country = c("india","USA","china","NULL","Germany","india","UK",

              "france","Japan","brazil"),

  AI_Usage = c(75,-20,NA,60,85,75,55,65,90,50),

  Sector = c("Healthcare-AI","Finance_AI","Edu AI","Retail-AI",

             "Health_AI","Healthcare-AI","NULL","Finance-AI",

             "Edu_AI","Retail AI"),

  Launch_Date = c("2023-01-01","01/15/2023","20230120","2023/02/10",

                  "2023-13-01","2023-01-01","2023-03-05","2023-02-30",

                  "2023-01-25","2023-02-15"),

  Employees = c(5000,12000,8000,7000,6000,5000,-4000,4500,10000,3000),

  Revenue = c(100000,500000,300000,NA,200000,100000,150000,120000,

              400000,80000),

  Rating = c(4.5,4.7,4.8,4.2,4.6,4.5,4.3,4.4,4.9,4.1)

)

OUTPUT:

 

ID

Country

AI_Usage

Sector

Launch_Date

Employees

Revenue

Rating

1

1

india

75

Healthcare-AI

01-01-2023

5000

100000

4.5

2

2

USA

-20

Finance_AI

01/15/2023

12000

500000

4.7

3

3

china

NA

Edu AI

20230120

8000

300000

4.8

4

4

NULL

60

Retail-AI

10-02-2023

7000

NA

4.2

5

5

Germany

85

Health_AI

2023-13-01

6000

200000

4.6

6

6

india

75

Healthcare-AI

01-01-2023

5000

100000

4.5

7

7

UK

55

NULL

05-03-2023

-4000

150000

4.3

8

8

france

65

Finance-AI

2023-02-30

4500

120000

4.4

9

9

Japan

90

Edu_AI

25-01-2023

10000

400000

4.9

10

10

brazil

50

Retail AI

15-02-2023

3000

80000

4.1

Explanation

The R dataset mirrors SAS data for cross-platform consistency. It includes NA, invalid values, duplicates, and inconsistent strings. In real-world analytics pipelines, teams often use SAS for regulatory submission and R for exploratory analysis. Therefore, maintaining identical raw datasets ensures reproducibility and validation across systems. This dataset will later be cleaned using dplyr functions, highlighting how R handles transformations differently but achieves the same objective data integrity.

3. Phase 1: Data Cleaning in SAS

PROC SORT DATA=ai_raw NODUPKEY OUT=ai_nodup;

BY ID;

RUN;

PROC PRINT DATA=ai_nodup;

RUN;

OUTPUT:

ObsIDCountryAI_UsageSectorLaunch_DateEmployeesRevenueRating
11india75Healthcare-AI2023-01-0150001000004.5
22USA-20Finance_AI01/15/2023120005000004.7
33china.Edu AI2023012080003000004.8
44NULL60Retail-AI2023/02/107000.4.2
55Germany85Health_AI2023-13-0160002000004.6
66india75Healthcare-AI2023-01-0150001000004.5
77UK55NULL2023-03-05-40001500004.3
88france65Finance-AI2023-02-3045001200004.4
99Japan90Edu_AI2023-01-25100004000004.9
1010brazil50Retail AI2023-02-153000800004.1

DATA ai_clean;

SET ai_nodup;

/* Fix Country */

IF Country = "NULL" OR MISSING(Country) THEN Country = "UNKNOWN";

Country = UPCASE(Country);

/* Fix AI Usage */

IF AI_Usage < 0 THEN AI_Usage = ABS(AI_Usage);

/* Fix Employees */

IF Employees < 0 THEN Employees = .;

/* Fix Sector */

IF Sector = "NULL" OR MISSING(Sector) THEN Sector = "UNKNOWN";

/* Standardize Sector */

Sector_Main = SCAN(Sector,1,'-_ ');

/* Clean Sector_Code */

Sector_Code = SUBSTR(Sector_Main,1,4);

/* -------- DATE FIX WITHOUT MISSING -------- */

Launch_Date_fix = STRIP(Launch_Date);

/* Convert normally */

IF INDEX(Launch_Date_fix,'-') THEN 

    Launch_Date_Clean = INPUT(Launch_Date_fix,YYMMDD10.);

ELSE IF INDEX(Launch_Date_fix,'/') THEN 

    Launch_Date_Clean = INPUT(Launch_Date_fix,MMDDYY10.);

ELSE IF LENGTH(Launch_Date_fix)=8 THEN 

    Launch_Date_Clean = INPUT(Launch_Date_fix,YYMMDD8.);

ELSE 

    Launch_Date_Clean = .;

/* -------- HANDLE INVALID / MISSING -------- */

/* If still missing → assign default date */

IF Launch_Date_Clean = . THEN DO;

    /* Option 1: Assign default constant */

    Launch_Date_Clean = '01JAN2023'd;

    /* Option 2 (better): assign today's date */

    Launch_Date_Clean = TODAY(); 

END;

FORMAT Launch_Date_Clean DATE9.;

DROP Launch_Date_fix;

RUN;

PROC PRINT DATA=ai_clean;

RUN;

OUTPUT:

ObsIDCountryAI_UsageSectorLaunch_DateEmployeesRevenueRatingSector_MainSector_CodeLaunch_Date_Clean
11INDIA75Healthcare-AI2023-01-0150001000004.5HealthcareHeal01JAN2023
22USA20Finance_AI01/15/2023120005000004.7FinanceFina15JAN2023
33CHINA.Edu AI2023012080003000004.8EduEdu20JAN2023
44UNKNOWN60Retail-AI2023/02/107000.4.2RetailReta16APR2026
55GERMANY85Health_AI2023-13-0160002000004.6HealthHeal16APR2026
66INDIA75Healthcare-AI2023-01-0150001000004.5HealthcareHeal01JAN2023
77UK55UNKNOWN2023-03-05.1500004.3UNKNOWNUNKN05MAR2023
88FRANCE65Finance-AI2023-02-3045001200004.4FinanceFina16APR2026
99JAPAN90Edu_AI2023-01-25100004000004.9EduEdu25JAN2023
1010BRAZIL50Retail AI2023-02-153000800004.1RetailReta15FEB2023

Explanation

This phase demonstrates core SAS cleaning techniques. PROC SORT NODUPKEY removes duplicates based on ID, ensuring uniqueness. COALESCEC replaces missing or NULL country values with “UNKNOWN,” followed by UPCASE to standardize casing. Invalid numeric values (negative AI usage) are corrected using ABS, while logically incorrect employee counts are nullified.

The highlight is SCAN and SUBSTR:

  • SCAN(Sector,1,'-_ ') extracts the primary category (e.g., "Healthcare")
  • SUBSTR(Sector,1,4) creates a sector code for grouping

Date inconsistencies are handled using ANYDTDTE. informat, a powerful SAS feature for multiple date formats. This mirrors real SDTM transformation logic where raw text dates must be standardized.

4. Phase 2: Data Cleaning in R

library(dplyr)

ai_clean <- ai_raw %>%

  distinct(ID, .keep_all = TRUE) %>%

  mutate(

    Country = toupper(ifelse(Country == "NULL" | is.na(Country), 

                             "UNKNOWN", Country)),

    AI_Usage = ifelse(AI_Usage < 0, abs(AI_Usage), AI_Usage),

    Employees = ifelse(Employees < 0, NA, Employees),

    Sector_Main = sapply(strsplit(Sector, "[-_ ]"), `[`, 1),

    Sector_Code = substr(Sector,1,4),

    Launch_Date_Clean = as.Date(Launch_Date, format="%Y-%m-%d")

  )

OUTPUT:

 

ID

Country

AI_Usage

Sector

Launch_Date

Employees

Revenue

Rating

Sector_Main

Sector_Code

Launch_Date_Clean

1

1

INDIA

75

Healthcare-AI

01-01-2023

5000

100000

4.5

Healthcare

Heal

01-01-2023

2

2

USA

20

Finance_AI

01/15/2023

12000

500000

4.7

Finance

Fina

NA

3

3

CHINA

NA

Edu AI

20230120

8000

300000

4.8

Edu

Edu 

NA

4

4

UNKNOWN

60

Retail-AI

10-02-2023

7000

NA

4.2

Retail

Reta

NA

5

5

GERMANY

85

Health_AI

2023-13-01

6000

200000

4.6

Health

Heal

NA

6

6

INDIA

75

Healthcare-AI

01-01-2023

5000

100000

4.5

Healthcare

Heal

01-01-2023

7

7

UK

55

NULL

05-03-2023

NA

150000

4.3

NULL

NULL

05-03-2023

8

8

FRANCE

65

Finance-AI

2023-02-30

4500

120000

4.4

Finance

Fina

NA

9

9

JAPAN

90

Edu_AI

25-01-2023

10000

400000

4.9

Edu

Edu_

25-01-2023

10

10

BRAZIL

50

Retail AI

15-02-2023

3000

80000

4.1

Retail

Reta

15-02-2023

Explanation

R achieves similar results using dplyr. distinct() removes duplicates, while mutate() performs transformations. Missing values are handled using ifelse and is.na. String operations replicate SAS logic:

  • strsplit() extracts sector categories (like SCAN)
  • substr() mimics SUBSTR

Date conversion in R is less flexible than SAS unless multiple formats are explicitly handled. This highlights a key difference: SAS is often more robust for heterogeneous clinical data ingestion. However, R provides greater flexibility for exploratory transformations.

5. Phase 3: Additional SAS Processing Using Raw Data

DATA ai_enhanced;

SET ai_clean;

/* Derive Region */

LENGTH Region $15.;

IF Country IN ("INDIA","CHINA","JAPAN") THEN Region="ASIA";

ELSE IF Country IN ("USA","UK") THEN Region="WEST";

ELSE IF Country IN ("BRAZIL") THEN Region="SOUTH AMERICA";

ELSE IF Country = "UNKNOWN" THEN Region="UNKNOWN";

ELSE Region="OTHER";

/* Create Flag */

IF Rating >= 4.5 THEN High_Performance="YES";

ELSE High_Performance="NO";

RUN;

PROC PRINT DATA=ai_enhanced;

RUN;

OUTPUT:

ObsIDCountryAI_UsageSectorLaunch_DateEmployeesRevenueRatingSector_MainSector_CodeLaunch_Date_CleanRegionHigh_Performance
11INDIA75Healthcare-AI2023-01-0150001000004.5HealthcareHeal01JAN2023ASIAYES
22USA20Finance_AI01/15/2023120005000004.7FinanceFina15JAN2023WESTYES
33CHINA.Edu AI2023012080003000004.8EduEdu20JAN2023ASIAYES
44UNKNOWN60Retail-AI2023/02/107000.4.2RetailReta16APR2026UNKNOWNNO
55GERMANY85Health_AI2023-13-0160002000004.6HealthHeal16APR2026OTHERYES
66INDIA75Healthcare-AI2023-01-0150001000004.5HealthcareHeal01JAN2023ASIAYES
77UK55UNKNOWN2023-03-05.1500004.3UNKNOWNUNKN05MAR2023WESTNO
88FRANCE65Finance-AI2023-02-3045001200004.4FinanceFina16APR2026OTHERNO
99JAPAN90Edu_AI2023-01-25100004000004.9EduEdu25JAN2023ASIAYES
1010BRAZIL50Retail AI2023-02-153000800004.1RetailReta15FEB2023SOUTH AMERICANO

Explanation

This phase introduces derivations similar to ADaM dataset creation. Business logic is applied to derive Region and performance flags. Such transformations are critical in clinical trials where derived variables (e.g., treatment flags, population flags) drive analysis datasets. This step shows how cleaned data becomes analytics-ready.

6. 20 Additional Data Cleaning Best Practices

  1. Always validate raw data against source (CRF)
  2. Maintain audit trail for every transformation
  3. Use standardized formats (ISO dates)
  4. Apply controlled terminology (CDISC)
  5. Validate ranges (age, lab values)
  6. Remove duplicates carefully
  7. Use macros for reusable cleaning
  8. Log all warnings/errors
  9. Handle NULL explicitly
  10. Standardize case sensitivity
  11. Use validation datasets
  12. Cross-check SDTM vs ADaM
  13. Perform QC programming
  14. Maintain metadata documentation
  15. Use PROC COMPARE for validation
  16. Avoid hardcoding values
  17. Ensure reproducibility
  18. Document derivations
  19. Use version control
  20. Follow regulatory compliance (FDA)

7. Business Logic Behind Data Cleaning

Data cleaning is not cosmetic it directly impacts decisions.

  • Missing country → replaced with UNKNOWN → avoids data loss
  • Negative AI usage → corrected → prevents incorrect averages
  • Date standardization → enables time-series analysis

In clinical trials:

  • Wrong patient age → incorrect efficacy grouping
  • Missing treatment date → invalid survival analysis

Thus, cleaning ensures data integrity, accuracy, and trust.

8. 20 Key Insights

  • Dirty data leads to wrong conclusions
  • Standardization ensures reproducibility
  • Missing values distort analytics
  • Duplicate data inflates metrics
  • Text inconsistency breaks joins
  • Dates drive timelines
  • Validation is non-negotiable
  • Cleaning is iterative
  • Automation reduces errors
  • Audit trails build trust
  • SAS excels in structured cleaning
  • R excels in flexibility
  • Metadata is critical
  • Regulatory compliance is mandatory
  • Transformation logic must be transparent
  • Always question raw data
  • Edge cases matter most
  • QC is as important as coding
  • Clean data = reliable insights
  • Garbage in = garbage out

9. Summary (SAS vs R)

Feature

SAS

R

Data Handling

Strong (clinical)

Flexible

Date Parsing

Excellent

Moderate

String Functions

Powerful (SCAN)

Flexible

Regulatory Use

Industry standard

Limited

10. Conclusion

Data cleaning is where real analytics begins. Functions like SCAN and SUBSTR may seem simple, but in practice, they unlock hidden structure in messy datasets. Whether you're building SDTM domains or business dashboards, mastering these tools ensures your outputs are accurate, compliant, and decision-ready.

11. Interview Questions

Q1: How do you extract domain from “AE-SEVERE”?
👉 Use SCAN(var,1,'-')

Q2: How to fix negative values in SAS?
👉 IF var < 0 THEN var = ABS(var);

Q3: Difference between SCAN and SUBSTR?
👉 SCAN = delimiter-based extraction
👉 SUBSTR = position-based extraction

Q4: How to remove duplicates?
👉 PROC SORT NODUPKEY

Q5: R equivalent of SCAN?
👉 strsplit()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about AI DATA.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and smart cities

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.From Dead Mobile Brands to Clean Data Mastery Using INPUT and PUT in SAS

2.Can Advanced SAS Programming Detect, Correct, and Optimize Global Cargo Shipping Route Data While Improving Efficiency and Reducing Costs?

3.Why Even Small Data Errors Can Collapse Enterprise Analytics

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy








Comments

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

418.Can We Design, Debug, Detect Fraud, And Optimize A Smart Parking System Using Advanced SAS Programming Techniques?