453.Global AI Trends Unlocked Through SCAN and SUBSTR Precision in SAS
Global AI Nations, Broken Data & Hidden Patterns: Mastering SCAN and SUBSTR in SAS for Real-World Data Rescue
1. Introduction
Imagine
you’re working as a SAS programmer on a global analytics project analyzing AI
adoption across countries. The dataset looks promising countries, AI usage
percentages, launch dates, sector details but within minutes, reality hits:
- “India” appears as india, INDIA,
and even NULL
- Dates look like 2023-01-01, 01/01/2023,
and 20230101
- AI sector column contains
strings like "Healthcare-AI", "Finance_AI", "Edu
AI"
- Duplicate records quietly
distort trends
- Negative usage percentages
exist (!)
Now
imagine presenting insights from this dataset to stakeholders. One wrong number
→ flawed decision → business loss.
This is
exactly why data cleaning is not optional it’s foundational.
In both SAS
and R, data cleaning transforms chaos into intelligence. And in SAS,
functions like SCAN and SUBSTR are powerful tools to extract
meaningful patterns from messy text.
2. Raw Data Creation in SAS and R (With Intentional Errors)
SAS Raw Dataset
DATA ai_raw;
INFILE DATALINES TRUNCOVER;
LENGTH line $200;
INPUT line $200.;
/* Parse using SCAN */
ID = INPUT(SCAN(line,1,' '),8.);
Country = SCAN(line,2,' ');
AI_Usage = INPUT(SCAN(line,3,' '),8.);
/* Handle Sector (can be 1 or 2 words) */
word4 = SCAN(line,4,' ');
word5 = SCAN(line,5,' ');
/* Check if word5 is date */
IF ANYDIGIT(word5) THEN DO;
Sector = word4;
Launch_Date = word5;
Employees = INPUT(SCAN(line,6,' '),8.);
Revenue = INPUT(SCAN(line,7,' '),8.);
Rating = INPUT(SCAN(line,8,' '),8.);
END;
ELSE DO;
Sector = CATX(' ',word4,word5);
Launch_Date = SCAN(line,6,' ');
Employees = INPUT(SCAN(line,7,' '),8.);
Revenue = INPUT(SCAN(line,8,' '),8.);
Rating = INPUT(SCAN(line,9,' '),8.);
END;
DROP line word4 word5;
DATALINES;
1 india 75 Healthcare-AI 2023-01-01 5000 100000 4.5
2 USA -20 Finance_AI 01/15/2023 12000 500000 4.7
3 china . Edu AI 20230120 8000 300000 4.8
4 NULL 60 Retail-AI 2023/02/10 7000 . 4.2
5 Germany 85 Health_AI 2023-13-01 6000 200000 4.6
6 india 75 Healthcare-AI 2023-01-01 5000 100000 4.5
7 UK 55 NULL 2023-03-05 -4000 150000 4.3
8 france 65 Finance-AI 2023-02-30 4500 120000 4.4
9 Japan 90 Edu_AI 2023-01-25 10000 400000 4.9
10 brazil 50 Retail AI 2023-02-15 3000 80000 4.1
;
RUN;
PROC PRINT DATA=ai_raw;
RUN;
OUTPUT:
| Obs | ID | Country | AI_Usage | Sector | Launch_Date | Employees | Revenue | Rating |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | india | 75 | Healthcare-AI | 2023-01-01 | 5000 | 100000 | 4.5 |
| 2 | 2 | USA | -20 | Finance_AI | 01/15/2023 | 12000 | 500000 | 4.7 |
| 3 | 3 | china | . | Edu AI | 20230120 | 8000 | 300000 | 4.8 |
| 4 | 4 | NULL | 60 | Retail-AI | 2023/02/10 | 7000 | . | 4.2 |
| 5 | 5 | Germany | 85 | Health_AI | 2023-13-01 | 6000 | 200000 | 4.6 |
| 6 | 6 | india | 75 | Healthcare-AI | 2023-01-01 | 5000 | 100000 | 4.5 |
| 7 | 7 | UK | 55 | NULL | 2023-03-05 | -4000 | 150000 | 4.3 |
| 8 | 8 | france | 65 | Finance-AI | 2023-02-30 | 4500 | 120000 | 4.4 |
| 9 | 9 | Japan | 90 | Edu_AI | 2023-01-25 | 10000 | 400000 | 4.9 |
| 10 | 10 | brazil | 50 | Retail AI | 2023-02-15 | 3000 | 80000 | 4.1 |
Explanation
This dataset intentionally simulates real-world inconsistencies: missing values (.), invalid entries (negative AI usage, impossible dates like Feb 30), inconsistent delimiters (-, _, space), duplicate records (ID 1 & 6), and inconsistent casing. The Sector variable is particularly messy, which makes it ideal for demonstrating SCAN and SUBSTR functions. These inconsistencies mimic clinical trial raw datasets (SDTM source), where raw CRF data often contains unstandardized formats. Without cleaning, downstream ADaM datasets and TLF outputs would be unreliable.
Step 1: Read Full Line as
Text
INFILE DATALINES TRUNCOVER;
LENGTH line $200;
INPUT line $200.;
What this does:
- Reads entire row into one
variable (line)
- No column confusion, no
shifting
Why important:
If you
directly use INPUT, SAS may misplace values when data is inconsistent.
👉 Here, you take full control
Step 2: Extract Basic
Columns Using SCAN
ID = INPUT(SCAN(line,1,' '),8.);
Country = SCAN(line,2,' ');
AI_Usage = INPUT(SCAN(line,3,'
'),8.);
What this does:
- SCAN(line,1,' ') → picks 1st
word → ID
- SCAN(line,2,' ') → Country
- SCAN(line,3,' ') → AI Usage
Why INPUT():
Because
SCAN returns text, but ID and AI_Usage must be numeric
Step 3: Capture Sector
Parts
word4 = SCAN(line,4,' ');
word5 = SCAN(line,5,' ');
What this does:
- Reads possible sector values
Examples:
- "Healthcare-AI" →
word4 = Healthcare-AI
- "Edu AI" → word4 =
Edu, word5 = AI
Step 4: Smart Logic
IF ANYDIGIT(word5) THEN DO;
Meaning:
Check if word5
contains numbers
👉 If YES → it is a date
👉 If NO → it is part of sector
Case 1: Sector is ONE word
Sector = word4;
Launch_Date = word5;
Employees = INPUT(SCAN(line,6,'
'),8.);
Revenue = INPUT(SCAN(line,7,'
'),8.);
Rating = INPUT(SCAN(line,8,'
'),8.);
Example:
Healthcare-AI 2023-01-01
- Sector = Healthcare-AI
- Date = next column
Case 2: Sector is TWO words
Sector = CATX(' ',word4,word5);
Launch_Date = SCAN(line,6,' ');
Employees = INPUT(SCAN(line,7,'
'),8.);
Revenue = INPUT(SCAN(line,8,'
'),8.);
Rating = INPUT(SCAN(line,9,'
'),8.);
Example:
Edu AI 20230120
- Sector = "Edu AI"
- Date shifts to next column
Step 5: Clean Temporary
Variables
DROP line word4 word5;
Why:
- These were only used for
processing
- Not needed in final dataset
Step 6: DATALINES (Raw
Data)
DATALINES;
...
;
Contains:
- Mixed formats
- Missing values (.)
- Multi-word sectors
- Invalid values
👉 This simulates real-world messy data
Step 7: Print Dataset
PROC PRINT DATA=ai_raw;
RUN;
Final
cleaned structure
What Problems This Code
Solves
|
Problem |
Solution |
|
Multi-word
values |
SCAN +
logic |
|
Column
shifting |
Read
full line |
|
Mixed
formats |
Flexible
parsing |
|
Missing
values |
Handled
safely |
|
Real-world
messy data |
Fully
controlled |
Key Concepts You Should
Remember
✔SCAN
- Extracts word by position
- Based on delimiter
✔INPUT
- Converts text → numeric
✔ANYDIGIT
- Detects if value contains
numbers
✔CATX
- Combines words with space
Interview Explanation
If
interviewer asks:
Why
did you use this approach?
You say:
Because
raw data had inconsistent structure and multi-word fields, I avoided direct
INPUT parsing. Instead, I read the entire row as a string and used SCAN with
conditional logic to dynamically assign variables. This prevents column
shifting and ensures accurate data extraction.
R Raw Dataset
ai_raw <- data.frame(
ID = 1:10,
Country = c("india","USA","china","NULL","Germany","india","UK",
"france","Japan","brazil"),
AI_Usage = c(75,-20,NA,60,85,75,55,65,90,50),
Sector = c("Healthcare-AI","Finance_AI","Edu AI","Retail-AI",
"Health_AI","Healthcare-AI","NULL","Finance-AI",
"Edu_AI","Retail AI"),
Launch_Date = c("2023-01-01","01/15/2023","20230120","2023/02/10",
"2023-13-01","2023-01-01","2023-03-05","2023-02-30",
"2023-01-25","2023-02-15"),
Employees = c(5000,12000,8000,7000,6000,5000,-4000,4500,10000,3000),
Revenue = c(100000,500000,300000,NA,200000,100000,150000,120000,
400000,80000),
Rating = c(4.5,4.7,4.8,4.2,4.6,4.5,4.3,4.4,4.9,4.1)
)
OUTPUT:
|
|
ID |
Country |
AI_Usage |
Sector |
Launch_Date |
Employees |
Revenue |
Rating |
|
1 |
1 |
india |
75 |
Healthcare-AI |
01-01-2023 |
5000 |
100000 |
4.5 |
|
2 |
2 |
USA |
-20 |
Finance_AI |
01/15/2023 |
12000 |
500000 |
4.7 |
|
3 |
3 |
china |
NA |
Edu AI |
20230120 |
8000 |
300000 |
4.8 |
|
4 |
4 |
NULL |
60 |
Retail-AI |
10-02-2023 |
7000 |
NA |
4.2 |
|
5 |
5 |
Germany |
85 |
Health_AI |
2023-13-01 |
6000 |
200000 |
4.6 |
|
6 |
6 |
india |
75 |
Healthcare-AI |
01-01-2023 |
5000 |
100000 |
4.5 |
|
7 |
7 |
UK |
55 |
NULL |
05-03-2023 |
-4000 |
150000 |
4.3 |
|
8 |
8 |
france |
65 |
Finance-AI |
2023-02-30 |
4500 |
120000 |
4.4 |
|
9 |
9 |
Japan |
90 |
Edu_AI |
25-01-2023 |
10000 |
400000 |
4.9 |
|
10 |
10 |
brazil |
50 |
Retail AI |
15-02-2023 |
3000 |
80000 |
4.1 |
Explanation
The R
dataset mirrors SAS data for cross-platform consistency. It includes NA,
invalid values, duplicates, and inconsistent strings. In real-world analytics
pipelines, teams often use SAS for regulatory submission and R for exploratory
analysis. Therefore, maintaining identical raw datasets ensures reproducibility
and validation across systems. This dataset will later be cleaned using dplyr
functions, highlighting how R handles transformations differently but achieves
the same objective data integrity.
3. Phase 1: Data Cleaning in SAS
PROC SORT DATA=ai_raw NODUPKEY OUT=ai_nodup;
BY ID;
RUN;
PROC PRINT DATA=ai_nodup;
RUN;
OUTPUT:
| Obs | ID | Country | AI_Usage | Sector | Launch_Date | Employees | Revenue | Rating |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | india | 75 | Healthcare-AI | 2023-01-01 | 5000 | 100000 | 4.5 |
| 2 | 2 | USA | -20 | Finance_AI | 01/15/2023 | 12000 | 500000 | 4.7 |
| 3 | 3 | china | . | Edu AI | 20230120 | 8000 | 300000 | 4.8 |
| 4 | 4 | NULL | 60 | Retail-AI | 2023/02/10 | 7000 | . | 4.2 |
| 5 | 5 | Germany | 85 | Health_AI | 2023-13-01 | 6000 | 200000 | 4.6 |
| 6 | 6 | india | 75 | Healthcare-AI | 2023-01-01 | 5000 | 100000 | 4.5 |
| 7 | 7 | UK | 55 | NULL | 2023-03-05 | -4000 | 150000 | 4.3 |
| 8 | 8 | france | 65 | Finance-AI | 2023-02-30 | 4500 | 120000 | 4.4 |
| 9 | 9 | Japan | 90 | Edu_AI | 2023-01-25 | 10000 | 400000 | 4.9 |
| 10 | 10 | brazil | 50 | Retail AI | 2023-02-15 | 3000 | 80000 | 4.1 |
DATA ai_clean;
SET ai_nodup;
/* Fix Country */
IF Country = "NULL" OR MISSING(Country) THEN Country = "UNKNOWN";
Country = UPCASE(Country);
/* Fix AI Usage */
IF AI_Usage < 0 THEN AI_Usage = ABS(AI_Usage);
/* Fix Employees */
IF Employees < 0 THEN Employees = .;
/* Fix Sector */
IF Sector = "NULL" OR MISSING(Sector) THEN Sector = "UNKNOWN";
/* Standardize Sector */
Sector_Main = SCAN(Sector,1,'-_ ');
/* Clean Sector_Code */
Sector_Code = SUBSTR(Sector_Main,1,4);
/* -------- DATE FIX WITHOUT MISSING -------- */
Launch_Date_fix = STRIP(Launch_Date);
/* Convert normally */
IF INDEX(Launch_Date_fix,'-') THEN
Launch_Date_Clean = INPUT(Launch_Date_fix,YYMMDD10.);
ELSE IF INDEX(Launch_Date_fix,'/') THEN
Launch_Date_Clean = INPUT(Launch_Date_fix,MMDDYY10.);
ELSE IF LENGTH(Launch_Date_fix)=8 THEN
Launch_Date_Clean = INPUT(Launch_Date_fix,YYMMDD8.);
ELSE
Launch_Date_Clean = .;
/* -------- HANDLE INVALID / MISSING -------- */
/* If still missing → assign default date */
IF Launch_Date_Clean = . THEN DO;
/* Option 1: Assign default constant */
Launch_Date_Clean = '01JAN2023'd;
/* Option 2 (better): assign today's date */
Launch_Date_Clean = TODAY();
END;
FORMAT Launch_Date_Clean DATE9.;
DROP Launch_Date_fix;
RUN;
PROC PRINT DATA=ai_clean;
RUN;
OUTPUT:
| Obs | ID | Country | AI_Usage | Sector | Launch_Date | Employees | Revenue | Rating | Sector_Main | Sector_Code | Launch_Date_Clean |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | INDIA | 75 | Healthcare-AI | 2023-01-01 | 5000 | 100000 | 4.5 | Healthcare | Heal | 01JAN2023 |
| 2 | 2 | USA | 20 | Finance_AI | 01/15/2023 | 12000 | 500000 | 4.7 | Finance | Fina | 15JAN2023 |
| 3 | 3 | CHINA | . | Edu AI | 20230120 | 8000 | 300000 | 4.8 | Edu | Edu | 20JAN2023 |
| 4 | 4 | UNKNOWN | 60 | Retail-AI | 2023/02/10 | 7000 | . | 4.2 | Retail | Reta | 16APR2026 |
| 5 | 5 | GERMANY | 85 | Health_AI | 2023-13-01 | 6000 | 200000 | 4.6 | Health | Heal | 16APR2026 |
| 6 | 6 | INDIA | 75 | Healthcare-AI | 2023-01-01 | 5000 | 100000 | 4.5 | Healthcare | Heal | 01JAN2023 |
| 7 | 7 | UK | 55 | UNKNOWN | 2023-03-05 | . | 150000 | 4.3 | UNKNOWN | UNKN | 05MAR2023 |
| 8 | 8 | FRANCE | 65 | Finance-AI | 2023-02-30 | 4500 | 120000 | 4.4 | Finance | Fina | 16APR2026 |
| 9 | 9 | JAPAN | 90 | Edu_AI | 2023-01-25 | 10000 | 400000 | 4.9 | Edu | Edu | 25JAN2023 |
| 10 | 10 | BRAZIL | 50 | Retail AI | 2023-02-15 | 3000 | 80000 | 4.1 | Retail | Reta | 15FEB2023 |
Explanation
This
phase demonstrates core SAS cleaning techniques. PROC SORT NODUPKEY removes
duplicates based on ID, ensuring uniqueness. COALESCEC replaces missing or NULL
country values with “UNKNOWN,” followed by UPCASE to standardize casing.
Invalid numeric values (negative AI usage) are corrected using ABS, while
logically incorrect employee counts are nullified.
The
highlight is SCAN and SUBSTR:
- SCAN(Sector,1,'-_ ')
extracts the primary category (e.g., "Healthcare")
- SUBSTR(Sector,1,4) creates a
sector code for grouping
Date
inconsistencies are handled using ANYDTDTE. informat, a powerful SAS feature
for multiple date formats. This mirrors real SDTM transformation logic where
raw text dates must be standardized.
4. Phase 2: Data Cleaning in R
library(dplyr)
ai_clean <- ai_raw %>%
distinct(ID, .keep_all = TRUE) %>%
mutate(
Country = toupper(ifelse(Country == "NULL" | is.na(Country),
"UNKNOWN", Country)),
AI_Usage = ifelse(AI_Usage < 0, abs(AI_Usage), AI_Usage),
Employees = ifelse(Employees < 0, NA, Employees),
Sector_Main = sapply(strsplit(Sector, "[-_ ]"), `[`, 1),
Sector_Code = substr(Sector,1,4),
Launch_Date_Clean = as.Date(Launch_Date, format="%Y-%m-%d")
)
OUTPUT:
|
|
ID |
Country |
AI_Usage |
Sector |
Launch_Date |
Employees |
Revenue |
Rating |
Sector_Main |
Sector_Code |
Launch_Date_Clean |
|
1 |
1 |
INDIA |
75 |
Healthcare-AI |
01-01-2023 |
5000 |
100000 |
4.5 |
Healthcare |
Heal |
01-01-2023 |
|
2 |
2 |
USA |
20 |
Finance_AI |
01/15/2023 |
12000 |
500000 |
4.7 |
Finance |
Fina |
NA |
|
3 |
3 |
CHINA |
NA |
Edu AI |
20230120 |
8000 |
300000 |
4.8 |
Edu |
Edu |
NA |
|
4 |
4 |
UNKNOWN |
60 |
Retail-AI |
10-02-2023 |
7000 |
NA |
4.2 |
Retail |
Reta |
NA |
|
5 |
5 |
GERMANY |
85 |
Health_AI |
2023-13-01 |
6000 |
200000 |
4.6 |
Health |
Heal |
NA |
|
6 |
6 |
INDIA |
75 |
Healthcare-AI |
01-01-2023 |
5000 |
100000 |
4.5 |
Healthcare |
Heal |
01-01-2023 |
|
7 |
7 |
UK |
55 |
NULL |
05-03-2023 |
NA |
150000 |
4.3 |
NULL |
NULL |
05-03-2023 |
|
8 |
8 |
FRANCE |
65 |
Finance-AI |
2023-02-30 |
4500 |
120000 |
4.4 |
Finance |
Fina |
NA |
|
9 |
9 |
JAPAN |
90 |
Edu_AI |
25-01-2023 |
10000 |
400000 |
4.9 |
Edu |
Edu_ |
25-01-2023 |
|
10 |
10 |
BRAZIL |
50 |
Retail AI |
15-02-2023 |
3000 |
80000 |
4.1 |
Retail |
Reta |
15-02-2023 |
Explanation
R
achieves similar results using dplyr. distinct() removes duplicates, while mutate()
performs transformations. Missing values are handled using ifelse and is.na.
String operations replicate SAS logic:
- strsplit() extracts sector
categories (like SCAN)
- substr() mimics SUBSTR
Date
conversion in R is less flexible than SAS unless multiple formats are
explicitly handled. This highlights a key difference: SAS is often more robust
for heterogeneous clinical data ingestion. However, R provides greater
flexibility for exploratory transformations.
5. Phase 3: Additional SAS Processing Using Raw Data
DATA ai_enhanced;
SET ai_clean;
/* Derive Region */
LENGTH Region $15.;
IF Country IN ("INDIA","CHINA","JAPAN") THEN Region="ASIA";
ELSE IF Country IN ("USA","UK") THEN Region="WEST";
ELSE IF Country IN ("BRAZIL") THEN Region="SOUTH AMERICA";
ELSE IF Country = "UNKNOWN" THEN Region="UNKNOWN";
ELSE Region="OTHER";
/* Create Flag */
IF Rating >= 4.5 THEN High_Performance="YES";
ELSE High_Performance="NO";
RUN;
PROC PRINT DATA=ai_enhanced;
RUN;
OUTPUT:
| Obs | ID | Country | AI_Usage | Sector | Launch_Date | Employees | Revenue | Rating | Sector_Main | Sector_Code | Launch_Date_Clean | Region | High_Performance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | INDIA | 75 | Healthcare-AI | 2023-01-01 | 5000 | 100000 | 4.5 | Healthcare | Heal | 01JAN2023 | ASIA | YES |
| 2 | 2 | USA | 20 | Finance_AI | 01/15/2023 | 12000 | 500000 | 4.7 | Finance | Fina | 15JAN2023 | WEST | YES |
| 3 | 3 | CHINA | . | Edu AI | 20230120 | 8000 | 300000 | 4.8 | Edu | Edu | 20JAN2023 | ASIA | YES |
| 4 | 4 | UNKNOWN | 60 | Retail-AI | 2023/02/10 | 7000 | . | 4.2 | Retail | Reta | 16APR2026 | UNKNOWN | NO |
| 5 | 5 | GERMANY | 85 | Health_AI | 2023-13-01 | 6000 | 200000 | 4.6 | Health | Heal | 16APR2026 | OTHER | YES |
| 6 | 6 | INDIA | 75 | Healthcare-AI | 2023-01-01 | 5000 | 100000 | 4.5 | Healthcare | Heal | 01JAN2023 | ASIA | YES |
| 7 | 7 | UK | 55 | UNKNOWN | 2023-03-05 | . | 150000 | 4.3 | UNKNOWN | UNKN | 05MAR2023 | WEST | NO |
| 8 | 8 | FRANCE | 65 | Finance-AI | 2023-02-30 | 4500 | 120000 | 4.4 | Finance | Fina | 16APR2026 | OTHER | NO |
| 9 | 9 | JAPAN | 90 | Edu_AI | 2023-01-25 | 10000 | 400000 | 4.9 | Edu | Edu | 25JAN2023 | ASIA | YES |
| 10 | 10 | BRAZIL | 50 | Retail AI | 2023-02-15 | 3000 | 80000 | 4.1 | Retail | Reta | 15FEB2023 | SOUTH AMERICA | NO |
Explanation
This
phase introduces derivations similar to ADaM dataset creation. Business logic
is applied to derive Region and performance flags. Such transformations
are critical in clinical trials where derived variables (e.g., treatment flags,
population flags) drive analysis datasets. This step shows how cleaned data
becomes analytics-ready.
6. 20 Additional Data
Cleaning Best Practices
- Always validate raw data
against source (CRF)
- Maintain audit trail for
every transformation
- Use standardized formats
(ISO dates)
- Apply controlled terminology
(CDISC)
- Validate ranges (age, lab
values)
- Remove duplicates carefully
- Use macros for reusable
cleaning
- Log all warnings/errors
- Handle NULL explicitly
- Standardize case sensitivity
- Use validation datasets
- Cross-check SDTM vs ADaM
- Perform QC programming
- Maintain metadata
documentation
- Use PROC COMPARE for
validation
- Avoid hardcoding values
- Ensure reproducibility
- Document derivations
- Use version control
- Follow regulatory compliance
(FDA)
7. Business Logic Behind
Data Cleaning
Data
cleaning is not cosmetic it directly impacts decisions.
- Missing country → replaced
with UNKNOWN → avoids data loss
- Negative AI usage →
corrected → prevents incorrect averages
- Date standardization →
enables time-series analysis
In
clinical trials:
- Wrong patient age →
incorrect efficacy grouping
- Missing treatment date →
invalid survival analysis
Thus,
cleaning ensures data integrity, accuracy, and trust.
8. 20 Key Insights
- Dirty data leads to wrong
conclusions
- Standardization ensures
reproducibility
- Missing values distort
analytics
- Duplicate data inflates
metrics
- Text inconsistency breaks
joins
- Dates drive timelines
- Validation is non-negotiable
- Cleaning is iterative
- Automation reduces errors
- Audit trails build trust
- SAS excels in structured
cleaning
- R excels in flexibility
- Metadata is critical
- Regulatory compliance is
mandatory
- Transformation logic must be
transparent
- Always question raw data
- Edge cases matter most
- QC is as important as coding
- Clean data = reliable
insights
- Garbage in = garbage out
9. Summary (SAS vs R)
|
Feature |
SAS |
R |
|
Data
Handling |
Strong
(clinical) |
Flexible |
|
Date
Parsing |
Excellent |
Moderate |
|
String
Functions |
Powerful
(SCAN) |
Flexible |
|
Regulatory
Use |
Industry
standard |
Limited |
10. Conclusion
Data
cleaning is where real analytics begins. Functions like SCAN and SUBSTR
may seem simple, but in practice, they unlock hidden structure in messy
datasets. Whether you're building SDTM domains or business dashboards,
mastering these tools ensures your outputs are accurate, compliant, and
decision-ready.
11. Interview Questions
Q1: How do you extract domain from
“AE-SEVERE”?
👉 Use SCAN(var,1,'-')
Q2: How to fix negative values in
SAS?
👉 IF var < 0 THEN var = ABS(var);
Q3: Difference between SCAN and
SUBSTR?
👉 SCAN = delimiter-based extraction
👉 SUBSTR = position-based extraction
Q4: How to remove duplicates?
👉 PROC SORT NODUPKEY
Q5: R equivalent of SCAN?
👉 strsplit()
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. Here we learn about AI DATA.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and smart cities
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment