426.Can Advanced SAS Programming Detect And Fix Errors In Semiconductor Fabrication Plant Data?
Cracking Semiconductor Fabrication Data Errors Using Advanced SAS Programming
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —DATA STEP | LENGTH | INPUT | DATALINES | FORMAT | SET | IF-THEN | MERGE | PROC SORT | PROC PRINT | PROC APPEND | PROC TRANSPOSE | PROC DATASETS | MACRO | NUMERIC AND CHARACTER FUNCTIONS
Table of Contents
- Introduction
- Semiconductor Industry
Business Context
- Dataset Design Overview
- Variables Description
- Intentional Errors
Introduced
- SAS Environment Setup
- Code 1 – Creating Raw
Semiconductor Dataset
- Code 2 – Using Character Functions
to Clean Data
- Code 3 – Numeric
Calculations and Derived Metrics
- Code 4 – Date Functions
(MDY, INTCK, INTNX)
- Code 5 – Utilization
Classification
- Code 6 – Fraud Detection
Macro
- Code 7 – Dataset Merge
Example
- Code 8 – Dataset Append
Example
- Code 9 – Dataset Transpose
- Code 10 – PROC DATASETS
Delete
- Code 11 – Detecting Data
Errors
- Code 12 – Creating Final
Cleaned Dataset
- Final Dataset Output
- Key Project Insights
- Summary
- Conclusion
1. Introduction
Semiconductor
fabrication plants (commonly called fabs) manufacture microchips used in
electronics, automobiles, smartphones, and satellites. These facilities
generate massive amounts of process data such as wafer batches, defect
densities, yield percentages, and machine downtime.
Data
integrity is extremely important in semiconductor manufacturing because:
·
Small data errors can lead to millions of dollars in production
losses
·
Incorrect yield calculations may hide equipment failures
·
Fraudulent data manipulation could hide fabrication defects
This
project demonstrates how Advanced SAS Programming can be used to:
·
Create semiconductor plant datasets
·
Introduce intentional data errors
·
Detect fraud and inconsistencies
·
Correct those errors using SAS functions and procedures
·
Generate a clean analytical dataset
The
project will also demonstrate:
·
Character functions
·
Numeric functions
·
Macros
·
Date functions
· Dataset manipulation techniques
2. Semiconductor Industry
Business Context
Semiconductor
fabs operate with highly precise manufacturing processes. A typical fabrication
facility tracks:
|
Metric |
Meaning |
|
Wafer
Batch |
Group
of wafers processed together |
|
Defect
Density |
Number
of defects per wafer area |
|
Yield
Percentage |
Percentage
of functional chips |
|
Process
Node |
Technology
size (5nm, 7nm, etc) |
|
Downtime
Minutes |
Machine
stoppage time |
|
Utilization |
Machine
usage efficiency |
Companies
such as:
·
Intel
·
TSMC
·
Samsung
·
GlobalFoundries
track
these metrics continuously.
However
real-world manufacturing datasets often contain:
·
Missing values
·
Incorrect date formats
·
Fraudulent yield values
·
Duplicate records
· Character formatting errors
3. Dataset Design Overview
Variables
|
Variable |
Type |
Description |
|
Fab_ID |
Character |
Fabrication
plant identifier |
|
Wafer_Batch |
Character |
Wafer
batch ID |
|
Defect_Density |
Numeric |
Defects
per wafer |
|
Yield_Percentage |
Numeric |
Production
yield |
|
Process_Node |
Character |
Fabrication
technology |
|
Downtime_Minutes |
Numeric |
Equipment
downtime |
|
Utilization |
Numeric |
Equipment
utilization |
|
Production_Fees |
Numeric |
Production
cost |
|
Production_Date |
Character |
Production
date |
Code 1 — Creating Raw Semiconductor Dataset (With Intentional Errors)
data semiconductor_fab_raw;
input Fab_ID $ Wafer_Batch $ Defect_Density Yield_Percentage
Process_Node $ Downtime_Minutes Utilization Production_Fees Production_Date $;
datalines;
fab01 WB1001 0.32 92 7nm 120 85 45000 02-15-2024
fab02 WB1002 . 95 5nm 90 88 47000 03-10-2024
fab03 WB1003 0.55 105 10nm 300 110 42000 04-05-2024
fab04 WB1004 0.40 89 7nm -50 80 41000 05-01-2024
fab05 WB1005 0.28 94 5nm 60 75 39000 06-18-2024
fab06 WB1006 0.90 60 14nm 500 40 35000 07-22-2024
fab07 WB1007 . 97 5nm 40 92 48000 08-11-2024
fab08 WB1008 0.25 96 3nm 20 95 50000 09-15-2024
fab09 WB1009 0.45 91 7nm 200 70 43000 10-20-2024
fab10 WB1010 0.60 88 10nm 350 65 42000 11-12-2024
;
run;
proc print data=semiconductor_fab_raw;
run;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date |
|---|---|---|---|---|---|---|---|---|---|
| 1 | fab01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 |
| 2 | fab02 | WB1002 | . | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 |
| 3 | fab03 | WB1003 | 0.55 | 105 | 10nm | 300 | 110 | 42000 | 04-05-20 |
| 4 | fab04 | WB1004 | 0.40 | 89 | 7nm | -50 | 80 | 41000 | 05-01-20 |
| 5 | fab05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 |
| 6 | fab06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 |
| 7 | fab07 | WB1007 | . | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 |
| 8 | fab08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 |
| 9 | fab09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 |
| 10 | fab10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 |
Explanation
This program creates the initial semiconductor dataset.
However it intentionally includes several data
quality issues.
Intentional Errors Introduced
Error 1 — Missing Defect Density
.
Meaning missing value.
Error 2 — Yield Above 100%
105
Yield cannot exceed 100%.
Error 3 — Negative Downtime
-50
Downtime cannot be negative.
Error 4 — Utilization Above 100%
110
Machine utilization cannot exceed 100%.
Why Create Intentional Errors?
In real manufacturing datasets:
· Data entry
mistakes occur
·
Sensor malfunctions produce invalid
values
·
Fraudulent adjustments may hide process
failures
Detecting such issues is an important SAS programmer skill.
Code 2 — Character Data Cleaning
data fab_clean_char;
set semiconductor_fab_raw;
Fab_ID = upcase(strip(Fab_ID));
Process_Node = propcase(Process_Node);
Batch_Label = catx("_", Fab_ID, Wafer_Batch);
run;
proc print data=fab_clean_char;
run;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date | Batch_Label |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FAB01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 | FAB01_WB1001 |
| 2 | FAB02 | WB1002 | . | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 | FAB02_WB1002 |
| 3 | FAB03 | WB1003 | 0.55 | 105 | 10nm | 300 | 110 | 42000 | 04-05-20 | FAB03_WB1003 |
| 4 | FAB04 | WB1004 | 0.40 | 89 | 7nm | -50 | 80 | 41000 | 05-01-20 | FAB04_WB1004 |
| 5 | FAB05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 | FAB05_WB1005 |
| 6 | FAB06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 | FAB06_WB1006 |
| 7 | FAB07 | WB1007 | . | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 | FAB07_WB1007 |
| 8 | FAB08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 | FAB08_WB1008 |
| 9 | FAB09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 | FAB09_WB1009 |
| 10 | FAB10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 | FAB10_WB1010 |
Explanation
This step cleans and standardizes character variables.
Character Functions Used
STRIP
Removes leading and trailing spaces.
strip(Fab_ID)
UPCASE
Converts characters to uppercase.
Example:
fab01 → FAB01
PROPERCASE
Converts text into readable format.
7nm → 7Nm
CATX
Concatenates values with delimiter.
Example output:
FAB01_WB1001
Why Character Cleaning Matters
In large datasets inconsistent formatting
causes:
· Merge
failures
·
Duplicate keys
·
Incorrect groupings
Cleaning character variables ensures consistency.
Code 3 — Numeric Calculations
data fab_numeric;
set fab_clean_char;
Effective_Yield = Yield_Percentage - Defect_Density*10;
Downtime_Hours = Downtime_Minutes/60;
run;
proc print data=fab_numeric;
run;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date | Batch_Label | Effective_Yield | Downtime_Hours |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FAB01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 | FAB01_WB1001 | 88.8 | 2.00000 |
| 2 | FAB02 | WB1002 | . | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 | FAB02_WB1002 | . | 1.50000 |
| 3 | FAB03 | WB1003 | 0.55 | 105 | 10nm | 300 | 110 | 42000 | 04-05-20 | FAB03_WB1003 | 99.5 | 5.00000 |
| 4 | FAB04 | WB1004 | 0.40 | 89 | 7nm | -50 | 80 | 41000 | 05-01-20 | FAB04_WB1004 | 85.0 | -0.83333 |
| 5 | FAB05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 | FAB05_WB1005 | 91.2 | 1.00000 |
| 6 | FAB06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 | FAB06_WB1006 | 51.0 | 8.33333 |
| 7 | FAB07 | WB1007 | . | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 | FAB07_WB1007 | . | 0.66667 |
| 8 | FAB08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 | FAB08_WB1008 | 93.5 | 0.33333 |
| 9 | FAB09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 | FAB09_WB1009 | 86.5 | 3.33333 |
| 10 | FAB10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 | FAB10_WB1010 | 82.0 | 5.83333 |
Explanation
This step derives additional numeric metrics.
Numeric Calculations
Effective Yield
Yield_Percentage – Defect_Density × 10
Used to estimate yield impact from defects.
Downtime Hours
Converts minutes to hours.
Why Numeric Calculations Are Useful
Engineers analyze derived metrics to evaluate:
· Production
efficiency
·
Equipment reliability
·
Yield performance
Code 4 — Date Conversion Using MDY
data fab_dates;
set fab_numeric;
Prod_Date = input(Production_Date, mmddyy10.);
format Prod_Date date9.;
run;
proc print data=fab_dates;
run;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date | Batch_Label | Effective_Yield | Downtime_Hours | Prod_Date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FAB01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 | FAB01_WB1001 | 88.8 | 2.00000 | 15FEB2020 |
| 2 | FAB02 | WB1002 | . | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 | FAB02_WB1002 | . | 1.50000 | 10MAR2020 |
| 3 | FAB03 | WB1003 | 0.55 | 105 | 10nm | 300 | 110 | 42000 | 04-05-20 | FAB03_WB1003 | 99.5 | 5.00000 | 05APR2020 |
| 4 | FAB04 | WB1004 | 0.40 | 89 | 7nm | -50 | 80 | 41000 | 05-01-20 | FAB04_WB1004 | 85.0 | -0.83333 | 01MAY2020 |
| 5 | FAB05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 | FAB05_WB1005 | 91.2 | 1.00000 | 18JUN2020 |
| 6 | FAB06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 | FAB06_WB1006 | 51.0 | 8.33333 | 22JUL2020 |
| 7 | FAB07 | WB1007 | . | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 | FAB07_WB1007 | . | 0.66667 | 11AUG2020 |
| 8 | FAB08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 | FAB08_WB1008 | 93.5 | 0.33333 | 15SEP2020 |
| 9 | FAB09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 | FAB09_WB1009 | 86.5 | 3.33333 | 20OCT2020 |
| 10 | FAB10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 | FAB10_WB1010 | 82.0 | 5.83333 | 12NOV2020 |
Explanation
The dataset currently stores dates as character values.
This step converts them to SAS date format.
Code 5 — Using INTCK
data fab_dates2;
set fab_dates;
Days_Since_Production = intck('day', Prod_Date, today());
run;
proc print data=fab_dates2;
run;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date | Batch_Label | Effective_Yield | Downtime_Hours | Prod_Date | Days_Since_Production |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FAB01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 | FAB01_WB1001 | 88.8 | 2.00000 | 15FEB2020 | 2217 |
| 2 | FAB02 | WB1002 | . | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 | FAB02_WB1002 | . | 1.50000 | 10MAR2020 | 2193 |
| 3 | FAB03 | WB1003 | 0.55 | 105 | 10nm | 300 | 110 | 42000 | 04-05-20 | FAB03_WB1003 | 99.5 | 5.00000 | 05APR2020 | 2167 |
| 4 | FAB04 | WB1004 | 0.40 | 89 | 7nm | -50 | 80 | 41000 | 05-01-20 | FAB04_WB1004 | 85.0 | -0.83333 | 01MAY2020 | 2141 |
| 5 | FAB05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 | FAB05_WB1005 | 91.2 | 1.00000 | 18JUN2020 | 2093 |
| 6 | FAB06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 | FAB06_WB1006 | 51.0 | 8.33333 | 22JUL2020 | 2059 |
| 7 | FAB07 | WB1007 | . | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 | FAB07_WB1007 | . | 0.66667 | 11AUG2020 | 2039 |
| 8 | FAB08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 | FAB08_WB1008 | 93.5 | 0.33333 | 15SEP2020 | 2004 |
| 9 | FAB09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 | FAB09_WB1009 | 86.5 | 3.33333 | 20OCT2020 | 1969 |
| 10 | FAB10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 | FAB10_WB1010 | 82.0 | 5.83333 | 12NOV2020 | 1946 |
Purpose
INTCK calculates time intervals between dates.
Example:
INTCK('DAY', StartDate, EndDate)
Used in:
·
equipment aging analysis
·
production monitoring
Code 6 — Using INTNX
data fab_dates3;
set fab_dates2;
Next_Service_Date = intnx('month', Prod_Date, 3);
format Next_Service_Date date9.;
run;
proc print data=fab_dates3;
run;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date | Batch_Label | Effective_Yield | Downtime_Hours | Prod_Date | Days_Since_Production | Next_Service_Date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FAB01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 | FAB01_WB1001 | 88.8 | 2.00000 | 15FEB2020 | 2217 | 01MAY2020 |
| 2 | FAB02 | WB1002 | . | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 | FAB02_WB1002 | . | 1.50000 | 10MAR2020 | 2193 | 01JUN2020 |
| 3 | FAB03 | WB1003 | 0.55 | 105 | 10nm | 300 | 110 | 42000 | 04-05-20 | FAB03_WB1003 | 99.5 | 5.00000 | 05APR2020 | 2167 | 01JUL2020 |
| 4 | FAB04 | WB1004 | 0.40 | 89 | 7nm | -50 | 80 | 41000 | 05-01-20 | FAB04_WB1004 | 85.0 | -0.83333 | 01MAY2020 | 2141 | 01AUG2020 |
| 5 | FAB05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 | FAB05_WB1005 | 91.2 | 1.00000 | 18JUN2020 | 2093 | 01SEP2020 |
| 6 | FAB06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 | FAB06_WB1006 | 51.0 | 8.33333 | 22JUL2020 | 2059 | 01OCT2020 |
| 7 | FAB07 | WB1007 | . | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 | FAB07_WB1007 | . | 0.66667 | 11AUG2020 | 2039 | 01NOV2020 |
| 8 | FAB08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 | FAB08_WB1008 | 93.5 | 0.33333 | 15SEP2020 | 2004 | 01DEC2020 |
| 9 | FAB09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 | FAB09_WB1009 | 86.5 | 3.33333 | 20OCT2020 | 1969 | 01JAN2021 |
| 10 | FAB10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 | FAB10_WB1010 | 82.0 | 5.83333 | 12NOV2020 | 1946 | 01FEB2021 |
Purpose
INTNX calculates future or past dates.
Here it schedules maintenance every 3 months.
Code 7 — Utilization Classification
data fab_utilization;
set fab_dates3;
length Util_Class $20;
if Utilization >=90 then Util_Class="High Utilization";
else if Utilization >=70 then Util_Class="Moderate Utilization";
else Util_Class="Low Utilization";
run;
proc print data=fab_utilization;
run;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date | Batch_Label | Effective_Yield | Downtime_Hours | Prod_Date | Days_Since_Production | Next_Service_Date | Util_Class |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FAB01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 | FAB01_WB1001 | 88.8 | 2.00000 | 15FEB2020 | 2217 | 01MAY2020 | Moderate Utilization |
| 2 | FAB02 | WB1002 | . | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 | FAB02_WB1002 | . | 1.50000 | 10MAR2020 | 2193 | 01JUN2020 | Moderate Utilization |
| 3 | FAB03 | WB1003 | 0.55 | 105 | 10nm | 300 | 110 | 42000 | 04-05-20 | FAB03_WB1003 | 99.5 | 5.00000 | 05APR2020 | 2167 | 01JUL2020 | High Utilization |
| 4 | FAB04 | WB1004 | 0.40 | 89 | 7nm | -50 | 80 | 41000 | 05-01-20 | FAB04_WB1004 | 85.0 | -0.83333 | 01MAY2020 | 2141 | 01AUG2020 | Moderate Utilization |
| 5 | FAB05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 | FAB05_WB1005 | 91.2 | 1.00000 | 18JUN2020 | 2093 | 01SEP2020 | Moderate Utilization |
| 6 | FAB06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 | FAB06_WB1006 | 51.0 | 8.33333 | 22JUL2020 | 2059 | 01OCT2020 | Low Utilization |
| 7 | FAB07 | WB1007 | . | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 | FAB07_WB1007 | . | 0.66667 | 11AUG2020 | 2039 | 01NOV2020 | High Utilization |
| 8 | FAB08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 | FAB08_WB1008 | 93.5 | 0.33333 | 15SEP2020 | 2004 | 01DEC2020 | High Utilization |
| 9 | FAB09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 | FAB09_WB1009 | 86.5 | 3.33333 | 20OCT2020 | 1969 | 01JAN2021 | Moderate Utilization |
| 10 | FAB10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 | FAB10_WB1010 | 82.0 | 5.83333 | 12NOV2020 | 1946 | 01FEB2021 | Low Utilization |
Explanation
Machines
are categorized based on utilization rate.
|
Utilization |
Classification |
|
≥ 90 |
High |
|
70-89 |
Moderate |
|
< 70 |
Low |
Code 8 — Fraud Detection Macro
%macro fraud_check;
data fab_fraud_flag;
set fab_utilization;
length Fraud_Flag $20.;
if Yield_Percentage >100 then Fraud_Flag="Yield Error";
else if Downtime_Minutes <0 then Fraud_Flag="Downtime Error";
else if Utilization >100 then Fraud_Flag="Utilization Error";
else Fraud_Flag="Valid";
run;
proc print data=fab_fraud_flag;
run;
%mend;
%fraud_check;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date | Batch_Label | Effective_Yield | Downtime_Hours | Prod_Date | Days_Since_Production | Next_Service_Date | Util_Class | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FAB01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 | FAB01_WB1001 | 88.8 | 2.00000 | 15FEB2020 | 2217 | 01MAY2020 | Moderate Utilization | Valid |
| 2 | FAB02 | WB1002 | . | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 | FAB02_WB1002 | . | 1.50000 | 10MAR2020 | 2193 | 01JUN2020 | Moderate Utilization | Valid |
| 3 | FAB03 | WB1003 | 0.55 | 105 | 10nm | 300 | 110 | 42000 | 04-05-20 | FAB03_WB1003 | 99.5 | 5.00000 | 05APR2020 | 2167 | 01JUL2020 | High Utilization | Yield Error |
| 4 | FAB04 | WB1004 | 0.40 | 89 | 7nm | -50 | 80 | 41000 | 05-01-20 | FAB04_WB1004 | 85.0 | -0.83333 | 01MAY2020 | 2141 | 01AUG2020 | Moderate Utilization | Downtime Error |
| 5 | FAB05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 | FAB05_WB1005 | 91.2 | 1.00000 | 18JUN2020 | 2093 | 01SEP2020 | Moderate Utilization | Valid |
| 6 | FAB06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 | FAB06_WB1006 | 51.0 | 8.33333 | 22JUL2020 | 2059 | 01OCT2020 | Low Utilization | Valid |
| 7 | FAB07 | WB1007 | . | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 | FAB07_WB1007 | . | 0.66667 | 11AUG2020 | 2039 | 01NOV2020 | High Utilization | Valid |
| 8 | FAB08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 | FAB08_WB1008 | 93.5 | 0.33333 | 15SEP2020 | 2004 | 01DEC2020 | High Utilization | Valid |
| 9 | FAB09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 | FAB09_WB1009 | 86.5 | 3.33333 | 20OCT2020 | 1969 | 01JAN2021 | Moderate Utilization | Valid |
| 10 | FAB10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 | FAB10_WB1010 | 82.0 | 5.83333 | 12NOV2020 | 1946 | 01FEB2021 | Low Utilization | Valid |
Explanation
This macro detects suspicious values.
Possible fraud indicators:
·
Impossible yield values
·
Negative downtime
·
Utilization exceeding limits
Macros allow reusable validation logic.
Code 9 — Dataset Merge
data fab_cost;
input Fab_ID $ Production_Fees;
datalines;
FAB01 45000
FAB02 47000
FAB03 42000
FAB04 41000
;
run;
proc print data=fab_cost;
run;
OUTPUT:
| Obs | Fab_ID | Production_Fees |
|---|---|---|
| 1 | FAB01 | 45000 |
| 2 | FAB02 | 47000 |
| 3 | FAB03 | 42000 |
| 4 | FAB04 | 41000 |
proc sort data=fab_fraud_flag;by Fab_ID;run;
proc print data=fab_fraud_flag;
run;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date | Batch_Label | Effective_Yield | Downtime_Hours | Prod_Date | Days_Since_Production | Next_Service_Date | Util_Class | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FAB01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 | FAB01_WB1001 | 88.8 | 2.00000 | 15FEB2020 | 2217 | 01MAY2020 | Moderate Utilization | Valid |
| 2 | FAB02 | WB1002 | . | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 | FAB02_WB1002 | . | 1.50000 | 10MAR2020 | 2193 | 01JUN2020 | Moderate Utilization | Valid |
| 3 | FAB03 | WB1003 | 0.55 | 105 | 10nm | 300 | 110 | 42000 | 04-05-20 | FAB03_WB1003 | 99.5 | 5.00000 | 05APR2020 | 2167 | 01JUL2020 | High Utilization | Yield Error |
| 4 | FAB04 | WB1004 | 0.40 | 89 | 7nm | -50 | 80 | 41000 | 05-01-20 | FAB04_WB1004 | 85.0 | -0.83333 | 01MAY2020 | 2141 | 01AUG2020 | Moderate Utilization | Downtime Error |
| 5 | FAB05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 | FAB05_WB1005 | 91.2 | 1.00000 | 18JUN2020 | 2093 | 01SEP2020 | Moderate Utilization | Valid |
| 6 | FAB06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 | FAB06_WB1006 | 51.0 | 8.33333 | 22JUL2020 | 2059 | 01OCT2020 | Low Utilization | Valid |
| 7 | FAB07 | WB1007 | . | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 | FAB07_WB1007 | . | 0.66667 | 11AUG2020 | 2039 | 01NOV2020 | High Utilization | Valid |
| 8 | FAB08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 | FAB08_WB1008 | 93.5 | 0.33333 | 15SEP2020 | 2004 | 01DEC2020 | High Utilization | Valid |
| 9 | FAB09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 | FAB09_WB1009 | 86.5 | 3.33333 | 20OCT2020 | 1969 | 01JAN2021 | Moderate Utilization | Valid |
| 10 | FAB10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 | FAB10_WB1010 | 82.0 | 5.83333 | 12NOV2020 | 1946 | 01FEB2021 | Low Utilization | Valid |
proc sort data=fab_cost;by Fab_ID;run;
proc print data=fab_cost;
run;
OUTPUT:
| Obs | Fab_ID | Production_Fees |
|---|---|---|
| 1 | FAB01 | 45000 |
| 2 | FAB02 | 47000 |
| 3 | FAB03 | 42000 |
| 4 | FAB04 | 41000 |
data fab_merged;
merge fab_fraud_flag
fab_cost;
by Fab_ID;
run;
proc print data=fab_merged;
run;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date | Batch_Label | Effective_Yield | Downtime_Hours | Prod_Date | Days_Since_Production | Next_Service_Date | Util_Class | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FAB01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 | FAB01_WB1001 | 88.8 | 2.00000 | 15FEB2020 | 2217 | 01MAY2020 | Moderate Utilization | Valid |
| 2 | FAB02 | WB1002 | . | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 | FAB02_WB1002 | . | 1.50000 | 10MAR2020 | 2193 | 01JUN2020 | Moderate Utilization | Valid |
| 3 | FAB03 | WB1003 | 0.55 | 105 | 10nm | 300 | 110 | 42000 | 04-05-20 | FAB03_WB1003 | 99.5 | 5.00000 | 05APR2020 | 2167 | 01JUL2020 | High Utilization | Yield Error |
| 4 | FAB04 | WB1004 | 0.40 | 89 | 7nm | -50 | 80 | 41000 | 05-01-20 | FAB04_WB1004 | 85.0 | -0.83333 | 01MAY2020 | 2141 | 01AUG2020 | Moderate Utilization | Downtime Error |
| 5 | FAB05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 | FAB05_WB1005 | 91.2 | 1.00000 | 18JUN2020 | 2093 | 01SEP2020 | Moderate Utilization | Valid |
| 6 | FAB06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 | FAB06_WB1006 | 51.0 | 8.33333 | 22JUL2020 | 2059 | 01OCT2020 | Low Utilization | Valid |
| 7 | FAB07 | WB1007 | . | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 | FAB07_WB1007 | . | 0.66667 | 11AUG2020 | 2039 | 01NOV2020 | High Utilization | Valid |
| 8 | FAB08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 | FAB08_WB1008 | 93.5 | 0.33333 | 15SEP2020 | 2004 | 01DEC2020 | High Utilization | Valid |
| 9 | FAB09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 | FAB09_WB1009 | 86.5 | 3.33333 | 20OCT2020 | 1969 | 01JAN2021 | Moderate Utilization | Valid |
| 10 | FAB10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 | FAB10_WB1010 | 82.0 | 5.83333 | 12NOV2020 | 1946 | 01FEB2021 | Low Utilization | Valid |
Purpose
MERGE combines datasets horizontally.
Used when datasets share a common key.
Code 10 — Dataset Append
proc append base=fab_merged
data=fab_cost force;
run;
proc print data=fab_merged;
run;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date | Batch_Label | Effective_Yield | Downtime_Hours | Prod_Date | Days_Since_Production | Next_Service_Date | Util_Class | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FAB01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 | FAB01_WB1001 | 88.8 | 2.00000 | 15FEB2020 | 2217 | 01MAY2020 | Moderate Utilization | Valid |
| 2 | FAB02 | WB1002 | . | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 | FAB02_WB1002 | . | 1.50000 | 10MAR2020 | 2193 | 01JUN2020 | Moderate Utilization | Valid |
| 3 | FAB03 | WB1003 | 0.55 | 105 | 10nm | 300 | 110 | 42000 | 04-05-20 | FAB03_WB1003 | 99.5 | 5.00000 | 05APR2020 | 2167 | 01JUL2020 | High Utilization | Yield Error |
| 4 | FAB04 | WB1004 | 0.40 | 89 | 7nm | -50 | 80 | 41000 | 05-01-20 | FAB04_WB1004 | 85.0 | -0.83333 | 01MAY2020 | 2141 | 01AUG2020 | Moderate Utilization | Downtime Error |
| 5 | FAB05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 | FAB05_WB1005 | 91.2 | 1.00000 | 18JUN2020 | 2093 | 01SEP2020 | Moderate Utilization | Valid |
| 6 | FAB06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 | FAB06_WB1006 | 51.0 | 8.33333 | 22JUL2020 | 2059 | 01OCT2020 | Low Utilization | Valid |
| 7 | FAB07 | WB1007 | . | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 | FAB07_WB1007 | . | 0.66667 | 11AUG2020 | 2039 | 01NOV2020 | High Utilization | Valid |
| 8 | FAB08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 | FAB08_WB1008 | 93.5 | 0.33333 | 15SEP2020 | 2004 | 01DEC2020 | High Utilization | Valid |
| 9 | FAB09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 | FAB09_WB1009 | 86.5 | 3.33333 | 20OCT2020 | 1969 | 01JAN2021 | Moderate Utilization | Valid |
| 10 | FAB10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 | FAB10_WB1010 | 82.0 | 5.83333 | 12NOV2020 | 1946 | 01FEB2021 | Low Utilization | Valid |
| 11 | FAB01 | . | . | . | . | 45000 | . | . | . | . | . | ||||||
| 12 | FAB02 | . | . | . | . | 47000 | . | . | . | . | . | ||||||
| 13 | FAB03 | . | . | . | . | 42000 | . | . | . | . | . | ||||||
| 14 | FAB04 | . | . | . | . | 41000 | . | . | . | . | . |
Purpose
APPEND adds observations to an existing
dataset.
Code 11 — Dataset Transpose
proc transpose data=fab_fraud_flag out=fab_transposed;
var Yield_Percentage Utilization;
run;
proc print data=fab_transposed;
run;
OUTPUT:
| Obs | _NAME_ | COL1 | COL2 | COL3 | COL4 | COL5 | COL6 | COL7 | COL8 | COL9 | COL10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Yield_Percentage | 92 | 95 | 105 | 89 | 94 | 60 | 97 | 96 | 91 | 88 |
| 2 | Utilization | 85 | 88 | 110 | 80 | 75 | 40 | 92 | 95 | 70 | 65 |
Purpose
TRANSPOSE converts rows to columns.
Useful for report generation.
Code 12 — PROC DATASETS DELETE
proc datasets library=work nolist;
delete fab_clean_char fab_numeric fab_dates;
quit;
LOG:
Purpose
Removes intermediate datasets to keep
workspace clean.
Final Corrected Dataset Code
data semiconductor_final;
set fab_utilization;
if Yield_Percentage >100 then Yield_Percentage=100;
if Downtime_Minutes <0 then Downtime_Minutes=0;
if Utilization >100 then Utilization=100;
if missing(Defect_Density) then Defect_Density=0.3;
run;
proc print data=semiconductor_final;
run;
OUTPUT:
| Obs | Fab_ID | Wafer_Batch | Defect_Density | Yield_Percentage | Process_Node | Downtime_Minutes | Utilization | Production_Fees | Production_Date | Batch_Label | Effective_Yield | Downtime_Hours | Prod_Date | Days_Since_Production | Next_Service_Date | Util_Class |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | FAB01 | WB1001 | 0.32 | 92 | 7nm | 120 | 85 | 45000 | 02-15-20 | FAB01_WB1001 | 88.8 | 2.00000 | 15FEB2020 | 2217 | 01MAY2020 | Moderate Utilization |
| 2 | FAB02 | WB1002 | 0.30 | 95 | 5nm | 90 | 88 | 47000 | 03-10-20 | FAB02_WB1002 | . | 1.50000 | 10MAR2020 | 2193 | 01JUN2020 | Moderate Utilization |
| 3 | FAB03 | WB1003 | 0.55 | 100 | 10nm | 300 | 100 | 42000 | 04-05-20 | FAB03_WB1003 | 99.5 | 5.00000 | 05APR2020 | 2167 | 01JUL2020 | High Utilization |
| 4 | FAB04 | WB1004 | 0.40 | 89 | 7nm | 0 | 80 | 41000 | 05-01-20 | FAB04_WB1004 | 85.0 | -0.83333 | 01MAY2020 | 2141 | 01AUG2020 | Moderate Utilization |
| 5 | FAB05 | WB1005 | 0.28 | 94 | 5nm | 60 | 75 | 39000 | 06-18-20 | FAB05_WB1005 | 91.2 | 1.00000 | 18JUN2020 | 2093 | 01SEP2020 | Moderate Utilization |
| 6 | FAB06 | WB1006 | 0.90 | 60 | 14nm | 500 | 40 | 35000 | 07-22-20 | FAB06_WB1006 | 51.0 | 8.33333 | 22JUL2020 | 2059 | 01OCT2020 | Low Utilization |
| 7 | FAB07 | WB1007 | 0.30 | 97 | 5nm | 40 | 92 | 48000 | 08-11-20 | FAB07_WB1007 | . | 0.66667 | 11AUG2020 | 2039 | 01NOV2020 | High Utilization |
| 8 | FAB08 | WB1008 | 0.25 | 96 | 3nm | 20 | 95 | 50000 | 09-15-20 | FAB08_WB1008 | 93.5 | 0.33333 | 15SEP2020 | 2004 | 01DEC2020 | High Utilization |
| 9 | FAB09 | WB1009 | 0.45 | 91 | 7nm | 200 | 70 | 43000 | 10-20-20 | FAB09_WB1009 | 86.5 | 3.33333 | 20OCT2020 | 1969 | 01JAN2021 | Moderate Utilization |
| 10 | FAB10 | WB1010 | 0.60 | 88 | 10nm | 350 | 65 | 42000 | 11-12-20 | FAB10_WB1010 | 82.0 | 5.83333 | 12NOV2020 | 1946 | 01FEB2021 | Low Utilization |
This program fixes all data issues.
Corrections applied:
|
Issue |
Correction |
|
Missing defect density |
replaced with average |
|
Yield above 100 |
capped |
|
Negative downtime |
reset to 0 |
|
Utilization above 100 |
capped |
Key Project Insights
- Semiconductor fabs produce
complex manufacturing datasets.
- Data errors can
significantly impact yield analysis.
- SAS character functions
standardize text variables.
- Numeric functions help
derive operational metrics.
- Date functions allow
production tracking.
- Macros automate fraud
detection.
- Merge and append combine
datasets.
- Transpose helps in
reporting.
- PROC DATASETS manages
workspace efficiently.
- Data validation ensures
reliable analytics.
This project demonstrates how Advanced SAS programming can be used
to create, analyze, detect errors, and clean data related to semiconductor
fabrication plants. The dataset represents operational information
from semiconductor fabs, including variables such as Fab_ID,
Wafer_Batch, Defect_Density, Yield_Percentage, Process_Node, Downtime_Minutes,
Utilization, Production_Fees, and Production_Date. To simulate
real-world manufacturing data challenges, several intentional errors
were introduced into the raw dataset, such as missing defect density values,
yield percentages greater than 100%, negative downtime values, and machine
utilization exceeding 100%.
Using SAS programming techniques, the project demonstrates how to identify
and correct these data issues systematically. Character functions like
STRIP, TRIM, CATX, UPCASE, LOWCASE, and PROPERCASE were used
to clean and standardize text variables. Numeric calculations were applied to
derive useful metrics such as effective yield and downtime in hours. Date
functions including MDY, INTCK, and INTNX were used to convert
and analyze production dates and calculate future maintenance schedules.
The project also illustrates advanced dataset management techniques such as SET,
MERGE, APPEND, and TRANSPOSE, along with workspace cleanup using PROC
DATASETS DELETE. Additionally, a macro-based fraud detection
logic was implemented to automatically flag suspicious or unrealistic
values in the dataset.
Overall, the project highlights how SAS can be used to build reliable,
error-free datasets for semiconductor manufacturing analytics,
ensuring accurate operational insights and improved decision-making.
Conclusion
Semiconductor manufacturing is one of the
most data-intensive industries. Reliable analytics requires accurate and
validated datasets.
This project demonstrated how Advanced SAS programming techniques
can be used to detect, analyze, and correct data errors in semiconductor
fabrication plant datasets.
Using a combination of:
·
character functions
·
numeric calculations
·
date functions
·
macros
·
dataset manipulation techniques
SAS programmers can build robust data pipelines that
ensure data quality and support high-value manufacturing decisions.
SAS INTERVIEW QUESTIONS
1. What is the difference between CALL SYMPUT and
CALL SYMPUTX?
Answer:
CALL SYMPUT may include extra spaces, while CALL SYMPUTX removes leading
and trailing spaces.
2. What is the purpose of %LET?
Answer:
%LET assigns values to macro variables.
3. What is %DO loop in macros?
Answer:
%DO loop repeats macro code multiple times.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent SEMICONDUCTOR data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
· Clinical SAS Programmer
· Research Data Analyst
· Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment