411.Can We Detect And Correct Fraudulent And Erroneous Air Quality Sensor Network Data Using Only Proc Sql In Sas?Air Quality Sensor Network Fraud Detection And Error Correction Using PROC SQL
Air Quality Sensor Network Fraud Detection And Error Correction Using PROC SQL
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —PROC SQL | CREATE TABLE | INSERT INTO | SELECT | FROM | WHERE | CASE WHEN THEN ELSE END | COALESCE | STRIP | TRIM | PROPCASE | UPCASE | LOWCASE | CAT | CATX | CALCULATED | INTCK | INTNX | MDY | TODAY | FORMAT | GROUP BY | HAVING | ORDER BY | UNION ALL | JOIN | LEFT JOIN | ON | AVG | SUM | COUNT | MIN | MAX | DISTINCT | CREATE TABLE AS SELECT | ALTER TABLE | ADD COLUMN | UPDATE | SET | DELETE FROM | IS NULL | IS NOT NULL | BETWEEN | LIKE | %MACRO | %MEND | PROC DATASETS DELETE | QUIT
Introduction
Air
quality sensor networks are widely used in smart cities, environmental
monitoring, and industrial compliance systems. These sensors continuously
measure pollutants such as PM2.5 (fine particulate matter) and PM10
(coarse particulate matter). Governments use this information to issue
health warnings, regulate industries, and protect public health.
However,
real-world sensor data is rarely perfect. Many problems occur, such as:
- Incorrect calibration
- Missing values
- Negative or impossible
readings
- Incorrect dates
- Fraudulent manipulation
- Inconsistent text formats
- Duplicate records
- Sensors reporting
unrealistic accuracy
A SAS
programmer must detect, correct, and validate such issues before analysis.
In this
project, we will build a complete Air Quality Sensor Network analytics
system using only PROC SQL We will:
- Create a dataset with 15+
observations
- Add intentional errors
- Detect errors using SQL
- Correct errors using SQL
functions
- Create utilization
classification
- Detect fraud using macros
- Derive dates using MDY,
INTCK, INTNX logic
- Perform append, merge,
transpose using SQL
- Use numeric and character
functions
- Delete temporary datasets
- Explain every line clearly
Table Of Contents
- Business Context
- Dataset Design
- Raw Dataset Creation With
Intentional Errors
- Identifying Errors Using
PROC SQL
- Correcting Errors Using PROC
SQL Functions
- Character Functions
Explanation
- Numeric Functions
Explanation
- Date Derivations Using MDY,
INTCK, INTNX
- Utilization Classification
Logic
- Fraud Detection Using Macro
+ SQL
- Append Using PROC SQL
- Merge Using PROC SQL
- Transpose Using PROC SQL
- Statistical Preparation
Tables
- PROC DATASETS DELETE Cleanup
- Final Clean Dataset
- Conclusion
1. Business Context
Suppose
the National Air Monitoring Authority installed sensors in Indian
cities:
Delhi
Mumbai
Hyderabad
Chennai
Kolkata
Pune
Each
sensor reports:
- Pollution levels
- Calibration status
- Accuracy
- Battery level
- Service dates
Problems
reported:
- Some sensors show negative
pollution
- Some sensors show accuracy
>100%
- Some sensors not calibrated
- Some sensors inactive for
long time
Goal:
Build fraud
detection and quality validation system using PROC SQL
2. Dataset Structure Design
Variables:
|
Variable |
Description |
|
Sensor_ID |
Unique
sensor identifier |
|
City |
Sensor
location |
|
PM2_5 |
Fine
particle level |
|
PM10 |
Coarse
particle level |
|
Calibration_Status |
Calibration
condition |
|
Data_Gaps |
Missing
data days |
|
Accuracy_Score |
Accuracy
percentage |
|
Install_Date |
Installation
date |
|
Service_Date |
Last
service date |
|
Battery_Level |
Battery
% |
|
Signal_Strength |
Signal
quality % |
3. Raw Dataset Creation With Intentional Errors
PROC SQL;
CREATE TABLE air_quality_raw
(Sensor_ID CHAR(6),
City CHAR(20),
PM2_5 NUM,
PM10 NUM,
Calibration_Status CHAR(20),
Data_Gaps NUM,
Accuracy_Score NUM,
Install_Date DATE,
Service_Date DATE,
Battery_Level NUM,
Signal_Strength NUM);
INSERT INTO air_quality_raw
VALUES('S001','delhi',250,400,'notcalibrated',15,55,'01JAN2023'd,'01JAN2024'd,45,60);
INSERT INTO air_quality_raw
VALUES('S002','Mumbai ',35,60,'Calibrated',2,98,'15FEB2023'd,'15JAN2024'd,90,85);
INSERT INTO air_quality_raw
VALUES('S003','CHENNAI',-10,80,'Calibrated',5,88,'10MAR2023'd,'10FEB2024'd,75,70);
INSERT INTO air_quality_raw
VALUES('S004','Kolkata',120,.,'Calibrated',25,40,'01APR2023'd,'01JAN2024'd,30,40);
INSERT INTO air_quality_raw
VALUES('S005','Hyderabad',80,120,'unknown',0,105,'05MAY2023'd,'05JAN2024'd,95,90);
INSERT INTO air_quality_raw
VALUES('S006','Pune',60,95,'Calibrated',50,65,'07JUN2023'd,'07JAN2024'd,50,55);
INSERT INTO air_quality_raw
VALUES('S007','Delhi',300,500,'Calibrated',3,92,'01JUL2023'd,'01FEB2024'd,88,88);
INSERT INTO air_quality_raw
VALUES('S008','mumbai',45,70,'NotCalibrated',40,45,'01AUG2023'd,'01FEB2024'd,35,45);
INSERT INTO air_quality_raw
VALUES('S009','Chennai',55,85,'Calibrated',1,99,'01SEP2023'd,'01FEB2024'd,92,91);
INSERT INTO air_quality_raw
VALUES('S010','Kolkata',65,110,'Calibrated',10,85,'01OCT2023'd,'01FEB2024'd,80,75);
INSERT INTO air_quality_raw
VALUES('S011','Hyderabad',75,115,'Calibrated',0,97,'01NOV2023'd,'01FEB2024'd,95,95);
INSERT INTO air_quality_raw
VALUES('S012','Pune',85,140,'Calibrated',60,35,'01DEC2023'd,'01FEB2024'd,25,35);
INSERT INTO air_quality_raw
VALUES('S013','Delhi',95,150,'Calibrated',5,93,'01JAN2024'd,'01FEB2024'd,90,88);
INSERT INTO air_quality_raw
VALUES('S014','Mumbai',105,170,'Calibrated',7,89,'01JAN2024'd,'01FEB2024'd,85,82);
INSERT INTO air_quality_raw
VALUES('S015','Chennai',115,180,'Calibrated',9,87,'01JAN2024'd,'01FEB2024'd,82,80);
QUIT;
proc print data=air_quality_raw;
run;
OUTPUT:
| Obs | Sensor_ID | City | PM2_5 | PM10 | Calibration_Status | Data_Gaps | Accuracy_Score | Install_Date | Service_Date | Battery_Level | Signal_Strength |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S001 | delhi | 250 | 400 | notcalibrated | 15 | 55 | 01JAN23 | 01JAN24 | 45 | 60 |
| 2 | S002 | Mumbai | 35 | 60 | Calibrated | 2 | 98 | 15FEB23 | 15JAN24 | 90 | 85 |
| 3 | S003 | CHENNAI | -10 | 80 | Calibrated | 5 | 88 | 10MAR23 | 10FEB24 | 75 | 70 |
| 4 | S004 | Kolkata | 120 | . | Calibrated | 25 | 40 | 01APR23 | 01JAN24 | 30 | 40 |
| 5 | S005 | Hyderabad | 80 | 120 | unknown | 0 | 105 | 05MAY23 | 05JAN24 | 95 | 90 |
| 6 | S006 | Pune | 60 | 95 | Calibrated | 50 | 65 | 07JUN23 | 07JAN24 | 50 | 55 |
| 7 | S007 | Delhi | 300 | 500 | Calibrated | 3 | 92 | 01JUL23 | 01FEB24 | 88 | 88 |
| 8 | S008 | mumbai | 45 | 70 | NotCalibrated | 40 | 45 | 01AUG23 | 01FEB24 | 35 | 45 |
| 9 | S009 | Chennai | 55 | 85 | Calibrated | 1 | 99 | 01SEP23 | 01FEB24 | 92 | 91 |
| 10 | S010 | Kolkata | 65 | 110 | Calibrated | 10 | 85 | 01OCT23 | 01FEB24 | 80 | 75 |
| 11 | S011 | Hyderabad | 75 | 115 | Calibrated | 0 | 97 | 01NOV23 | 01FEB24 | 95 | 95 |
| 12 | S012 | Pune | 85 | 140 | Calibrated | 60 | 35 | 01DEC23 | 01FEB24 | 25 | 35 |
| 13 | S013 | Delhi | 95 | 150 | Calibrated | 5 | 93 | 01JAN24 | 01FEB24 | 90 | 88 |
| 14 | S014 | Mumbai | 105 | 170 | Calibrated | 7 | 89 | 01JAN24 | 01FEB24 | 85 | 82 |
| 15 | S015 | Chennai | 115 | 180 | Calibrated | 9 | 87 | 01JAN24 | 01FEB24 | 82 | 80 |
Explanation Line-by-Line
PROC SQL
→ Starts SQL procedure
CREATE TABLE
→ Creates dataset
CHAR / NUM
→ Defines variable types
INSERT INTO
→ Adds observations
Intentional Errors Present
Error 1
City inconsistent case
delhi, Delhi, DELHI
Error 2
Trailing space
Mumbai(space)
Error 3
Negative PM2_5
-10 impossible
Error 4
Missing PM10
Error 5
Accuracy >100
Error 6
Calibration inconsistent
unknown
NotCalibrated
notcalibrated
Error 7
Excessive Data gaps
4. Identify Errors Using PROC SQL
PROC SQL;
SELECT *
FROM air_quality_raw
WHERE PM2_5 <0 OR PM10 IS NULL OR Accuracy_Score>100;
QUIT;
OUTPUT:
| Sensor_ID | City | PM2_5 | PM10 | Calibration_Status | Data_Gaps | Accuracy_Score | Install_Date | Service_Date | Battery_Level | Signal_Strength |
|---|---|---|---|---|---|---|---|---|---|---|
| S003 | CHENNAI | -10 | 80 | Calibrated | 5 | 88 | 10MAR23 | 10FEB24 | 75 | 70 |
| S004 | Kolkata | 120 | . | Calibrated | 25 | 40 | 01APR23 | 01JAN24 | 30 | 40 |
| S005 | Hyderabad | 80 | 120 | unknown | 0 | 105 | 05MAY23 | 05JAN24 | 95 | 90 |
5. Correct Dataset Using PROC SQL
PROC SQL;
CREATE TABLE air_quality_clean AS
SELECT
STRIP(Sensor_ID) AS Sensor_ID,
PROPCASE(STRIP(City)) AS City,
CASE
WHEN PM2_5<0 THEN . ELSE PM2_5 END AS PM2_5,
COALESCE(PM10,PM2_5*1.5) AS PM10,
UPCASE(STRIP(Calibration_Status)) AS Calibration_Status,
Data_Gaps,
CASE
WHEN Accuracy_Score>100 THEN 100 ELSE Accuracy_Score END AS Accuracy_Score,
Install_Date FORMAT=DATE9.,
Service_Date FORMAT=DATE9.,
Battery_Level,
Signal_Strength
FROM air_quality_raw;
QUIT;
proc print data=air_quality_clean;
run;
OUTPUT:
| Obs | Sensor_ID | City | PM2_5 | PM10 | Calibration_Status | Data_Gaps | Accuracy_Score | Install_Date | Service_Date | Battery_Level | Signal_Strength |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S001 | Delhi | 250 | 400 | NOTCALIBRATED | 15 | 55 | 01JAN2023 | 01JAN2024 | 45 | 60 |
| 2 | S002 | Mumbai | 35 | 60 | CALIBRATED | 2 | 98 | 15FEB2023 | 15JAN2024 | 90 | 85 |
| 3 | S003 | Chennai | . | 80 | CALIBRATED | 5 | 88 | 10MAR2023 | 10FEB2024 | 75 | 70 |
| 4 | S004 | Kolkata | 120 | 180 | CALIBRATED | 25 | 40 | 01APR2023 | 01JAN2024 | 30 | 40 |
| 5 | S005 | Hyderabad | 80 | 120 | UNKNOWN | 0 | 100 | 05MAY2023 | 05JAN2024 | 95 | 90 |
| 6 | S006 | Pune | 60 | 95 | CALIBRATED | 50 | 65 | 07JUN2023 | 07JAN2024 | 50 | 55 |
| 7 | S007 | Delhi | 300 | 500 | CALIBRATED | 3 | 92 | 01JUL2023 | 01FEB2024 | 88 | 88 |
| 8 | S008 | Mumbai | 45 | 70 | NOTCALIBRATED | 40 | 45 | 01AUG2023 | 01FEB2024 | 35 | 45 |
| 9 | S009 | Chennai | 55 | 85 | CALIBRATED | 1 | 99 | 01SEP2023 | 01FEB2024 | 92 | 91 |
| 10 | S010 | Kolkata | 65 | 110 | CALIBRATED | 10 | 85 | 01OCT2023 | 01FEB2024 | 80 | 75 |
| 11 | S011 | Hyderabad | 75 | 115 | CALIBRATED | 0 | 97 | 01NOV2023 | 01FEB2024 | 95 | 95 |
| 12 | S012 | Pune | 85 | 140 | CALIBRATED | 60 | 35 | 01DEC2023 | 01FEB2024 | 25 | 35 |
| 13 | S013 | Delhi | 95 | 150 | CALIBRATED | 5 | 93 | 01JAN2024 | 01FEB2024 | 90 | 88 |
| 14 | S014 | Mumbai | 105 | 170 | CALIBRATED | 7 | 89 | 01JAN2024 | 01FEB2024 | 85 | 82 |
| 15 | S015 | Chennai | 115 | 180 | CALIBRATED | 9 | 87 | 01JAN2024 | 01FEB2024 | 82 | 80 |
WHERE filters bad records
Explanation of Character Functions
STRIP
removes spaces
PROPCASE
Delhi correct format
UPCASE
CALIBRATED standard format
COALESCE
replaces missing value
Example
PM10 missing
replace PM2.5 *1.5
Explanation Numeric Logic
CASE
conditional correction
Accuracy >100
set to 100
Negative pollution
set missing
6. Date Derivations Using INTCK and INTNX
PROC SQL;
CREATE TABLE air_quality_dates AS
SELECT *,
INTCK('DAY',Install_Date,Service_Date) AS Service_Days,
INTNX('MONTH',Service_Date,6,'SAME') FORMAT=DATE9. AS Next_Service_Date
FROM air_quality_clean;
QUIT;
proc print data=air_quality_dates;
run;
OUTPUT:
| Obs | Sensor_ID | City | PM2_5 | PM10 | Calibration_Status | Data_Gaps | Accuracy_Score | Install_Date | Service_Date | Battery_Level | Signal_Strength | Service_Days | Next_Service_Date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S001 | Delhi | 250 | 400 | NOTCALIBRATED | 15 | 55 | 01JAN2023 | 01JAN2024 | 45 | 60 | 365 | 01JUL2024 |
| 2 | S002 | Mumbai | 35 | 60 | CALIBRATED | 2 | 98 | 15FEB2023 | 15JAN2024 | 90 | 85 | 334 | 15JUL2024 |
| 3 | S003 | Chennai | . | 80 | CALIBRATED | 5 | 88 | 10MAR2023 | 10FEB2024 | 75 | 70 | 337 | 10AUG2024 |
| 4 | S004 | Kolkata | 120 | 180 | CALIBRATED | 25 | 40 | 01APR2023 | 01JAN2024 | 30 | 40 | 275 | 01JUL2024 |
| 5 | S005 | Hyderabad | 80 | 120 | UNKNOWN | 0 | 100 | 05MAY2023 | 05JAN2024 | 95 | 90 | 245 | 05JUL2024 |
| 6 | S006 | Pune | 60 | 95 | CALIBRATED | 50 | 65 | 07JUN2023 | 07JAN2024 | 50 | 55 | 214 | 07JUL2024 |
| 7 | S007 | Delhi | 300 | 500 | CALIBRATED | 3 | 92 | 01JUL2023 | 01FEB2024 | 88 | 88 | 215 | 01AUG2024 |
| 8 | S008 | Mumbai | 45 | 70 | NOTCALIBRATED | 40 | 45 | 01AUG2023 | 01FEB2024 | 35 | 45 | 184 | 01AUG2024 |
| 9 | S009 | Chennai | 55 | 85 | CALIBRATED | 1 | 99 | 01SEP2023 | 01FEB2024 | 92 | 91 | 153 | 01AUG2024 |
| 10 | S010 | Kolkata | 65 | 110 | CALIBRATED | 10 | 85 | 01OCT2023 | 01FEB2024 | 80 | 75 | 123 | 01AUG2024 |
| 11 | S011 | Hyderabad | 75 | 115 | CALIBRATED | 0 | 97 | 01NOV2023 | 01FEB2024 | 95 | 95 | 92 | 01AUG2024 |
| 12 | S012 | Pune | 85 | 140 | CALIBRATED | 60 | 35 | 01DEC2023 | 01FEB2024 | 25 | 35 | 62 | 01AUG2024 |
| 13 | S013 | Delhi | 95 | 150 | CALIBRATED | 5 | 93 | 01JAN2024 | 01FEB2024 | 90 | 88 | 31 | 01AUG2024 |
| 14 | S014 | Mumbai | 105 | 170 | CALIBRATED | 7 | 89 | 01JAN2024 | 01FEB2024 | 85 | 82 | 31 | 01AUG2024 |
| 15 | S015 | Chennai | 115 | 180 | CALIBRATED | 9 | 87 | 01JAN2024 | 01FEB2024 | 82 | 80 | 31 | 01AUG2024 |
Explanation
INTCK
difference between dates
INTNX
adds future date
7. Utilization Classification
PROC SQL;
CREATE TABLE utilization AS
SELECT *,
CASE
WHEN Data_Gaps<=5 THEN 'HIGH'
WHEN Data_Gaps<=20 THEN 'MEDIUM' ELSE 'LOW'
END AS Utilization
FROM air_quality_dates;
QUIT;
proc print data=utilization;
run;
OUTPUT:
| Obs | Sensor_ID | City | PM2_5 | PM10 | Calibration_Status | Data_Gaps | Accuracy_Score | Install_Date | Service_Date | Battery_Level | Signal_Strength | Service_Days | Next_Service_Date | Utilization |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S001 | Delhi | 250 | 400 | NOTCALIBRATED | 15 | 55 | 01JAN2023 | 01JAN2024 | 45 | 60 | 365 | 01JUL2024 | MEDIUM |
| 2 | S002 | Mumbai | 35 | 60 | CALIBRATED | 2 | 98 | 15FEB2023 | 15JAN2024 | 90 | 85 | 334 | 15JUL2024 | HIGH |
| 3 | S003 | Chennai | . | 80 | CALIBRATED | 5 | 88 | 10MAR2023 | 10FEB2024 | 75 | 70 | 337 | 10AUG2024 | HIGH |
| 4 | S004 | Kolkata | 120 | 180 | CALIBRATED | 25 | 40 | 01APR2023 | 01JAN2024 | 30 | 40 | 275 | 01JUL2024 | LOW |
| 5 | S005 | Hyderabad | 80 | 120 | UNKNOWN | 0 | 100 | 05MAY2023 | 05JAN2024 | 95 | 90 | 245 | 05JUL2024 | HIGH |
| 6 | S006 | Pune | 60 | 95 | CALIBRATED | 50 | 65 | 07JUN2023 | 07JAN2024 | 50 | 55 | 214 | 07JUL2024 | LOW |
| 7 | S007 | Delhi | 300 | 500 | CALIBRATED | 3 | 92 | 01JUL2023 | 01FEB2024 | 88 | 88 | 215 | 01AUG2024 | HIGH |
| 8 | S008 | Mumbai | 45 | 70 | NOTCALIBRATED | 40 | 45 | 01AUG2023 | 01FEB2024 | 35 | 45 | 184 | 01AUG2024 | LOW |
| 9 | S009 | Chennai | 55 | 85 | CALIBRATED | 1 | 99 | 01SEP2023 | 01FEB2024 | 92 | 91 | 153 | 01AUG2024 | HIGH |
| 10 | S010 | Kolkata | 65 | 110 | CALIBRATED | 10 | 85 | 01OCT2023 | 01FEB2024 | 80 | 75 | 123 | 01AUG2024 | MEDIUM |
| 11 | S011 | Hyderabad | 75 | 115 | CALIBRATED | 0 | 97 | 01NOV2023 | 01FEB2024 | 95 | 95 | 92 | 01AUG2024 | HIGH |
| 12 | S012 | Pune | 85 | 140 | CALIBRATED | 60 | 35 | 01DEC2023 | 01FEB2024 | 25 | 35 | 62 | 01AUG2024 | LOW |
| 13 | S013 | Delhi | 95 | 150 | CALIBRATED | 5 | 93 | 01JAN2024 | 01FEB2024 | 90 | 88 | 31 | 01AUG2024 | HIGH |
| 14 | S014 | Mumbai | 105 | 170 | CALIBRATED | 7 | 89 | 01JAN2024 | 01FEB2024 | 85 | 82 | 31 | 01AUG2024 | MEDIUM |
| 15 | S015 | Chennai | 115 | 180 | CALIBRATED | 9 | 87 | 01JAN2024 | 01FEB2024 | 82 | 80 | 31 | 01AUG2024 | MEDIUM |
Logic:
Low gaps
= high utilization
Explanation
CASE
creates classification
8. Fraud Detection Macro
%MACRO FRAUD_CHECK;
PROC SQL;
CREATE TABLE fraud_detection AS
SELECT *,
CASE
WHEN Accuracy_Score<60 AND Data_Gaps>30 THEN 1
WHEN Calibration_Status NE 'CALIBRATED' THEN 1
ELSE 0
END AS Fraud_Flag
FROM utilization;
QUIT;
proc print data=fraud_detection;
run;
%MEND;
%FRAUD_CHECK;
OUTPUT:
| Obs | Sensor_ID | City | PM2_5 | PM10 | Calibration_Status | Data_Gaps | Accuracy_Score | Install_Date | Service_Date | Battery_Level | Signal_Strength | Service_Days | Next_Service_Date | Utilization | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S001 | Delhi | 250 | 400 | NOTCALIBRATED | 15 | 55 | 01JAN2023 | 01JAN2024 | 45 | 60 | 365 | 01JUL2024 | MEDIUM | 1 |
| 2 | S002 | Mumbai | 35 | 60 | CALIBRATED | 2 | 98 | 15FEB2023 | 15JAN2024 | 90 | 85 | 334 | 15JUL2024 | HIGH | 0 |
| 3 | S003 | Chennai | . | 80 | CALIBRATED | 5 | 88 | 10MAR2023 | 10FEB2024 | 75 | 70 | 337 | 10AUG2024 | HIGH | 0 |
| 4 | S004 | Kolkata | 120 | 180 | CALIBRATED | 25 | 40 | 01APR2023 | 01JAN2024 | 30 | 40 | 275 | 01JUL2024 | LOW | 0 |
| 5 | S005 | Hyderabad | 80 | 120 | UNKNOWN | 0 | 100 | 05MAY2023 | 05JAN2024 | 95 | 90 | 245 | 05JUL2024 | HIGH | 1 |
| 6 | S006 | Pune | 60 | 95 | CALIBRATED | 50 | 65 | 07JUN2023 | 07JAN2024 | 50 | 55 | 214 | 07JUL2024 | LOW | 0 |
| 7 | S007 | Delhi | 300 | 500 | CALIBRATED | 3 | 92 | 01JUL2023 | 01FEB2024 | 88 | 88 | 215 | 01AUG2024 | HIGH | 0 |
| 8 | S008 | Mumbai | 45 | 70 | NOTCALIBRATED | 40 | 45 | 01AUG2023 | 01FEB2024 | 35 | 45 | 184 | 01AUG2024 | LOW | 1 |
| 9 | S009 | Chennai | 55 | 85 | CALIBRATED | 1 | 99 | 01SEP2023 | 01FEB2024 | 92 | 91 | 153 | 01AUG2024 | HIGH | 0 |
| 10 | S010 | Kolkata | 65 | 110 | CALIBRATED | 10 | 85 | 01OCT2023 | 01FEB2024 | 80 | 75 | 123 | 01AUG2024 | MEDIUM | 0 |
| 11 | S011 | Hyderabad | 75 | 115 | CALIBRATED | 0 | 97 | 01NOV2023 | 01FEB2024 | 95 | 95 | 92 | 01AUG2024 | HIGH | 0 |
| 12 | S012 | Pune | 85 | 140 | CALIBRATED | 60 | 35 | 01DEC2023 | 01FEB2024 | 25 | 35 | 62 | 01AUG2024 | LOW | 1 |
| 13 | S013 | Delhi | 95 | 150 | CALIBRATED | 5 | 93 | 01JAN2024 | 01FEB2024 | 90 | 88 | 31 | 01AUG2024 | HIGH | 0 |
| 14 | S014 | Mumbai | 105 | 170 | CALIBRATED | 7 | 89 | 01JAN2024 | 01FEB2024 | 85 | 82 | 31 | 01AUG2024 | MEDIUM | 0 |
| 15 | S015 | Chennai | 115 | 180 | CALIBRATED | 9 | 87 | 01JAN2024 | 01FEB2024 | 82 | 80 | 31 | 01AUG2024 | MEDIUM | 0 |
Fraud
indicators
Low
accuracy
Not calibrated
High data gaps
Explanation
Macro
reusable logic
Fraud_Flag
1 fraud
0 normal
9. Append Example Using PROC SQL
proc sql;
create table new_sensor as
select
'S016' as Sensor_ID length=5,
'Delhi' as City length=10,
80 as PM2_5,
120 as PM10,
'CALIBRATED' as Calibration_Status length=12,
3 as Data_Gaps,
95 as Accuracy_Score,
today() as Install_Date format=date9.,
today() as Service_Date format=date9.,
90 as Battery_Level,
90 as Signal_Strength
from work.fraud_detection(obs=1);/*By Creating Dummy Record*/
quit;
proc print data=new_sensor;
run;
OUTPUT:
| Obs | Sensor_ID | City | PM2_5 | PM10 | Calibration_Status | Data_Gaps | Accuracy_Score | Install_Date | Service_Date | Battery_Level | Signal_Strength |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S016 | Delhi | 80 | 120 | CALIBRATED | 3 | 95 | 25FEB2026 | 25FEB2026 | 90 | 90 |
PROC SQL;
CREATE TABLE combined AS
SELECT * FROM fraud_detection
UNION ALL
SELECT * FROM new_sensor;
QUIT;
proc print data=combined;
run;
OUTPUT:
| Obs | Sensor_ID | City | PM2_5 | PM10 | Calibration_Status | Data_Gaps | Accuracy_Score | Install_Date | Service_Date | Battery_Level | Signal_Strength | Service_Days | Next_Service_Date | Utilization | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S001 | Delhi | 250 | 400 | NOTCALIBRATED | 15 | 55 | 01JAN2023 | 01JAN2024 | 45 | 60 | 365 | 01JUL2024 | MEDIUM | 1 |
| 2 | S002 | Mumbai | 35 | 60 | CALIBRATED | 2 | 98 | 15FEB2023 | 15JAN2024 | 90 | 85 | 334 | 15JUL2024 | HIGH | 0 |
| 3 | S003 | Chennai | . | 80 | CALIBRATED | 5 | 88 | 10MAR2023 | 10FEB2024 | 75 | 70 | 337 | 10AUG2024 | HIGH | 0 |
| 4 | S004 | Kolkata | 120 | 180 | CALIBRATED | 25 | 40 | 01APR2023 | 01JAN2024 | 30 | 40 | 275 | 01JUL2024 | LOW | 0 |
| 5 | S005 | Hyderabad | 80 | 120 | UNKNOWN | 0 | 100 | 05MAY2023 | 05JAN2024 | 95 | 90 | 245 | 05JUL2024 | HIGH | 1 |
| 6 | S006 | Pune | 60 | 95 | CALIBRATED | 50 | 65 | 07JUN2023 | 07JAN2024 | 50 | 55 | 214 | 07JUL2024 | LOW | 0 |
| 7 | S007 | Delhi | 300 | 500 | CALIBRATED | 3 | 92 | 01JUL2023 | 01FEB2024 | 88 | 88 | 215 | 01AUG2024 | HIGH | 0 |
| 8 | S008 | Mumbai | 45 | 70 | NOTCALIBRATED | 40 | 45 | 01AUG2023 | 01FEB2024 | 35 | 45 | 184 | 01AUG2024 | LOW | 1 |
| 9 | S009 | Chennai | 55 | 85 | CALIBRATED | 1 | 99 | 01SEP2023 | 01FEB2024 | 92 | 91 | 153 | 01AUG2024 | HIGH | 0 |
| 10 | S010 | Kolkata | 65 | 110 | CALIBRATED | 10 | 85 | 01OCT2023 | 01FEB2024 | 80 | 75 | 123 | 01AUG2024 | MEDIUM | 0 |
| 11 | S011 | Hyderabad | 75 | 115 | CALIBRATED | 0 | 97 | 01NOV2023 | 01FEB2024 | 95 | 95 | 92 | 01AUG2024 | HIGH | 0 |
| 12 | S012 | Pune | 85 | 140 | CALIBRATED | 60 | 35 | 01DEC2023 | 01FEB2024 | 25 | 35 | 62 | 01AUG2024 | LOW | 1 |
| 13 | S013 | Delhi | 95 | 150 | CALIBRATED | 5 | 93 | 01JAN2024 | 01FEB2024 | 90 | 88 | 31 | 01AUG2024 | HIGH | 0 |
| 14 | S014 | Mumbai | 105 | 170 | CALIBRATED | 7 | 89 | 01JAN2024 | 01FEB2024 | 85 | 82 | 31 | 01AUG2024 | MEDIUM | 0 |
| 15 | S015 | Chennai | 115 | 180 | CALIBRATED | 9 | 87 | 01JAN2024 | 01FEB2024 | 82 | 80 | 31 | 01AUG2024 | MEDIUM | 0 |
| 16 | S016 | Delhi | 80 | 120 | CALIBRATED | 3 | 95 | 25FEB2026 | 25FEB2026 | 90 | 90 | . | . | . |
10. Merge Example Using PROC SQL
Maintenance table
PROC SQL;
CREATE TABLE maintenance
(Sensor_ID CHAR(6),
Technician CHAR(20));
INSERT INTO maintenance
VALUES('S001','Ravi');
INSERT INTO maintenance
VALUES('S002','Amit');
QUIT;
proc print data=maintenance;
run;
OUTPUT:
| Obs | Sensor_ID | Technician |
|---|---|---|
| 1 | S001 | Ravi |
| 2 | S002 | Amit |
Merge
PROC SQL;
CREATE TABLE merged AS
SELECT
A.*,
B.Technician
FROM combined A LEFT JOIN
maintenance B
ON A.Sensor_ID=B.Sensor_ID;
QUIT;
proc print data=merged;
run;
OUTPUT:
| Obs | Sensor_ID | City | PM2_5 | PM10 | Calibration_Status | Data_Gaps | Accuracy_Score | Install_Date | Service_Date | Battery_Level | Signal_Strength | Service_Days | Next_Service_Date | Utilization | Fraud_Flag | Technician |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S001 | Delhi | 250 | 400 | NOTCALIBRATED | 15 | 55 | 01JAN2023 | 01JAN2024 | 45 | 60 | 365 | 01JUL2024 | MEDIUM | 1 | Ravi |
| 2 | S002 | Mumbai | 35 | 60 | CALIBRATED | 2 | 98 | 15FEB2023 | 15JAN2024 | 90 | 85 | 334 | 15JUL2024 | HIGH | 0 | Amit |
| 3 | S003 | Chennai | . | 80 | CALIBRATED | 5 | 88 | 10MAR2023 | 10FEB2024 | 75 | 70 | 337 | 10AUG2024 | HIGH | 0 | |
| 4 | S004 | Kolkata | 120 | 180 | CALIBRATED | 25 | 40 | 01APR2023 | 01JAN2024 | 30 | 40 | 275 | 01JUL2024 | LOW | 0 | |
| 5 | S005 | Hyderabad | 80 | 120 | UNKNOWN | 0 | 100 | 05MAY2023 | 05JAN2024 | 95 | 90 | 245 | 05JUL2024 | HIGH | 1 | |
| 6 | S006 | Pune | 60 | 95 | CALIBRATED | 50 | 65 | 07JUN2023 | 07JAN2024 | 50 | 55 | 214 | 07JUL2024 | LOW | 0 | |
| 7 | S007 | Delhi | 300 | 500 | CALIBRATED | 3 | 92 | 01JUL2023 | 01FEB2024 | 88 | 88 | 215 | 01AUG2024 | HIGH | 0 | |
| 8 | S008 | Mumbai | 45 | 70 | NOTCALIBRATED | 40 | 45 | 01AUG2023 | 01FEB2024 | 35 | 45 | 184 | 01AUG2024 | LOW | 1 | |
| 9 | S009 | Chennai | 55 | 85 | CALIBRATED | 1 | 99 | 01SEP2023 | 01FEB2024 | 92 | 91 | 153 | 01AUG2024 | HIGH | 0 | |
| 10 | S010 | Kolkata | 65 | 110 | CALIBRATED | 10 | 85 | 01OCT2023 | 01FEB2024 | 80 | 75 | 123 | 01AUG2024 | MEDIUM | 0 | |
| 11 | S011 | Hyderabad | 75 | 115 | CALIBRATED | 0 | 97 | 01NOV2023 | 01FEB2024 | 95 | 95 | 92 | 01AUG2024 | HIGH | 0 | |
| 12 | S012 | Pune | 85 | 140 | CALIBRATED | 60 | 35 | 01DEC2023 | 01FEB2024 | 25 | 35 | 62 | 01AUG2024 | LOW | 1 | |
| 13 | S013 | Delhi | 95 | 150 | CALIBRATED | 5 | 93 | 01JAN2024 | 01FEB2024 | 90 | 88 | 31 | 01AUG2024 | HIGH | 0 | |
| 14 | S014 | Mumbai | 105 | 170 | CALIBRATED | 7 | 89 | 01JAN2024 | 01FEB2024 | 85 | 82 | 31 | 01AUG2024 | MEDIUM | 0 | |
| 15 | S015 | Chennai | 115 | 180 | CALIBRATED | 9 | 87 | 01JAN2024 | 01FEB2024 | 82 | 80 | 31 | 01AUG2024 | MEDIUM | 0 | |
| 16 | S016 | Delhi | 80 | 120 | CALIBRATED | 3 | 95 | 25FEB2026 | 25FEB2026 | 90 | 90 | . | . | . |
Explanation
LEFT JOIN
merge datasets
11. Transpose Example Using PROC SQL
PROC SQL;
CREATE TABLE city_summary AS
SELECT
City,
AVG(PM2_5) AS AVG_PM25,
AVG(PM10) AS AVG_PM10
FROM merged
GROUP BY City;
QUIT;
proc print data=city_summary;
run;
OUTPUT:
| Obs | City | AVG_PM25 | AVG_PM10 |
|---|---|---|---|
| 1 | Chennai | 85.000 | 115.0 |
| 2 | Delhi | 181.250 | 292.5 |
| 3 | Hyderabad | 77.500 | 117.5 |
| 4 | Kolkata | 92.500 | 145.0 |
| 5 | Mumbai | 61.667 | 100.0 |
| 6 | Pune | 72.500 | 117.5 |
Explanation
GROUP BY
aggregation
12. Create Statistical Ready Dataset
PROC SQL;
CREATE TABLE final_dataset AS
SELECT *,
PM10-PM2_5 AS Pollution_Difference
FROM merged;
QUIT;
proc print data=final_dataset;
run;
OUTPUT:
| Obs | Sensor_ID | City | PM2_5 | PM10 | Calibration_Status | Data_Gaps | Accuracy_Score | Install_Date | Service_Date | Battery_Level | Signal_Strength | Service_Days | Next_Service_Date | Utilization | Fraud_Flag | Technician | Pollution_Difference |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | S001 | Delhi | 250 | 400 | NOTCALIBRATED | 15 | 55 | 01JAN2023 | 01JAN2024 | 45 | 60 | 365 | 01JUL2024 | MEDIUM | 1 | Ravi | 150 |
| 2 | S002 | Mumbai | 35 | 60 | CALIBRATED | 2 | 98 | 15FEB2023 | 15JAN2024 | 90 | 85 | 334 | 15JUL2024 | HIGH | 0 | Amit | 25 |
| 3 | S003 | Chennai | . | 80 | CALIBRATED | 5 | 88 | 10MAR2023 | 10FEB2024 | 75 | 70 | 337 | 10AUG2024 | HIGH | 0 | . | |
| 4 | S004 | Kolkata | 120 | 180 | CALIBRATED | 25 | 40 | 01APR2023 | 01JAN2024 | 30 | 40 | 275 | 01JUL2024 | LOW | 0 | 60 | |
| 5 | S005 | Hyderabad | 80 | 120 | UNKNOWN | 0 | 100 | 05MAY2023 | 05JAN2024 | 95 | 90 | 245 | 05JUL2024 | HIGH | 1 | 40 | |
| 6 | S006 | Pune | 60 | 95 | CALIBRATED | 50 | 65 | 07JUN2023 | 07JAN2024 | 50 | 55 | 214 | 07JUL2024 | LOW | 0 | 35 | |
| 7 | S007 | Delhi | 300 | 500 | CALIBRATED | 3 | 92 | 01JUL2023 | 01FEB2024 | 88 | 88 | 215 | 01AUG2024 | HIGH | 0 | 200 | |
| 8 | S008 | Mumbai | 45 | 70 | NOTCALIBRATED | 40 | 45 | 01AUG2023 | 01FEB2024 | 35 | 45 | 184 | 01AUG2024 | LOW | 1 | 25 | |
| 9 | S009 | Chennai | 55 | 85 | CALIBRATED | 1 | 99 | 01SEP2023 | 01FEB2024 | 92 | 91 | 153 | 01AUG2024 | HIGH | 0 | 30 | |
| 10 | S010 | Kolkata | 65 | 110 | CALIBRATED | 10 | 85 | 01OCT2023 | 01FEB2024 | 80 | 75 | 123 | 01AUG2024 | MEDIUM | 0 | 45 | |
| 11 | S011 | Hyderabad | 75 | 115 | CALIBRATED | 0 | 97 | 01NOV2023 | 01FEB2024 | 95 | 95 | 92 | 01AUG2024 | HIGH | 0 | 40 | |
| 12 | S012 | Pune | 85 | 140 | CALIBRATED | 60 | 35 | 01DEC2023 | 01FEB2024 | 25 | 35 | 62 | 01AUG2024 | LOW | 1 | 55 | |
| 13 | S013 | Delhi | 95 | 150 | CALIBRATED | 5 | 93 | 01JAN2024 | 01FEB2024 | 90 | 88 | 31 | 01AUG2024 | HIGH | 0 | 55 | |
| 14 | S014 | Mumbai | 105 | 170 | CALIBRATED | 7 | 89 | 01JAN2024 | 01FEB2024 | 85 | 82 | 31 | 01AUG2024 | MEDIUM | 0 | 65 | |
| 15 | S015 | Chennai | 115 | 180 | CALIBRATED | 9 | 87 | 01JAN2024 | 01FEB2024 | 82 | 80 | 31 | 01AUG2024 | MEDIUM | 0 | 65 | |
| 16 | S016 | Delhi | 80 | 120 | CALIBRATED | 3 | 95 | 25FEB2026 | 25FEB2026 | 90 | 90 | . | . | . | 40 |
13. Delete Temporary Tables
PROC DATASETS LIB=WORK;
DELETE
air_quality_raw
air_quality_clean
air_quality_dates
utilization;
QUIT;
LOG:
NOTE: Deleting WORK.AIR_QUALITY_RAW (memtype=DATA).NOTE: Deleting WORK.AIR_QUALITY_CLEAN (memtype=DATA).NOTE: Deleting WORK.AIR_QUALITY_DATES (memtype=DATA).NOTE: Deleting WORK.UTILIZATION (memtype=DATA).
Explanation
Deletes unused tables
Improves performance
Final Dataset Contains
Sensor_ID
City
PM2_5
PM10
Calibration_Status
Accuracy_Score
Utilization
Fraud_Flag
Service_Days
Next_Service_Date
Interview Questions Answered
1.Question
How detect fraud?
Answer
CASE logic
2.Question
How merge datasets?
Answer
JOIN
3.Question
How clean text?
Answer
STRIP
PROPCASE
4.Question
How handle missing numeric?
Answer
COALESCE
Benefits
1.Improves data quality
2.Detects faulty sensors
3.Supports government decisions
4.Prevents fraud
Conclusion
This project demonstrated a complete
real-world air quality sensor network analytics system using only PROC SQL in
SAS. We created a dataset with intentional errors such as negative pollution
values, missing PM10 readings, inconsistent calibration status, and accuracy
scores exceeding 100 percent. Using SQL functions such as STRIP, PROPCASE,
UPCASE, COALESCE, and CASE, we cleaned and standardized the data.
We derived service intervals using INTCK and
future service schedules using INTNX. We classified utilization based on data
gaps and implemented fraud detection using reusable macro logic. We simulated
append operations using UNION ALL, merge operations using JOIN, and
transpose-style summaries using GROUP BY aggregation.
Finally, we removed temporary datasets to
maintain a clean SAS environment.
This approach reflects real industry
workflows used in:
Environmental monitoring
Clinical trials
Bank fraud detection
IoT analytics
and demonstrates strong PROC SQL expertise
required for SAS programmer interviews.
SAS INTERVIEW QUESTIONS
1. How do you insert records?
2. What is PROC SQL in SAS?
3. How do you
create a dataset using PROC SQL?
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent AIR QUALITY data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
· Clinical SAS Programmer
· Research Data Analyst
· Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Follow Us On :
Follow Us On :
Comments
Post a Comment