432.Can Advanced SAS Programming Detect, Clean, and Optimize Deep-Sea Exploration Vehicle Data While Identifying Fraud Patterns?
Advanced SAS Programming for Deep-Sea Vehicle Data Optimization and Fraud Detection
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —DATA | SET | MERGE | INPUT | DATALINES | IF | PROC SORT | PROC SQL | PROC APPEND | PROC TRANSPOSE | PROC DATASETS | RUN | QUIT | %MACRO | %MEND | FUNCTIONS ETC
Table of Contents
- Introduction
- Business Context
- Dataset Design
- Initial Dataset Creation
- Error Identification
- Data Cleaning (Numeric Fixes)
- Character Standardization
- Date Handling (MDY, INTCK, INTNX)
- Fraud Detection Macro
- Data Transformation using SET
- MERGE for Enrichment
- PROC SQL Analysis
- PROC APPEND
- PROC TRANSPOSE
- Utilization Classification
- PROC DATASETS DELETE
- Final Clean Dataset Code
- 20 Key points About The Project
- Summary
- Conclusion
1. Introduction
Deep-sea exploration is one of the most technically demanding fields in modern science. Vehicles used in these missions—such as submersibles and remotely operated vehicles (ROVs)—generate massive volumes of operational and sensor data. This data must be accurate, consistent, and validated, especially when used for scientific research, defense, or commercial exploration.
However, real-world datasets are rarely perfect. They often contain:
- Missing values
- Incorrect formats
- Outliers and anomalies
- Fraudulent or manipulated entries
In this project, we will simulate a deep-sea exploration vehicle dataset, intentionally introduce errors, and then use Advanced SAS Programming techniques to clean, validate, transform, and analyze the data.
This is designed as a real-world, interview-ready SAS project with:
- 12+ independent SAS programs
- Macros for fraud detection
- Usage of DATA step, PROC SQL, PROC TRANSPOSE, PROC DATASETS, etc.
- Numeric and character functions
- Date handling with MDY, INTCK, INTNX
- Error detection and correction
2. Business Context
Deep-sea companies operate vehicles costing millions of dollars. Incorrect data may lead to:
- Wrong mission decisions
- Equipment damage
- Financial fraud
- Regulatory issues
This project simulates:
- Vehicle performance tracking
- Mission analytics
- Fraud detection (fake data entries, unrealistic values)
3. Dataset Design
Variables:
- Vehicle_Name
- Dive_Depth (meters)
- Pressure_Resistance (bars)
- Mission_Duration (hours)
- Data_Collected_GB
- Navigation_Error (%)
- Price (USD)
- Utilization_Status
- Mission_Start_Date
- Mission_End_Date
4. Initial Dataset Creation (With Intentional Errors)
data deep_sea_raw;
infile datalines truncover;
input Vehicle_Name &$25. Dive_Depth Pressure_Resistance Mission_Duration
Data_Collected_GB Navigation_Error Price Mission_Start_Date : mmddyy10.;
format Mission_Start_Date mmddyy10.;
datalines;
nautilus 11000 1200 48 500 2.5 5000000 01/15/2025
abyssX -500 1500 72 700 . 6000000 02/10/2025
deep explorer 9000 abc 36 450 1.2 4500000 03/05/2025
poseidon 12000 1800 60 -200 0.5 8000000 04/20/2025
triton 10000 1600 50 600 150 7000000 05/25/2025
;
run;
LOG:
Explanation
This is our raw dataset, intentionally filled with errors:
· Negative Dive_Depth
· Character value “abc” in numeric field
· Missing Navigation_Error
· Negative Data_Collected_GB
· Unrealistic Navigation_Error (150%)
Why Used
DATA step is the foundation of SAS for dataset creation.
Comparison of Input Modifiers
Modifier | Behavior | Use Case |
| Reads one word | Simple names |
| Reads until delimiter | Risky with spaces |
| Reads multiple words | Best for names with spaces |
5. Error Identification
proc print data=deep_sea_raw;
run;
OUTPUT:
| Obs | Vehicle_Name | Dive_Depth | Pressure_Resistance | Mission_Duration | Data_Collected_GB | Navigation_Error | Price | Mission_Start_Date |
|---|---|---|---|---|---|---|---|---|
| 1 | nautilus | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 |
| 2 | abyssX | -500 | 1500 | 72 | 700 | . | 6000000 | 02/10/2025 |
| 3 | deep explorer | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 |
| 4 | poseidon | 12000 | 1800 | 60 | -200 | 0.5 | 8000000 | 04/20/2025 |
| 5 | triton | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 |
Explanation
Used to visually inspect dataset.
Points
· Quick validation
· Identifies obvious anomalies
6. Numeric Data Cleaning
data clean_numeric;
set deep_sea_raw;
if Dive_Depth < 0 then Dive_Depth = abs(Dive_Depth);
if Data_Collected_GB < 0 then Data_Collected_GB = .;
if Navigation_Error = . then Navigation_Error = 0;
run;
proc print data=clean_numeric;
run;
OUTPUT:
| Obs | Vehicle_Name | Dive_Depth | Pressure_Resistance | Mission_Duration | Data_Collected_GB | Navigation_Error | Price | Mission_Start_Date |
|---|---|---|---|---|---|---|---|---|
| 1 | nautilus | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 |
| 2 | abyssX | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 |
| 3 | deep explorer | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 |
| 4 | poseidon | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 |
| 5 | triton | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 |
Explanation
Fixes:
· Negative values using abs()
· Missing values replaced
Numeric Functions Used:
· abs()
Why Used
Ensures numeric consistency.
7. Character Functions Standardization
data clean_char;
set clean_numeric;
Vehicle_Name = propcase(strip(Vehicle_Name));
Vehicle_Name = catx('_', Vehicle_Name, 'Vehicle');
run;
proc print data=clean_char;
run;
OUTPUT:
| Obs | Vehicle_Name | Dive_Depth | Pressure_Resistance | Mission_Duration | Data_Collected_GB | Navigation_Error | Price | Mission_Start_Date |
|---|---|---|---|---|---|---|---|---|
| 1 | Nautilus_Vehicle | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 |
| 2 | Abyssx_Vehicle | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 |
| 3 | Deep Explorer_Vehicle | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 |
| 4 | Poseidon_Vehicle | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 |
| 5 | Triton_Vehicle | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 |
Functions Used
· strip() → removes spaces
· propcase() → proper capitalization
· catx() → concatenation
Why Used
Standard naming improves reporting.
8. Date Handling
data date_handling;
set clean_char;
Mission_End_Date = intnx('day', Mission_Start_Date, Mission_Duration);
Mission_Length = intck('day', Mission_Start_Date, Mission_End_Date);
format Mission_End_Date mmddyy10.;
run;
proc print data=date_handling;
run;
OUTPUT:
| Obs | Vehicle_Name | Dive_Depth | Pressure_Resistance | Mission_Duration | Data_Collected_GB | Navigation_Error | Price | Mission_Start_Date | Mission_End_Date | Mission_Length |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Nautilus_Vehicle | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 | 03/04/2025 | 48 |
| 2 | Abyssx_Vehicle | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 | 04/23/2025 | 72 |
| 3 | Deep Explorer_Vehicle | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 | 04/10/2025 | 36 |
| 4 | Poseidon_Vehicle | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 | 06/19/2025 | 60 |
| 5 | Triton_Vehicle | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 | 07/14/2025 | 50 |
8.
Functions
· INTNX → add duration
· INTCK → calculate difference
Why Used
Critical for mission timeline analysis.
9. Fraud Detection Macro
%macro fraud_check;
data fraud_flag;
set date_handling;
if Navigation_Error > 100 then Fraud_Flag = 'YES';
else if Dive_Depth > 11000 then Fraud_Flag = 'YES';
else Fraud_Flag = 'NO';
run;
proc print data=fraud_flag;
run;
%mend;
%fraud_check;
OUTPUT:
| Obs | Vehicle_Name | Dive_Depth | Pressure_Resistance | Mission_Duration | Data_Collected_GB | Navigation_Error | Price | Mission_Start_Date | Mission_End_Date | Mission_Length | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Nautilus_Vehicle | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 | 03/04/2025 | 48 | NO |
| 2 | Abyssx_Vehicle | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 | 04/23/2025 | 72 | NO |
| 3 | Deep Explorer_Vehicle | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 | 04/10/2025 | 36 | NO |
| 4 | Poseidon_Vehicle | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 | 06/19/2025 | 60 | YES |
| 5 | Triton_Vehicle | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 | 07/14/2025 | 50 | YES |
Explanation
Flags suspicious entries.
Why Used
Reusable logic across datasets.
10. SET Statement Usage
data combined_data;
set fraud_flag;
run;
proc print data=combined_data;
run;
OUTPUT:
| Obs | Vehicle_Name | Dive_Depth | Pressure_Resistance | Mission_Duration | Data_Collected_GB | Navigation_Error | Price | Mission_Start_Date | Mission_End_Date | Mission_Length | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Nautilus_Vehicle | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 | 03/04/2025 | 48 | NO |
| 2 | Abyssx_Vehicle | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 | 04/23/2025 | 72 | NO |
| 3 | Deep Explorer_Vehicle | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 | 04/10/2025 | 36 | NO |
| 4 | Poseidon_Vehicle | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 | 06/19/2025 | 60 | YES |
| 5 | Triton_Vehicle | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 | 07/14/2025 | 50 | YES |
Explanation
Creates new dataset.
Why Used
Basic dataset copying and transformation.
11. MERGE Example
data pricing;
input Vehicle_Name:$25. Price_Category $;
datalines;
Nautilus_Vehicle High
Poseidon_Vehicle Premium
;
run;
proc print data=pricing;
run;
OUTPUT:
| Obs | Vehicle_Name | Price_Category |
|---|---|---|
| 1 | Nautilus_Vehicle | High |
| 2 | Poseidon_Vehicle | Premium |
proc sort data=combined_data;by Vehicle_Name;run;
proc print data=combined_data;
run;
OUTPUT:
| Obs | Vehicle_Name | Dive_Depth | Pressure_Resistance | Mission_Duration | Data_Collected_GB | Navigation_Error | Price | Mission_Start_Date | Mission_End_Date | Mission_Length | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Abyssx_Vehicle | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 | 04/23/2025 | 72 | NO |
| 2 | Deep Explorer_Vehicle | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 | 04/10/2025 | 36 | NO |
| 3 | Nautilus_Vehicle | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 | 03/04/2025 | 48 | NO |
| 4 | Poseidon_Vehicle | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 | 06/19/2025 | 60 | YES |
| 5 | Triton_Vehicle | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 | 07/14/2025 | 50 | YES |
proc sort data=pricing;by Vehicle_Name;run;
proc print data=pricing;
run;
OUTPUT:
| Obs | Vehicle_Name | Price_Category |
|---|---|---|
| 1 | Nautilus_Vehicle | High |
| 2 | Poseidon_Vehicle | Premium |
Explanation
Before using a MERGE statement with a BY variable, SAS requires both datasets to be sorted by that variable. Here, Vehicle_Name is the key used to combine records from combined_data and pricing.
Why It Is Used
· Ensures correct row-to-row matching
· Prevents data misalignment
· Avoids SAS warnings like:
ERROR: BY variables are not properly sorted
data merged_data;
merge combined_data
pricing;
by Vehicle_Name;
if price < 5000000 then Price_Category="Moderate";
else if 5000000 <= Price < 8000000 then Price_Category="High";
else if Price_Category="Premium";
run;
proc print data=merged_data;
run;
OUTPUT:
| Obs | Vehicle_Name | Dive_Depth | Pressure_Resistance | Mission_Duration | Data_Collected_GB | Navigation_Error | Price | Mission_Start_Date | Mission_End_Date | Mission_Length | Fraud_Flag | Price_Category |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Abyssx_Vehicle | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 | 04/23/2025 | 72 | NO | High |
| 2 | Deep Explorer_Vehicle | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 | 04/10/2025 | 36 | NO | Moderate |
| 3 | Nautilus_Vehicle | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 | 03/04/2025 | 48 | NO | High |
| 4 | Poseidon_Vehicle | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 | 06/19/2025 | 60 | YES | Premium |
| 5 | Triton_Vehicle | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 | 07/14/2025 | 50 | YES | High |
Explanation
This logic categorizes vehicles into pricing tiers based on the numeric variable Price.
Why It Is Used
· Creates a derived variable (Price_Category)
· Converts continuous numeric data → categorical groups
· Helps in:
· Reporting
· Analysis
· Visualization
Why NOT use
5000000 < Price < 8000000
Because:
· SAS does not support chained comparisons properly
· It evaluates left to right → gives wrong results
Final Concept Summary
Concept | Why Used | Risk if Not Used |
PROC SORT | Correct merging | Wrong data mapping |
IF with AND | Accurate categorization | Logical errors |
12. PROC SQL Analysis
proc sql;
select Vehicle_Name, avg(Data_Collected_GB) as Avg_Data
from merged_data
group by Vehicle_Name;
quit;
OUTPUT:
| Vehicle_Name | Avg_Data |
|---|---|
| Abyssx_Vehicle | 700 |
| Deep Explorer_Vehicle | 450 |
| Nautilus_Vehicle | 500 |
| Poseidon_Vehicle | . |
| Triton_Vehicle | 600 |
Why Used
Advanced querying.
13. PROC APPEND
proc append base=combined_data
data=merged_data force;
run;
proc print data=combined_data;
run;
OUTPUT:
| Obs | Vehicle_Name | Dive_Depth | Pressure_Resistance | Mission_Duration | Data_Collected_GB | Navigation_Error | Price | Mission_Start_Date | Mission_End_Date | Mission_Length | Fraud_Flag |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Abyssx_Vehicle | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 | 04/23/2025 | 72 | NO |
| 2 | Deep Explorer_Vehicle | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 | 04/10/2025 | 36 | NO |
| 3 | Nautilus_Vehicle | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 | 03/04/2025 | 48 | NO |
| 4 | Poseidon_Vehicle | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 | 06/19/2025 | 60 | YES |
| 5 | Triton_Vehicle | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 | 07/14/2025 | 50 | YES |
| 6 | Abyssx_Vehicle | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 | 04/23/2025 | 72 | NO |
| 7 | Deep Explorer_Vehicle | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 | 04/10/2025 | 36 | NO |
| 8 | Nautilus_Vehicle | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 | 03/04/2025 | 48 | NO |
| 9 | Poseidon_Vehicle | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 | 06/19/2025 | 60 | YES |
| 10 | Triton_Vehicle | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 | 07/14/2025 | 50 | YES |
Why Used
Adds records efficiently.
14. PROC TRANSPOSE
proc transpose data=combined_data out=transposed;
var Dive_Depth Data_Collected_GB;
run;
proc print data=transposed;
run;
OUTPUT:
| Obs | _NAME_ | COL1 | COL2 | COL3 | COL4 | COL5 | COL6 | COL7 | COL8 | COL9 | COL10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Dive_Depth | 500 | 9000 | 11000 | 12000 | 10000 | 500 | 9000 | 11000 | 12000 | 10000 |
| 2 | Data_Collected_GB | 700 | 450 | 500 | . | 600 | 700 | 450 | 500 | . | 600 |
Why Used
Reshapes dataset.
15. Utilization Classification
data utilization;
set combined_data;
length Utilization_Status $10.;
if Mission_Duration > 60 then Utilization_Status='High';
else if Mission_Duration > 40 then Utilization_Status='Medium';
else Utilization_Status='Low';
run;
proc print data=utilization;
run;
OUTPUT:
| Obs | Vehicle_Name | Dive_Depth | Pressure_Resistance | Mission_Duration | Data_Collected_GB | Navigation_Error | Price | Mission_Start_Date | Mission_End_Date | Mission_Length | Fraud_Flag | Utilization_Status |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Abyssx_Vehicle | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 | 04/23/2025 | 72 | NO | High |
| 2 | Deep Explorer_Vehicle | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 | 04/10/2025 | 36 | NO | Low |
| 3 | Nautilus_Vehicle | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 | 03/04/2025 | 48 | NO | Medium |
| 4 | Poseidon_Vehicle | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 | 06/19/2025 | 60 | YES | Medium |
| 5 | Triton_Vehicle | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 | 07/14/2025 | 50 | YES | Medium |
| 6 | Abyssx_Vehicle | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 | 04/23/2025 | 72 | NO | High |
| 7 | Deep Explorer_Vehicle | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 | 04/10/2025 | 36 | NO | Low |
| 8 | Nautilus_Vehicle | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 | 03/04/2025 | 48 | NO | Medium |
| 9 | Poseidon_Vehicle | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 | 06/19/2025 | 60 | YES | Medium |
| 10 | Triton_Vehicle | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 | 07/14/2025 | 50 | YES | Medium |
Why Used
Removes temporary datasets.
16. PROC DATASETS DELETE
proc datasets library=work;
delete clean_numeric clean_char;
quit;
LOG:
17. Final Corrected Dataset Code (Full Length)
data final_dataset;
set deep_sea_raw;
Dive_Depth = abs(Dive_Depth);
if Data_Collected_GB < 0 then Data_Collected_GB = .;
if Navigation_Error = . then Navigation_Error = 0;
Vehicle_Name = propcase(strip(Vehicle_Name));
Vehicle_Name = catx('_', Vehicle_Name, 'Vehicle');
Mission_End_Date = intnx('day', Mission_Start_Date, Mission_Duration);
Mission_Length = intck('day', Mission_Start_Date, Mission_End_Date);
if Navigation_Error > 100 or Dive_Depth > 11000 then Fraud_Flag='YES';
else Fraud_Flag='NO';
length Utilization_Status $10.;
if Mission_Duration > 60 then Utilization_Status='High';
else if Mission_Duration > 40 then Utilization_Status='Medium';
else Utilization_Status='Low';
format Mission_End_Date mmddyy10.;
run;
proc print data=final_dataset;
run;
OUTPUT:
| Obs | Vehicle_Name | Dive_Depth | Pressure_Resistance | Mission_Duration | Data_Collected_GB | Navigation_Error | Price | Mission_Start_Date | Mission_End_Date | Mission_Length | Fraud_Flag | Utilization_Status |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Nautilus_Vehicle | 11000 | 1200 | 48 | 500 | 2.5 | 5000000 | 01/15/2025 | 03/04/2025 | 48 | NO | Medium |
| 2 | Abyssx_Vehicle | 500 | 1500 | 72 | 700 | 0.0 | 6000000 | 02/10/2025 | 04/23/2025 | 72 | NO | High |
| 3 | Deep Explorer_Vehicle | 9000 | . | 36 | 450 | 1.2 | 4500000 | 03/05/2025 | 04/10/2025 | 36 | NO | Low |
| 4 | Poseidon_Vehicle | 12000 | 1800 | 60 | . | 0.5 | 8000000 | 04/20/2025 | 06/19/2025 | 60 | YES | Medium |
| 5 | Triton_Vehicle | 10000 | 1600 | 50 | 600 | 150.0 | 7000000 | 05/25/2025 | 07/14/2025 | 50 | YES | Medium |
1. Designed a high-quality simulated dataset representing deep-sea exploration vehicles with operational and financial variables.
2. Introduced real-world data issues such as missing values, invalid entries, and outliers to mimic production-level challenges.
3. Applied DATA step techniques to ingest, structure, and pre-process raw mission data efficiently.
4. Utilized robust INPUT methods (&, informats, TRUNCOVER) to correctly parse complex character fields like multi-word vehicle names.
5. Implemented numeric data validation using functions like ABS, conditional checks, and missing value handling.
6. Standardized character variables using STRIP, PROPCASE, UPCASE, and LOWCASE for consistency across datasets.
7. Leveraged concatenation functions (CAT, CATX) to create structured and readable derived variables.
8. Applied date handling techniques using MDY, INTNX, and INTCK for mission timeline calculations and duration tracking.
9. Detected and corrected data anomalies such as negative dive depths, unrealistic navigation errors, and invalid pressure values.
10. Built reusable SAS macros to automate fraud detection logic based on predefined thresholds and business rules.
11. Flagged suspicious records such as extreme navigation errors (>100%) and abnormal dive depths beyond engineering limits.
12. Performed dataset transformations using SET statements for streamlined data flow and intermediate processing.
13. Combined multiple datasets using MERGE with proper sorting to ensure accurate record alignment.
14. Conducted advanced querying and aggregation using PROC SQL for analytical insights like average data collection.
15. Appended incremental datasets using PROC APPEND to simulate real-time data integration scenarios.
16. Reshaped data structures using PROC TRANSPOSE for reporting flexibility and downstream analytics.
17. Classified vehicle utilization levels (Low, Medium, High) based on mission duration thresholds.
18. Managed workspace efficiently by removing temporary datasets using PROC DATASETS DELETE.
19. Ensured data integrity and readiness for reporting by applying end-to-end cleaning, validation, and transformation workflows.
20. Delivered an interview-ready, production-style SAS project demonstrating data engineering, validation, and fraud detection expertise in a single pipeline.
Summary
In this project, we built a complete SAS workflow for deep-sea exploration data:
- Created dataset with intentional errors
- Applied cleaning using numeric & character functions
- Used date functions for mission tracking
- Built fraud detection macro
- Performed transformations using SET, MERGE
- Used PROC SQL, APPEND, TRANSPOSE
- Classified utilization
Conclusion
Working with real-world data is never as clean as textbooks make it seem. In this deep-sea exploration project, we saw how even a small dataset can contain multiple layers of errors from incorrect numeric values to inconsistent naming and unrealistic operational metrics. Using SAS, we didn’t just clean the data.we transformed it into something meaningful and reliable. The use of macros for fraud detection adds a powerful layer of automation, making the system scalable for large datasets. Functions like INTNX and INTCK helped us handle time-based logic efficiently, which is crucial in mission-based industries. Overall, this project reflects what actually happens in real jobs: identifying problems, fixing them systematically, and ensuring the final dataset is trustworthy. Mastering these techniques will definitely give you an edge in SAS interviews and real-world projects.
SAS INTERVIEW QUESTIONS
1. What is automatic macro variable?
Answer:
Automatic macro variables are system-created variables such as &SYSDATE, &SYSUSERID.
2. What is macro debugging?
Answer:
Macro debugging uses options like MPRINT, MLOGIC, SYMBOLGEN to track macro execution.
3. What is the difference between local and global macro variables?
Answer:
Global variables are available throughout the program, while local variables exist inside a macro only.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent DEEP-SEA VEHICLE data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics
· Clinical SAS Programmer
· Research Data Analyst
· Regulatory Data Validator
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment