437.Can Advanced SAS Programming Detect, Correct, and Optimize Global Cargo Shipping Route Data While Improving Efficiency and Reducing Costs?
Data-Driven Shipping Excellence: Detecting Errors and Boosting Route Performance with SAS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —DATA | SET | MERGE | INPUT | DATALINES | IF | BY | OUTPUT | PROC SORT | PROC APPEND | PROC TRANSPOSE | PROC DATASETS | RUN | %MACRO | %MEND | FUNCTIONS | R LANGUAGE
Table of Contents
1. Introduction
2. Business Context
3. Raw Dataset Creation (SAS &
R)
4. Variables Description
5. Intentional Errors Dataset
6. Error Detection & Fixing
7. SAS Implementation Using PROG1
8. Data Cleaning Techniques
9. Derived Variables
10. Efficiency Calculation
11. Sorting & Merging
12. SQL Analysis
13. Transpose & Reporting
14. Macros Implementation
15. Visualization (SGPLOT)
16. Final Clean Dataset
17. Key Insights
18. 20 Key Points About The Project
19. Summary
20. Conclusion
1. Introduction
Global cargo shipping plays a critical role in enabling international trade,
connecting continents, and supporting the global economy. Every day, thousands
of vessels transport goods such as oil, food products, machinery, and raw
materials across vast ocean routes. Managing these operations efficiently
requires accurate and well-structured data related to shipping routes, fuel
consumption, cargo capacity, weather conditions, and operational costs.
However, in real-world scenarios, shipping data is often prone to
inconsistencies such as missing values, incorrect entries, unrealistic
efficiency percentages, and data type mismatches. These issues can lead to poor
decision-making, increased fuel costs, delays in delivery, and reduced
operational efficiency. Therefore, ensuring high-quality data is essential for
logistics optimization and business success.
This project focuses on creating a simulated global cargo shipping dataset
and applying Advanced SAS Programming techniques to handle such challenges.
Using Sas statements, we will demonstrate how to identify data errors, clean
and transform datasets, and derive meaningful insights such as route efficiency
and cost optimization. Additionally, intentional errors are introduced to
replicate real-world data issues, allowing us to showcase practical debugging
and correction strategies.
By the end of this project, we will have a clean, optimized dataset and a
deeper understanding of how SAS can be used as a powerful tool for data
validation, transformation, and analytics in the shipping and logistics domain.
2. Business Context
Shipping companies rely on accurate route data
to:
- Optimize fuel
consumption
- Minimize weather
delays
- Improve route
efficiency
- Reduce operational
costs
Poor data quality leads to:
- Wrong decisions
- Financial losses
- Inefficient
logistics
3. Raw Dataset (SAS)
data cargo_raw;
input Route_ID $ Distance_nm Fuel_Consumption Cargo_Tons Weather_Delay
Route_Efficiency Fees;
datalines;
R001 1200 500 200 2 85 10000
R002 1500 700 300 3 78 15000
R003 1800 800 350 5 90 20000
R004 2000 900 400 7 95 25000
R005 2200 950 450 8 92 27000
R006 2500 1000 500 10 88 30000
R007 2700 1100 550 12 80 32000
R008 3000 1200 600 15 75 35000
R009 3200 1300 650 18 70 38000
R010 3500 1400 700 20 65 40000
R011 3700 1500 750 22 60 42000
R012 4000 1600 800 25 55 45000
;
run;
proc print data=cargo_raw;
run;
OUTPUT:
| Obs | Route_ID | Distance_nm | Fuel_Consumption | Cargo_Tons | Weather_Delay | Route_Efficiency | Fees |
|---|---|---|---|---|---|---|---|
| 1 | R001 | 1200 | 500 | 200 | 2 | 85 | 10000 |
| 2 | R002 | 1500 | 700 | 300 | 3 | 78 | 15000 |
| 3 | R003 | 1800 | 800 | 350 | 5 | 90 | 20000 |
| 4 | R004 | 2000 | 900 | 400 | 7 | 95 | 25000 |
| 5 | R005 | 2200 | 950 | 450 | 8 | 92 | 27000 |
| 6 | R006 | 2500 | 1000 | 500 | 10 | 88 | 30000 |
| 7 | R007 | 2700 | 1100 | 550 | 12 | 80 | 32000 |
| 8 | R008 | 3000 | 1200 | 600 | 15 | 75 | 35000 |
| 9 | R009 | 3200 | 1300 | 650 | 18 | 70 | 38000 |
| 10 | R010 | 3500 | 1400 | 700 | 20 | 65 | 40000 |
| 11 | R011 | 3700 | 1500 | 750 | 22 | 60 | 42000 |
| 12 | R012 | 4000 | 1600 | 800 | 25 | 55 | 45000 |
Explanation
(SAS)
· DATA
cargo_raw; → Creates dataset
· INPUT → Defines structure (same as R
columns)
· DATALINES → Inline raw data entry
· $ → Character variable (Route_ID)
· Remaining variables → Numeric
Why
used?
· Fundamental
dataset creation in SAS
4. Raw Dataset in R
cargo_raw <- data.frame(
Route_ID = c("R001","R002","R003","R004","R005","R006",
"R007","R008","R009","R010","R011","R012"),
Distance_nm = c(1200,1500,1800,2000,2200,2500,
2700,3000,3200,3500,3700,4000),
Fuel_Consumption = c(500,700,800,900,950,1000,
1100,1200,1300,1400,1500,1600),
Cargo_Tons = c(200,300,350,400,450,500,
550,600,650,700,750,800),
Weather_Delay = c(2,3,5,7,8,10,
12,15,18,20,22,25),
Route_Efficiency = c(85,78,90,95,92,88,
80,75,70,65,60,55),
Fees = c(10000,15000,20000,25000,27000,30000,
32000,35000,38000,40000,42000,45000)
)
OUTPUT:
|
|
Route_ID |
Distance_nm |
Fuel_Consumption |
Cargo_Tons |
Weather_Delay |
Route_Efficiency |
Fees |
|
1 |
R001 |
1200 |
500 |
200 |
2 |
85 |
10000 |
|
2 |
R002 |
1500 |
700 |
300 |
3 |
78 |
15000 |
|
3 |
R003 |
1800 |
800 |
350 |
5 |
90 |
20000 |
|
4 |
R004 |
2000 |
900 |
400 |
7 |
95 |
25000 |
|
5 |
R005 |
2200 |
950 |
450 |
8 |
92 |
27000 |
|
6 |
R006 |
2500 |
1000 |
500 |
10 |
88 |
30000 |
|
7 |
R007 |
2700 |
1100 |
550 |
12 |
80 |
32000 |
|
8 |
R008 |
3000 |
1200 |
600 |
15 |
75 |
35000 |
|
9 |
R009 |
3200 |
1300 |
650 |
18 |
70 |
38000 |
|
10 |
R010 |
3500 |
1400 |
700 |
20 |
65 |
40000 |
|
11 |
R011 |
3700 |
1500 |
750 |
22 |
60 |
42000 |
|
12 |
R012 |
4000 |
1600 |
800 |
25 |
55 |
45000 |
Explanation (R)
· data.frame() → Creates tabular dataset
· Each column
matches SAS exactly
· Order of
observations is identical
· Data types:
· Character → Route_ID
·
Numeric → all others
5. Intentional Error Dataset
data cargo_error;
set cargo_raw;
if Route_ID='R005' then Fuel_Consumption=.;
if Route_ID='R007' then Distance_nm='ABC';
if Route_ID='R009' then Route_Efficiency=150;
run;
proc print data=cargo_error;
run;
OUTPUT:
| Obs | Route_ID | Distance_nm | Fuel_Consumption | Cargo_Tons | Weather_Delay | Route_Efficiency | Fees |
|---|---|---|---|---|---|---|---|
| 1 | R001 | 1200 | 500 | 200 | 2 | 85 | 10000 |
| 2 | R002 | 1500 | 700 | 300 | 3 | 78 | 15000 |
| 3 | R003 | 1800 | 800 | 350 | 5 | 90 | 20000 |
| 4 | R004 | 2000 | 900 | 400 | 7 | 95 | 25000 |
| 5 | R005 | 2200 | . | 450 | 8 | 92 | 27000 |
| 6 | R006 | 2500 | 1000 | 500 | 10 | 88 | 30000 |
| 7 | R007 | . | 1100 | 550 | 12 | 80 | 32000 |
| 8 | R008 | 3000 | 1200 | 600 | 15 | 75 | 35000 |
| 9 | R009 | 3200 | 1300 | 650 | 18 | 150 | 38000 |
| 10 | R010 | 3500 | 1400 | 700 | 20 | 65 | 40000 |
| 11 | R011 | 3700 | 1500 | 750 | 22 | 60 | 42000 |
| 12 | R012 | 4000 | 1600 | 800 | 25 | 55 | 45000 |
Errors
Explanation
1. Missing Fuel Consumption
- . indicates missing value
2. Character in Numeric Field
- 'ABC' in numeric variable
3. Invalid Efficiency
6. Error Correction Code
data cargo_clean;
set cargo_error;
if Fuel_Consumption=. then Fuel_Consumption=avg(500,700,800);
if notdigit(Distance_nm)>0 then Distance_nm=2000;
if Route_Efficiency>100 then Route_Efficiency=100;
run;
proc print data=cargo_clean;
run;
LOG:
data cargo_clean;
set cargo_error;
/* Fix missing fuel consumption */
if Fuel_Consumption = . then Fuel_Consumption = mean(500,700,800);
/* Fix invalid numeric values */
if notdigit(strip(put(Distance_nm, best.))) > 0 then Distance_nm = 2000;
/* Fix efficiency > 100 */
if Route_Efficiency > 100 then Route_Efficiency = 100;
run;
proc print data=cargo_clean;
run;
OUTPUT:
| Obs | Route_ID | Distance_nm | Fuel_Consumption | Cargo_Tons | Weather_Delay | Route_Efficiency | Fees |
|---|---|---|---|---|---|---|---|
| 1 | R001 | 1200 | 500.00 | 200 | 2 | 85 | 10000 |
| 2 | R002 | 1500 | 700.00 | 300 | 3 | 78 | 15000 |
| 3 | R003 | 1800 | 800.00 | 350 | 5 | 90 | 20000 |
| 4 | R004 | 2000 | 900.00 | 400 | 7 | 95 | 25000 |
| 5 | R005 | 2200 | 666.67 | 450 | 8 | 92 | 27000 |
| 6 | R006 | 2500 | 1000.00 | 500 | 10 | 88 | 30000 |
| 7 | R007 | 2000 | 1100.00 | 550 | 12 | 80 | 32000 |
| 8 | R008 | 3000 | 1200.00 | 600 | 15 | 75 | 35000 |
| 9 | R009 | 3200 | 1300.00 | 650 | 18 | 100 | 38000 |
| 10 | R010 | 3500 | 1400.00 | 700 | 20 | 65 | 40000 |
| 11 | R011 | 3700 | 1500.00 | 750 | 22 | 60 | 42000 |
| 12 | R012 | 4000 | 1600.00 | 800 | 25 | 55 | 45000 |
Explanation
1.
Fix for AVG Error
Wrong:
avg(500,700,800)
Correct:
mean(500,700,800)
Why?
MEAN()is a DATA step function- It calculates average
of values
- Automatically
ignores missing values
2. NOTDIGIT Issue Fix
Original:
notdigit(Distance_nm)
Problem:
NOTDIGIT()works only on character variablesDistance_nmis numeric → causes logical issue
Fixed:
notdigit(strip(put(Distance_nm, best.)))
Why?
PUT()→ converts numeric → characterSTRIP()→ removes spacesNOTDIGIT()→ checks invalid characters
3.
Efficiency Correction
if Route_Efficiency > 100 then Route_Efficiency = 100;
Why?
- Efficiency
cannot exceed 100%
- Ensures
realistic business rule
Key Learnings
AVG()→ Not valid in DATA stepMEAN()→ Correct functionNOTDIGIT()→ Only for character variables- Always
validate data types before applying
functions
7. Derived Variables
data cargo_derived;
set cargo_clean;
Fuel_Efficiency = Distance_nm / Fuel_Consumption;
Delay_Impact = Weather_Delay * 2;
run;
proc print data=cargo_derived;
run;
OUTPUT:
| Obs | Route_ID | Distance_nm | Fuel_Consumption | Cargo_Tons | Weather_Delay | Route_Efficiency | Fees | Fuel_Efficiency | Delay_Impact |
|---|---|---|---|---|---|---|---|---|---|
| 1 | R001 | 1200 | 500.00 | 200 | 2 | 85 | 10000 | 2.40000 | 4 |
| 2 | R002 | 1500 | 700.00 | 300 | 3 | 78 | 15000 | 2.14286 | 6 |
| 3 | R003 | 1800 | 800.00 | 350 | 5 | 90 | 20000 | 2.25000 | 10 |
| 4 | R004 | 2000 | 900.00 | 400 | 7 | 95 | 25000 | 2.22222 | 14 |
| 5 | R005 | 2200 | 666.67 | 450 | 8 | 92 | 27000 | 3.30000 | 16 |
| 6 | R006 | 2500 | 1000.00 | 500 | 10 | 88 | 30000 | 2.50000 | 20 |
| 7 | R007 | 2000 | 1100.00 | 550 | 12 | 80 | 32000 | 1.81818 | 24 |
| 8 | R008 | 3000 | 1200.00 | 600 | 15 | 75 | 35000 | 2.50000 | 30 |
| 9 | R009 | 3200 | 1300.00 | 650 | 18 | 100 | 38000 | 2.46154 | 36 |
| 10 | R010 | 3500 | 1400.00 | 700 | 20 | 65 | 40000 | 2.50000 | 40 |
| 11 | R011 | 3700 | 1500.00 | 750 | 22 | 60 | 42000 | 2.46667 | 44 |
| 12 | R012 | 4000 | 1600.00 | 800 | 25 | 55 | 45000 | 2.50000 | 50 |
Why?
· Helps business analysis
8. Sorting
proc sort data=cargo_derived;
by descending Route_Efficiency;
run;
run;
proc print data=cargo_derived;
run;
OUTPUT:
| Obs | Route_ID | Distance_nm | Fuel_Consumption | Cargo_Tons | Weather_Delay | Route_Efficiency | Fees | Fuel_Efficiency | Delay_Impact |
|---|---|---|---|---|---|---|---|---|---|
| 1 | R009 | 3200 | 1300.00 | 650 | 18 | 100 | 38000 | 2.46154 | 36 |
| 2 | R004 | 2000 | 900.00 | 400 | 7 | 95 | 25000 | 2.22222 | 14 |
| 3 | R005 | 2200 | 666.67 | 450 | 8 | 92 | 27000 | 3.30000 | 16 |
| 4 | R003 | 1800 | 800.00 | 350 | 5 | 90 | 20000 | 2.25000 | 10 |
| 5 | R006 | 2500 | 1000.00 | 500 | 10 | 88 | 30000 | 2.50000 | 20 |
| 6 | R001 | 1200 | 500.00 | 200 | 2 | 85 | 10000 | 2.40000 | 4 |
| 7 | R007 | 2000 | 1100.00 | 550 | 12 | 80 | 32000 | 1.81818 | 24 |
| 8 | R002 | 1500 | 700.00 | 300 | 3 | 78 | 15000 | 2.14286 | 6 |
| 9 | R008 | 3000 | 1200.00 | 600 | 15 | 75 | 35000 | 2.50000 | 30 |
| 10 | R010 | 3500 | 1400.00 | 700 | 20 | 65 | 40000 | 2.50000 | 40 |
| 11 | R011 | 3700 | 1500.00 | 750 | 22 | 60 | 42000 | 2.46667 | 44 |
| 12 | R012 | 4000 | 1600.00 | 800 | 25 | 55 | 45000 | 2.50000 | 50 |
Key Points
DATA
Step
· Creates
dataset
· Used for transformation
SET
· Reads existing dataset
IF
· Conditional logic
PROC
SORT
· Orders data
PROC
SQL
· Advanced querying
MACROS
· Reusability
9. Merge Example
data extra;
input Route_ID $ Region $;
datalines;
R001 Asia
R002 Europe
;
run;
proc print data=extra;
run;
OUTPUT:
| Obs | Route_ID | Region |
|---|---|---|
| 1 | R001 | Asia |
| 2 | R002 | Europe |
proc sort data=cargo_derived;
by Route_Efficiency;
run;
proc print data=cargo_derived;
run;
OUTPUT:
| Obs | Route_ID | Distance_nm | Fuel_Consumption | Cargo_Tons | Weather_Delay | Route_Efficiency | Fees | Fuel_Efficiency | Delay_Impact |
|---|---|---|---|---|---|---|---|---|---|
| 1 | R012 | 4000 | 1600.00 | 800 | 25 | 55 | 45000 | 2.50000 | 50 |
| 2 | R011 | 3700 | 1500.00 | 750 | 22 | 60 | 42000 | 2.46667 | 44 |
| 3 | R010 | 3500 | 1400.00 | 700 | 20 | 65 | 40000 | 2.50000 | 40 |
| 4 | R008 | 3000 | 1200.00 | 600 | 15 | 75 | 35000 | 2.50000 | 30 |
| 5 | R002 | 1500 | 700.00 | 300 | 3 | 78 | 15000 | 2.14286 | 6 |
| 6 | R007 | 2000 | 1100.00 | 550 | 12 | 80 | 32000 | 1.81818 | 24 |
| 7 | R001 | 1200 | 500.00 | 200 | 2 | 85 | 10000 | 2.40000 | 4 |
| 8 | R006 | 2500 | 1000.00 | 500 | 10 | 88 | 30000 | 2.50000 | 20 |
| 9 | R003 | 1800 | 800.00 | 350 | 5 | 90 | 20000 | 2.25000 | 10 |
| 10 | R005 | 2200 | 666.67 | 450 | 8 | 92 | 27000 | 3.30000 | 16 |
| 11 | R004 | 2000 | 900.00 | 400 | 7 | 95 | 25000 | 2.22222 | 14 |
| 12 | R009 | 3200 | 1300.00 | 650 | 18 | 100 | 38000 | 2.46154 | 36 |
proc sort data=extra;
by Route_Efficiency;
run;
proc print data=extra;
run;
OUTPUT:
| Obs | Route_ID | Region |
|---|---|---|
| 1 | R001 | Asia |
| 2 | R002 | Europe |
data merged_data;
merge cargo_derived extra;
by Route_ID;
run;
proc print data=merged_data;
run;
OUTPUT:
| Obs | Route_ID | Distance_nm | Fuel_Consumption | Cargo_Tons | Weather_Delay | Route_Efficiency | Fees | Fuel_Efficiency | Delay_Impact | Region |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | R001 | . | . | . | . | . | . | . | . | Asia |
| 2 | R002 | . | . | . | . | . | . | . | . | Europe |
10. PROC SQL
proc sql;
select Route_ID, avg(Fuel_Consumption) as AvgFuel
from cargo_clean
group by Route_ID;
quit;
OUTPUT:
| Route_ID | AvgFuel |
|---|---|
| R001 | 500 |
| R002 | 700 |
| R003 | 800 |
| R004 | 900 |
| R005 | 666.6667 |
| R006 | 1000 |
| R007 | 1100 |
| R008 | 1200 |
| R009 | 1300 |
| R010 | 1400 |
| R011 | 1500 |
| R012 | 1600 |
11. PROC TRANSPOSE
proc transpose data=cargo_clean out=transposed;
var Distance_nm Fuel_Consumption;
run;
proc print data=transposed;
run;
OUTPUT:
| Obs | _NAME_ | COL1 | COL2 | COL3 | COL4 | COL5 | COL6 | COL7 | COL8 | COL9 | COL10 | COL11 | COL12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Distance_nm | 1200 | 1500 | 1800 | 2000 | 2200.00 | 2500 | 2000 | 3000 | 3200 | 3500 | 3700 | 4000 |
| 2 | Fuel_Consumption | 500 | 700 | 800 | 900 | 666.67 | 1000 | 1100 | 1200 | 1300 | 1400 | 1500 | 1600 |
12. PROC MEANS
proc means data=cargo_clean;
var Distance_nm Fuel_Consumption Cargo_Tons;
run;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|
Distance_nm Fuel_Consumption Cargo_Tons | 12 12 12 | 2550.00 1055.56 520.8333333 | 913.0368906 353.4343290 187.6388375 | 1200.00 500.0000000 200.0000000 | 4000.00 1600.00 800.0000000 |
13. PROC FREQ
proc freq data=cargo_clean;
tables Weather_Delay;
run;
OUTPUT:
The FREQ Procedure
| Weather_Delay | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
|---|---|---|---|---|
| 2 | 1 | 8.33 | 1 | 8.33 |
| 3 | 1 | 8.33 | 2 | 16.67 |
| 5 | 1 | 8.33 | 3 | 25.00 |
| 7 | 1 | 8.33 | 4 | 33.33 |
| 8 | 1 | 8.33 | 5 | 41.67 |
| 10 | 1 | 8.33 | 6 | 50.00 |
| 12 | 1 | 8.33 | 7 | 58.33 |
| 15 | 1 | 8.33 | 8 | 66.67 |
| 18 | 1 | 8.33 | 9 | 75.00 |
| 20 | 1 | 8.33 | 10 | 83.33 |
| 22 | 1 | 8.33 | 11 | 91.67 |
| 25 | 1 | 8.33 | 12 | 100.00 |
14. PROC SGPLOT
proc sgplot data=cargo_clean;
scatter x=Distance_nm y=Fuel_Consumption;
run;
OUTPUT:
15. Macro Implementation
%macro summary;
proc means data=cargo_clean;
run;
%mend;
%summary;
OUTPUT:
The MEANS Procedure
| Variable | N | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|
Distance_nm Fuel_Consumption Cargo_Tons Weather_Delay Route_Efficiency Fees | 12 12 12 12 12 12 | 2550.00 1055.56 520.8333333 12.2500000 80.2500000 29916.67 | 913.0368906 353.4343290 187.6388375 7.7003542 14.2581715 10958.25 | 1200.00 500.0000000 200.0000000 2.0000000 55.0000000 10000.00 | 4000.00 1600.00 800.0000000 25.0000000 100.0000000 45000.00 |
16. PROC APPEND
proc append base=cargo_clean
data=extra Force;
run;
proc print data=cargo_clean;
run;
OUTPUT:
| Obs | Route_ID | Distance_nm | Fuel_Consumption | Cargo_Tons | Weather_Delay | Route_Efficiency | Fees |
|---|---|---|---|---|---|---|---|
| 1 | R001 | 1200 | 500.00 | 200 | 2 | 85 | 10000 |
| 2 | R002 | 1500 | 700.00 | 300 | 3 | 78 | 15000 |
| 3 | R003 | 1800 | 800.00 | 350 | 5 | 90 | 20000 |
| 4 | R004 | 2000 | 900.00 | 400 | 7 | 95 | 25000 |
| 5 | R005 | 2200 | 666.67 | 450 | 8 | 92 | 27000 |
| 6 | R006 | 2500 | 1000.00 | 500 | 10 | 88 | 30000 |
| 7 | R007 | 2000 | 1100.00 | 550 | 12 | 80 | 32000 |
| 8 | R008 | 3000 | 1200.00 | 600 | 15 | 75 | 35000 |
| 9 | R009 | 3200 | 1300.00 | 650 | 18 | 100 | 38000 |
| 10 | R010 | 3500 | 1400.00 | 700 | 20 | 65 | 40000 |
| 11 | R011 | 3700 | 1500.00 | 750 | 22 | 60 | 42000 |
| 12 | R012 | 4000 | 1600.00 | 800 | 25 | 55 | 45000 |
| 13 | R001 | . | . | . | . | . | . |
| 14 | R002 | . | . | . | . | . | . |
17. PROC DATASETS
proc datasets library=work;
delete cargo_error;
run;
LOG:
18. Final Clean Dataset Code
data final_cargo;
set cargo_raw;
if Fuel_Consumption=. then Fuel_Consumption=600;
if Route_Efficiency>100 then Route_Efficiency=100;
Fuel_Efficiency = Distance_nm / Fuel_Consumption;
Total_Cost = Fees + (Fuel_Consumption * 10);
run;
proc print data=final_cargo;
run;
OUTPUT:
| Obs | Route_ID | Distance_nm | Fuel_Consumption | Cargo_Tons | Weather_Delay | Route_Efficiency | Fees | Fuel_Efficiency | Total_Cost |
|---|---|---|---|---|---|---|---|---|---|
| 1 | R001 | 1200 | 500 | 200 | 2 | 85 | 10000 | 2.40000 | 15000 |
| 2 | R002 | 1500 | 700 | 300 | 3 | 78 | 15000 | 2.14286 | 22000 |
| 3 | R003 | 1800 | 800 | 350 | 5 | 90 | 20000 | 2.25000 | 28000 |
| 4 | R004 | 2000 | 900 | 400 | 7 | 95 | 25000 | 2.22222 | 34000 |
| 5 | R005 | 2200 | 950 | 450 | 8 | 92 | 27000 | 2.31579 | 36500 |
| 6 | R006 | 2500 | 1000 | 500 | 10 | 88 | 30000 | 2.50000 | 40000 |
| 7 | R007 | 2700 | 1100 | 550 | 12 | 80 | 32000 | 2.45455 | 43000 |
| 8 | R008 | 3000 | 1200 | 600 | 15 | 75 | 35000 | 2.50000 | 47000 |
| 9 | R009 | 3200 | 1300 | 650 | 18 | 70 | 38000 | 2.46154 | 51000 |
| 10 | R010 | 3500 | 1400 | 700 | 20 | 65 | 40000 | 2.50000 | 54000 |
| 11 | R011 | 3700 | 1500 | 750 | 22 | 60 | 42000 | 2.46667 | 57000 |
| 12 | R012 | 4000 | 1600 | 800 | 25 | 55 | 45000 | 2.50000 | 61000 |
19.
Key Insights
· High
distance routes → higher fuel usage
· Weather
delays impact efficiency
· Incorrect
data leads to wrong cost estimation
· Cleaning improves decision accuracy
- The project
simulates a global cargo shipping
dataset representing real-world logistics operations across
multiple routes.
- It includes critical
variables such as Distance, Fuel
Consumption, Cargo Tons, Weather Delay, Route Efficiency, and Fees.
- The dataset
is created in both SAS and R with
identical structures, ensuring cross-platform consistency.
- Intentional data errors are
introduced to mimic real-world data
quality issues like missing values and invalid entries.
- Advanced SAS DATA step
techniques are used to identify
missing and incorrect values efficiently.
- Errors such
as invalid numeric values and
unrealistic efficiency percentages (>100%) are detected and
corrected.
- The project
demonstrates proper use of PROG1
statements including DATA, SET, IF, MERGE, and PROC SORT.
- Incorrect
function usage (like
AVG()in DATA step) is identified and corrected using appropriate SAS functions like MEAN(). - Data type
mismatches are handled using functions like PUT(), INPUT(), and NOTDIGIT().
- Missing
values are imputed using logical
replacements and statistical averages.
- Derived
variables such as Fuel Efficiency
and Delay Impact are created to enhance analytical insights.
- PROC SORT
is used to organize routes based on
efficiency, aiding decision-making.
- PROC SQL is
leveraged for aggregation,
filtering, and advanced querying of shipping data.
- PROC MEANS
and PROC FREQ are used for statistical
summaries and frequency analysis.
- PROC
TRANSPOSE reshapes data for reporting
and visualization flexibility.
- SAS Macros
are implemented to automate
repetitive processes, improving efficiency and scalability.
- Data
merging techniques are used to integrate additional route-level information like regions.
- Visualization
using PROC SGPLOT helps identify relationships
between distance and fuel consumption.
- The cleaned
dataset enables better cost
optimization by analyzing fuel usage and route delays.
- Overall,
the project showcases how SAS programming can transform raw, error-prone
data into high-quality,
decision-ready datasets for logistics optimization.
Project Insights
This project demonstrates:
- Data
Cleaning
- Error
Handling
- Data
Transformation
- Business
Insight Generation
- Automation
using Macros
21. Summary
This
project demonstrates how Advanced SAS Programming can effectively handle
real-world cargo shipping data challenges. A raw dataset was created containing
route-level shipping metrics such as distance, fuel consumption, and
efficiency. Intentional errors like missing values, invalid data types, and
unrealistic values were introduced to simulate real-world data quality issues.
Using PROG1 SAS statements, these errors were identified and corrected
systematically. Various procedures such as PROC SORT, PROC SQL, PROC MEANS, and
PROC TRANSPOSE were applied to analyze and restructure the data. Derived
variables such as fuel efficiency and delay impact were calculated to provide
deeper business insights. Macros were used to automate repetitive tasks,
improving efficiency. Overall, the project highlights how structured SAS
programming enhances data quality, supports accurate analytics, and enables
better decision-making in global shipping operations.
22. Conclusion
In the
shipping industry, data accuracy is critical for operational success and cost
optimization. This project illustrates how Advanced SAS techniques can detect
inconsistencies, correct errors, and enhance data usability. By leveraging
Sas statements, we built a robust workflow that transforms raw, error-prone
data into a clean and analyzable dataset. The inclusion of intentional errors
helped demonstrate practical debugging strategies that are highly relevant in
real-world scenarios. Furthermore, advanced procedures and macros enabled
scalable and efficient data processing. The final dataset provided meaningful
insights into route efficiency, fuel consumption, and delay impacts, allowing
stakeholders to make informed decisions. This project not only strengthens SAS
programming skills but also showcases its importance in solving complex
business problems in logistics and global shipping.
INTERVIEW QUESTIONS FOR YOU
1. The RETAIN Statement
Question:
What does the RETAIN statement do, and
how is it different from a regular assignment?
Short
Answer: By default, SAS resets variables to missing at the start of each
iteration of the DATA step. The RETAIN statement tells SAS
to hold the value of a variable from the previous observation. I use it most
often for calculating cumulative
totals (running sums) or carrying a non-missing value forward across
multiple rows.
Question:
What does the RETAIN statement do, and
how is it different from a regular assignment?
Short
Answer: By default, SAS resets variables to missing at the start of each
iteration of the DATA step. The RETAIN statement tells SAS
to hold the value of a variable from the previous observation. I use it most
often for calculating cumulative
totals (running sums) or carrying a non-missing value forward across
multiple rows.
2. FIRST. and LAST. Variables
Question:
How do FIRST.variable
and LAST.variable
work, and what is required to use them?
Short
Answer: These are temporary 'flag' variables created by SAS during a BY-group processing in a DATA
step. FIRST.var is 1 for the first
record in a group, and LAST.var
is 1 for the final record. To use them, the data must be sorted by that variable first. They are
essential for identifying duplicates or performing calculations within specific
groups .
Question:
How do FIRST.variable
and LAST.variable
work, and what is required to use them?
Short
Answer: These are temporary 'flag' variables created by SAS during a BY-group processing in a DATA
step. FIRST.var is 1 for the first
record in a group, and LAST.var
is 1 for the final record. To use them, the data must be sorted by that variable first. They are
essential for identifying duplicates or performing calculations within specific
groups .
3. DROP vs. KEEP (Statement vs. Dataset Option)
Question:
What is the difference between using DROP as a statement
versus a dataset option?
Short
Answer: A DROP statement
applies to the entire DATA step and removes variables from the final output
dataset. A dataset option
(e.g., (DROP=var)) is more efficient because it can be
used on the SET
statement to prevent a variable from even entering the Program Data Vector
(PDV), saving memory and processing time.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
Question:
What is the difference between using DROP as a statement
versus a dataset option?
Short
Answer: A DROP statement
applies to the entire DATA step and removes variables from the final output
dataset. A dataset option
(e.g., (DROP=var)) is more efficient because it can be
used on the SET
statement to prevent a variable from even entering the Program Data Vector
(PDV), saving memory and processing time.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
About the Author:
SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.
Disclaimer:
The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent CARGO SHIPPING data.
Our Mission:
This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.
This project is suitable for:
· Students learning SAS
· Data analysts building portfolios
· Professionals preparing for SAS interviews
· Bloggers writing about analytics and Exams Reviewers and Observers
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Comments
Post a Comment