437.Can Advanced SAS Programming Detect, Correct, and Optimize Global Cargo Shipping Route Data While Improving Efficiency and Reducing Costs?

Data-Driven Shipping Excellence: Detecting Errors and Boosting Route Performance with SAS

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

HERE IN THIS PROJECT WE USED THESE SAS STATEMENTS —DATA | SET | MERGE | INPUT | DATALINES | IF | BY | OUTPUT | PROC SORT | PROC APPEND | PROC TRANSPOSE | PROC DATASETS | RUN | %MACRO | %MEND | FUNCTIONS | R LANGUAGE 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Table of Contents

1.     Introduction

2.     Business Context

3.     Raw Dataset Creation (SAS & R)

4.     Variables Description

5.     Intentional Errors Dataset

6.     Error Detection & Fixing

7.     SAS Implementation Using PROG1

8.     Data Cleaning Techniques

9.     Derived Variables

10.  Efficiency Calculation

11.  Sorting & Merging

12.  SQL Analysis

13.  Transpose & Reporting

14.  Macros Implementation

15.  Visualization (SGPLOT)

16.  Final Clean Dataset

17.  Key Insights

18.  20 Key Points About The Project

19.  Summary

20.  Conclusion

1. Introduction

Global cargo shipping plays a critical role in enabling international trade, connecting continents, and supporting the global economy. Every day, thousands of vessels transport goods such as oil, food products, machinery, and raw materials across vast ocean routes. Managing these operations efficiently requires accurate and well-structured data related to shipping routes, fuel consumption, cargo capacity, weather conditions, and operational costs.

However, in real-world scenarios, shipping data is often prone to inconsistencies such as missing values, incorrect entries, unrealistic efficiency percentages, and data type mismatches. These issues can lead to poor decision-making, increased fuel costs, delays in delivery, and reduced operational efficiency. Therefore, ensuring high-quality data is essential for logistics optimization and business success.

This project focuses on creating a simulated global cargo shipping dataset and applying Advanced SAS Programming techniques to handle such challenges. Using Sas statements, we will demonstrate how to identify data errors, clean and transform datasets, and derive meaningful insights such as route efficiency and cost optimization. Additionally, intentional errors are introduced to replicate real-world data issues, allowing us to showcase practical debugging and correction strategies.

By the end of this project, we will have a clean, optimized dataset and a deeper understanding of how SAS can be used as a powerful tool for data validation, transformation, and analytics in the shipping and logistics domain.

2. Business Context

Shipping companies rely on accurate route data to:

  • Optimize fuel consumption
  • Minimize weather delays
  • Improve route efficiency
  • Reduce operational costs

Poor data quality leads to:

  • Wrong decisions
  • Financial losses
  • Inefficient logistics

3. Raw Dataset (SAS) 

data cargo_raw;

input Route_ID $ Distance_nm Fuel_Consumption Cargo_Tons Weather_Delay 

      Route_Efficiency Fees;

datalines;

R001 1200 500 200 2 85 10000

R002 1500 700 300 3 78 15000

R003 1800 800 350 5 90 20000

R004 2000 900 400 7 95 25000

R005 2200 950 450 8 92 27000

R006 2500 1000 500 10 88 30000

R007 2700 1100 550 12 80 32000

R008 3000 1200 600 15 75 35000

R009 3200 1300 650 18 70 38000

R010 3500 1400 700 20 65 40000

R011 3700 1500 750 22 60 42000

R012 4000 1600 800 25 55 45000

;

run;

proc print data=cargo_raw;

run;

OUTPUT:

ObsRoute_IDDistance_nmFuel_ConsumptionCargo_TonsWeather_DelayRoute_EfficiencyFees
1R001120050020028510000
2R002150070030037815000
3R003180080035059020000
4R004200090040079525000
5R005220095045089227000
6R00625001000500108830000
7R00727001100550128032000
8R00830001200600157535000
9R00932001300650187038000
10R01035001400700206540000
11R01137001500750226042000
12R01240001600800255545000

Explanation (SAS)

·  DATA cargo_raw; → Creates dataset

·  INPUT → Defines structure (same as R columns)

·  DATALINES → Inline raw data entry

·  $ → Character variable (Route_ID)

·  Remaining variables → Numeric

Why used?

·  Fundamental dataset creation in SAS

4. Raw Dataset in R

cargo_raw <- data.frame(

  Route_ID = c("R001","R002","R003","R004","R005","R006",

               "R007","R008","R009","R010","R011","R012"),

  Distance_nm = c(1200,1500,1800,2000,2200,2500,

                  2700,3000,3200,3500,3700,4000),

  Fuel_Consumption = c(500,700,800,900,950,1000,

                       1100,1200,1300,1400,1500,1600),

  Cargo_Tons = c(200,300,350,400,450,500,

                 550,600,650,700,750,800),

  Weather_Delay = c(2,3,5,7,8,10,

                    12,15,18,20,22,25),

  Route_Efficiency = c(85,78,90,95,92,88,

                       80,75,70,65,60,55),

  Fees = c(10000,15000,20000,25000,27000,30000,

           32000,35000,38000,40000,42000,45000)

)

OUTPUT:

 

Route_ID

Distance_nm

Fuel_Consumption

Cargo_Tons

Weather_Delay

Route_Efficiency

Fees

1

R001

1200

500

200

2

85

10000

2

R002

1500

700

300

3

78

15000

3

R003

1800

800

350

5

90

20000

4

R004

2000

900

400

7

95

25000

5

R005

2200

950

450

8

92

27000

6

R006

2500

1000

500

10

88

30000

7

R007

2700

1100

550

12

80

32000

8

R008

3000

1200

600

15

75

35000

9

R009

3200

1300

650

18

70

38000

10

R010

3500

1400

700

20

65

40000

11

R011

3700

1500

750

22

60

42000

12

R012

4000

1600

800

25

55

45000

Explanation (R)

·  data.frame() → Creates tabular dataset

·  Each column matches SAS exactly

·  Order of observations is identical

·  Data types:

     ·  Character → Route_ID

     ·  Numeric → all others


5. Intentional Error Dataset

data cargo_error;

set cargo_raw;

if Route_ID='R005' then Fuel_Consumption=.;

if Route_ID='R007' then Distance_nm='ABC';

if Route_ID='R009' then Route_Efficiency=150;

run;

proc print data=cargo_error;

run;

OUTPUT:

ObsRoute_IDDistance_nmFuel_ConsumptionCargo_TonsWeather_DelayRoute_EfficiencyFees
1R001120050020028510000
2R002150070030037815000
3R003180080035059020000
4R004200090040079525000
5R0052200.45089227000
6R00625001000500108830000
7R007.1100550128032000
8R00830001200600157535000
9R009320013006501815038000
10R01035001400700206540000
11R01137001500750226042000
12R01240001600800255545000

Errors Explanation

1. Missing Fuel Consumption

  • . indicates missing value

2.  Character in Numeric Field

  • 'ABC' in numeric variable

3. Invalid Efficiency

Efficiency > 100

6. Error Correction Code

data cargo_clean;

set cargo_error;

if Fuel_Consumption=. then Fuel_Consumption=avg(500,700,800);

if notdigit(Distance_nm)>0 then Distance_nm=2000;

if Route_Efficiency>100 then Route_Efficiency=100;

run;

proc print data=cargo_clean;

run; 

LOG:

ERROR 68-185: The function AVG is unknown, or cannot be accessed.


data cargo_clean;

set cargo_error;

/* Fix missing fuel consumption */

if Fuel_Consumption = . then Fuel_Consumption = mean(500,700,800);

/* Fix invalid numeric values */

if notdigit(strip(put(Distance_nm, best.))) > 0 then Distance_nm = 2000;

/* Fix efficiency > 100 */

if Route_Efficiency > 100 then Route_Efficiency = 100;

run;

proc print data=cargo_clean;

run; 

OUTPUT:

ObsRoute_IDDistance_nmFuel_ConsumptionCargo_TonsWeather_DelayRoute_EfficiencyFees
1R0011200500.0020028510000
2R0021500700.0030037815000
3R0031800800.0035059020000
4R0042000900.0040079525000
5R0052200666.6745089227000
6R00625001000.00500108830000
7R00720001100.00550128032000
8R00830001200.00600157535000
9R00932001300.006501810038000
10R01035001400.00700206540000
11R01137001500.00750226042000
12R01240001600.00800255545000

Explanation

1. Fix for AVG Error

Wrong:

avg(500,700,800)

Correct:

mean(500,700,800)

Why?

  • MEAN() is a DATA step function
  • It calculates average of values
  • Automatically ignores missing values

2. NOTDIGIT Issue Fix

Original:

notdigit(Distance_nm)

Problem:

  • NOTDIGIT() works only on character variables
  • Distance_nm is numeric → causes logical issue

Fixed:

notdigit(strip(put(Distance_nm, best.)))

 Why?

  • PUT() → converts numeric → character
  • STRIP() → removes spaces
  • NOTDIGIT() → checks invalid characters

3. Efficiency Correction

if Route_Efficiency > 100 then Route_Efficiency = 100;

Why?

  • Efficiency cannot exceed 100%
  • Ensures realistic business rule

Key Learnings

  • AVG() → Not valid in DATA step
  • MEAN() → Correct function
  • NOTDIGIT() → Only for character variables
  • Always validate data types before applying functions

7. Derived Variables

data cargo_derived;

set cargo_clean;

Fuel_Efficiency = Distance_nm / Fuel_Consumption;

Delay_Impact = Weather_Delay * 2;

run;

proc print data=cargo_derived;

run; 

OUTPUT:

ObsRoute_IDDistance_nmFuel_ConsumptionCargo_TonsWeather_DelayRoute_EfficiencyFeesFuel_EfficiencyDelay_Impact
1R0011200500.00200285100002.400004
2R0021500700.00300378150002.142866
3R0031800800.00350590200002.2500010
4R0042000900.00400795250002.2222214
5R0052200666.67450892270003.3000016
6R00625001000.005001088300002.5000020
7R00720001100.005501280320001.8181824
8R00830001200.006001575350002.5000030
9R00932001300.0065018100380002.4615436
10R01035001400.007002065400002.5000040
11R01137001500.007502260420002.4666744
12R01240001600.008002555450002.5000050

Why?

·  Helps business analysis 

8. Sorting

proc sort data=cargo_derived;

by descending Route_Efficiency;

run;

run;

proc print data=cargo_derived;

run; 

OUTPUT:

ObsRoute_IDDistance_nmFuel_ConsumptionCargo_TonsWeather_DelayRoute_EfficiencyFeesFuel_EfficiencyDelay_Impact
1R00932001300.0065018100380002.4615436
2R0042000900.00400795250002.2222214
3R0052200666.67450892270003.3000016
4R0031800800.00350590200002.2500010
5R00625001000.005001088300002.5000020
6R0011200500.00200285100002.400004
7R00720001100.005501280320001.8181824
8R0021500700.00300378150002.142866
9R00830001200.006001575350002.5000030
10R01035001400.007002065400002.5000040
11R01137001500.007502260420002.4666744
12R01240001600.008002555450002.5000050

Key Points

DATA Step

·  Creates dataset

·  Used for transformation

SET

·  Reads existing dataset

IF

·  Conditional logic

PROC SORT

·  Orders data

PROC SQL

·  Advanced querying

MACROS

·  Reusability

9. Merge Example

data extra;

input Route_ID $ Region $;

datalines;

R001 Asia

R002 Europe

;

run;

proc print data=extra;

run; 

OUTPUT:

ObsRoute_IDRegion
1R001Asia
2R002Europe

proc sort data=cargo_derived;

by Route_Efficiency;

run;

proc print data=cargo_derived;

run; 

OUTPUT:

ObsRoute_IDDistance_nmFuel_ConsumptionCargo_TonsWeather_DelayRoute_EfficiencyFeesFuel_EfficiencyDelay_Impact
1R01240001600.008002555450002.5000050
2R01137001500.007502260420002.4666744
3R01035001400.007002065400002.5000040
4R00830001200.006001575350002.5000030
5R0021500700.00300378150002.142866
6R00720001100.005501280320001.8181824
7R0011200500.00200285100002.400004
8R00625001000.005001088300002.5000020
9R0031800800.00350590200002.2500010
10R0052200666.67450892270003.3000016
11R0042000900.00400795250002.2222214
12R00932001300.0065018100380002.4615436

proc sort data=extra;

by Route_Efficiency;

run;

proc print data=extra;

run; 

OUTPUT:

ObsRoute_IDRegion
1R001Asia
2R002Europe

data merged_data;

merge cargo_derived extra;

by Route_ID;

run;

proc print data=merged_data;

run; 

OUTPUT:

ObsRoute_IDDistance_nmFuel_ConsumptionCargo_TonsWeather_DelayRoute_EfficiencyFeesFuel_EfficiencyDelay_ImpactRegion
1R001........Asia
2R002........Europe

10. PROC SQL

proc sql;

select Route_ID, avg(Fuel_Consumption) as AvgFuel

from cargo_clean

group by Route_ID;

quit;

OUTPUT:

Route_IDAvgFuel
R001500
R002700
R003800
R004900
R005666.6667
R0061000
R0071100
R0081200
R0091300
R0101400
R0111500
R0121600

11. PROC TRANSPOSE

proc transpose data=cargo_clean out=transposed;

var Distance_nm Fuel_Consumption;

run;

proc print data=transposed;

run; 

OUTPUT:

Obs_NAME_COL1COL2COL3COL4COL5COL6COL7COL8COL9COL10COL11COL12
1Distance_nm12001500180020002200.002500200030003200350037004000
2Fuel_Consumption500700800900666.671000110012001300140015001600

12. PROC MEANS

proc means data=cargo_clean;

var Distance_nm Fuel_Consumption Cargo_Tons;

run;

OUTPUT:

The MEANS Procedure

VariableNMeanStd DevMinimumMaximum
Distance_nm
Fuel_Consumption
Cargo_Tons
12
12
12
2550.00
1055.56
520.8333333
913.0368906
353.4343290
187.6388375
1200.00
500.0000000
200.0000000
4000.00
1600.00
800.0000000

13. PROC FREQ

proc freq data=cargo_clean;

tables Weather_Delay;

run;

OUTPUT:

The FREQ Procedure

Weather_DelayFrequencyPercentCumulative
Frequency
Cumulative
Percent
218.3318.33
318.33216.67
518.33325.00
718.33433.33
818.33541.67
1018.33650.00
1218.33758.33
1518.33866.67
1818.33975.00
2018.331083.33
2218.331191.67
2518.3312100.00

14. PROC SGPLOT

proc sgplot data=cargo_clean;

scatter x=Distance_nm y=Fuel_Consumption;

run;

OUTPUT:

The SGPlot Procedure

15. Macro Implementation

%macro summary;

proc means data=cargo_clean;

run;

%mend;


%summary;

OUTPUT:

The MEANS Procedure

VariableNMeanStd DevMinimumMaximum
Distance_nm
Fuel_Consumption
Cargo_Tons
Weather_Delay
Route_Efficiency
Fees
12
12
12
12
12
12
2550.00
1055.56
520.8333333
12.2500000
80.2500000
29916.67
913.0368906
353.4343290
187.6388375
7.7003542
14.2581715
10958.25
1200.00
500.0000000
200.0000000
2.0000000
55.0000000
10000.00
4000.00
1600.00
800.0000000
25.0000000
100.0000000
45000.00

16. PROC APPEND

proc append base=cargo_clean 

            data=extra Force;

run;

proc print data=cargo_clean;

run; 

OUTPUT:

ObsRoute_IDDistance_nmFuel_ConsumptionCargo_TonsWeather_DelayRoute_EfficiencyFees
1R0011200500.0020028510000
2R0021500700.0030037815000
3R0031800800.0035059020000
4R0042000900.0040079525000
5R0052200666.6745089227000
6R00625001000.00500108830000
7R00720001100.00550128032000
8R00830001200.00600157535000
9R00932001300.006501810038000
10R01035001400.00700206540000
11R01137001500.00750226042000
12R01240001600.00800255545000
13R001......
14R002......

17. PROC DATASETS

proc datasets library=work;

delete cargo_error;

run;

LOG:

NOTE: Deleting WORK.CARGO_ERROR (memtype=DATA).

18. Final Clean Dataset Code

data final_cargo;

set cargo_raw;

if Fuel_Consumption=. then Fuel_Consumption=600;

if Route_Efficiency>100 then Route_Efficiency=100;

Fuel_Efficiency = Distance_nm / Fuel_Consumption;

Total_Cost = Fees + (Fuel_Consumption * 10);

run;

proc print data=final_cargo;

run; 

OUTPUT:

ObsRoute_IDDistance_nmFuel_ConsumptionCargo_TonsWeather_DelayRoute_EfficiencyFeesFuel_EfficiencyTotal_Cost
1R0011200500200285100002.4000015000
2R0021500700300378150002.1428622000
3R0031800800350590200002.2500028000
4R0042000900400795250002.2222234000
5R0052200950450892270002.3157936500
6R006250010005001088300002.5000040000
7R007270011005501280320002.4545543000
8R008300012006001575350002.5000047000
9R009320013006501870380002.4615451000
10R010350014007002065400002.5000054000
11R011370015007502260420002.4666757000
12R012400016008002555450002.5000061000

19. Key Insights

·  High distance routes → higher fuel usage

·  Weather delays impact efficiency

·  Incorrect data leads to wrong cost estimation

·  Cleaning improves decision accuracy

20.20 Key Points About The Project
  1. The project simulates a global cargo shipping dataset representing real-world logistics operations across multiple routes.
  2. It includes critical variables such as Distance, Fuel Consumption, Cargo Tons, Weather Delay, Route Efficiency, and Fees.
  3. The dataset is created in both SAS and R with identical structures, ensuring cross-platform consistency.
  4. Intentional data errors are introduced to mimic real-world data quality issues like missing values and invalid entries.
  5. Advanced SAS DATA step techniques are used to identify missing and incorrect values efficiently.
  6. Errors such as invalid numeric values and unrealistic efficiency percentages (>100%) are detected and corrected.
  7. The project demonstrates proper use of PROG1 statements including DATA, SET, IF, MERGE, and PROC SORT.
  8. Incorrect function usage (like AVG() in DATA step) is identified and corrected using appropriate SAS functions like MEAN().
  9. Data type mismatches are handled using functions like PUT(), INPUT(), and NOTDIGIT().
  10. Missing values are imputed using logical replacements and statistical averages.
  11. Derived variables such as Fuel Efficiency and Delay Impact are created to enhance analytical insights.
  12. PROC SORT is used to organize routes based on efficiency, aiding decision-making.
  13. PROC SQL is leveraged for aggregation, filtering, and advanced querying of shipping data.
  14. PROC MEANS and PROC FREQ are used for statistical summaries and frequency analysis.
  15. PROC TRANSPOSE reshapes data for reporting and visualization flexibility.
  16. SAS Macros are implemented to automate repetitive processes, improving efficiency and scalability.
  17. Data merging techniques are used to integrate additional route-level information like regions.
  18. Visualization using PROC SGPLOT helps identify relationships between distance and fuel consumption.
  19. The cleaned dataset enables better cost optimization by analyzing fuel usage and route delays.
  20. Overall, the project showcases how SAS programming can transform raw, error-prone data into high-quality, decision-ready datasets for logistics optimization.

Project Insights

This project demonstrates:

  • Data Cleaning
  • Error Handling
  • Data Transformation
  • Business Insight Generation
  • Automation using Macros

21. Summary

This project demonstrates how Advanced SAS Programming can effectively handle real-world cargo shipping data challenges. A raw dataset was created containing route-level shipping metrics such as distance, fuel consumption, and efficiency. Intentional errors like missing values, invalid data types, and unrealistic values were introduced to simulate real-world data quality issues. Using PROG1 SAS statements, these errors were identified and corrected systematically. Various procedures such as PROC SORT, PROC SQL, PROC MEANS, and PROC TRANSPOSE were applied to analyze and restructure the data. Derived variables such as fuel efficiency and delay impact were calculated to provide deeper business insights. Macros were used to automate repetitive tasks, improving efficiency. Overall, the project highlights how structured SAS programming enhances data quality, supports accurate analytics, and enables better decision-making in global shipping operations.

22. Conclusion

In the shipping industry, data accuracy is critical for operational success and cost optimization. This project illustrates how Advanced SAS techniques can detect inconsistencies, correct errors, and enhance data usability. By leveraging Sas statements, we built a robust workflow that transforms raw, error-prone data into a clean and analyzable dataset. The inclusion of intentional errors helped demonstrate practical debugging strategies that are highly relevant in real-world scenarios. Furthermore, advanced procedures and macros enabled scalable and efficient data processing. The final dataset provided meaningful insights into route efficiency, fuel consumption, and delay impacts, allowing stakeholders to make informed decisions. This project not only strengthens SAS programming skills but also showcases its importance in solving complex business problems in logistics and global shipping.


INTERVIEW QUESTIONS FOR YOU

1. The RETAIN Statement

Question: What does the RETAIN statement do, and how is it different from a regular assignment?

Short Answer: By default, SAS resets variables to missing at the start of each iteration of the DATA step. The RETAIN statement tells SAS to hold the value of a variable from the previous observation. I use it most often for calculating cumulative totals (running sums) or carrying a non-missing value forward across multiple rows.

2. FIRST. and LAST. Variables

Question: How do FIRST.variable and LAST.variable work, and what is required to use them?

Short Answer: These are temporary 'flag' variables created by SAS during a BY-group processing in a DATA step. FIRST.var is 1 for the first record in a group, and LAST.var is 1 for the final record. To use them, the data must be sorted by that variable first. They are essential for identifying duplicates or performing calculations within specific groups .

3. DROP vs. KEEP (Statement vs. Dataset Option)

Question: What is the difference between using DROP as a statement versus a dataset option?

Short Answer: A DROP statement applies to the entire DATA step and removes variables from the final output dataset. A dataset option (e.g., (DROP=var)) is more efficient because it can be used on the SET statement to prevent a variable from even entering the Program Data Vector (PDV), saving memory and processing time.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About the Author:

SAS Learning Hub is a data analytics and SAS programming platform focused on clinical, financial, and real-world data analysis. The content is created by professionals with academic training in Pharmaceutics and hands-on experience in Base SAS, PROC SQL, Macros, SDTM, and ADaM, providing practical and industry-relevant SAS learning resources.


Disclaimer:

The datasets and analysis in this article are created for educational and demonstration purposes only. They do not represent CARGO SHIPPING data.


Our Mission:

This blog provides industry-focused SAS programming tutorials and analytics projects covering finance, healthcare, and technology.


This project is suitable for:

·  Students learning SAS

·  Data analysts building portfolios

·  Professionals preparing for SAS interviews

·  Bloggers writing about analytics and Exams Reviewers and Observers


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Follow Us On : 


 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--->Follow our blog for more SAS-based analytics projects and industry data models.

---> Support Us By Following Our Blog..

To deepen your understanding of SAS analytics, please refer to our other data science and industry-focused projects listed below:

1.Can we analyze planets using SAS like NASA scientists?

2.Can SAS Compare How 3 Different Families Live in Their Homes?

3.Can SAS Help Us Choose the Most Efficient Cooler?

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

About Us | Contact Privacy Policy





Comments

Popular posts from this blog

412.Can We Build And Clean A University Course Analytics & Fraud Detection System In Sas Using Only Macros While Intentionally Creating And Fixing Errors?

420.Can We Detect Errors, Prevent Fraud, And Optimize Biometric Access System Security Using Advanced SAS Programming?

418.Can We Design, Debug, Detect Fraud, And Optimize A Smart Parking System Using Advanced SAS Programming Techniques?