SPORTS PLAYER PERFORMANCE DATA CREATION AND ANALYSIS USING PROC SQL | PROC MEANS | PROC UNIVARIATE | PROC SGPLOT | PROC SGPANEL WITH MACRO-BASED WIN RATE AND DATE DERIVATIONS USING INTCK AND INTNX
options nocenter;
STEP 1: CREATE A RAW SPORTS DATASET USING PROC SQL
proc sql;
create table work.sports_raw as
select
"P001" as Player_ID length=5,
"India" as Country length=20,
85 as Matches_Played,
55 as Wins,
3150 as Ranking_Points,
"Grand Slam" as Tournament_Type length=20,
'15JAN2015'd as Debut_Date,
'20OCT2024'd as Last_Played_Date
from sashelp.class(obs=1)
union all
select
"P002", "Australia", 72, 48, 2980, "Grand Slam",
'10MAR2016'd, '05SEP2024'd
from sashelp.class(obs=1)
union all
select
"P003", "Spain", 102, 79, 4020, "Masters",
'01MAY2013'd, '30NOV2024'd
from sashelp.class(obs=1)
union all
select
"P004", "USA", 60, 36, 2500, "International",
'25AUG2017'd, '18JUL2024'd
from sashelp.class(obs=1)
union all
select
"P005", "France", 95, 63, 3300, "Masters",
'12DEC2014'd, '09SEP2024'd
from sashelp.class(obs=1)
union all
select
"P006", "Germany", 54, 29, 2100, "International",
'03FEB2018'd, '14AUG2024'd
from sashelp.class(obs=1)
union all
select
"P007", "Japan", 77, 50, 2900, "Grand Slam",
'19JUL2016'd, '02NOV2024'd
from sashelp.class(obs=1)
union all
select
"P008", "Brazil", 65, 38, 2400, "International",
'08JAN2017'd, '28OCT2024'd
from sashelp.class(obs=1)
union all
select
"P009", "UK", 110, 81, 4200, "Masters",
'21APR2012'd, '15NOV2024'd
from sashelp.class(obs=1)
union all
select
"P010", "Italy", 58, 31, 2050, "International",
'07JUN2018'd, '01SEP2024'd
from sashelp.class(obs=1)
union all
select
"P011", "Russia", 92, 60, 3200, "Grand Slam",
'29OCT2014'd, '30AUG2024'd
from sashelp.class(obs=1)
union all
select
"P012", "Argentina", 70, 44, 2700, "International",
'11JAN2016'd, '22OCT2024'd
from sashelp.class(obs=1)
union all
select
"P013", "Canada", 49, 27, 1950, "International",
'05SEP2019'd, '05NOV2024'd
from sashelp.class(obs=1)
union all
select
"P014", "China", 88, 62, 3400, "Masters",
'17NOV2015'd, '10OCT2024'd
from sashelp.class(obs=1)
;
quit;
/* Apply date formats */
data work.sports_raw;
set work.sports_raw;
format Debut_Date Last_Played_Date date9.;
run;
proc print data=work.sports_raw;
run;
OUTPUT:
| Obs | Player_ID | Country | Matches_Played | Wins | Ranking_Points | Tournament_Type | Debut_Date | Last_Played_Date |
|---|---|---|---|---|---|---|---|---|
| 1 | P001 | India | 85 | 55 | 3150 | Grand Slam | 15JAN2015 | 20OCT2024 |
| 2 | P002 | Australia | 72 | 48 | 2980 | Grand Slam | 10MAR2016 | 05SEP2024 |
| 3 | P003 | Spain | 102 | 79 | 4020 | Masters | 01MAY2013 | 30NOV2024 |
| 4 | P004 | USA | 60 | 36 | 2500 | International | 25AUG2017 | 18JUL2024 |
| 5 | P005 | France | 95 | 63 | 3300 | Masters | 12DEC2014 | 09SEP2024 |
| 6 | P006 | Germany | 54 | 29 | 2100 | International | 03FEB2018 | 14AUG2024 |
| 7 | P007 | Japan | 77 | 50 | 2900 | Grand Slam | 19JUL2016 | 02NOV2024 |
| 8 | P008 | Brazil | 65 | 38 | 2400 | International | 08JAN2017 | 28OCT2024 |
| 9 | P009 | UK | 110 | 81 | 4200 | Masters | 21APR2012 | 15NOV2024 |
| 10 | P010 | Italy | 58 | 31 | 2050 | International | 07JUN2018 | 01SEP2024 |
| 11 | P011 | Russia | 92 | 60 | 3200 | Grand Slam | 29OCT2014 | 30AUG2024 |
| 12 | P012 | Argentina | 70 | 44 | 2700 | International | 11JAN2016 | 22OCT2024 |
| 13 | P013 | Canada | 49 | 27 | 1950 | International | 05SEP2019 | 05NOV2024 |
| 14 | P014 | China | 88 | 62 | 3400 | Masters | 17NOV2015 | 10OCT2024 |
STEP 2: DEFINE A MACRO TO DERIVE WIN RATE & DATE-BASED METRICS
%macro derive_winrate(indata=, outdata=);
data &outdata.;
set &indata.;
/* Basic derived metrics */
if Matches_Played > 0 then Win_Rate = Wins / Matches_Played;
else Win_Rate = .;
Losses = Matches_Played - Wins;
/* Career duration using INTCK (month level) */
Career_Months = intck('month', Debut_Date, Last_Played_Date, 'c');
Career_Years = Career_Months / 12;
/* Align debut to start of debut season using INTNX */
Season_Start = intnx('year', Debut_Date, 0, 'b'); /* 'b' = beginning */
format Win_Rate percent8.2
Career_Years 8.1
Season_Start date9.;
run;
proc print data=&outdata.;
run;
%mend derive_winrate;
%derive_winrate(indata=work.sports_raw, outdata=work.sports_final);
OUTPUT:
| Obs | Player_ID | Country | Matches_Played | Wins | Ranking_Points | Tournament_Type | Debut_Date | Last_Played_Date | Win_Rate | Losses | Career_Months | Career_Years | Season_Start |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | P001 | India | 85 | 55 | 3150 | Grand Slam | 15JAN2015 | 20OCT2024 | 64.71% | 30 | 117 | 9.8 | 01JAN2015 |
| 2 | P002 | Australia | 72 | 48 | 2980 | Grand Slam | 10MAR2016 | 05SEP2024 | 66.67% | 24 | 101 | 8.4 | 01JAN2016 |
| 3 | P003 | Spain | 102 | 79 | 4020 | Masters | 01MAY2013 | 30NOV2024 | 77.45% | 23 | 138 | 11.5 | 01JAN2013 |
| 4 | P004 | USA | 60 | 36 | 2500 | International | 25AUG2017 | 18JUL2024 | 60.00% | 24 | 82 | 6.8 | 01JAN2017 |
| 5 | P005 | France | 95 | 63 | 3300 | Masters | 12DEC2014 | 09SEP2024 | 66.32% | 32 | 116 | 9.7 | 01JAN2014 |
| 6 | P006 | Germany | 54 | 29 | 2100 | International | 03FEB2018 | 14AUG2024 | 53.70% | 25 | 78 | 6.5 | 01JAN2018 |
| 7 | P007 | Japan | 77 | 50 | 2900 | Grand Slam | 19JUL2016 | 02NOV2024 | 64.94% | 27 | 99 | 8.3 | 01JAN2016 |
| 8 | P008 | Brazil | 65 | 38 | 2400 | International | 08JAN2017 | 28OCT2024 | 58.46% | 27 | 93 | 7.8 | 01JAN2017 |
| 9 | P009 | UK | 110 | 81 | 4200 | Masters | 21APR2012 | 15NOV2024 | 73.64% | 29 | 150 | 12.5 | 01JAN2012 |
| 10 | P010 | Italy | 58 | 31 | 2050 | International | 07JUN2018 | 01SEP2024 | 53.45% | 27 | 74 | 6.2 | 01JAN2018 |
| 11 | P011 | Russia | 92 | 60 | 3200 | Grand Slam | 29OCT2014 | 30AUG2024 | 65.22% | 32 | 118 | 9.8 | 01JAN2014 |
| 12 | P012 | Argentina | 70 | 44 | 2700 | International | 11JAN2016 | 22OCT2024 | 62.86% | 26 | 105 | 8.8 | 01JAN2016 |
| 13 | P013 | Canada | 49 | 27 | 1950 | International | 05SEP2019 | 05NOV2024 | 55.10% | 22 | 62 | 5.2 | 01JAN2019 |
| 14 | P014 | China | 88 | 62 | 3400 | Masters | 17NOV2015 | 10OCT2024 | 70.45% | 26 | 106 | 8.8 | 01JAN2015 |
STEP 3: USE PROC SQL FOR SOME BASIC QUERIES & VIEWS
proc sql outobs=5;
/* Top 5 players by Win_Rate */
create table work.top5_winrate as
select Player_ID, Country, Tournament_Type,
Matches_Played, Wins, Win_Rate, Ranking_Points
from work.sports_final
order by Win_Rate desc, Ranking_Points desc;
proc print data=work.top5_winrate;
run;
OUTPUT:
| Obs | Player_ID | Country | Tournament_Type | Matches_Played | Wins | Win_Rate | Ranking_Points |
|---|---|---|---|---|---|---|---|
| 1 | P003 | Spain | Masters | 102 | 79 | 77.45% | 4020 |
| 2 | P009 | UK | Masters | 110 | 81 | 73.64% | 4200 |
| 3 | P014 | China | Masters | 88 | 62 | 70.45% | 3400 |
| 4 | P002 | Australia | Grand Slam | 72 | 48 | 66.67% | 2980 |
| 5 | P005 | France | Masters | 95 | 63 | 66.32% | 3300 |
/* Average ranking points by Tournament_Type */
create table work.avg_points_by_type as
select Tournament_Type,
count(*) as Num_Players,
avg(Ranking_Points) as Avg_Ranking_Points
from work.sports_final
group by Tournament_Type;
quit;
proc print data=work.avg_points_by_type;
run;
OUTPUT:
| Obs | Tournament_Type | Num_Players | Avg_Ranking_Points |
|---|---|---|---|
| 1 | Grand Slam | 4 | 3057.50 |
| 2 | International | 6 | 2283.33 |
| 3 | Masters | 4 | 3730.00 |
STEP 4: SUMMARY STATISTICS USING PROC MEANS
proc means data=work.sports_final
n mean std min max maxdec=2;
class Tournament_Type;
var Matches_Played Wins Win_Rate Ranking_Points Career_Years;
run;
OUTPUT:
The MEANS Procedure
| Tournament_Type | N Obs | Variable | N | Mean | Std Dev | Minimum | Maximum |
|---|---|---|---|---|---|---|---|
| Grand Slam | 4 | Matches_Played Wins Win_Rate Ranking_Points Career_Years | 4 4 4 4 4 | 81.50 53.25 0.65 3057.50 9.06 | 8.81 5.38 0.01 141.04 0.85 | 72.00 48.00 0.65 2900.00 8.25 | 92.00 60.00 0.67 3200.00 9.83 |
| International | 6 | Matches_Played Wins Win_Rate Ranking_Points Career_Years | 6 6 6 6 6 | 59.33 34.17 0.57 2283.33 6.86 | 7.53 6.37 0.04 294.39 1.25 | 49.00 27.00 0.53 1950.00 5.17 | 70.00 44.00 0.63 2700.00 8.75 |
| Masters | 4 | Matches_Played Wins Win_Rate Ranking_Points Career_Years | 4 4 4 4 4 | 98.75 71.25 0.72 3730.00 10.63 | 9.43 10.14 0.05 446.77 1.67 | 88.00 62.00 0.66 3300.00 8.83 | 110.00 81.00 0.77 4200.00 12.50 |
STEP 5: DISTRIBUTION ANALYSIS USING PROC UNIVARIATE
proc univariate data=work.sports_final;
var Ranking_Points;
histogram Ranking_Points / normal;
inset n mean std min max / position=ne;
run;
OUTPUT:
The UNIVARIATE Procedure
Variable: Ranking_Points
| Moments | |||
|---|---|---|---|
| N | 14 | Sum Weights | 14 |
| Mean | 2917.85714 | Sum Observations | 40850 |
| Std Deviation | 691.92239 | Variance | 478756.593 |
| Skewness | 0.35762409 | Kurtosis | -0.4989298 |
| Uncorrected SS | 125418300 | Corrected SS | 6223835.71 |
| Coeff Variation | 23.7133744 | Std Error Mean | 184.924037 |
| Basic Statistical Measures | |||
|---|---|---|---|
| Location | Variability | ||
| Mean | 2917.857 | Std Deviation | 691.92239 |
| Median | 2940.000 | Variance | 478757 |
| Mode | . | Range | 2250 |
| Interquartile Range | 900.00000 | ||
| Tests for Location: Mu0=0 | ||||
|---|---|---|---|---|
| Test | Statistic | p Value | ||
| Student's t | t | 15.77868 | Pr > |t| | <.0001 |
| Sign | M | 7 | Pr >= |M| | 0.0001 |
| Signed Rank | S | 52.5 | Pr >= |S| | 0.0001 |
| Quantiles (Definition 5) | |
|---|---|
| Level | Quantile |
| 100% Max | 4200 |
| 99% | 4200 |
| 95% | 4200 |
| 90% | 4020 |
| 75% Q3 | 3300 |
| 50% Median | 2940 |
| 25% Q1 | 2400 |
| 10% | 2050 |
| 5% | 1950 |
| 1% | 1950 |
| 0% Min | 1950 |
| Extreme Observations | |||
|---|---|---|---|
| Lowest | Highest | ||
| Value | Obs | Value | Obs |
| 1950 | 13 | 3200 | 11 |
| 2050 | 10 | 3300 | 5 |
| 2100 | 6 | 3400 | 14 |
| 2400 | 8 | 4020 | 3 |
| 2500 | 4 | 4200 | 9 |
The UNIVARIATE Procedure
The UNIVARIATE Procedure
Fitted Normal Distribution for Ranking_Points
| Parameters for Normal Distribution | ||
|---|---|---|
| Parameter | Symbol | Estimate |
| Mean | Mu | 2917.857 |
| Std Dev | Sigma | 691.9224 |
| Goodness-of-Fit Tests for Normal Distribution | ||||
|---|---|---|---|---|
| Test | Statistic | p Value | ||
| Kolmogorov-Smirnov | D | 0.10010171 | Pr > D | >0.150 |
| Cramer-von Mises | W-Sq | 0.02483572 | Pr > W-Sq | >0.250 |
| Anderson-Darling | A-Sq | 0.21887937 | Pr > A-Sq | >0.250 |
| Quantiles for Normal Distribution | ||
|---|---|---|
| Percent | Quantile | |
| Observed | Estimated | |
| 1.0 | 1950.00 | 1308.20 |
| 5.0 | 1950.00 | 1779.75 |
| 10.0 | 2050.00 | 2031.12 |
| 25.0 | 2400.00 | 2451.16 |
| 50.0 | 2940.00 | 2917.86 |
| 75.0 | 3300.00 | 3384.55 |
| 90.0 | 4020.00 | 3804.59 |
| 95.0 | 4200.00 | 4055.97 |
| 99.0 | 4200.00 | 4527.51 |
STEP 6: VISUALIZATION USING PROC SGPLOT
/* Bar chart of total wins by Country */
proc sgplot data=work.sports_final;
vbar Country / response=Wins stat=sum;
xaxis label="Country";
yaxis label="Total Wins";
title "Total Wins by Country";
run;
OUTPUT:
/* Scatter plot of Win_Rate vs. Ranking_Points */
proc sgplot data=work.sports_final;
scatter x=Win_Rate y=Ranking_Points;
xaxis label="Win Rate";
yaxis label="Ranking Points";
title "Relationship Between Win Rate and Ranking Points";
run;
OUTPUT:
STEP 7: BOXPLOTS BY TOURNAMENT TYPE USING PROC SGPANEL
proc sgpanel data=work.sports_final;
panelby Tournament_Type / columns=2;
vbox Win_Rate;
colaxis label="Tournament Type";
rowaxis label="Win Rate";
title "Distribution of Win Rate by Tournament Type";
run;
No comments:
Post a Comment