Regression of NCAA Basketball

Introduction
The problem of interest, or research objective, is to identify what variables contribute to a

teams winning percentage and how coaches and players can increase their chances of winning a
game by improving specific team measures. The primary reason for creating this sport-related
model should is not to win bracket challenges and predict betting odds, but is to allow coaches
and players to better evaluate and improve their performance. Our study aims to be useful to
coaches and players in their attempt increase their win percentage. In recent years, statistical
models for sports prediction have become increasingly complex. Many articles have been written
using various statistical methods to predict sport tournament outcomes. One such article from Ed
Feng, a researcher at Sandia National Laboratory, uses an algorithm to rank sports teams and
predict outcomes, similar to how internet search engines like Google rank websites (Feng, 1). As
highly complex predictive analytics models have already been created, we will not attempt to
match these models, but strive to find specific team measures that coaches and players can use to
increase their chances of winning as opposed to simply picking a winning team. Our research
question is what actionable goals coaches and players should focus on to improve their chances
of winning a game. Because of this, a Multiple Linear Regression is a good model to use. MLR
models allow us to interpret coefficients and the general effect different variables are having on a
teams success. For our project we will create a Multiple Linear Regression for NCAA Division I
top 25 teams, as well as BYU, in order to determine what aspects of the game have the most
effect on team success.
Methods
Response Variables and Explanatory Variables
Response Variable: As a measure of team success, number of regular season wins for the
2014-2015 season was selected. This variable excludes tournament play both in conference
tournaments and the NCAA tournament, since each tournament results in one loss for all but the
winning team.
Explanatory Variables: Our rational for the selection of response variables was to choose
variables that reflected items that coaches have control over. This meant that statistics that were
directly related to the score (i.e. scoring more points in a game leads to more wins) were
excluded. Variables were selected because they were all tied to a direct actionable goal for the
teams coach. Our explanatory variables were as followed:
All American: A binary variable that has a value of 1 when a team has a previous AllAmerican High School player. Each year 24 high school students are selected as the top players
of their class. Since our sample includes only the top 25 teams, it is assumed that the vast
majority of these teams have the prestige to attract top players. The actionable goal for a teams
coach in this case is the recruitment of top players. Is it worth expending team and school
resources in order to recruit a star player, or is time better spent recruiting a more varied team?
Free Throws: a continuous variable of the percent of free throws a team makes over the
course of a season. Practicing Free throws is not always easy for college coaches as it is very
individual. Is it worth the time for a coach to pull below average free throw players out of team
practice, or is time better spent on team activities?
Offensive Rebounds: Offensive rebounds is a measure of the average number of rebounds

a team collect from its own shots over the course of a game. While the actionable point here is
more abstract, coaches often must choose whether to use their taller players as extra shooters, to
place the near the basket to collect rebounds on offense. The main choice a coach must make
here is what strategy to use on offense.
Personal Fouls Per Game: A measure of the average fouls a team commits per game.
Fouling is an integral part of a basketball teams defensive. Coaching players who can defend a
lay-up or other shot taken near the basket without fouling is difficult. Many teams are content to
simply collect a lot of fouls over the course of the game. A teams fouls per game is very much a
measure of coaching style, because less fouls often means giving up more shots. The question for
a coach is whether or not it his worth giving up extra shots to keep from being penalized as a
team for reaching a certain number of fouls in a given game, or whether it is worth fouling and
excepting these penalties.
Possession: the average number of times a team has control of the ball in a given game.
Because the number of possessions in a game is always equal or near equal (if a team starts and
ends the game with the ball they will have one more possession) for both teams, this is
essentially a measure of the speed at which a team plays. A new style has recently developed in
College Basketball where teams attempt to play faster in order to keep their opponent from
setting up their defense. BYU is a classic example of this kind of play. The question for a coach
is whether or not to adopt this style and thereby except the errors that his team will commit when
playing at this speed.
Blocks: the average number of times a defender touches a ball that is heading towards the
basket and keeps his opponent from scoring. Blocking is a difficult skill, and time consuming to
teach. How important is this skill, and what emphasis should a coach put on it during practice?
Assists: Assists is the average number of times during a game a pass directly leads to a
basket without the scorer repositioning himself after the pass to attempt the shot. This is a
measure of good passing and communication between players. A coach must often decide
between coaching individual skills or team unity.
Data Collection
Data was collected from the website Team Rankings, (teamrakings.com) and aggregator
of data for various sports. Because data is aggregated by team, data was entered manually. The
teams measured were those on the postseason AP poll top 25 list (http://espn.go.com/menscollege-basketball/rankings), the data for the top 10 All-American players was also collected
from ESPN.com.
Data for the variables assist to turnover ratio, and offensive efficiency were also
collected, but were removed before analysis when it was decided they did not lead to actionable
goals. Data for BYU basketball was also collected, even though BYU was not listed in the AP
poll top 25. While it is outside of the range of most of our data, it had a below-average cooks
distance, and its inclusion did not have an adverse effect on the model. Coefficients of the
collected data are given on the next page.
Coefficients:
Estim
ate
(Intercept)
Assist
Free Throw
All-American
Possession
Personal
Fouls/Game
Blocks
Off.Rebounds
1.1801
33
0.0138
82
0.0087
17
0.0655
12
0.0179
1
0.0066
03
0.0011
69
0.0064
56
Std.
Error
t
value
Pr(>|t|)
0.463863
2.544
0.01888
0.008996
1.543
0.13776
0.004525
1.926
0.06769
0.035786
1.831
0.08138
0.004859
-3.686
0.00137
0.011659
-0.566
0.5781
0.010149
0.115
0.9096
0.013287
0.486
0.6329
Residual standard error: 0.07862 on 18 degrees of freedom

Adjusted R-squared:
Multiple R-squared: 0.5194
0.3326
F-statistic: 2.779 on 7 and 18 DF, p-value: 0.03805
alpha level = .05
alpha level = .1
Analysis and Results

Analysis Method
As stated above, we will used a MLR model that includes one binary variable and six continues
variables. Our response variable is a continuous variable of a teams win percentage in the 20142015 season. Interactions were chosen using pairs plot as well as expertise in the research area.
As our data is correlational and not experimental, a regression model is appropriate, and since
there is no significant interaction between our binary variable and other variables, a parallel
model is appropriate.
Model Selection and

Assumptions
Scatter Plot:
The pairs plot of all
variables was created to
check for gross linearity
between variables. R^2
values for all pairs of
variables are listed in the
bottom panel of the pairs
plot. No gross linearity was
found. This plot was also
referred to in order to choose
which interactions to test
since our degrees of freedom
are limited.
Heteroskedasticity:
Pairs plot showing each variable plotted against all other

variables including explanatory variable (Win.Pct)
From the plot of Residuals

Vs. Fitted values, we see that
the residuals are distributed
in a reasonably cloud-like
manner. This shows that the variance in the data is constant across the data. This makes sense as
our data was collected from a fairly small
and elite set of teams.
Outliers/Leverage Points:
From the above plot we see that there are
no studentized residuals outside |3|, and
this implies that there are no outliers in
the data. From the graph of Residuals vs.
Leverage that includes the Cooks
distance bands we see that there are no
overly influential points. Point #1
(Kentucky) is the most influential points,
but this makes sense as they are the
highest ranked team, and the only team
with a perfect record. To exclude

Kentucky from the data would be to
exclude the most exceptional team, and
as that is what we are hoping to
promote it seems intuitive that
Kentucky is the most influential point.
Normality:
From the Q-Q Plot we see that the errors of the data are
essentially normally distributed. While the number of
data points is not large we do not find any non-normality
significant enough to transform Y.
Global F-test:
The global F-test comparing the mean of win percent to the nave model is significant with a pvalue of .038 The ANOVA table is given below:
Analysis of Variance
Table
Model 1: Win.Pct ~ 1
Model 2: Win.Pct ~ (All.American + Free.Throw + Off.Rebounds + Pers.Fouls.p.game + Poss +
Blocks + Assist)
Pr(>F
Res.Df
RSS
Df
SS
F
)
Model 1
25
0.23154
0.1202
2.779 0.038
Model 2
18
0.11127
7
7
5
05
Variable Selection:
Since our degrees of freedom did not allow us to check all interactions we used the pairs plot
(see above) to select interactions to test. The interaction between personal fouls per game and
possessions, as well as fouls per game and offensive rebounds were chosen because they were
the only interactions with R^2 values above .60, and because both possessions (fast play) and
offensive rebounds are aggressive and could lead to more fouls. In order to select variables for
our model, we used both forward and backward selection. Both yielded the same model. Both the
first and last steop of backward selection are shown below:
Start: AIC=-122.73
Win.Pct ~ (All.American + Free.Throw + Off.Rebounds + Assist.Turnover +
Off.Efficiency + Pers.Fouls.p.game + Poss + Blocks + Assist) Off.Efficiency - Assist.Turnover + Poss:Pers.Fouls.p.game +
Off.Rebounds:Pers.Fouls.p.game
Df Sum of Sq
RSS
AIC
- Blocks
1 0.0000648 0.10745 -124.71
- Pers.Fouls.p.game:Poss
1 0.0017769 0.10916 -124.30
- Off.Rebounds:Pers.Fouls.p.game 1 0.0036634 0.11105 -123.85
<none>
0.10738 -122.72
- All.American
1 0.0110104 0.11840 -122.19
- Free.Throw
1 0.0121651 0.11955 -121.94
- Assist
1 0.0123097 0.11969 -121.90
Step: AIC=-131.14
Win.Pct ~ All.American + Free.Throw + Poss + Assist
Df Sum of Sq
RSS
AIC
<none>
0.11412 -131.14
- Assist
1 0.012939 0.12706 -130.35
- All.American 1 0.018212 0.13234 -129.29
- Free.Throw 1 0.020168 0.13429 -128.91
- Poss
1 0.073832 0.18796 -120.17
Results
Our study yielded the following model:
Win Percent= 1.180133+.013882(Assists)+.0087(Free Throws)+.0655(AllAmerican)-0.018(Possession)
The coefficient table for our model is given below:
Coefficients:
Estimate
(Intercept)
Assist
Free Throw
All-American
Possession
Std. Error
1.180133
0.013882
0.008717
0.065512
-0.01791
0.463863
0.008996
0.004525
0.035786
0.004859
t value
2.544
1.543
1.926
1.831
-3.686
Pr(>|t|)
0.01888
0.13776
0.06769
0.08138
0.00137
Residual standard error: 0.07372 on 21 degrees of freedom

Multiple R-squared: 0.5071
Adjusted R-squared: 0.4132
F-statistic: 5.401 on 4 and 21 DF, p-value: 0.003758
The coefficients can be interpreted as followed:
Intercept: The intercept of this data has no interpretable meaning, and is of no real value as all
teams have at least 55 possessions per game. Since the coefficient for possession is negative this
means that the intercept assumes a team has 0 possessions in a game.
Assists: When all other variables are held constant, we expect an increase of one assist per game
to correspond to an increase in winning percentage of 1.3% per season.
Free Throw: When all other variables are held constant, we expect an increase in free throws
made in a season of 1% to correspond to correspond to an increase in wining percent of .8% per
season.
All American: When all other variables are held constant, we expect a team with an AllAmerican to have a win percentage 6.5% better than a team without an All-American.
Possession: When all other variables are held constant, we expect an increase of one possessions
per game to correspond to a win percentage 1.7% lower.
When evaluating these variables using a t-test we see that possessions is the only statistically
significant variable at the .95 significance level. Both free throws and All-American are
significant at the .9 significance level. While included in the model, assists is not statistically
significant.

Regression of NCAA Basketball

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Regression of NCAA Basketball

Enviado por

Direitos autorais:

Formatos disponíveis

Introduction

The problem of interest, or research objective, is to identify what variables contribute to a

Offensive Rebounds: Offensive rebounds is a measure of the average number of rebounds

Residual standard error: 0.07862 on 18 degrees of freedom

Analysis and Results

Model Selection and

Pairs plot showing each variable plotted against all other

From the plot of Residuals

with a perfect record. To exclude

The coefficient table for our model is given below:

Residual standard error: 0.07372 on 21 degrees of freedom

The coefficients can be interpreted as followed:

Você também pode gostar