Está en la página 1de 5

NBA Shot Result Prediction

Alex Dai

Abstract— This report will detail the usage of statistical


learning methods to predict whether a basketball shot is made
or missed in the 2014-2015 NBA season. We will attempt to
predict the result based on information about the game scenario
at the time of the shot.

I. INTRODUCTION

The usage of data analysis to influence decisions in basket-


ball analytics has increased in popularity exponentially over
the past few years. This trend is due to two main reasons.
First being the increased hardware process in handling big
data tasks. The second is the advent of advanced play-by-
play detection technology facilitated by a camera system
hung from the rafters collects data on the movements of
each player. Now that teams are able to gather a plethora
of detailed and descriptive data, they are shifting to relying
more on statistical and quantitative measures to influence
game time decisions. Fig. 1. Feature Summary in Shot Log Dataset
Obviously, the objective of basketball is to score more
points than your opponent. The point of this analysis is to
determine whether we can build a model that can predict C. Data Preprocessing
whether a shot will fall or not based on advanced data
First we cleaned the dataset by handling any NaN values
gathered by the SportsVU camera system mentioned above.
as well as any non-sensical values. After inspection, only the
A model that can do this provides invaluable information to
SHOT CLOCK column contained any NaNs, with around
the coaching staff of a team about how the conditions of the
5,500 out of 128,000 entries missing. Instead of deleting
game at the time of the shot influences the outcome of the
all of the rows with no SHOT CLOCK information, We
shot.
found a way to impute this value without losing too much
valuable information. After some close inspection, it looks
II. DATA like there are only 10 entries in the dataset where the
SHOT CLOCK is equal to the game clock. This is a lot
A. Data Source less than what is expected, due to the fact that there should
The data we will be looking at shot log data gathered from be many instances where a team gets possession of the ball
the SportsVU camera systems for every game spanning a with less than 24 seconds on the game clock, causing the shot
majority of the 2014-2015 NBA season, from Oct 2014-Mar clock to be equal to the time remaining on the game clock.
2015. The data was taken from https://www.kaggle. After re-reading the NBA rules, it says that the shot clock is
com/dansbecker/nba-shot-logs/home, which actually turned off when a team gets possession of the ball
compiled this dataset by using the NBA API. The reason when the game clock is under 24 seconds. Therefore, we can
we are looking at data from the 2014-2015 season is that replace the shot clock with the value of the game clock if
the NBA API stopped providing this data permanently after the shot clock is NaN and the game clock is less than 24
the 2015 season due to cost concerns. seconds. This accounts for 3,554 of our 5,500 NaN entries.
The rest of the 2,000 NaNs, however, don’t have as easy as
an explanation, and are probably due to some recording error
B. Data Characteristics
within the camera system. Thus, we will delete those rows
The Shot Log dataset contains information about every from our analysis, as SHOT CLOCK intuitively should be
single shot taken in 904 games spanning the aforementioned an important predictor.
time period, totaling 128,069 rows with 21 features in each The next preprocessing we did involved handling non-
row. Figure 1 shows the features available, along with their sensical values. The T OU CH T IM E feature describes how
corresponding data types. long the shooter has the ball before releasing his shot. There
are 300 values in which this value is negative, which makes
no sense in this context. Figure 2 shows the distribution of the
negative T OU CH T IM E values compared to the normal
T OU CH T IM E values. It appears as if the negative values,
once flipped, follow the same distribution as the normal
ones. This seems to suggest that the negative values was a
random user entry error that was sampled from the original
distribution. Therefore, it made sense to flip these negative
values in order to preserve as much information as possible.
Lastly, before doing any further analysis, we will factor
our response variable, SHOT RESU LT , into 0/1 values,
as well as other binary categorical response variables in the Fig. 3. Feature Distributions
features set.
III. DATA E XPLORATION
the shot is the result of a initial posession, or a recovered
A. Feature Distribution possession, perhaps from a rebound or an early steal off of
Figure 3 shows the distributions of the features that may the entry pass. I believe shots coming off of second chance
be important for prediction in the final model to serve as a possessions should have a higher success rate, due to the
launchpad for exploration. It also serves as interesting infor- fact that many of such shots are easy put-backs or tip-ins
mation about the current status of the NBA that may help after the defense is scrambled by the initial shot. Since we
in decision making in structuring our model. Approximately don’t have this information within the dataset, we also need
45% of the shots in the 2014-2015 season went in, giving us to factor this latent source of bias as well.
a good baseline to compare our models against. If our model
guessed the majority class of ”MISS” every single time, we B. Feature Correlations
should achieve an accuracy of approximately 55% In Figure 4, we created a correlation heatmap that is
One important feature that we are missing is whether able to visualize the correlation between all of the features,
or not the shooter was fouled during the shot. This will as well as with the response variable, SHOT RESU LT .
drastically affect the chances of a miss or a make, and There are quite a few variables that were heavily correlated
represents a latent source of bias we must be aware of. with each other, however after closer inspection, it’s clear
Some interesting things that the distribution plot reveals that it’s because there’s some redundant information within
is the shot selection looks in the modern NBA. Ever since the features. For example, F IN AL M ARGIN and W are
2010, more and more emphasis has been placed on the 3-pt heavily correlated because when the F IN AL M ARGIN is
shot. As can be seen in the SHOT DIST plot, a significant positive, the game obviously results in a win. Furthermore,
amount of shots came from the 22 − 24ft area, which is right neither of those features are something we wanted to use
at the border of the 3-pt line. in a predictive model, since those features aren’t ascertained
Another interesting thing is the SHOT CLOCK distri- until after a game is over.
bution, which looks relatively normally distributed, except Another set of features that is heavily correlated is
for a large of spike of shots at the 24-second mark, which P T S T Y P E and SHOT DIST AN CE for obvious rea-
probably makes sense due to players chucking up shots at sons. Therefore, it made the most sense for our model to
the end of the SHOT CLOCK timer to avoid a 24-second only utilize one of these.
violation. One important feature this dataset lacks is whether Additionally, DRIBBLES and T OU CH T IM E are
Fig. 4. Feature Correlation Heatmap

Fig. 5. Shot Distance vs. Closest Defender Distance Heatmap


very correlated with each other, but also contain non-
overlapping information that could be useful to the predictor.
For example, a possession by a player like Carmelo Anthony,
who likes to hold on to the ball without dribbling for a
long time before shooting should be treated differently than
a player like Kyrie Irving, who tends to dribble a lot before
each shot. Therefore, it would make sense to retain both in
the model.
One interesting relationship was between SHOT DIST
and CLOSE DEF DIST . It would make sense that if
the farther away a shooter is from the basket, the farther
away they are from the defender as well. Furthermore, the
farther away the shot occurs, the harder it is to make,
incentivizing the defender to contest the shot less and less as
the shot distance increases. Figure 5 shows a heatmap that
visualizes this relationship. Most shots are either layups or 3-
pt shots, which really highlights the trend the modern NBA is
moving towards. If this plot were created for the early 2000s Fig. 6. Shot Distance Feature Analysis
period, there would be much more red located in the mid-
range portion. It seems that a majority of the 2-pt shots are
defended tightly, with the closest defender within 2-5 feet. don’t want our model to overfit on these large distances, so
In contrast, 3-pt shots are defended much less tightly, with instead of treating this predictor as a numerical feature, we
the closest defender distance centered around 5 feet away. bin this feature into 5 categories, which were recommended
by www.NBA.com. The binning function is as follows:
C. Feature Analysis • 0 − 2 feet away - Very Tightly Defended

Figure 6 shows the relationship between • 2 − 4 feet away - Tightly Defended

SHOT DIST AN CE and our response, • 4 − 6 feet away - Open

SHOT RESU LT . Surprisingly, there is very little • 6 − 10 feet away - Very Open

relationship between field goal percentage and distance • 10+ feet away - Completely Open

from closest defender, unless the distance is absurdly high, Hopefully this encoding will reduce the variance of our
something like 20ft+. This happens very rarely, probably model while minimizing the increase of potential bias it
only when a player gets a fast break off of a steal and is introduces.
completely ahead of the pack. There seems to be a dip Figure 7 indicates that there is a strong relationship
in field goal percentage at around 28 ft away down. We between SHOT CLOCK and our response variable. It
Fig. 7. Shot Distance vs. Closest Defender Distance Heatmap Fig. 8. Shot Distance vs. Accuracy

appears that shots that occur very early on in the shot clock
have a significantly higher percentage than normal. This
probably is aided by high percentage fast break plays that
end up scoring very early into the shot clock. As the shot
clock approaches 0, the accuracy dives. This is probably due
to the reason mentioned earlier about the shooter’s pressure
to put up a shot before the shot clock expires. This feature
thus is important to include in our model.
Figure 8 highlights the obvious relationship between
SHOT DIST AN CE and our response. As the shooter
gets further away from the basket, the chance of success
drops rapidly. One surprising thing to notice is how the
efficiency of a shot 10-ft away is almost the same as a shot
22-feet away. This trend explains why teams are moving
away from the midrange shot. Though it is just as likely
to fall as the 3-pt shot, it rewards the team only 2-pts, a Fig. 9. Results vs. PCA
50% reduction per possession than a 3-pt shot. Furthermore,
the shot percentage takes a dramatic drop from 0 − 5 feet
to 10 feet with no increase in points given per difficulty of the following equation:
shot. This trend further explains why the NBA is moving
towards taking shots with a high expected value, namely x − xmin
layups and 3-pt shots. Furthermore, this trend shows that (1)
xmax − xmin
including SHOT DIST AN CE as a feature will improve
our model.
Figure 9 shows the results of the variance retained vs. the
IV. D IMENSIONALITY R EDUCTION components used after performing 5-fold cross validation.
One important feature to include in our dataset is the It seems as if PCA isn’t appropriate in this situation. I
specific player performing the shot, as well as the specific hypothesize that the nature of the sparsity of this dataset
player defending the shot. Unfortunately, this presents a makes it hard to find a good low-rank approximation.
concern computationally, as there are 281 unique shooters in I talked to Su in class about possible alternatives, and one
the dataset and 487 unique defenders in the shot logs. Once possible alternative included fitting a regression on the non-
these are one-hot encoded and joined into the dataset, our player specific features, and then treating each player as an
dataset shape is (125,000, 705). This presents a challenge, intercept. I couldn’t quite figure out how to do this for both
as some models we tried to use, such as logistic regression, the shooter and the defender as well. I should have at least
were unstable and had overflow errors. This section explores tried to perform that method for just the shooter, as that is
how we can reduce the dimensionality of our data. One idea one of the most important predictors, however due to time
was to use PCA in order to find a low rank approximation to constraints, I was unable to add this in. Thus for our model,
our data matrix that might ease our computation woes. First we do not factor in the shooting player or closest defending
we scaled all of our data to values between 0 and 1 using player at all.
Fig. 11. Logistic Regression Threshold Selection

Fig. 10. Logistic Regression Summary precision of 65.2% and a recall of 65.2%. The gradient
boosted trees saw an improvement of all 3 metrics.
VI. F UTURE W ORK
V. M ODEL R ESULTS
We were slightly disappointed with the lackluster results
A. Logistic Regression of our classifier, but given the limited numbers of information
Figure 10 shows the results of logistic regression over about each shot, we didn’t expect much more. In terms
all of the data. With an R2 score of 0.044, our model of improving the quality of our classification, we feel like
isn’t that great at explaining the variability within the obtaining certain features such as whether or not the shooter
dataset. All of the predictors were significant, except for was fouled, which possession the shot happened on, and
the GAM E CLOCK predictor. Some interesting results general descriptions about the play leading up to the shot.
can be derived from this summary. When interpreting the The exclusion of these features puts a hard cap on how well
coefficients, it can be seen that as the SHOT DIST AN CE our classifier can possibly do. Furthermore, being able to
increases with all other coefficients constant, the chance of include the shooting player defending player in our model
shot success decreases as well. Furthermore, as shooters will potentially increase results, but couldn’t be done do to
are less tightly defended, the chance of shot success in- lack of time. It may be possible to categorize each player
creases. Interestingly, T OU CH T IM E and DRIBBLES into 5-6 offensive and defensive categories based on their
both seem to have a positive correlation with shot success. offensive and defensive stats and used that as a lightweight
Personally, I would have imagined the opposite. alternative to treating each specific player as a feature.
Figure 11 describes the results from 5-fold cross valida- Despite that, this analysis was important in highlighting
tion. After averaging the threshold that performs the best important trends in the current NBA landscape. Furthermore,
on all of the folds, we achieve a threshold of 0.54 with a if the base model were improved to 80% accuracy or so, it
cross-validated accuracy of 61.3%. Furthermore, we achieved may provide a very powerful tool for shot evaluation for
a cross-validated precision of 64.2% and recall of 32.9%. teams to use in the future.
These results aren’t that impressive compared to the baseline
accuracy of 55% by guessing the majority class.
B. Gradient Boosted Trees
In an effort to improve on the results of logistic re-
gression, we tried applying gradient boosted trees using
the XGBoost package on this dataset. Gradient boosted
trees is one of the best performing architectures in
Kaggle competitions for classification tasks. Since fea-
ture selection is a little bit more important in deci-
sion trees, we decided only to use SHOT N U M BER,
SHOT CLOCK, DRIBBLES, T OU CH T IM E, and
the binned CLOSED EFD IST as features to input into my
classifier. After using the default parameters for training,
we achieved a cross-validated accuracy of 62.1% with a

También podría gustarte