How costly are turnovers in a basketball game? How does a 1% increase in the FG percentage affect the probability of winning a game? These are questions that a data-oriented, sports-loving person might have. While the qualitative answers to similar questions might be obvious for the dedicated fans of the game, the exact quantitative answer requires data analysis. Similar questions serve as a great example for fun teaching of basic statistical modeling techniques.

In this post I will show how one can use NBA boxscore data to model the probability of winning a basketball game. The goal is not to identify the new crucial factor that teams need to pay attention to (well even if I knew that I guess I would try to monetize it privately), but rather to show (a) the power of data and (b) the power of simple statistical models in helping us understand the game better. For this purpose I collected boxscore data from basketball-reference for the current season 2015-2016. twiThe data includes every regular and postseason games’ statistics, i.e., basic information such as FG%, number of rebounds, steals, blocks, assists, etc.

Let us denote with a binary variable that captures whether a team won a specific matchup (i.e., ) or not (i.e., ). This binary variable depends on a number of factors such as the field goal percentage of the team, its ability to get the boards, the turnovers forced in defense, etc. These factors form our independent variables represented by vector . Logistic regression is a generalized linear model that is used to capture the dependencies between and . In particular, the output of the logistic regression provides us with the probability of being 1, i.e., . More specifically the logistic regression model for is given by:

The coefficient vector includes the weights for each independent variable of the input vector . From the boxscore we will use the following information as our independent variables: (1) whether the team plays in home court or on the road, (2) field goal percentage, (3) free throw percentage, (4) number of offensive rebounds, (5) number of defensive rebounds, (6) number of assists, (7) number of turnovers committed and (8) number of personal fouls committed.

First I did some very basic exploratory analysis with simple pairwise comparisons of the boxscore features for the winning and losing teams. The following figures present the empirical density of the differential of each feature between the winning and losing team of each matchup in the season.

These pairwise comparisons show that winning teams tend to commit fewer turnovers, have a better field goal percentage and collect more defensive rebounds. Home court seems to provide some advantage as well. In any case these are just some exploratory comparisons. The importance of each of these game statistics/attributes will be revealed through the logistic regression coefficients. For each coefficient we will obtain two values; the “magnitude” and the p-value of the coefficient. The p-value corresponds to a simple t-test that tests the value of the coefficient whether it is 0 or not. However, here is the tricky part that many people confuse. A significant coefficient does not necessarily mean that is also important! You can have a significant coefficient but its weight is very small in order to have any meaningful impact on the dependent variable. This is especially true in the case of “big data”, where even very small impacts can be picked up by the large dataset. Things are simply “too big to fail“. In order to avoid such problems one solution is to run the regression repeatedly with smaller samples obtained at random from the original dataset. So in our case we use 1/3 of the games to build the model (i.e., 500 data points instead of the 1,310 games). The following are the coefficients obtained.

As we can see some coefficients are practically 0 (i.e., personal fouls, assists and free throw percentage), while others even though statistically significant, they have very small impact (e.g., offensive rebounds). The most interesting result might be the insignificance of the number of assists to the probability of winning. It might be the case that the game has changed for good and personal abilities trump over team-work (at least it seems to be worth tracking this change with boxscores from earlier seasons). Furthermore, the accuracy at predicting the winner of the game, given the team performance features, is about 81% (not bad if you think that the features used are very simple). Of course the problem is how you predict the numbers that the team will put up…

One of the things to always be careful is interpreting the results. Logistic regression only captures correlations and always be careful for the reverse causation issue. For example, the turnover coefficient is negative, which means that teams that commits fewer turnovers has better chances at winning. However, this might be artificial; teams that are behind the score (for other reasons than turnovers) towards the end of the game are in a “hurry-up” mode and hence, they commit more turnovers. The same might be true for example for the coefficient of personal fouls. Even though the p-value is 0.11 (i.e., one can consider the value being pretty much 0), if we ignore this fact, the negative sign of the coefficient might just be an artifact of teams behind in the score committing fouls towards the end of the game to stop the clock and hence, they end up having more fouls altogether. Some times this problem of reverse causation can be solved by checking for example the distribution of turnovers over the course of a game (e.g., if the turnovers are uniformly distributed over the period of the game then the reverse causation might be less probable). You can find some more details on this issue and how to work it out at my analysis on factors affecting the probability of winning an NFL game.

I know that boxscore data are not easy to find readily available so you can download them here.