Isn’t it fun when you know the result of a game before it actually happens? Of course it is and everyone have their subjective opinions on how the game will play out. In this blog post I will describe – in as much detail as possible – the Basketball Prediction Matchup (BPM) model for obtaining the win probability of an NBA matchup. BPM’s architecture is based on our relevant model for NFL, FPM, with the corresponding features of course being different.
The backbone of our prediction engine is the Bradley-Terry model . The Bradley-Terry model is used to order a given set of items based on their characteristics and understand the impact of these characteristics on the ranking. In our case the set of items are the NBA teams and the output of the model for items i and j provides us essentially with the probability of team i (assuming with out loss of generality that i is the home team) winning team j. In particular, the Bradley-Terry model is described by:
where is the ability of team i. Given a set of team-specific independent variables Zi, the difference in the ability of the teams i and j can be expressed by:
where U represent the error term that is normally distributed with 0 mean. The Bradley-Terry model is then a generalized linear model that can be used to predict the probability of team i winning team j by using the difference in the team-specific independent variables as the model features.
The independent variables Z that we use to model the probability of win are based on Dean Oliver’s “four factors”. In his book “Basketball on paper”, he identifies 4 factors that best summarize the performance of a team in basketball. They have been used to project the total number of wins for a team during a season, but here I am using them to estimate the probability of winning in a specific matchup. These factors are:
- The Effetive Field Goal percentage (eFG%). eFG adjusts the raw field goal percentage to account for the fact that 3 point shots are worth 1 more point. The formula for eFG is (FG+(0.5*3PtA)/FGA)
- Turnover percentage (TOV%) is the percentage of possessions that ended with a turnover. TOV% is calculated as: TO/(FGA+(0.44*FTA)-OR+TO)
- Offensive rebounding percentage (OR%) is the fraction of offensive rebounds grabbed from the team over all the possible offensive rebounds that could have been obtained, i.e., OR% =OR/(OR+DRopp)
- Free throws factor that aims at measuring both the ability to go to the charity line as well as convert the charity. It is calculated as the fraction of the free throws made over the field goal attempts.
I use one additional feature, the win percentage of the teams during the matchup. So for every matchup we will represent with a binary variable whether the home team won or not, and the independent variables of the model will be the differences in the above factors (e.g., the eFG% differential between the home and away teams etc.). I have downloaded the data from basketball-reference for the last season (i.e., 2015-16) and the current season and using the above setting we obtain the following model:
As we can see all of the independent variables are significantly correlated with the win probability. As one might have expected the impact of all the confounders is positive expect the turnover rate, where higher turnover rate is associated with lower win probability. The cross validation accuracy is approximately 96%, while the Brier score is 0.02! The Brier score is a measure of the accuracy of a probabilistic prediction task where probabilities need to be assigned to a set of mutually independent events. In order to calculate the Brier score we need the predicted probability, assinged to instance i and the actual outcome . In our case, this maps to the probability of the home team winning game i as well as whether the home team actually won. With this setting the Brier score is given by:
where N is the total number of instances (i.e., games). Accurate predictions will have small Brier score, with the minimum possible value being 0, while the maximum value of the Brier score is 1 and it is the worst possible performance. As alluded to above the cross validation Brier score of the model is 0.02, which means that the model not only accurately predicts the winner, but also provides a very good estimation of the probability of winning (i.e., a well-calibrated model).
However, there is a problem with the above. The model assumes that we know the input features. And while this is obviously true for which is the home team and what are the current winning percentages of the teams, the same does not hold for the performance -based features of the model. For instance, we will not know beforehand that the Celtics will have an eFG of 0.54 against the Cavs! Therefore, the above performance evaluation is both misleading from the perspective of a realistic system as well as impractical. Nevertheless, it serves as a good indicator that as long as we will be able to estimate these features, we will be able to obtain a fairly accurate probability of winning.
In order to make BPM applicable in a real-world prediction setting we need to have a way to project the performance of a team with respect to the four features included in the model. For that we will rely on historic data. One possibility would be to use the mean/median value of these features over the previous games in the season. While this seems reasonable it does not consider any variability in the performance. Therefore, in order to add this variability in our predictions we will use the previous games of a team in the season to estimate their joint distribution (we assume a multivariate normal) and then simulate their performance by repeatedly sampling this distribution. While there are many different factors to consider when trying to decide which previous games are more relevant to the upcoming matchup (e.g., opponent strength, roster absences etc.), in the current version of FPM we only consider whether a game is a home or a road game. For instance, if we want to estimate the features for the Spurs on the road we will build the distribution based only on the road games of the Spurs. Considering too many variables for stratifying the games can also lead to very small samples that essentially will provide us with very little information. On the contrary, considering only the field (home vs road) does not reduce the sample significantly, while at the same time there are teams that exhibit statistically significant different performance with respect to these indicators when playing at home and when playing away. In particular, by using the Hotelling’s test (in brief a generalization of the t-test for the mean in multiple dimensions) we examine which teams exhibit statistically significant differences in their performance:
We can see that there are several teams whose performance differs with regards to the “four factors” when they play at home and on the road. Therefore, when building the empirical distribution for these features we will only use the home or road games for each matchup team respectively.
Having this setting we use last year’s NBA data to evaluate the accuracy of BPM. We used the first 200 games of the season as our training set and then for each following game we first infer the “performance” of each team using the corresponding multivariate distribution from the historic data, and then compute the estimated win probability for the home team. After each game prediction, we use the actual boxscore data from that game to re-train the model with the additional information for using this newly trained model for the upcoming matchups.
BPM has a 67% accuracy with a standard error of 1.5% for the 2015-16 season (regular season and playoffs), while during the current season the running performance of BPM is 65% accuracy. This is on par with the performance reported for various different prediction systems online. Overall the prediction accuracy for NBA is between 60-70%. We also built a baseline model where the only features used to predict the win probability for the home team is, well, whether the team is playing at home or not as well as the win percentage differential. Note that these two features do not need to be sampled from any distribution since they are known prior to the matchup. The accuracy of this system was just 59%, right at the lower end of a typical accuracy for NBA prediction. Adding the four factors seems to improve the accuracy over this baseline by approximately 8%. Furthermore we computed the Brier score for the full model and the baseline and this further verifies the above results, since the Brier score for the full model is 0.15, while that of the baseline model is 0.24 (remember that the smaller the Brier score the better the probability prediction from the model!). A Brier score of 0.248 is also obtained from tha typical “climatology” reference strategy for estimating the probabilities. In this strategy, every instance is predicted to give a home-team win with a probability equal to the fraction of games where the home team wins. This probability is approximately 52%.
It will be interesting to see how BPM would perform if we used as dependent variable the projected point differential. The true test would then be to compare with the Vegas line 🙂