Did high seeds got the bad end of the bubble deal in the NBA? Can we “recreate” the equivalent of home court in the bubble? These are some questions that I have seen quiet often being discussed in the media (social and traditional). There are also articles (like this one on ESPN’s site) that suggest that the home court advantage is still present in the bubble. This question (together with the hot hand) is one that possibly trascends sports analytics and has “fascinated” people for decades. I took this opportunity to introduce the Bayesian linear regression in this context, and at the same time try to answer the question of the bubble court advantage.
To look into that we will start with a regression-based rating system for teams. As I described here one way to identify team ratings is through a linear regression, where the independent variables are “dummy” variables corresponding to teams and the dependent variable is the final point margin. When you learn this regression, the constant term corresponds to the home court advantage. Using the OLS is the “frequentists” approach, and even though I am not taking sides here (supposedly), let’s look at the Bayesian approach to linear regression.
In this case the dependent variable is a sample from a distribution . The mean of the distribution is the inner product of the coefficients to be learnt and the independent variables, while the standard deviation is also to be learned. The model coefficients are considered to be drawn from a distribution as well, which essentially allows us to get an estimate of their uncertainty. The goal of a Bayesian linear regression is to obtain the posterior distribution for the model parameters given the data at hand, . Using the Bayes rule we have: . The prior can incorporate domain knowledge. Alternatively it can be a non-informative prior, e.g., a uniform distribution. These models are estimated through Monte Carlo Markov Chain methods and a great library to do that in python is `
pyMC3`. You can download the code and data for the Bayesian linear regression here.
We start with the situation pre-covid and build the regression for identifying the ratings and the HCA. We use 10 chains of 2000 sample each. We also have input the following priors for the model parameters:
- Team ratings: normal distribution with average 0 and standard deviation 5
- HCA: uniform distribution from -5 to 5
- Model standard deviation : half Cauchy with beta equal to 10.
The output for every coefficient is the posterior distribution aforementioned. The following is the posterior distribution for the HCA during the regular season:
The expected value of the posterior distribution is 2.3, while the 3% credible interval of the distribution is [1.5, 3.0]. Essentially, a home team is expected to get about 2.3 points for its home edge (traditionally attributed to travel/rest and refereeing bias). Moving to the situation in the bubble, we used the same prior for the HCA and the model variance, but we used as priors for the team ratings the posteriors identified by the previous regression. This leads to the following results for the posterior distribution of the HCA in the bubble:
As we can see the average of the posterior is now smaller than 1 point, while the credible interval spans around both sides of 0, i.e., [-1.1, 2.9]. This essentially means that there seems to be some small effect remaining but it is rather small and not robust based on the credible interval of the posterior. What is this small (non-robust) effect? Many things are possible (with the most plausible explanation is a statistical “anomaly” from small sample size – compared to pre-covid data). Sure, there is no travel, but there are still referees. Now it is not clear how they would be biased based on some logos on the floor or some cyber-fans, but I looked into the data from the L2M, courtesy of a Hawks’ fan. In particular, I looked at the probability of an incorrect decision (call or no-call) based on whether the side disadvantaged from the call was the “home” or “visiting” team. In the case of the home team, this probability is 4.6%, while for the visiting team this is 6.2% (the following 7 lines of code – in R for inclusivity – will get you these results). The p-value for the difference is 0.11, so it is “marginally robust” (you can make your own interpretations), which seems to be in agreement with the home court advantage results (maybe a little bit, but not robust).
l2m <- read.csv("https://raw.githubusercontent.com/atlhawksfanatic/L2M/master/1-tidy/L2M/L2M.csv") l2m_bubble <- l2m[which(as.character(l2m$date) > "2020-07-01"),] l2m_bubble$incorrect = rep(0,dim(l2m_bubble)) l2m_bubble[which(l2m_bubble$decision == "INC" | l2m_bubble$decision == "IC"),]$incorrect = 1 mod = (glm(incorrect~disadvantaged_side,data=l2m_bubble, family="binomial")) # probability of incorrect call when home team disadvantaged predict(mod,data.frame(disadvantaged_side = "home"),type = "response")[] # probability of incorrect call when away team disadvantaged predict(mod,data.frame(disadvantaged_side = "away"),type = "response")[]
Now let’s see how we can use this model to make predictions. Let’s say we want to see how Denver is going to fare in LA (virtual LA) against the Lakers. We can get an estimate for the distribution of the expected points margin, which will be: , i.e., a normal distribution with mean 4.9 points and standard deviation 13.4 (the parameter of the model). This gives a win probability of approximately 64% for the Lakers in their “home court” and 61% when they cross half court to go to the high mile bench (i.e., practically the same probability). One can also sample the posterior distributions for each of the variables invovled and make the prediction this way — but as one might expect these two will much given enough samples.