The Athlytics Blog

College Hoops and March Madness

kpele — Fri, 22 Mar 2024 22:03:55 +0000

It is this time of the year when pretty much everyone in college basketball is complaining. There are complaints for the importance of NCSOS, NET, or any other metric the committee uses to justify its decision for the teams to make the tournament. Even those that make it complain for their seeding. Everyone complaints. It is the human thing to do so when you have people making decisions. Even though there is this belief out there that the only thing that matters for the committee today is analytics and metrics, this is far from true. At the end of the day humans make the decisions and they will use the numbers in any subjective way they see fit. Otherwise Virginia would not have made the tournament this year over say Pitt or any of the 3 Big East teams snubbed (St John’s, Providence and Seton Hall). The way the committee operates makes it hard to be consistent from year to year even if supposedly the same things are looked at. To make the whole selection process more transparent and consistent I think that the NCAA needs to draw parallels and design a process similar to how UEFA decides which teams from each country will compete for its international competitions (e.g., Champions League).

At a high level, every country has a pre-determined number of teams that qualify for the Champions League based on their final position on their national league table. Now of course, not all national leagues are equal; it is tougher to finish 3rd in Premiera Division than win the Greek league. How does UEFA decide how many teams from each country will participate? They have a ranking system that is based on the results of teams from each country across its international competitions over the last 5 years. In many cases not all of the teams qualify for the group stage of Champions league but rather they have to go through qualification rounds taking place in the summer before the group stage starts in September. What do all these have to do with college basketball and the NCAA tournament?

Well the setting is similar. Instead of countries you have conferences and at the end of the day you want to have teams participate in a bigger tournament (Champions league and NCAA tournament) for the ultimate winner. However, the conferences are of varying strength, just as the different soccer national leagues are of varying strength as well. So the NCAA can take a similar approach and the first step for this is to take control of the non-conference schedule. NCAA should create a schema that will allow them to obtain a conference ranking similar to the UEFA ranking system (one possible way for such a ranking was described in this blog before). This can be based on random draws among teams of different conferences, with possible a yearly rotation among conferences similar to how the NFL does its non-division scheduling. At the end of the non-conference schedule each conference will know how many teams will send to the 64 team tournament directly, and how many teams will participate in the qualifiers.

Qualifiers? Well this will replace the conference tournaments. Conferences will only have regular seasons, and while I understand the appeal of having conference tournaments with surprising runs etc., the same is true for these qualifier tournaments. The schedule could have several rounds and specific teams get buys based on their conference and their seed within the conference (even in Champions League there are teams/countries that have to go through 4 rounds and others that have to go through 1 round to reach the group stage).

All of this will basically keep the same number of teams in the tournament, will not take away the excitement of single elimination tournament and Cinderella stories, but it will make things (in my opinion) more transparent and consistent. I am sure no one will complain …

Upping the Ante at the NBA In-Season Tournament

kpele — Mon, 10 Jul 2023 03:14:48 +0000

If you have not heard yet, the NBA will introduce for the first time an in-season tournament, which essentially tries to “copy” the national cups that European soccer leagues have (without however, adding new games for the teams – until the final). While you can find all the details here, the main point is that teams are placed in 6 groups (3 in each conference) and after the group play, where each team will play once each of the other team in the same group, 8 teams will compete for the award at a final-8 format (single elimination).

While I do not have strong feelings about the tournament itself, I still think that the league should most probably try to strengthen the interest in the G-league by incorporating these teams in the tournament. But since we are here now the question is whether the “bragging” rights and the monetary award for the players are enough to make teams take it seriously? After all we are talking about 4 group stage games and then potentially another 3 games (only 2 of which count for the regular season). So are these enough to make teams care and give their 100% in these few games? Or, from the league perspective, if the objective is to increase the interest in regular season, isn’t this just a drop (or two) in the ocean since we are talking for about 10% of the total games?

I personally think that the motivation is not enough and the impact on the interest will again be small (of course small is better than nothing). Remember that teams that win the soccer cups are being awarded with a spot to next seasons international tournaments, which comes with financial gains (which consequently can help the team improve their roster etc.). So it is not only about bragging rights. But what if there was a “prize” that would be worth for the teams playing? Something that could help them with improving their team. I guess you know where I am heading towards to. What if the team that won the tournament was awarded 2 (tradeable) second round picks in the next draft? The order of these picks would be chosen via a random draw. This sounds easy to implement and an award that is of enough value for teams to be interested in winning the tournament. While I like this as a bait for teams, one of the problem is that some teams that have decided to tank before the season started – or that they are in constant tank mode – most probably are not going to be worried about 2 second round picks, since they are going after the big prize of the top overall pick. Now of course, the amount of tournament games is still fairly low that they might still try and win those games as they are not expected to have a huge impact on their tanking efforts. What would be something else that could make these “tanking” teams really try and win the tournament? A first round pick swap; that’s what. Teams will provide before the tournament (or the season) begins a sealed bid for a team that they wish to swap first round picks with if they win the tournament. The bid will be made public only after the regular season ends. What is the impact of this ? Playoffs teams would like to get a chance at making the playoffs and getting a higher draft pick, which means they are trying to win these games. Tanking teams want to avoid getting their precious pick snatched by another team and hence, they want to win the tournament to avoid this. Now these teams might also be “forced” to try and win the rest of the games too if they do not win the in-season tournament, since they will be candidates for a pick swap! So everyone wants to win — and most probably the whole season.

Obviously there are many details that need to be considered. For example, what if a team that wins has chosen another team with their bid that has a worse draft pick? Then we could fall back to the 2 second round draft picks. Also, maybe instead of 1 game with every other team in the group, we go to 2 or even better 3 games between each pair of teams in the group phase (as they are in the same conference). One of the problems that this might bring up is that the pairings at the final-8 might be such that the teams have already played their regular season series. Well scheduling in this case can be more tricky (but still doable) or if the teams have played already in the regular season the results of these games automatically fill the bracket. This has the added benefit of “forcing” teams to worry about other games in the regular season that might indirectly impact the tournament. Now I do not think that anyone will agree upon this (i.e., using earlier results), but I still think the scheduling can be tweaked to accommodate these intricacies.

The main problem I see teams having an issue with is that it is possible for the pick swap to hurt competitive balance. E.g., teams that are truly bad and need these high draft picks might be swapped out of their chance to improvement. This is certainly a possibility but on the other hand it might discourage teams from getting really bad and relying on prolonged tanking to improve. A possibility would be instead of a pick swap, the winner of the in-season tournament is provided with a lottery probability (if they are already in the lottery, their odds are adjusted upwards). This way the truly bad teams still have a chance (even though lowered) to get a high draft pick. However, this does not necessarily make teams avoid tanking, as they still have a pretty good shot to a high draft pick, while the pick swap might send them towards the end of the first round.

Bottom line, in my opinion, the league needs to up the ante (well I had to use the title) in order to make sure that teams are really interested in winning the in season tournament. Based on the structure of the reward it might have trickling effects in other areas that the league has been trying to improve for a long time (e.g., tanking). I am sure someone will come with a better reward structure than the one I mentioned here but I just do not think “bragging” rights of a new cup is enough to make teams try more during these games.

Leading Effects in the NFL: A mixed effects model approach

kpele — Tue, 27 Sep 2022 13:09:48 +0000

Have you ever felt that a team after jumping to an early lead or even blowing out their opponent, they start playing worse? If yes, you are not alone, and also it turns out that this is not some sort of cognitive illusion of your mind, but rather it is a robust phenomenon. Jeremias Englemann was the first – to the best of my knowledge – to show that there are (linear) leading effects in the NBA, that is, the team in the lead plays approximately 0.35pts/100 possessions worse for every point of lead. I then thought let’s explore this for the NFL and while we are at it let’s talk a bit about mixed effects models.

A linear mixed effects model takes the following form: , where and are the fixed and random effects vectors respectively, while and are the corresponding design matrices (i.e., the data) relating the dependent variable with the mixed effects. Without looking at the theory behind these models, intuitively, you can think of the fixed effects as variables whose “impact”/relationship with does not vary across the population. On the other hand, the random effects are parameters who are themselves random variables. For example, we could say that our random effects follow a normal distribution with a given mean and variance: . If we compare it to the traditional linear regression, where we have the data being random variables and the parameters all fixed effects, in mixed-effects models we have the data still being random variables but the parameters are random variables at one level, and fixed at the highest level. This is why these models are also called multilevel or hierarchical models. The benefit of having random effects is that their estimation provides inference about the specific levels (similar to a fixed effect), but also population level information.

The fixed effects essentially define the slope of the line (with respect to each variable), while the random effects define a different intercept for each “sub population”. These different intercepts can be though of us draws from a random variable (thus the name “random effects”). In our case we have two fixed effects, i.e., the score differential at the beginning of the drive and the field position at the beginning of the drive (distance from goal line). We also have random effects for the team with possession and the team in defense. So you can imagine that these random effects capture the (constant) offensive and defensive ability of each team (points scored per drive), while the fixed effects capture the relationship between score differential and field position with the points scored per drive. Other settings are possible too but we do not use them here. For example, if we believe that different teams will be impacted differently by the point differential or the starting field position, we can have random slopes as well, where the slopes will now be different for different groups/populations (in our case teams). Nevertheless in our analysis we only use fixed effects and random intercepts.

We used play-by-play data since 2000 and estimated the mixed effects models on a 10-year rolling window (you can find the code here). From the analysis, we removed drives that were not complete prior to the end of the half/game. Overall, during the past 10 years teams play 0.1 pts/10 drives worse for every point they lead. Furthermore, these leading effects are relatively stable (despite the slight decrease we see, the estimated confidence intervals overlap) over the period we examined with values ranging from -0.07 pts/10 drives to -0.1 pts/10 drives as we can see in the figure below.

Now if you have been following my analysis you might be wondering “wait did he just do what I think he did?” and you will be right. I used a linear model when I should have used a multinomial one, where the classes will be the outcome at the end of the drive. That would have been the absolute right thing to do but I went with the linear model simply because it is easier to interpret the coefficients. However, if you run a multinomial mixed effects model you will see that for classes that correspond to higher scores, their probability is higher with an increase in the score differential.

A detailed guide for developing player ratings for WNBA (and other leagues)

kpele — Thu, 18 Aug 2022 14:28:11 +0000

DARKO, RPM, RAPTOR, CARMELO, PIMP, LEBRON and many others are just some of the player rating models that exist for the NBA. Unfortunately I have not managed to find similar player ratings for the WNBA (this does not mean they do not exist — I am not particularly known for my search skills either in real or digital life) so I thought this was a nice opportunity to (i) create WNBA player ratings, and, (ii) provide a fairly detailed description of the underlying model. In a nutshell, the model combines two components:

A boxscore-based player rating that serves as a prior for final player ratings
An on/off component based on adjusted +/- regression

These two components are then combined through a regularized linear regression, where the player ratings are shrunk towards their (boxscore) prior rating. We will see each one of these components in detail in what follows.

Boxscore prior

Statistical plus-minus (SPM) models can be thought of as an approximation of the traditional adjusted plus-minus models when on/off lineup data are not available. For example, Box Plus-Minus (BPM) of basketball-reference is one such SPM. In general, this type of models provide an estimation of plus-minus ratings using boxscore inputs. Now one of the problems we have with WNBA is that we (I) do not have historic data to build an SPM and thus, we take a slight reroute. We started by building a simple model for estimating the offensive and defensive BPM for the NBA, and we use this same model we learnt for WNBA. While there are certainly other approaches one can take, this was convenient since basketball-reference has all the data needed for both of the leagues (and yes I know that BPM is itself an SPM but we are talking about convenience here!). For estimating the offensive BPM we used the following model:

lm(OBPM~as.factor(Big)+MPG+TS.+TPRate+FTr+ORB.+AST.+TOV.+USG.+OnCourt*OnOff)

where Big is a binary variable on whether the player is a PF/C. We also adjust the minutes per game (MPG) to a 48-min game in order to have a similar point of reference (since the model was trained on NBA data). For estimating the defensive BPM we used the following model:

lm(DBPM~as.factor(Big)+MPG+DRB.+STL.+BLK.+OnCourt*OnOff+PF)

The models estimated have a good fit, with out-of-sample standard error for BPM of 2.8 points. The correlation (see following figure) between BPM and the estimated version is 0.86, with the offensive component having an even higher correlation (0.88), while the defensive component has lower correlation (~0.7).

The out-of-sample estimation for BPM has a standard error of 2.8 points and a correlation of 0.86 with the actual BPM.

Now we use these models to create to boxscore component of our WNBA player ratings. Basketball-reference provides the same statistics for WNBA so the estimation is straightforward. However, the reason I talked above for slight reroute, is related to the fact that it is not clear whether this “transfer learning” (abusing the terminology here a bit) is accurate in this situation (i.e., translating from NBA to WNBA). Nevertheless, we are going to ride with it.

On/Off regression

One of the early innovations in basketball analytics was the development of the adjusted plus-minus metrics. The idea was to improve upon the raw plus-minus metrics by controlling for the other 9 players that were on the court with a particular player. There are several ways to set up the model for the adjusted plus-minus metrics, but here we will consider every data point being a single possession (other options could be using as observation the whole time period between substitutions, or a chance). To do this one needs information about all the offense’s and defense’s players on the court for each possession. Using these data we can build the design matrix , where is the number of possessions in our data, and is the number of players in our data. There are columns that will correspond to the ratings for the players (2 for each player — one for the offensive contributions of the player and one for the defensive contributions) and the intercept of the model, which will capture the average points scored per possession. The intercept column will be 1 in every row. The rest of the elements will be based on the possessions. In particular, for possession , we will have , for every row that corresponds to the 5 players on the offense, and , for every row that corresponds to the 5 players on defense. The rest of the elements in row will be 0. The response variable is the number of points scored in each possession. The OLS solution attempts to identify the vector of coefficients such that: . The solution to this linear system is:

To above equation has an exact and unique solution only if is invertible. If this is not the case, we can still use the Moore-Penrose pseudoinverse, matrix to get a solution but this solution will be approximate (nevertheless still unique). However, regardless if we use the inverse or the pseudo-inverse matrix, there is a specific problem that will break this computation. In particular, if there is multicollinearity between the columns of , the variance of the estimated coefficients is going to blow up (and the bigger the multicollinearity the larger the variance). Consequently, these coefficient estimates will not be accurate (and some could be extreme as well). Essentially small changes in the input can lead to huge swings for the coefficients. This is a problem that is not uncommon in our setting where for example columns of the matrix are correlated since two players share the floor together the majority of the time.

One of the major approaches to solve this problem is to use regularization, and in particular a specific type of regularization called Ridge regression. As mentioned above, when we cannot solve exactly the normal equations we get an approximate solution using the Moore-Penrose pseudoinverse. This approximation is with respect to a specific error, namely, the squared error:

Ridge regression changes this error slightly by adding the following term:

If we are to express Ridge regression in matrix form we get the following linear system: , where is the identity matrix.

The impact that the regularization term has on the solution of the problem is that it shrinks the coefficients towards to 0. For example, if the first term of the error function tries to set the coefficient to a large value (one of the ways that multicollinearity manifests itself as we mentioned above), the regularization term will limit/overwrite this. The regularization reduces the standard errors of the coefficients by adding some bias to the estimates. This consequently increases the reliability of the estimates. An important note to make here is that the intercept term is not regularized.

Connection with Bayesian Inference: The regularization term in Ridge regression has a very close relationship to Bayesian inference. In particular, we can think of the regularization term as forcing a prior of 0 to the coefficients, and as we obtain more observations (data points) we update the posterior of the coefficients. In fact, even stylistically the relationship between linear regression and Ridge regression, is similar as the relationship between Maximum Likelihood Estimation (MLE) and Maximum A posteriori Estimation (MAP). MAP can be seen as a regularization of MLE. For MLE we are trying to find parameter that maximizes the likelihood of observing the data . For MAP, the likelihood function now is weighted by a specific prior , i.e., we want to maximize . Another way to see this is that MLE is basically a special case of MAP when the prior is uniform. Now if we take the logarithm of the MAP objective we finally have to maximize , which is stylistically the same as the Ridge regression.

Total Player Ratings

Now that we have seen the two separate components of our player ratings let’s see how we can combine them to get our final/total player ratings. The on/off lineup data can clearly be used to create the design matrix . What about the boxscore component? Well I have given you a hint by calling it “Boxscore prior”. So what the total player ratings are going to do is combine the two through ridge regression. However, instead of shrinking every coefficient to 0 (as the ridge regression does by default), we are going to shrink every coefficient towards to their boxscore-based rating. This way the statistical plus-minus acts as a prior, and also we remove colinearities that might exist between players. The optimization problem is now written as:

where is the vector with the boxscore-based ratings. In matrix form we have the following linear system:

Choosing : The value of controls the balance of the contributions between the two components of the player ratings. So a question that needs to be answered is what value should it take. If we set it close to 0, we essentially will have the traditional linear regression solution. If we set it to an extremely large value we will essentially get the boxscore priors as our solution. To choose we will rely on a validation set, that is, we will randomly split matrix (and the response vector ) into a training and validation set. For different values of we will train the model with and evaluate its (predictive) performance on . We will then choose the value for that has the best performance. To be completely correct though in this approach, we need to also build the boxscore priors based purely on . Otherwise we could potentially have data leakage since we would have used the possessions in the validation set to build the prior. Now, this might be hard to achieve based on the way you collect your data. For example, if you are collecting (as I do) your boxscores from baketball-reference it is practically impossible to break these down based on possessions. One solution is to do your train-validation split chronologically, i.e., train on possession up to a specific date and use the rest of the data as your validation set. This will allow you to get the boxscore data up to that specific date you make the split. While data leakage is certainly a big problem in general, I am not certain how much it impacts the specific setting we have (especially if you are not interested in predictions but rather just getting descriptive player ratings). After all we use the boxscores just as a prior and the more possessions we observe the less it matters.

WNBA 2022 season

Now that we saw the process of estimating the player ratings let’s see the ratings themselves for the current WNBA season (full regular season ratings). I used data from basketball-reference for the boxscore priors and play-by-play data with on/off information generous courtesy of Andrew Patton. You can find the full list of ratings for 2022 here (and hopefully I will update the data with historic ratings from past seasons). Following we provide the top-15 players according to the total rating.

Top-15 WNBA players for the 2022 (regular) season (Note that a negative rating for defense is good, while a positive is bad)

As we can see at the top of the list we have the two frontrunners for the MVP award, A’ja Wilson and Breanna Stewart, and overall the list passes the “eye test”. Also Wilson has the best defensive rating at -5.86, and she is also a candidate for DPOY. If we were going purely by these ratings (which we do not and should not), they suggest that Wilson has an excellent chance to have the historic honors for both MVP and DPOY. The On/Off ratings use regularization to deal with the multicollinearity, but it shrinks every coefficient towards 0 (i.e., it does not use the box score prior). Furthermore, the out-of-sample error for the overall player ratings was smaller than that of the On/Off only ratings. This emphasizes the importance of the boxscore prior particularly for leagues with not many games (e.g., Euroleague and other international leagues). We have also ran a pure linear regression (with no regularization) and here are the ratings we got for the above players:

Using pure linear regression on On/Off data results to possibly noisy estimates

As we can see the ratings we got vary much more. This by itself is not necessarily bad, but given what we know about collinearity it is more possible these ratings to capture more noise than signal, reiterating the importance for the boxscore prior in this setting.

There are many other approaches that one can take to improve upon these metrics. For example, one might want to account for the garbage time. One way to do that is to eliminate garbage time possessions from the data. Another – more principled approach and personally preferred approach – is to solve a weighted version of the regression above, where the weights are a function of win probability or point differential and time remaining etc. With being a diagonal matrix matrix whose element is the weight for data point we have the following:

Another possibility is to build metrics where the dependent variable is not the point scored but the win probability added (this will automatically take care of garbage time as well). This is what the IMPACT ratings we discuss in Mathletics do. You can also create different priors considering play type, or tracking data. In general the options are only limited by our imagination (and the data available).

We need to start counting assists that lead to made free throws.

kpele — Fri, 01 Jul 2022 14:04:33 +0000

I was looking yesterday at the play-by-play of game between Greece and Great Britain for the 2023 WC qualifiers and I noticed the following:

Dorsey passed to Antetokounmpo and he drew a shooting foul. The moment he scored the second free throw, Dorsey was credited with an assist. I am not sure when FIBA started counting these types of assists, but I like it! The NBA scorekeepers, despite being sometimes too liberal in crediting assists, they do not credit such assists. This could be understandable for comparisons reasons with earlier seasons that these were not counted, but still I do not think it makes sense not to count those assists.

So I took a look at these assists for this past NBA season and you can find the data here. As you will see Darius Garland played 70 (reg season and play in) games this season and he had 83 foul assists. This would have moved his assists per game at 9.7 instead of his current 8.5. Obviously this would be the case for other players as well but these “lost assists” are part of “creating for teammates”. Now I have heard some arguments that if the fouled player only scores 1 FT then the assist does not have the same “value” as a made FG. While this is (partially) true, we still count as 1 assist passes that lead to 2 and 3 point made shots. In fact the difference in points created between foul assists and FG2 assists is only 0.33pts per assist, which is much less than the 1 point difference in FG2 and FG3. Maybe we should start counting fractional assists (but this will never happen :)), but point generated from assists is a better indicator of creation although other things need to be considered there (e.g., sure a player created a wide open three point shot but making that shot is not a gimme like a layup).

European Basketball Leagues Ranking: An optimization approach

kpele — Fri, 05 Nov 2021 15:30:15 +0000

A usual question that comes up is how one can translate the performance of a team in their national league to an international one. E.g., if AEK Athens had a net rating of +10 in the Greek league (wishful thinking these days but one can only dream), how does this translate to Euroleague? Having this translation can also allow us to translate player performance. There are various ranking lists that one can find online that are based on a variety of factors. For example, for domestic league rankings, TV revenues/ratings, fan attendance and facilities have been considered as inputs to the rankings from FIBA and later the Euroleague Basketball Company.

In this article, I will provide an optimization based approach to getting league ratings (and consequently rankings). The idea is simple: for teams that play in both domestic and international competition we calculate the difference in the net rating they post in each of the competitions they participated. Then we find league ratings that fit these differences as close as possible. For example, if AEK Athens had a net rating during the season of +5.4 in the Greek league and a net rating of +1.2 in the Basketball Champions League, then the difference in these ratings is 4.2, and we want to find ratings for the Greek league and such that . So formally, with representing the net rating for team in league and being the rating of league , we find the ratings by solving the following constraint optimization problem:

s.t.

where and . The constraint “forces” an average league to have rating of 0. Using data from the three last seasons pre-COVID I got the following rankings.

The way that you can interpret these ratings is as follows: with Euroleague having a rating of +33.11 and Greek HEBA A1 having a rating of +14.66, an average Euroleague team is better than an average Greek HEBA A1 team by 33.11-14.66 = 18.45 per 100 possessions.

Now one thing to consider is that league-“power” can change over the period of years, and these rankings should be recalculated. Also, and most importantly, teams (at least teams playing at high level of international competition) often treat domestic games differently – in terms of rotations etc. – and hence, they can underperform. However, overall I do not anticipate large differences at least in terms of ranking. Finally, the optimization problem can be slightly tuned to solve a weighted version of the objective functions, where the weight of each data point is a function of the number of games the team played in these leagues. For now, I have just considered cases where the team played at least 10 games in each league, but more sophisticated approaches are certainly better and will improve the ratings.

Single game player contribution metrics in basketball

kpele — Sat, 02 Jan 2021 15:54:49 +0000

While in the NBA you will rarely see a game’s stats sheet mention metrics like PER and PIR, this is quiet the norm for leagues around the world. In fact, these metrics are used to award game MVPs in these leagues – and as you can imagine many times the results are strange to say the least. Overall, the pitfalls of these type of metrics have been well-documented. Another alternative to PER and PIR, that one can also find in a boxscore in the NBA, is the raw +/-. When looking at the single game +/- and putting it in context of what happened in the game, then it can provide some useful information. However, if you just take this number at face value, then you get into problems similar with PER and PIR. It should also be evident that using an adjusted plus-minus regression approach (regularized or not) is going to be very noisy for a single game. Can we develop a method that approaches the adjusted plus/minus for whole season(s) but operates on a single game unit? (I will pretend that the single game box plus minus does not exist, so I have an excuse to use some ideas from game theory).

For basketball (and other sports in general) being a game, game theory has not been utilized enough – imo – to model interactions between teams, players, lineups etc. Given our objective, and the fact that teammates cooperate (at least we would hope they do) to achieve their goal of winning a game it is natural to look at tools used in cooperative game theory. An important concept – Nobel-important concept – in cooperative games is the Shapley value. Shapley values are a solution concept in such games. In brief, it assigns the surplus generated by a coallition of players, according to the contribution of each player to this total surplus. But wait, this sounds pretty much what we are trying to do for the players of a team in a single basketball game!

Let’s start with some definitions that will be useful for mapping our problem and data to the concept of Shapley values. In a coallition game we have a set of players and a characteristic function , that is defined over the power-set of ; . The characteristic function captures the value of a coallition , and it essentially represents the expected sum of payoffs from the cooperation of the players in . The Shapley value of a player then distributes the total gain of the game to the players, in a fair manner. Fairness here is related to specific mathematical properties satisfied by the solution. While this is interesting by itself, it is out of the scope of this post, but you can read more about this and Shapley values in general here). However, just for completeness, here is the credit each player is allocated according to the Shapley value:

Let’s now get back to our specific problem that we want to solve using the concept of Shapley values. In our case we have a team of basketball players, and the coach decides during the game which 5-player coallitions to play. For each one of those we also observe their value, which is the net rating the coallition (lineup) had during the game. Small sample sizes make its Bayesian adjustment more appropriate. So far so good, but from the definition of the characteristic function above, we can see that we need to define , for each possible subset of the players in a team (those that played in the game). For coalitions with fewer than 5 players this is straightforward, and we can essentially obtain the net ratings for 2, 3, and 4-men lineups (and the on-off ratings for single players). However, the question is how to define the value of coalitions with more than 5 players. For this case we can aggregate the results from any lineup consisting exclusively of players in the coalition at hand. For example, if we want to estimate the value of the characteristic function for a coalition of 7 players, we will calculate the average net rating for any 5-man lineup played consisting only of players from this set of 7 players.

Let’s see an example here. Of course, I will choose one of Pitt’s games and let’s go with the first conference game of the season against Miami. Despite the fact that Miami had only 6 or 7 scholarship players available for the game, they fought well (partially due to early foul troubles for Champagnie and Johnson) before Pitt eventually got away in the latter part of the second half with a 70-55 win. Using the excellent R package from Jake Flancer, bigballR, I was able to obtain the lineup information needed to calculate the Shapley Player Value (SPV) for that game (code is available here).

Player	SPV
Abdul Karim Coulibaly	+9.6
Audiese Toney	+13.8
Femi Odukale	+3.6
John Hugley	-3.6
Ithiel Horton	+3.9
Nike Sibande	-4.0
Terrell Brown	-4.5
Noah Collier	+5.5
Justin Champagnie	+5.5
William Jeffress	-1.9
Onyebuchi Ezeakudo	(-13.2)
Xavier Johnson	+6.1

SPV for the Pitt players in the ACC opening game at Miami. Values in parenthesis represent players that played less than 3 minutes.

These SPV values essentially consider all the different lineups that played for Pitt in that game and how they performed, and assigned a net rating to each player. According to SPV, Toney was the game MVP for Pitt (which also passes my eye test for that game). Coulibaly also played nice in that game and made contributions that would not be captured by bean counting his points and rebounds and had an SPV of +9.6. Now this is not to say that SPV is better than any other single-game metric. Far from it. How would even one go about showing this? SPV is just a “credit allocation” approach and there is no ground truth (if there was we would not need SPV or any other metric). However, I believe it looks at the contribution of players during a game more hollistically compared to other single game player ratings. Furthermore, Shapley values are able to deal with colinearity fairly well.

Some limitations of SPV is that it does not consider who was on the court from the opposing team when a specific player “coalition” was on the court. However, I would argue that for a single game – and for a descriptive metric – this might not be a big problem (if at all). At the end of the day, SPV allocates credit to players according to who contributed to the team for that specific game. If a player faced the second unit of the opponent but made significant contributions towards winning that specific game, so be it. SPV’s goal is not to predict the future performance of this player. For that, regularized adjusted plus/minus approaches are clearly better. However, there is an interesting question here on whether a similar approach can be used for estimating player ratings when we only have information about the lineups of a team (e.g., from the NBA stats website) and not which opponents they faced – and hence, we cannot run the +/- regression. One of the challenges is that the computation becomes intractable when more than 10-12 players are involved. For a single game this is rarely the case, but for lineups over a season the number of players involved can easily go way over this. Of course, time/possession limits can be applied and players with few minutes can be grouped together, but certainly more work is needed if one wants to apply the concept of Shapley values to season-long player ratings.

Home Court Advantage, Team Ratings and Bayesian Regression

kpele — Thu, 17 Sep 2020 18:32:47 +0000

Did high seeds got the bad end of the bubble deal in the NBA? Can we “recreate” the equivalent of home court in the bubble? These are some questions that I have seen quiet often being discussed in the media (social and traditional). There are also articles (like this one on ESPN’s site) that suggest that the home court advantage is still present in the bubble. This question (together with the hot hand) is one that possibly trascends sports analytics and has “fascinated” people for decades. I took this opportunity to introduce the Bayesian linear regression in this context, and at the same time try to answer the question of the bubble court advantage.

To look into that we will start with a regression-based rating system for teams. As I described here one way to identify team ratings is through a linear regression, where the independent variables are “dummy” variables corresponding to teams and the dependent variable is the final point margin. When you learn this regression, the constant term corresponds to the home court advantage. Using the OLS is the “frequentists” approach, and even though I am not taking sides here (supposedly), let’s look at the Bayesian approach to linear regression.

In this case the dependent variable is a sample from a distribution . The mean of the distribution is the inner product of the coefficients to be learnt and the independent variables, while the standard deviation is also to be learned. The model coefficients are considered to be drawn from a distribution as well, which essentially allows us to get an estimate of their uncertainty. The goal of a Bayesian linear regression is to obtain the posterior distribution for the model parameters given the data at hand, . Using the Bayes rule we have: . The prior can incorporate domain knowledge. Alternatively it can be a non-informative prior, e.g., a uniform distribution. These models are estimated through Monte Carlo Markov Chain methods and a great library to do that in python is `pyMC3`. You can download the code and data for the Bayesian linear regression here.

We start with the situation pre-covid and build the regression for identifying the ratings and the HCA. We use 10 chains of 2000 sample each. We also have input the following priors for the model parameters:

Team ratings: normal distribution with average 0 and standard deviation 5
HCA: uniform distribution from -5 to 5
Model standard deviation : half Cauchy with beta equal to 10.

The output for every coefficient is the posterior distribution aforementioned. The following is the posterior distribution for the HCA during the regular season:

Posterior distribution for the home court advantage during the regular season.

The expected value of the posterior distribution is 2.3, while the 3% credible interval of the distribution is [1.5, 3.0]. Essentially, a home team is expected to get about 2.3 points for its home edge (traditionally attributed to travel/rest and refereeing bias). Moving to the situation in the bubble, we used the same prior for the HCA and the model variance, but we used as priors for the team ratings the posteriors identified by the previous regression. This leads to the following results for the posterior distribution of the HCA in the bubble:

Posterior distribution for the home court advantage during the bubble (seeding and playoffs until 9/15/20)

As we can see the average of the posterior is now smaller than 1 point, while the credible interval spans around both sides of 0, i.e., [-1.1, 2.9]. This essentially means that there seems to be some small effect remaining but it is rather small and not robust based on the credible interval of the posterior. What is this small (non-robust) effect? Many things are possible (with the most plausible explanation is a statistical “anomaly” from small sample size – compared to pre-covid data). Sure, there is no travel, but there are still referees. Now it is not clear how they would be biased based on some logos on the floor or some cyber-fans, but I looked into the data from the L2M, courtesy of a Hawks’ fan. In particular, I looked at the probability of an incorrect decision (call or no-call) based on whether the side disadvantaged from the call was the “home” or “visiting” team. In the case of the home team, this probability is 4.6%, while for the visiting team this is 6.2% (the following 7 lines of code – in R for inclusivity – will get you these results). The p-value for the difference is 0.11, so it is “marginally robust” (you can make your own interpretations), which seems to be in agreement with the home court advantage results (maybe a little bit, but not robust).

l2m <- read.csv("https://raw.githubusercontent.com/atlhawksfanatic/L2M/master/1-tidy/L2M/L2M.csv")
l2m_bubble <- l2m[which(as.character(l2m$date) > "2020-07-01"),]
l2m_bubble$incorrect = rep(0,dim(l2m_bubble)[1])
l2m_bubble[which(l2m_bubble$decision == "INC" | l2m_bubble$decision == "IC"),]$incorrect = 1
mod = (glm(incorrect~disadvantaged_side,data=l2m_bubble, family="binomial"))
# probability of incorrect call when home team disadvantaged
predict(mod,data.frame(disadvantaged_side = "home"),type = "response")[[1]]
# probability of incorrect call when away team disadvantaged
predict(mod,data.frame(disadvantaged_side = "away"),type = "response")[[1]]

Making Predictions

Now let’s see how we can use this model to make predictions. Let’s say we want to see how Denver is going to fare in LA (virtual LA) against the Lakers. We can get an estimate for the distribution of the expected points margin, which will be: , i.e., a normal distribution with mean 4.9 points and standard deviation 13.4 (the parameter of the model). This gives a win probability of approximately 64% for the Lakers in their “home court” and 61% when they cross half court to go to the high mile bench (i.e., practically the same probability). One can also sample the posterior distributions for each of the variables invovled and make the prediction this way — but as one might expect these two will much given enough samples.

Posterior distribution for the Denver Nuggets’ rating

Some (philosophical) thoughts on coach challenges

kpele — Mon, 09 Mar 2020 03:29:02 +0000

I have been thinking about this problem from an analytics persepctive since the beginning of the season, but the WSJ had recently an article that triggered some more thoughts. Here I am going to put some of my thoughts on the topic — bare with me because the thoughts will be rather unstructured at this point, and mainly aim at seeing what other people think about the problem.

Win probability added – vs – expected points added

In the WSJ article the importance of a challenge is measured through the win probability added from it. Win probability added is a natural choice of course, but are there any possible shortcomings using it?

First let’s compare the two objectives of maximizing win probability – vs – maximizing expected points. During the beginning of a game, there is more uncertainty with regards to each outcome, with win probabilities being closer to 50% (or in general the pre-game probabilities that are based on team strength). As a result, any given possession will not swing the win probability by a large percentage. During the end of the game, the uncertainty is smaller, which also means that there is smaller uncertainty on how a given possession impacts the win probability. For example, an offensive foul in the first possession of the game, most probably will not change the win probability at all, since there is high uncertainty during that time of the game. However, the same charge in the last 5 seconds of the game being down 1 point, most probably will have a swing in the win probability of about 20-30%. This is an artifact of the finite duration of the game.

Depending on which view one takes, the corresponding on-court strategy might be different as well. For example, early in the game maximizing the win probability or maximizing the expected points, most probably will yield similar decisions. In the extremely simplified version of decision making, a corner three at the first possession of the game is a better option compared to a long-2 both in terms of win probability and expected points. However, at the end of a tied game these two objectives diverge. A long-2 is the better option in terms of maximizing win probability, while a corner three still provides the best choice in terms of expected points. So it should be clear that game clock clearly impacts the leverage of each call in terms of win probability. However, in terms of expected points the value of each call is not impacted by the game clock. A challenge that is worth 1.5 expected points, will be so either if it is during the first possession or the last one of the game.

At the same, a point scored in the first possession of the game counts the same (i.e., 1 point) as one scored at the end of the game. Their (win probability) levarege at the time being scored will be different – just as explained above. However, their contribution to the final points of the game is the same (1 point). So in close games – where challenges have the potential to make a difference – a point scored or saved in the first quarter is equally important as one in the last 2 minutes. For example, the WSJ article ranks the challenge used by POR at the game against DAL as the one with the largest swing at win probability (which is true). During that challenge, the referees had called a shooting foul against POR with 9 seconds left in the game. DAL was down 1 point and the challenge was worth approximately 1.5 expected points (Finney-Smith is about a 75% FT shooter) if won (without considering the ensuing jump-ball result if the challenge was succcesful). However, if POR had used their challenge earlier in the game for a situation that was at least worth 1.5 expected points, they would be 9 seconds before the end of the game up by 2.5 (expected) points, which would give at that point a very similar win probability as the one after the reversal of the call challenged. Note that this is not any type of hindsight bias — 2 points scored in the last 10 seconds give you the same lead as if these points were scored during the first half. Of course this makes the assumption that the game would have proceeded in the same way/path if a challenge was called earlier in the game. However, given how small of a change any given basket provides in the overall score, this assumption might not be too much of a stretch (of course cases where a star player is charged with early foul troubles etc. might be different, but these are also things that are hard to be included in the expected value from a challenge).

So overall, the fact that the highest win-probability added comes from late game challenges is to be expected given the high leverage in terms of win probability that late game situations have. However, looking the issue from an expected points lense, might also give us more insight on how one can approach this strategically.

The secretary problem

The coach challenge fits well under problems involving the optimal stopping theory. The most popular example – and very relevant to our setting – is the secretary problem. You have one secretary position to fill and start interviewing (known before hand) canditates for the position. Candidates come in random order (in terms of their skills/quality). We have to make a decision on whether to hire the candidate immediately after the interview. Once we reject a candidate we cannot call him/her back again, while once a hire is made the process stops and no more candidates are interviewed. The question is when should we make a hire. This is similar to the coaching challenge problem (with some important differences we will discuss later); we get a series of calls that the coach can challenge (hire), and once you make a challenge no other challenges (hires) can be made.

In the case of the secretary problem, the optimal strategy is the following stopping rule: interview the first candidates without hiring anyone. Assume also that the best candidate among them has a “skill rating” of . Then interview the rest of the candidates and hire the first that has a “skill rating” better than (or the last candidate if this never happens). This simple rule selects the best candidate with a probability approximately 37% – in other words, the best candidate is chosen in 37% of the times.

Now the coach challenge problem has a few differences. For one, if we use the win probability as our metric of qualifying calls (candidates), then the order with which these calls arrive is not random. Based on the above discussion, one can argue that earlier calls might hold less value in terms of win probability compared to later calls (e.g., last 5 minutes). However, in terms of expected points the order can be assumed to be random. The expected points of a challenge needs to consider the probability of the call being wrong (which might not be trivial in subjective calls — e.g., block – vs – charge). The expected call from a call reversal then depends on the type of call. For example, an out-of-bounds reversal essentially will provide the expected points per possession for the opponent. A shooting foul will provide the expected points of two free throws from the player to take them. It should be clear that there is an upper limit on what is the maximum expected point from a challenge. For example, if a three-pointer was scored but an offensive foul was called on the shooter and the coach challenges it to be a shooting foul, a successful challenge will yield 3+ points (say 3.8 points). So there is a limit on what value you can get from the challenge in terms of expected points. These cases might not appear very often but still there is an upper bound on the value, which does not exist in the secretary problem. If there is an absolute maximum that you can obtain, then once you observe this maximum you know there is no way to find a better situation and you should make the challenge. So if we assume that the max expected points to be yield are 3 (again in reality getting this expected value from a call is extremely unlikely) and that the number of calls that a coach can challenge are 20 (this number can be learnt from data and we can also consider only calls that have say more than 40-50% probability of being overturned). Then we can use the following decision process:

For the first “challengable” calls do nothing unless if one of them yields the maximum expected points. In this case, challenge.
After the first 7 challenges, and with being the maximum expected points for the first 7 challenges, if xP_{max}" class="latex" /> or , challenge. Otherwise, continue.

Now this is of course just a roadmap on how one could start thinking of the problem on when to challenge. High-profile, high-levrage challenges like the one in the Portland – Dallas game, are outliers and personally I think that keeping the challenge in case a call/situation like this appears is a bad strategy (it is essentially based on the availability heuristic – it is easier to recall high-leverage successes than run-of-the-mill ones).

Incorrect calls – vs – incorrect non-calls

One of the limitations of the challenge process at the moment is the inability to challenge non-calls. There are practical reasons for this (i.e., continuation of play etc.) but next iterations of the rule (if it is to remain) need to consider this. This is particularly important, since based on the L2M since 2015, 92% of the incorrect ref decisions are non-calls, and hence, non-challengable !

Again this text is just a dump of ideas and philosophical-level thoughts about the win probability and expected points. I am looking forward to hearing to your comments.

Not all free throws are created equal anymore … at the G-League

kpele — Fri, 27 Sep 2019 23:19:08 +0000

Yesterday the G-league announced that it will experiment with a new free throw rule in order to reduce the duration of the game. In particular, instead of awarding one, two or three free throws depending on the violation, it will award one free throw that will be worth one, two or three points respectively. Many interesting things came to my mind when I first heard that. For example, teams sometimes attempt to ice the shooter for their second free throw shot by taking a time-out. And it seems to work. Not anymore though!

There is also a difference in the FT% with regards to the splits. Following are the percentages from last season over the whole league:

As we can see there are some differences (statistically significant as well) between consecutive free throws in a FT trip. Others have noticed similar behavior. Furthermore, we see that overall the FTs at a 3-point shooting foul have higher percentage since most probably the players that are awarded these fouls are better shooters. With the new rule change the unit of interest is the FT trip. Last year there were on average around 3 trips for 1 FT, 10 trips for 2 FTs and 0.4 trips for a triplet of FTs per team per game. Now all of these trips will be awarded a single FT that will be worth 1, 2 or 3 points respectively. What will be the impact on scoring? Assuming (and this is a big assumption that I will expand a bit on later) that teams keep getting the same type of FT trips and have the same percentages (the ones that matter here are the percentages of the first FT of the trip), the following figure shows the expected number of points per game per team from FTs compared with what they get now:

As we can see almost all teams are expected to get less points from the charity line if this rule was to be introduced in the NBA. The degree of reduction depends on many factors (type of trips for a team, actual FT%, degree of FT split percentages etc.). However we could start looking at pairs of teams and examine who would get a benefit in their matchup. For example, Memphis is barely expected to see any decrease in its expected points from FTs, but Brooklyn will see a decrease almost equal to 1 point. So in a matchup between BKN and MEM the new rule would give Memphis an 1 point advantage.

Of course here come the assumptions. The rule change is going to change the way teams foul (I think) and hence, who and what kind of trips a team gets. How? I don’t know. But now you can imagine a league-average player (even including the splits) currently has around 6% chance of not scoring from a pair of FTs. With the new rule he will have a 25% of not scoring. This can make Euro fouls a thing again. Of course, someone will point that there is also a 75% chance of scoring 2 points as compared to the 56% chance of that happening today with the current NBA rule. True; that’s why I said I don’t know how this will change the fouling behavior. Moreover, teams might put more emphasis on practicing FTs (which I would believe and hope do anyway but even more so now) – not to mention personnel movement. What is clear though, is that with the current levels of FT% the teams are expected to score less from the line. Scoring is another variable on fan satisfaction (as is the duration of the game). Of course the reduction will be less than 2 points per game but fans do like more points (maybe not from the FT line anyway). Also we have not separated the rule for the last 2 minutes – that will be the same as is now – so this will further reduce this impact on scoring (it might also bring an uptick of fouls just before the last two minutes so teams can benefit from the increased variance from the new rule — but we cannot know until we see).

We will see how things will go in the G league this season but it is certainly an intriguing change.