Now that the dust from Super Bowl LI has settled (and even Brady found his jersey) let’s talk a bit about a topic that was discussed a lot in the aftermath of the game – at least within the sports analytics community. That is no other but the in-game win probability models that several media outlets and websites (not to mention betting sites) provide as the game progresses. In the case of super bowl LI ESPN’s model gave Atlanta a 99.5% win probability 16:30 minutes before the end of the game. While 0.5% is not 0, for sure at that point it seemed highly unlikely that the Patriots would be hoisting their 5th Lombardi trophy. Similar situations have appeared this year in the NFL’s regular season as well (e.g., take a look at this Reddit thread). However, do these instances really justify the heat that these models have taken? My personal view-point is not, *but* there are a few suggestions that I could provide.

First of all, people tend to easily recall and focus on extreme situations, such as one where a team with a win probability 99.8% ends up losing the game! This is an inherent cognitive bias of people and is present even when we know it and we do try to avoid falling into its trap.

Second, the probability is exactly what it means, i.e., a probability not a “definetability”. 0.2% probability means that you are going to see this event once in every 500 time that this setting appears. Now even in this case people can fall in various cognitive traps. This situation might have never appeared and the even with the 0.2% probability prevails. Well, this is totally possible since the probability indicator does not tell you that you will first see 499 fails and then the 500th time will be the “success”. In other words the probability has nothing to do with the order of events (at least in the setting of in-game win probabilities).

Of course, the most crucial question is whether the chances were indeed 1:500 or they were more like 1:100 or even 1:50. Someone might argue that both of these describe kind of impossible events but to a trained “eye” these are significantly different. Now from a practical point of view when it comes to NFL in-game win probabilities this might not be a *important *but if in the context of a medical decision for example, these chances represent completely different levels of risk factors.

The best way to evaluate probability models is through the reliability diagrams. Briefly, reliability diagrams plot the predicted probability from the model over the observed probability as measured by the fraction of instances with a given probability that ended up being “true”. Typically, a logistic regression model returns well-calibrated models, while other classifiers are many times well-calibrated over a large probability interval but not well calibrated in extreme situations (i.e., very small or very large probabilities). Now there can be several reasons from this and the most plausible one is the lack of data-points for this cases. Nevertheless, this should be true for a logistic regression model, hence, there might be other reasons as well. For instance, a much more complicated model might have less generalization power especially when there are not many data points for specific situations leading to overfitting. Of course, all of these are speculations but it is always good to step back and be critical of our models. In the next part of my post for in-game NFL win probabilities I will present the design of a simple model (based mainly on logistic regression) that exhibits very good calibration over the whole probability range.

**Other readings: **