Probabilistic predictions are everywhere and of course in sports as well. The holly grail of sports analytics is to predict the outcome of a game after all. A typical approach that is used to evaluate predictions is focused on the accuracy, that is, how many correct predictions did the prediction engine make. In theory this is correct, especially when the predictions are made without any probabilistic justification (e.g., experts on sports networks making their picks). Accuracy is known to have issues in scenarios where the binary output is unbalanced; this is not a problem for sports predictions typically though. However, when the picks are made through a probabilistic model (e.g., logistic regression), the output is a quantization of a genuine probability. The typical threshold is 0.5, i.e., if the output of the model is greater than 0.5 then label 1 (e.g., home team wins) is assigned to the specific instance, while if the output is less than 0.5 label 0 is assigned (e.g., road team wins). However, one of the problems I have with this is the fact that if the output of the model is say 0.58, this translates to the specific data instance having a probability of 58% being of class 1 (e.g., the home team has a 58% probability of winning). This is far from certain, and there is a non-negligible 42% probability of the instance being of class 0 (e.g., the road team to win). This is not captured in the accuracy results typically presented.
Here I will discuss another way to evaluate the performance of a logistic regression model. While this is a generic evaluation metric I will focus on the NFL prediction engine (FPM) I have developed with Vagelis Papalexakis. While the details of this engine most probably will be part of a different post, in brief, the engine consists of two modules; (i) a bootstrap module to infer future predictions of teams performance through a biased re-sampling with replacement, and (ii) a Bradley-Terry regression model to obtain the probability of the home team winning a matchup. The features we used are simple box-score statistics (e.g., total offensive yards, turnovers etc.). The output of FPM is a probability for the home team winning the game. For example, in yesterday’s matchup at Heinz Field, the Steelers had a 72% of winning based on FPM. Eventually the Steelers won, so someone would say that the prediction was accurate. Wouldn’t it be accurate if the Giants had won? After all FPM game then a 28% probability of winning the game. This calls for a better way to evaluate the actual outcome of the logistic regression model, i.e., the probability.
What does 72% probability of Steelers winning this matchup means? If we had the ability to replay the game 1000 times under the same situation Steelers are expected to win 720 times and Giants 280. Obviously we cannot replay the game 1000 times. So one way to evaluate how accurate is this 72% probability is the following: For all the games that FPM predicted a win for the home team with probability x%, what percentage of times did the home team actually win? An accurate model for the probability would have this percentage to be around x%. In other words when plotting the predicted probability from the model with the fraction of those games actually predicted correctly, the slope of the linear fit line should be statistically equal to 1. We will term this slope as the regression probability slope. One of the problems is that there might not be many games with a predicted probability 0.67 for example. Hence, one can quantize the probabilities (e.g., at increments of 0.05). The following figure shows the results from our FPM model for the prediction of the last seven years of NFL games (2009-2015).
As we can see for all the ranges of predicted probabilities the confidence interval of the fraction of correct predictions touches the y=x line. In fact, the linear fit line exhibits a regression probability slope of 0.96 ($latex R^2 = 0.94), with a confidence of [0.76,1.14], i.e., the slope is not statistically different than 1, and hence the predicted win probabilities are accurate.
In the general case, for a general classification problem, the probabilities of a data instance being of class 1 could be used and generate similar plots and compute the relevant slope. I strongly believe that this is a better measure for evaluating a logistic regression (or any other probability prediction) model and hopefully, we will see more researchers and practitioners using it.
P.S. Another metric that is used for evalauting probability predictions is the Brier score. However, I refrain from describing it here since it is less intuitive and it is also more a metric of the callibration for the predicted probabilities as compared to accuracy of probabilities.