posted from - http://www.rawbw.com/~deano/articles/kalman.html, 20 January, 1997
Introduction: Why Do We Need Predictions?
In the work I do, I find little reason to estimate point spreads or to predict scores. This is because those methods are mostly used to try to win in Las Vegas, an objective I have never had. No, my goal is and always has been to figure out how to construct a good team, “basketball engineering,” as I’ve pitched it to a few people.
This objective is carried out by studying how individuals work or by studying how teams work, with the intention that the two approaches will merge and lead to the same conclusions.
On the team side, one of my focuses has been to understand how points scored and points allowed relate to win/loss records: if a team averages 3 points per game more than their opponents, what is their expected winning percentage? I have, in fact, gained numerous insights through the development of the method, which answers questions just like this. This method, in combination with matchup probabilities, also does a good job of prediction and has been used in one public study showing that there is no distinct added home court advantage in the playoffs. Methods like this, however, do not explicitly account for one piece of information that seems important: strength of schedule.
Let me first hedge a little and say: There is actually no hard evidence proving that a 10 point win over a strong opponent means anything more than a 10 point win over a weak opponent. Intuitively, we believe that this _must be true_ and I have little doubt that an empirical study would show that it is true (something that you yourself can do). What I will present here relies on this unproven belief and, for those theoretical types out there, actually proves it if you read carefully and think about it.
In large studies, the strength of opponents balances out or can be factored out in some way. In small studies, the strength of opponents does not balance out, which is a primary motivation for this work. For example, during the 1992-93 season, Michael Jordan missed 4 games and the Bulls went 1-3 in those games. Given that the Bulls were 57-25 on the season, this immediately implies that Jordan is _much better_ than his replacement, B.J. Armstrong. But does it? The Bulls’ three losses were all to playoff teams, including one to the Knicks who had the best record in the regular season that year. In addition, none of these losses was by more than 6 points. The Bulls’ one win was a 28 point blowout. How do we put all this together to create some picture of the relative value of Armstrong to Jordan? We can do it in our minds, but we wouldn’t all agree. Or we can use a mathematical method whose basis we can agree upon.
Let me now take a quick diversion into the benefits of numerical methods whose results are consistent from person to person, not subjective. If you don’t want to hear me sound off like Billy Graham after a physics class, I suggest you just leap past my preachings to the rest of the article.
Are you sure you want to read this? It could be _worse_ than Billy Graham on physics. It could be Bill Clinton on health care! It could be Newt Gingrich on ethics! It could be Bob Dole on anything! Last chance. Click here or forever hold your peace…
For many issues, it is fine that we make subjective judgments and that we don’t all agree on those judgments. Argument is underrated in this world, as long as it’s rational and we don’t start killing each other because you think Jordan is 20 points better than Armstrong and I think he’s only 10 points better. (No subtle reference to the Middle East intended there.) I make my living off people disagreeing and, no, I am _not_ a lawyer.
But when things have to get done, we have to agree on some basic rules. We have to have a consistent set of methods for characterizing the truth. Usually, if we cannot agree on the big picture, we can start by looking at the details. We may not agree how much better Jordan is than Armstrong, but we can agree that if Armstrong took Jordan’s place for four games and the Bulls won all four, that is an _indication_ that Armstrong isn’t as bad as Jordan. We should also agree that if all four games were against very weak teams, then that indication isn’t as strong. Finally, we should agree that if all four games were against strong teams, then we have reason to wonder what’s going on – Armstrong is beginning to look pretty good.
If we can find a mathematical method that adequately characterizes those details we agree upon, we have made a step towards agreeing upon the big picture. Often, there are several mathematical methods that can characterize agreed-upon details. Sometimes those methods disagree on the big picture. Many times they don’t. The more information that they account for, the more likely they are to agree upon the big picture.
Of course, these methods can still be wrong. Some of the best models in environmental engineering agree on many things, but they can’t predict real circumstances very well. It frustrates me to no end when people argue over methods whose predictions are all pretty close to one another but that are also all quite far from predicting reality. That’s why I am not trying to place the method of this article in competition with other similar ones which use the same information to get similar results. All of the methods have about equal value for doing what we want: taking scores and assigning “ratings” based on those scores, the opponents, and whether the game was at home or on the road. They all make similar predictions. They all eliminate a large part of the subjectivity. Arguing over what is left of the subjectivity within the methods is foolish and left to people who like to call themselves fools.
Now back to our regularly scheduled article …
On the individual side of my research, I also have never taken explicit account of the strength of opponents. Specifically, Michael Jordan drives and jukes against the toughest defenders every night, just as Joe Dumars tries to contain the toughest offensive players every night. Direct measurements of what they do then show Jordan and Dumars to be actually somewhat worse than they actually are. What I will present here is the skeleton of a method that can account for this bias.
The Method: A Kalman Filter
The method I will present here to handle varying strengths of opponents is called a statistical filter. Statistical filters are methods for estimating _something_ using statistics. In this case, the filters can be used to estimate the strength of a team using a team’s game-to-game progression of points scored, points allowed, whether they were at home or on the road, and who they played against. There are several methods out there that take only this information to produce rankings and/or predict scores – Doug Norris used to have one but I can’t find him on the web anymore, ESPNet has one, and World Wide Rankings and Ratings (WWRR) has four or five. I believe that Doug’s method and the ones from WWRR are “original”, meaning that they dreamed them up on their own, a feat for which I applaud them until my hands turn red. However, an “optimal” technique for using this information has been around for a long time. This technique is called a Kalman Filter. Even though I said the Kalman Filter is “optimal”, I am not claiming a Kalman Filter is any “better” than anything else. Every paper that has ever been written about the Kalman Filter has stated that it is “optimal”, so I’m just regurgitating.
Kalman filters are used by NASA to predict the path of missiles and planes. They are also used to predict weather. They are used on Wall Street. Recently, they were introduced to environmental law by yours truly. A Kalman filter is clearly a very practical tool and it only makes sense that it has applications in basketball. It was actually used in a football prediction program when it was introduced to me about five years ago.
As another illustration of how one might use the Kalman Filter, I present the following chicken scratch:
For this strip, I owe a debt of gratitude to Scott Adams, writer of Dilbert. No, he didn’t draw this, nor did he write the dialog. Actually he didn’t do diddley except get popular enough so that you could tell what I drew even though I am a lousy artist.
So How Does a Kalman Filter Work in Basketball?
Conceptually, we know that a good offense on average will do relatively better against a poor defense and relatively worse against a good defense. Let’s start with that concept and attach some numbers. Last year’s Utah Jazz had a good offense on average with an of 111.7 in a league where the average rating was 105.9. They played against the following “good defensive teams” a total of 18 times: Chicago (2), Miami (2), New York (2), Portland (4), San Antonio (4), and Seattle (4). These teams had a weighted average defensive rating of 101.5 (weighted by games played against Utah). The Jazz played against the following “poor defensive teams” a total of 16 times: Charlotte (2), Dallas (4), LA Clippers (4), Milwaukee (2), Philadelphia (2), and Toronto (2). These teams had a weighted average defensive rating of 109.6.
According to my methods, the Jazz offensive rating should have been about 107 against the good defensive teams (their rating dropped) and about 115 against the poor defensive teams. In actuality, the Jazz ratings were 109.3 and 112.0, respectively. The results aren’t as good as I had hoped, but this sample was small and I have not done an extensive analysis to determine whether my methods work on larger samples than those here. I believe they will work, but I’d like to ultimately check… unless someone else would like to do it ( Ask me!).
One of the methods says to predict the Jazz offense vs. Team _B_ defense as
(Jazz Off. Rtg) (_B_ Def. Rtg) Jazz Off. Rating vs. Tm _B_ Def. = ------------------------------ (1) League Avg. Rtg.
Try this yourself here:
How Will Team _A_ Offense Do Against Team _B_ Defense?
League Avg. Rating
Team _A_ Offense
Team _B_ Defense
Technical note: Mathematically, this relationship says that Utah’s offensive performance is _linearly_ related to both the average Utah offense and to the opposing defense. “Linearly” means that if the average Jazz offense improves by 10% then the Jazz offense vs. Team B’s defense also improves by 10%, not 8% or 50%. I only introduce this because a Kalman Filter is strictly only “optimal” if this relationship is linear. The second method I introduce is not linear. (For people with a statistics background, note that a Kalman Filter is optimal only if offensive ratings and defensive ratings are Gaussian distributed. As I showed in Basketball’s Bell Curve, this is also essentially true. The success of the Correlated Gaussian method substantiates this.)
Here is how the Kalman Filter will work for a game where team _A_ plays at team B:
- Evaluate the offensive and defensive for both teams. If possible, evaluate team A’s ratings on the road and team B’s ratings at home. If this is not possible, take team A’s ratings and make each of them worse by 1 point per 100 possessions; take team B’s ratings and make them better by 1 point per 100 possessions. For example, if team _A_ is Utah and team _B_ is Seattle, we evaluate Utah’s ratings as 110.7(=111.7-1.0) and 105.5(=104.5+1.0) and Seattle’s ratings as 109.7(=108.7+1.0) and 99.6(=100.6-1.0).
- Predict how team _A_ will do against team _B_ using the equation above. Using the Utah-Seattle example, we find that Utah’s offense vs. Seattle defense should have a rating of 104.1(=110.7*99.6⁄105.9) and that Utah’s defense vs. Seattle should have a rating of 109.3(=105.5*109.7⁄105.9).
- After the game, input the actual ratings. In this example, let’s assume that Seattle won 99-94 with 88 possessions each. Utah’s actual ratings were then 106.8(=94⁄88*100) offensively and 112.5(=99⁄88*100) defensively.
- Adjust the offensive and defensive ratings for both teams according to these formulas (slight revision on 11/16/97), which essentially tell you how strongly to weight the game results. Here, Utah’s offense exceeded predictions, but their defense was worse than predicted, so we would adjust their ratings downward. Similarly, Seattle’s offensive rating goes up and their defensive rating goes down. (I will attach numbers in the real example on the Bulls below.) Note that if we had not accounted for quality of competition, it looks like Utah’s offense got worse because it only scored 106.8 points per 100 possessions compared to 111.7 against the league. But by recognizing that Seattle is a good opponent and that Utah is on the road, Utah’s offense actually did well.
(Points can be used in place of ratings above. I like to use ratings rather than points because they do not fluctuate as much as points. But, in terms of ease of use, points are preferable because no calculation of possessions is necessary. Specifically, we could have used Utah’s points per game on the road, Seattle’s points per game at home, the league average of points per game, and the final score of the game to replace Utah’s road ratings, Seattle’s home ratings, the league average rating, and the final ratings of the game.)
The Bulls Example
As an example of this entire method, let’s return to the four Bulls games that Jordan missed where they went 1-3 against fairly tough competition. (Note of 11/16/97: The numbers have been revised below due to a fix in the variance of the predicted rating.)
Game 1 at Boston
- Chicago’s offensive and defensive ratings during the ‘92-93 season were 110.8 and 104.2, respectively. Their first game without Jordan was against Boston in the Garden, so we are approximating Chicago’s ratings as 109.8 and 105.2 for this game. Boston’s season ratings were 106.7 and 105.8, which get adjusted to 107.7 and 104.8 for this home game.
- The predicted ratings for this game are 108.5 for Chicago and 106.8 for Boston: Chicago is the predicted winner. This uses the league average rating of 106.1.
- In reality, Boston beat Chicago 101-96. With a pace of 96.2 possessions, this means that Chicago’s actual offensive and defensive ratings were 99.8 and 105.0, respectively. The offense was worse, but the defense was actually slightly better than predicted.
- Using a prior variance of 20 for both Chicago’s ratings and the Celtics’ ratings, the variance of the expected ratings is about 40[=(20*20 +109.82*20+104.82*20)/(106.12) for the offense, = (20*20 +104.22*20+107.7220)/(106.12) for the defense]. In general, ratings fluctuate from game to game with a standard deviation of 12 (or a variance of 150). Hence, the Kalman weight is 0.2145 [=41/(41+150)]. The updated road offensive rating for the Bulls is then 107.9 [=109.8+0.2145(99.8-108.5)]. The variance on this new estimate is 15.7 [=(1-0.2145)20], only a slight drop from before. For the defense, the updated rating is 104.8 [=105.2+0.2145(105.0-105.2)], a slight improvement. The variance decreased to 15.8.
Game 2 vs New York
- As mentioned above, Chicago’s offensive and defensive ratings during the ‘92-93 season were 110.8 and 104.2, respectively. Their second game was against New York at home, so we are approximating Chicago’s ratings as 111.8 and 103.2 for this game. New York’s season ratings were 104.4 and 98.1, which get adjusted to 103.4 and 99.1 for this game in Chicago.
- The predicted ratings for this game are 104.4 for Chicago and 100.6 for New York: Chicago again is the predicted winner.
- In reality, New York beat Chicago 104-98. With a pace of 88.7 possessions, this means that Chicago’s actual offensive and defensive ratings were 110.5 and 117.2, respectively. The offense was better, but the defense was much worse than predicted.
- Again using a prior variance of 20 for both Chicago’s ratings and the Knicks’ ratings, the variance of the expected ratings is about 39[=(20*20 +111.82*20+99.12*20)/(106.12) for the offense, = (2020 +103.22+103.42)/(106.12)]. Again with a score variance of 150, the Kalman weight is 0.2092 [=39.7/(39.7+150), roundoff differences]. The updated _home_ offensive rating for the Bulls is then 113.1 [=111.8+0.2092(110.5-104.4)]. The variance on this new estimate is 15.8 [=(1-0.2092)100]. For the defense, since it played so poorly, the updated rating jumps from 103.2 to 106.6 [=103.2+0.2092(117.2-100.6)], The variance of the Chicago defensive estimate decreased to 16.0.
Game 3 vs San Antonio
- Going into the second home game without Jordan, the Bulls’ offensive rating at home is 113.1 and their defensive rating is 106.6. Their opponent, San Antonio, had season ratings of 107.8 and 105.1, which get adjusted to 106.8 and 106.1 for this game in Chicago.
- The predicted offensive ratings for this game are 113.1 for Chicago and 107.3 for San Antonio: Chicago should win by about 6.
- In the game, San Antonio won 107-102, using 93.0 possessions. This means that Chicago’s actual offensive and defensive ratings were 109.7 and 115.1, respectively. It was a bad game for the Bulls on both ends, not by a terrible amount, but bad enough to lose a game they should have won.
- From the previous home game, our uncertainties in the Chicago offensive and defensive ratings are 15.8 and 16.0, respectively. The variances of the predicted offensive and defensive ratings are 38.6[=(15.8*20 +112.02*20+106.12*15.8)/(106.12)] and 36.4 [=(16.0*20 +103.72*20+106.8216.0)/(106.12)]. With a score variance of 150, the Kalman weight for the offensive estimate is 0.2045 [=38.6/(38.6+150)] and that for the defensive estimate is 0.1952 [=36.4/(36.4+150)]. The updated _home_ offensive rating for the Bulls is then 112.4 [=113.1+0.2045(109.7-113.1)]. The variance on this new estimate is 12.6 [=(1-0.2045)15.8]. For the defense, the updated rating once again gets worse, going from 106.6 to 108.1 [=106.6+0.1952(115.1-107.3)], The variance of the Chicago defensive estimate decreased to 12.8 [=(1-0.1952)*16.0].
Game 4 vs Dallas
- Going into the last home game without Jordan, the Bulls’ offensive rating at home is 112.4 and their defensive rating is 108.1. Their opponent, Dallas, had the worst season ratings of anyone in the league at 98.2 and 113.2, which get adjusted to 97.2 and 114.2 for this game in Chicago.
- The predicted offensive ratings for this game are 121.0 for Chicago and 99.0 for Dallas: Chicago should blow out Dallas by more than 20 points (depending on pace)….
- …And they did, winning 125-97, using 91.8 possessions. This means that Chicago’s actual offensive and defensive ratings were 136.2 and 105.7, respectively. The Bulls exceeded offensive expectations, but allowed Dallas a few points extra.
- From the previous home game, our uncertainties in the Chicago offensive and defensive ratings are 12.6 and 12.8, respectively. The variances of the predicted offensive and defensive ratings are 37.0[=(12.6*20 +112.42*20+114.22*12.6)/(106.12)] and 31.6 [=(12.8*20 +108.12*20+97.2212.8)/(106.12)]. With a score variance of 150, the Kalman weight for the offensive estimate is 0.1980 [=37.0/(37.0+150)] and that for the defensive estimate is 0.1738 [=31.6/(31.6+150)]. The updated _home_ offensive rating for the Bulls is then 115.4 [=112.4+0.1980(136.2-121.0)]. The variance on this new estimate is 10.1 [=(1-0.1980)12.6]. For the defense, the updated rating once again gets worse, going from 108.1 to 109.2 [=108.1+0.1738(105.7-99.0)], The variance of the Chicago defensive estimate decreased to 10.6 [=(1-0.1738)*12.8].
The above calculations are duplicated below, allowing you to change the two somewhat subjective parameters in the procedure: the variance in the prior estimate of all ratings (which I set to 100) and the variance of the game ratings (which I also set to 100). The variance of the game ratings is quite consistent with my records. The variance of the prior estimates states how sure we are with those estimates; since we are not sure of these estimates due to Jordan’s absence, I set these relatively high. Feel free to vary these parameters below to see the effects.
Change These Parameters
See the Results Here
Bulls at Home
Bulls on the Road
After G1 vs Knicks
(L, 110.5-117.2 in ratings)
After G1 vs Celtics
After G2 vs Spurs
After G3 vs Mavs
Even though not many games were played, we can already get some idea that the Bulls were not as good without Jordan. The offense went up slightly, but not enough to be certain about, and the defense went down quite a bit. For that season, my numbers had Armstrong’s offense being just as efficient as Jordan’s, but his defense being considerably worse, so this Kalman result is consistent with that. Overall, these few games indicated that the Bulls’ expected winning percentage went from about 0.762 to about , or a loss of an additional games over the course of a season. This seems small to me based on the difference in talent between Jordan and Armstrong, but seems about right given the Bulls’ performances in the games he missed, which is the only information the method uses.
This raises the issue of uncertainty. These four games cannot present a perfect picture of the difference between Armstrong and Jordan. Just the noise of basketball – players getting hurt, teams playing back to back nights, Dennis Rodman “not being interested” – prevents us from being sure about _any_ rating. The Kalman Filter “knows this” and tells us roughly how sure we should be with the ratings it gives us. With the parameters above, our final variances in the offensive and defensive ratings of the Bulls at home are and , respectively. These have gone down about from our prior estimate, so we feel only somewhat more confident about the estimate than before. But it gives us a foothold for other comparisons we might make between Jordan and Armstrong.
(Technical remark: I could have estimated the prior Bulls’ offensive and defensive ratings differently, for instance, by just using those games in which Jordan played. The prior ratings are the ‘null hypothesis’ we are testing against, as traditional statisticians phrase it; our null hypothesis would then have been that not having Jordan made no difference in the Bulls and we were seeing if we could disprove this hypothesis.)
This Kalman Filter is a powerful tool for evaluating situations where strength of opponents is important. This is actually quite common in basketball, where teams don’t play a fully balanced schedule, some teams certainly playing a more difficult schedule than others, even over the course of an entire season. I hope to use it quite a bit, though it is still a little labor intensive for me to implement.
There are a couple weaknesses of the filter that I will mention here at the end. First, the reason I hadn’t really introduced it before was because I never saw it as a very good predictor when someone like Jordan was missing from a team. Because the filter looks only at teams, it cannot account for teams that change, like when a player is injured. When people put out team ratings, what are they really measuring if significant players miss a few games? We know that significant players make a difference in our predictions, but methods like this don’t explicitly account for those players absence or presence. I took advantage of this “weakness” above by turning it around and using the method to identify the difference between the Bulls with Jordan and the Bulls without Jordan.
A second weakness is also a strength. The Kalman Filter’s generality of applicability (to other fields) is great, but it also implies that it doesn’t have built in a lot of the details of those fields. I had to build a simple model to “predict” basketball games to use in the Kalman Filter. This simple model is not precisely what happens in basketball; a more complex model may be more accurate, but then it becomes much more difficult to implement in a Kalman Filter.
Finally, this method has the weakness that it says that a team that blows out another team always improves its overall rating. Unless you read Can the Bulls Be Perfect?, you are probably wondering “How is that a weakness?”. A recent finding I made in writing that article indicated that a blowout doesn’t necessarily make you a better team and can actually imply that you’re not as good. This was a very unusual result, but one that I cannot dismiss. I also think that it can be built into the Kalman Filter. The thoughts on how to do that will have to wait a while because they are technical enough that most people won’t want to hear them. Besides, this article is long enough.
In trying to end on a positive note, I want to mention that this method holds a key to defensive ratings. Because good defensive players are often assigned to guard the best players, their defensive numbers may not look very good unless we take into account the quality of the players they have to guard. Doug Steele does something like this in his defensive ratings, but he has indicated to me that it is a lot of work. Hopefully, this is an easier way to do it.
Kalman Filter References
A reference on the history of the Kalman Filter is this military page. The military does use Kalman Filters for a lot, so they should know about it.
Another reference for the Kalman Filter is this fairly technical paper by two people from North Carolina. I found this paper to be very useful to refresh my memory on this topic. If you know the Kalman Filter well, this paper is too trivial for you. If you don’t know it and are not technically inclined, this paper is probably too advanced, but the example is still pretty good.
Most importantly, I owe thanks to Dick Donald for introducing me to this topic many years ago. Second, I want to thank George Pinder for reviving my need to know this stuff and to one of his students, Graciela Herrera, for helping me to relearn it quickly. Finally, I make a second mention of this University of North Carolina paper, by Welch and Bishop, who did a good job with it. I hope that this work adequately reflects these people’s abilities to teach it.