What is WAR? Everything you need to know about hockey's 'wins above replacement' revolution

By Arvind V. Shrivats Were finally back into hockey season after a slow couple of months of actual hockey news, during which the hockey community was all worked up about the latest #fancystats. In particular, there was a lot of noise around wins above replacement (WAR), which actually isnt new to hockey (and certainly isnt

By Arvind V. Shrivats

We’re finally back into hockey season after a slow couple of months of actual hockey news, during which the hockey community was all worked up about the latest #fancystats. In particular, there was a lot of noise around wins above replacement (WAR), which actually isn’t new to hockey (and certainly isn’t new to other sports), but is now more than ever monopolizing the discussion on the direction of stats in hockey.

Advertisement

So given all the controversy, I figured it was time to answer some questions about the war on WAR.

Hang on, I was actually out enjoying my summer – what’s been going on with WAR?

There was a roundtable discussion at The Athletic with James Mirtle, Tyler Dellow, and Matt Cane which to put it lightly was … incendiary. The Athletic followed that up with some great discussions with Brian MacDonald (formerly of the Florida Panthers) and Michael Schuckers (a statistics professor at St. Lawrence University).

There’s probably been more public coverage on WAR in the last few months than in the previous three years. I think, by and large, people understand what WAR is, but there seems to be a great deal of uncertainty around exactly how WAR is calculated, and that can translate into a poor understanding of its advantages, limitations, and how to best interpret it. Consider this an attempt to help shed some light on what’s under the hood of WAR, to help drive some more productive discussions around it. 

All right, fine – let’s level-set a little. What is WAR?

At its core, WAR is a one-number estimate of the value (in wins) an individual player provided over a “replacement level player.” People often use the term wins above replacement (WAR) interchangeably with goals above replacement (GAR) and justifiably so. Hockey models typically calculate a player’s GAR, and use a translation factor (roughly five goals per win) to convert that to WAR. There are parallels to baseball, where player contributions to wins can be attributed to runs.

What is a replacement level player?

It’s a broad term to describe the kinds of players a team can acquire and play at a moment’s notice. These are players who shuttle through waivers every year, or are emergency call-ups from the AHL. Some examples would include Ben Smith and Jaycob Megna — good players in the AHL who become limited at the NHL level.

Advertisement

Why is WAR controversial?

Any new stat generally faces some resistance as it makes its way into the mainstream, but WAR is interesting in that it’s contentious even among the stats community.

Many in the hockey analytics community believe that the methods by which we are currently calculating WAR are flawed (or relying on flawed assumptions), and that the data we currently have available to us isn’t sufficiently rich enough to properly solve the problem of how to fully capture player value. This is valid, and we’ll get into that in a bit.

With the fans and media, though, WAR also does itself no favours due to its complexity. When shot attempts (i.e., Corsi, Fenwick) were first introduced, they were seen as pretty controversial but I think we can all agree that a motivated individual could probably understand the spirit and math behind the stat and why it might be important. WAR is much more opaque and there’s a learning curve to fully understand the math.

All right, so what WAR models are out there today?

At this time, the two most commonly referenced public WAR models are by Corsica and EvolvingWild. These models are similar in some ways and very different in others, but broadly speaking they both attempt to do the same thing: determine value of a team’s on-ice play, and attribute it out to the individual players on the ice, and tether that to what a replacement level player would have done.

There are other hockey models out there (such as Dom Luszczyszyn’s Game Score Value Added, Chace McCallum’s WAR, among others), but let’s focus on these two for now.

Before we get into WAR too much, how is it different from the other player evaluation tools?

There are a ton of differences but the fundamental one is that WAR attempts to come up with a one-number, all-encompassing representation of player value. Many evaluative mediums, such as HERO or SKATR charts attempt to provide insight into various components of player value, but they keep them separate, leaving the question of how to weight them unanswered.

Advertisement

So who actually builds these models?

Depending on your perspective, this is the coolest part. There’s an emerging trend of public hockey analytics being pushed forward by fans and hobbyists (“fanalysts,” as they’re called), and that’s very much the case here. Emmanuel Perry (the founder of Corsica) is a Senators fan, while Josh and Luke Younggren (the people behind the pseudonym EvolvingWild) are Wild fans. Perry has long been at the forefront of hockey analytics, and is a strong data scientist, competing and placing highly in a number of Kaggle (model building) competitions. The Younggrens are also veterans of the hockey analytics scene, having contributed their knowledge to aging curves, re-defining relative teammate metrics, and aggregate statistics, among other areas.

Okay, beyond them being smart and mathy, how do they do it?

The short answer is that it’s model-specific. The best resource here is the creators themselves. Perry has published his WAR model, complete with the code he used to run it. EvolvingWild introduced their model quite recently at the Rochester Institute of Technology Sports Analytics Conference (RITSAC) and while there is no detailed write-up, their slides are here and here.

Any attempt I make to explain the inner workings of these models is going to be an oversimplification, but I do think there’s still value in that. Corsica and EvolvingWild WAR models don’t work in the same way, but they share the critical common thread of being regression-based. Regression is a statistical tool to estimate the relationship between a target variable (for example, goals for rate or Corsi for rate) and a set of explanatory variables (such as the players on the ice and contextual factors like score, game state, and venue). Regressions find the impact of a change in each explanatory variable on the target, holding all other explanatory variables constant.

Micah McCurdy (creator of HockeyViz) had an excellent description of regression in an article by Daniel Wagner of the Vancouver Courier:

“Regression is just drawing a line of best fit through a bunch of points,” said McCurdy. “If the points were on a piece of paper a dextrous four-year-old could do a job which wouldn’t be markedly worse than you’d do with a computer. A typical NHL season has about nine hundred different skaters in it and you probably want to model defence and offence separately, so when you move from two dimensions to eighteen hundred dimensions, the computer starts to help a lot. But it’s still just drawing a best fit line through data.”

In the context of a model, the benefit of regression is that we’re able to find player impacts on components of gameplay that we value, and essentially strip out the effect of every other player on the ice and other contextual factors included in the regression.

To bring this back to hockey, a simple regression model could attempt to explain scoring rates in a given year, as a function of scoring rates in a prior year. Here, the target variable could be 5v5 points per 60 minutes in year X and an explanatory variable could be 5v5 points per 60 minutes in year X – 1. Seeing that there’s a significant positive correlation in the plot can justify using prior scoring rates as an explanatory variable in a model trying to predict future scoring rates.

As mentioned earlier, WAR models attempt to aggregate value from multiple components of hockey gameplay. A defining question of a WAR model then becomes; what components of hockey gameplay do I want to build into my model? To understand any model, it’s helpful to break down these components.

Advertisement

Corsica is structured around five player components: impact on shot rates (for and against), impact on shot quality (for and against), shooting talent, propensity to draw or take penalties, and propensity for zone transitions between whistles. EvolvingWild is structured across three game states: even strength, power play, and short-handed. All facets of play are included within each game state, though they aren’t explicitly split out.

Let’s take a more detailed look at exactly how these components are calculated, considering Corsica WAR first.

Corsica – Impact on shot rates:

This component attempts to identify players that provide the most value through their ability to drive unblocked shots and tilt possession in their team’s favour (think Patrice Bergeron). Here, Perry uses regression to model the time until a shot (defined as an unblocked shot attempt) occurs for the home team and away team, in separate regressions. The data is broken up into ‘observations,’ which for this component, are defined as uninterrupted periods of play where all players on the ice remain constant.

Perry’s target variables are the length of the observation and a binary response if there was a shot by the home team during the observation (1 if yes, 0 if no). This effectively models the shot rate for the home team during the observation. The explanatory variables are the players on the ice in that observation and a set of contextual factors (i.e. score, zone start, the number of skaters on each team, etc.). We can do the same for the away team shots and ultimately, the regressions essentially spit out a relationship between each player on both teams and their impact on shot rates (both for and against).

As an aside, this naturally leads to questions around how to separate the effects of players that are frequently on the ice at the same time, and don’t play much without each other (the Sedins are a good example). There’s a term for this in stats called collinearity. Perry uses a statistical tool called regularization that helps with this problem (notice I say ‘helps’ and not ‘solves’) by biasing estimated shot impacts towards 0, which helps ensure the model doesn’t over-fit to a small sample of data where two players are away from one another. The thinking here is that this reduces the error of the model when generalizing to data it’s never seen before.

In a hockey context, consider this an advanced version of watching every shift, and recording who is on the ice when good or bad things (shots for or against) happen to a team. The players who are on the ice for many good things and few bad things will have the best estimated impacts. Since computers have the advantage of being able to hold much more in memory than humans can, their calculations of who was on the ice for more good and bad things is better than ours.

Corsica – Impact on shot quality:

This component attempts to model the impact players have in generating higher quality chances for themselves or their linemates (think Connor McDavid) or their ability to suppress the opposition from doing the same (think Ryan Suter). Perry uses linear regression where the target variable is the probability of an unblocked shot becoming a goal (using Corsica’s proprietary expected goals model), and each observation is an unblocked shot. The explanatory variables are consistent with impact on shot rates and again, Perry uses regularization to reduce overfitting. This regression provides estimates of the impacts of players on the expected goals added (or taken away) both offensively and defensively.

Advertisement

Corsica – Shooting Talent:

This component represents the impact that players may have to consistently score above their expected goals (think Patrik Laine). Perry uses logistic regression on all unblocked shots from every player, with the target variable being binary (1 if a goal is scored, 0 otherwise). The explanatory variables are the shooter, the goaltender, the expected goal rate (xG) of the shot, and the contextual factors we mentioned before. The estimated impact for the player can be interpreted in terms of goals added above what was expected given their shot quality.

Corsica – Penalty Impact:

Skaters have their time on ice divided into distinct game states based on skater advantage, score advantage, venue, and the cumulative imbalance in penalties awarded within that game. Player TOI is tallied for each game state, along with the number of penalties drawn or taken. Perry runs two Poisson regressions (for penalties drawn vs. taken), with the players and game state combinations as explanatory variables, and penalties drawn/taken as the target variables.

Corsica – Zone Transitions:

This component only has a minor impact on player WAR, but briefly, it attempts to model the value provided or lost by a player who transitions the puck to a more or less advantageous zone for a faceoff. For example, a player who takes a defensive zone faceoff and converts it to an offensive zone faceoff by some means has put his team in a better situation to win.

Ultimately, Perry obtains estimates on goals contributed and lost for each player across these five components and from there, it’s as simple as adding the results for a total GAR, which can be converted to WAR. There is more work to do in terms of defining replacement level, but for that, I will refer you to Perry’s write-up.

Whew, that took a while. Okay, now onto EvolvingWild’s WAR model.

EvolvingWild:

EvolvingWild takes a different approach using two levels of regression within each game state (even strength, power play, short-handed). It is very similar in concept to Box Plus-Minus (BPM), which is a popular stat in NBA circles.

Using data spanning 2007-2018 (the time span for which we have reasonable NHL data), they run a weighted regression where each observation is a period of play where no substitutions occur (let’s call this a shift). Regressions are run for both offence and defence and the weights on each observation correspond to shift length.

Advertisement

The target variable for the offensive regression is goals for rate, while the target variable for the defence is expected goals against rate (EvolvingWild has their own xG model, separate from Corsica). Like many of the regressions detailed above, the explanatory variables include contextual factors like score, venue, zone start, and so on, as well as the players on the ice. In their presentation at RITSAC, the Younggrens had the following graphic, which explains the setup of this regression well.

Source: Younggren, Younggren (2018)

In this regression, they also use regularization to combat collinearity (again, note that regularization doesn’t always completely alleviate the issue).

The output of this regression is what the Younggrens refer to as Regularized Adjusted Plus Minus (RAPM), which measures the goal impact of players accounting for teammates, competition, and usage (it’s similar in method to the shot rates regression in Corsica’s WAR). This is obtained for even strength (offence and defence), power play (offence only), and short-handed (defence only) game states, with explanatory and target variables being consistent across.

A second set of regressions are run to map a player’s long-term individual stats (such as goals, assists, on-ice stats, and many others) to their impacts from the first regression. This creates a function that can map a player’s individual stats over any arbitrary subset of data into a set of predicted RAPMs (for even strength offence and defence, power play offence, short-handed defence). RAPM represents a context-adjusted goal impact so getting to wins above replacement simply requires a definition of replacement level and some subtraction. This second set of regressions also helps mitigate the collinearity we discussed before, and stabilize the model outputs.

What are reasonable benefits and criticisms of each model?

Corsica:

I think Corsica’s WAR model is holistic, well-reasoned, and uses a first-principles definition of value, determined entirely by on-ice results, and I really like that approach. You may have noticed that it never uses points as an input anywhere (except implicitly with the shooting talent component), which distinguishes it from other WAR models, including EvolvingWild; some see that as a bug, but I see it as a feature. A hypothetical player that got no points but guaranteed a 100 percent CF% would be the most valuable player in existence, and this model would pick that up. I also like the modular, component-by-component design that provides a sense for where players shine and where they don’t.

In getting to an output, however, there are some modeling decisions that I don’t fully agree with. My main one is around power play and short-handed impacts, which I would like to see explicitly split out. To be clear, skater advantage is included in each WAR component, but its contributions aren’t transparent. Essentially, it’s unclear how much of a player’s WAR comes from even strength play versus special teams and since players play different proportions of their time at even strength versus special teams, comparing two players’ WAR is not always fair.

Advertisement

Moving from a theoretical concern to a practical one, Perry’s WAR has been controversial among many because of the unexpected and unintuitive outputs. At a high level, a list of Corsica WAR leaders just doesn’t map to who we’d expect to see there. That alone isn’t a reason to dismiss a model, but it’s a reason to look into it. Looking deeper into the model, it seems that the bar for replacement level is unusually high. Perry’s barometer of replacement level is players who made the NHL minimum salary, which sounds reasonable, though the outputted WAR numbers for 2017-2018 indicate that only 120 defencemen had a positive WAR. Given that there are about 180 defencemen playing in the NHL at any time, it implies that a third of defencemen in the league are actually worse than replacement level (if only by a small amount), which raises some red flags to me.

I did some digging, and noticed that Corsica WAR generated by shooting talent is almost universally negative for defencemen. In 2017-2018, only two defenders had a positive WAR from shooting talent (Kevin Connauton and Alex Goligoski) and in 2016-2017, only eight defenders achieved the feat. Something feels off here, as I wouldn’t expect a minimum salary defender who I call up from the AHL to immediately become one of the best shooters among defencemen in the league. Ignoring the shooting talent component, 150 defencemen have positive WAR, which essentially implies that the sixth defenceman on each team is about replacement level. That seems much more reasonable. I am not entirely sure of the cause of this peculiarity, but it merits further investigation.

EvolvingWild:

To start off, I love the idea of RAPM; as a context-independent stat that tells us which players are driving play, I think it is excellent. While they only use the version of RAPM with goals for as the target variable for offence and expected goals against as the target variable for defence, they have produced RAPM values for other target variables (such as expected goals and Corsi) on their website, EvolvingHockey.com. I find these figures tremendously useful in general, even if they’re not explicitly used in their WAR calculation.

A major strength of this model is that it can produce WAR estimates for any arbitrary time frame, due to the regression that maps individual player stats to RAPM (you could theoretically find RAPM for arbitrary samples too, but it would be unstable). I also enjoy that it splits out WAR by strength state, allowing a user to make an apples-to-apples comparison of multiple players’ value. These features are not present in Corsica WAR. All in all, I think it is a well thought out model, where I can understand the decisions made and why they made them.

My concerns regarding EvolvingWild’s model are more conceptual in nature. RAPM makes a lot of sense in the NBA, even over the course of a single season given the relative level of noise in basketball data. Simply put, it’s rare for an elite player in the NBA to get outscored over the course of the season and when it does happen, it’s usually the result of unsustainably hot or cold shooting runs.

Hockey, however, is higher variance (given fewer events/goals) and since RAPM is based on goals (at least offensively), EvolvingWild often deems ‘strong’ offensive players who suffer from low shooting percentages in a given season as rather weak (and vice-versa). As a result, good players who are unlucky will see their WAR suffer, and bad players who are fortunate will see the opposite. This is not as much of an issue with Corsica’s WAR, as the only component that factors in goals is Shooting Talent.

As an example, in 2017-18, Sidney Crosby had a negative offensive RAPM. This isn’t the model misbehaving – it’s by design. Crosby didn’t drive goals for in 2017-18 – the Penguins goal rate at 5v5 was worse with him on the ice than off. However, the main culprit of this was an on-ice shooting percentage of just 8 percent (Crosby’s career average is 10.6 percent). Since EvolvingWild’s model regresses individual statistics to predict RAPM, and uses predicted RAPM to get to WAR, Crosby only has a barely-positive WAR at even strength (with his individual scoring likely making up for his poor goal impacts).

Advertisement

This is a case of the model working as intended and in a descriptive sense, it’s hard to argue a certain player provided high value in a season where they got outscored on the ice. But knowing what we know about hockey, Crosby likely wasn’t the reason the Penguins weren’t great when he was on the ice. The most likely culprit is ‘shit happens.’

This itself is not a deal breaker. If you’re attempting to make a descriptive statistic (as the Younggrens are), it is reasonable to want the outputs to align with who actually outscored the competition over the time-span you’re calculating WAR over. Furthermore, taking a larger sample of WAR can alleviate this issue, as many unsustainable runs of good or bad luck will revert to normal given a larger sample size.

However, it means their WAR is not a measure of the best players, but rather the players who had and contributed to the best on-ice results in a given time period. This choice is not inherently wrong, but it must be kept in mind when interpreting the outputs of the model, particularly in single-season (and smaller) samples.

Who are players each model likes and dislikes, and why?

Corsica:

Generally, elite shooters are rated very highly by Corsica’s model. This makes sense, as the ability to consistently turn shots into goals at an above average rate is one of the rarest and most valuable skills in hockey. Atop a single-season WAR list, you’ll generally find players on (potentially unsustainable) shooting benders. William Karlsson is an obvious example. He has legitimate talent, but I think most would feel he’s not a proven elite shooter just yet, but rather a player who experienced an extreme jump in his shooting percentage. Nonetheless, his 2017-18 Corsica WAR credits him for his ability to beat goaltenders at an elite rate.

On the flip side, I’d say that elite passers are somewhat undervalued (Joe Thornton and Nicklas Backstrom, as an example). The model attributes the ability to outperform the expected value of a shot to the shooter and not the passer, and it’s likely that some elite passers create chances that are undervalued by expected goals.

EvolvingWild:

As covered above, strong offensive players who have poor shooting percentage in a given year will look mediocre. In addition to Crosby, Mikael Backlund is a very obvious example. The inverse is true as well – Michael Raffl looks better than you’d expect, in part because of his inflated percentages.

Additionally, EvolvingWild WAR suffers from the lack of poor defensive boxscore stats. As a result, some players’ defensive prowess won’t be fully recognized by the model that translates individual stats into predicted defensive RAPM measures.

Advertisement

Any other criticisms worth mentioning?

A lot of what we’ve discussed aren’t really criticisms – just context that I believe makes WAR more useful. With that in mind, though, there are a few other things to keep in mind when interpreting the value of WAR. They include, but are not limited to, the following:

  • Models are not objective and reflect the biases of their makers. If you disagree with modeling assumptions or decisions, you’re probably not going to like the output. And understanding how a model works tends to go a long way into explaining outputs that may be unintuitive.
  • The player impacts that arise from regressions are estimates and have ‘error bars’ around them that we don’t typically see. In interpreting WAR results, we need to be respectful of the variance of these estimates. The common rule of thumb is that results within one win of one another are within the ‘margin of error,’ but in general, the variance of these estimates is not published. As a side note, Tyrel Stokes, a PhD student in Statistics at McGill recently wrote about how to formally test whether the difference between two players’ WAR is statistically meaningful, which is pertinent to this discussion
  • Hockey is incredibly high variance (duh), so results can be noisy from season to season.
  • Despite the fancy math used to mitigate collinearity, it’s extremely hard to properly allocate value to teammates who spend the vast majority of their time on ice together and very little of it apart.
  • There are edge situations that are not captured by current models due to the unique property of hockey in allowing changes on the fly (see here and here for an example). It is worth noting that the precise quantitative impact of these situations has not been meaningfully established, so this is more of a pre-emptive concern than anything.

Given all of these constraints on using WAR results, does it still have value?

Entire articles could be spent on this issue right here, and you could argue this entire article is an attempt to help set up this discussion. I’ve mentioned a lot of potential limitations to how we use these stats, and a natural question is whether they’re so significant that they make the strengths irrelevant. In my opinion, they are not. The limitations of WAR are real, but every other stat in the world suffers from significant limitations of their own.

These limitations simply mean that WAR isn’t perfect, and that there’s a responsibility in how we use it. That doesn’t invalidate the stat, or make it devoid of value. There’s a common saying in statistics that all models are wrong, but some are useful. I believe that applies here.

To me, the value in WAR is that it’s really the only public, published stat that attempts to account for all components of player value independent of teammates, competition, and usage. It provides a reasonable interpretation and justification for player value relative to one another, and as such, is an excellent starting point for assessing player performance.

It’s even more useful when you consider the modular design discussed above, which lets us explore where players shine and how that has changed over time. WAR, for example, could really let us analyze player aging curves across various components, and in the aggregate.

A natural application of this would be to determine whether the distribution of NHL roster spots aligns with what these WAR models imply about aging.

Thinking about age this morning. Here’s the ~breakdown by age of NHLers on rosters for next season. The bar colours represent median goals above replacement value for each age. pic.twitter.com/WZYgRjOLyz

— Sean Tierney (@ChartingHockey) August 30, 2018

WAR also provides us with insight into the distribution of player value across the league. In Corsica, Perry notes that the truly elite forwards seem to distinguish themselves from the rank and file more through offensive WAR than defensive. Put another way, being elite offensively is more valuable than being elite defensively for forwards, which if true, could have implications in roster construction.

(Source)

Advertisement

Overall, I think the key to using these models is to understand, at least at a basic level, what goes into them and what they’re trying to do. As the hockey community starts to get a better handle on that, my optimistic take is that the discussions we have will elevate a level or two, and more closely mimic what we see in baseball or basketball. Sophisticated mathematical tools have been used to revolutionize just about every industry in existence. It’s kind of crazy to assume that this won’t be the case for hockey. The trick will be in understanding what exactly those tools are doing, and the decisions that are being made in using them. My hope is that this piece aids in the interpretation and discussion of these stats.

Acknowledgements

Thank you to Josh and Luke Younggren and Emmanuel Perry for their tremendous insight and help with this piece.

(Top photo credit: Bruce Bennett/Getty Images)

ncG1vNJzZmismJqutbTLnquim16YvK57lG5pbG5mZH9xfZdoaGlnYGt8uLTArWSiq12srrN5xK%2Bcq7Gknbavs4yypq5lnpqypXnTqGSkpp%2BseqKuzq6rZqCfmLimxdJmrqKmo2Kuo7vVnmSrnaChrqSxzJ6lrWWimsOwuNStoKimXw%3D%3D

 Share!