Mark Glickman

Lichess ratings are not Glicko-2

12 Oct 202419,252 viewsEnglish (US)

Ratings shouldn't be stable achievements. Glicko-2 ratings are volatile after upsets.

As always, opinions are my own, not those of Lichess.org.

Players have complained about problems with the rating system many times in Lichess Feedback, and Dr. Glickman (also author of the US Chess Rating system) proposes a peer-reviewed solution targeting those problems along with an explanation of how it works.

Reading Lichess' source and commit history clearly indicates an attempt to use Glicko-2 without defining a "rating period" during which all games are concurrently rated (instead of games being rated one at a time). US Chess and FIDE rate games in batches, demonstrating value in publishing stable ratings as well as live rating previews.

Rating games in "rating period" batches as Glickman suggests would yield the predictable outcomes Glickman suggests (volatility increases when upsets occur and decreases otherwise; and RD increases based upon volatility).

The problem with the Elo system that the Glicko system addresses has to do with the reliability of a player’s rating. Suppose two players, both rated 1700, played a tournament game with the first player defeating the second. Under the US Chess Federation’s version of the Elo system, the first player would gain 16 rating points and the second player would lose 16 points. But suppose that the first player had just returned to tournament play after many years, while the second player plays every weekend...

Lichess hasn't implemented true Glicko-2 (or Glicko-Boost) since it's costly to implement (both in terms of coding/maintenance as well as the overhead of computing stable and live ratings); Lichess instead prefers to moderate at greater cost than necessary. The entire purpose of Glicko-2 is to mitigate ratings errors (including abuse) less mitigated by Glicko-1 and even less mitigated by the Lichess rating system:

Every player in the Glicko-2 system has a rating, r, a rating deviation, RD, and a rating volatility σ. The volatility measure indicates the degree of expected fluctuation in a player’s rating. The volatility measure is high when a player has erratic performances (e.g., when the player has had exceptionally strong results after a period of stability), and the volatility measure is low when the player performs at a consistent level.

As with the original Glicko system, it is usually informative to summarize a player’s strength in the form of an interval (rather than merely report a rating). One way to do this is to report a 95% confidence interval. The lowest value in the interval is the player’s rating minus twice the RD, and the highest value is the player’s rating plus twice the RD. So, for example, if a player’s rating is 1850 and the RD is 50, the interval would go from 1750 to 1950. We would then say that we’re 95% confident that the player’s actual strength is between 1750 and 1950.

Even with an ideal rating system, nobody would want to play against an opponent who artificially manipulates their rating, but most of an abuser's "fun" is removed (and most of that damage mitigated) whose volatile rating quickly rebounds back to normal with minimal damage to opponents' stable ratings.

Years ago, cheaters and "peaksitters" were easily appearing on leaderboards, and we at Lichess easily implemented a Pareto optimal mitigation (decrease the RD cutoff for leaderboard eligibility, so cheaters could be more easily detected and so "peaksitters" quickly fall off). We at Lichess mitigated nontrivial problems, too!

Introducing RD aging, so inactive players don't heavily influence opponents' ratings (as Glickman explained, RD aging is a fundamental difference betwen Elo and Glicko-1)
Decreasing the artificial RD floor to stabilize ratings of active players who otherwise may be tempted to "peaksit" after a winning streak or throw insults after a losing streak (but because "volatility" is not computed the Glicko-2 way, a floor still exists)

In theory, ratings abuse might be further mitigated by showing a false rating to a would-be abuser, and showing their "true" rating to their opponent and using that "true" rating to rate games, similar to how US Chess uses rating floors. Of course abusers would still be banned, but could they be stopped from damaging others in the first place?

"Trying is the first step toward failure." - Homer Simpson

Image credit: Mark Glickman