@plentifulnoodles

Are antichess ratings deflating?

ErinYu10 Nov 20211,124 viewsEnglish (US)

Leveraging the expanse of chess data: Part 2

A cute little introduction

Hey, I’m Erin. Part 1 has finally arrived, but chronologically it's now part 2.

One of my long-term interests has been how move quality correlates with rating. It’s one of those things that has been studied extensively, but there’s still so much to learn from it. There’s this quote from an FAQ page that I found interesting, though.

You got one part of that wrong

No. Ratings can't be compared from a server to another. They only serve to compare the skill of players within the same rating pool. Lichess Rating Systems FAQ Page

This is not ~~meth~~ correct (Breaking Bad S1E6, please don't sue). But after talking to a mod about it, my anger subsided because it's a matter of context. Lots of people try to directly compare Lichess ratings to chess.com ratings or Chess24 ratings, but the direct comparisons just don’t work. For example, 1500 on Lichess isn’t necessarily 1500 on chess.com. These people then get upset when they figure this out and say that one of the two rating systems is wrong.

Both rating systems are correct, mathematically speaking. And both rating systems measure relative chess skill. With enough players sampled from the entire chess playerbase, which there are, ratings can represent absolute chess skill. This means you can use absolute chess skill as a bridge between two rating systems to a certain degree of precision. Here are a few examples of people doing exactly that.

This got me thinking. Can we even compare the same rating pool at different points in time? The rating pool changes literally every day. Some new players join, some old players leave, and different games are played every day. There are changes in some small way, but changes nonetheless. Can we compare Lichess 2011 ratings to Lichess 2021 ratings? Probably not. Right?

Why does anyone care?

Instead of inflation or deflation, I'll call it rating drift. Ratings can drift for a bunch of reasons, and inflation is a specific kind of rating drift in my mind (when absolute skill is no longer represented by the same number). So, yes, the title is click bait. But I feel like it's easier to call everything rating drift and then qualify reasons behind it, as opposed to switching between terminology.

There are many reasons why someone might want to care about rating drift, but I’ll list two.

1) People want to know if they've improved. It’s natural for people to check if their rating is authentic. Many players have never played tournament chess in their life, and they just want to know how good they’d be if they started playing competitively. “You can’t compare apples and oranges” is the worst answer you can give someone like this. It's certainly not a very welcoming response.

2) Lichess ratings matter. And I mean they matter for international titles. Take for example the International Antichess Federation (IAF), which awards antichess titles using arbitrary Lichess rating cutoffs. Presumably they wouldn't want to award titles to weak players, right? That'd be like FIDE issuing FM titles to a generation of 2100s. The titles depreciate in value, and so do the ratings. Levon Aronian even has similar concerns about the GM title in chess. So there really is a legitimate need for maintaining the integrity of the rating system, and not just by forcing the median to be 1500 whenever it strays away.

Let’s use antichess as an example. Are ratings drifting downwards? Well, obviously. That's been known for years. But to what extent? And can we find out why?

A few common myths

You can map ratings between platforms, but it won't be accurate at all.

This one is actually true, but it depends on what you mean by accurate. Normal playing strength variations can be 100 points in either direction on any given day, so there's going to be variability just because humans are humans. We can't get a 1-to-1 mapping, but we can hopefully get something that makes us go "yeah, this seems about right".

Lichess ratings aren't inflated. It just uses a different system.

Correct, but a small clarification. There are two ways I see inflation. The first is within the same rating system (legit inflation), and the second is between rating systems. Between rating systems, the word inflation is used in a much more lax manner - that one rating is higher than another. For example, my little sister's 3000 My Little Pony chess minigame rating is numerically higher than your 2000 Lichess blitz rating. I’d feel comfortable calling that inflated.

In general, I feel like the term inflation is used pejoratively. It shouldn’t be. I'm cool with a rating system as long as it's consistent. If it starts drifting, then that's very interesting, but not necessarily wrong.

Comparing rating systems is like comparing apples and oranges. There is no correlation.

There's also no correlation between your mom.

The ways rating pool distortions occur

This is simplified, so please excuse me. Having talked to a few top antichess players, ratings can fluctuate for a number of reasons. Time controls, cheaters, rating deviations of provisional players, whatever. It all affects ratings. I had this long discussion with nevergonnaberserk, and it came out similar to what tolius had to say. In fact, tolius is literally the authority on antichess ratings given he has a website where he recalculates ratings of every antichess game ever played. This stuff goes very deep, and unfortunately I lack the technical prowess and patience to get into it. But below I've explained my own model. I think of it as a high-level overview that works satisfactorily.

In my defense, oftentimes nobody knows why rating distortions occur, but sometimes they do. For example, FICS identified that their blitz ratings decrease every year, but there's still speculation as to why.

Let's look at 4 cases where ratings can go haywire.

The system is broken.
Players join the pool.
Players leave the pool.
Someone dares improve.

Case 1 - The ratings system just can't even. This was suspected in the FICS study as an "artifact of the rating system". Basically, rounding errors can sometimes happen which destroys the conservation of rating. Lichess has taken great care to avoid this particular issue, but the reverse could also and does sometimes happen. Rating points can be externally injected into the pool, for example. Both of these changes mean that player ratings change, but player strength does not.

The funniest case to me is when the mathematics of a rating system fail to model reality. Rating systems rely on sigmoidal expected score curves to calculate rating changes. At the extremes, however, they get it wrong.

In an Elo calculation (above, but replaced the 400 parameter with whatever fit the data best), weaker players win slightly more often than predicted. This drags everyone's rating closer together. There's some rating drift upwards among weaker players, and some rating drift downwards among stronger players. Player strength does not change at all, though. Lichess uses Glicko 2 instead of Elo to better model reality, but this tracking error will always exist somehow. A model can't predict everything.

Case 2 - New players change the ratings. Such was postulated in the FICS study as well, where weaker players were suspected of dragging ratings down. Why might this happen? My guess is the provisional ratings feature. During the first few games, a weak player will lose more points than their opponents gain, and this applies in reverse with strong players joining, causing drift.

The nuance here is that while overall rating goes down if a weak player joins the pool, there is actually a slight drift upwards towards the higher rating range. If a 1700 plays a provisional 1500 who's actually 1300-strength, the 1700 is gaining more rating than they should have; they're now 1705 instead of 1701. The stronger player is farming the overrated weaker player while they're stuck at a provisional rating. Both overall skill level and rating go down, but not uniformly across all skill levels.

Case 3 - Sometimes players leave the pool. When a strong player leaves, both the average skill and rating go down, and vice versa for a weak player leaving.

Case 4 - Weak players can improve. Take this example: two players are rated 1200 and 2400, respectively. The 1200 improves to 2400. After enough games, both players' ratings become 1800 despite them both actually being 2400-strength. The average rating has not changed, but the average player strength went up. This results in a 600-point rating drift. It could be due to cheating or genuine improvement; I don’t really care as long as the player’s skill increases (this is a point of contention). If everyone improves, however, then this effect is diminished. My theory is that weaker players improve more quickly than stronger players, so Case 4 has some bite to it.

These case 4 effects have actually happened in real life. A very recent study on FIDE ratings concluded that ratings were deflating because rating floors decreased. One can surmise that something similar happened when Lichess dropped their rating floor from 800 to 600. The United States Chess Federation was also victim to rating deflation due to an improving youth, which even made some veteran players quit over the unfair rating losses. Their reactive measure to placate the populace was instating individual rating floors, which are a controversy of their own.

The question of rating drift is going to be a complicated mesh of these factors, and even more. In the IAF’s case, now they have to make a decision. Should a player be awarded an ACM title because they’re objectively 2000-strength, or should they not be given a title because they aren’t 2000 relative to the current player base?

So, how do we figure out:

Are antichess ratings drifting up or down?
Are these swings accompanied by swings in player skill?

We need to measure ratings and skill over time. Well, ratings is easy. We'll also be tracking player skill using average centipawn loss (ACPL). There are many issues with this, but just go along with it.

How we get the data

I got all the PGNs from the Lichess Database. I took all the antichess files from May 2018 to May 2021. I excluded files after May 2021 because Lichess switched their variant engine around that time. The engine switch means that I can't necessarily trust those position evaluations are consistent with the rest of the data.

I used a slightly modified version of the open-source program Welgaia to read all of those millions of games, find the games with engine analysis, and collect average centipawn losses (ACPLs) and rating information from those games. Positions calculated as forced mate or exceeding +/-10 in evaluation were capped at +/-10. I didn't account for time control, though I could have if I were feeling meticulous. There are other optimizations that I avoided for simplicity, for example most anything to do with ACPL corrections. For those who don't know, ACPL is a move quality metric that we calculate as the difference between two engine evaluations. It has a whole host of problems associated with it (understatement), but it's still probably the most popular move quality metric anyway.

When processing that data, I only selected the player rating information from analyzed games. This is not totally accurate, but I'm doing it anyway because while analyzed games are generally of lower quality than the average game (more about this in a future blog), the rating trends should still match up. Take for example the figure below, where the two scatterplots are translatable.

The data for the rest of this article can be said to be noisy, which is correct. The precision of this dataset hinges on people analyzing their games. Thank you to everyone who analyzes their games. The more analysis, the better the numbers.

What happened on a macroscopic scale?

Let's try Case 1. Are there any rating distortions due to artifacts in the rating system? We can try plotting ACPL against rating and see if it goes vertical at any moment. A change in slope means something interesting is happening. We'll plot average rating of analyzed games and ACPL for every month between May 2018 and May 2021. Every point is one month.

It doesn't go funky ever, really. Probably this is a bad way to approach this problem. Lichess using Glicko 2 and decimal ratings really helps avoid kinds of nonlinearities, and the site probably didn't inject substantially many points into the rating system during this time period - otherwise we'd have seen it.

At face value, this kind of linear relationship indicates that nothing crazy is happening with the rating system's mathematics. In other words, player ratings are very closely tied to their actual skill level on a month-to-month-basis. However it does indicate that average ratings are drifting alongside player skill.

Macroscopically, anyway. Doesn't it kind of look like the points on the right are following a different slope than the points on the left?

Let's try Case 2. Below is a plot that shows ACPL and average rating in analyzed games over May 2018 - May 2021.

It's funny how the two curves match up so closely, including that large spike right around when quarantine season was trendy. Ratings were very slightly on the decline and ACPLs were very slightly on the upswing before coronavirus, but after that the rate accelerated incredibly.

The explanation I'm offering for the sudden coronavirus spike is that many new (and weak) players started playing antichess. This is consistent with the number of games being played every month nearly doubling around that time period. It's also consistent with the rising ACPL, because weak players don't really play that well. The influx of new players brought the average rating down, but it didn’t break the mathematics of the rating system. This must have skyrocketed the percentile of top players even if nothing really changed from their end.

This also highlights why it’s probably a bad idea to just hope for Case 1 rating drift. The recent drift in median (mean in this case, but whatever) rating is very much explainable as a totally natural part of rating systems. Median ratings don’t have to be Case 1-shifted back to 1500. At least, there might be a more nuanced approach, for example only Case 1-shifting certain rating bands.

What happened at individual rating bands?

One more thing to check. Let's try plotting ACPLs of player ratings over time. Case 4.

Players at the higher end of the rating system were getting better and better until last year, which meant a higher standard of play (ACPL) was required to hold your rating. It's a bit hard to see at lower ratings because of the noise, but for 2000- and 2200-rated players it's quite striking to my eyes.

Perhaps it's because new players tend to have a low retention rate, so it's difficult for the standard of a 1500-rated player to improve? But the 2000-rated veterans stick around for a while and gradually reap the rewards of their improvements. It's also interesting that the ACPL volatility is so much lower for higher-rated players despite the bulk of analyzed games being from lower-rated players.

Is that prolonged ACPL drop really due to improvement? It depends. But devil's advocate, let's say that the lower ACPL may also be due to strong players farming weak players (there will be a blog about this later on). After all, we do know players are constantly joining because overall ACPL always trends upwards. I'd claim that both are happening at the same time; new players are generally weak and have ridiculous ACPLs, so they're dragging the overall ACPL up. Simultaneously, everyone is improving slowly but surely, but not enough for that to show on the overall ACPL plot. This improvement does show up when zooming in to certain rating bands, though, especially at the higher rating bands where there are few but reputable players.

I think this means that rating drift doesn't have to be uniformly distributed across the entire player pool. The player skill distribution for a pool can be stretched or squished over time depending on who joins or leaves, which causes local rating drift wherever the stretch occurred. A Case 1 artificial correction may not lead to the desired affect in such cases because it acts uniformly on the player base.

The ijh spike

Let's talk about the ACPL increase in 2020. Look at the rating graphs of elite players such as Ogul1, PepsiNGaming, EUROPROFESIONAL, firebatprime, Zooey, Dreadful... why are all of them dropping 200 points from 2018 to 2021 like in the image below? They didn't get worse - quite the contrary, they probably got better if we go off of ACPL history. Why's the ACPL going up when ratings trajectories aren't changing?

As a quick check, let's check move time against player strength over the months. Weak players tend to play slower than strong players; just check out Welgaia plots, though I won't show those here (but will in another blog). But, basically, we'd expect to see a nice function just like with the ACPL versus strength plot. This time we'll be using all games - not just those that have fishnet analysis - so the average ratings are higher than they were earlier.

Wait, what? Why are there two distinct pockets? Weak players ought to be playing slower, not faster. This indicates a disturbance. Let's make the x-axis a time dimension to see if we can find it.

Move times will do whatever they want in the long run, but there's this funny divergence characterized by the ACPL hike in April 2020 for low-rated players. These ACPL increases cascaded to high-rated players until August 2020. I call it the ijh spike because, well, it's named after ijh, but also because it has to do with ijh shenanigans. April 2020 marked the introduction of antichess hourly arenas, and many of these are 1+0 or 2+0. These are fast time controls, and players are bound to make more mistakes under time pressure. Weaker players tend to participate in hourlies the most. Many of these weak players are also new players because of quarantine, and thus the fast time control culture was born. This slowly propagated to the entire community over several months.

This means that ijh singlehandedly caused a permanent meta shift in antichess time controls that has lasting effects on aggregate antichess game quality. The influence is so strong that there's quantitative evidence of it that will live on forever in the Lichess games database.

I don't think that the change in time control alone accounts for the accelerated rating drift in 2020. The time control change is permanent, but the rating drift rate returned back to normal after a few months. I maintain that the bulk of the rating drift in 2020 happened because the coronavirus kids started playing antichess. That's Case 2, and it isn't deflation as we think of it because top antichess player ratings didn't collapse because the rating system died - the average rating just went down along with the average skill level.

The gradual rating drift that we see among top players, being accompanied by a reduction in ACPL, probably happened because of player improvement via Case 4 - this I would consider actual deflation. It'd be consistent with the FIDE and USCF cases as well.

Going back to the first ACPL vs rating figure, it's this interesting thing where somehow ratings are going down along with move quality. In a true deflation case, move quality would be about the same while ratings are going down. In other words, we shouldn't be seeing something linear if deflation is happening. But while player quality was going up, time control was simultaneously correcting for that ACPL drop. The result is a plot that looks like nothing spectacular happened. In reality, it's two opposite forces competing against each other. Or so I claim.

So are antichess ratings drifting because of improvement?

Check out this interview

I think it's safe to say that the standard of play has been steadily increasing over the years. There's a few interesting things that happened due to quarantining and due to ijh, but it's this gradual and consistent improvement among the player base, focused on weak players quickly achieving par results, that pushes ratings downward. It's the easier explanation to swallow, anyway, and it makes sense. What that does mean, however, is that antichess knowledge is growing everyday. More strong players means more ideas, more innovations, more high-level games, more coaches, and more motivation. Check out this interview with firebatprime, one of the elites of the game, who explains why it's easier to improve now than ever before.

Why does anyone care, again?

Well nobody does, but I do. Everyone already knew that antichess ratings have been deflating for years, and everyone (some people... at least one person) already knew about the ijh spike. The community has been voicing their concerns about deflation for years, and I didn't need to show any proof of the evident.

I think that every variants community that wants to award titles to their players should be monitoring their variant's rating drift. Rating drift is evident just from looking at strong player rating trends, has happened in other rating systems, and hopefully we've shown here that there are many other ways to visualize it. I don't even care if all of my own analysis is bogus (which it may very well be) - I still think the rating drift problem merits consideration. This may include analyzing games locally to ensure engine consistency over time, filtering by time control to avoid ijh spikes, and making a very specific decision about whether antichess titles should be awarded on absolute or relative merit. Maybe the IAF, for example, already does this. In that case, I look silly. Whoops.

But what I care about is developing the tools that allow us to come to these conclusions quantitatively. Forget the actual conclusion in this blog post, if there even was one anyway. Imagine if we can use this kind of approach to predict how strong joddle or pminear would be if they played today. What if we could compare chess.com antichess ratings to Lichess antichess ratings? What about different variants, like crazyhouse? What does it mean when someone says they're a beginner, intermediate, or decent? Can we build a system that can map ratings between any variant, site, time control, or time period? A system that measures your objective chess skill?

Of course we can. But first, I need to publish part 3.

Recommendations to the FAQ page

Half of the point of this blog is to challenge the rating systems FAQ page. I understand why it’s written the way it is. I still think it's worth amending, though. Such is the cost of having an opinion. Here are my suggestions.

“Comparing between federations or servers” - You can compare rating systems between servers, but it’s not trivial. A lot of statisticians try their hand at drawing correlations between rating systems. We recommend this one.

Of course, have a hyperlink to a good comparison website. Our friends at Chess Goals (hyperlinked earlier) have a promising correlation scheme in my opinion. As an aside, I think Lichess should make their own rating correlation chart. With the resources the site has, I wouldn't expect it to be too intensive. I mean, anyone can generate a passable correlation chart with Welgaia in like 15 minutes. And I think it'd look pretty good on the site in any case.

"Are Lichess ratings inflated?" - Lichess ratings start at 1500, as is recommended by the Glicko system definition. Lichess ratings can often be higher than ratings from other systems that start at 1200, such as FIDE or chess.com. This is in part due to initial rating but also how ratings are calculated.

There does appear to be a significant median deviation from 1500 over time, so I'd remove the part that denies it. It's not obvious that there isn't that deviation, anyway. As of time of writing, the median antichess rating of active players is around 1275. I consider this a significant deviation from 1500.

“Are players from server X stronger/weaker, because their ratings are higher/lower?” - No. Ratings on different servers are calculated differently. There is more to consider than just the numerical values.

I feel like the site text kind of shoots itself in the foot considering rating pools (Lichess, FICS, ICC, whatever) aren't even comparable between themselves from month to month. Unless we're comparing percentiles, in which case it seems fine, but most people want to compare absolute strength instead of relative strength.

“Which rating system is best?” - Practically speaking, all rating systems do their job. If there were any one truly better rating system then everyone would be using it. But technically, Glicko 1 makes better predictions than Elo, and Glicko 2 makes better predictions than Glicko 1.

I added a sentence about how all rating systems are practically identical. FIDE isn’t suffering because they use Elo, and chess.com is doing just fine using Glicko 1. All three rating systems do a reasonable job at what they do.

“Why don’t they all use the same rating system?” - Because the first rating system historically used, Elo, has been improved upon over the decades. Glicko 1, then Glicko 2, make considerable improvements, and offer greater accuracy. Chess servers didn't want to pay the cost of legacy forever, so they moved on to more state-of-the-art systems.

My word preference, I suppose. It hurts to call the Elo system pretty bad even though pioneers don't have any predecessors. For being the first of its kind, the Elo system holds up remarkably against the new technologies.

Acknowledgements

Shoutout to @plentifulnoodles for the cover picture. Plentiful noodles best doodles.
Shoutout to @lakinwecker for reading over an earlier revision to make sure I don't embarrass myself. I tried to fix it.