lichess.org
Donate
Stockfish absolute strength

chatgpt

Finding the absolute strength of existing rating systems - my proposal

Chess
tl;dr: establish an absolute rating pool based on fixed depth neural network engines, then convert existing rating systems to that metric

I posted this on reddit trying to get someone interested in this, but I made the mistake of posting it on after midnight on christmas so it got burried before anyone saw it. I'm posting again here so maybe someone takes interest and tries to implement this before I eventually do.

An argument I hear often is that rating systems are not objective metrics of chess strength, they are merely ranking systems. A common argument used is that if you simply added 500 points to every player in the rating pool, everyone's relative rating and ranking would remain the same, therefore there is no objective, or absolute meaning in any of the rating systems. Some have also quoted chess.com saying this in a blog post or somewhere on their website.
I propose a new method to measure the objective strength of a rating system and or players in that rating system. Imagine taking a snapshot of the entire rating pool - and then applying my method to determine what the objective strength of each rating in the rating pool is. I say take a snapshot, because theoretically players could be getting stronger all the time, i.e. a 1300 on lichess today might actually be objectively stronger if all 1300 level players increase in strength for some reason (or if every single player in the rating pool increases in strength). Therefore you will have to continuously apply my method to get an up to date measure of the absolute skill of any rating in the rating system.
Here is how you do it:
First, let's find an absolute, objective metric of chess skill. In other words, we need fixed, immutable chess entities that never change their absolute strength.
How do we get that? well, we use a combination of neural network engines that are designed to play human-like (like maia), along with fixed depth search. The code never changes, because the depth search is fixed it is not hardware dependent, so advances in computer technology will not change the engine's strength.
You create an entire collection of these engines, and play them against each other to establish their own rating pool. Let's say we have 20 bots, in their own rating pool rated 0-2000 (the numbers are arbitrary, but elo formulas determine the ratings because the stronger engines beat the weaker ones).
Now that we have our collection of engines, we have an absolute, objective measure of chess skill in a rating pool.
Now, to measure the absolute strength of a player say from lichess, we look at his rating - and then we have him play against these bots until we determine what his absolute rating is. For example, a 1200 lichess player might map onto an 800 absolute bot strength rating.
While it's true the meaning of any rating pool is relative, we are now mapping that rating pool in its current form to the absolute rating pool of the engines.
In this fashion, we can measure when someone's absolute skill level changes, even if weird things happen on lichess like they inject 400 points for every player or everyone just happens to get stronger at the same time. We just need to keep testing the current lichess rating players against the objective engine pool to get the most up to date analysis of the objective strength of various ratings on lichess.
Possible objections I can think of:
a) The bots strengths are not fixed and in some games they play stronger than others. I don't think this is usually the case and for example with a bot like maia I think it does a pretty good job of playing pretty much the same strength every game, although I could be mistaken. This may depend on the programming of the engines, or the neural networks, etc.
b)Chess engines change all the time - this is a silly argument because you can simply keep the exact same version and code for any bot you create, and like i said it is hardware independent if you are using a fixed depth. I'm not really sure about deterministic vs non deterministic, but I think if it was deterministic it would be too easy to game the system because it would play the same moves every time.
An additional benefit of this line of reasoning is that you would be able to create conversions between rating systems that may be more accurate than other methods of analysis.
Let me know what you think. If anyone would like to attempt to create this please do - otherwise if no one does I'll take a crack at it myself, but it would take me a bit because ideally you'd be training your own version of a stockfish or leela neural network incrementally so it slowly builds up in strength during training and you can get a wide variation of strengths in your different versions of the engines you use.