Exact Ratings for Everyone on Lichess

eel9

#51

mfw my real rating is 250 points lower than shown on lichess

Cooperating

#52

@justaz I only see correspondence

dboing

edited

#53

There is no such thing. It would have to use external referentials, probably (hehehe!), and anyway would be multidimensional vector of numbers, chess being big. I wonder how our internal models could have just one number on their forehead. And exact.. I might have just read the title, again.....

Yes

@ShogiButItSimplifies
Thank you for your feedback!

First to address some confusion. I did not develop Ordo so this was not really my research. I actually don't even understand statistics well enough to know exactly how Ordo works. I simply ran it on the Lichess dataset and then spent the whole blog post complaining about Lichess ratings. My evaluation was not very rigorous, you're correct. It was mostly an after-thought where I just quickly checked how many outcomes it guessed correctly to see if these ratings weren't actually worse. The sample size is 30 million decisive games for blitz.

For the December rating release I will be more rigorous and I will test the November ratings on the December dataset. This will be an actual test of rating quality. I'm currently experimenting with my own program that does similar optimization to Ordo, and I seem to be scoring better than Ordo on the fitness function I specified.

I think when discussing rating systems, we need to set out a metric for how accurate the ratings are. Then, whatever the best system is shall prevail. Let me lay out a modest proposal for what that might look like:

Use the historical Lichess data except the newest month. Calculate a rating for every player (except those with all wins and all losses). Evaluate every blitz game in the newest month with your own E(player_rating, opponent_rating, player_color) function for the expected score. Take the sum of E for all games played by a player, compare it to the sum of points the player scored that rating period (month). Take the absolute differences and sum those for all players. Now you have the total score misprediction. The lower it is, the better the ratings.

This is certainly not perfect. This fitness function is what Elo uses when adjusting player ratings. Do you think it is better to count the total absolute error on a per-game basis, without summing the score of a player? We are certainly losing information by summing but my experiments with my own program show that the ratings converge very quickly with this metric, so it can't be that bad.

Another improvement would be using only 3+0 games, the most common time control, not all blitz games. Tournament games should also be excluded as they feature berserking, and other types of upset-inducing tendencies. My dataset did not include tournament games.

A better system would spit out probabilities for a win, draw and loss. But let's keep to a simpler system for now perhaps?

I am interested in best predicting game outcomes in the near-future, like a month. This is distinct from what Ordo, BayesElo, EloStat does, which is made for engine games where the skill levels are constant. I propose a month testing period as it is not long enough to significantly improve one's "playing strength" for most players, and offers a large enough sample size for testing. But maybe one week is more appropriate? I think it could be too noisy and be missing too many players.

Think of "playing strength" as the g-factor in psychometrics. It is simply a construct that correlates with game outcomes. It cannot be perfectly quantified by one or more variables.

Let me know what you think. You're welcome to join the discussion on the Official Lichess server in #general-programming-forum > Rating research unit. Here is a direct link: discord.com/channels/280713822073913354/1190085963339272344

Now on to address your specific points.

> The issue is that there is an unquestioning assumption present in the whole argument that the Elo curve model and bell shaped normal or logistic distribution models for player ratings are correct and suitably statistically consistent with reality, but in actual fact they are not!
I'm not making that assumption, but Ordo, Glicko and Elo sure are :). I do not fully understand the statistics, maybe you can explain. There seem to be two types of distributions we are talking about. The distribution of playing strengths (ratings) and the distribution of game outcomes by a given player. To my understanding, Elo assumes that game outcomes by a player are normally distributed with the mean being his playing strength. Two players play eachother and both play with some random strength level drawn from their distributions, the winner being the one with the higher value. Now how does this relate to the distribution of ratings? I'm not sure. Am I correct in understanding that the 400 constant in Elo is meant to be the standard deviation of the distribution of player ratings? Consider the domain of engine chess, even though they use the same E = 1 / ( 1 + 10 ^ (-Δ/400)), their distribution of engine ratings will most likely not be normally distributed. The rating scores themselves can be scaled and shifted arbitrarily depending on what the function E is no? So I'm just not sure how rating distributions and the imaginary playing strength distribution relate.

> I am in the process of working on an improvement on the status quo of chess rating systems and imo we should be basing ratings on computer analysis of player performance for whole games rather than just the outcome.
I'm looking forward to seeing your research. Certainly you're not forgetting that chess strength is not just about move accuracy. "The winner of the game is the player who makes the next-to-last mistake." I think my metric for playing strength being the ability to predict future game outcomes is appropriate. It will be interesting to see if using engine based evaluations will be more accurate on that metric. Don't forget about the computational effort involved though! Lichess certainly cannot afford to compute engine accuracy for all player games. Also beware when computing historical accuracies on Lichess data that different Stockfish versions have different evaluation scales. I suggest using a pretty short period of games to avoid changes in the engine. Note that Stockfish will be sticking to "+1 means 50% win-rate" as their evaluation scale for the time being.

> Here is an excellent blog post from world leading chess statistician giving his statistical data science take on FIDE ELO and his proposed optimisations and refinements
Sure I am familiar with Jeff Sonas and I am a big enjoyer of the beautiful http://chessmetrics.com/ website. I've been reading more of his stuff recently. Unfortunately I don't think we have access to his exact dataset, correct me if I'm wrong. Hence I think doing research on Lichess games is more appropriate, as the results can be independently verified. That doesn't undermine his work tho! You might also be familiar with the Universal Rating System which he developed together with Mark Glickman (of Glicko fame). It seems that not only is the dataset not public, but neither is the system itself. Typical FIDE-like behavior.

> your software engineering skills are clearly great (I assume this is what your background is in?)
Thank you, tho credit for Ordo goes to Miguel A. Ballicora who wrote it in 2012 - 2016. I am self-taught but work as a programmer yes :)

> you have to work on your statistical and scientific critical thinking in how you conduct and disseminate results. If you want to test beliefs and get closer to the truth you have to use a more formal approach with statistical hypothesis tests and you have to be more self aware of when your statistical arguments are too vague and unsupported, you have to go deeper.
Again, I wasn't making any statistical arguments really, I think it's impossible for Ordo not to do better than Glicko given that Ordo optimizes the ratings for whatever function it uses. I did a quick test that the ratings predicted the winner of games more accurately. When you think about it, this evaluates the rankings, not the ratings per se. I have later on also tested the Elo-style fitness function that I described above, and Ordo performs much better than the live glicko ratings of players. But ofc in either test, Ordo cheats by knowing the future. The real test will be predicting the next month's outcomes.

> For instance you make a tenuous claim at the start appealing to the law of large numbers that as the number of games tend to infinity each player's rating will eventually tend towards some exact true value but this is not the case
This was my understanding of how Glicko works. But perhaps I'm mistaken. Are you claiming that Glicko-2 does not converge for players of consistent, but randomly distributed, strength as they play an endless amount of games? In that case I have misunderstood glicko-2.

> even IF the true Elo value remains constant there will always be a non-zero minimum amount of noise
Doesn't RD approach 0, so change in ratings will approach 0? I'm talking about normal Glicko-2, not the modified Lichess version. I don't fully understand how Glicko-2 works so maybe RD doesn't go down to 0.

> The point is whether this question crossed your mind when you made the claim in the first place?
Well I just assumed that Glicko-2 worked that way. My focus was not on the details of how Glicko-2 works, which I glossed over.

> there is also the danger that you will succumb to your own self serving biases if you are not disciplined with statistics because ofc you WANT your system you have worked hard developing to be effective and work well, so you may be overly selective seeing patterns and significance when it benefits you or your pride in your work (I can relate).
Again to clarify, I didn't develop Ordo and I'm not sure exactly what function it tries to optimize. Nor is it appropriate to train and test ratings on the same data. The true test of ratings is their ability to predict future outcomes. (This is not the case in engine chess where Ordo is used. There it focuses on measuring the exact strength difference and the statistical significance thereof, between two or more engines.)

> This can damage the objective validity of your own work so be careful when drawing conclusions without first laying out the statistical success/fail criteria of tests BEFORE you collect the data. This is what makes it so easy to lie with statistics when someone goes about it the wrong way, intentionally or unintentionally.
Yes, let me know what you think of my proposed test. I will include this proposal, or a better one, in the next month's blog post.

> If you think about this for a minute this obviously cannot be true, the game outcome is partially random and the only way for a data analysis prediction to be 100% accurate in the face of uncertainty is either being able to read the future or retrospective overfitting.
I was not claiming that 100% accurate outcome predictions are possible, maybe my sentence there was a bit imprecise. They most certainly are not possible, and absolutely not when using a single variable. I was just showing how this is a valid scale for rating (or rather ranking) accuracy.

Anyways, I hope to hear back from you, and I'll look forward to seeing your research! I hope you join the discussion on the Lichess Discord as well, where I'll be posting progress on my own program.

i-bex

#56

@justaz
> "The winner of the game is the player who makes the next-to-last mistake."
Or the player who doesn't get flagged in a winning position as I was painfully reminded playing some bullet games to fit into next month set. It's still a type of mistake I guess but I wonder how that would calculated in a computer analysis based rating.

ShogiButItSimplifies

#57

@justaz Hi, thanks for replying in full and Happy New Year!

I think I'll start off by apologising for the tone of my first message XD, I think I misinterpreted what type of project you were doing and, well, attacked it from a data science and statistics perspective (my background is in physics but I am an avid stats and data science enthusiast. Bit autistic when it comes to getting stats right so got a bit triggered whoopsie :P) when it was really more of a fun lil' preliminary exploration of "hmm what if we use this software for player games?" and beginning to wonder about if it could be used in the implementation of a better rating system. I thought it was being presented as a conclusion of a big project that seemed incomplete by my standards but ofc you had no intention of just stopping there.

I think Ordo probably just uses a machine learning technique to incrementally improve the fit as you suggested, the precise method probably isn't critical for the statistics as long as it finds the same global optimum. T@justaz said in #55:
> I don't fully understand how Glicko-2 works so maybe RD doesn't go down to 0.
hey probably worked out which one works best for the context.

@justaz said in #55:
> Do you think it is better to count the total absolute error on a per-game basis, without summing the score of a player? We are certainly losing information by summing but my experiments with my own program show that the ratings converge very quickly with this metric, so it can't be that bad.
I think it doesn't really matter so much whether or not you sum it, it's more accumulating data into a single test statistic variable rather than losing information, but summing it up does make more sense for usage in a statistical test as the sum or average is more well behaved thanks to the central limit theorem (hopefully) making it normally distributed. You could do something similar treating the games individually as a much larger number of separate data points and using a binomial model, but then each individual data point is less accurate and less reliable. In general, the methodology you are suggesting sounds good but success/fail criteria are still missing.

@justaz said in #55:
> I do not fully understand the statistics, maybe you can explain. There seem to be two types of distributions we are talking about.
Yes there is quite a bit to unpack here, I'll try my best. There are indeed 3 distributions in total that we have discussed and they are all related in interesting functional intuitive ways:
1) The distribution of the hidden "strength level" or what I like to call game "performance" which is how well a given player played in a game (the game outcome is theoretically decided by whoever had the higher performance) which Elo sensibly assumed was normally distributed, but based on Jeff Sonas' findings in the article I shared indicating that a truncated linear function gives a more accurate game outcome estimator, I personally believe it is more closely modelled by a continuous uniform distribution function + some real world outliers (I will explain my intuitive reasoning for this in a bit)
2) The uncertainty distribution of the inferred true rating of an individual. I believe Glicko is based on a normal distribution for this which makes sense.
3) The distribution of measured or observed ratings for players in the population of players. This is what my own rating regression analysis focussed on and what you briefly discussed in your blog comparing the Lichess rating distribution with the Ordo ratings.
Now onto how they relate to each other. The game outcome estimator and the concept of Elo chess ratings is derived from the normally distributed "performance" distributions. The Elo ratings can be understood as the position of the means for the 2 player's performance bell curves, and since the sum or difference of 2 normally distributed random variables is another normal distribution, the criteria of a positive/negative performance difference leads to a game predictor based on rating difference equivalent to a scaled version of the cumulative distribution function of the normal distribution. Elo uses a logistic sigmoid as a simplified approximation to this, classic physics professor behaviour from Elo. The 400 constant is an arbitrary scale factor proportional to the standard deviation but not equal to it, and yes as it is a purely relative scale without much of an absolute basis at all you can shift all ratings around as much as you like but we tend to keep it at around 1500 to keep ratings positive (although mathematically setting the average to 0 makes more sense).
Now the specific shape of the game estimator function used by the rating system (which is partially arbitrary as discussed) will determine the stable shape of the distribution of player population observed ratings that the overall player ratings will converge to. As I explained briefly in my original message, the outcome estimator function will resemble the cumulative distribution function of this population rating distribution also. This can be intuitively understood through a simplified model, assume a higher percentile player always beats a lower percentile player and all players play each other until the population distribution stabilises, we can see that the uniform distribution of percentiles will be transformed by the estimator function running in reverse (the percentiles will be fed into the inverse of the estimator) to generate the player ratings around an arbitrary mean, which means the estimator function must be equivalent to the CDF of the population ratings in this idealised case. This is why I would expect the CDF of the real population distribution to resemble whatever game estimator function was selected (and why I find the Sonas' result with the truncated linear estimator so significant on a theoretical level, a model of that form could only imply that the hidden player performance quantity resembles a uniform distribution function), although the distribution is not uniquely defined from this because we have yet to consider the distribution of true population ratings. The game estimator along with the personal uncertainty distribution function will likely add statistical fuzz over this true distribution of ratings to result in the measured distribution of live ratings.

ShogiButItSimplifies

#58

@justaz said in #55:
> I don't fully understand how Glicko-2 works so maybe RD doesn't go down to 0.
@justaz said in #55:
> Well I just assumed that Glicko-2 worked that way. My focus was not on the details of how Glicko-2 works, which I glossed over.

If you are proposing your Ordo ratings could act as an improvement on Glicko2 then it's important to take the time to understand more about how it works so you can argue whether or not it really is a better system.

http://www.glicko.net/glicko/glicko2.pdf (The important parts are the estimator and updating procedures, the volatility updating procedure is very complicated but can be skimmed as it's not as crucial as RD updating)

Although playing games reduce RD and over time the expected rating of the player should approach their true rating, the RD will never reach 0 no matter how frequently you play because of the way it updates the RD every rating period using the game outcomes and the volatility. Playing games has the effect of reducing RD but the volatility acts as an opposing effect that naturally increases your RD over time whether you play or not. The RD will stabilise towards an equilibrium where these effects are balanced. Because of the precise formula used to update it using a sum inside a square root function, the RD will never reach 0 because the RD increasing effect of volatility wrt time tends to infinity as the RD tends to 0 so playing more actively will shift the equilibrium but not to 0. Volatility is also updated (but through a very complicated and unclear root finding procedure of some function they don't explain) but I think it's always positive. In short it's not just about playing lots of games but playing lots of games ludicrously frequently.

BananaBeaver

edited

#59

@justaz said in #45:
> Anyways, I've got a my own program cooking. Stay tuned for the December release for perhaps even more accurate ratings :)

Excellent. Will you put it on GitHub? (Or similar)

It's a shame that Ordo is totally opaque. In practice, it works well, but the code is completely undocumented, and there is no article that comes with it to explain what it does.

It would be great to have something collaborative, transparent and modern/fast to replace Ordo.

Alonshow

#60

The graphics are very low quality, I can't see anything there. Can you replace them with high definition graphics?