lichess.org
Donate

Exact Ratings for Everyone on Lichess

@EmaciatedSpaniard Thank you!

Yep, we are working on faster rating algorithms.

> The difference seems quite modest, only about 1% improvement for ORDO.
I was waiting for someone to complain about the math in that section. This is not a perfect way to measure rating accuracy, when December data comes out soon, there will be new ratings with more accurate measurements. The intuition behind the 13.1% figure is that, a useless rating system predicts 50% of outcomes, and a perfect system predicts 100%. So this is a scale between 0.5 and 1. If we look at that scale, it is 8.423pp over random chance, compared to 9.527pp over random chance, hence 13.1% better. But don't worry, I'll do a much more accurate evaluation next month!
@btdmaster50 said in #37:
> I looked at your Makefile, it looked pretty good. One idea is you could try -Ofast instead of -O3, which will give less accurate (incorrect for some purposes) floating point math with a potential big gain in performance.
>
> Of course, you'd need to check its chess accuracy, but it might even improve (less overfitting) so it might be interesting.

Thank you for the feedback! I'm pretty sure I tried Ofast and it was not statistically significant. I'm working on a GPU based approach anyways, but someone else is working on improving the normal code. If you're interested, you can head over to the Lichess discord server, and check out the Rating research unit in #general-programming-forum. Here is a direct link: discord.com/channels/280713822073913354/1190085963339272344
Interesting post. I recommend using larger font sizes in your graphs; it will improve their quality greatly.
@BananaBeaver said in #44:
> Have you tried BayesElo?

I know about the other programs, but didn't bother trying them. Wouldn't it be funny if it was a lot faster and didn't take 11 hours to compute for the blitz games XD

Anyways, I've got a my own program cooking. Stay tuned for the December release for perhaps even more accurate ratings :)
@justaz There is a major problem with the massive machine learning data crunching Ordo methodology you are suggesting but it isn't your fault at all in how you have gone about doing this project. The issue is that there is an unquestioning assumption present in the whole argument that the Elo curve model and bell shaped normal or logistic distribution models for player ratings are correct and suitably statistically consistent with reality, but in actual fact they are not! You are not alone in thinking this way and it's an easy trap to fall into, there is a lot of bureaucratic resistance to changing the rating systems because it's a lot of work and most people would prefer to ignore the obvious signs that the current outdated system doesn't work, they are stuck in a particular paradigm for thinking quantitatively about chess skill without applying the scientific method to test their beliefs. I am in the process of working on an improvement on the status quo of chess rating systems and imo we should be basing ratings on computer analysis of player performance for whole games rather than just the outcome. We have the technology now to analyse and quantify how strong moves are within games so why we haven't switched over to using it in ratings is a mystery. I will be making some blog posts about it soon so keep an eye out for it ;)

Here is an excellent blog post from world leading chess statistician giving his statistical data science take on FIDE ELO and his proposed optimisations and refinements: en.chessbase.com/post/the-sonas-rating-formula-better-than-elo
(tl;dr Elo curve predictor for game outcome is absolutely inconsistent with real world data, simple truncated linear model gives better fit. Optimised k factor and different weightings for different time controls for a more accurate overall rating help too).

Seeing your graphs of the distribution of Lichess ratings alongside the fitted Ordo ratings reminded me of my own rather in depth regression analysis project which analysed how well various theoretical models fit to the distribution of ratings on Lichess in an attempt to test personal beliefs about the impact the recent influx of new players has had on the rating distribution. I think you may find the statistical conclusions raise important questions about some side assumptions included in some implications about how chess ratings "should" be distributed
(colab.research.google.com/drive/1aRXuT1RJaestpPC9L2lo6InG3OT8w5eD?usp=drive_link)
tl;dr: chess ratings are absolutely NOT normally distributed! A logistic distribution fits better which makes sense as this relates to the logistic function used by Elo/Glicko systems to predict game outcome, if higher percentile players beat lower percentile players you would expect the cumulative distribution function to resemble whatever predictor of game outcome you are using. Even so, a simple single peaked 2 parameter model will not fit to the data within the errors and confidence intervals of the data. It appears you need a superposition of 2 such logistic peaked models close together to get a good fit, which supports the hypothesis that the player population has been split into 2 sub categories: Pre and Post "The Queen's Gambit"/Rise of Chess Youtube and the mainstreaming of chess.

Some constructive criticism about your data science methodology; your software engineering skills are clearly great (I assume this is what your background is in?) but you have to work on your statistical and scientific critical thinking in how you conduct and disseminate results. If you want to test beliefs and get closer to the truth you have to use a more formal approach with statistical hypothesis tests and you have to be more self aware of when your statistical arguments are too vague and unsupported, you have to go deeper. For instance you make a tenuous claim at the start appealing to the law of large numbers that as the number of games tend to infinity each player's rating will eventually tend towards some exact true value but this is not the case (at best the player's expected or average rating will tend to an exact value but not the live rating itself). On average the rating adjustments will push you in the right direction but the fluctuations and uncertainties will never die down to an exact limiting value, even IF the true Elo value remains constant there will always be a non-zero minimum amount of noise (my pure math skills aren't good enough to prove from first principles that this is the case, but if you think about it on a deep intuitive mathematical level you can see that it makes more sense than what you were suggesting. The point is whether this question crossed your mind when you made the claim in the first place?). Another example is when you stated figures about guessing the outcome right 58% of the time and then stating that it is a significant "good" result without stating the sample size, without this information the effect size has no context of statistical significance to compare against. Given the number of data points you seem to be using I will give you the benefit of the doubt as a few percentage points probably is significant in this context, but remember we don't know that from your blog! But besides the issue of whether it is persuasive enough to an attentive reader, there is also the danger that you will succumb to your own self serving biases if you are not disciplined with statistics because ofc you WANT your system you have worked hard developing to be effective and work well, so you may be overly selective seeing patterns and significance when it benefits you or your pride in your work (I can relate). This can damage the objective validity of your own work so be careful when drawing conclusions without first laying out the statistical success/fail criteria of tests BEFORE you collect the data. This is what makes it so easy to lie with statistics when someone goes about it the wrong way, intentionally or unintentionally.
The issues you raised about lichess volatility are very true, I suffer from unintentional volatility farming because I can't play as actively as how the lichess system expects from players so my ratings never really stabilise which is a pain. I don't play all the time because I only want to play when I feel I can play at my best, otherwise my rating suffers. I think they need to reduce the player volatility because even if a player is really active unlike me, the RD can't go very far below 50 because the volatility increases RD too fast so even when you play lots of games your rating supposedly never gets much more accurate than to the nearest 100 which I don't think reflects the real stability of Lichess ratings. I'm quite confident from how well I play in games relative to my opponents that my Rapid rating is around 1980+-10 but no matter how many games I play it will never fully stabilise to this despite swinging around that value.
@justaz said in #41:
> The intuition behind the 13.1% figure is that, a useless rating system predicts 50% of outcomes, and a perfect system predicts 100%.

If you think about this for a minute this obviously cannot be true, the game outcome is partially random and the only way for a data analysis prediction to be 100% accurate in the face of uncertainty is either being able to read the future or retrospective overfitting.