lichess.org
Donate

Exact Ratings for Everyone on Lichess

@BananaBeaver said in #59:
> Excellent. Will you put it on GitHub? (Or similar)
>
> It's a shame that Ordo is totally opaque. In practice, it works well, but the code is completely undocumented, and there is no article that comes with it to explain what it does.
>
> It would be great to have something collaborative, transparent and modern/fast to replace Ordo.

Thank you! Yes it will go on GitHub. I might even build a website to view player rating graphs. Right now I'll be posting updates in the discord channel which I've previously linked. Finally finished a work project so I have more free time now :)
@ShogiButItSimplifies

Thank you for your response, I'll try to wrap my mind around the rating math later. Happy new year!

Also, Ordo calculates ± intervals for ratings. To do that it runs a bunch of simulations, which is fine when you have 50 engines, but 1 million players, not so much. Surely there is a smart way to calculate the ± values for every player. This would be crucial to have a real leaderboard. Every rating deviation depends on every opponents deviation, so they would probably have to converge in a similar way to how ratings do now. Something for you to think about perhaps... Let's imagine a simple case: Player A has rating 2000 ± 100 (SD = 50, ± is 95% confidence) Player B has only played A and scored +64-36=0 giving him a rating of 2100 ± x. What would be x here? Even simpler, let's imagine there are no draws. (I believe Ordo calculates the "draw rate between two equal opponents" precisely for this type of question, when we include draws)
@justaz said in #62:
> Ordo calculates ± intervals for ratings. To do that it runs a bunch of simulations, which is fine when you have 50 engines, but 1 million players, not so much.

Right so it's a monte carlo simulation of some kind combined with some machine learning, interesting.

@justaz said in #62:
> Surely there is a smart way to calculate the ± values for every player.

There probably is but as you said it would probably be very similar to what Glicko1/2 do and the most mathematically correct way to do it might not be the best engineering choice either from a practical development perspective (it probably wouldn't computationally scale up to large numbers of players well) or in terms of how valid the mathematical modelling is to the messy real world. A robust statistical approach with uncertainty quantities with an updating procedure is the way to go.

But if you want a smart way, I would suggest using a simple approximation for the error based on the number of games played by the given player in the period of time you are interested in, which would greatly reduce computation needed. I would recommend 130/sqrt(n), as errors typically decay according to a 1/sqrt(n) relationship with sample size because of the way averaging affects uncertainty. The thing is the only way to bring down the uncertainty is with more data, and it will never reach 0 so calling them "exact ratings" can be a bit misleading.

@justaz said in #62:
> Let's imagine a simple case: Player A has rating 2000 ± 100 (SD = 50, ± is 95% confidence) Player B has only played A and scored +64-36=0 giving him a rating of 2100 ± x. What would be x here? Even simpler, let's imagine there are no draws.

This is definitely solvable with some statistical error propagation, these sort of problems come up all the time in physics experiment. The crucial piece of information missing is the sample size of the number of games (n), without this there is no meaningful notion of uncertainty regarding the game outcomes. I would solve this problem by first considering the uncertainty on the expected score for player B (p) from a game based on all the real game outcomes, this can be modelled using a beta distribution which is in this case is a binomial distribution rearranged to represent the distribution of the probability parameter (the inclusion of a small number of draws will not cause too many problems with this choice of model). The error on this expected score will be sqrt(p(1-p)/n) and this is then propagated (along with p) through the sigmoid curve used to determine expected scores from the rating difference. The propagation is achieved through simple dy/dx calculus to convert uncertainty in 1 variable to the other using the gradient and assuming that the errors are small and that the curve is locally smooth and linear. This would then give you a value for the rating difference between the two players along with a +- error for the difference. This would then be added to the rating for A and the error for A's rating would be combined with the error for the rating difference by "adding in quadrature" which is just adding the sum of the squares and then square rooting (this is the correct way to combine uncorrelated errors since when 2 uncorrelated random variables add together, you find the mean and variance from adding the component mean and variances, NOT the standard deviations).

This method would not scale up to larger numbers of players well at all, but it can give you an idea of what the analytical approach would look like.
I think what you're trying to do should be equivalent to calculating the maximum likelihood estimator of the bradly-terry model (but I'm not sure if everything works out correctly with draws). There is some preexisting work for calculating it which might be useful.
If it is equivalent then you might be able to get a simple approximation for "error bars" by taking the second derivative of the likelihood w.r.t. the rating (with some rescaling). This tells you how quickly the likelihood shrinks if you change the calculated rating slightly which is gives you an idea of how stable the result is.
Edit: Actually, I think the correct confidence interval should be 1/sqrt(second derivative of log likelihood), but admittedly I'm out of my depth here

Also I don't really think assuming that the result of a match between two players is logistically distributed is a huge mistake. It's a simplification of the model, but so is assigning a single number to the ability of a player. (Imo taking Sonas's numbers which come from games between masters (I think?) and assuming that games between amateurs and games between top chess engines behave the same is just as big a mistake). Though it might be interesting to redo Sonas's numbers with your ratings on the lichess dataset (and then maybe recalculate your ratings with the improved expected score formula and repeat? :P)
can't find the quote from @justaz , but that winner coming from one sufficient mistake made last of a string of tactical segments, is likely what one unique game might tell with just one instance of same player pair. Each game being a sample of the pool of players, and if totally random pairing, it might take a lot of whole pool sampling before we get the same exact different experiences sets meeting again.. on the other hand if the 2 kept only playing with each other, then they might be increase their experience set in own bubble of big chess.

so conclusion, there is some insight in that statement. I was going to agree fully, but then the above thoughts forced me to dissect that response (from a reply quote)..... I guess I should read the whole post.. later.. Happy new year, anyone (wishes of course!).
<Comment deleted by user>
@ant_artic Thank you for sharing that insight about the Bradley-Terry model! I completely see where you are coming from. For anyone struggling to follow the maths, assuming the Elo model is true, the logistic estimator for game outcome used by Elo can be rearranged to the form of the Braley-Terry condition using the exp(rating) as the Bradley Terry model p values (with a 400 scaling constant somewhere in there). After reading up about it on Wikipedia I would conclude that it is a useful transformation and the existing work on solving the Maximum Likelihood estimator problem for it is useful but unfortunately it isn't going to be much better than what Ordo is doing already (in fact Ordo might even be doing the same thing) since there is no known clever closed form solution for finding the optimal parameters and the iterative procedure converges slowly but it is thankfully guaranteed to find the global optimum as there is only one optimum.

I think the most practical way of using this approach to find errors would be to find the half width half maximum of the peaks for the optimal values and use these as the errors. Simple error propagation can be used to translate the uncertainty in the new Bradley-Terry p variables back to the ratings, again assuming small errors.

@ant_artic said in #64:
> Also I don't really think assuming that the result of a match between two players is logistically distributed is a huge mistake.

Indeed even from Sonas' results it was not an awful effect size difference so it is still approximately correct, but given the sample size there is no question about the statistical significance of the result, so we certainly know scientifically that Elo is not a correct model at least for Master players but even this result discredits the claim that the rating system is correct for all playing strengths, which is ideally what we are looking for in a correct simplified univariate chess rating system. And it fits with our experience too, I know I stand no chance against a 2400+ player no matter the supposed small probability of me winning against according to Elo, so in that case the Sonas predictor seems to make more sense practically speaking (the off chance that I win can pretty much always be regarded as an outlier ;P).
@C_a__l___e____b
Thanks for the tip! I'll try it. I assume you're referring to my new program and not Ordo. As it stands my code is already very fast, taking 50ms per iteration (using Cython). I did try using numpy and doing vector operations but .sum() was somehow the bottleneck. You can check out the discussion on the lichess discord, which I've linked previously in this thread :)
@justaz I guess optimization is moving along as we speak :) I meant your program on github. Ok, I'll move over to discord.