As a computer chess aficionado, I have long been thinking about the disadvantages of the incremental rating methods used for human play (in chess and in other games). The assumption of strength not changing between games is obviously not correct for humans, but it's usually close enough to the truth.
This blog post and the work behind it are very interesting.
In theory, it is still possible to get another substantial improvement in accuracy over this model, but it involves complications - instead of making converge one value (the rating), you need two (or more) values, the simplest model involving a rating and a "volatility" value that determines the spread of the expected outcomes (for example, a high volatility player would be more likely to beat a much stronger player by rating but also more likely to lose to a much weaker player ; obviously this property fundamentally depends on the player pool). But as far as I know, nobody has really bothered making a rating tool (or updating an existing one such as Ordo) to support this, although computer chess has some easily repeatable examples of engines with a noticeably different "volatility".
By the way, I don't think that the 58%/59% numbers you computed are the best reflection of a rating algorithm accuracy. As far as I can tell from your description in the blog post, the only thing that matters there is the ordering. It tests "is the higher rated player winning", but disregards by how much that player is higher rated. It is probably better to measure the difference between expected outcome and actual outcome with MSE.
This blog post and the work behind it are very interesting.
In theory, it is still possible to get another substantial improvement in accuracy over this model, but it involves complications - instead of making converge one value (the rating), you need two (or more) values, the simplest model involving a rating and a "volatility" value that determines the spread of the expected outcomes (for example, a high volatility player would be more likely to beat a much stronger player by rating but also more likely to lose to a much weaker player ; obviously this property fundamentally depends on the player pool). But as far as I know, nobody has really bothered making a rating tool (or updating an existing one such as Ordo) to support this, although computer chess has some easily repeatable examples of engines with a noticeably different "volatility".
By the way, I don't think that the 58%/59% numbers you computed are the best reflection of a rating algorithm accuracy. As far as I can tell from your description in the blog post, the only thing that matters there is the ordering. It tests "is the higher rated player winning", but disregards by how much that player is higher rated. It is probably better to measure the difference between expected outcome and actual outcome with MSE.