@impruuve Thanks for your thoughtful post. I just want to clarify a few things.
My formula and elometer produce estimates of the exact same thing: FIDE ratings. In principle, the final number produced by Elometer and the final number produced by my formula should be interpreted in the exact same way.
Think about the three quantities involved:
1. Observed measure of performance
2. Benchmark FIDE rating
3. Predicted FIDE rating
Both Elometer and my formula compare some measure of #1 to a self-reported measure of #2, in order to make predictions about #3. The quantities #2 and #3 are the same in both models. What differs is only the type of input data in #1: I used observed game results whereas elometer uses puzzles results. So, again, the estimated of #3 that the two models produced should be interpreted in the same way.
You argue that elometers benchmark FIDE rating is of higher quality. In particular, you claim that:
(a) Lichess self-reported FIDE ratings may be incorrect
(b) Lichess self-reported FIDE ratings may be outdated
(a) is obviously true from the graph that I linked to above. Some users report a 3000 rating; that can’t be correct. But it’s important to realize that Elometer ALSO uses self-reported FIDE ratings as their benchmark (see element #2 above). So there is nothing that tells us that their data is more reliable than mine. Why would people be more truthful in their self-reports over there? In fact, the reasons elometer might be overshooting may be that several people’s self-reports are inflated. Unfortunately, elometer is not transparent about how they deal with outliers and fake self-reports. In this thread, I have discussed several ways to exclude outliers, and I can show that the results are not sensitive to “bad input data”.
(b) is a more tricky issue. Imagine that user X improves 50 FIDE points and 50 Lichess points, but forgets to update her user self-reported FIDE rating in her user profile. In that case, my formula will tend to *underestimate* the FIDE ratings of other players. In contrast, if user Y loses 50 FIDE points and 50 Lichess points, but forgets to update her profile, my formula will *overestimate* the FIDE ratings of other players. If there are similar numbers of players moving in both directions, they cancel each other out, and the formula remains accurate.
Moreover, it’s important to realize that baby Karjakin was a huge outlier. Some young people’s FIDE ratings change really quickly, but that’s definitely not the most common situation. Most people’s chess skills move extremely slowly over time.
Finally, as I explained above, I took several steps to remove outliers. If some kid improves 400 FIDE and Lichess points, but forgets to update her user profile, that data point is automatically flagged as an outlier and excluded from the analysis. Again, not a big deal at all.
In sum, Elometer and my formula measure the exact same concept, and I’m not 100% convinced that my input data is worse than theirs (especially since we don’t know how much data they have), or that it should matter in the estimation.