rating deviation and streaks

jeffsonadam

I really appreciate the probabilistic approach of the glicko2 rating system adopted by lichess, and I think it does a fairly good job in its aim of matching players of similar strength and modeling winning/losing probabilities. However, if we only see it as a tool to this aim (i.e. as a model to estimate the probability of different outcomes in a game), I think this method (as well as all the alternatives, to be fair) falls short in one aspect, namely the fact that it assumes that the outcome of a game is only predicted by the rating of the two players, ignoring the existence of hot and cold streaks (see, e.g., Adilov et al., 2022; Chowdhary et al., 2023). To overcome this issue, one could implement a rating system which takes into account the fact that winning probabilities are predicted not only by the rating difference, but also by the outcomes of recent games of the two players. I think neural networks or other supervised learning methods could easily model this information into a rating system based on some training data. This way, when a player is having an unusually good or bad day, their rating deviation would likely dampen, limiting the effect of hot and cold streaks on their and their opponents' rating change. This would not only theoretically but also pragmatically make sense, as I think anyone has had the experience of winning against a higher rated opponent just to find out later looking at their profile that they are on a huge losing streak. In this and similar cases, I think a more accurate model would predict the outcome based on recent playing history in addition to rating values. A potential advantage of this method would be to limit something we might call "involuntary smurfing", with which I mean what happens when someone gets a low rating after a big losing streak and later ends up playing lower rated players for some time. This of course results in their rating getting back to its "true" level and all of this is no big deal, but maybe overall this might result in a better playing experience.
With that being said, I can also see arguments for rejecting such a proposal. In particular, because for some irrational reason people tend to consider rating as a measure of their ability or even of their intelligence rather than a measure of past performance and matchmaking tool, the implementation of such a method might result, e.g., in people taking games less seriously once they get in a losing streak (as their rating would not be strongly impacted).
I'm curious about how people feel about this, have you ever given it a thought?

References

Adilov, N., Tierney, H. L., Gerring, J. V., & Yu, T. (2022). The Bounce-Back Effect: Checkmating Competitors with the Cold Hand. Global Business & Finance Review, 27(6), 16.

Chowdhary, S., Iacopini, I. & Battiston, F. Author Correction: Quantifying human performance in chess. Sci Rep 13, 4787 (2023). doi.org/10.1038/s41598-023-31622-8

I really appreciate the probabilistic approach of the glicko2 rating system adopted by lichess, and I think it does a fairly good job in its aim of matching players of similar strength and modeling winning/losing probabilities. However, if we only see it as a tool to this aim (i.e. as a model to estimate the probability of different outcomes in a game), I think this method (as well as all the alternatives, to be fair) falls short in one aspect, namely the fact that it assumes that the outcome of a game is only predicted by the rating of the two players, ignoring the existence of hot and cold streaks (see, e.g., Adilov et al., 2022; Chowdhary et al., 2023). To overcome this issue, one could implement a rating system which takes into account the fact that winning probabilities are predicted not only by the rating difference, but also by the outcomes of recent games of the two players. I think neural networks or other supervised learning methods could easily model this information into a rating system based on some training data. This way, when a player is having an unusually good or bad day, their rating deviation would likely dampen, limiting the effect of hot and cold streaks on their and their opponents' rating change. This would not only theoretically but also pragmatically make sense, as I think anyone has had the experience of winning against a higher rated opponent just to find out later looking at their profile that they are on a huge losing streak. In this and similar cases, I think a more accurate model would predict the outcome based on recent playing history in addition to rating values. A potential advantage of this method would be to limit something we might call "involuntary smurfing", with which I mean what happens when someone gets a low rating after a big losing streak and later ends up playing lower rated players for some time. This of course results in their rating getting back to its "true" level and all of this is no big deal, but maybe overall this might result in a better playing experience. With that being said, I can also see arguments for rejecting such a proposal. In particular, because for some irrational reason people tend to consider rating as a measure of their ability or even of their intelligence rather than a measure of past performance and matchmaking tool, the implementation of such a method might result, e.g., in people taking games less seriously once they get in a losing streak (as their rating would not be strongly impacted). I'm curious about how people feel about this, have you ever given it a thought? References Adilov, N., Tierney, H. L., Gerring, J. V., & Yu, T. (2022). The Bounce-Back Effect: Checkmating Competitors with the Cold Hand. Global Business & Finance Review, 27(6), 16. Chowdhary, S., Iacopini, I. & Battiston, F. Author Correction: Quantifying human performance in chess. Sci Rep 13, 4787 (2023). doi.org/10.1038/s41598-023-31622-8

nadjarostowa

You could easily argue for the opposite direction as well.

If you lose many games in a row, your rating should adjust even more quickly to reflect your current playing strength. After all, the rating should predict the next outcome. When will your streak end? Nobody knows. A five minute break might be enough, or you have a bad month full of stress.

You could easily argue for the opposite direction as well. If you lose many games in a row, your rating should adjust even more quickly to reflect your current playing strength. After all, the rating should predict the *next* outcome. When will your streak end? Nobody knows. A five minute break might be enough, or you have a bad month full of stress.

Toadofsky

Glicko-2 accounts for this with a "volatility" factor, however Lichess does not implement Glicko-2 or similar despite claiming otherwise https://lichess.org/faq#ratings

nadjarostowa

@Toadofsky said in #3:

Glicko-2 accounts for this with a "volatility" factor, however Lichess does not implement Glicko-2 or similar despite claiming otherwise lichess.org/faq#ratings

Is there a TLDR version somewhere of what's actually used, or what the differences are to Glicko-2?

@Toadofsky said in #3: > Glicko-2 accounts for this with a "volatility" factor, however Lichess does not implement Glicko-2 or similar despite claiming otherwise lichess.org/faq#ratings Is there a TLDR version somewhere of what's actually used, or what the differences are to Glicko-2?

jeffsonadam

@nadjarostowa said in #2:

You could easily argue for the opposite direction as well.

If you lose many games in a row, your rating should adjust even more quickly to reflect your current playing strength. After all, the rating should predict the next outcome. When will your streak end? Nobody knows. A five minute break might be enough, or you have a bad month full of stress.

This is why I said that in that case your RD would likely dampen: different patterns of performance (in terms of intervals between games, opponent's rating and wins/losses/draws distribution) could be predictive of different outcomes, and a supervised learning algorithm (like, e.g. SVM or neural networks) should be pretty effective at modeling such patterns. I didn't know about the volatility factor mentioned by @Toadofsky, nor of the fact that lichess apparently does not use glicko2. I concur that it would be interesting to know more about the rationale behind the choices made by lichess with this regard, although, as I stated in my original post, I find the current method perfectly suitable for its purposes and my curiosity is probably due to professional deformation.

@nadjarostowa said in #2: > You could easily argue for the opposite direction as well. > > If you lose many games in a row, your rating should adjust even more quickly to reflect your current playing strength. After all, the rating should predict the *next* outcome. When will your streak end? Nobody knows. A five minute break might be enough, or you have a bad month full of stress. This is why I said that in that case your RD would *likely* dampen: different patterns of performance (in terms of intervals between games, opponent's rating and wins/losses/draws distribution) could be predictive of different outcomes, and a supervised learning algorithm (like, e.g. SVM or neural networks) should be pretty effective at modeling such patterns. I didn't know about the volatility factor mentioned by @Toadofsky, nor of the fact that lichess apparently does not use glicko2. I concur that it would be interesting to know more about the rationale behind the choices made by lichess with this regard, although, as I stated in my original post, I find the current method perfectly suitable for its purposes and my curiosity is probably due to professional deformation.

Toadofsky

Glickman's explanations of Glicko-2 highlight the differences between it and other rating systems. Lichess implements something like Glicko (Elo with a RD), however...

Many times in this very forum I have asked "what is the rating period Lichess uses for treating a collection of games to have occurred simultaneously?" and received no answer, because Lichess does not rate collections of games... each game is rated one at a time, in the sequence they are played, and there is no re-calculation/correction of ratings at the end of a rating period:

To apply the rating algorithm, we treat a collection of games within a “rating period” to have occurred simultaneously. Players would have ratings, RD’s, and volatilities at the beginning of the rating period, game outcomes would be observed, and then updated ratings, RD’s and volatilities would be computed at the end of the rating period (which would then be used as the pre-period information for the subsequent rating period). The Glicko-2 system works best when the number of games in a rating period is moderate to large, say an average of at least 10-15 games per player in a rating period. The length of time for a rating period is at the discretion of the administrator.

http://www.glicko.net/glicko/glicko2.pdf

Because Lichess doesn't use rating periods, the "volatility" factor (of a player being on a winning/losing streak) is not captured by Lichess' rating system in the way that Glicko-2 would capture it.

Arguably Glicko-2 doesn't work well for online game servers (where a player can play dozens or hundreds of games per rating period), and so Glicko-Boost was created (and for a time online-go.com used that). It would be wonderful if Lichess could implement such a thing, so the notion of "during a rating period is the rating system accurately predicting outcomes, or should the player's RD be increased?" could be implemented.

Years ago I encouraged Lichess to improve accuracy of their rating system, by:

Increase player RD over time, even for players who aren't playing games (but this isn't the same as Glicko or Glicko-2, as noted above... collections of games are not rated concurrently as a collection, but instead ratings are highly volatile)
Decrease minimum RD for highly active players, to help stabilize ratings
Decrease RD cutoff for appearing on "top 100" etc. leaderboards, so inactive cheaters don't appear on the leaderboard (a cheater would need to play many games to appear, thereby providing ample evidence to judge them by)

Glickman's explanations of Glicko-2 highlight the differences between it and other rating systems. Lichess implements something like Glicko (Elo with a RD), however... Many times in this very forum I have asked "what is the rating period Lichess uses for treating a collection of games to have occurred simultaneously?" and received no answer, because Lichess does not rate collections of games... each game is rated one at a time, in the sequence they are played, and there is no re-calculation/correction of ratings at the end of a rating period: > To apply the rating algorithm, we treat a collection of games within a “rating period” to have occurred simultaneously. Players would have ratings, RD’s, and volatilities at the beginning of the rating period, game outcomes would be observed, and then updated ratings, RD’s and volatilities would be computed at the end of the rating period (which would then be used as the pre-period information for the subsequent rating period). The Glicko-2 system works best when the number of games in a rating period is moderate to large, say an average of at least 10-15 games per player in a rating period. The length of time for a rating period is at the discretion of the administrator. http://www.glicko.net/glicko/glicko2.pdf Because Lichess doesn't use rating periods, the "volatility" factor (of a player being on a winning/losing streak) is not captured by Lichess' rating system in the way that Glicko-2 would capture it. Arguably Glicko-2 doesn't work well for online game servers (where a player can play dozens or hundreds of games per rating period), and so Glicko-Boost was created (and for a time online-go.com used that). It would be wonderful if Lichess could implement such a thing, so the notion of "during a rating period is the rating system accurately predicting outcomes, or should the player's RD be increased?" could be implemented. Years ago I encouraged Lichess to improve accuracy of their rating system, by: - Increase player RD over time, even for players who aren't playing games (but this isn't the same as Glicko or Glicko-2, as noted above... collections of games are not rated concurrently as a collection, but instead ratings are highly volatile) - Decrease minimum RD for highly active players, to help stabilize ratings - Decrease RD cutoff for appearing on "top 100" etc. leaderboards, so inactive cheaters don't appear on the leaderboard (a cheater would need to play many games to appear, thereby providing ample evidence to judge them by)

mkubecek

@Toadofsky said in #6:

Many times in this very forum I have asked "what is the rating period Lichess uses for treating a collection of games to have occurred simultaneously?" and received no answer, because Lichess does not rate collections of games... each game is rated one at a time, in the sequence they are played, and there is no re-calculation/correction of ratings at the end of a rating period
Rating periods similar to those used by FIDE (one month) or national ratings (often longer) would be hard to swallow for impatient online chess user base. That's why lichess applies a different approach and updates ratings after each game. I can't find the source where this was explained best but IIRC the idea is that the time factor is still applied to adjust RD (in addition to result (in)stability) and there are adjustments to the glicko-2 formulas for variable "rating period". As an example, one's RD would grow from 60 to 110 after a year of inactivity, according to https://lichess.org/forum/lichess-feedback/query-about-the-rating-period-used-for-glicko-2-on-lichess

@Toadofsky said in #6: > Many times in this very forum I have asked "what is the rating period Lichess uses for treating a collection of games to have occurred simultaneously?" and received no answer, because Lichess does not rate collections of games... each game is rated one at a time, in the sequence they are played, and there is no re-calculation/correction of ratings at the end of a rating period Rating periods similar to those used by FIDE (one month) or national ratings (often longer) would be hard to swallow for impatient online chess user base. That's why lichess applies a different approach and updates ratings after each game. I can't find the source where this was explained best but IIRC the idea is that the time factor is still applied to adjust RD (in addition to result (in)stability) and there are adjustments to the glicko-2 formulas for variable "rating period". As an example, one's RD would grow from 60 to 110 after a year of inactivity, according to https://lichess.org/forum/lichess-feedback/query-about-the-rating-period-used-for-glicko-2-on-lichess

Toadofsky

@mkubecek said in #7:

Rating periods similar to those used by FIDE (one month) or national ratings (often longer) would be hard to swallow for impatient online chess user base...

There are "live" and "official" rating lists. FIDE official ratings are (re-)computed at the conclusion of a rating period.

The bit of Glickman's paper I quoted mentions "an average of at least 10-15 games per rating period" so if Lichess had a rating period it certainly wouldn't be one month or longer.

@mkubecek said in #7:

That's why lichess applies a different approach and updates ratings after each game. I can't find the source where this was explained best but IIRC...

I appreciate your desire to explain my own code change to me (but actually, someone else coded it while I was pointing out how buggy the rating system was).

@mkubecek said in #7: > Rating periods similar to those used by FIDE (one month) or national ratings (often longer) would be hard to swallow for impatient online chess user base... There are "live" and "official" rating lists. FIDE official ratings are (re-)computed at the conclusion of a rating period. The bit of Glickman's paper I quoted mentions "an average of at least 10-15 games per rating period" so if Lichess had a rating period it certainly wouldn't be one month or longer. @mkubecek said in #7: > That's why lichess applies a different approach and updates ratings after each game. I can't find the source where this was explained best but IIRC... I appreciate your desire to explain my own code change to me (but actually, someone else coded it while I was pointing out how buggy the rating system was).

mkubecek

@Toadofsky said in #8:

at least 10-15 games per rating period" so if Lichess had a rating period it certainly wouldn't be one month or longer.
For me, 10-15 games would indeed mean a month or so. For others, 10-15 games is one evening. (Much more frequent case, I believe.) For some, half an hour...

@Toadofsky said in #8: > at least 10-15 games per rating period" so if Lichess had a rating period it certainly wouldn't be one month or longer. For me, 10-15 games would indeed mean a month or so. For others, 10-15 games is one evening. (Much more frequent case, I believe.) For some, half an hour...

Toadofsky

#10

Indeed...

Indeed... Arguably Glicko-2 doesn't work well for online game servers (where a player can play dozens or hundreds of games per rating period), and so Glicko-Boost was created (and for a time online-go.com used that). It would be wonderful if Lichess could implement such a thing, so the notion of "during a rating period is the rating system accurately predicting outcomes, or should the player's RD be increased?" could be implemented.

This topic has been archived and can no longer be replied to.