ThisisEngineering RAEng

What is a Rating?

1 Oct 2021596 viewsEnglish (US)

Given decades of research, how can we measure performance?

Rather than belabor The US Chess Rating system or even The Glicko-boost system which you already know about, let's examine alternatives for measuring player or engine performance.

Intrinsic Performance

Prof. Ken Regan at University at Buffalo documented Intrinsic Chess Ratings so I refer you to his detailed paper. Frequently I've heard players claim based upon similar self-assessment, their "true skill" is stronger than their USCF/FIDE ratings indicate. By comparing a player's actual moves to a strong engine's recommendations, one can form a biased estimation of player strength (biased in two respects: sometimes the engine-recommended move may be inappropriate against a human opponent, and bias of opponents' moves and move times can only be corrected with a large sample size).
Informally, a similar technique is used by many coaches: observe moves played by a student and in place of the engine, the coach finds and compares best moves to what the student played, and guesses the student's skill level.

Bratko-Kopec

The Bratko-Kopec Test was designed by Dr. Ivan Bratko and Dr. Danny Kopec in 1982 at the University of Edinburgh. It is the most popular test I know for grading engine strength; I have used python-chess/bratko_kopec.py for testing Multi-Variant Stockfish patches (checking for bugs). How does it work?

Create a corpus (data set) of problems with correct solutions.
For each problem, measure whether the solver (player or engine) can find the best move within the time limit. For each correct answer, score 1 point.
- The test as explained on Dr. Kopec's site awards fractional credit if among the player's top 4 preferred moves (ranked by order of preference) is the best move.
- Another form of this test (for engines) awards fractional credit for answers found after the time limit.
- Chess software, magazines, books, etc. sometimes award partial credit for second-best answers. Strictly speaking that's a different, subjective test; but strong titled players have written or approved tests this way so you might find their tests useful despite being subjective.
Compare results with solvers whose ratings are approximated through other means.

Since this test is so popular there are many Extended Position Description-formatted corpuses of different difficulties readily available, but if those don't satisfy your needs (for example, if you are writing an engine) then you can select positions from Lichess games in Forsyth-Edwards Notation format and easily write new EPD test files!

Astoundingly, other performance tests I discovered seem not to substantively differ from the two tests above! Have I overlooked something? (No, I don't care about "CAPS" due to its secrecy.)

In my mind, chess players exhibit performance levels in openings, middlegames, and endgames. See SimplerEval's chess insights for a rather extreme example. Outside of:

SPSA (used for tuning Stockfish parameters)
NNUE (used for tuning Leela Chess Zero weights)

... I cannot find any high-dimensional player model. I even found 2014mchidamb/AdversarialChess: Style transfer for chess. (github.com) which attempts to emulate human play... and perhaps that's what I am interested in although it doesn't mention player strength or game phases? I don't know... I've tried it before, without being impressed, but maybe I need to run it longer?

At Chess Game Dataset (Lichess) | Kaggle I loosely defined a test to measure which openings improve player performance. Still, there has to be some way of understanding whether a player understands an opening, pawn structures, etc. since playing "Sicilian Defense: Old Sicilian" isn't going to improve everyone's rating!

Photo credit: ThisisEngineering RAEng

Footnote: "Glicko"

According to Chess rating systems • lichess.org "Lichess.org uses the Glicko 2 system." This claim is invalid according to Prof. Glickman at Boston University (not that other chess sites do any better):

To apply the [Glicko-2] rating algorithm, we treat a collection of games within a “rating period” to have occurred simultaneously. Players would have ratings, RD’s, and volatilities at the beginning of the rating period, game outcomes would be observed, and then updated ratings, RD’s and volatilities would be computed at the end of the rating period (which would then be used as the pre-period information for the subsequent rating period). The Glicko-2 system works best when the number of games in a rating period is moderate to large, say an average of at least 10-15 games per player in a rating period. The length of time for a rating period is at the discretion of the administrator.

OGS has a new Glicko-2 based rating system! - (online-go.com) explains possibly the only valid Glicko-based implementation on any online gaming site; since games during a rating period are treated as being played concurrently, the (re-)calculation at the end of each rating period may change the player's live rating.
EDIT: as of 2021 Rating and rank adjustments - Announcements (online-go.com) OGS discovered typos in their rating code (which at one time was correct) and opted to ditch the 10-15 game window rather than attempt to implement Glicko/Glicko-2. I haven't played on OGS this past year and given my disappointment I'd rather not return.

What is a Rating?

Intrinsic Performance

Bratko-Kopec

Footnote: "Glicko"

More blog posts by Toadofsky

Opportunism

Prayer for Peace

Long-Term Goals Are Malarkey

How to Write a Popular Blog Post