lichess.org
Donate
ijh nuke

Flaws in PGN-Spy and T% Analysis

Chess engineSoftware Development
Cheat detection -- sometimes the old methods are bad

Introduction

T% analysis is what people think about when they think of cheat detection. It measures how correlated a player's moves are to those of a chess engine in terms of Multi PV. For those who are unfamiliar with Multi PV, here is what a Multi PV of 3 would look like on Lichess’s analysis board:
T% analysis is probably the oldest and most basic idea in cheat detection, and it comes with a few assumptions: (1) cheaters seek to augment their game, (2) chess engines are accessible and offer unrivaled playing strength, and (3) cheaters using an engine, on aggregate, will show a statistically significant larger proportion of top engine moves compared to their baseline skill level. Figuring out these proportions of top engine moves, named T1 (1st best move), T2 (1st or 2nd best move), etc. was done manually on ICC and FICS way back in the day. Since those times much has been automated, and the PGN-Spy and ChessReanalysis programs exist which have sped things up considerably as they can run thru a PGN file and analyze every game in it. The automation of T% analysis helped spread its popularity as it is easy to use, yet many have blatantly ignored its flaws and made accusations which are at best delusional. So, let’s get into some of the problems with T% analysis...

Flaws

The first and foremost flaw of T% analysis is that many are not aware of just how much overlap there is between humans and engines; there is a pretty significant overlap between human and engine play, and the Maia team has revealed some if this. Let’s take a look at one of their graphs


Mcllroy-Young et al., 2020, August

This graph shows the overlap of Stockfish with various depths against rating ranges; the overlap appears to increase at a linear rate, and this graph only goes to 1900. Mcllroy-Young et al., (2020, August) reported that Stockfish matched human moves 33-41% of the time with it matching 1900 rated players 5% more than 1100 rated players. Even if you have benchmarks of skill, it would take uncomfortable sample sizes to overcome an intrinsic 8% spread. A cheater needs to be quite blatant for you to catch them in only a handful of games. However, very rarely do the cases we debate on whether or not the user is cheating occur in this rating range. More often these users are 2400+, so if we assume that this linear trend stays the same, we will basically have to assume overlap closer to high 40’s or even 50% occurs between Stockfish and humans. Various programs have attempted to reduce this problem by not considering forced moves or winning endgames, but the problem never completely goes away. There’s only so much sample engineering that can happen before T% results lose credibility.

A second flaw of T% analysis which is often overlooked is that it is only good for that specific engine at that specific depth. As you can see in the image above, depth influences the correlation with humans, and this obviously means that different depths are playing different moves which in turn influences T%’s results. An even bigger problem arises when an entirely different engine is being used by the cheater, let alone the cheater’s processing power and machine-specific settings. This can, and does, lead to bogus results. Let’s look at some examples.



Note: these results exclude the first 8 moves of the game due to openings and forced moves

First, we have sunfish_rs, a BOT account on Lichess running the Sunfish chess engine in Rust. Then we have penguingim1, GM Andrew Tang, whom we all know and love and is obviously legitimate. Then we have MiniHuman, another BOT account on Lichess running a modified Leela. Lastly, we have Konevlad, GM Vladislav Artemiev, another person who’s legitimate. However, with no other information and going solely off of T% analysis, you would say sunfish_rs was by far the “most legitimate”, and while the modified Leela has more consistency than Konevlad, Konevlad has more moves matching Stockfish’s T% values. However, with our knowledge of other facts, it's obvious this is not true. This leads us to something many seem to have overlooked as just a disclaimer to avoid liability and nothing else...

"This software is intended as a tool to help detect cheating in chess by providing information on engine correlation and other statistics. Information obtained using this software should not be taken as evidence of engine use without proper statistical analysis involving comparison to appropriate benchmarks and consideration of other relevant evidence.” (MGleason1, recovered January 4, 2022)

While the author of the software is stating clearly one cannot simply accuse based off of these results, so many people still make decisions on whether or not someone is cheating based off of these results. Cheat detection is not restricted to analyzing move quality. This should have been obvious, but common sense has been sacrificed in the pursuit of convenience, i.e. “my computer-generated numbers can ban you”. Even then, you have to do your move quality analysis correctly, which requires at least an ounce of scientific acumen.

However, as more engines have been created, the flaws of PGN-Spy and T% analysis as a whole have been exposed, and they go much deeper than simply not using appropriate benchmarks. The entire philosophy of T% analysis cheat detection is flawed; it strives to measure quality of play, but its results are compared to those of a fallible agent, namely the chess engine of choice. While this philosophy worked in the days of old when there weren’t many engines available to the public, it is flawed to the point that it is nearly useless against cheaters today.

An interesting fact further demonstrating this problem: if one dives deeper into the variety of engines today, one can find http://ccrl.chessdom.com/ccrl/4040/, which lists 78 engines as rated 3000 or higher, and 269 engines rated at least 2400.

While T% analysis will have a degree of overlap between some of these engines, it’s incredibly unreasonable to run enough of them to catch even a cheater using a weak engine to evade T% analysis as running engine analysis over and over on games is insanely heavy on CPU power. This gargantuan computational expense gleans insights into catching a very specific kind of cheater: one who copies unorthodox engines. And if someone is copying Stockfish, one would hope that you wouldn’t need T% analysis to figure that out. However, it should be noted that for the most part cheaters simply using these tactics can be detected with relative ease via other methods, so don’t be stupid and try cheating thinking you can evade detection just because you read this.

References

  1. McIlroy-Young, R., Sen, S., Kleinberg, J., & Anderson, A. (2020, August). Aligning Superhuman AI with Human Behavior: Chess as a Model System. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1677-1687).
  2. MGleason1 https://github.com/MGleason1/PGN-Spy