Analysis of Lichess' cheating detection with Machine Learning (ML) - a mis-use of ML & doesn't work • page 1/10 • General Chess Discussion • lichess.org

IrwinCaladinResearch edited

# TLDR/Abstract
---------------------
Lichess tries to combat cheating at online chess using two machine-learning-based components: Irwin & Kaladin. As a researcher and machine learning (ML) expert, I have taken these mechanisms apart by looking at their source code and reverse-engineered them to look into their effectiveness. I do not play on Lichess (or other sites, so do not expect this post to be about chess). Both of these mechanisms are fundamentally flawed and do not even remotely achieve standards that such systems should reach. The number of false-positives (players identified as cheaters that are likely not cheating) is about 1-2% and due to conceptual issues with the learning & model building approach, this is getting worse. The approach to supplement the cheating algorithms with user-reported cheat-reporting likely results in even more false-positives. This post provides details around these aspects and shows some of the results that I have obtained. While the system catches obvious cheaters, it is very likely that is punishes a lot of non-cheating players as well.

About myself
------------
Let's get this out of the way: I do not play chess, I barely know how the pieces move. My wife plays, but on "that other site". But I am a senior principal Machine Learning Specialist at Amazon Web Services. I have worked on machine learning (ML) projects in various industries and over more than 15 years in a professional environment. I have also published research papers on ML, have a PhD in Computer Science based on a thesis in ML, the use of neural networks in particular (for classification problems, see below) and I am giving lectures on ML. So in a nutshell, everyone here is probably better at chess than me, but I know much better how ML works having worked on some of the toughest ML problems. I am writing this post, because a.) it is worrying that ML is misused here and b.) I wanted to understand the approach used on Lichess for this ML challenge. So it was a semi-professional pet-project of mine that lasted more than 9 months.

NOTE: I am going to try to explain this in the most non-technical way possible, so for those (few) ML experts in this forum, forgive me some inaccuracies in the following that I am making to spare the average reader some of the details that would just confuse matters here.
I have also decided to keep some details hidden here to protect the platform from actual cheaters as this post is not about making cheating easier. I will mention this explicitly where this is the case.

Cheat Detection at Lichess (50k ft view)
----------------------------------------
Irwin (Source code: github.com/clarkerubber/irwin) is the tool used by Lichess moderators to compute a "cheater score" that (allegedly) determines how likely someone is a cheater. Irwin works mostly from game reports, Kaladin uses the Insights generated for a player. In a nutshell, both of these services have a ML model where inputs such as the moves of game(s), move times etc. are passed in and the neural network then returns a value between 0 and 1 depending on the likelyhood of player X being a cheater in a set of games. For instance, in Iwin, the corresponding code is this (from here: github.com/clarkerubber/irwin/blob/master/modules/irwin/BasicGameModel.py):

> def predict(self, playerId: PlayerID, games: List[Game]) -> Opt[int]:
> tensors = [game.tensor(playerId) for game in games]
> return [None if t is None else int(100*self.model.predict([np.array([t[0]]),np.array([t[1]])])[0][0]) for t in tensors]

The result of the Neural Network is then multiplied by 100 to provide a nice score between 0 and 100, i.e., the prediction returns for a set of games if the player has (allegedely) cheated. Since the multiplication by 100 is an implementation detail, I will instead refer to the result of the neural network as "cheater score" or CS which is from the interval [0;1]. Note that CS is not the accuracy value or something you get shown in game analysis on Lichess. The score threshold for a cheater to be banned is not published, but one can estimate it by using the published games (see: database.lichess.org/) and then checking which accounts got banned/closed in the meantime using the API lichess.org/api/user/{username} (documented here: lichess.org/api#tag/Users/operation/apiUser) - more details are below. One can then compute the average/median/min/max CS for the games of those cheaters to re-engineer this threshold pretty accurately. I have done this but out of respect for this platform, I am not going to explicitly state this value here and instead refer to it as the "cheater threshold" CT. This means Irwin/Kaladin computes a cheater score CS and compares it to a cheater threshold CT (see more details in the section below "What's in the data"). If CS > CT for a player, the player is assumed to have cheated (even if there are (allegedly) some moderation steps to confirm this; note: I do not believe this is the case due to the sheer numbers, but Lichess at least claims they do it for every case). The key to this whole process therefore are the machine learning models used by Irwin/Kaladin that compute CS.

Mini-Intro to Machine Learning (ML) for Cheater Detection (on Lichess)
---------------------------------------------------------
ML has different types of problems and the one relevant here is called "classification": In a nutshell, you are trying to classify data sets into classes. Cheater detection is a binary classification as it basically just has to determine "cheater vs non-cheater" or more precisely "game where cheating by player X occurred vs game where player X did not cheat". Given a data set (e.g., a set of moves representing game(s), times between moves & other "telemetry"...), ML models will (typically) return a probability from [0;1] that the data set belongs to a game with cheating or not, i.e., the cheater score CS for a single game. Please note the term "probability" here: the math underpinning ML is based on probabilities and even a probability of 0.99 does not equate certainty.
It is then application logic that determines how high this probability has to be so that the player is considered a cheater and also Irwin/Kaladin considers several games to make that determination, i.e., a single game with high cheater score is insufficient to get banned. While it is also possible to reverse-engineer how many games Lichess considers, I am not going to reveal that value here to protect the platform.

How does the ML know how to identify cheater games? When games get played on Lichess, the platform records the moves and additional telemetry/meta data such as move times (and other data). Let's summarize all this data as a vector Xi=[X1, X2...] where each element Xi is a game with all data recorded. When cheaters get banned, Lichess will mark the accounts with a flag "tosViolation" (See API definition: lichess.org/api#tag/Users/operation/apiUser). For the ML model, each vector Xi is then associated with a label Yi whether this was played by a cheater (with tosViolation for cheating) or not* (no tosViolation for cheating). The underlying math function of the model is then fed tuples of (Xi, Yi) and the optimizer (in this case the so called ADAM optimizer) then changes the parameters of the model in such a way that the resulting math function is optimized so that the labels Yi are "guessed"/inferred when some Xi is passed in. This also means that the data from the cheaters detected by Irwin is directly fed into the next iteration of the model.

This is already a fundamental flaw of the approach: As the model is trained with labels that largely are based on its own decisions from the past (there may be some manual labeling by moderators, but given the number of games here, it is simply not possible that humans largely check a lot of these decisions, let alone that there is some independent decision making to monitor Irwin's decision making), the decision making of the model will increasingly make the same errors it made in the past. To make a simple analogy: If you trained a model to distinguish between pics of cats & non-cats, but labeled some birds as cats, the model will also determine some birds to be cats, i.e., make errors. If you then feed the results of the previous version of the model into the training of the next (with some additional birds identified as cats), this effect will increase over time, i.e., you will have more and more birds identified as cats by the model. This is the same trap that my employer Amazon along with many others fell into with CVs (www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G). This could be a contributing factor why Lichess (and other sites) reports detecting "more and more cheaters" every month (aside from there being more cheaters). In the end, any cheat detection using supervised ML will need to rely on labeled data and the only real way to obtain them is getting "confessions". While there may be a (small portion) of banned players that make such a confession, I would guess that this is a minority (I do not have data on this). Confessions of players that did not get banned but cheated are obviously not happening. Hence, the ground truth data is biased and incomplete. Lichess would have to provide more data on "banned players vs banned players that confessed" as well as other metrics to actually refute this position.

*) This is a simplification as it actually would be "White cheated" and "Black cheated" but this is circumvented by making two data sets for each game - one from the perspective of white and one for black - so that the data set Xi basically means "game from perspective of player with Black/White".

Model Deep Dive
---------------
The model used by Irwin/Kaladin is based on a neural network. More specifically, it is a Convolutional Neural Network (which are typically used for image rekognition tasks) combined with a "Long short-term Memory" neural network (often used in Natural Language Processing). I am not going to argue with this choice or structure here even though it seems - to be diplomatic - odd. Apparently, someone with a very basic understanding of neural networks at some point determined that the model was overfitting (i.e., the model was basically memorizing the cheater moves/properties instead of properly generalizing them). The result was that someone added a so called "Dropout" layer in the code (see Line 59 in github.com/clarkerubber/irwin/blob/master/modules/irwin/BasicGameModel.py). This is pretty bad. There is a lot of research on this, but just having some Dropout somewhere in the model is pretty arbitrary and likely does not improve the model. Here is one of the many papers on this: http://mipal.snu.ac.kr/images/1/16/Dropout_ACCV2016.pdf (not written by me). The general structure of the model therefore is dubious.
Summary: The structure of the model and how it is built is not even remotely "state of the art". As shown below, this is likely a contributing factor why the results are nowhere near the accuracy required for such systems (for instance, FIDE requires 99.998% accuracy for statistical models). This machine learning model is orders of magnitude away from that standard (see below).

What's in the data (and how to get cheater statistics from Lichess)
-------------------------------------------------------------------
Let's start with how to get the data on cheating: The games played on Lichess are there for download every month as PGNs and they include move times. Thats's simple. Now, the flags for cheaters aren't in these games. Here is what I did to get it: I upload these files into an Amazon S3 bucket and extract the games which I then feed into a queue (SQS). The entries in the queue are processed by separate Lambda functions that use the player API (see link above) to retrieve the player's profile data. If you do this in the middle of the month, even the cheaters from the last few games of the previous month are banned already (there are very few cases that fall out of this rule, i.e., where a ban came weeks or months after the last played game). Additionally, you can augment the data with engine evaluations, i.e., add the computer eval on every move (this was actually technically the most challenging part to get Stockfish to fun in a Lambda function - but I digress...). This way, you can actually obtain training data to build your ML model as you basically have the tuples (Xi, Yi) needed for training with the exception of browser telemetry (see below):
- the Xi tuples are the moves of the game (contained in the files one can download) along with some telemetry data like move times (and evaluations)
- the Yi labels is the flag tosViolation that you can get from the player API using the web-scraping approach with Lambda functions I described above.
I will not reveal here how much data you need to get an accurate model, how to put it into the model and also not discuss any data engineering (e.g., augmenting moves with the actual moves suggested by Stockfish) as this would likely compromise Lichess, but I was able to locally build my own model to use with Irwin/Kaladin and verify it's functionality by running it on a couple of months of data to detect the same "cheaters" that were marked on Lichess, i.e., I made sure that my model was behaving similar to the one used on Lichess by comparing who was banned and who "my model" was identifying as players with "cheater score" > "cheater theshold". Obviously, the accuracy of the model depends on the cheater threshold CT you actually use to mark the cheaters, so I actually computed the accuracy for different cheater threshold values CT and this way you can determine the CT Value quite accurately (I tested this against data that wasn't used during the model training as it is recommended for machine learning).

The last iteration of my model had a 99.1% overlap, i.e., out of 100 players banned by Lichess (in August 2022), the reverse-engineered model would also ban the same 99 players and not ban any additional ones. Thus, I would argue the behavior of the model I reverse-engineered is largely similar to the one Lichess is using, particularly since the vast majority of games does not have cheaters (less than 1% of games have cheaters).
Ideally, if this model worked perfectly and you had to draw the results returned into a histogram, you would get a shape that has two clusters: The vast majority of non-cheating players have a cheater score of 0 and those few that cheat have CS=1 (or 100). This is NOT the case for Irwin/Kaladin which is partially due to the flaws mentioned above (re-training on model output, probabilities...). The rather obvious cheaters are detected ("obvious" meaning people with 75+ games won in a row or 50 moves like Stockfish), so that seems good. But there is a lot of players/games that get flagged with surprisingly low CS scores, i.e., where the model does not actually provide a clear/conclusive answer. When you then manually dive into players that have lower CS scores (and there are a lot of them), but still got flagged, you find cases where the win-percentage is not even extra-ordinarily high (some even below 50%), accuracy from game analysis is not abnormal and the move times are all over the place (not consistent, i.e., indicates a human that thinks during various points in the game vs. a player that plays all moves within a couple of seconds). Using the legal approach here, I would argue that these are (at least partially) false-positives even if the CT value is set conservatively very close to 1 and given the data on these cases, it would be extremely hard to prove cheating in a way that would meet legal standards. I consider these to be false-positives. Using this approach, the result is that about at least 1-2% of the bans are false positives. For these kind of ML systems that could have detrimental effects on human beings, such an error rate is generally unacceptable for machine learning. Given the number of total bans per year, you can basically state that there thousands of false-positive bans every year on Lichess.

Browser Telemetry
-----------------
There are videos about Lichess using browser telemetry as part of their cheating detection mechanism (source: www.youtube.com/watch?v=LZgyVadkgmI&t=940s). Browser telemetry is data obtained during the game that for example records when people move to another tab or change the window away from the browser. The underlying allegation is that people are doing this "toggling" of tabs/windows to play the move with an engine and then get the recommended moves from the engine back. The way this works is that the browser typically has an API where websites executing Javascript like Lichess can register for events (such as users toggling away from a window / the window losing the focus). First and foremost, I would assume that any user concerned about their privacy will have these APIs restricted or blocked preventing that mechanism to produce usable data at all (For Firefox users, there is even a separate extension that does this). Non-obvious cheaters will know this and prevent it. Second, when chess-playing friends as well as my wife reviewed this work, they told me that professional players & chess streamers often move their mouse "off the board" to avoid any kind of "mouse slips" - particularly in tight games - after every move. For instance, I was shown a youtube video of Hikaru Nakamura where he specifically did that and commented on it during a game why he does it.

Hence, using this kind of data seems dubious from a semantical perspective. But more importantly, given the results above, the browser telemetry does not seem to have a huge influence on the model performance, as I did not have this data and the model still behaves in a similar way. Using this seems to be some sort of "gut-feeling-voodoo" that makes people feel better about the ML model because some intricate data is used that sounds like magic. It's not.

What about Chess.com/?
---------------------
Lichess publishes its code and game data. When it comes to cheating, that's it. Chess.com publishes numbers on cheaters (seems more like some kind of bragging like "look how many cheaters we catch"), but absolutely nothing on their method. Are they better? No. First on foremost, Chess.com does not publish anything on their anti-cheating mechanisms and this lack of transparency makes them obviously even worse than Lichess. For all that is known, getting banned on Chess.com could depend on a random number generator. Lichess is a lot better here through the open source policy, but still leaves a lot to be desired. In general, both sites make allegations pointing to their technical cheating detection mechanisms, but they do not provide any proof or details. The 72-page report on cheating recently released by Chess.com also only gives very vague descriptions on the input of their mechanism, but nothing more. They claim they will go to court (www.chess.com/article/view/online-chess-cheating), but let's face it, nobody will even get a lawyer even if they have $100 diamond membership. They will just make a new account. Like on Lichess.

A number that is staggering in the above article on Chess.com is the fact that 77% of account closures happen without a human in the loop, i.e., automated. Given that Lichess is the second largest website for chess according to publicly available data and the number of games played, the corresponding number is probably in a similar range for Lichess, potentially even worse given that Lichess is a non-profit that likely has a tighter budget than Chess.com with their membership options. That's downright frightening given the flaws of these methods.

Summary
-------
Lichess uses a ML-based approach to detect cheating and the code for this mechanism is publicly available. Based on the data they have published about the games on the platform and the publicly available APIs, one can reverse-engineer this approach and the accuracy of what is produced by the model is sub-par at best with a false-positive rate that is somewhere between 1-2% every month. This rate is due to fundamental flaws of the machine learning procedure used by Lichess, but also design flaws in the model code. Given that actual humans are affected by this and such bans can result in real-life implications for those that are flagged by the system, such a false-positive rate is inacceptable by any published ML standard.

Cedur216

> Given that Lichess is the second largest website for chess according to publicly available data and the number of games played, the corresponding number is probably in a similar range for Lichess, potentially even worse given that Lichess is a non-profit that likely has a tighter budget than Chess.com with their membership options.

I think you're wrong with that already.

IrwinCaladinResearch edited

@Cedur216 said in #2:
> I think you're wrong with that already.

You think? I provided you with a source and a reasoning on why I would argue that Lichess is in the same range of automated account closures. Do you have a source for your thinking along with an adequate reasoning, or are you just disagreeing because you don't like it?

The fact that Lichess does not provide such numbers is a lack of transparency at best and a telltale sign of what really is going on at worst. Given the scale of this platform, my statement stands that the vast majority of closures happen without a human in the loop that looks at the details. Chess.com has the same approach.

Rookitiki edited

how exactly did you link a source to something thats not publicly disclosed?
you said its a personal guess yourself and used the words "probably" and "potentially".

given that you can always use appeal and trigger a human review in case you are one of the false positives that you think are occuring you cut lichess very little slack while making pretty steep assumptions yourself.

edit: i think its fine to use an exaggeration or two if you are indeed concerned about systematic flaws. one wonders why you chose this method of deliverance tho. in a private email / discord session you couldve shared ALL your methods and point towards the lichess methods that you dont want to discuss publicly. isnt that what you want to do to make use of your insights gathered, point out improvement recommendations, etc rather than talk about cats and birds?

Cedur216

edited

@IrwinCaladinResearch

www.youtube.com/watch?v=bmIFdrUVHXw

go to 31:00 and forward

at 32:30 there's the spot where Chris says that "only the dumbest and most obvious cheaters are caught automatically, the low hanging fruit if you will, and then from there the case is moved on to human beings". So you don't need to argue about automated cases.

Most bans come from reports and every report is reviewed by humans. It doesn't just take machine learning or knowledge of statistics, it also takes understanding of chess

also talking about chesscom, there was the (former) NSF president confessing that he used assistance from a stronger player at chesscom, and at that time chesscom was already highly suspicious of him but didn't ban him because the chance for a false positive was 0.1%, at which rate they'd falseban a lot of high-profile players every month and have serious issues. Besides, the likelihood of false positive is not the only thing, the likelihood of false negative matters too.

ninehundredsixty

You don't play chess, barely know how the pieces move, yet you win your first 3 games, crushing 2100 in bullet ?

That's ''amazing''

petri999

I never thought that irwin tried to model a cheater. I just assumed that attempt was to create model on non-cheating behaviour relative to persons strength. Like estimating probability of 1500 player finding in six different games best move that requires at least 6-ply calculation. obviously supervised learning is kinda hard if you don't validated samples for outliers

BorisOspasky

Hands up who hasn't got a clue whats going on here beyond the fact its attempting to show that cheat detection is flawed... I think ?

You sound like a pretty clever dude tbh but my question is this:-

Are you the guy I complain to when my next day Amazon delivery gets lost?

new_player_123

@ninehundredsixty said in #6:
> You don't play chess, barely know how the pieces move, yet you win your first 3 games, crushing 2100 in bullet ?
>
> That's ''amazing''

Natural born chess talent.

Buuuuut, his 1900 opponent leaves game in winning position and the 2100 one blunders in the opening on move 5.

SomewhatUnsound

#10

So I would take the "glass is half full approach", @IrwinCaladinResearch - the Lichess stuff is all open source. If you can convince the folks contributing that you have a better solution and you're willing to share it, I think that would be great. Please do - better cheat detection helps us all! So thank you for your thoughts and research on this!

Model drift is going to be a problem for sure if it's "eating its own classifications", but I would hope the human labelled data is heavily used for training purposes which might help.