Analysis of Lichess' cheating detection with Machine Learning (ML) - a mis-use of ML & doesn't work

I don't want to sidetrack things as I find the discussion interesting, but players need to be far more careful of their conduct whilst playing - especially those who have never played OTB - rather than worrying about whether an opponent is cheating.

IrwinCaladinResearch edited

#63

Thanks for the various responses and I would ask everyone replying here to do 3 things:
1. This is about data and providing data for what you believe is right. If you just say "Lichess knows what they are doing", that is your opinion, but has no actual value for this discussion. The same is true for "Lichess cheating detection doesn't work".
2. There is no point in coming on this thread to discuss your individual "cheating story".
3. Insulting others because they do not share your position is also not helpful.

I will try to summarize the situation as I see it using some of the statements made by Lichess moderators in this thread (see #15):

STATEMENT 1:
@izzie26 said in #15:
> Alright, let's talk about your post. Unsurprisingly, we strongly disagree with your central claims that our systems are "fundamentally flawed" and that "it is very likely that [Lichess] punishes a lot of non-cheating players as well" - especially if the latter refers to decisions taken about accounts.

To be direct: This is a mischaracterization of what I am writing. My reverse-engineered model that has an overlap of 99.1% with your bans in August 2022 indicates that about 1-2% of these bans happen with a cheater score that is rather low. This is consistent with other months, e.g., July or September were in the same range. Out of respect for this platform, I did not publish the actual deviation and the exact value of the cheating threshold, but Lichess is obviously free to do that. A manual look into some of these relatively low cases at least indicate that these are likely false positives and the argument that the data is inconclusive with regards to their cheating.

To use a legal approach here: Lichess is the accuser in these cases and would need to prove without a reasonable doubt that each banned player is actually a cheater.

STATEMENT 2:
@izzie26 said in #15:
> More generally, we think your claims are rather strongly stated given the limited details that you provide to support them, and even those details need to be probed further. For example, just from what you've posted, we'd have concerns about the inferences you have drawn from the available data, the assumptions and logic behind your 'false positive' estimate, and whether your analysis has fully accounted for our ML systems' primary role to inform a multifactorial decision process.

I have outlined how I came to the conclusion while trying to avoid providing data points that could undermine your efforts to prevent cheating. Let me try again: If CT is your cheating threshold (see OP), then you can use a very small epsilon e to define an interval just above CT, i.e., [CT; CT+e], and then check the players that fall into this interval. This shows that many of these players do no exhibit a behavior or have stats like the ones that have higher cheater scores. Your statement that about a "multifactorial decision process" is without any substance, so it is hard to argue with this. Without a documentation of this process, I will stand by my point that there are 1-2% of the bans per month like these where the data is inconclusive.

I also looked at the players in the interval [CT-e; CT+e] and when you cluster the data, with the cheaters and non-cheaters from this interval, you actually do not get any combination of features/attributes that indicate any kind of correlation with the label. That means for the players with cheater scores close to CT, this is rather arbitrary.

Publishing your process will increase transparency but not reveal anything that is of value for cheaters as this happens AFTER the cheater has been detected by the models.

> In addition, your characterisation of the feedback loops in our models is completely off the mark, because we have taken proactive measures to avoid exactly what you describe. You really ought to give us a bit more credit! We know we're dealing with applied ML here, where "ground truth" data never truly reflects actual ground truth, and perfect labels are the exception, not the norm.

Again: Happy to be educated how you do that as one (key) input for your cheating moderation is the output of these ML models. Revealing how you avoid feedback loops and prevent output of the model to be (part of) training data used for the next iteration of the model will not help cheaters, so feel free to provide substance and I will retract my point. Doing this in a transparent way will actually help you.

@Molurus said in #31:
> I firmly believe that methods of catching cheaters and statistics of catching cheaters should never be discussed publicly. Anything like that is just counter productive and doesn't help the process and could effectively just help cheaters cheat.
>
> In the end.. it's up to individual users if they feel comfortable with a server and how it fights cheating. If you don't trust Lichess' methods, go somewhere else.

This is actually a sad statement. Statistics on cheaters that have been caught do not help cheaters. When Lichess bans a player, they are making a public statement about a real human being. How would you feel if someone just made such a statement about you publicly? You naturally and rightfully would ask for proof. While Lichess claims they have such proof, they seem to provide little to no such proof, even in private conversations. Furthermore, the mechanisms how they obtain what they call proof is highly dubious as my opening post and work of the last 9 months shows.

A mechanism that accuses people of cheating cannot have 1-2% error rates.

odoaker2015

#64

@IrwinCaladinResearch
"A mechanism that accuses people of cheating cannot have 1-2% error rates."

What error rate do you think it should be?

Zorro3000

#65

While for different reasons - maybe I'm too touchy ?? - I don't like LiChess, and while I'm in no way a professional in law, I would find it devastating for Lichess if they would discuss ANY of the inner working of the cheating topic in public: Cheating is not only about the casual "guy taking an opponents pawn from the board or moving a white fielded bishop to a black square when the latter is not watching", but sadly also about a few criminals: When money is involved the treshhold to that group is crossed. Those things could easily land in front of a court - and then anything one party said before will be used against it.

Whitedancingrockstar

#66

Interesting read, and some interesting comments in the thread (with some other ones less so).
I am not in any way technical enough to comment on most of what this topic covers, but one thing struck me as odd.

As #13 already wrote in their last paragraph, it is possible that some of your false positives were in fact detected as cheaters due to other signals than Irwin/Kaladin.
I have some understanding of how cheat detection works on Lichess (moderators can confirm this), and can, without going into details, safely say that Lichess uses many different types of methods to catch cheaters, apart from Irwin & Kaladin.
Many of these systems work well on their own (naturally even better together) and do not necessarily need input from all other systems in place to determine whether somebody has cheated. This is likely what is happening with many of your detected false positives.

Lichess detection has its flaws, but I remain skeptical on whether your approach truly addresses them.

FM thijscom

#67

To follow up on WDR's points, I think it's important to elaborate a bit more on the data, and the 99.1% model accuracy and false positive rate you claim.

From what I understand, you trained your model on the public games database only. Is that correct?

This means that your model does include game data and move times, but no information about telemetrics, or other mod info such as IP/print matches with past cheat accounts. Moreover, not all games are engine-analyzed via the fishnet-framework, and sometimes mods run extra (local) game analysis software to check for other suspicious factors. So as an outsider it is very hard to check what is truly a false positive, since ban decisions are sometimes made on data we as outsiders do not possess.

Anyway, I wholeheartedly agree that the lack of transparency is an issue (with people like Molurus suggesting we should not discuss this in public), and that it is problematic that we do not have any true labels to train models on. I believe we need a shift towards a more open collaboration to make real progress.

glbert

edited

#68

@Whitedancingrockstar said in #66:
> Many of these systems work well on their own (naturally even better together) and do not necessarily need input from all other systems in place to determine whether somebody has cheated. This is likely what is happening with many of your detected false positives.

it's much simpler. op's numbers are pulled out of their ... hat. it's well known that lichess tos marks include not only engine users, but also sandbaggers, boosters, account sharers and so on. but somehow their supposedly reverse engineered system can replicate 99.1 percent of tos marks? yeah, right.

odoaker2015

edited

#69

@thijscom said in #67:
> To follow up on WDR's points, I think it's important to elaborate a bit more on the data, and the 99.1% model accuracy and false positive rate you claim.
>
> From what I understand, you trained your model on the public games database only. Is that correct?
>
> This means that your model does include game data and move times, but no information about telemetrics, or other mod info such as IP/print matches with past cheat accounts. Moreover, not all games are engine-analyzed via the fishnet-framework, and sometimes mods run extra (local) game analysis software to check for other suspicious factors. So as an outsider it is very hard to check what is truly a false positive, since ban decisions are sometimes made on data we as outsiders do not possess.
>
> Anyway, I wholeheartedly agree that the lack of transparency is an issue (with people like Molurus suggesting we should not discuss this in public), and that it is problematic that we do not have any true labels to train models on. I believe we need a shift towards a more open collaboration to make real progress.

An outsider does not possess and/or has access to any data that can establish who is or is not a false positive. That's one of the criticisms.

dreadpresence

#70

Some good points are being made, with the only remedy being releasing the more details of your process. Please consider doing so. You’ve brought up transparency as an issue with lichess, but I think a little more transparency from you would be appreciated as well. I think the 99% banrate is highly doubtful for the same reasons others have highlighted, and I’m curious to see how you achieved such a result.

This topic has been archived and can no longer be replied to.