A graph of tournament pairings on Lichess for the last 24 hours

Our recent server issues....

ChessSoftware DevelopmentLichess
A quick update on what happened yesterday would like to apologise for the disruption to the Agadmator CakeDeFi tournament which occurred on 21st December 2021. We know how frustrating the disruption was for the players, and how disappointing it was for the wider chess community.

Despite our best efforts to persevere, the scale of the event revealed a technical limitation that will require some further optimisation in the longer-term. We are grateful that this issue was exposed so we can address it, but regret it was determined with such a public failure. We have had larger events in the past, but with fewer players actively playing at the same time.

These issues can and do happen to any website, of any scale or size. We are grateful to our entire team for their fast actions and professionalism during the tournament. And, we’re grateful to the entire chess community, and especially those who share our vision and support us. You’re all awesome.

It's not over

We’re currently assessing the work we need to do, and when we can have it in place. Once we know more, we will give a further update, but we have confirmed we can resume the tournament with the last unaffected standings and corresponding remaining tournament time in the near future. This is of course dependent on agadmator’s wishes, and the scheduling of top level events occurring over the board.

Love from,
The Lichess Team

Frequently Asked Questions:

Shouldn’t you have known what your infrastructure can support and put limits in place?

We’ve historically had tournaments with 30,000+ players in them with no issues. Whilst we had some concerns about the initial pairing, we were cautiously optimistic that after that the tournament would continue without issue.

The issue which caused the tournament to fail was completely new, and unknown to us before the tournament. Had we foreseen the weak point, we certainly would have shored it up, or put reasonable limits in place. Even simulating events at this scale is not easy. Unfortunately, despite how talented our dev team and sysadmins are, they’re not precogs (not just yet).

Shouldn’t you have restricted entry in the tournament to xyz rating restriction / group of players?

The aim was to have an inclusive open tournament. We did not want to limit the enjoyment people could get from participating to just a few. We had no serious reason to think the tournament could not succeed as an open event.

Is it unfair to have two tournaments, separated by a period of time before resuming?

If it’s good enough for the Candidates Tournament 2020/21, it’s good enough for us. In all seriousness, we believe this is the best course of action currently available to us.

Have you thought of splitting the prize in half, half for this event and half for a second event?

We understand the event organisers are considering various approaches. However, the current plan is to resume the tournament in the near future with the last unaffected standings, and with the corresponding remaining time left for the tournament.

Is there anything I can do to help?

Lichess is run by the chess community, for the chess community. There are various ways to contribute directly or support our work.

Will more money help fix the problem?

Being purely funded by donations, all gifts and contributions are valuable and welcome. However, money won’t fix this particular problem. All websites of all sizes get problems from time to time — you may remember when multi-billion dollar companies were struggling with their infrastructure not so long ago. That said, if you happen to be a developer that’s great with Scala, and you want to hang out with other great Scala developers, feel free to visit us on our Discord server.

Can you elaborate on the technical issue?

Long after we got past the rough start, we encountered the following main issue.
There are many events affecting a tournament state, such as creating new pairings, applying results of finished games, updating the leaderboard, accepting new players, etc.

These events must be processed one at a time, for a given tournament, to avoid race conditions. So the events are put in a queue. Players push to the queue by joining and playing, and Lichess pulls from the queue by computing pairings and the leaderboard.

During this tournament, for the first time, players were inputting events faster than Lichess could process them. The queue grew larger, causing delays in pairings. Eventually there was no way of keeping up with the queue, and we had to cancel the tournament.

The queue is defined in TournamentApi.scala (permalink, note that we already made various changes on the master branch). The actions going through the queue are wrapped in a Sequencing block in that file.

Lichess itself was not overloaded, and the rest of the site was working perfectly and without delay. The problem is that the queue of a single tournament cannot be parallelized, because it must ensure that events are processed one by one for consistency. Therefore, even though Lichess had resources to spare, it could not keep up with handling the queue events one by one. If the same amount of players had been spread over multiple tournaments, there would have been multiple queues in parallel, and everything would have worked smoothly. We’re now looking at optimizations that will make the queue empty faster, and possibly allow some events to be processed outside the queue, to parallelize them.

Lichess is a charity and entirely free/libre open source software.
All operating costs, development, and content are funded solely by user donations.