lichess.org
Donate

Post-mortem of 2021-01-16 and 2021-02-01 outages

All times are UTC and hours on 24h.

On the 16th of January 2021, at 05:13, monitoring detected a very high number of errors and triggered an alert.
The on-call admin noticed the MongoDB primary and some secondaries had crashed, and he restarted them all, in succession. When the primary server crashed, a secondary noticed and became the primary. That new primary immediately crashed too. Then another secondary noticed and became primary. It crashed too. The database cluster finally recovered when the admin restarted them all one by one. The lichess app recovered by itsef, normal service resumed at 05:23 according to monitoring and players.

We filed a bug with MongoDB (the database software we use) and hoped it would be fixed before the next freak coincidence leading to this crash. We were wrong. The conditions for the crash are reproducible. The MongoDB team managed to reproduce the crash and pointed us in the direction of the likely culprit: an undocumented setting to increase the max depth of BSON documents, which we're using to store studies. We made a mental note to fix this later and went on our merry way.

However, the same incident happened on the 1st of February, this time with several rounds of "let's all crash one by one" between 19:22 and 20:14, until we disabled studies. The lead dev then immediately started work on an improved (ie. non-crashy) way to store studies, which was completed 1.5 days later. Right after the release of the new study code and migration of the old studies to the new storage, we noticed a bug in the migration, and had no choice but to re-run the migration once more, therefore erasing any changes done to studies during the 3-hour window it took us to notice the migration bug.

The issue on mongodb's tracking tool jira.mongodb.org/browse/SERVER-53857.

We are very sorry about the impact this has caused all our users and players.
Thanks for the transparency and staying on top of this issue :)
@nojock you better be no joke because pretending to be nojoke is no joke.

Yay Lichess is fixed.
Wow I'm amazed to see this level of accountability from Lichess. Another reason why it is the best chess site in the world 👍👍👍
Sounds like a journal entry in a Resident Evil video game.
I wish I could write like that.
I guess this kind of god-like control over what to do about things going wrong is one of the things that attract people to computers or machines in general.
I love how the conversation went from bugs in lichess to saving the world.

This topic has been archived and can no longer be replied to.