Post-Mortem of our Longest Downtime

I'm no longer able to play liches on my mobile browser.
Every time I start a new game, the board won't load.

Could the crash be the reason?

No.

Sounds like this problem wouldn't have happened if manta was running in cloud. In this case simple VM migration would be done quickly and transparently for you. Why is this not the case? Surely lichess can afford one cloud server

HailstormChessPlayer

#24

The 9/12 incident will be remembered...

jetbee

#25

Thanks for your quick actions and for the explanation. You're the best. Stuff like this happens...

TwoBishopsOnePawn

#26

Not using a proper cloud provider (GCP, AWS or even Azure) is not serious and bad use of donated money

andreas83

#27

We hosted the website of big german car manufacture on OVH , my only advice i can give you:
run, as fast as possible, there a better alternatives out there.
it was not only the burned down datacenter in france, but certainly this was one of the reasons we finally leaved.

andreas83

#28

@TwoBishopsOnePawn said in #26:
> Not using a proper cloud provider (GCP, AWS or even Azure) is not serious and bad use of donated money

imho using gcp aws or azure is exactly what you described as "bad use of donated money"
These datacenter are extremely expensive, in special for larger sites with existing sys admin teams.

gpa150chess

#29

@andreas83 said in #28:
> imho using gcp aws or azure is exactly what you described as "bad use of donated money"
> These datacenter are extremely expensive, in special for larger sites with existing sys admin teams.
Agree, people who recommend these providers only have a superficial understanding about how things are run.
Another thing I was wondering: I'm not sure if this played a role as well for thibault and his team: OVH is a european/french company. Considering lichess' general attitude and values I believe it definitely played a factor in choosing OVH over the the likes of google, amazon etc.

Motroskin

#30

Thank you so much for all your effort to create and support this amazing web site.
Special thanks for the report, it helps.

One question, out of curiosity, just to be on the same page. The very first step to setup fail-safe solutions is to duplicate/mirror the infrastructure on independent data centers. At the end of the day, this data center could have burnt to the ground.

The lack of such mirror is a matter of a budget and available support team or there is another reason this wasn't done in a first place?

This is not a critique, I am just trying to learn the reasoning and priorities here. Of course, I would agree that site support and development of new features has higher priority than 100% availability.