- Blind mode tutorial
lichess.org
Donate

Post-Mortem of our Longest Downtime

To the people who say it is disrespectful to the user: I'd rather play on lichess and have occasional outages which didn't happen a lot in the first place. It's free and open source plus people invest their time voluntary apart from their private and professional lives to make everything work. Not to mention that they are working behind the scenes to make everything more stable.

If you play on chess com and you pay for your whatever membership then of course I would expect them to be reliable at all times. I play on lichess as I don't want to support bad commercial companies and this includes some of the biggest cloud providers.
If lichess were some website were absolute availability at all times were absolutely critical like for example when human/animal life would depend on it or important business transactions then I would understand that OVH wouldn't be the right choice as a cloud provider, but as much as I love chess this outage didn't bother me at all even if I were in the midst of a chess game. It's a chess game after all.
Nobody is doubting that other providers are more advanced from a technical point of view, but that's not what is important for me. You can still go play on chess com if you don't like it. People who appreciate this project will stay.

To the people who say it is disrespectful to the user: I'd rather play on lichess and have occasional outages which didn't happen a lot in the first place. It's free and open source plus people invest their time voluntary apart from their private and professional lives to make everything work. Not to mention that they are working behind the scenes to make everything more stable. If you play on chess com and you pay for your whatever membership then of course I would expect them to be reliable at all times. I play on lichess as I don't want to support bad commercial companies and this includes some of the biggest cloud providers. If lichess were some website were absolute availability at all times were absolutely critical like for example when human/animal life would depend on it or important business transactions then I would understand that OVH wouldn't be the right choice as a cloud provider, but as much as I love chess this outage didn't bother me at all even if I were in the midst of a chess game. It's a chess game after all. Nobody is doubting that other providers are more advanced from a technical point of view, but that's not what is important for me. You can still go play on chess com if you don't like it. People who appreciate this project will stay.

Physical infra failures do happen. The blog doesn't give more details on what/if there was a single point of failure and if resiliency or fault tolerance can be introduced on the failed component atleast in theroy. Would be a good learning experience.

Physical infra failures do happen. The blog doesn't give more details on what/if there was a single point of failure and if resiliency or fault tolerance can be introduced on the failed component atleast in theroy. Would be a good learning experience.

Sorry if this has already been suggested... admittedly, I have not read through the entire thread. Have we considered moving to a cloud platform like AWS? I'm a cloud solutions, technical, and integration architect, and I build stuff in AWS all day every dat. The reliability is well-known and published.
And you can also architect & build for High-Availablity and Fault-Tolerence.

I'm in Ireland (GMT+1) and I'd be happy to help / discuss if anyone @ lichess is interested in talking it through.

Sorry if this has already been suggested... admittedly, I have not read through the entire thread. Have we considered moving to a cloud platform like AWS? I'm a cloud solutions, technical, and integration architect, and I build stuff in AWS all day every dat. The reliability is well-known and published. And you can also architect & build for High-Availablity and Fault-Tolerence. I'm in Ireland (GMT+1) and I'd be happy to help / discuss if anyone @ lichess is interested in talking it through.

@mphillips256 said in #83:

Sorry if this has already been suggested... admittedly, I have not read through the entire thread. Have we considered moving to a cloud platform like AWS?

I'm also a cloud guy... I moved my start up's entire infra from physical to AWS exactly because we had too many hardware / networking failures and couldn't justify staffing up to support our (small) footprint.

However, moving to the cloud wasn't just "drag and drop." We had to rewrite swaths of code to get our application using cloud technology efficiently enough to justify the move economically. I.e. most of our application runs on spot instances and can handle interruptions. It was actually rather painful at the time (but ultimately worth it).

That said, AWS is just too expensive for many firms, especially ones without steady income streams. Maybe there are AWS credits for non-profits?

@mphillips256 said in #83: > Sorry if this has already been suggested... admittedly, I have not read through the entire thread. Have we considered moving to a cloud platform like AWS? I'm also a cloud guy... I moved my start up's entire infra from physical to AWS exactly because we had too many hardware / networking failures and couldn't justify staffing up to support our (small) footprint. However, moving to the cloud wasn't just "drag and drop." We had to rewrite swaths of code to get our application using cloud technology efficiently enough to justify the move economically. I.e. most of our application runs on spot instances and can handle interruptions. It was actually rather painful at the time (but ultimately worth it). That said, AWS is just too expensive for many firms, especially ones without steady income streams. Maybe there are AWS credits for non-profits?

@noahlz said in #84:

That said, AWS is just too expensive for many firms, especially ones without steady income streams. Maybe there are AWS credits for non-profits?

Many comments mention "budget"... the world's largest corporations didn't become profitable by being charitable.

Lichess does publish https://lichess.org/costs ; Chess.com annual revenue is a different order of magnitude.

@noahlz said in #84: > That said, AWS is just too expensive for many firms, especially ones without steady income streams. Maybe there are AWS credits for non-profits? Many comments mention "budget"... the world's largest corporations didn't become profitable by being charitable. Lichess does publish https://lichess.org/costs ; Chess.com annual revenue is a different order of magnitude.

@dboing said in #78:

Which cloud now? Is this not just displacing the risk. Maybe that cloud service could be a contract promise to keep that VM going when their hardware gets broke, they would have the redundancies for it. This is just about software

I prefer Microsoft Azure. Amazon AWS is good too. There's dozens of others. But forget the "cloud" for a moment. Just stand up a hypervisor in your closet and run this on VM. The days of hardware dependence are behind us. You should never say, "well my hardware broke, and we had to wait for them to replace it..." What? Just migrate restart the VM on different hardware. VMware has a technology called High Availability that will do that for you. Longest outage you'd experience is the time it takes to reboot the VM (about 10 seconds). Getting back to the cloud, that just gives you more convenience. You're paying someone else to maintain your data center infrastructure. You don't have to do that, you can maintain your servers on prem. But if you're in the business of say, writing chess server software, and not managing data centers, then it makes sense to outsource that to the pros.

@dboing said in #78: > Which cloud now? Is this not just displacing the risk. Maybe that cloud service could be a contract promise to keep that VM going when their hardware gets broke, they would have the redundancies for it. This is just about software I prefer Microsoft Azure. Amazon AWS is good too. There's dozens of others. But forget the "cloud" for a moment. Just stand up a hypervisor in your closet and run this on VM. The days of hardware dependence are behind us. You should never say, "well my hardware broke, and we had to wait for them to replace it..." What? Just migrate restart the VM on different hardware. VMware has a technology called High Availability that will do that for you. Longest outage you'd experience is the time it takes to reboot the VM (about 10 seconds). Getting back to the cloud, that just gives you more convenience. You're paying someone else to maintain your data center infrastructure. You don't have to do that, you can maintain your servers on prem. But if you're in the business of say, writing chess server software, and not managing data centers, then it makes sense to outsource that to the pros.

I am glad that Lichess is and will remain fully free.

To all those who advocate for AWS, Azure or other "Cloud" solutions, those are very expensive because they are distributed real physical servers, in the sense that instead of a single server in a single data centre, you are paying for your website to be mirrored real-time, hosted and synced, across multiple real, physical data centres on real physical hardware at each point. The heat generated by these huge data centres and their use of local fresh-water supply for additional cooling leaves a huge environmental footprint.

Cloud (read properly as massively re-mirrored, redistributed onto multiple physical hardware data centres, with complex syncing mechanisms) solutions are very expensive for a reason, and not without a significant environmental impact for those who are concerned about such matters.

I am glad that Lichess is and will remain fully free. To all those who advocate for AWS, Azure or other "Cloud" solutions, those are very expensive because they are distributed real physical servers, in the sense that instead of a single server in a single data centre, you are paying for your website to be mirrored real-time, hosted and synced, across multiple real, physical data centres on real physical hardware at each point. The heat generated by these huge data centres and their use of local fresh-water supply for additional cooling leaves a huge environmental footprint. Cloud (read properly as massively re-mirrored, redistributed onto multiple physical hardware data centres, with complex syncing mechanisms) solutions are very expensive for a reason, and not without a significant environmental impact for those who are concerned about such matters.

@robertl30 said in #86:

You don't have to do that, you can maintain your servers on prem.

brb, I'll file an issue to run Lichess.org from my basement. I'm sure my home internet (which only went down twice in the last week) will be fast enough!

@robertl30 said in #86: > You don't have to do that, you can maintain your servers on prem. brb, I'll file an issue to run Lichess.org from my basement. I'm sure my home internet (which only went down twice in the last week) will be fast enough!

@gpa150chess said in #81:

I'd rather play on lichess and have occasional outages which didn't happen a lot in the first place
Before making such decision at least you should ask for prices of both servers. This is your purely emotional decision and it's not based on logic. Lichess is a realtime application and it needs high availability servers. Maybe you enjoy when your game is interrupted and you lose, but others don't.

@gpa150chess said in #81: > I'd rather play on lichess and have occasional outages which didn't happen a lot in the first place Before making such decision at least you should ask for prices of both servers. This is your purely emotional decision and it's not based on logic. Lichess is a realtime application and it needs high availability servers. Maybe you enjoy when your game is interrupted and you lose, but others don't.

Of course I know it's a realtime application, this was not my argument here. I never said that it is enjoyable for other people, but honestly this is a chess server. No lifes are dependend on it. Your precious rating points don't matter in the grand scheme of things. Just get your priorities straight and take chess less seriously.
They are working on it behind the scenes. Also just look at how many outages lichess had up until now? You are pretending like it happens a regular basis.
I would be able to understand you if this were a constant problem but it isn't. If you want a 100% reliable website go to chess com. You are not forced to play here.
It's not like we have a banking application to deal with which handles a bunch of transactions each day, things that actually matter not some random chess games on the internet. But that's your opinion. You can have different values and not appreciate open source or you can appreciate the efforts and trust lichess to work on the issues they have. I'm sure they will come up with good solutions for any problems that my arise.

Of course I know it's a realtime application, this was not my argument here. I never said that it is enjoyable for other people, but honestly this is a chess server. No lifes are dependend on it. Your precious rating points don't matter in the grand scheme of things. Just get your priorities straight and take chess less seriously. They are working on it behind the scenes. Also just look at how many outages lichess had up until now? You are pretending like it happens a regular basis. I would be able to understand you if this were a constant problem but it isn't. If you want a 100% reliable website go to chess com. You are not forced to play here. It's not like we have a banking application to deal with which handles a bunch of transactions each day, things that actually matter not some random chess games on the internet. But that's your opinion. You can have different values and not appreciate open source or you can appreciate the efforts and trust lichess to work on the issues they have. I'm sure they will come up with good solutions for any problems that my arise.