Photo by Manuel on Unsplash
Lichess on scala3 - help needed
Update: Friday 09/12
Excellent news: all is good and lichess is performing great with scala3.
The fix we applied on tuesday worked. It's been 3 days now and performance is stable.
Here's a new snapshot of the relevant JVM monitoring, where you can see that the CPU usage is excellent, and the JIT pools are being put in use.
It was all about the code cache
The JIT (Just In Time) compiler (aka hotspot compiler) is how the JVM profiles and optimizes our code during runtime. It was running out of memory to save the optimized code - the code cache. All we needed to do was tuning a couple values. The difficulty was to know about them, since the default values work for most programs.
Community to the rescue
Fortunately, immediately after this post was published, we received help from many talented people who quickly figured out what the problem was, and how to solve it.
It has been a very humbling and wholesome experience, and I'm forever grateful to everyone who came to our help.
- The JVM is complex and has a lot of relatively obscure settings. When the program behaves in a weirdly suboptimal way, JVM tuning should be considered. Our current prod config:
-XX:ProfiledCodeHeapSize=500m -XX:NonProfiledCodeHeapSize=500m -XX:NonNMethodCodeHeapSize=24m -XX:ReservedCodeCacheSize=1024m
- Ask for help, and help will come. I wish I would have done it sooner, and spared myself some very stressful days. How could I forget that lichess is a community project? That is our strength.
- Scala3 is awesome and I'll publish a blog post about it soon.
End of story, thank you all :)
Update: Wednesday 07/12
It's 7am UTC and so far, so good. CPU usage is nominal (maybe slightly better even than in lila2 times).
Let's not claim victory too quickly, as the issue can take more time to fully express itself. Here's a fresh snapshot of the most relevant monitoring data.
In 12 hours from now, we'll be past today's peak time. If by then nothing wrong has happened, then we will now for sure that the JVM tuning fixed lila3. Then we'll keep watching, as we'll want to make sure that longer runtimes are also possible.
1pm UTC: Things are looking very promising.
4pm UTC: Everything still great. Looks like we did it :)
Update: Tuesday 06/12
What an overwhelming community response! Thank you so much to everyone who took the time to analyse the situation and give precious advice, on Discord, Hacker News, and by email. It feels great not to face this problem alone.
lila3 and scala3 back in production
Based on the advice we received, we decided to deploy the lila3 branch again this morning!
With new JVM tuning arguments:
-XX:+PrintCodeCache -XX:NonProfiledCodeHeapSize=512m -XX:NonNMethodCodeHeapSize=32m -XX:ReservedCodeCacheSize=2g -XX:InitialCodeCacheSize=512m -XX:CodeCacheExpansionSize=2m
These tune the JVM code cache and give it more space to perform Just-In-Time optimizations. Indeed, running out of JIT space would be a good explanation to the behaviours we're seeing. Therefore I have good hope that today's deploy will provide us with valuable insights. I'll be monitoring production and keeping you all informed.
And if it doesn't work, we have more routes to explore, thanks to you all.
New monitoring and data
- New non-G1 pools panel to better see code cache pools, on the JVM thread/pool dashboard.
- so far -XX:+PrintCodeCache is not outputing anything, we'll consider trying
- New output of
jstat -gc -h 10 $(pgrep java) 1000from today ~ 9:30AM. I'll post another one of peak time (~4pm)
- Production JVM flags (output of -XX:+PrintFlagsFinal with tuning arguments set)
- We can't publish memory dumps as that would not be very GDPR compliant :)
Initial blog post, Monday 05/12
Since its recent rewrite to scala3, Lichess is using way more CPU than usual, but the problem takes a while to kick in and has peculiar patterns.
- 30-days monitoring for context
- peak time monitoring for granularity (4pm-7pm)
- Thread dumps at peak time Raw: 1 2 3 4 5
- JVM GC stats dumped every 500ms, at peak time
The core of Lichess is a monolith written in scala.
It serves thousands of pages per second, plays thousands of chess moves per second,
while doing a lot of other things. It's a big program running on a big server with lots of CPU and memory.
Until now, lila was written in scala 2.13.10, and it was running well and performing well. We'll call it lila2.
About a month ago, I started migrating it to scala 3.2.1 - we'll call it lila3. It was fun and I'll write about it... After problems are solved.
On the 2022-11-22 at 7AM UTC, I deployed lila3 to production, and it worked quite well, and performed quite well.
During the next days, things went smoothly as I deployed a new version every morning.
But when I let it run for more than 24h, the CPU usage started increasing dramatically. It became obvious that lila3 could not handle more than 48h of uptime.
In comparison, lila2 could easily stay up for 2 consecutive weeks.
lila2 CPU usage, each spike is 24h. Vertical red lines are deploys/restarts.
lila3 CPU usage, each spike is 24h. CPU usage increases on the second day of runtime.
A closer look at CPU usage
The CPU usage is not regular, even during the worst times. Instead, it will be normal for a minute or two; then be crazy for a minute or two. Then back to normal.
Close-up look. Around 20% is normal CPU usage. We see it spike up to 80% for extended periods - that is abnormal.
Memory pressure / GC problem?
I don't think the garbage collector is to blame. During the worst times, when the JVM almost maxed out 90 CPUs, the GC was only using 3s of CPU time per minute.
Allocations and memory usage also look rather normal.
What about exceptions or log messages?
There are none relevant. We have extensive logging in lila and no new errors are popping up. No timeouts either.
I had to revert to the scala2 version of Lichess. This is very unfortunate, as the gap between lila2 and lila3 is huge, and every day we spend with lila2 in production makes it bigger.
At this point I don't know how to diagnose and fix the problem with lila3.
If you know about large JVM deployments and have an idea about what could be causing this, please let me know.
You can use the Lichess programming discord channel or our email address.
Please focus on the problem at hand and its quick resolution. I'm not looking for long-term architecture discussions right now, we'll do these after production is healthy again.
I will update this post with more data and info, as people ask for them.
For monitoring data, thread dumps and GC logs, see the top of this post.
lila2 and lila3 are compiled with java 17, and the prod server runs the same JVM:
# java -version
openjdk version "17" 2021-09-14
OpenJDK Runtime Environment (build 17+35-2724)
OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing)
# uname -a
Linux manta 5.4.0-107-generic #121-Ubuntu SMP Thu Mar 24 16:04:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
JVM args are the same as before:
-Xms30g -Xmx30g -XX:+UseG1GC
lila2 runs scala 2.13.10, lila3 runs scala 3.2.1.
SBT dependencies can be found in the lila github repo