lichess.org
Donate

Lichess on scala3 - help needed

I reviewed the data summaries you've posted at the top of this blog and can only point out one obvious thing I noted.

There is a strong and consistent 24-hour cycle in the baseline scala-2 performance data sets ( you've referred to them as 'context' ). This makes sense because most games are played most often during the same period of each 24-hour cycle.

Since resources are going out of control on the same 24-hour cycle, and presumably at the time of peak resource demand, one's 'spidey sense' would suspect the overall system has not been properly configured for handling peak load.

Is the server 'thrashing' for instance, having run out physical ram and is now just dead-in-the-water jamming the busses with page-swaps ?

If this is complete nonsense, please ignore, but I thought it might be worth a mention ...
@boilingFrog said in #11:
> I reviewed the data summaries you've posted at the top of this blog and can only point out one obvious thing I noted.
>
> There is a strong and consistent 24-hour cycle in the baseline scala-2 performance data sets ( you've referred to them as 'context' ). This makes sense because most games are played most often during the same period of each 24-hour cycle.
>
> Since resources are going out of control on the same 24-hour cycle, and presumably at the time of peak resource demand, one's 'spidey sense' would suspect the overall system has not been properly configured for handling peak load.
>
> Is the server 'thrashing' for instance, having run out physical ram and is now just dead-in-the-water jamming the busses with page-swaps ?
>
> If this is complete nonsense, please ignore, but I thought it might be worth a mention ...

When a server swaps you see a huge reduction in apps CPU, that is not the case here as we see an increase (the graphs are probably process CPU). I doubt this is system/OS related.

This is most probably a JVM or application issue. The JVM is a nice piece of engineering, but it has a million knobs so it's behavior can be weird at times. It would be nice to see graphs for the different memory area inside the Java heap.
It would help if you could provide thread dumps during normal operation, this way you'd be able to tease apart things that are different between "normal" vs "misbehaving".

If you grep your thread dumps for "parking to wait for", you'll notice that there are a lot of runnable parked threads. Not necessarily a bad sign, but suspicious if that's not present in "normal" threaddumps. It might be a sign there's some forkjoin pool abuse going on. Early on, problems like this used to happen, eg: github.com/scala/bug/issues/8955. However, it can also be simply a result of CPU starvation as well (ie a consequence, not a cause).

Detailed profiling, or even just latency breakdown by request type could also help here (eg is it a specific part of the app that got slower or is it uniform across the board?).
A few things come to mind:

(a) is the hardware and OS the same between the scala2 and scala3 versions compared (if this was mentioned in the original post, I didn't see it),

(b) similar problems

www.google.com/search?client=firefox-b-1-d&q=scala3+cpu
e.g., -> youtrack.jetbrains.com/issue/SCL-20457/Scala-3-project-using-too-much-CPU-in-2022.2

(c) the advice of profiling, or snapshotting to see where the thread(s) are doing when the CPU is going wild, sounds like good advice to me,

(d) I have seen this kind of thing if you run out of scarce resources. In the case I have in mind it happened with a very slow file handle leak that only happened in an unusual case.
To troubleshoot - Deploy to a test server, keep adding 'human player bots' until the issue is induced ?
Can't we just reduce CPU usage by preventing further upgrades? Wouldn't a total reconstruction be a bad idea? Idk.
@randompalZ said in #17:
> I think the only effective way to tackle this is to get CPU profiling data, in that way we can see where the CPU is wasting time on. Something like www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java could help

The easily forgotten part here is that also you need subject matter experts in worst cases where a) the answer isn't obvious or b) the wrong answer is the obvious one. But yes, how to profile correctly is being considered esp. since there's no reason yet to believe this is a worst case.
Hope you can fix it soon
Also didn't understand a thing lol XD (no offense)