- Blind mode tutorial
lichess.org
Donate

Lag spike at game start. Unable to move and game is aborted after time limit is reached.

@for_cryingout_loud said in

I would love to check the server load
and see if thats a factor but i dont have console access
but you are right it can be anything under the sun but the devs can't search under the sun because they will get burnt
So how can we narrow down what the problem is then or what causes the disconnect even thruther.

Let that SUN be alone.
But start wider and narrow problems down if needed.

How long are servers now on OVH hosting? At least several (6 or more) months ago there was no such problems or it was not so prominent.
So what else has been changed? scala version itselt, but besides that?

I told those times as my time region is EET +2 and server playability at least for me is better at night and before 12 in the morning.

Now, between my last posts and your answer, i played 2 tournaments and second one was barely playable, i mean short recconects affected nearly all games. Lets say, this narrows down period from 12-22 something to be more successful in monitoring and catching up bottlenecks. It looks like some kind of overload.

So access to hosted system monitoring is crucial and server operators should have some kind of service mechanism -persons to contact who provide such task or access if needed. There is no need "to guess" but create such inquri to hosting providers.

Maybe you or lichess owner can ask temporarily more ressources from hosting provider for testing purposes.
Such things are hard to solve without serverside inspection and monitoring.

Maybe setup a bot that requests stuff from the api and see if that has dc issues or message the devs on discord and we start finding ways to narrow it down.

That bot thing can help too, specially if this counts disconnets for bot client. But i would start from overall system perfomance stats. But bot's can't play 1:0 tournaments, can they?

Wish you good luck on solving.

@for_cryingout_loud said in > I would love to check the server load > and see if thats a factor but i dont have console access > but you are right it can be anything under the sun but the devs can't search under the sun because they will get burnt > So how can we narrow down what the problem is then or what causes the disconnect even thruther. Let that SUN be alone. But start wider and narrow problems down if needed. How long are servers now on OVH hosting? At least several (6 or more) months ago there was no such problems or it was not so prominent. So what else has been changed? scala version itselt, but besides that? I told those times as my time region is EET +2 and server playability at least for me is better at night and before 12 in the morning. Now, between my last posts and your answer, i played 2 tournaments and second one was barely playable, i mean short recconects affected nearly all games. Lets say, this narrows down period from 12-22 something to be more successful in monitoring and catching up bottlenecks. It looks like some kind of overload. So access to hosted system monitoring is crucial and server operators should have some kind of service mechanism -persons to contact who provide such task or access if needed. There is no need "to guess" but create such inquri to hosting providers. Maybe you or lichess owner can ask temporarily more ressources from hosting provider for testing purposes. Such things are hard to solve without serverside inspection and monitoring. > Maybe setup a bot that requests stuff from the api and see if that has dc issues or message the devs on discord and we start finding ways to narrow it down. That bot thing can help too, specially if this counts disconnets for bot client. But i would start from overall system perfomance stats. But bot's can't play 1:0 tournaments, can they? Wish you good luck on solving.

" It looks like some kind of overload."
But these are not peak times at peak times it would be most noticeable which judging by the games gathered from uses when they had the issue is not true(cant provide said games because link limit would kill me)

"Let that SUN be alone.
But start wider and narrow problems down if needed.
"
No i will abuse the sun until it gives me answers
wider takes longer narrow will be faster just only narrow if we are a 100% sure based on facts

"How long are servers now on OVH hosting? At least several (6 or more) months ago there was no such problems or it was not so prominent.
So what else has been changed? scala version itselt, but besides that?"

Not easy to link it to 1 change as users have complained for years (search la gon forum and scroll down or disconnect)
Even harder to say if its gotten worse because of new updates since no one has been measuring it in any way besides feeling

but maybe we can ask for load stats before and after that update and other ones to see if we can spot overload
maybe we can also see if the server system can run full thorlate with no problems in terms of heat power etc

"So access to hosted system monitoring is crucial and server operators should have some kind of service mechanism -persons to contact who provide such task or access if needed. There is no need "to guess" but create such inquri to hosting providers."

lichess should be able to see form console so just get lichess to ask and see.
but what data exactly are we looking for
avg cpu usage
temps
requests for games
requests for everything
disk usage % and read and write
mem usage and mem free and memcached
maybe we can ask them to save cpu and other values over time and then we can plot it and see if any meaningful trend aries

"That bot thing can help too, specially if this counts disconnets for bot client. But i would start from overall system perfomance stats. But bot's can't play 1:0 tournaments, can they?"
bots sadly cant play tournaments according to this(https://lichess.org/api#tag/Bot) but
tournament games are only different from normal games by parring and points
if we just have bot a play bot b
we should still find the issue
just make both bots run ping to see if the whole connection drops or just server response to our requests so we can be certain of it.

"Wish you good luck on solving."
You can't run away i am roping you into this

" It looks like some kind of overload." But these are not peak times at peak times it would be most noticeable which judging by the games gathered from uses when they had the issue is not true(cant provide said games because link limit would kill me) "Let that SUN be alone. But start wider and narrow problems down if needed. " No i will abuse the sun until it gives me answers wider takes longer narrow will be faster just only narrow if we are a 100% sure based on facts "How long are servers now on OVH hosting? At least several (6 or more) months ago there was no such problems or it was not so prominent. So what else has been changed? scala version itselt, but besides that?" Not easy to link it to 1 change as users have complained for years (search la gon forum and scroll down or disconnect) Even harder to say if its gotten worse because of new updates since no one has been measuring it in any way besides feeling but maybe we can ask for load stats before and after that update and other ones to see if we can spot overload maybe we can also see if the server system can run full thorlate with no problems in terms of heat power etc "So access to hosted system monitoring is crucial and server operators should have some kind of service mechanism -persons to contact who provide such task or access if needed. There is no need "to guess" but create such inquri to hosting providers." lichess should be able to see form console so just get lichess to ask and see. but what data exactly are we looking for avg cpu usage temps requests for games requests for everything disk usage % and read and write mem usage and mem free and memcached maybe we can ask them to save cpu and other values over time and then we can plot it and see if any meaningful trend aries "That bot thing can help too, specially if this counts disconnets for bot client. But i would start from overall system perfomance stats. But bot's can't play 1:0 tournaments, can they?" bots sadly cant play tournaments according to this(https://lichess.org/api#tag/Bot) but tournament games are only different from normal games by parring and points if we just have bot a play bot b we should still find the issue just make both bots run ping to see if the whole connection drops or just server response to our requests so we can be certain of it. "Wish you good luck on solving." You can't run away i am roping you into this

Oh boy, reading this thread... OK, it sounds like there's a reproducible issue capable of further analysis although it sounds complicated.

Unfortunately I don't know whether this issue is server-side or client-side, although ironically this morning a lishogi.org user reported a similar issue on Lishogi, that every 10 seconds or so their socket connection drops. So I'm thinking any of:
a) Somehow the same change recently deployed both to Lichess and Lishogi (unlikely since I don't think Lishogi updates)
b) There is some client-side issue
c) There is a latent server-side issue that suddenly started happening on both sites (possibly triggered by something client-side)

Oh boy, reading this thread... OK, it sounds like there's a reproducible issue capable of further analysis although it sounds complicated. Unfortunately I don't know whether this issue is server-side or client-side, although ironically this morning a lishogi.org user reported a similar issue on Lishogi, that every 10 seconds or so their socket connection drops. So I'm thinking any of: a) Somehow the same change recently deployed both to Lichess and Lishogi (unlikely since I don't think Lishogi updates) b) There is some client-side issue c) There is a latent server-side issue that suddenly started happening on both sites (possibly triggered by something client-side)

@for_cryingout_loud said in #12:

But start wider and narrow problems down if needed.
"
No i will abuse the sun until it gives me answers
wider takes longer narrow will be faster just only narrow if we are a 100% sure based on facts

For analyzing such issues you have consider whole picture = think wider.
Then narrow down potential bottlenecks and via monitoring, exclude first some areas quite fast - factors what are irrevelant.
You can use of course "abuse the SUN method" if you wish, me and general public will wait till Sun answers to you.

"How long are servers now on OVH hosting? At least several (6 or more) months ago there was no such problems or it was not so prominent.
So what else has been changed? scala version itselt, but besides that?"

Not easy to link it to 1 change as users have complained for years (search la gon forum and scroll down or disconnect)
Even harder to say if its gotten worse because of new updates since no one has been measuring it in any way besides feeling

but maybe we can ask for load stats before and after that update and other ones to see if we can spot overload
maybe we can also see if the server system can run full thorlate with no problems in terms of heat power etc

Who said it must be easy? If not able to say, if perfomance of servers has gotten worse or not, means someone should implement measuring standards to be able to say those things in future. Interesting, who might that be?

"So access to hosted system monitoring is crucial and server operators should have some kind of service mechanism -persons to contact who provide such task or access if needed. There is no need "to guess" but create such inquri to hosting providers."

lichess should be able to see form console so just get lichess to ask and see.
but what data exactly are we looking for
avg cpu usage
temps
requests for games
requests for everything
disk usage % and read and write
mem usage and mem free and memcached
maybe we can ask them to save cpu and other values over time and then we can plot it and see if any meaningful trend aries

All of them in the very beginning and even worse - do that in timely manner. And not only those parameters you told. There is more.
But let every person do task he is skilled on.

  1. OVH persons can tell overall load/perfomance of hosted servers, manage their load balancing and etc and provide perfomance graphs.
  2. Some network specialist from OVH should be able to answer questions related to networks, firewalls, redirections, network equipment, their loads, queue times and possible even attacks if happening on systems.
  3. Some system-database architect - can seek and measure overall load between frontend and backends and their interactions.
    Database measurings, optimisations and etc. Database access times, response times and etc.

You want facts - generate them by analyzing and measuring.

In this place, i would ask, if those problems affect only bullet or ultrabullet, or how prominent are "lag/throtle" issues on slower tournaments. Are they on same virtual machine/cluster/network and etc. What is common, what is different.
Differenciate your findings to narrow things down and then start details search.

What i can tell, by just "feeling" - in peak times there is around 90000 clients in server and some 45000-46000 games in progress, and that has not changed too much.

About lag and forums - there is need to differenciate real lag problems and load problems. And you know, this can also be done only by measuring.

"That bot thing can help too, specially if this counts disconnets for bot client. But i would start from overall system perfomance stats. But bot's can't play 1:0 tournaments, can they?"
bots sadly cant play tournaments according to this(lichess.org/api#tag/Bot) but
tournament games are only different from normal games by parring and points
if we just have bot a play bot b
we should still find the issue
just make both bots run ping to see if the whole connection drops or just server response to our requests so we can be certain of it.

Incorrect. You may have finds and may not.
You possibly find bot a vs bot b issues, if you are lucky enough to run them on affected part of the system.
To be more sure - you have make a bot challenge on same server as bullet (because this one is propably most complained about ) system runs, and better make plenty of those bots- i mean some hundreds, to mimic normal tournament startup, and make them move in game progress more like normal humans do - with random move delay (i mean no opening book play with 30 moves in first 0,001 s).

But this bot test is more meaningful, if overal server perfomance analizing process is allready done.
And who restricts making just bot test arena for perfomance testing purposes?

It depends on human ressourses you have, how wide or narrow one starts.

"Wish you good luck on solving."
You can't run away i am roping you into this

Dont worry, i just bought good custom M390 steel knife against roping.

@for_cryingout_loud said in #12: > But start wider and narrow problems down if needed. > " > No i will abuse the sun until it gives me answers > wider takes longer narrow will be faster just only narrow if we are a 100% sure based on facts For analyzing such issues you have consider whole picture = think wider. Then narrow down potential bottlenecks and via monitoring, exclude first some areas quite fast - factors what are irrevelant. You can use of course "abuse the SUN method" if you wish, me and general public will wait till Sun answers to you. > "How long are servers now on OVH hosting? At least several (6 or more) months ago there was no such problems or it was not so prominent. > So what else has been changed? scala version itselt, but besides that?" > > Not easy to link it to 1 change as users have complained for years (search la gon forum and scroll down or disconnect) > Even harder to say if its gotten worse because of new updates since no one has been measuring it in any way besides feeling > > but maybe we can ask for load stats before and after that update and other ones to see if we can spot overload > maybe we can also see if the server system can run full thorlate with no problems in terms of heat power etc > Who said it must be easy? If not able to say, if perfomance of servers has gotten worse or not, means someone should implement measuring standards to be able to say those things in future. Interesting, who might that be? > "So access to hosted system monitoring is crucial and server operators should have some kind of service mechanism -persons to contact who provide such task or access if needed. There is no need "to guess" but create such inquri to hosting providers." > > lichess should be able to see form console so just get lichess to ask and see. > but what data exactly are we looking for > avg cpu usage > temps > requests for games > requests for everything > disk usage % and read and write > mem usage and mem free and memcached > maybe we can ask them to save cpu and other values over time and then we can plot it and see if any meaningful trend aries All of them in the very beginning and even worse - do that in timely manner. And not only those parameters you told. There is more. But let every person do task he is skilled on. 1. OVH persons can tell overall load/perfomance of hosted servers, manage their load balancing and etc and provide perfomance graphs. 2. Some network specialist from OVH should be able to answer questions related to networks, firewalls, redirections, network equipment, their loads, queue times and possible even attacks if happening on systems. 3. Some system-database architect - can seek and measure overall load between frontend and backends and their interactions. Database measurings, optimisations and etc. Database access times, response times and etc. You want facts - generate them by analyzing and measuring. In this place, i would ask, if those problems affect only bullet or ultrabullet, or how prominent are "lag/throtle" issues on slower tournaments. Are they on same virtual machine/cluster/network and etc. What is common, what is different. Differenciate your findings to narrow things down and then start details search. What i can tell, by just "feeling" - in peak times there is around 90000 clients in server and some 45000-46000 games in progress, and that has not changed too much. About lag and forums - there is need to differenciate real lag problems and load problems. And you know, this can also be done only by measuring. > "That bot thing can help too, specially if this counts disconnets for bot client. But i would start from overall system perfomance stats. But bot's can't play 1:0 tournaments, can they?" > bots sadly cant play tournaments according to this(lichess.org/api#tag/Bot) but > tournament games are only different from normal games by parring and points > if we just have bot a play bot b > we should still find the issue > just make both bots run ping to see if the whole connection drops or just server response to our requests so we can be certain of it. Incorrect. You may have finds and may not. You possibly find bot a vs bot b issues, if you are lucky enough to run them on affected part of the system. To be more sure - you have make a bot challenge on same server as bullet (because this one is propably most complained about ) system runs, and better make plenty of those bots- i mean some hundreds, to mimic normal tournament startup, and make them move in game progress more like normal humans do - with random move delay (i mean no opening book play with 30 moves in first 0,001 s). But this bot test is more meaningful, if overal server perfomance analizing process is allready done. And who restricts making just bot test arena for perfomance testing purposes? It depends on human ressourses you have, how wide or narrow one starts. > > "Wish you good luck on solving." > You can't run away i am roping you into this Dont worry, i just bought good custom M390 steel knife against roping.

@Toadofsky said in #13:

Oh boy, reading this thread... OK, it sounds like there's a reproducible issue capable of further analysis although it sounds complicated.

Unfortunately I don't know whether this issue is server-side or client-side, although ironically this morning a lishogi.org user reported a similar issue on Lishogi, that every 10 seconds or so their socket connection drops. So I'm thinking any of:
a) Somehow the same change recently deployed both to Lichess and Lishogi (unlikely since I don't think Lishogi updates)
b) There is some client-side issue
c) There is a latent server-side issue that suddenly started happening on both sites (possibly triggered by something client-side)
Qood points provided. And solving - measuring analizing and testing... Difficult but doable.
If so, maybe first can one search - which side drops socket, server or client.

@Toadofsky said in #13: > Oh boy, reading this thread... OK, it sounds like there's a reproducible issue capable of further analysis although it sounds complicated. > > Unfortunately I don't know whether this issue is server-side or client-side, although ironically this morning a lishogi.org user reported a similar issue on Lishogi, that every 10 seconds or so their socket connection drops. So I'm thinking any of: > a) Somehow the same change recently deployed both to Lichess and Lishogi (unlikely since I don't think Lishogi updates) > b) There is some client-side issue > c) There is a latent server-side issue that suddenly started happening on both sites (possibly triggered by something client-side) Qood points provided. And solving - measuring analizing and testing... Difficult but doable. If so, maybe first can one search - which side drops socket, server or client.

"who restricts making just bot test arena for perfomance testing purposes?"
Lichess does not allow bots to do that you would have to ask someone like toadofsky
to have a test like that

"You possibly find bot a vs bot b issues, if you are lucky enough to run them on affected part of the system.
To be more sure - you have make a bot challenge on same server as bullet (because this one is propably most complained about ) system runs, and better make plenty of those bots- i mean some hundreds, to mimic normal tournament startup, and make them move in game progress more like normal humans do - with random move delay (i mean no opening book play with 30 moves in first 0,001 s).
"
"Are they on same virtual machine/cluster/network and etc. What is common, what is different.
Differenciate your findings to narrow things down and then start details search."

https://docs.google.com/spreadsheets/d/1Si3PMUJGR9KrpE5lngSkHLJKJkb0ZuI4/preview(where we can find a list of the servers to narrow which ones it can be for load issues)
Here is all the servers we can eliminate some from the list easily but the question is;
How can we force it to play on x server or do we just chance it?

"better make plenty of those bots- i mean some hundreds"
i agree i was just using bot a and b only because i wanted to defined exactly what the bots want to do before we figure out how to run 100 of them.

"Qood points provided. And solving - measuring analizing and testing... Difficult but doable.
If so, maybe first can one search - which side drops socket, server or client."
But lets start if figuring out which side drops socket

Are we going to use a bunch of bots playing games with both server and client being monitored for drops or how?

EDIT: "Dont worry, i just bought good custom M390 steel knife against roping."
ok then guess we are on to steal chains there is no escaping this. We are solving lag on lichess together

"who restricts making just bot test arena for perfomance testing purposes?" Lichess does not allow bots to do that you would have to ask someone like toadofsky to have a test like that "You possibly find bot a vs bot b issues, if you are lucky enough to run them on affected part of the system. To be more sure - you have make a bot challenge on same server as bullet (because this one is propably most complained about ) system runs, and better make plenty of those bots- i mean some hundreds, to mimic normal tournament startup, and make them move in game progress more like normal humans do - with random move delay (i mean no opening book play with 30 moves in first 0,001 s). " "Are they on same virtual machine/cluster/network and etc. What is common, what is different. Differenciate your findings to narrow things down and then start details search." https://docs.google.com/spreadsheets/d/1Si3PMUJGR9KrpE5lngSkHLJKJkb0ZuI4/preview(where we can find a list of the servers to narrow which ones it can be for load issues) Here is all the servers we can eliminate some from the list easily but the question is; How can we force it to play on x server or do we just chance it? "better make plenty of those bots- i mean some hundreds" i agree i was just using bot a and b only because i wanted to defined exactly what the bots want to do before we figure out how to run 100 of them. "Qood points provided. And solving - measuring analizing and testing... Difficult but doable. If so, maybe first can one search - which side drops socket, server or client." But lets start if figuring out which side drops socket Are we going to use a bunch of bots playing games with both server and client being monitored for drops or how? EDIT: "Dont worry, i just bought good custom M390 steel knife against roping." ok then guess we are on to steal chains there is no escaping this. We are solving lag on lichess together

Do you have any clientside logging - messaging possible? Something built onto client, more informational than move time and etc. To see what happens before disconnects or socket drop?

I see there is a lot of data and different logging versions available, if looking page with mozilla developer tools. But what exactly look for?

Do you have any clientside logging - messaging possible? Something built onto client, more informational than move time and etc. To see what happens before disconnects or socket drop? I see there is a lot of data and different logging versions available, if looking page with mozilla developer tools. But what exactly look for?

OK, Toadofsky's post made me think, what else can i see from client-side.

Overall, today is bad diagnosis day, as this "lagging" happened really few compared to 3 previous days.
Only some tournaments were affected from 9-15 EET+2

Also turns out, that recording whole session with Mozilla developer tools performance analyzer requests too much memory (or overflows memory), to get result of one tournament saved as profile.

But, i was able to catch several interesting findings, while running Lichess in special "focus mode" with developer tools network analyzer turned on
First i hoped, that i found culprit, but not so sure now.

Findings so far. Quite often there is blockage on webpage loading request, reason i can not tell, but once it took 14,98 secs and this tournament game is was timed out.
get socket is sometimes quite slow, often 300ms but also reached 7-14 seconds in worst cases, but looks like it does not depend, witch of the sockets is free (from socket 1 -to 5)

But, there are some get TLS session problems, and those mach more with "disconnect feature", as they are quite long -from 5-7 secs following with message - reconnecting in 3500 ms.
One game i lost exactly while this setup TLS timed out - with no answer or reason to request.

All those things happened with Firefox, only one tab open. (maybe should try plenty of tab open too, to compare.

For fact-hungry persons i can provide screenshots from those timeouts too.

OK, Toadofsky's post made me think, what else can i see from client-side. Overall, today is bad diagnosis day, as this "lagging" happened really few compared to 3 previous days. Only some tournaments were affected from 9-15 EET+2 Also turns out, that recording whole session with Mozilla developer tools performance analyzer requests too much memory (or overflows memory), to get result of one tournament saved as profile. But, i was able to catch several interesting findings, while running Lichess in special "focus mode" with developer tools network analyzer turned on First i hoped, that i found culprit, but not so sure now. Findings so far. Quite often there is blockage on webpage loading request, reason i can not tell, but once it took 14,98 secs and this tournament game is was timed out. get socket is sometimes quite slow, often 300ms but also reached 7-14 seconds in worst cases, but looks like it does not depend, witch of the sockets is free (from socket 1 -to 5) But, there are some get TLS session problems, and those mach more with "disconnect feature", as they are quite long -from 5-7 secs following with message - reconnecting in 3500 ms. One game i lost exactly while this setup TLS timed out - with no answer or reason to request. All those things happened with Firefox, only one tab open. (maybe should try plenty of tab open too, to compare. For fact-hungry persons i can provide screenshots from those timeouts too.

@klelik @kudzu12

One of you was asking about this in discord, I'm not sure which. I don't want to repaste the whole thing here but:

https://discord.com/channels/280713822073913354/693123651272179793/1048638731013337158

We should keep in touch about this issue.

@klelik @kudzu12 One of you was asking about this in discord, I'm not sure which. I don't want to repaste the whole thing here but: https://discord.com/channels/280713822073913354/693123651272179793/1048638731013337158 We should keep in touch about this issue.

+schlawg said in #19:

+klelik +kudzu12

One of you was asking about this in discord, I'm not sure which. I don't want to repaste the whole thing here but:

discord.com/channels/280713822073913354/693123651272179793/1048638731013337158

We should keep in touch about this issue.
No it was me asking and thanks for responding
hopefully we get the tools soon makes helping you guys alot easier

+schlawg said in #19: > +klelik +kudzu12 > > One of you was asking about this in discord, I'm not sure which. I don't want to repaste the whole thing here but: > > discord.com/channels/280713822073913354/693123651272179793/1048638731013337158 > > We should keep in touch about this issue. No it was me asking and thanks for responding hopefully we get the tools soon makes helping you guys alot easier

This topic has been archived and can no longer be replied to.