Connection backlog + active connections passed 64512, but there are only 64512 unprivileged port numbers, so the socket server ran out of ephemeral ports to connect to the backend.
We expected to hit the limit soon, but not that soon, so at least a solution was in progress. We rushed to deploy it immediately, which fixed the issue.
what was the solution?
@zaneanderman The tuple (source addr, port) is required to be unique, so adding a second source address ("moar servers") increases the limit to 128k.
does that mean you doubled the number of servers or did you only have one to start?
Doubled from 1 to 2 socket front servers.
from what I understand the tuples would have been something like (1, 1..64512) but are now (1..2, 1..64512). Is that correct?
Correct. Instead of literally 1/2, it's the (private) IPv4 address of the front server.
Edit: And instead of 1..64512 it's 1024..65535.