Handling lots of socket connections

I’m doing some research for a potential client. They need a TCP/IP server that handles ’lots’ of concurrent connections. Their own in-house server code currently fails at 4-6000 connections and I’m putting together a demo for them of how The Server Framework supports 30000 connections on 1GB of ram before running into non-paged pool restrictions… Whilst doing this I ran into an ‘interesting’ feature of WSAAccept() (or, perhaps, simply of an LSP that’s installed on my machine…).

To test for the maximum number of concurrent connections that my servers can support I wrote a little tool, in fact, I wrote it in response to an MSDN article on .Net sockets. You run the tool, specify the ip address and port and the number of connections to attempt and away it goes. It stops when it fails to connect and reports the number of connections that it managed to successfully create. Right now it’s only testing ‘inactive’ connections, i.e. just the ability to connect to an idle server. The next step is to have each of the connections pump data to the server so that I can test connections to a fully loaded server…. Anyway, I digress.

I ran the tool against the latest build of my simple Echo Server. The tool managed ~31200 connections before the box on which the server was running exhasted the non-paged pool and all socket operations on the box began to fail with the standard no more non-paged pool error of WSAENOBUFS (“An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.”). [Correction: WSAENOBUFS can also mean that the limit on the number of locked pages has been reached… This should be avoidable by posting zero byte reads; more investigation is required] Since I was expecting this I told the tool to close the connections. Later, for some reason or another, I ran the tool again and it failed to connect to the server at all. The WSAENOBUFS failure in WSAAccept had stopped the server being able to accept new connections again. Oops. A bug.

I paused the server (which causes the listening socket to be closed but all existing connections to stay active, i.e. the server doesn’t accept any new connections but can continue to service existing connections) and then resumed it. Eveything was fine again. The service accepted connections as expected.

I looked at the accept code which is something like this:

while (!m_shutdownEvent.Wait(0) && m_acceptConnectionsEvent.Wait(0))
{
   CAddress address;
   
   SOCKET acceptedSocket = ::WSAAccept(
      m_listeningSocket, 
      address.GetStorage(), 
      address.GetSize(), 
      0, 
      0);
   
   if (acceptedSocket != INVALID_SOCKET)
   {
      CSmartStreamSocketEx socket = AllocateSocket(acceptedSocket);
                  
      socket->Accepted();
   
      OnConnectionEstablished(socket.Get(), address);
   }
   else if (m_acceptConnectionsEvent.Wait(0))
   {
      OnError(_T("CStreamSocketServer::Run() - WSAAccept:") + GetLastErrorMessage(::WSAGetLastError()));
    }
}

A call to WSAAccept() was failing with WSAENOBUFS and the error was being reported correctly through the OnError() handler. At which point the code looped around and called WSAAccept() again, which I would have assumed should fail again and the code should have gone into a horrible busy loop whilst sockets couldn’t be created (another bug!) but it didn’t it just hung inside of the call to WSAAccept() and never returned…

Exhausting non-paged pool is a very bad thing to do. It means that drivers can start failing when they attempt to allocate memory and the system as a whole could become unstable (if there are any drivers present that weren’t expecting, or can’t handle, that situation). It should probably be a feature of the framework to limit the number of connections that it will attempt to accept so that the server can refuse incomming connections without hitting the limit first. This would allow the administrators of the server to configure it to attempt to use up to 90% of the machine’s non-paged pool rather than having it crash into the buffers and then attempt to recover. Of couse, since non-paged pool is a machine wide resource the server also needs to be able to recover from failing in the way that it’s failing at the moment in case other applications have used up enough non-paged pool to cause it to fail in this way… In other words, fix this bug before worrying about making the framework prevent the bug.

The solution that I came up with simply pauses the server when WSAAccept() returns WSAENOBUFS and attempts to resume accepting connections each time an existing connection is released. It’s not fool proof but my current test case now passes. I can run the client connection test tool until the server runs out of non-paged pool and the server continues to operate once the lack of memory situation has passed.

Once the code is in place to automatically pause and restart connection acceptance it’s trivial to add a maximum connection limit to the server so that it can be configured to actively prevent exhaustion of the non-paged pool.

With that fixed I decided to take a look at the AcceptEx() version of the server code. Unfortunately the situation appears to be worse when using AcceptEx(). The calls to AcceptEx() don’t fail, the failures seem to occur when the accept completes (which figures) and they’re reported via the return value from GetQueuedCompletionStatus(). This returns ERROR_NO_SYSTEM_RESOURCES (“Insufficient system resources exist to complete the requested service.”). At this point there seems to be a ‘phantom’ accept queued. This accept will partially complete (the client thinks it has completed but the completion never happens on the server). Again this may be due to a buggy LSP rather than Winsock itself. What’s particularly annoying about this situation is that although I could put in similar “fix” to prevent accept starvation by issuing new accepts when the problem has gone it’s a) harder to work out when the problem has gone and b) I’ll still be left with X clients that think they’ve connected but for whom the server will never get a completed accept… Time to test on a clean virtual machine so that I can remove “buggy LSP” from the list of potential issues…