More on Windows Networking resource limit server failures

2005-11-19

My VoIP client has been stress testing the UDP version of The Server Framework and they had what they thought was a deadlock. It wasn’t a deadlock it was more of a lazy server… What seemed to have happened is that they had been thrashing a server with lots of concurrent datagrams and pushed the machine until it stopped receiving packets due to the resource limitations of the machine being hit (probably either non-paged pool exhaustion or the i/o page lock limit). They then allowed the server to rest for a bit and then resumed the test. The server ignored every packet sent to it… Bad server…

They sent me stack traces of all of the threads in the system that they had produced with the SysInternals ProcessExplorer tool. These clearly showed that everything was good and that all of the IO threads were sitting waiting for IOCP events and all of the business logic threads were also idle. Given that I’ve recently been testing the TCP side of the framework in exactly this kind of situation I had a pretty good idea of what the problems would be. Running the tests on the UDP and AcceptEx versions of The Server Framework were on the list of outstanding work items but other things had become more important…

I took a look at the code and had a good idea what the problem would be. The datagram server works like this; when we start ‘accepting connections’ we post a series of overlapped RecvFrom requests, the number of requests to post is our ’listen backlog’. When each RecvFrom completes the first thing we do is check to see that it completed without error and if so we post a new RecvFrom and then handle the data that has arrived. By doing this we guarantee that we always have outstanding read requests and we can always accept new datagrams. Of course when we’re hitting the networking resource limits of the machine some of these RecvFrom requests start to complete with errors and when that happens we don’t post new requests and so eventually the server gets to a point where the resource problem has gone but we have no reads pending…

I tested the server in a VMWare machine and forced a “I/O page lock limit” situation (note that this is easier than forcing a “non-paged pool exhaustion” situation as you simply have to post socket operations that use humongous buffers…). The reads (and writes) started failing as expected and the server was eventually left with no pending reads and became unresponsive even after the resource problem went away… The obvious fix at this point is also the wrong one. Changing the logic to post a new RecvFrom even if the previous completed with an error will just continue the resource failure and force the server into lots of pointless busy work as it posts requests that are almost bound to fail and then posts more of the same.

One thing that I decided whilst doing these kind of tests for the TCP side of the framework was that it’s a bad idea to even be able to let your server get into this state. The resource limits are machine wide and if your server is chewing up all the resources then other parts of the machine might start to fail. This is not something you want production systems to do. I figured that the server will be refusing connections at this point anyway so it’s better to be able to configure the maximum number of connections that you’re willing for it to attempt to handle and have it refuse connections after that limit is reached. You can then do some calcs and tests to work out what that limit should be given your hardware configuration and then you’re pretty much safe. The server will be overloaded but it won’t be endangering the machine. So, connection limits are good. The TCP server already had them and the code that recovers from resource failures is the same code that recovers from when we hit our internal limits. So the first thing I did was add configurable connection limits to the UDP server. Once I had the server accepting a set maximum number of connections I started working on getting it to gracefully recover when this limit was reached.

The first step on the road to recovery is knowing that you have a problem. The existing design failed here as it had no way to know that it had less pending receives than it expected. When the receives failed due to low resources the framework was blissfully unaware. To fix this particular problem I added a counter to keep track of the number of pending receives. When a new receive is posted the counter goes up, when it completes the counter goes down, if the receive cant be posted then the counter doesn’t go up (well, actually it goes up and then goes down very quickly…). I added a way for the server base class to notify the user’s derived class of this counter change and updated my “UDP echo server service with performance counters” example to publish a performance counter for the number of pending receives. When the server is operating normally the counter reports a pretty constant value, our “listen backlog”. By setting a max connection limit and then using up all of the connections you can see the counter go down as we decide not to post new receives, eventually the counter hits zero and the server stops accepting new clients. This is the state that the low resource issue puts the server into.

Once I could visualise the problem via the performance counter I ran the resource limit tests again in the VMWare box. Eventually, the counter dropped to zero and the server became inert. Now it was time to fix the problem.

Since the server was now aware of the number of pending receives that it had and since the server was also aware of the number that it should have it could tell when there was a problem. If actual was less than expected we need to issue more reads. The question then is simply when we should attempt to top up the pool of pending receives. Right now I have an answer that works in most situations but it’s not 100% foolproof. The lifecycle of a socket within the server is this; obtain a “socket” object from the allocator (this can pool “socket” objects for later use but doesn’t have to, this “socket” object is a simple wrapper around the real Winsock socket.), attach a real Winsock socket to the wrapper object, issue a read, wait for completion, pass data to The Server Framework for processing by user code, allow write operations, wait for the reference count of the socket to go to zero and then pass the socket back to the allocator. All sockets eventually go back to the allocator and they pass through the server on the way. Right now I use this as the time when the server checks that it has enough pending receives. The idea is that a socket connection has just completed and that when this happens resources are released so there’s a good chance we can create a new socket and post a new read request. We check the pending receive count against what we expect and if it’s low we post a new receive.

This works well but if the server didn’t have any active connections at the time then it has nothing to drive the creation of new sockets, it will stay dead. In practice I’m not convinced that this problem needs addressing. For a server to get into that state you need a pretty contrived example and you need something else to be using up all of your networking resources on the machine…

Once I had this working for the datagram server I looked at the AcceptEx TCP/IP server and added similar functionality to that. There was one additional wrinkle with the AcceptEx server and that was the actual listen backlog on the listening socket… Although the AcceptEx server uses asynchronous accepts to accept new connections and although we have our own concept of a “listen backlog” which is actually the number of pending accepts the TCP/IP stack also has a real listen backlog for the listening socket. The code that I added to track pending accepts (in much the same way that the UDP server tracks the number of pending reads) worked fine but the server allowed additional client connections to be established even though we had no accepts pending. These connections appear to be connected from the client perspective but don’t connect in the server until a new AcceptEx is made. To prevent this happening I would need to close the listening socket.

The “standard” TCP/IP server simply closes the listening socket when it encounters a resource failure or reaches the configurable max connection limit. When the listening socket is closed it refuses connections. We drive the attempt to begin listening again off of the release of existing socket connections in much the same way as we do for the UDP server. The AcceptEx server is different in that we may hit the limit when posting an AcceptEx request but we can’t simply close the listening socket at this point as we have a number of pending accepts that will all be cancelled if we close the listening socket before they complete. The solution to this was pretty simple, it just took me a while to get there. The ideal time to close the listening socket is when the number of pending accepts hits zero. For this to happen we’ve either hit the max connection limit and stopped posting new accepts or we’ve hit a resource limit and become unable to post new accepts… Either way, when the number of pending accepts hits zero we can close the listening socket without aborting any pending accepts. Once again we drive the attempt to begin listening from the release of existing sockets.

All three styles of server now recover gracefully from low resource situations. All three can have the number of connections limited if required and, as usual, the code required for the AcceptEx TCP server is a cross between what’s needed for the standard TCP server and the UDP server. There’s a bit of refactoring required now as the process has exposed some code patterns that are shared between the server types. Once that’s done I need to look at expanding the concept of ‘connection limiting’ into a class so that I can share the limit across multiple servers in a single process; given we often have a single process that listens on multiple ports, it would be good to be able to set a single process wide limit and have the servers share it sensibly. Still, that’s work for another day.