Data distribution servers

2008-06-17

Many of the servers that are built with The Server Framework are for high number of connections and low data flow situations and as such that’s where the focus has been on the framework development and testing. As I’ve shown in the past, we can easily deal with 70,000 concurrent connections with various real world traffic flow patterns and have various test tools that can stress these kinds of servers. As importantly it’s easy to place a limit on the number of connections that a server can handle so as to protect the server (and other processes on the same machine) from resource limits (such a non-paged pool exhaustion). In general I’m pretty happy with how The Server Framework works for these kinds of servers.

Unfortunately there’s a whole area that I haven’t previously paid a great deal of attention to (mainly because I didn’t have many clients that had these kinds of problems); by default The Server Framework doesn’t do high data flow, low number of connections servers especially well. That’s a bit of an exaggeration, but you do have to hack at The Server Framework code a little to make the simple changes that are required to get decent throughput.

The problem is that as of version 5.2.1 The Server Framework doesn’t provide an easy way to set a connection’s TCP receive window and although it’s easy enough to hack in a change to the framework it should be something that’s made more accessible. Of course, the reason for this is that my test servers don’t need or use this…

I currently have several clients that are doing high data flow servers. They’re writing market data distribution servers, automatic execution engines and other trading related stuff. One of them has some performance questions and to be able to answer his questions I first need to have a similar demo server architecture that I can use to experiment on. Step one was adjusting the test client to be able to send lots and lots of data on a connection very quickly. The test client can do ’lots of connections’ and ‘x data every y ms’ but couldn’t easily do ‘as much data as possible in the shortest time’.

One of the problems with a simplistic IO Completion Port design for sending data is that you can keep pushing data into the TCP stack long after the stack can keep sending to the peer. Unless you keep track of how many of your writes are outstanding you could easily just chew up all of the resources on your machine and queue vast amounts of data for transmission. A connection filter based solution to this problem has been planned for the 5.3 release for some time but other work has got in the way. So, to understand the problems I wrote a send-data flow control system into the test harness directly with the intention of moving it out into a connection filter once I had it working correctly.

Doing this kind of flow control involves the following; keeping track of the number of overlapped writes that you issue, keeping track of the number of overlapped writes that complete, stopping issuing writes when you have more than X outstanding writes, restarting the data flow when the number of outstanding writes drops below a certain limit. This allows you to keep the amount of data that is currently waiting to be sent by the TCP stack to a manageable amount and helps to restrict your use of non-paged pool. If you don’t do this (and just issue writes until you have no more data to send) then you can use an arbitrary amount of non-paged pool and since that’s a finite resource this kind of behaviour can cause unexpected server failures.

So, the first thing I did was run the existing test harness with some settings that caused it to send as much data as it could and I watched it chew up resources until it crashed. What I also saw, with dismay, was that it completely failed to stress the echo server that it was testing. Looking at network utilisation showed that the test harness, although sending data as fast as a tight loop could, was actually squirting very small amounts of data onto the wire in intermittent bursts. Not what I was after at all; but not too surprising when you think about how TCP throttles data flow…

Step two was to prevent the test harness from crashing by adding the flow control based on outstanding writes. Of course this didn’t change the amount of data flowing but it did make the test more reliable; it now reliably ran to completion even if I increased the number of connections.

Step three was to fire up WireShark and take a look at what was actually happening on the wire; though I had a fairly good idea of what I would see. The network trace clearly showed the TCP stacks regulating the data flow using the TCP receive window. Lots of TCP ZeroWindow situations as the TCP stack on the server told the TCP stack on the client to stop sending because its buffers were full.

Adjusting the server to use a larger receive window on its connections was fairly straight forward, I added a new callback to the framework that was called straight after each socket was created and which was passed the socket. This can now be implemented in a client’s server code to adjust the socket in any way they want prior to connection establishment. I’m actually going to add specific send and receive buffer setting code as it’s likely to be used reasonably regularly but the generic callback can be used to set any other stuff that users might want and that might not currently be included in the framework.

Once the server and client could be set up with decent sized TCP receive windows and the client could be configured to send data at a controlled maximum rate it was possible to start doing some experiments. It seems that you don’t need a great many pending writes to keep the network busy; just keep pushing data in until the write completions slow right down and then switch to only pushing more data in as the pending writes complete. You then tend to use around 110% of your TCP receive window in non-paged pool and you get a network connection full of data.

Next on the list is writing a more focused server and clients. I need a data feed which pushes lots of data into the server and several clients that can subscribe to the data. I expect the auction server work that I did a while back will come in handy here; basically we’ll send out data to many clients without copying the buffer that it’s in for each client. Once I’ve got that working I can start to look at performance issues… Though right now things look pretty good…