AFD datagram performance

I’ve been playing around with the low-level access to the Windows networking stack that is provided by \Device\Afd. This provides a ‘readiness’ interface rather than the ‘completion’ interface provided by traditional IOCP designs. I’ve been meaning to do some comparative performance tests, much in the same way that I did for my investigations of the RIO API, but I’ve been too busy.

Instead, what has happened is that the AFD code has made its way into the version of The Server Framework that I’m using with my Online Game Company client. Their main networking protocol uses UDP and we decided to see if the AFD approach would be beneficial for performance. There are several reasons why it should be:

  1. we use a single ‘well known socket’ for all communication, splitting out the separate connections based on protocol level information.
  2. the data flow works best if the datagrams are processed in the order that they’re sent.
  3. the main area where improved performance is noticed is when the server is under heavy load and we’re continually processing inbound and outbound datagrams.

Points 1 and 2 mean that we get very little value from scaling out an IOCP solution across multiple I/O threads. In fact, 2, often means that scaling out hurts us as datagrams that are processed out of sequence cause queing in the individual connections.

Points 2 and 3 mean that running a single threaded, tight loop that reads data whilst it’s available can give better results. However, the servers expose several different endpoints and a thread per endpoint isn’t ideal so a readiness notification which allocates the endpoint to an I/O thread, followed by a tight loop, reading data, seems ideal.

The AFD design means that we reduce both context switches and user to kernel transitions. With IOCP we can’t use “skip completion port on success” processing as this causes us problems with point 2 as it can allow datagrams to “jump the queue”. So all reads result in two kernel transitions, one for the read and one for the completion. Likewise, we can’t easily get value from batching the completion processing with GetQueuedCompletionEx() as this makes it more likely that we’ll process datagrams out of sequence under heavy load. Finally we’re not interested in the send completions and so we send those synchronously, which is more efficient, as we halve the number of kernel transitions but can cause issues if we are near the limit of the link capacity.

With the AFD design we amortize the cost of the single readiness completion across the following reads which return data. This makes it more efficient the more loaded the server is. Likewise, sends are efficient right up to the point where we have filled buffers and need to slow down anyway.

I don’t have general purpose numbers, but the client’s tests show reduced CPU and round trip time in their tests and the ability to support considerably more concurrent connections. Also, their tests on the “more complex” server architectures, NUMA and ‘stupidly high numbers of CPUs on non-NUMA architecture’ seem to show even more gains for AFD.

The AFD code was easy to slot into The Server Framework as the 8.0 branch, which is the fully cross platform version, separates all socket objects into two halves so that the back-end can be replaced with platform specific code. This has the added advantage that the back-end can also be replaced with different implentations on each platform.

Given the success of this AFD back-end we’re now looking at a RIO version, just to see how that works out.

More on AFD