A multi-connection AFD-based echo server

Last time I looked at way of using \Device\Afd to perform individual socket polling for readiness. This differed from the previous approach to using \Device\Afd, which batched up the sockets and issued a single poll for multiple sockets.

The individual socket polling approach appeals to me as it would appear to scale more easily, and putting together an echo server that supports multiple connections is now much easier. It doesn’t map as well to the way other operating systems do things though, so if that’s your primary goal, then you’re probably better off continuing with the ‘set of sockets’ approach.


Full source can be found here on GitHub.

This article refers to the socket_without_device_afd code.

This isn’t production code, error handling is simply “panic and run away”.

This code is licensed with the MIT license.

Comparing the code in socket_without_device_afd/echo_client and the previous approach in socket/echo_client it’s fairly obvious that the new approach reduces the complexity somewhat. There’s less code; though most of the differences are in the tcp_socket class which operates slightly differently.

The polling we were doing before this worked with a single handle to \Device\Afd which was associated with an I/O Completion Port. We only used the ‘per operation’ data, that is, what is returned as a pointer to the ‘overlapped’ structure that we provided when we made the polling call. This worked OK in our tests, but there were some mistakes being made. Firstly, when polling we had to specify a flag, Exclusive, in the AFD_POLL_INFO structure. We were setting this to FALSE and this allowed us to poll multiple times for a single handle and each poll could be treated separately. Unfortunately, we were polling using the same ‘operational data’ multiple times, expecting this to result in one poll being set up, but instead it meant that multiple polls were being set up and all of them could return when the required conditions were met. The correct approach for the design we had, was to set the Exclusive flag to TRUE. This change meant that we would get a single response to a poll but that issuing another poll on a given handle whilst a poll was already active, would result in the first poll completing as canceled and then the new poll being registered.

Next, it makes more sense, when dealing with per-socket polling, to associate the socket with the IOCP using the ‘completion key’ parameter. This is an opaque value that is associated with the handle when you link the handle to the IOCP and which is then returned from each completion for that handle. This is ‘per device’ data and needn’t be convertible to an OVERLAPPED for correct operation. Using this approach, we can associate the afd_events interface on the tcp_socket class with the socket handle when we link it to the IOCP and then call into this interface to handle the completions when they are returned from GetQueuedCompletionStatus().

Then there’s the question of whether it’s more efficient, from a system call perspective, to poll for each socket independently or to group sockets together and poll using a handle to \Device\Afd. The independent polling approach requires a system call to poll and a system call to retrieve the poll completion. This results in two calls per poll whether we receive any events. Theoretically, we could issue a poll, and then decide to poll for more, or less, event types whilst that poll is active, we would then have to deal with the cancellation completion for the first poll before dealing with any events returned from the second. In practice, we are likely to set up a poll once for a socket and only change the poll as a result of processing a completion, and before we set up the next poll. The ‘cost’ of cancelling an incomplete poll is unlikely to occur. With polling for a set of sockets using a single handle, we, at first, appear to have an advantage in working in batches of sockets. We have one system call to set up a poll for many sockets and one system call to retrieve completions for, potentially, many sockets. In practice, we would need to cancel polls whenever we changed the polling requirements for any socket in the set, which would, likely, be every time we get a completion for any socket in the set. This results in a poll per socket operation, which is the same as for the independent polling. For completions, we can attempt to ameliorate the system call cost of retrieving completions using GetQueuedCompletionStatusEx() to retrieve a variable number of completions with each call… Finally, if we enable SKIP_COMPLETION_PORT_ON_SUCCESS processing for the socket handle we can avoid any unnecessary system calls when issuing a poll for a socket when a poll for the socket is already pending.

My gut feeling is that independent polling is no less efficient than polling for a set of handles together and is considerably easier to code for; at least for someone with a background in using ’normal’ completion-style IOCP socket designs.

In our independent polling echo server we have a listening socket which is similar in design to the one in the previous example and an echo_server_connection which is a simple class wrapper around a tcp_socket which provides state for each socket connection. Each time the listening socket accepts a new connection it creates an instance of the echo_server_connection object and links its socket to the IOCP that it is using. We then issue a poll on the socket for events and wait for a completion. Each connection can easily be associated with the same IOCP and polled independently with no need for any code to manage the set of all the connections as a group.

Once we have a connection, we wait for a readability notification and then read data until we’ve either read all there is to read or we have filled our per connection buffer. We then echo this data back to the client by writing as much as we can. If we can’t write everything, we wait for the socket to become writeable and continue to read until our buffer is full.

This all works well, with one poll and one completion per group of events that we handle.


Wrapping up

The main question now is “is there any point?” If we’re not using a design that maps easily to the way other operating systems do this, then is there really any advantage in using readiness polling rather than completion handling? I think there may be. For one thing, dealing with TCP flow control using a standard completion-based design can be complex, as can be seen here. There’s also a possible performance gain for datagram sockets as we’re rarely interested in send completions and would usually want to pull all data off the wire in a tight loop and so a read readiness notification is likely more useful to us than a series of read completions; even if we retrieve them as a batch, using GetQueuedCompletionStatusEx(). I won’t know any of this for sure until I can compare performance but that can come next…

