Adventures with \Device\Afd

2023-04-14

Page content

I’ve been playing around with Rust recently and whilst investigating asynchronous programming in Rust I was looking at Tokio, an async runtime. From there I started looking at Mio, the cross-platform, low-level, I/O code that Tokio uses.

For Windows platforms Mio uses wepoll, which is a Windows implementation of the Linux epoll API on Windows and is based on the code that is used by libuv for Node.js. This uses networking code that is NOT your standard high-performance Windows networking code using I/O completion ports and instead uses the ‘sparsely documented’ \Device\Afd interface that lies below the Winsock2 layer.

I’ve played with libuv before and it was never to my liking, but after some random wandering I came across a very good and recent description of why wepoll works in the way it does and why the \Device\Afd interface might be worth looking at for async work.

TL;DR

I’ve started to boil the \Device\Afd code down to the bare essentials so that I can understand what’s going on without having to continually ignore other people’s coding styles and the APIs that they have built upon it.

The code is here on GitHub and I expect that this will develop into a series of articles that explore how best to take advantage of this approach.

But first you really should go and read “notgull”’s piece on \Device\Afd

Overview

There’s a ‘sparsely documented’ interface to the Windows networking stack that is accessed via the \Device\Afd device driver’s ioctl interface and which can provide a “socket readiness” polling interface that promises to be as efficient as the normal completion based interface. Whilst it’s possible to build readiness polling using the normal completion-based API it’s harder and some things simply aren’t possible. When working on the cross-platform code for The Server Framework I ended up with a hybrid approach that used the epoll API to drive a completion based system above it which was compatible with the ‘standard’ IOCP system on Windows but being able to do either would be useful and, if nothing else, an interesting diversion.

Note

Please bear in mind that this stuff is new to me (even though the approach has been around for years), I’m still finding my way, the docs are very sparse and I’m working from other people’s understanding of the API and so much of this is based on assumption and intuition. Please feel free to tell me when I get things wrong!

Cruft-free programming

I think many programmers are happiest working with their own code, or at least, code that conforms to their own idea of ‘good’. Due to the variety of requirements, languages and platforms it’s a wonder any code is ever reused. I often find it far easier to understand code if I pull apart some reference material and rebuild it in “the one true way”, of course that’s unlikely to meet your definition of “the one true way”. With the code here I am looking to understand how best to use the \Device\Afd approach to the Windows networking stack and so have taken a look at wepoll and the Rust polling code and pulled the stuff I’m interested out and rebuilt it without all of the added complexity. For me, at least, this makes the resulting code that actually does the work easier to reason about and understand and from there to build my own abstractions above it so that other people can complain about my version too…

I had a quick look for more authoritative documentation or examples and didn’t find much that I felt like working from. I’m sure there are more authoritative sources out there but I’m not bothering to look for them just yet as the two sources that I am relying on are widely used and therefore reasonably trustworthy; also they’re both clearly Open Source and so free for me to explore and understand.

Opening the Afd device

Both of my sources use NtCreateFile() to open the Afd device. There’s some talk on Reddit that suggests that we might be able to use CreateFile() but I haven’t yet explored that route. To access NtCreateFile() we need to pull in winternl.h and link with ntdll.dll and we need to work with the UNICODE_STRING and OBJECT_ATTRIBUTES types. These types are all slightly more complex than you might be used to when working with the Windows API but they are easy to wrap up with something a little more user friendly if you want to. For now, and for simplicity of the code I’m using them directly and avoiding the macros that wepoll uses.

Code

Full source can be found here on GitHub.

This isn’t production code, error handling is simply “panic and run away”.

This code is licensed with the MIT license.

Opening the device is as simple as this:

// Arbitrary name in the Afd namespace
static LPCWSTR deviceName = L"\\Device\\Afd\\explore";

const USHORT lengthInBytes = static_cast<USHORT>(wcslen(deviceName) * sizeof(wchar_t));

static const UNICODE_STRING deviceNameUString {
   lengthInBytes,
   lengthInBytes,
   const_cast<LPWSTR>(deviceName)
};

static OBJECT_ATTRIBUTES attributes = {
   sizeof(OBJECT_ATTRIBUTES),
   nullptr,
   const_cast<UNICODE_STRING *>(&deviceNameUString),
   0,
   nullptr,
   nullptr
};

HANDLE hAFD;

IO_STATUS_BLOCK statusBlock {};

NTSTATUS status = NtCreateFile(
   &hAFD,
   SYNCHRONIZE,
   &attributes,
   &createStatusBlock,
   nullptr,
   0,
   FILE_SHARE_READ | FILE_SHARE_WRITE,
   FILE_OPEN,
   0,
   nullptr,
   0);

if (status == 0)
{

Associating with an IOCP

We then associate the Afd handle with an I/O completion port that we can poll for results to our polling requests.

// Create an IOCP for notifications...

HANDLE hIOCP = CreateIOCP();

// Associate the AFD handle with the IOCP...

if (nullptr == CreateIoCompletionPort(hAFD, hIOCP, 0, 0))
{
   ErrorExit("CreateIoCompletionPort");
}

if (!SetFileCompletionNotificationModes(hAFD, FILE_SKIP_SET_EVENT_ON_HANDLE))
{
   ErrorExit("SetFileCompletionNotificationModes");
}

I would expect that we can also use FILE_SKIP_COMPLETION_PORT_ON_SUCCESS here, but I’ve not tested it yet.

In the real world the code above would be done once to set the system up, the next pieces of code represent what happens when we have a new socket that we want to poll.

Polling for events on a socket

wepoll suggests that the following events are available to us.

      constexpr ULONG events = 
         AFD_POLL_RECEIVE |                  // readable
         AFD_POLL_RECEIVE_EXPEDITED |        // out of band
         AFD_POLL_SEND |                     // writable
         AFD_POLL_DISCONNECT |               // client close
         AFD_POLL_ABORT |                    // closed
         AFD_POLL_LOCAL_CLOSE |              // ?
         AFD_POLL_ACCEPT |                   // connection accepted on listening
         AFD_POLL_CONNECT_FAIL;              // outbound connection failed

We register an interest in an event on a socket like this:

// This is information about what we are interested in for the supplied socket.
// We're polling for one socket, we are interested in the specified events
// The other stuff is copied from wepoll - needs more investigation

AFD_POLL_INFO pollInfoIn {};

pollInfoIn.Exclusive = FALSE;
pollInfoIn.NumberOfHandles = 1;
pollInfoIn.Timeout.QuadPart = INT64_MAX;
pollInfoIn.Handles[0].Handle = reinterpret_cast<HANDLE>(GetBaseSocket(s));
pollInfoIn.Handles[0].Status = 0;
pollInfoIn.Handles[0].Events = events;

// To make it clear that the inbound and outbound poll structures can be different
// we use a different one...

// As we'll see below, the status block and the outbound poll info need to stay
// valid until the event completes...

AFD_POLL_INFO pollInfoOut {};

IO_STATUS_BLOCK pollStatusBlock {};

// kick off the poll

status = NtDeviceIoControlFile(
   hAFD,
   nullptr,
   nullptr,
   &pollStatusBlock,
   &pollStatusBlock,
   IOCTL_AFD_POLL,
   &pollInfoIn,
   sizeof (pollInfoIn),
   &pollInfoOut,
   sizeof(pollInfoOut));

if (status == 0)
{
   // It's unlikely to complete straight away here as we haven't done anything with
   // the socket, but I expect that once the socket is connected we could get immediate
   // completions and we could, possibly, set 'FILE_SKIP_COMPLETION_PORT_ON_SUCCESS` for the
   // AFD association...

   cout << "success" << endl;
}
else if (status == STATUS_PENDING)
{

For people who have some experience of IOCP things are starting to look a little familiar; here we have an “operation” which specifies two pieces of “per operation data”. The first is the status block, which is returned to us in our call to GetQueuedCompletionStatus() when the poll completes, or, presumably, is cancelled. The second is the pollInfoOut structure. This isn’t explicitly returned to us when the poll completes and so in real code we will likely include both the status block and the info structure in a larger structure and then navigate from the status block to our larger structure in much the same way that we’re used to navigating from an OVERLAPPED to an extended overlapped structure with normal IOCP designs.

Receiving event notifications

The simplest event we can generate here is a connection failure event, we do that by trying to connect our socket to a port that isn’t listening and, eventually, we will get a completion from a call to GetQueuedCompletionStatus() on our IOCP.

int result = connect(s, (struct sockaddr*) &addr, sizeof addr);

if (result == SOCKET_ERROR)
{
   const DWORD lastError = WSAGetLastError();

   if (lastError == WSAEWOULDBLOCK)
   {
      cout << "connect would block" << endl;

      if (!::GetQueuedCompletionStatus(hIOCP, &numberOfBytes, &completionKey, &pOverlapped, INFINITE))
      {
         ErrorExit("GetQueuedCompletionStatus");
      }

      cout << "got completion" << endl;

      IO_STATUS_BLOCK *pStatus = reinterpret_cast<IO_STATUS_BLOCK *>(pOverlapped);

At this point we have the status block that we used when we started the poll and the pollInfoOut structure has been updated to hold details of the poll results. In this example the pollInfoOut.Handles[0].Events member now holds just AFD_POLL_CONNECT_FAIL.

Wrapping up

This simple example does just what it needs to demonstrate how the \Device\Afd means of accessing the Windows networking stack works. In later articles we’ll build something a little more useful and, eventually, we can start to measure performance and compare with other methods of network I/O on Windows.

Code is here