The design implications of FILE_SKIP_COMPLETION_PORT_ON_SUCCESS and the Vista overlapped I/O change

The latest release of The Server Framework gives you the option of building with FILE_SKIP_COMPLETION_PORT_ON_SUCCESS enabled for various parts of the framework. You can enable it for stream and datagram sockets independently and also for the async file reader and writer classes. This option potentially improves performance by allowing overlapped I/O that completes straight away to be handled by the thread that issued the I/O request rather than by a thread that is servicing the I/O Completion Port that the socket or file handle concerned is associated with. This smooths thread scheduling by allowing the thread that is currently active to deal with the completion rather than forcing a context switch to an I/O thread.

The next release of the framework will, finally, include code which will allow us to dispatch I/O requests from any thread rather than just from I/O threads. As I mentioned a while ago, Vista has relaxed the requirements on thread lifetime with regards to pending overlapped I/O. Prior to Vista the thread that issued an overlapped I/O request had live longer than it took for t he I/O request to complete or the outstanding I/O request was cancelled when the thread terminated. To work around this the server framework marshals all I/O requests to the I/O thread pool so that users of the framework don’t have to worry about the thread lifetime issue (it’s a pity that .Net didn’t do something similar with its abstraction over overlapped I/O; see here for details of the kinds of problems that you can get without the marshalling).

Being able to dispatch an I/O request from any thread AND deal with the completion on that same thread has some implications for the design of servers that have been built on the server framework but offers the change to reduce context switching considerably in some situations.

At present with the FILE_SKIP_COMPLETION_PORT_ON_SUCCESS changes disabled the following sequence of events occurs when a read is issued and there is already data present. Let us assume that we have an arbitrary thread, thread 1, which is a non I/O pool thread that you have created. This thread calls Read() on a socket and the read request is marshalled to the I/O pool using the same I/O Completion Port that the socket is associated with for I/O completions. The request is handled by thread 2, which is one of the threads processing completions on that port, and it retrieves the marshalled read request and actually calls WSARecv() on the socket. At this point thread 1 could exit and the I/O request would still complete and be handled correctly even on pre-Vista systems. There is adequate data present in the TCP stack to fulfil the read request and the call to WSARecv() returns SUCCESS and a completion packet is queued in the I/O Completion Port (see here for details). At this point thread 2 is finished and it loops around back to the I/O Completion Port and queries it for another request; it may get to handle the read completion or another thread servicing the same I/O Completion Port may have already retrieved this completion. For the sake of argument we’ll say that thread 3 gets to process the completion. The read has completed and the data is passed up through the server framework’s filtering API and out into your callback handler. You process the data and realise that you need more data before you can work with it and so issue a new Read() with the same buffer. Since thread 3 is an I/O pool thread there’s no need to marshal the read request into the I/O pool so it calls WSARecv() itself and we wait for more data to arrive and for the read to complete. Reading data that was already present in the stack required between 2 and 3 context switches to complete.

Using 6.2 with the FILE_SKIP_COMPLETION_PORT_ON_SUCCESS changes enabled would lead to this sequence of events. Everything up to the point where thread 2’s call to WSARecv() completed with SUCCESS would be the same but with since FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is enabled we no longer get a completion queued and we must deal with the successful completion ourselves. This removes the potential context switch and the trip through the I/O completion port. The new read is now issued on thread 2 which, like thread 3 is also an I/O pool thread and so can issue the next WSARecv() directly.

Using 6.3, with changes to take into account the Vista overlapped I/O changes, and running on Vista or later, there’s no need for thread 1 to marshal the initial Read() through the I/O Completion Port to an I/O thread. The changes mean that if thread 1 calls WSARecv() and it returns SOCKET_ERROR and GetLastError() returns IO_PENDING and thread 1 then terminates the outstanding I/O will still complete normally and will NOT be cancelled as it would be with a pre-Vista OS. In our scenario the WSARecv() call issued by thread 1 will complete with SUCCESS and since FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is enabled we can deal with the completion on thread 1. Then, due to the fact that we’re on a Vista or later OS we can issue the next WSARecv() from thread 1. This means that there’s no unnecessary context switching for successful I/O operations.

There are, however, some issues as soon as you enable FILE_SKIP_COMPLETION_PORT_ON_SUCCESS, let alone if you enable the Vista I/O changes that are likely to make it into 6.3.

The first is that recursion is now a real possibility. You call Read() which eventually calls OnReadCompleted() which calls Read() which eventually calls OnReadCompleted() etc. This was never a possibility before since you were always guaranteed that WSARecv() would always return to you and the completion would always be a result of another trip through the I/O Completion Port. Now, whilst there’s data available to fulfil a read request your reads can complete straight away on the same thread that issued the WSARecv() and this will lead to recursion that was never possible before. This hasn’t proven to be an issue with any of the testing that I’ve done with 6.2 but it’s one of the reasons that the FILE_SKIP_COMPLETION_PORT_ON_SUCCESS can be completely disabled in 6.2. It’s only likely to be an issue on servers that do all of their work on their I/O threads; that is, it wont bite you (yet) if you use a “business logic” thread pool as the Read() will always be marshalled to an I/O thread and that marshalling will prevent the recursion.

The second is that the order of completion processing when multiple reads are pending, or when you issue a read before you have completed processing the current read is likely to be more shaken up than normal and using a single thread to service the I/O Completion Port will no longer ensure that you’re safe from completion re-ordering. Consider a loaded server which has 5 read completions for a socket queued in the I/O Completion Port. The first of these is processed and issues a new Read(), there is data available so the call completes straight away and now you’re processing the data from read number 6 ahead of the data from reads 2 through 5. Whilst this situation is possible as soon as you begin to issue multiple reads when you have multiple threads servicing your I/O Completion Port it’s now also possible when you have but a single thread processing the IOCP.

If the proposed Vista I/O changes from 6.3 are enabled then you lose control of which threads process I/O completions. This may, or may not, be important to you. Suppose you currently process all of your I/O on the I/O pool threads and suppose that the work that you do requires COM to be available and for the threads concerned to be part of a particular apartment, or if you wish to limit the number of concurrent messages that can be processed regardless of the number of threads issuing I/O requests. Now you can’t do that as any thread that issues an I/O request could potentially also handle the completion.

There are probably more issues that I haven’t come across yet but hopefully I’ll have identified them all and dealt with them so that 6.3 can finally include the changes for the Vista I/O change.