When are asynchronous file writes not asynchronous...

For some time I’ve had a class which implements asynchronous file writing. It’s the basis for a logging class that some of my clients use. The class builds on several of the classes in The Server Framework and uses I/O Completion Ports and a thread pool to manage the overlapped I/O. Due to the way that I structure my servers you can often share the I/O pool that you use for your overlapped log writes with the pool that deals with the socket I/O.

Due to the way that thread exits cancelled pending overlapped I/O in versions of Windows prior to Vista I had put in some tricks to prevent the user of the async file writer class needing to know or care if the write that they’d issued had actually completed before the thread that issued it needed to terminate. All of the actual file writes were executed on a thread in the I/O pool and they were manually ‘marshalled’ to the I/O pool using the same I/O Completion Port that dealt with the I/O completions. This worked well. The thread that was doing the actual write operation simply passed the data to the I/O pool, which was a very quick operation, and then the I/O pool issued an async write and then dealt with the completion. Of course you needed to wait for all the asynchronous operations to complete before you could shut down your program but this was rarely a problem; the actual work got done much faster and the I/O happened in the background.

When I discovered the subtle thread exit related I/O change in Vista recently I decided to see if I could take advantage of it in my file writer class. The code already optimised for writes that are issued from a thread in the I/O pool (see here for details) so I figured the easiest way to test the new Vista change was to simply add another test for when it was safe to actually issue the write from the calling thread. I was surprised to see that when we DID issue the write from the calling thread that all of the benefits of using overlapped I/O vanished.

I have a couple of performance tests that I run with my overlapped file writer. They build all of the required objects, start a timer, issue a number of writes, note down the time taken and then wait for all the writes to complete and then note down the final time. Please ignore all of the actual figures that follow, it’s the differences between them that matter. For example, it might take 900ms to make the write calls and 36000ms to finish writing the data. My expectations for the change were that the time taken to make the write calls might go up fractionally (on the basis that calling WriteFile() was probably more time consuming that calling PostQueuedCompletionStatus()) but that the overall time taken should go down slightly (on the basis that we always ended up calling WriteFile() eventually and with the change in place we would never had to call PostQueuedCompletionStatus()).

Unfortunately the change actually resulted in the two figures becoming the same, i.e. if it took 36000ms for the writes to complete it also took 36000ms to issue them. This implied that the actual writes were NOT in fact being issued in an asynchronous manner after all. I looked into this in the Richter book and did some googling and it seems that the answer to the question “When are asynchronous file writes not asynchronous…” is, unfortunately, “Most of the time”.

It seems that there are two main obstacles to actually writing to a file asynchronously. The first is that for security reasons Windows extends files in a synchronous manner. I think that the idea is this, when a file is extended the data between the place where the file used to end and the place where the file now ends is first filled with zeros so that you can’t read the random rubbish that used to be there. This stops malicious code from ‘stealing’ data from disk sectors that are no longer used (see here for an example). You can attempt to work around this (as Richter’s FileCopy example does) by setting the end of the file to a large value before you start and then trimming the file to the correct size when you finish. The idea is that if you do this the operating system will allow the writes to be asynchronous as you’re not extending the file with the writes… However, the Richter example doesn’t work for me (on Windows Vista x64) simply calling SetFilePointer() and then SetEndOfFile() (as Richter’s example does) doesn’t seem to be enough. You also need to call SetFileValidData() and to be able to call that you need to have SE_MANAGE_VOLUME_NAME enabled on your token before you open the file that you want to extend. You probably don’t want to grant this privilege without a lot of thought as it makes it possible for malicious code to run and potentially peek at the contents of deleted files.

In fact, to get the ’extend the file first’ trick to work I had to do the following:

  • Enable the SE_MANAGE_VOLUME_NAME privilege.

  • Open the file.

  • Seek to the desired file length with SetFilePointerEx().

  • Set the end of file with SetEndOfFile().

  • Set the end of the valid data within the file SetFileValidData().

  • Close the file.

  • Open the file to write.

  • Issue the writes

Extending the file before writing does cause the write operations to occur asynchronously but the performance improvement isn’t that great. Write calls might take 32000ms to complete and the actual writes would then complete in something like 36000ms… The point is not the improvement itself but the fact that if you don’t take the steps shown above then the file writes are actually synchronous anyway!

So far I’ve only been using FILE_FLAG_OVERLAPPED. If you add FILE_FLAG_NO_BUFFERING then you are telling the cache manager not to use its buffers (there’s more to it than that, go look it up). There are plenty of restrictions on the amount of data that you MUST write and the alignment of the buffers that you use so that it isn’t always possible or practical to use this flag. However, it had the largest effect on file write performance, but it only really works if your I/O is operating asynchronously, so you pretty much have to set the end of file (as detailed above) first… Setting the end of file and using FILE_FLAG_NO_BUFFERING might bring the figures down to 1100ms and 12000ms.

Where you’ll see some really big differences is if you also specify FILE_FLAG_WRITE_THROUGH. This tells the file system that the write isn’t complete until the data is on the disk. In my tests this caused an increase in the time until the writes completed of almost 10 times if you weren’t pre-extending the file and around 3 times if you were. As before you got the best asynchronous performance if you pre-extended and used FILE_FLAG_NO_BUFFERING. The point of this is not to suggest that you use FILE_FLAG_WRITE_THROUGH unless you really need to but that if you do need to then using the two techniques above will dramatically improve your performance.

What I’ve found interesting is that my original design has consistently performed better than most of the attempts at improving things. The original design of issuing the write operations on an I/O thread by marshalling the required data using the I/O Completion Port is ALWAYS faster than doing the write operation on the calling thread. For the logging system that’s the most important thing. For other file writing code it’s still pretty important. The best way to increase the speed at which the writes actually complete is to pre-extend the file AND use FILE_FLAG_NO_BUFFERING . These changes are pretty specialist and I don’t think I have many situations where I can make use of this extra speed. The other thing is that even with all of these improvements in the original design can still outperform the new design if Windows happens to feel like it. It seems that if you catch the cache manager at the right time you can get significantly better performance (2400ms vs. 36000ms) without doing ANYTHING special, but, of course, you can’t rely on that, sometimes you might get 2400ms and sometimes you might get 36000ms it all depends on what else is going on.

Now that the code is written I think I’ll keep it in, but the default settings will be as before. It was an interesting journey.