IOCP performance tweaks and repeatable perf logging

2010-05-16

Whilst profiling my new low contention buffer allocator in a ‘real world’ situation recently I spent quite a lot of time trying to work out what I was looking for and when it should happen. Since the allocator improves performance by reducing lock contention it only shines when there IS lock contention. I added some more monitoring to the existing buffer allocator so that it can give an indication of when it’s going to block during an allocation or release operation (by using TryEnterCriticalSection() followed by a monitoring notification and then an EnterCriticalSection() when the try has failed). This gave me a nice jiggly line in perfmon to show some indication of the contention. Next I added the context switches/sec counters for the threads in the application under test; and, of course, the three processor usage counters (user, kernel and total) for the process. With these added to the thread activity counters and the various queue length counters and events/sec counters I could see what the server was doing and where it was doing it.

I still had a hard time getting the results that I wanted though ;) I had to push the server much harder than I was for the new allocator to shine.

I set up a test client on my network using the server test program that I spoke about here. This let me set up a test which created 5000 connections and pushed 250 messages of 1024 bytes on each connection, sending the next message as soon as the echo from the previous one arrived. This gave me 90% utilisation of my 1Gb link into the machine running the server under test and middling CPU usage on the client machine - it’s important not to push client CPU or network utilisation too high as then you start introducing ‘strangeness’ and variability to the tests that tend to make each run different. Likewise there’s little point in stressing the machine under test to 100% CPU as then you find that various other things on the box are using CPU differently on each test run…

The server design that I was using was a hybrid of one of my standard example servers. It has a business logic thread pool to deal with the actual work and an I/O pool to deal with I/O, it parses complete messages on the I/O threads and then dispatches them to the business logic threads via an IOCP. Initially it was hard to generate the required contention on the buffer allocator but increasing the number of threads in each pool to 4 (this is an 8 core box) and adding some spurious buffer allocation and release calls to the work that both pools does seemed to help. In a more real world scenario the buffer allocator might be used to provide fixed size memory allocation for purposes other than data transmission and/or more data copying might be occurring (the client for which I’m doing this work has a server that stresses the allocator a lot due to the fact that it uses it extensively for memory required by its reliable UDP implementation).

The results look good, as long as the server is configured appropriately the new allocator is considerably better. Once I had that test working I decided to add a few more counters to the server; I’d recently added the “IO events per second” counter to the I/O Completion Port that is used for the socket I/O but I wanted another counter that would enable me to look at the number of I/O events that I was posting to that port to marshal the reads and writes to the I/O pool. I added the counter and looked at the results of the test and then adjusted the code to take into account the new changes that I’ve been thinking about to take advantage of the IOCP threading changes in Vista and later. With the new counter I could clearly see the effect and the performance changes; it’s well worth doing, it’ll be in 6.3.

I then did some tests with the FILE_SKIP_COMPLETION_PORT_ON_SUCCESS and the new counters clearly showed the effects and also the performance change; again, pretty considerable.

All of this got me to thinking that these kinds of performance tests should be made available to users and potential users of the framework and also that it would be good for them to be easy to repeat so that they can be run as part of each release to see how things are improving (and to spot if performance regresses). This led me to discover logman, the command line interface to setting up perfmon logs and relog a way to selectively upload these traces to a database.

So far things are looking good, I can set up repeatable performance tests which can create perf counter traces before they start and squirt them into the database for analysis… More, and the results of these tests, later…