Unexpected causes of poor datagram send performance

I’m still working on my investigation of the Windows Registered I/O network extensions, RIO, which I started back in October when they became available with the Windows 8 Developer Preview. I’ve improved my test system a little since I started and now have a point to point 10 Gigabit network between my test machines using two 2 Intel 10 Gigabit AT2 cards wired back to back.

My test system isn’t symmetrical, that is I have a much more powerful machine on one end of the link than on the other. This makes it somewhat complex to really push the powerful machine (which is currently the only one with Windows Server 2012 RC installed on it). I’m planning to install Windows Server 2012 onto the lower spec machine so that I can push both RIO and the standard network APIs harder; more data coming in from the network, less cpu to process it. Before I installed the new OS I decided to see how hard I could push the 10 Gigabit link from the more powerful machine. Initially, the answer was, no more than from the lower powered machine…

The strange thing about the datagram send performance on the powerful machine was that the CPU wasn’t anywhere near maxed out and that memory usage was creeping up slowly as the test programs ran; not the memory usage of the simple, synchronous, UDP traffic generator programs, the overall system memory usage. When I drilled down into this some more the memory usage was all in non-paged pool… The test ran 10 copies of the single threaded synchronous UDP traffic generator, used around 80% CPU and generated around 6Gbps of network traffic. The remote machine was running nothing at all to receive the datagrams but was using over 50% of its CPU just to process interrupts. The remote machine was more sluggish than would be expected for only 50% CPU usage.

I checked the NIC driver settings on the remote machine and noticed that interrupt moderation was off. Interrupt moderation allows the driver to coalesce the interrupts so that the NIC only generates an interrupt every X datagrams rather than for every datagram. This increases latency but reduces the number of interrupts that the host’s CPU needs to deal with. I turned this on and set the moderation rate to “adaptive” which allows the driver to decide how best to moderate the interrupt rate. Unfortunately this didn’t help matters very much. The remote machines was fractionally more responsive but was still using 50% CPU for interrupts. I set the interrupt moderation rate to “Extreme” and CPU usage dropped to 35-40%, network utilisation on both machines dropped to 4.3Gbps and non-paged pool on the sending machine started to rise faster than before…

Whilst I could understand why the remote machine’s network utilisation might be reduced by the change in settings I didn’t understand why the sender could suddenly send less data simply because the remote machine’s settings had changed. I looked through the NIC driver settings again and noticed the “Flow Control” setting. The help for this says “A link partner can become overloaded if incoming frames arrive faster than the device can process them, and this results in the frames being discarded until the overload condition passes. The flow control mechanism overcomes this problem and eliminates the risk of lost frames. If a potential overload situation occurs, the device generates a flow control frame, which forces the transmitting link partner to immediately stop transmitting and wait a random amount of time before trying to retransmit.”. Both machines had this enabled and this explained why the reduction in processing on the remote machine cause the sender to also slow down; I was expecting the sender to simply send at the rate that it could send at and for datagrams to be discarded.

Turning off the flow control setting on one side of the connection removed the “lock step” link between the sender’s network utilisation and the receiver. Datagrams were being lost, of course, but at least I could now see what the sender was capable of. I now had a sender capable of sending 7-8.5Gbps using 70-80% CPU but it still used steadily more non-paged pool, though at a much lower rate than before.

After some trial and error with the NIC driver settings on the sending machine I finally managed to remove the non-paged pool usage issue. Turning off UDP checksum generation on the NIC increased network utilisation fractionally and removed the non-paged pool “leak” entirely. I expect that the amount of traffic generated was too much for the NIC to handle whilst also doing checksum generation and this caused a lag in the completion of the send requests within the driver which led to the non-paged pool usage. Since the host computer had CPU to spare it made sense to move the checksum generation off of the NIC and back onto the host.

None of this was especially intuitive to me, it was helped, somewhat, by the fact that someone on Stack Overflow was having a similar problem with identical hardware, but mostly it was trial and error.

Of course, turning off the flow control isn’t improving the number of datagrams that are getting to the destination but it does remove the chance that the sender would use an uncontrolled amount of non-paged pool. The lack of NIC flow control MAY impact switches or routers (if I wasn’t directly connected to the other NIC) but, even if it did, I’d think that the non-paged pool issue on the server was more important as it affects the stability of the server box. Of course, it’s well known that uncontrolled flow control can be an enemy to scalability for asynchronous networking APIs I was surprised to see that it can also be a problem with synchronous APIs when the processor in the NIC is overloaded or when the NIC decides to provide flow control but doesn’t cause a synchronous send to block when its flow control is in operation.