Practical Testing: 21 - Looking at Performance and finding a leak

Back in 2004, I wrote a series of articles called “Practical Testing” where I took a piece of complicated multi-threaded code and wrote tests for it. I then rebuild the code from scratch in a test driven development style to show how writing your tests before your code changes how you design your code. Since the original articles there have been several bug fixes and redesigns all of which have been supported by the original unit tests and many of which have led to the development of more tests.

The tests were written for the pretty scalable timer queue implementation in my high performance, scalable, server framework. The timer queue has been present in the framework ever since I first had to associate timeouts with overlapped I/O requests back on NT 4.0. Back then the Windows Timer Queue didn’t exist and so I rolled my own. As you’ll see from the previous entries in this series, the class has gone through some changes over time. The queue works well for dealing with all manner of varied timeouts and is pretty general purpose in nature; I use it for setting timeouts to roll log files to a new name in my rotating log file class, I use it for all manner of per connection based time situations and it scales well; there’s no problem having 30,000 connections all with a 2 minute inactivity timeout set, etc.

The problem with it is that it could perform better. The general purpose nature of the queue means that it needs to be flexible; it can handle timeouts from 1ms to 49.7 days in a reasonably time and space efficient manner. I use STL containers to index the timers and that works pretty well. As a general purpose solution it’s pretty good. The problem is that it’s not good enough, at least it’s not fast enough for situations where I’m using the timers to implement retransmission timers for reliable UDP protocols. In these situations I have lots of timers (one per connection, and the aim is to support lots of concurrent connections) for generally short periods of time, 50ms to 30seconds. Since the timers are used for retransmission and data flow pacing they tend to expire (rather than being reset and rarely expiring as is the case with inactivity timers), they also tend to set a new timer when the current one expires. I’ve found that the general purpose nature of the timer queue and the use of STL means that there’s a lot of contention going on for the timer system and that contention affects performance.

The thread safe version of the timer queue protects the internal data structures with a lock and this lock needs to be acquired to do anything to the queue. So we lock when we create a new timer, we lock when we set a timer, we lock when the timer thread is processing an expired timer, etc. Obviously we try and hold this lock for as short a time possible and obviously we try and access the queue from as few threads as possible but eventually the contention starts to bite as connections reset their timers and timers expire almost constantly to drive rate limited flow queues and retransmission, etc. Although in this particular scenario the timer expiry code is quick (we lock, remove the expired timer and the callback simply pushes the user data into a queue of work items that are processed by another thread, we then release the lock) the code still causes the lock to be acquired and released for each timer that expires. In the reliable UDP scenario we have lots of timers set for exactly the same time and so the thread that processes the expired timers spends a lot of time acquiring and releasing the lock on the queue. Since the expiry of a timer often causes a new timer to be set we then have other threads processing the work items and setting new timers which also requires that we lock the queue. So all threads that access the queue are in contention for the lock; there’s not much that we can do about that apart from reducing the amount of time spent holding the lock.

Unfortunately it’s worse than that. Since we’re using the STL to implement our timer queue and since setting, expiring and cancelling a timer all result in adjustments being made to our central STL map object the threads that use our timer queue are also in contention with all other threads in the system that use the same memory heap as our timer queue uses. Each operation results in dynamic memory allocation and/or release.

My super secret game company client need their ENet implementation to run fast and to support 1000s of concurrent connections. In profiling their system under load the timer queue is one of the hot spots and the locks that it uses are showing some of the most contention in the system. Because of this I need to take a look at what we’re doing and potentially move from a general purpose solution to something a little more specific.

My first thoughts were to add some explicit monitoring to the timer queue, performance counters are your friends, and so a monitoring interface was born and the threaded callback timer queue was adjusted so that it could give an idea of the contention for its lock by using TryEnterCriticalSection() so that we could track contention (and follow such failures to acquire with a plain ol’ EnterCriticalSection(). The code looks something like this and whilst the code changes will, no doubt, change the way the contention bites it gives some indication of the contention being experienced.

void CThreadedCallbackTimerQueue::SetTimer(
   Timer &timer,
   const Milliseconds timeout,
   const UserData userData)
   ICriticalSection::PotentialOwner lock(m_criticalSection);
   if (!lock.TryEnter())
   ICriticalSection::Owner lock(m_criticalSection);
   m_spTimerQueue->SetTimer(timer, timeout, userData);

The next step was to add some performance tests to the timer queue’s test suite. Simple things such as measuring the time taken to set timers, cancel them, create them and expire them; obviously we repeat the operation a large number of times and repeat the whole test several times and take the average result. This gives us some figures for the current implementation, the performance tests for general performance improvement in the algorithm and the contention figures for some indication of whether the performance improvements are actually helping in the real-world usage scenario.

Whilst writing the monitoring test I decided that there was a leak in the timer queue for “one shot” timers that were active when the timer queue was destroyed. I added a new monitoring function to track when the internal timer data was actually deleted as this isn’t quite the same as when DestroyTimer() is called and anyway DestroyTimer() is never called for “one shot” timers. The resulting trace from the mock monitor proved that the leak that I had thought I had found didn’t exist. Since the monitor, with the new deletion monitoring functionality could prove that all timer data was cleaned up correctly I decided to add the monitor to all tests and added a simple call to check that the number of calls to OnTimerCreated() equalled the number of calls to OnTimerDeleted(). This addition to the tests located a memory leak in the code that dealt with destroying a timer handle during timeout processing. I was calling DeleteAfterTimeout() in CCallbackTimerQueueBase::DestroyTimer() rather than calling SetDeleteAfterTimeout(). What’s interesting to me is that this leak wasn’t located by any of the tests even when running the tests under a leak checking tool such as BoundsChecker. The reason that BoundsChecker failed to report on it is that it couldn’t handle the concept of casting dynamically allocated memory to an opaque handle and storing the handle in a map for later clean up; I guess I can’t blame it really… Since it would always complain that the data that is allocated in CCallbackTimerQueueBase::CreateTimer() was leaked even though it was cleaned up correctly later on I added an entry to the BoundsChecker suppression file; although most of these complaints are invalid, this one WAS valid and the data WAS being leaked. Once again I’m reminded that it doesn’t matter how many tests you have, how much coverage you have and even what tools you use the weak link is always the human being in the process… I’ve changed how the handle map works so that it stores the typed data rather than the opaque handle. This means that BoundsChecker CAN follow the ownership correctly and can now report on the real leak and stay quiet about the things that it thought we leaks but that weren’t. I then fixed the actual bug… In summary, don’t suppress the warning, understand it and change the code to fix it.

The code is here and new rules apply. A fair while has passed since the previous episode in this series of articles. My build environment, and some of the support code has changed a fair bit since then. The code will build with VS.Net 2002, VS.Net 2003, VS 2005, VS 2008 and VS 2010. The code builds as x86 or x64 with VS 2005, 2008 and 2010. Win32Tools is the workspace that you want and Win32ToolsTest is the project that you should set as active. The code will build with either the standard STL that comes with Visual Studio or with a version of STL Port. The code uses precompiled headers the right way so that you can build with precompiled headers for speed or build without them to ensure minimal code coupling. The various options are all controlled from the “Admin” project; edit Config.h and TargetWindowsVersion.h to change things… By default the project is set to build for Windows 7; this will mean that the code WILL NOT RUN on operating systems earlier than Windows Vista as it will try and use GetTickCount64() which isn’t available. To fix this you need to edit the Admin\TargetWindowsVersion.h file and change the values used; see Admin\TargetWindowsVersion_WIN2K.h and Admin\TargetWindowsVersion_WINXP.h for details. Since I’m looking at the performance of the code I’ve adjusted the VS2008 solutions to build without checked iterators in release mode, by default the STL in VS2008 builds with checked iterators enabled in release mode as well as in debug mode and this adversely affects performance; see here for more details. Although the code builds with earlier versions of Visual Studio I’m not actively using these versions and so there may be standard VS STL related performance tweaks that could be applied. I’m only focussing on VS2008 and VS2010 with my testing. This code is in the Public Domain.