Practical Testing: 22 - Performance: Some you win...

The previous article in the “Practical Testing” series set things up so that we can measure the performance of the code under test with the intention of trying to improve performance for a specific set of use case scenarios. This time around I’ll make a few changes which I hope will improve the performance and measure the effects of the changes with the performance tests that I added last time.

One of the things that struck me about the code when I was looking at it during my profiling runs for my client is that when using the CThreadedCallbackTimerQueue in “do not hold your lock whilst dispatching timers” mode, the lock is acquired and released for each timer that is dispatched. This is less than optimal as the timer dispatch code is designed to dispatch all of the timers that are currently due and so it could, theoretically, extract all of them when CCallbackTimerQueueBase::BeginTimeoutHandling() is called and process all of them when CCallbackTimerQueueBase::HandleTimeout() is called. Instead it extracts them one at a time. Part of the reason for this is that a) the Begin/End style of timeout handling was a late addition to the design and b) it’s always worked well enough in the past so why change it.

At the moment I use a std::multimap for storing timers. This indexes the timers based on absolute timeout and the use of a std::multimap allows me to have multiple timers set for the same time. It also allows me to retain an iterator to the newly inserted timer so that timer cancellation is an O(1) operation. Insertion and timeout expiry processing are O(log n) due to the balanced binary tree lookups involved. To improve timeout processing speed I’d like to be able to do a single lookup and return all of the timers with that timeout rather than one lookup per timer, though theoretically I could adjust the interface and expose a GetNextTimeout() function which could use the current timer’s iterator as a hint to locate the next one, avoiding subsequent lookups, that would still leave us with the fine grained locking issue where callers only process a single timer with each Begin/Handle/End sequence. Since std::multimap doesn’t support removal of all the values at a given key whilst still leaving them in some form of container so that we can manipulate them as one I would need to switch from using std::multimap to using a std::map where the values are a std::set or std::deque of timers at the timeout value. The std::deque implementation is fractionally more complex as with the std::set we can use the insertion iterator to locate our node for cancellation whereas with a std::deque we need to use the offset at which we inserted the timer and then set that element to null when we cancel the timer. Knowing when we can delete the std::deque if all timers for that timeout are cancelled also required a book-keeping counter. The complexity of the std::deque code demanded a couple of new tests…

The code for the std::set based implementation can be found here.

The code for the std::deque based implementation can be found here.

The same rules as last time apply.

As expected the time taken to handle the timeouts drops but the time taken to set and cancel a timer has increased. We’re doing more work when setting timers because we have to allocate space in the new structure as well as in the map that we were using before. Cancellation likewise takes longer in some circumstances due to the corresponding deallocation.

The increased performance for handling timeouts may well be worth the performance degradation in setting and cancelling timers in some situations. These tests don’t show what we’re doing about cross-thread lock contention though. These changes have made that slightly worse.

Lets use a new notation to discuss the contention issues, I’ll call it “Big C notation” ;). For a given operation that accesses a shared lock we have a worst case contention value of C. In a piece of code where a single thread performs an operation involving a single lock but where no other threads can ever perform the operation then the operation can be classified as C(0); no contention. If we have a single shared lock then operations involving the lock are C(n) where n is the number of threads that can perform the operation. For operations that access multiple locks their overall contention value can either be represented as the highest value of C for each of the operations involving the shared locks, or as the sum of the contention; so an operation involving two locks which are C(2) and C(5) could either be C(5) or C(2)+C(5). I think I prefer the later as it shows the number of times that the contention factor can bite as an operation that is C(2)+3C(5)+C(1) is obviously more contention prone whereas representing it as simply C(5) is somewhat misleading…

In our timer queue the queue operations themselves are C(n1) where n1 is the number of threads that access the queue. Queue operations that involve dynamic memory allocation or release are C(n2) where n2 is the number of threads that access the heap used for allocation. Thus any operation on the queue that involves dynamic memory allocation degrades from C(n1) to C(n1)+C(n2) where n2 is always assumed to be at least equal to n1 but probably higher. The latest changes to add the std::set or std::deque make this worse by converting the contention value to C(n1)+2C(n2) as we need to access the STL’s allocator twice…

From this it’s quite clear that to reduce contention we should have a private heap that is only used by a specific instance of a timer queue; this would mean that n2 is always the same as n1. Whilst providing a private heap and custom allocators for the STL types would be a worthwhile exercise I can’t help thinking that, at this point, there’s a better way to solve the actual problem that I’m facing with regards to the specific use case of the timer queue.

There really should be no need to do any dynamic memory allocation at all apart from to create and destroy the timer handle itself and even that could be avoided if we take a similar approach to the Win32 Critical Section API. Using the STL is convenient but the fact that the containers are non-invasive (that is you don’t have to change a type or have it derive from a specific type to be able to place it into a container) means that book-keeping data needs to be allocated outside of the type that’s being stored to provide, for example, the space for the pointers that connect nodes in a std::map together. I’ve already implemented an invasive container in the form of CNodeList which is used for buffer and socket allocation tracking and the advantage of an invasive container is that the object that’s being stored in the container can include space for the data that is needed to store it in the container. This is especially beneficial when you’re inserting and removing the same object time and again as the data is only ever allocated and released once as part of the object that you wish to store. I’m actually quite surprised that the STL didn’t also include invasive containers as the non-invasive containers could have been implemented in terms of the invasive ones.

Anyway, in an ideal world, with a custom invasive container, the timer handle could include links for the tree nodes of the tree that it’s stored in for searching and links to the other timers at this timeout (the equivalent of the std::set or std::deque in the latest timer queue). Since the timer handle itself would contain space for these links there would be no need for allocation or deallocation when setting, cancelling or expiring timers. This would remove all of the potential allocation contention and leave us with operations which were C(n) where n is the number of threads that can access the timer queue… The insertion and timeout handling would still be O(log n) due to the balanced binary tree lookups required but a) the potential contention would be lower and b) O would be much smaller as in general all you’re doing once you find the place in the tree is adjusting the linking pointers.

This change would mean replacing the STL containers with a custom balanced search tree implementing a multi-map where you can remove all items at a specific key with a single operation and where you’re left with a list of timers at that key. That’s a fairly major change but the performance improvements are likely to be pretty good. However, there’s a simpler data structure that might work even better for my specific use case scenario. We’ll look at that next time.