Practical Testing: 34 - Potential reentrant locking deadlock
Back in 2004, I wrote a series of articles called “Practical Testing” where I took a piece of complicated multi-threaded code and wrote tests for it. I then rebuild the code from scratch in a test driven development style to show how writing your tests before your code changes how you design your code. Since the original articles there have been several bug fixes and redesigns all of which have been supported by the original unit tests and many of which have led to the development of more tests.
In the previous update in July 2014 I added custom intrusive containers to replace the stl containers and reduce memory allocation contention and improve performance. Since then the code has been in regular use by clients and users of The Server Framework and has stayed pretty stable.
Recently a client has upgraded some rather old code to a new version of The Server Framework and triggered a bug that has lain dormant in the timer queue since release 6.6 of The Server Framework when I added some new locking classes. Most clients haven’t triggered this bug because they don’t use the code in this particular way.
So, in this instalment I detail the bug and present some tests which reveal it. The fix is pretty simple, we simply have to change the lock type used by the threaded timer queue so that it allows reentrant locking.
A non-reentrant lock
The CLockableObject
class is a replacement for my CCriticalSection
both provide locking functionality but the CLockableObject
uses a SRWL
(Slim Reader Writer Lock) for exclusive locking which works out to be faster than using a CRITICAL_SECTION
. The downside of the SRWL
is that it’s not a reentrant lock and if you try and obtain it whilst you’re already holding it you will deadlock. The functionality that we require for the fully featured CLockableObject
class includes the ability to “TryEnter” and as this wasn’t available until after Windows Vista we have some conditional compilation that degrades the implementation of a CLockableObject
back to using a CRITICAL_SECTION
if the required functionality isn’t available. This means that you can switch to using the new lock type sooner and reap the rewards when you build for the correct version of Windows… The downside of this flexibility is that you may find that your new lock is used in a reentrant fashion when it shouldn’t and it doesn’t cause problems because it’s using the reentrant CRITICAL_SECTION
implementation rather than the full featured SRWL
version…
My new lock type is supposed to be non reentrant but sometimes it isn’t. Code that works when it isn’t will fail when it is.
Since many users of the framework don’t have as much unit testing as they should I added some functionality to the CLockableObject
so that it could warn of attempts at reentrant use. This means you can start using it and replacing some of your CCriticalSections
with it with the reentrancy checks enabled and you’ll get warnings at run-time if the lock you’ve replaced WAS used in a reentrant fashion after all…
The bug
Early on in this series of articles I pointed out that the threading of the timer queue was orthogonal to the timer management. To that end the threading was split into a derived class, CThreadedCallbackTimerQueue
this is the class that is thread safe and contains the lock in question. Originally this class used to hold its lock during timer dispatch but this was a source of lock inversions and so changes were applied to allow you to dispatch without holding the lock. This complicated the dispatch somewhat and reduced performance a little and so the two options were available so that you could decide what was most important to you. The problem is that when the lock is held during timer dispatch it’s not possible to manipulate a timer during timer handling without reentrant locking of the queue’s lock. There are no tests for this. With a reentrant lock there is no problem. With a non-reentrant lock you get a deadlock.
Adding tests for this is fairly easy. We simply need to perform an action on the timer queue whilst dispatching timers. For a test which uses a queue that holds the lock during timer dispatch we’ll get a failure and for the test that doesn’t we’ll get the results that we expect. Adjusting the mocks so that we can specify that a timer calls SetTimer()
from within OnTimer()
enables us to write such a test.
I’ve chosen to write the test that should fail so that it can currently pass with the broken functionality as long as we have the debug option enabled that allows us to track reentrant lock use. This debug option results in an exception being thrown when we attempt the reentrant lock acquisition and I’ve adjusted the test to notice this. This proves we have a problem.
Source code
The code is here on GitHub and new rules apply. A fair while has passed since the previous episode in this series of articles. My build environment, and some of the support code has changed a fair bit since then. The code will build with Visual Studio 2010, 2012, 2013 and 2015. Win32Tools is the solution that you want and Win32ToolsTest is the project that you should set as active. The code will build with either the standard STL that comes with Visual Studio or with a version of STL Port. The code uses precompiled headers the right way so that you can build with precompiled headers for speed or build without them to ensure minimal code coupling. The various options are all controlled from the “Admin” project; edit Config.h
and TargetWindowsVersion.h
to change things… By default the project is set to build for Windows 7; this will mean that the code WILL NOT RUN on operating systems earlier than Windows Vista as it will try and use GetTickCount64()
which isn’t available. To fix this you need to edit the Admin\TargetWindowsVersion.h
file and change the values used; see Admin\TargetWindowsVersion_WINXP.h
for details. This code is in the Public Domain.
Fixing the bug
Fixing this bug is pretty simple. I just need to change the lock type used so that it supports reentrant use. Once that’s done the test needs to be updated so that the test that uses the locking dispatch queue is broadly similar to the test that uses the non-locking dispatch queue. The code is here on GitHub.