Practical Testing: 18 - Removing the potential to deadlock
Back in 2004, I wrote a series of articles called “Practical Testing” where I took a piece of complicated multi-threaded code and wrote tests for it. I then rebuild the code from scratch in a test driven development style to show how writing your tests before your code changes how you design your code. Since then there have been various changes and fixes and redesigns all of which were made considerably easier due to the original tests.
There has always been a potential for deadlock when using the timer queue, which is unfortunate but something that you can avoid if you know the rules. The problem is that if you have a lock in your code which you hold when calling into the timer queue and you also try and acquire your lock when the timer queue calls into you in OnTimer()
and you’re using the CThreadedCallbackTimerQueue
then you will deadlock as the threaded timer queue has its own lock which it acquires when you call into the queue and which it holds when it calls into you via OnTimer()
. As I pointed out in Part 12, the threading is orthogonal, so it’s only the CThreadedCallbackTimerQueue
that has this problem, but this is probably the class you’re most likely to use!
So, this time around we’ll adjust the CThreadedCallbackTimerQueue
so that it doesn’t have to hold its lock when it calls back into you and thus we’ll remove the potential for deadlock and make the code much easier to use correctly. In addition we’ll make it easier to use either the new CCallbackTimerQueueEx
or the original CCallbackTimerQueue
with CThreadedCallbackTimerQueue
.
The first problem that we’ll address is being able to use either implementation of the timer queue with the threaded timer queue. Right now the problem is that we’re tied to the concrete object of CCallbackTimerQueue
. We can break that tie in the usual way by slipping in an interface. I don’t want to extend IQueueTimers
as that’s the interface that clients of the timer queue use and it’s not appropriate for a client of the queue to be able to do the things that we want the CThreadedCallbackTimerQueue
to be able to do. So, IManagerTimerQueue
will extend IQueueTimers
and add the support that we need to be able to compose timer queues from other queues. Initially that support is for GetNextTimeout()
and HandleTimeouts()
. Once we add this interface and adjust the two existing queues to implement it and the threaded queue to work in terms of it it’s easy to add some configuration flags to the threaded queue’s constructor so that we can select the queue that we want to use when we create the threaded queue.
The lock holding problem is slightly harder to address. The problem exists because of this code in Run()
in the threaded timer queue:
while (!m_shutdown)
{
const Milliseconds timeout = GetNextTimeout();
if (timeout == 0)
{
CCriticalSection::Owner lock(m_criticalSection);
m_timerQueue.HandleTimeouts();
}
else
{
m_stateChangeEvent.Wait(timeout);
}
}
To maintain the thread safety of the data structures inside the queue we lock our critical section when we call HandleTimeouts()
. We also lock around all other calls into the queue implementation. Inside the implementation, in HandleTimeouts()
, we do this:
void CCallbackTimerQueue::HandleTimeouts()
{
while (0 == GetNextTimeout())
{
TimerQueue::iterator it = m_queue.begin();
TimerData *pData = it->second;
m_queue.erase(it);
Handle handle = reinterpret_cast<handle>(pData);
MarkHandleUnset(handle);
pData->OnTimer();
if (pData->IsOneShotTimer())
{
DestroyTimer(handle);
}
}
}
which means that our lock is held when OnTimer()
is called.
The first step in solving this problem is to break the timer processing into several functions and separate the pieces that require that they’re locked to maintain thread safety from the pieces that do not require locking. Logically we need the queue to expose something like this; BeginTimeoutHandling()
, HandleTimeout()
and EndTimeoutHandling()
where HandleTimeout()
doesn’t need a lock to be held to maintain thread safety.
Our new interface looks a bit like this:
class IManageTimerQueue : public IQueueTimers
{
public :
virtual Milliseconds GetNextTimeout() = 0;
virtual void HandleTimeouts() = 0;
typedef ULONG_PTR * TimeoutHandle;
static TimeoutHandle InvalidTimeoutHandleValue;
virtual TimeoutHandle BeginTimeoutHandling() = 0;
virtual void HandleTimeout(
TimeoutHandle &handle) = 0;
virtual void EndTimeoutHandling(
TimeoutHandle &handle) = 0;
virtual ~IManageTimerQueue() {}
};
TimeoutHandle
is a new type of handle to a timeout which is only used for processing timeouts that have happened.
With the new interface we can write the timeout handling in our threaded queue like this:
while (!m_shutdown)
{
const Milliseconds timeout = GetNextTimeout();
if (timeout == 0)
{
IManageTimerQueue::TimeoutHandle handle = IManageTimerQueue::InvalidTimeoutHandleValue;
{
CCriticalSection::Owner lock(m_criticalSection);
handle = m_pTimerQueue->BeginTimeoutHandling();
}
if (handle != IManageTimerQueue::InvalidTimeoutHandleValue)
{
m_pTimerQueue->HandleTimeout(handle);
CCriticalSection::Owner lock(m_criticalSection);
m_pTimerQueue->EndTimeoutHandling(handle);
}
}
else
{
m_stateChangeEvent.Wait(timeout);
}
}
We hold our lock to obtain a handle to a timeout that needs to be dispatched, unlock, dispatch the timeout and then acquire our lock again to complete the dispatch. Now we just have to make the implementations operate correctly in terms of the new interface.
Note that the queue must be able to operate correctly when any of the timer queue manipulation functions are called whilst a timer is being dispatched. So, for example, if we have called BeginTimeoutHandling()
and then call SetTimer()
or CancelTimer()
for the timer that is currently being processed by BeginTimeoutHandling()
then things should work as expected. Luckily, due to the fact that we have separated out the threading from the logic of the queue we can write tests for these situations pretty easily. With the tests in place, and failing due to lack of implementation, we can begin to implement the changes.
At first glance, HandleTimeouts()
just needs to be broken into three pieces; BeginTimeoutHandling()
needs to do this
if (0 == GetNextTimeout())
{
TimerQueue::iterator it = m_queue.begin();
TimerData *pData = it->second;
m_queue.erase(it);
Handle handle = reinterpret_cast<handle>(pData);
MarkHandleUnset(handle);</handle>
HandleTimeout()
needs to do this:
pData->OnTimer();
and EndTimeoutHandling()
needs to do this:
if (pData->IsOneShotTimer())
{
DestroyTimer(handle);
}
Unfortunately things aren’t quite that simple… The tests that we wrote earlier for making sure that timer dispatch can be safely interrupted by other calls show that SetTimer()
could cause the timer to be updated whilst it is being dispatched and, if the user data or timer callback is changed in the call to SetTimer
then the timer that is being dispatched might be the wrong one! Even worse, DeleteTimer()
may destroy the timer that we’re in the middle of dispatching!
To work around these issues we need to adjust the data that we store for each timer. By separating the data that is used during dispatch from the data that is used when the timer is set and queued we can allow a call to SetTimer()
to update a timer and requeue it whilst it’s being dispatched. We then need to check to see if a timer is being dispatched when DeleteTimer()
is called for it and if so we should simply note that the timer has been deleted and delete it once the dispatch is complete. See the code for the detail.
Since our new CThreadedCallbackTimerQueue
works in terms of IManagerTimerQueue
we can add a constructor to allow you to plug your own implementation of IManagerTimerQueue
in if you like, and once we’ve done that we can test the CThreadedCallbackTimerQueue
more easily because we can give it a mock of IManagerTimerQueue
.
The duplication in the code still bothers me, so I expect the next instalment will deal with that!
Code is here and the new rules apply. I’ve changed the way the test harness operates recently, it now runs all tests and logs the failures and skipped tests, rather than stopping as soon as there’s a single failure. By default the project is set to build for Windows Vista; this will mean that the code WILL NOT RUN on earlier operating systems as it will try and use GetTickCount64()
which isn’t available. To fix this you need to edit the Admin\TargetWindowsVersion.h
file and change the values used; see Admin\TargetWindowsVersion_WIN2K.h
and Admin\TargetWindowsVersion_WINXP.h
for details.