Practical Testing: 31 - A bug in DestroyTimer.

Back in 2004, I wrote a series of articles called “Practical Testing” where I took a piece of complicated multi-threaded code and wrote tests for it. I then rebuild the code from scratch in a test driven development style to show how writing your tests before your code changes how you design your code. Since the original articles there have been several bug fixes and redesigns all of which have been supported by the original unit tests and many of which have led to the development of more tests.

Whilst doing some development on a new server design I managed to expose a bug which has been present in the timer queue code for some time and which is also present in the timer wheel implementation. The bug allows memory within the timer queue to be deleted twice if you happen to call DestroyTimer() on a timer from within OnTimer() for that same timer. The problem occurs infrequently as it requires the memory for the timer data that has just been deleted to be set with a specific bit pattern that will then cause the timer queue to think that the data needs to be deleted again. It also requires that you’re calling HandleTimeouts() to handle timeouts whilst allowing the timer queue to hold its internal lock; this is something most of my servers don’t do. Anyway, circumstances conspired to make this bug visible and so here I am to fix it.

Of course, the first thing to do is write a test that exposes the bug. The test is fairly simple to build but it requires that we adjust the mock timer that we use so that it can be told to delete the timer during timeout handling. The new test looks something like this:

template <class Q, class T, class P>
void TCallbackTimerQueueTestBase<Q, T, P>::TestDestroyTimerDuringOnTimerInHandleTimeouts()
{
   JetByteTools::Win32::Mock::CMockTimerQueueMonitor monitor;

   P tickProvider;

   tickProvider.logTickCount = false;

   {
      Q timerQueue(monitor, tickProvider);

      CheckConstructionResults(tickProvider);

      CLoggingCallbackTimer timer;

      const Milliseconds timeout = 100;

      const IQueueTimers::UserData userData = 1;

      IQueueTimers::Handle handle = CreateAndSetTimer(
         tickProvider,
         timerQueue,
         timer,
         timeout,
         userData);

      timer.DestroyTimerInOnTimer(timerQueue, handle);

      const Milliseconds expectedTimeout = CalculateExpectedTimeout(timeout);

      THROW_ON_FAILURE_EX(expectedTimeout == timerQueue.GetNextTimeout());

      tickProvider.CheckResult(_T("|GetTickCount|"));

      tickProvider.SetTickCount(expectedTimeout);

      timerQueue.HandleTimeouts();

      tickProvider.CheckResult(_T("|GetTickCount|"));

      timer.CheckResult(_T("|OnTimer: 1|TimerDestroyed|"));

      tickProvider.CheckNoResults();

      THROW_ON_NO_EXCEPTION_EX_1(timerQueue.DestroyTimer, handle);
   }

   THROW_ON_FAILURE_EX(true == monitor.NoTimersAreActive());
}

And the change to the mock looks something like this:

void CLoggingCallbackTimer::OnTimer(
   UserData userData)
{
   if (logMessage)
   {
      if (logUserData)
      {
         LogMessage(_T("OnTimer: ") + ToString(userData));
      }
      else
      {
         LogMessage(_T("OnTimer"));
      }
   }

   if (m_pTimerQueue)
   {
      m_pTimerQueue->DestroyTimer(m_handle);

      LogMessage(_T("TimerDestroyed"));
   }

   ::InterlockedIncrement(&m_numTimerEvents);

   m_timerEvent.Set();
}

The result is that the timer is destroyed during timeout handling and the test demonstrates the failure in the code.

Unfortunately, with the latest build of the code the test does NOT demonstrate the problem. Unfortunately the PTMalloc implementation that we’re currently using doesn’t allow you to set it to fill deleted memory with an unlikely bit pattern. The default allocator with the Visual Studio C runtime does allow this in debug builds and this helps to force the bug into view. Adding the new test to the code that was presented in part 29 causes the bug to manifest and for Visual Studio to pop up a debug assertion message when the memory is double deleted.

The potential for this problem to occur when using the BeginTimeoutHandling(), EndTimeoutHandling() style of “lock free” timeout handling was already identified and fixed back when I first added the “lock free” timeout handling in part 18. The fix is pretty similar. There’s already a flag in the internal timer data structure that’s used to delay destruction until after the timer has finished being processed but it doesn’t get set when using HandleTimeouts(). The fixes are fairly simple for both the timer wheel and the timer queue.

void CCallbackTimerQueueBase::TimerData::OnTimer()
{
   m_processingTimeout = true;

   OnTimer(m_active);

   m_processingTimeout = false;
}

CCallbackTimerWheel::TimerData *CCallbackTimerWheel::TimerData::OnTimer()
{
   m_ppPrevious = 0;

   m_processingTimeout = true;

   OnTimer(m_active);

   m_processingTimeout = false;

   return m_pNext;
}

The code from part 29 with these fixes applied can be found here.

The code that uses PTMalloc from part 30 with these fixes applied can be found here.

Please note that the previous rules apply.