Practical Testing: 36 - Timeout handle wrap

Page content

Previously on “Practical Testing”… I’ve just fixed a new bug in the timer queue and in doing so I updated the code used in the blog posts to the latest version that ships with The Server Framework. This code included two fixes that had been fixed some time ago and which I hadn’t documented here. They also lacked unit tests… Last time, I wrote tests for, and fixed, the first bug. This time I fix the final bug.

This bug is in the “lock-free” timeout handling code and it would cause the threaded version of the timer queue to do a small amount of unnecessary work but otherwise work correctly. The problem is that we use a ULONG_PTR value as an opaque handle when processing timeouts using the “lock-free” BeginTimeoutHandling(), HandleTimeout(), EndTimeoutHandling() sequence and this opaque handle has a sentinel value of 0 that is used to indicate that there are no timeouts to handle. The code that we use to generate the timeout handle in BeginTimeoutHandling() is a simple ::InterlockedIncrement() and, if enough timeout handles are generated, this will wrap to 0 and return the sentinel value when it should be returning a non-sentinel value.

Making the problem visible

Initially when I fixed this issue I did so without adding a test as I felt that it was too complex to add some method for the test code to manipulate the timeout handle generation to ensure the bad value was generated and running a test which called BeginTimeoutHandling() enough times to cause the bug to manifest would be too slow. The fix is easy, seeing the problem is less so.

Whilst looking at the code again and thinking about how I could justify not needing a test here I realised that it was trivial to make this issue very visible. Rather than starting the timeout handle value as the address of the object (don’t ask, we’ll refactor it later on) instead we set the initial value to -1. The first time we handle timeouts we will increment the value to zero and the bug will be exposed.

Fixing it

Fixing this is trivial, rather than simply calling InterlockedIncrement() we call it and check that we don’t have our sentinel value. If we do, we call it again… This code does the job:

IManageTimerQueue::TimeoutHandle CCallbackTimerQueueBase::GetNextTimeoutHandle()
   TimeoutHandle timeoutHandle = ::InterlockedIncrement(reinterpret_cast<long*>(&m_nextTimeoutHandle));

   if (timeoutHandle == InvalidTimeoutHandleValue)
      // wraps to zero? this is possible, avoid it...

      timeoutHandle = ::InterlockedIncrement(reinterpret_cast<long*>(&m_nextTimeoutHandle));

   return timeoutHandle;

The code is here on GitHub and new rules apply.

Does it really need to be this complex?

Whilst the changes expose the issue in a way that can be tested and then fix the problem I’m not especially happy with them. The code has obviously evolved from something that returned an opaque representation of a pointer to something that just returns a number… Looking back at the blog post where this concept was introduced, and looking at the code, I can see that originally we returned the timeouts themselves via the BeginTimeoutHandling() call and so it was essential to pass in a valid handle (pointer to timeouts) to the other functions as they dealt with the timeouts and tidied them up. When the code was changed to use intrusive containers in episode 32 the timeout handle changed and was no longer an opaque pointer to timeouts but instead became a meaningless value that just had to match in the calls to HandleTimeout() and EndTimeoutHandling(). It’s likely a better design to simply return a boolean from BeginTimeoutHandling() and remove the entire concept of a timeout handle. The timer wheel implementation still uses the ‘return an opaque pointer to timers’ version of the code but that can easily be changed to operate in the same way as the timer queues.

The only framework code that uses the BeginTimeoutHandling() code is the threaded timer queue. Some esoteric client code may use this API directly but it will be easy fix up. The code is an obvious, compile breaking change, which is good. Unfortunately the test code is riddled with timeout handle related code…

The code after the refactoring, with 2 tests removed as they are no longer necessary, is here on GitHub.