Practical Testing: 14 - Bitten by the handle reuse problem

A long time ago, on Practical Testing: I mentioned that there was a small chance that the way that ResetTimer() operated could mean that you could pass in a handle that you no longer owned and affect someone else’s timer by mistake; I said it was unlikely to cause problems… In this posting we fix this problem and in the process show how TDD builds up a set of tests that can support you during redesign…

My mistake was to assume that the class would be used in a particular way and when used in that way it was true that it was unlikely that a handle would be inadvertantly reused. However, the usage restriction wasn’t coded so there was always a chance that someone would use the class in a slightly different way and stumble across the problem. Recently I’ve had to fix an intermittant race condition bug in a client’s code and the bug was caused by me using the timer class in the wrong way, not realising why it was wrong and stumbling on the handle reuse bug. The client code was easy to fix but I decided that the class should be changed so that this kind of problem isn’t possible.

The original use of the callback timer class was to provide a light-weight way to add multiple timers to async socket connections. The typical usage pattern was that we’d use it for read timeouts and disconnect the client or warn them if they were inactive for more than an acceptable amount of time. When the client connects we set a timer for X and then each time a read completes we reset the timer for X. If the timer ever expires then we issue the warning, or disconnect them, or whatever. The point being that the usual usage pattern was set a timer and then continually reset it. We never expect the timer to go off so the handle for the timer remains valid for a long period of time.

The usage pattern that caused me a problem recently is pretty much the opposite. We wrote an auction server for a client and one of the requirements was a high performance logging system that didn’t slow the server down but could log lots of information during the sales. The logging system ended up being based on async file IO which gave us good performance and as a final performance improvement we adjusted how we timestamped the messages so that the timestamps were only generated every second. If several lines were logged in the same second then only one timestamp was written to the log file. To achieve this we used the timer queue in a completely different way. We’d set a timer and wait for it to expire and then, if we had logged anything after our last timestamp, we’d log a timestamp and reset the timer. If we hadn’t logged anything then we’d simply make a note that we needed to log a timestamp with the next log message. This usage leads to timers being set and allowed to expire often.

Our problem was that we were using the ‘set and expire’ pattern with the set and reset pattern and every so often our log system would reuse a sockets timer handle. A socket would time out and the handle would become free, the log would allocate a new timer and use the old handle. The socket code had a bug in it whereby when the timer expired we never set the socket’s timer handle to an invalid value, we just left it. When the socket’s timer expires we terminated the connection and the code that deals with connection termination routinely cancels and timers related to the socket that’s being terminated… The log reused the socket’s expired handle and then the socket cancelled its timer using that handle which led to some confusion and the bug…The immediate fix for our client’s problem was simply to set the socket’s timer handle to the invalid value when the timer expired, this meant that we didn’t subsequently try to cancel a timer we no longer owned and the bug was fixed…

After some thought I decided that the problem was due to the fact that we were trying to be too clever. The timer queue provides an interface that allows you to set a timer, reset a timer that may or may not have expired and cancel a timer that may or may not have expired. The creation of timers is done behind the scenes to ‘make it easier’ for the user to set and reset timers. Unfortunately it makes it harder for the user to reason about the lifetime of the timer handler - unless they communicate timer expiry to the timer setting code. Making the timer queue interface fractionally more complicated for users would remove the problem and make users responsible for the lifetime of their timers; it also had the nice side effect of removing memory allocation churn when we’re using the ‘set and expire’ pattern by allowing the user of the timer to allocate a timer once and reuse it each time they set the timer rather than allocating a timer and having it deallocated each time it expired.

We changed the IQueueTimers interface to look like this:

class IQueueTimers
{
   public :
 
      typedef ULONG_PTR UserData;
 
      class Timer;
 
      typedef ULONG_PTR Handle;
 
      static Handle InvalidHandleValue;
 
      class TimerData;
 
      virtual Handle CreateTimer() = 0;
 
      virtual bool SetTimer(
         const Handle &handle, 
         Timer &timer,
         const DWORD timeoutMillis,
         const UserData userData) = 0;
 
      virtual bool CancelTimer(
         const Handle &handle) = 0;
 
      virtual bool DestroyTimer(
         Handle &handle) = 0;
 
   protected :
 
      ~IQueueTimers() {}
};

Adding an explicit call to create and destroy timers and removing the distinction between setting and resetting a timer. The change means that a client can create a timer and then set or cancel it as often as it likes and then call destroy when it’s done with it. The lifetime of the timer is now under the control of the client and it’s harder to use a handle after it’s no longer valid as the client knows the handle is valid until DestroyTimer() is called and that call sets the reference to the handle that is supplied to an invalid value once the timer is destroyed.

The change to the interface is reasonably minor and fixing up the call sites in client code isn’t that complicated however the changes required inside the timer queue class are quite substantial. Luckily we have a suite of tests that enforce the expected functionality. First we can change the interface, then we can change the tests to work with the new interface and finally we can run the tests, observe the failures and adjust the timer queue code so that the tests pass. During this time we may add additional tests to allow us to develop the new functionality test first. Once we’re happy that the code passes the tests we can review the code and refactor where necessary… Of course, these changes will also ripple through to our threaded timer queue and its tests.

Often at this point people will complain that they have to change twice as much code as they would if they didn’t have tests… That’s true, but I usually find that having the tests means that I can change the code faster and that I can know when I’m done. Without tests you can hack away at the code that you need to change but it can be harder to be sure that you haven’t unexpectedly changed the semantics of the methods…

The downside of our changes is that it’s now more complicated to create timers that are used for one off events; set a timer, wait for it to expire, do something. It can be particularly cumbersome if you’re creating lots of these one off timers, in the previous design you would have simply ignored the handle that was returned from set timer and the timer would clean itself up when it expired, now you need to track the handle and explicitly destroy the timer when you’re done with it. This usage pattern can be managed by adding a variation on the SetTimer() call.

      virtual void SetTimer(
         Timer &timer,
         const DWORD timeoutMillis,
         const UserData userData) = 0;

You never get given a handle to the timer so you can’t manipulate the timer once you set it; this prevents us opening the door to the handle lifetime bug again. We add a flag to the internal timer data class which determines if the timer is automatically destroyed when the timer expires; timer’s set using the new method set this flag to true.

Of course we need to add some more tests for this function…

Code is here but new rules apply: The code will build with VC6, VS.Net 2002 and VS.Net 2003. The code will build with either the standard STL that comes with Visual Studio or with a version of STL Port. The code uses precompiled headers the right way so that you can build with precompiled headers for speed or build without them to ensure minimal code coupling. The code can also compile on VC6 with or without the platform SDK being installed. The various options are all controlled from the “Admin” project; edit Admin.h to change things…

For ease of comparing the changes between the code for this article and the code for the previous article you can use this version of the code from part 13 which is in the same style as the code for this part.