Time is an illusion, CLOCK_REALTIME, doubly so...
A few days ago one of my build machines started to have unexpected test failures during regular integration builds of one of my client’s codebases. My other build machines ran the tests fine and the failures were intermittent and spread over a worryingly large array of tests.
Don’t use CLOCK_REALTIME to measure wait or delay times as it can be changed and the
time reported can move unexpectedly; absolute timeouts do not play nicely with this clock.
POSIX allows for spurious wakes from timed operations, so, expect to manage your own timeouts.
The common thread to the failures was that waits for tests to do something were timing out and so the test either proceeded too soon or timed out or both. The Windows test runs, of the same code, ran fine and everything been running fine for several years before this. That is, some of the tests and code under test hadn’t changed in a long, long time.
The tests were being run in a Windows Subsystem for Linux shell on an AMD Ryzen machine,
neither of which facts turned out to be anything to do with the problem. I narrowed the
failures down to the POSIX specific code that provided an abstraction layer over POSIX
threading primitives that mirrored Win32 ‘handle wait’ semantics. This is based on the
Neosmart pevents code from GitHub.
The underlying problem of all of my test failures was that calls to WaitForMultipleObjects()
or other Win32 Event waits were returning with timeouts too soon. This code eventually
gets down to the pevents WaitForMultipleEvents(), and UnlockedWaitForEvent(), and both
of those eventually ended up at a call to pthread_cond_timedwait().
I haven’t done much work with pthreads as the pevents code just worked: until now. My
initial investigations into the API suggested that I could be experiencing
‘spurious wakeups’ or that the default
condition variable usage of CLOCK_REALTIME might be a problem as it can be changed and
thus allow time to move backwards… In the end it turned out to be both…
My first thoughts were that it must be the usage of CLOCK_REALTIME. The
man page is quite clear
that “When its time is changed, timers for a relative interval are unaffected, but timers
for an absolute point in time are affected.” and, after all, pthread_cond_timedwait()
works in terms of absolute timeouts…
I adjusted the code use one of the other clocks, CLOCK_MONOTONIC looked like a good bet.
This involved adjusting how the condition variables were created so that I could set the
clock type using a pthread_condattr_t structure during creation. This made things worse
and more unpredictable…
Time was now moving forwards predictably but calls to pthread_cond_timedwait() were
returning too soon anyway. I added plenty of old-school printf debugging and, amidst the
noise, I could see that the waits were returning early. That will be the
‘spurious wakeups’ that POSIX warned me
about. The pevents code didn’t allow for early returns from timed waits, but allowed for
them for waits that waited on the events to change. I expect the reason that the code didn’t
deal with early returns from timed waits is that it doesn’t appear to happen when using
CLOCK_REALTIME. It was easy enough to fix up the pevents code to deal with early returns
from timed waits, especially since I was already sampling the time for my debugging code.
The switch of clock types to CLOCK_MONOTONIC and the addition of the spurious wake handing
code fixed my test failures and worked on all of my build machines, but I still wanted to get
to the bottom of what was causing one machine to fail.
Much debugging occurred. Switching between the two clock types and conditionally enabling and
disabling the timed spurious wakeup handling code. Wondering about processor architecture and
WSL versions, I also tried CLOCK_MONOTONIC_RAW and this resulted in even more spurious wakeups.
Eventually I noticed that the box that was failing has a system time that was around 2 minutes ahead of my other boxes. I’ve no idea why, all the Windows Time settings were the same on all machines. However, seeing the time difference I realised what was going on…
On the failing box with the time in the future, the Windows NTP system was trying, valiantly, to
correct the time drift. It was doing this by adjusting the system clock backwards in small steps.
These adjustments were causing the original pevents implementation, that used CLOCK_REALTIME,
to fail as the absolute timeouts for the condition variables were being randomised by the underlying
clock’s movement into the past. Manually synchronising the broken machine’s clock fixed the problem.
The unfixed code ran fine and all the tests passed.
This is one of those “I’m glad this wasn’t a production issue” bugs. It also validated my approach to bug fixing, which is, if you find a bug fix it now, drop EVERYTHING else and fix it. If I had simply noted the test failures down and created an issue to track it and then gone back to the other work that I was doing, the chances are that the box would have fixed its time and the problem would have vanished. By immediately diving into the problem and staying with it until it was fixed I had a reliable way of reproducing the issue that would have vanished once the system time was properly synchronised.
The end result was that I have changed my version of pevents to use CLOCK_MONOTONIC and handle
the resulting spurious wakeups. The code works on all of my test machines even when their clocks are
wrong and ‘wall time’ is moving unpredictably for them.
Now I need to fork the pevents code and fix the issue there…