One of my long-term clients has hundreds of cloud machines running instances of their server, each server maintains thousands of reliable UDP connections using a custom protocol that we’ve developed over the years. When things go wrong it’s often hard to work out why. Even though we have reasonable unit test coverage of the code that runs the UDP protocol, it’s hard to build tests that cover every possible scenario.
I recently redesigned the internals of the inbound queuing, message reassembly and dispatch code so that it was more cache-friendly and worked better on NUMA hardware. During this work, we had some bugs that showed up during integration testing and these were hard to debug. Capturing network traces has never really been that viable when the problem only happens under load with network conditions that cause packet delay, loss and retransmission, so getting visibility on the potential causes of the bugs wasn’t easy. Eventually I decided to add a journal to the code, and this has proven to be immensely valuable.
The journal is a per-connection, in-memory byte buffer that is used to record the operations that occur on the code in question. The idea is that we record the minimal amount of information required to be able to replay the stream of operations that occur on the connection. So, for example, we may store the fact that we had a 200 byte message arrive with a sequence number of 5, and then the first fragment of a fragmented message with a sequence of 2 etc., we can store this information in very limited space if we don’t care about the actual message payloads. The result is that we store the ‘shape’ of the operations and all the key data, and we then fabricate the stuff that’s not important as we replay the journal. If the connection fills the current journal buffer, we allocate a new one and chain it to the previous one. Then, if certain events occur, we write the journal to disk in a simple binary format that can be easily read by our tools.
Over time, I’ve built several tools to process the journals. The first and most valuable was the replay test. This is based on our unit tests, it puts the objects that we need together with some mock objects and then reads the journal and fabricates the message data and pushes the resulting operations through the code under test. It allows for repeatedly running the problematic message stream through the code that handles it and, coupled with conditional breakpoints that can be set to trigger only on particular message sequence numbers, makes for a reasonably controlled and sane debugging environment.
Other tools that we’ve written can produce a text dump of the message flow, analyse the time between retransmissions and show duplicate messages and missing message and the time that it takes the clients to retransmit and recover. A summary tool can pull key information from a bunch of journal files so that we can if only a particular version of the client, on a particular machine type, is causing the problems.
This was all so useful that adding a corresponding outbound journal was the first step I took when I was about to do a similar redesign for the outbound data flow.
It took a while to get value out of this system as initially every time I replayed a journal there was key data missing, so, incrementally, I added stuff and tuned the system.
The journal is light-weight enough that, although it can be both turned off and compiled out, it’s generally left enabled. The system can be set to only persist the journal files when there’s a problem or if certain metrics are out of bounds and so that reduces the amount of data that needs to be examined.
I’m writing this to remind myself that this is something that should always be added as soon as I need to spend more than a day or so debugging this kind of thing. Especially since we’ve just had a new, strange, crash and as well as a mini-dump, which doesn’t help much on its own, in this case, we also have a journal that was produced. Running the journal through the replay system causes a dump to be generated in the same place, and so I immediately knew I had a reproducible bug and hacking on the replay tool for an hour or so was all it took to locate the problem code and step through how the problem occurs even though the actual problem happens a long time prior to the crash occurring.
As Jerry Pournelle would say, Recommended.