Debugging yourself...

Recently I spent quite a long time debugging a heap corruption bug. When I eventually found the issue, it was such a classic buffer overrun bug that it was actually quite laughable how long it had taken to track it down. It was just SO obvious that I immediately spotted it as soon as I looked at the code. The problem was that it took me so long to actually look at this piece of code. However, the code in question wasn’t in my part of the system, and the part of the system where the buffer overrun existed was never considered to be the location of the problem.

A long time ago I wrote a server system for a games company. The server is written in C++ and runs on Windows and hosts the Microsoft CLR, which, in turn, runs managed code for the actual game code. The native C++ does the networking with various protocols and, even after all these years, is still, predominantly, my domain. The C# code is written by the game company’s developers. We now also run on Linux using .Net Core, but the distribution of responsibility is still the same; I do the native stuff; they do the managed stuff. Whenever we have memory issues, it’s the native code in the spotlight. This is fair enough. It’s more likely the C++ code is at fault as it’s “impossible” for the memory corruption to be happening in managed code… This time it was the managed code causing the buffer overrun and the heap corruption.

Whilst it’s impossible for the managed code to cause memory corruption, it isn’t if you’re using unsafe blocks of managed code for performance. The issue we faced and which was finally located and fixed was that some code was converting a UTF8 string and telling the converting code that the target buffer was big enough when it wasn’t. Depending on how big the memory corruption was, we got various different effects. Small corruptions weren’t detected until later operations on the heap encountered the corrupted data. Large corruptions cause the expected access violations as they tampered with unallocated memory areas.

The managed layer reported the issue by telling the native hosting layer that the managed layer had caused a “rude process exit” and then the process shut down. Unfortunately, by the time the native code was told of the problem, it was too late for us to work out what the problem was as the callstack out of managed into native wasn’t useful to us.

Eventually, I spotted patterns in the dumps we were generating. Windbg showed a corrupted managed heap, and the “last good” object often looked the same. Due to the unique size of the managed object that was corrupted, there were only a few likely targets in the managed system and, once I started looking at the code, it soon became clear what the issue was.

Our reporting of this issue was made more complicated by the fact that the Windows Error Reporting system would often step in and handle the “bigger” heap corruptions. Though some of the smaller ones would also result in WER reports. The problem was very intermittent, relied on things that “shouldn’t happen” happening in the managed layer and generated WER reports that were often misunderstood. Add to this the fact that the system runs on 100s of server boxes and provides a cloud platform for lots of different games, and it was quite a complicated issue to track down. And it’s probably not the only bug causing very rare crashes for us.

As a post-mortem step, I decided to look into finding future issues faster. The native layer already has lots of “range checked” versions of the obvious buffer overrun functions; things like checked memory copy functions, etc. The managed layer is out of my control and, apart from stipulating code reviews and careful tracking of all unsafe blocks, it’s hard for me to suggest improvements there. My approach to debugging this was to take a step back from the system and write a custom debugger using the Windows Debug API.

I’ve done a lot of work with the Windows Debug API in the past; my time shifting and lock inversion tools ran as custom debuggers. I have a library of code that makes using the API “easier” (for me). Putting together a custom debugger that ran our server system was pretty easy as most of the boilerplate code already existed in other projects that I’ve worked on over the years. Once this custom debugger was running, I had much greater visibility on the server process; being able to see thread starts/stops and all SEH exceptions, even ones that are handled in before reaching the native code.

Armed with the now easy to reproduce, managed buffer overrun bug, and the custom debugger I set about spotting the access violation that the managed corruption could cause. The custom debugger could then dump a call stack, and we could see what was going on… Unfortunately, dumping a mixed-mode, managed and native callstack is quite complex and I don’t currently have code to hand that does it. This left me with a pretty useless native stack without symbols. My recent work on out of process mini dump generation meant that I understood enough about dump creation (again) to be able to recognise that the exception and stack walking code had all the information that I needed to be able to generate a dump rather than a callstack. So that’s what we do.

So, I’ve ended up with a system where; if we see a “first chance” access violation exception in a managed stack and this doesn’t result in an exception that reaches the native layer, and which does result in the managed layer killing the process at a later point, then we generate a crash dump of the process at the point when the exception originally occurred. We do this by grabbing the context of the thread that generated the exception from the custom debugger, when the debugger gets told of the exception. We keep this information around, and if we then see the process is dealing with the “rude exit process” managed callback, we generate a dump of the process when the access violation occurred. This works nicely and gives me a dump which is correctly positioned in managed code on the line that is broken.

The next step, should we choose to take it, is to incorporate this custom debugger into the Windows service design that we use for the real server. This would allow the server’s service to start as a debugger which then starts the real server and which would allow us to generate reliable crash dumps for these situations…