I was visiting a client on Friday to help them integrate a server we’d written for them, using The Server Framework, with some of their existing systems. We had just got everything working nicely when they mentioned, in passing, that they’d occasionally had problems with our service hanging during startup but they couldn’t reproduce it. We talked through the problem for a while and then I had an idea; I configured the service incorrectly, so that it would fail during start up and shutdown. It didn’t it hung.

It was nice that the problem was now reproducible, but we decided not to try and debug it on their production kit and instead I looked into it when I got back to the office. The only thing was the problem didn’t occur in my debugger on my development machine… I tried adjusting the processor affinity; I hadn’t checked how many processors were in the production box, but perhaps it was a threading issue… It wasn’t. I ran the release build and it hung, reliably, every time.

I stuck some logging in and eventually narrowed down the problem to an object that was being destroyed during the stack unwinding caused by the exception that was being handled… The object in question was part of our standard Windows Service framework…

We have some code that makes writing services easy for us, it’s based on the stuff in Richter’s book, but adds some things that make it easier for us to debug and test our services. All of our TCP/IP servers have black box test harnesses that are built with our C# Socket Server test harness. The idea being that we can thrash-test the servers and simulate all of the horrible nasty fragmentation issues that never normally happen in development. The key part of these tests is that they run as part of the release build. The build happens; the test starts the server, runs and then shuts the server down. The problem we had with Richter’s code was that it wasn’t easy to run a service outside of the SCM and still be able to shut it down; we didn’t want to manipulate the SCM programmatically for our tests and, to be honest, it’s useful to be able to run the service like a normal process in the debugger. To this end, we have added a command line switch to run the service in “debug mode”, as a normal process. So that we could still shut it down cleanly, and pause/resume it, we also added a thread to monitor a couple of events and post service controls as if they had come from the SCM.

Our bug was that we only start the event monitoring thread when the service is started in debug mode but we always wait for it to shut down in the monitor’s destructor. The wait for the unstarted thread caused an exception and, well, exceptions during stack unwinding due to exceptions are bad news. The bug was easy to locate and simple to fix; at least for this client.

Unfortunately our service code had been growing in the projects that used it and hadn’t yet been harvested into a library. The thinking was, each time we use it, it’s different, and so we don’t understand the problem enough to hoist the code into a library… Hmm. After I’d applied the same fix to the third server I decided that we needed to hoist the code out of the servers and into the service library. We were breaking the number one rule; Don’t Repeat Yourself.

So the service code was harvested into a library. To do this required us to factor out the comonality from the servers that used the service code and make it configurable enough so that we could share the common stuff and configure it appropriately. It wasn’t actually that hard; with several implementations to consider we could easily spot the similarities and we deliberately decided not to extend the code to cover currently unneeded functionality. Rather than attempt to refactor the client code we did what we usually do when we want to explore functionality in such a way that we can easily reuse it; we wrote a simple server example that uses it. And so the echo server service example was born. This takes our simple echo server and makes it a service using the newly refactored service code. It’s easy to see how to use the service code in a real server as the echo server is very simple and doesn’t get in the way of demonstrating the technique we’re exploring. Almost all of our new server functionality makes its way into a stand alone example program like this; we have very simple SSL echo servers, telnet, HTTP, kerberos (SSPI), multiple listening servers, bluetooth using the AcceptEx stuff, etc, etc. Of course real life production servers are rarely this simple but we have a working example to start from and that helps a lot.

Once we had the service code harvested into the library and an example server written we then spread the love through the rest of the servers that used this particular functionality, and finally, we had one place to fix rather than many… I think we harvested this code at about the right time, it was slightly over-ripe, but the last time we looked at doing the work we only had two uses of the code and we got tempted to add lots of functionality that we expected we’d need soon. This made the harvesting into more work and we got caught up in analysis paralysis. This time around we’d done things the same way three times, the echo server example made that four so we just harvested that way of doing things. We can do the hard, really flexible, kitchen sink version when we actually need it.