SetServiceStatus framework bug

One of my clients has been reporting an intermittent issue with the deployment of new releases of their game server. This runs as a Windows service on many, many, cloud machines and, just sometimes, the service seems to have issues during start up after upgrading the code on a machine that it has otherwise been running fine on.

I’ve been adding debug code to our service start up code to try and work out what’s going on and today we had our first hit. The first call to SetServiceStatus() failed and we really don’t expect that and so the service shut down.

TL;DR

I made a mistake 23 years ago but that isn’t the cause of my client’s current problems.

SetServiceStatus() never fails. Well, in 23 years’ I’ve never seen it fail and never had a client complain that it’s failed. The function returns a BOOL, so it can fail, and my code checks, and reports any failures, but, well, this has never happened. The failures that the docs suggest are possible are ERROR_INVALID_DATA, if we pass it something incorrect and ERROR_INVALID_HANDLE if the handle we use is wrong. Up until recently neither of those things were likely to be possible.

I looked through the docs again today, desperately clutching at straws, and noticed something that looks new, or at least new to me.

Do not register to accept controls while the status is SERVICE_START_PENDING or the service can crash.

After initialization is completed, accept the SERVICE_CONTROL_STOP code.

And looking at the example service main code it clearly shows that when the service is in SERVICE_START_PENDING state they zero out the dwControlsAccepted field. I was pretty sure the Richter code that I based mine from back in 2001 never mentioned this; but looking back at Programming Server Side Applications for Windows 2000 the code there doesn’t set the controls during service initialisation, so it looks like it’s a mistake I made. Anyway, it’s easy to fix up and so I’ve done that and this and other fixes to this code will be available in the new release of The Server Framework.

Having changed this the service started on the specific machine that the client was having problems with, but then failed as soon as it was running and the code set the controls to the value that we’d configured. The problem here turned out to be that, to aid debugging of the deployment issue, I’d enabled “all controls” so that the service would log everything that could have been affecting it. Unfortunately “everything” from the current Windows SDK includes control flags that the OS we’re running on doesn’t support.

I’m in the process of adding some code that checks for this kind of behaviour and prevents it but for now my client and I will need to wait for another chance to work out what’s actually going wrong… My guess is that it’s a race condition on their deployment scripts shutting down the old version of the service, removing it, installing the new one and starting that, coupled with a possibility that their service failure actions are causing the SCM to restart the ‘old’ instance of the service whilst the deployment scripts are changing things.