May your software scale, and other curses...

I’m in the process of upgrading another client to v6.0. These guys write multi-user game systems and have a fairly complex CLR hosting custom application server. Anyway, I was hoping that we’d have a few easy performance wins from the changes that have gone into v6.0 and these gaming guys are possibly even more hung up on performance than my banking and trading clients.

The good news is that the changes in how we convert numbers to strings has drastically improved performance in their debug builds where they have lots of logging going on. The bad news is that performing better when generating log lines from multiple threads that are then written to disk asynchronously has put more strain on the next stage in that particular pipeline. The problem now is very similar to the problem that you have if you don’t bother to use any kind of write completion flow control on your socket connections; you keep pushing overlapped, asynchronous writes into the system and each one of these takes up some resources, if you’re pushing items into the system faster than the system is processing them then you’re using system resources in an uncontrolled manner. Given that the resource in question is ’non paged pool memory’ and that it’s a finite resource this is a Bad Thing…

The problem is that, in the debug build, we have multiple threads generating log messages and firing them off into a queue to be written asynchronously by the file writer class (see here for details). Since the log messages are being produced faster than the file writer and the operating system is able to complete the asynchronous writes the system as a whole starts to eat ’non paged pool’. This can be seen from the NP Pool counter in task manager and from the fact that a ’tail’ of the log file that is being written begins to show that the timestamps on the latest log lines begin to lag behind real time. Once the load on the server eases the log system catches up but during the log message oversupply phase the server can use an unlimited amount of non paged pool and that’s a Bad Thing (though on Vista and Windows Server 2008 it’s less of a bad thing than it used to be (see here for details)…).

A solution to this would be to limit the number of writes that the asynchronous file writer can have pending at any one time. This can either be done before the write is dispatched to the queue or after it is removed from the queue but before the actual write is issued; or both… The problem of applying the limit before the message is dispatched to the queue is that you then effectively remove the asynchronous nature of the log system when the number of pending writes reaches a certain limit. This prevents the number of pending messages from growing out of control by throttling the log message producers… If, instead, you apply the limit to the actual write operations then you have simply moved the resource usage from system controlled ’non paged pool’ to the application memory used to manage the queue… Given that I currently have but a single client who has the potential to generate this problem, and given it takes a particular style of stress testing to get it to happen I think I’ll opt for the pending limit being applied before queue dispatch. This will provide users for which this is a problem with a way of removing the problem at the expense of performance and should anyone then have a performance issue we’ll address that at the time…

As I said over on the Joel On Software forum a while back; “Just remember that getting to the point where you CAN write something that is massively scalable and performs well is only the start of your problems…”