Be careful what you ask for...

As I mentioned in May last year, I’d was having trouble with a client’s system where a UDP DDOS was causing Windows Server machines to use all available non-paged pool and then blue screen.

The issue could be reproduced with, what I thought at the time was, a minimal test program that simply created a UDP socket, bound it to a port and then didn’t do anything with it. If something sent a stream of datagrams to that port then non-paged pool usage would grow in an uncontrolled manner until the box became unusable and eventually died.

The client’s DDOS issue was resolved up-stream from them and we promptly forgot about this strange behaviour but recently we’ve been doing some load testing and we’ve run into it again so, once I worked out what was going on, I revisited the tests to make sure the minimal code could still be used to reproduce the problem.

Unfortunately the test program wasn’t quite minimal enough, as well as creating and binding it also set the socket’s recv buffer size to ‘max int’. Before the Windows Update that first cause the problem this didn’t seem to matter, possibly because the value was being ignored or Windows stopped buffering when non-paged pool usage got too high. After the Windows Update that brought the issue to my attention Windows does just what you ask it and buffers the data and doesn’t care that its using all your non-paged pool whilst it’s doing that… Or, perhaps, before non-paged pool wasn’t used by this buffered data…

Removing the ‘set recv buffer size to max’ calls seems to fix the problem. Setting the value to something large, but not too large, and then watching the non-paged pool and Microsoft Windows BSP Dropped Datagrams counters perfmon counters clearly shows what’s happening. Data is buffered until the buffer space is used and then datagrams are dropped. The buffered data uses non-paged pool in addition to whatever memory is also used for the data.

Deciding on what size buffer you need is complicated by the fact that the socket recv buffer is in bytes and is used up by the payload bytes in the datagrams whereas the non-paged pool usage appears to be “per datagram”, so if the sender is sending large datagrams then the recv buffer space is used up sooner and datagrams begin to be dropped with less non-paged pool memory being used. With very small datagrams the non-paged pool usage becomes dangerous far sooner.

Working out a trade-off that allows for normal usage patterns but protects against a DDOS with small amounts of data may be complicated and may require auto-tuning based on the dropped datagrams and non-paged pool perf counters…