Lock free or many locks?
I’ve been working on some performance tuning for a client of mine recently and whilst much of the tuning has been specific to the details of the their particular server design, eventually I can always get to the point where there are some improvements that can be done to the framework itself. It’s nice to have a “real world” performance tuning scenario as I find that tuning example servers can be pointless as the example may not match a real world scenario. I’d already upgraded my client to 6.2 (which is part of the reason that 6.2 has been delayed a little) and dealt with some ’low hanging fruit’ and a few situations where the expected usage model that I originally designed for didn’t match the actual usage model and we’d achieved some fairly serious performance gains but my profiler still showed that contention around a few key locks was very high.
There are two main areas where contention is likely to be a problem with The Server Framework; one is the buffer allocator’s buffer pool and the second is the socket allocator’s socket pool. Both of these pools reduce memory allocation/deallocation but both introduce a lock which is contested for by many of the threads in the server. The lock isn’t held for very long, but it is obtained and released often and by many threads.
The buffer allocator is the worst culprit as almost any thread in the system can request a buffer when performing a socket read or write. In the past I’ve looked at switching the buffer allocator to a lock free algorithm and we’ve had a lock free buffer allocator, built on the Microsoft SList API, in the framework for a long while now. But the problem with the SList
API and with the majority of lock free algorithms is that they deal in terms of a FIFO queue.
The pooling allocator design that I use in the server framework uses two lists to maintain its state. There’s a FIFO free list which is used for the pooled objects and an doubly linked list which is used to maintain an ‘active’ list. Both lists actually use the same implementation but the free list could be a simple FIFO queue whereas the active list can’t be. I use a custom, invasive, doubly linked list implementation to implement both the active and free lists as nodes in the active list need to be able to be removed from anywhere in the list and not just pushed or popped at the head of the list.
The active list for the buffer allocator isn’t strictly required. It allows us to force a clean up when the allocator is destroyed which is sometimes nice to have when cleaning up a complex server during shutdown. At least it used to be nice to have. Most of the servers that I build these days don’t need this as there have been numerous improvements in the framework which makes forcible clean-up unnecessary - the main being that my thread pools now call some user code to clean up any items that are inside their queues when they’re shut down. The active list in the socket allocator IS required as it’s needed so that a server can walk the list of active sockets and abort connections during shutdown. Since this active list cannot be removed we can’t use the SList
solution for socket allocators.
When I built a lock free buffer allocator using SList
I removed the active list from it. The performance improvement of using the lock free buffer allocator over the locking allocator was pretty good when under test but didn’t really show up when used in example servers. Since the SList
approach wasn’t suitable for replacing the socket allocator as well and since no clients were complaining about the contention I decided to wait and see if anyone used the lock free buffer allocator and if so what their experiences were…
Every so often the contention issue comes up during pre-release testing and in the past I have toyed with the idea of having per-thread allocators. The idea being that each thread could store its allocator in TLS and as long as the number of allocations and deallocations on the thread were equal over time the pooling would work. Unfortunately it’s unlikely that the allocation and deallocation over time IS equal per thread and so some threads are always allocating new buffers and some have huge, mostly unused, pools. This time around I decided to try another approach. Rather than aiming for a lock free allocator to remove the contention I opted for a design where the allocator would usually have a lock available for the thread that wanted to use it…
The idea is that rather than having a single pool of objects I have multiple pools and I have an index to the next pool that should be used. As allocations occur we step through the pools and allocate from each in turn. Since multiple threads may be allocating at the same time the allocator that we try and use may already be in use. If it is we simply step to the next allocator and try that. If we loop through all available allocators and all are in use then we block on the allocator that we originally tried. This approach reduces contention and increases throughput quite nicely. It’s not as performant as the lock free version of the allocator if we have less threads than there are cores but it becomes more performant if we have more threads than we have cores and more allocator pools than we have threads…
The design is somewhat resource heavy, we have X pools each with their own lock but it works considerably better than the pool per thread approach. The pools are still shared between threads but since there are more pools and any of them can be used there is less contention. If we remove the requirement for an active list then the performance increases even more as we can also spread deallocation across pools in the same way; rather than requiring deallocation to occur to the pool from which the object was allocated.
So far I’ve only implemented a ’low contention’ allocator as a buffer allocator and whilst it’s possible to tune the allocator so that it out performs the lock free allocator it does need manual tuning dependent on the number of threads that are going to be accessing it. However, since it’s the only option for implementing a socket allocator with an “active list” I expect I’ll do that and run some tests soon.
I’ve yet to integrate these changes with my client’s code, but it’s something that I can try next time they need a bit of a performance boost, or when 6.3 is released…