In the post ‘Scaling Up with Task Parallel Library’ we reviewed the way in which the TPL (Task Parallel Library) library facilitates the process of adding concurrency to applications. One of TPL’s features that I thought worth a separate discussion is that it enables developers to manually tune the amount of threads (concurrency level) that will be used to serve a given set of tasks.
Manual Concurrency Control?
Relaying on TPL to determine the concurrently levels is probably the best idea for most applications. As for now, TPL’s default configuration is to set the concurrency levels to one thread per processor while in cases of low CPU utilization it may take the liberty to add threads as appropriate. This scheme is often a good solution since having two or more threads running on the same processor result in time slicing overhead which can drastically harm performance.
The primary overheads of time slicing are 1) saving register state of a thread when suspending it, and restoring the state when resuming it, 2) restoring a thread's cache state, 3) and thrashing virtual memory.
However, the thread per processor scheme may result in low CPU utilization since a running thread may be interrupted (e.g by a mouse event) or blocked (e.g while waiting for I/O operation to complete) for a while, where while blocking the matching processor doesn’t do any actual work.
Since most of the tasks in the application do block some of the time (some for long periods), it’s safe to say that appropriate concurrency levels tuning leads to improvements in performance. The only problem is that it can be really hard to calculate the “appropriate” number of threads which will yield optimal performances, while miss calculating can lead to degradation in performance and unexpected behavior, one should take into account that high concurrency levels can result in excessive context switching or resource contention and that low concurrency levels can result in low CPU utilization.
Putting All the Cores To Work - All the Time - With Minimum Contentions
The following figure (from MSDN) shows throughput (number of completed work items per second, while each work item has 100ms of execution time with 10ms of CPU usage and 90ms of waiting) as a function of concurrency level (number of threads) running on a dual-core 2GHz machine.
As you can see in the figure above, throughput peaks at concurrency level of 20 threads and degrades when concurrency level exceeds 25 threads while the degradation is mainly due to context switching. Clearly we can’t conclude that 20 is the right amount of threads for a dual-core machine, as in reality most work items don't wait 90% of the time. However, it’s pretty clear that we should consider the percentage of time that tasks will block in order to correctly tune the concurrency level – if high throughput is indeed our primary goal.
Since the TPL default policy is to use one thread per processor, we can conclude that TPL initially assumes that the workload of a task is ~100% working and 0% waiting, and if the initial assumption fails and the task enters a waiting state (i.e. starts blocking) - TPL with take the liberty to add threads as appropriate.
Calculating the ‘Appropriate’ Amount of Threads
The following simplified formula calculates the “ideal” number of threads according to the number of CPUs on the machine and the percentage of time that the tasks will block.
NumThreads = NumCPUs / (1 – BlockingTimeInPercentage)
Although we can’t really calculate the actual blocking time, this formula can give us a clue about the number of threads that will produce the best performance.
Calculating Maximum Speedup
Another way of addressing the problem is by calculating the maximum number of tasks that can run in parallel before beginning to see degradation in performance rather than a continued speedup. Amdahl's Law suggests the following equation (S is the percentage of the work that cannot be parallelized):
SppedUp = 1/(S + ((1 – S)/NumCPUs))
Concurrency Control in TPL (C# .NET Source Code)
In the code snippets bellow the ‘idea threads per processor’ is set to 2 yielding concurrency level of 4 threads for dual-core machine. This means that the TPL will ideally use 4 threads to parallel the 3 tasks that were assigned with the scheduler/parallel-options. However, TPL will exceed 4 threads if some of the tasks are blocking in order to keep the CPU utilization high.