Concurrency-managed workqueues

By Jonathan Corbet
October 7, 2009

A "thread pool" is a common group of processes which can be called on to perform work at some future time. The kernel does not lack for thread pool implementations; indeed, there are more choices than one might like. Options include workqueues, the slow work mechanism, and asynchronous function calls - not to mention various private thread pool implementations found elsewhere in the kernel. It has long been thought that having just one thread pool mechanism would be better, but nobody, so far, has managed to put together a single implementation that everybody likes.

Of the mechanisms listed above, the most commonly used by far is workqueues. A workqueue makes it easy for code to set aside work to be done in process context at a future time, but workqueues are not without their problems. There is a shared workqueue that all can use, but one long-running task can create indefinite delays for others, so few developers take advantage of it. Instead, the kernel has filled with subsystem-specific workqueues, each of which contributes to the surfeit of kernel threads running on contemporary systems. Workqueue threads contend with each other for the CPU, causing more context switches than are really necessary. It's discouragingly easy to create deadlocks with workqueues when one task depends on work done by another. All told, workqueues - despite a couple of major rewrites already - are in need of a bit of a face lift.

Tejun Heo has provided that face lift in the form of his concurrency managed workqueues patch. This 19-part series massively reworks the workqueue code, addressing the shortcomings of the current workqueue subsystem. This effort is clearly aimed at replacing the other thread pool implementations in the kernel too, though that work is left for a later date.

Current workqueues have dedicated threads associated with them - a single thread in some cases, one thread per CPU in others. The new workqueues do away with that; there are no threads dedicated to any specific workqueue. Instead, there is a global pool of threads attached to each CPU in the system. When a work item is enqueued, it will be passed to one of the global threads at the right time (as deemed by the workqueue code). One interesting implication of this change is that tasks submitted to the same workqueue on the same CPU may now execute concurrently - something which does not happen with current workqueues.

One of the key features of the new code is its ability to manage concurrency in general. In one sense, all workqueue tasks are executed concurrently after submission. Actually doing things that way would yield poor results, though; those tasks would simply contend with each other, causing more context switches, worse cache behavior, and generally worse performance. What's really needed is a way to run exactly one workqueue task at a time (avoiding contention) but to switch immediately to another if that task blocks for any reason (avoiding processor idle time). Doing this job correctly requires that the workqueue manager become a sort of special-purpose scheduler.

As it happens, that's just how Tejun has implemented it. The workqueue patch adds a new scheduler class which behaves very much like the normal fair scheduler class. The workqueue class adds a couple of hooks which call back into the workqueue code whenever a task running under that class transitions between the blocked and runnable states. When the first workqueue task is submitted, a thread running under the workqueue scheduler class is created to execute it. As long as that task continues to run, other tasks will wait. But as soon as the running task blocks on some resource, the scheduler will notify the workqueue code and another thread will be created to run the next task. The workqueue manager will create as many threads as needed (up to a limit) to keep the CPU busy, but it tries to only have one task actually running at any given time.

Also new with Tejun's patch is the concept of "rescuer" threads. In a tightly resource-constrained system, it may become impossible to create new worker threads. But any existing threads may be waiting for the results of other tasks which have not yet been executed. In that situation, everything will stop cold. To deal with this problem, some special "rescuer" threads are kept around. If attempts to create new workers fail for a period of time, the rescuers will be summoned to execute tasks and, hopefully, clear the logjam.

The handling of CPU hotplugging is interesting. If a CPU is being taken offline, the system needs to move all work off that CPU as quickly as possible. To that end, the workqueue manager responds to a hot-unplug notification by creating a special "trustee" manager on a CPU which is sticking around. That trustee takes over responsibility for the workqueue running on the doomed CPU, executing tasks until they are all gone and the workqueue can be shut down. Meanwhile, the CPU can go offline without waiting for the workqueue to drain.

These patches were generally welcomed, but there were some concerns expressed. The biggest complaint related to the special-purpose scheduling class. The hooks were described as (1) not really scheduler-related, and (2) potentially interesting beyond the workqueue code. For example, Linus suggested that this kind of hook could be used to implement the big kernel lock semantics, releasing the lock when a process sleeps and reacquiring it on wakeup. The scheduler class will probably go away in the next version of the patch; what remains to be seen is what will replace it.

One idea which was suggested was to use the preemption notifier hooks which are already in the kernel. These notifiers would have to become mandatory, and some new callbacks would be required. Another possibility would be to give in to the inevitable future when perf ~~counters~~ events will take over the entire kernel. Event tracepoints are designed to provide callbacks at specific points in the kernel; some already exist for most of the interesting scheduler events. Using them in this context would mostly be a matter of streamlining the perf events mechanism to handle this task efficiently.

Andrew Morton was concerned that the new code would take away the ability for a specific workqueue user to modify its worker tasks - changing their priority, say, or having them run under a different UID. It turns out that, so far, only a couple of workqueues have been modified in this way. The workqueue used by stop_machine() puts its worker threads into the realtime scheduling class, allowing them to monopolize the processors when needed; Tejun simply replaced that workqueue with a set of dedicated kernel threads. The ACPI code had bound a workqueue thread to CPU 0 because some operations corrupt the system if run anywhere else; that case is easily handled with the existing schedule_work_on() function. So it seems that, for now at least, there is no need for non-default worker threads.

One remaining issue is that some subsystems use single-threaded workqueues as a sort of synchronization mechanism; they expect tasks to complete in the same order they were submitted. Global thread pools change that behavior; Tejun has not yet said how he will solve that problem.

It almost certainly will be solved, along with the other concerns. David Howells, the creator of the slow work subsystem, thinks that the new workqueues could be a good replacement. In summary, this change looks likely to be accepted, perhaps as early as 2.6.33. Then we might finally have a single thread pool in the kernel.

Index entries for this article

Kernel Kernel threads

Kernel Workqueues

Index entries for this article
Kernel	Kernel threads
Kernel	Workqueues

Concurrency-managed workqueues

Posted Oct 8, 2009 14:16 UTC (Thu) by nix (subscriber, #2304) [Link] (4 responses)

The ACPI code had bound a workqueue thread to CPU 0 because some operations corrupt the system if run anywhere else

Is this just BIOSes being their usual malevolently incompetent selves, or is there a rational reason for this requirement? Because my first impression when reading this was 'WTF WTF WTF'...

Concurrency-managed workqueues

Posted Oct 8, 2009 14:27 UTC (Thu) by mjg59 (subscriber, #23239) [Link] (3 responses)

On some HPs, at least, certain ACPI operations trigger SMIs that then appear to be run on the CPU that triggered the SMI. HP's SMI handler seems to fail to restore CPU state if it runs on anything other than CPU 0.

Concurrency-managed workqueues

Posted Oct 8, 2009 17:16 UTC (Thu) by mebrown (subscriber, #7960) [Link] (2 responses)

This is also true for Dell SMI implementation, so I'd assume that this is widely true or some kind of limitation of SMI. If you trigger a SMI from any CPU other than CPU #0, you get all kinds of interesting fireworks and possibly random memory locations overwritten.

Concurrency-managed workqueues

Posted Oct 9, 2009 10:02 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

Ah. SMI. Malevolently incompetent by default, then. :/

(ACPI triggering SMI. What a nice way to take a kernel-controls-all VM executor and throw you into the undefeined-behaviour swamp again. Sigh.)

Concurrency-managed workqueues

Posted Oct 19, 2009 0:13 UTC (Mon) by vonbrand (guest, #4458) [Link]

"Simply Malevolently Incompetent"?

Concurrency-managed workqueues

Posted Oct 8, 2009 23:50 UTC (Thu) by giraffedata (guest, #1954) [Link] (1 responses)

I don't understand why there has to be a pool of threads. Why not just make a new thread for each task and let the CPU scheduler do its job?

Concurrency-managed workqueues

Posted Oct 9, 2009 19:35 UTC (Fri) by MisterIO (guest, #36192) [Link]

In theory a thread pool should be an easier tool to manage than threads, because it usually doesn't need all the direct management that a "single-thread-per-task" would require. In the end though, it all depends on the implementation.

Grand Central Dispatch comparison/likeness?

Posted Oct 11, 2009 20:28 UTC (Sun) by Zenith (guest, #24899) [Link] (1 responses)

Has this work in any way been inspired from Apple's Grand Central Dispatch approach?

If yes/no, in what ways do they differ? I seem to recall having read a great deal of appraisal for the approach Apple has taken, so I am hoping that the Linux implementation has copied the best ideas.

Grand Central Dispatch comparison/likeness?

Posted Oct 20, 2009 9:10 UTC (Tue) by njs (guest, #40338) [Link]

AFAICT, Grand Central Dispatch is a nice implementation, with a few clever tricks/API, of age-old pretty-obvious ideas. And a whole lot of marketing that seems to be confusing people who aren't up on their details of concurrency models, and become convinced that there must be something *there* or Apple wouldn't make such a big deal out of it. (Honestly, the marketing and generated excitement is probably more effective than the API at achieving Apple's real goal of getting more app developers to take advantage of SMP; it's advocacy, not brilliant new engineering.)

(Though sure, "blocks" *are* neat.)

Cooperative multitasker?

Posted Oct 15, 2009 13:22 UTC (Thu) by forthy (guest, #1525) [Link]

The specification "only one thread running, switching when blocking" looks more that what's required here is a cooperative multitasker, not a "new scheduler class". A cooperative multitasker is much simpler than a scheduler class. And now the Linux kernel developers plug a cooperative multitasker on top of their already quite complicated scheduler... Duh. No wonder there's accumulated bloat in the kernel.

Concurrency-managed workqueues

Posted Aug 18, 2010 23:35 UTC (Wed) by Lennie (subscriber, #49641) [Link]

In October 2009 Corbet mentioned:

"In summary, this change looks likely to be accepted, perhaps as early as 2.6.33. Then we might finally have a single thread pool in the kernel."

It was finally accepted in 2.6.36 (August 2010):

http://lwn.net/Articles/399052/

As usual it takes a bit more time. ;-)

Concurrency-managed workqueues

Concurrency-managed workqueues

Concurrency-managed workqueues

Concurrency-managed workqueues

Concurrency-managed workqueues

Concurrency-managed workqueues

Concurrency-managed workqueues

Concurrency-managed workqueues

Grand Central Dispatch comparison/likeness?

Grand Central Dispatch comparison/likeness?

Cooperative multitasker?

Concurrency-managed workqueues

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!