Concurrency-managed workqueues
Of the mechanisms listed above, the most commonly used by far is workqueues. A workqueue makes it easy for code to set aside work to be done in process context at a future time, but workqueues are not without their problems. There is a shared workqueue that all can use, but one long-running task can create indefinite delays for others, so few developers take advantage of it. Instead, the kernel has filled with subsystem-specific workqueues, each of which contributes to the surfeit of kernel threads running on contemporary systems. Workqueue threads contend with each other for the CPU, causing more context switches than are really necessary. It's discouragingly easy to create deadlocks with workqueues when one task depends on work done by another. All told, workqueues - despite a couple of major rewrites already - are in need of a bit of a face lift.
Tejun Heo has provided that face lift in the form of his concurrency managed workqueues patch. This 19-part series massively reworks the workqueue code, addressing the shortcomings of the current workqueue subsystem. This effort is clearly aimed at replacing the other thread pool implementations in the kernel too, though that work is left for a later date.
Current workqueues have dedicated threads associated with them - a single thread in some cases, one thread per CPU in others. The new workqueues do away with that; there are no threads dedicated to any specific workqueue. Instead, there is a global pool of threads attached to each CPU in the system. When a work item is enqueued, it will be passed to one of the global threads at the right time (as deemed by the workqueue code). One interesting implication of this change is that tasks submitted to the same workqueue on the same CPU may now execute concurrently - something which does not happen with current workqueues.
One of the key features of the new code is its ability to manage concurrency in general. In one sense, all workqueue tasks are executed concurrently after submission. Actually doing things that way would yield poor results, though; those tasks would simply contend with each other, causing more context switches, worse cache behavior, and generally worse performance. What's really needed is a way to run exactly one workqueue task at a time (avoiding contention) but to switch immediately to another if that task blocks for any reason (avoiding processor idle time). Doing this job correctly requires that the workqueue manager become a sort of special-purpose scheduler.
As it happens, that's just how Tejun has implemented it. The workqueue patch adds a new scheduler class which behaves very much like the normal fair scheduler class. The workqueue class adds a couple of hooks which call back into the workqueue code whenever a task running under that class transitions between the blocked and runnable states. When the first workqueue task is submitted, a thread running under the workqueue scheduler class is created to execute it. As long as that task continues to run, other tasks will wait. But as soon as the running task blocks on some resource, the scheduler will notify the workqueue code and another thread will be created to run the next task. The workqueue manager will create as many threads as needed (up to a limit) to keep the CPU busy, but it tries to only have one task actually running at any given time.
Also new with Tejun's patch is the concept of "rescuer" threads. In a tightly resource-constrained system, it may become impossible to create new worker threads. But any existing threads may be waiting for the results of other tasks which have not yet been executed. In that situation, everything will stop cold. To deal with this problem, some special "rescuer" threads are kept around. If attempts to create new workers fail for a period of time, the rescuers will be summoned to execute tasks and, hopefully, clear the logjam.
The handling of CPU hotplugging is interesting. If a CPU is being taken offline, the system needs to move all work off that CPU as quickly as possible. To that end, the workqueue manager responds to a hot-unplug notification by creating a special "trustee" manager on a CPU which is sticking around. That trustee takes over responsibility for the workqueue running on the doomed CPU, executing tasks until they are all gone and the workqueue can be shut down. Meanwhile, the CPU can go offline without waiting for the workqueue to drain.
These patches were generally welcomed, but there were some concerns expressed. The biggest complaint related to the special-purpose scheduling class. The hooks were described as (1) not really scheduler-related, and (2) potentially interesting beyond the workqueue code. For example, Linus suggested that this kind of hook could be used to implement the big kernel lock semantics, releasing the lock when a process sleeps and reacquiring it on wakeup. The scheduler class will probably go away in the next version of the patch; what remains to be seen is what will replace it.
One idea which was suggested was to use the preemption notifier hooks which
are already in the kernel. These notifiers would have to become mandatory,
and some new callbacks would be required. Another possibility would be to
give in to
the inevitable future when perf counters events will take over
the entire kernel. Event tracepoints are designed to provide callbacks at
specific points in the kernel; some already exist for most of the
interesting scheduler events. Using them in this context would mostly be a
matter of streamlining the perf events mechanism to handle this task
efficiently.
Andrew Morton was concerned that the new code would take away the ability for a specific workqueue user to modify its worker tasks - changing their priority, say, or having them run under a different UID. It turns out that, so far, only a couple of workqueues have been modified in this way. The workqueue used by stop_machine() puts its worker threads into the realtime scheduling class, allowing them to monopolize the processors when needed; Tejun simply replaced that workqueue with a set of dedicated kernel threads. The ACPI code had bound a workqueue thread to CPU 0 because some operations corrupt the system if run anywhere else; that case is easily handled with the existing schedule_work_on() function. So it seems that, for now at least, there is no need for non-default worker threads.
One remaining issue is that some subsystems use single-threaded workqueues as a sort of synchronization mechanism; they expect tasks to complete in the same order they were submitted. Global thread pools change that behavior; Tejun has not yet said how he will solve that problem.
It almost certainly will be solved, along with the other concerns. David
Howells, the creator of the slow work subsystem, thinks that the new workqueues could be a good
replacement. In summary, this change looks likely to be accepted, perhaps
as early as 2.6.33. Then we might finally have a single thread pool in the
kernel.
Index entries for this article | |
---|---|
Kernel | Kernel threads |
Kernel | Workqueues |
Posted Oct 8, 2009 14:16 UTC (Thu)
by nix (subscriber, #2304)
[Link] (4 responses)
Posted Oct 8, 2009 14:27 UTC (Thu)
by mjg59 (subscriber, #23239)
[Link] (3 responses)
Posted Oct 8, 2009 17:16 UTC (Thu)
by mebrown (subscriber, #7960)
[Link] (2 responses)
Posted Oct 9, 2009 10:02 UTC (Fri)
by nix (subscriber, #2304)
[Link] (1 responses)
(ACPI triggering SMI. What a nice way to take a kernel-controls-all VM executor and throw you into the undefeined-behaviour swamp again. Sigh.)
Posted Oct 19, 2009 0:13 UTC (Mon)
by vonbrand (guest, #4458)
[Link]
Posted Oct 8, 2009 23:50 UTC (Thu)
by giraffedata (guest, #1954)
[Link] (1 responses)
Posted Oct 9, 2009 19:35 UTC (Fri)
by MisterIO (guest, #36192)
[Link]
Posted Oct 11, 2009 20:28 UTC (Sun)
by Zenith (guest, #24899)
[Link] (1 responses)
If yes/no, in what ways do they differ? I seem to recall having read a great deal of appraisal for the approach Apple has taken, so I am hoping that the Linux implementation has copied the best ideas.
Posted Oct 20, 2009 9:10 UTC (Tue)
by njs (guest, #40338)
[Link]
(Though sure, "blocks" *are* neat.)
Posted Oct 15, 2009 13:22 UTC (Thu)
by forthy (guest, #1525)
[Link]
The specification "only one thread running, switching when blocking"
looks more that what's required here is a cooperative multitasker, not a
"new scheduler class". A cooperative multitasker is much simpler than a
scheduler class. And now the Linux kernel developers plug a cooperative
multitasker on top of their already quite complicated scheduler... Duh.
No wonder there's accumulated bloat in the kernel.
Posted Aug 18, 2010 23:35 UTC (Wed)
by Lennie (subscriber, #49641)
[Link]
"In summary, this change looks likely to be accepted, perhaps as early as 2.6.33. Then we might finally have a single thread pool in the kernel."
It was finally accepted in 2.6.36 (August 2010):
http://lwn.net/Articles/399052/
As usual it takes a bit more time. ;-)
Concurrency-managed workqueues
The ACPI code had bound a workqueue thread to CPU 0 because some operations corrupt the system if run anywhere else
Is this just BIOSes being their usual malevolently incompetent selves, or is there a rational reason for this requirement? Because my first impression when reading this was 'WTF WTF WTF'...
Concurrency-managed workqueues
Concurrency-managed workqueues
Concurrency-managed workqueues
"Simply Malevolently Incompetent"?
Concurrency-managed workqueues
I don't understand why there has to be a pool of threads. Why not just make a new thread for each task and let the CPU scheduler do its job?
Concurrency-managed workqueues
Concurrency-managed workqueues
Grand Central Dispatch comparison/likeness?
Grand Central Dispatch comparison/likeness?
Cooperative multitasker?
Concurrency-managed workqueues