Hierarchical group I/O scheduling

By Jonathan Corbet
February 15, 2011

There has recently been much attention paid to the group CPU scheduling feature built into the Linux kernel. Using group scheduling, it is possible to ensure that some groups of processes get a fair share of the CPU without being crowded out by a rather larger number of CPU-intensive processes in a different group. Linux has supported this feature for some years, but it has languished in relative obscurity; it is only with recent efforts to make group scheduling "just work" that it has started to come into wider use. As it happens, the kernel has a very similar feature for managing access to block I/O devices which is also, arguably, underused. In this case, though, I/O group scheduling is not as completely implemented as CPU scheduling, but some ongoing work may change that situation.

The "completely fair queueing" (CFQ) I/O scheduler tries to divide the available bandwidth on any given device fairly between the processes which are contending for that device. "Bandwidth" is measured not in the number of bytes transferred, but the amount of time that each process gets to submit requests to the queue; in this way, the code tries to penalize [Group hierarchy] processes which create seek-heavy I/O patterns. (There is also a mode based solely on the number of I/O operations submitted, but your editor suspects it sees relatively little use). The CFQ scheduler also supports group scheduling, but in an incomplete way.

Imagine the group hierarchy shown on the right; here we have three control groups (plus the default root group), and four processes running within those groups. If every process were contending fully for the available I/O bandwidth, and they all had the same I/O priority, one would expect that bandwidth to be split equally between P0, Group1, and Group2; thus P0 should get twice as much I/O bandwidth as either P1 or P3. If more processes were to be added to the root, they should be able to take I/O bandwidth at the expense of the processes in the other control groups. Similarly, the creation of new control groups underneath Group1 should not affect anybody outside of that branch of the hierarchy. In current kernels, though, that is not how things work.

With the current implementation of CFQ group scheduling, the above hierarchy is transformed into something that looks like this:

The CFQ group scheduler currently treats all groups - including the root group - as being equal, at the same level in the hierarchy. Every group is a top-level group. This level of grouping will be adequate for a number of situations, but there will be other users who want the full hierarchical model. That is why control groups were made to be hierarchical in the first place, after all.

The hierarchical CFQ group scheduling patch set from Gui Jianfeng aims to make that feature available. These patches introduce a new cfq_entity structure which is used for the scheduling of both processes and groups; it is clearly modeled after the sched_entity structure used in the CPU scheduling code. With this in place, the I/O scheduler can just give bandwidth to the top-level cfq_entity which has run up the least "vdisktime" so far; if that entity happens to be a group, the scheduling code drops down a level and repeats the process. Sooner or later, the entity which is scheduled for I/O will be an actual process, and the scheduler can start dispatching I/O requests.

This patch set is on its fourth revision; the previous iterations have led to significant changes. It appears that there are a few things to fix up still, but this work seems to be getting closer to being ready.

One thing is worth bearing in mind: there are two I/O bandwidth controllers in contemporary Linux kernels: the proportional bandwidth controller (built into the CFQ scheduler) and the throttling controller built into the block layer. The group scheduling changes only apply to the proportional bandwidth controller. Arguably there is less need for full group scheduling with the throttling controller, which puts absolute caps on the bandwidth available to specific processes.

Controlling I/O bandwidth has a lot of applications; providing some isolation between customers on a shared hosting service is an obvious example. But this feature may yet prove to have value on the desktop as well; many interactivity problems come down to contention for I/O bandwidth. Anybody who has tried to start an office suite while simultaneously copying a video image on the same drive understands how bad it can be. If the group I/O scheduling feature can be made to "just work" like the group CPU scheduling, we may have made another step toward a truly responsive Linux desktop.

Index entries for this article

Kernel Block layer/I/O scheduling

Index entries for this article
Kernel	Block layer/I/O scheduling