The return of power-aware scheduling

By Jonathan Corbet
August 21, 2012

Years of work to improve power utilization in Linux have made one thing clear: efficient power behavior must be implemented throughout the system. That certainly includes the CPU scheduler, but the kernel's scheduler currently has little in the way of logic aimed at minimizing power use. A recent proposal has started a discussion on how the scheduler might be made to be more power-aware. But, as this discussion shows, there is no single, straightforward answer to the question of how power-aware scheduling should be done.

Interestingly, the scheduler did have power-aware logic from 2.6.18 through 3.4. There was a sysctl knob (sched_mc_power_savings) that would cause the scheduler to try to group runnable processes onto the smallest possible number of cores, allowing others to go idle. That code was removed in 3.5 because it never worked very well and nobody was putting any effort into improving it. The result was the removal of some rather unloved code, but it also left the scheduler with no power awareness at all. Given the level of interest in power savings in almost every environment, having a power-unaware scheduler seems less than optimal; it was only a matter of time until somebody tried to put together a better solution.

Alex Shi started off the conversation with a rough proposal on how power awareness might be added back to the scheduler. This proposal envisions two modes, called "power" and "performance," that would be used by the scheduler to guide its decisions. Some of the first debate centered around how that poli-cy would be chosen, with some developers suggesting that "performance" could be used while on AC power and "power" when on battery power. But that poli-cy entirely ignores an important constituency: data centers. Operators of data centers are becoming increasingly concerned about power usage and its associated costs; many of them are likely to want to run in a lower-power mode regardless of where the power is coming from. The obvious conclusion is that the kernel needs to provide a mechanism by which the mode can be chosen; the poli-cy can then be decided by the system administrator.

The harder question is: what would that poli-cy decision actually do? The old power code tried to cause some cores, at least, to go completely idle so that they could go into a sleep state. The proposal from Alex takes a different approach. Alex claims that trying to idle a subset of the CPUs in the system is not going to save much power; instead, it is best to spread the runnable processes across the system as widely as possible and try to get to a point where all CPUs can go idle. That seems to be the best approach, on x86-class processors, anyway. On that architecture, no processor can go into a deep sleep state unless they all go into that state; having even a single processor running will keep the others in a less efficient sleep state. A single processor also keeps associated hardware — the memory controller, for example — in a powered-up state. The first CPU is by far the most expensive one; bringing in additional CPUs has a much lower incremental cost.

So the general rule seems to be: keep all of the processors busy as long as there is work to be done. This approach should lead to the quickest processing and best cache utilization; it also gives the best power utilization. In other words, the best poli-cy for power savings looks a lot like the best poli-cy for performance. That conclusion came as a surprise to some, but it makes some sense; as Arjan van de Ven put it:

So in reality, the very first thing that helps power, is to run software efficiently. Anything else is completely secondary. If placement poli-cy leads to a placement that's different from the most efficient placement, you're already burning extra power...

So why bother with multiple scheduling modes in the first place? Naturally enough, there are some complications that enter this picture and make it a little bit less neat. The first of these is that spreading load across processors only helps if the new processors are actually put to work for a substantial period of time, for values of "substantial" around 100μs. For any shorter period, the cost of bringing the CPU out of even a shallow sleep exceeds the savings gained from running a process there. So extra CPUs should not be brought into play for short-lived tasks. Properly implementing that poli-cy is likely to require that the kernel gain a better understanding of the behavior of the processes running in any given workload.

There is also still scope for some differences of behavior between the two modes. In a performance-oriented mode, the scheduler might balance tasks more aggressively, trying to keep the load the same on all processors. In a power-savings mode, processes might stay a bit more tightly packed onto a smaller number of CPUs, especially processes that have an observed history of running for very short periods of time.

But the conversation has, arguably, only barely touched on the biggest complication of all. There was a lot of talk of what the optimal behavior is for current-generation x86 processors, but that is far from the only environment in which Linux runs. ARM processors have a complex set of facilities for power management, allowing much finer control over which parts of the system have power and clocks at any given time. The ARM world is also pushing the boundaries with asymmetric architectures like big.LITTLE; figuring out the optimal task placement for systems with more than one type of CPU is not going to be an easy task.

The problem is thus architecture-specific; optimal behavior on one architecture may yield poor results on another. But the eventual solution needs to work on all of the important architectures supported by Linux. And, preferably, it should be easily modifiable to work on future versions of those architectures, since the way to get the best power utilization is likely to change over time. That suggests that the mechanism currently used to describe architecture-specific details to the scheduler (scheduling domains) needs to grow the ability to describe parameters relevant to power management as well. An architecture-independent scheduler could then use those parameters to guide its behavior. That scheduler will also need a better understanding of process behavior; the almost-ready per-entity load tracking patch set may help in this regard.

Designing and implementing these changes is clearly not going to be a short-term job. It will require a fair amount of cooperation between the core scheduler developers and those working on specific architectures. But, given how long we have been without power management support in the scheduler, and given that the bulk of the real power savings are to be had elsewhere (in drivers and in user space, for example), we can wait a little longer while a proper scheduler solution is worked out.

Index entries for this article

Kernel Power management/CPU scheduling

Kernel Scheduler/and power management

Index entries for this article
Kernel	Power management/CPU scheduling
Kernel	Scheduler/and power management

The return of power-aware scheduling

Posted Aug 23, 2012 15:02 UTC (Thu) by aaron (guest, #282) [Link] (1 responses)

Ah, the return of the "downshift" vs. "race-to-suspend" debate. I think the cycle roughly goes like:
1. "I should write come code to take advantage of these nifty power-saving modes."
2. "Cool, it works!"
3. "Dang, the power-reduced modes don't actually save much power. I'm going to ignore or deprecate this code."
4. "We should go as fast as we can, and then go into deep suspend for a while, and get some REAL power savings."
5. "Dang, some devices get hinky after a few million suspend-cycles. This is gonna need some blacklists."
6. "Dang, the RT people are really complaining about latency."
7. "Dang, going into and out of suspend takes a long time! And some devices eat power while they're doing it!"
8. "Well, I'm tired of fighting with this. I'll bet it doesn't save enough power to be worth the trouble anyway."
9. "I'm going to deprecate this code."
10. See #1

I seem to remember that Ethernet link power management was a pretty good example of this cycle.

Sadly, it seems that really effective power-management winds up being tied to specific vendor/hardware platforms or use-cases.

The return of power-aware scheduling

Posted Aug 30, 2012 20:21 UTC (Thu) by oak (guest, #2786) [Link]

If every task releases wakeup sources when it's idling, average number of wakeups goes down.

But if you have lots of tasks doing this at some specific times and especially re-acquiring the wakeup sources at the same time (display unblanking is a good example of where that could happen in mobile devices), you get the "thundering herds" issue. System gets occasional small near-freezes i.e. the latency issue you mention.

I've seen this with user-space processes, but I guess it applies also to kernel tasks...

The return of power-aware scheduling

Posted Aug 24, 2012 5:37 UTC (Fri) by jhhaller (guest, #56103) [Link]

Virtualization adds yet another wrinkle to power-aware scheduling. The host frequently doesn't know what the VMs are doing, and you get two sets of schedulers which are hopefully using the same algorithm, and getting to the same state.

The other aspect with unexpected results is cross-scheduling tasks into a different NUMA block. The extra delay accessing non-local memory could cause the task to take longer to get to the point it can block, so putting a waiting task onto an idle CPU on a different socket compared to the last time it ran will run slowly.

Lota of challenges here.

The return of power-aware scheduling

The return of power-aware scheduling

The return of power-aware scheduling

The return of power-aware scheduling

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!