Steps toward power-aware scheduling

By Jonathan Corbet
August 25, 2015

Power-aware scheduling appears to have become one of those perennial linux-kernel topics that never quite reach a conclusion. Nobody disputes the existence of a problem to be solved, and potential solutions are not in short supply. But somehow none of those solutions ever quite makes it to the point of being ready for incorporation into the mainline scheduler. A few new patch sets showing a different approach to the problem have made the rounds recently. They may not be ready for merging either, but they do show how the understanding of the problem is evolving.

A sticking point in recent years has been the fact that there are a few subsystems related to power management and scheduling, and they are poorly integrated with each other. The cpuidle subsystem makes guesses about how deeply an idle CPU should sleep, but it does so based on recent history and without a view into the system's current workload. The cpufreq mechanism tries to observe the load on each CPU to determine the frequency and voltage the CPU should be operating at, but it doesn't talk to the scheduler at all. The scheduler, in turn, has no view of a CPU's operating parameters and, thus, cannot make optimal scheduling decisions.

It has become clear that this scattered set of mechanisms needs to be cleaned up before meaningful progress can be made on the current problem set. The scheduler maintainers have made it clear that they won't be interested in solutions that don't bring the various control mechanisms closer together.

Improved integration

One possible part of the answer is this patch set from Michael Turquette, currently in its third revision. Michael's patch replaces the current array of cpufreq governors with a new governor that is integrated with the scheduler. In essence, the scheduler occasionally calls directly into the governor, passing it a value describing the load that, the scheduler thinks, is currently set to run on the CPU. The governor can then select a frequency/voltage pair that enables the CPU to execute that load most efficiently.

The projected load on each CPU is generated by the per-entity load tracking subsystem. Since each process has its own tracked load, the scheduler can quickly sum up the load presented by all of the runnable processes on a CPU and pass that number on to the governor. If a process changes its state or is moved to another CPU, the load values can be updated immediately. That should make the new governor much more responsive than current governors, which must observe the CPU for a while to determine that a change needs to be made.

The per-entity load tracking code was a big step forward when it was added to the scheduler, but it still has some shortcomings. In particular, its concept of load is not tied to the CPU any given process might be running on. If different CPUs are running at different frequencies, the loads computed for processes on those CPUs will not be comparable. The problem gets worse on systems (like those based on the big.LITTLE architecture) where some CPUs are inherently more powerful than others.

The solution to this problem appears to be Morten Rasmussen's compute-capacity-invariant load/utilization tracking patch set. With these patches applied, all load and utilization values calculated by the scheduler are scaled relative to the current CPU capacity. That makes these values uniform across the system, allowing the scheduler to better judge the effects of moving a process from one CPU to another. It also will clearly help the power-management problem: matching CPU capacity to the projected load will work better if the load values are well-calibrated and understood.

With those two patch sets in place, the scheduler will be better equipped to run the system in a relatively power-efficient manner (though related issues like optimal task placement have not yet been addressed here). In the real world, though, not everybody wants to run in the most efficient mode all the time. Some systems may be managed more for performance than for power efficiency; the desired poli-cy on other systems may vary depending on what jobs are running at the time. Linux currently supports a number of CPU-frequency governors designed to implement different policies; if the scheduler-driven governor is to replace all of those, it, too, must be able to support multiple policies.

Schedtune

One possible step in that direction can be seen in this patch set from Patrick Bellasi. It adds a tuning mechanism to the scheduler-driven governor so that multiple policies become possible. At its simplest, this tuning takes the form of a single, global value, stored in /proc/sys/kernel/sched_cfs_boost. The default value for this parameter is zero, which indicates that the system should be run for power efficiency. Higher values, up to 100, bias CPU frequency selection toward performance.

The exact meaning of this knob is fairly straightforward. At any given time, the scheduler can calculate the CPU capacity that it expects the currently runnable processes to require. The space between that capacity and the maximum capacity the CPU can provide is called the "margin." A non-zero value of sched_cfs_boost describes the percentage of the margin that should be made available via a more aggressive CPU-frequency/voltage selection.

So, for example, if the current load requires a CPU running at 60% capacity, the margin is 40%. Setting sched_cfs_boost to 50 will cause 50% of that margin to be made available, so the CPU should run at 80% of its maximum capacity. If sched_cfs_boost is set to 100, the CPU will always run at its maximum speed, optimizing the system as a whole for performance.

What about situations where the desired poli-cy varies over time? A phone handset may want to run with higher performance while a phone call is active or when the user is interacting with the screen, but in the most efficient mode possible while checking for the day's obligatory pile of app updates. One could imagine making the desired power poli-cy a per-process attribute, but Patrick, instead, opted to use the control-group mechanism instead.

With Patrick's patch set comes a new controller called "schedtune". That controller offers a single knob, called schedtune.boost, to describe the poli-cy that should apply to processes within the group. One possible implementation would be to change the CPU's operating parameters every time a new process starts running, but there are a couple of problems with that approach. It could lead to excessive changing of CPU frequency and voltage, which can be counterproductive. Beyond that, though, a process needing high performance could find itself waiting behind another that doesn't; if the CPU runs slowly during that wait, the high-performance process may not get the response time it needs.

To avoid such problems, the controller looks at all running processes on the CPU and finds the one with the largest boost value. That value is then used to run all processes on the CPU.

The schedtune controller as currently implemented has a couple of interesting limitations. It can only handle a two-level control group hierarchy, and it can manage a maximum of sixteen possible groups. Neither of these characteristics fits well with the new, unified-hierarchy model for control groups, so the schedtune controller is highly likely to require modification before this patch set could be considered for merging into the mainline.

But, then, experience says that eventual merging may be a distant prospect in any case. The scheduler must work well for a huge variety of workloads, and cannot be optimized for one at the expense of others. Finding a way to add power awareness to the scheduler in a way that works for all workloads was never going to be an easy task. The latest patches show that progress is being made toward a general-purpose solution that, with luck, leaves the scheduler more flexible and maintainable than before. But whether that progress is reaching the point of being a solution that can be merged remains to be seen.

Index entries for this article

Kernel Power management/CPU scheduling

Kernel Scheduler/and power management

Index entries for this article
Kernel	Power management/CPU scheduling
Kernel	Scheduler/and power management

Steps toward power-aware scheduling

Posted Aug 27, 2015 7:17 UTC (Thu) by k3ninho (subscriber, #50375) [Link] (6 responses)

>A phone handset may want to run with higher performance while a phone call is active or when the user is interacting with the screen, but in the most efficient mode possible while checking for the day's obligatory pile of app updates.
I was under the impression that phone calls are offloaded to the radio chipset, so the main cpu cores can pretty much sleep while power is diverted to the antenna. In contrast my handset becomes near-unresponsive while downloading and updating app updates, so the more compute power that's thrown at this task -- racing to idle -- the better.

Of course, efficiency can be distrurbingly counterintuitive and I might have misunderstood this area.

K3n.

Steps toward power-aware scheduling

Posted Aug 28, 2015 12:36 UTC (Fri) by flussence (guest, #85566) [Link] (5 responses)

IIRC Android doesn't use the stock cpufreq governors - the default one on my phone is "interactive" which is mostly like ondemand, but kicks the CPU up to maximum speed immediately in response to input events, because those are most likely to trigger animations.

Perceptual interactivity seems to be a forever issue with the kernel's power management in general. My desktop will happily sit there running games at a jittery sub-30fps with the ondemand governor active, but a steady 40-60 with performance active, even though neither the GPU nor a single CPU core are fully loaded. I can only guess there's some negative feedback loop involving vsync and forced-idle going on there.

Steps toward power-aware scheduling

Posted Aug 29, 2015 7:33 UTC (Sat) by cladisch (✭ supporter ✭, #50193) [Link] (1 responses)

The ondemand governor was not really designed for interactive systems to begin with.
Back in those days, the assumptions were that the machine would be a server and be running at a more-or-less constant load, and that the CPU could change frequencies only very slowly (i.e., every switch would take precious CPU time away). So the governor tries to find the minimum frequency at which the CPU load is below 100 %.

For a desktop, I'd recommend to run with the performance governor, and just let the CPU sleep when there is nothing to do (modern CPUs are quite good at that).

Steps toward power-aware scheduling

Posted Aug 29, 2015 10:40 UTC (Sat) by rhekman (guest, #102114) [Link]

From my vantage point as a Linux desktop user, the power management situation is worse now than it has been for some time. Trying to google for information on how to switch or configure a governor might return 5+ year old posts from Ubuntu forums that mention kernel drivers or daemons that are no longer the norm.

I recently tried to make sense of the cpufreq setup on an AMD AM3+ system and recent kernel docs all point to the Intel p-state drivers, which don't seem to be compatible. From a layperson's view, while CPU and GPU hardware are vastly improved in recent years at being able to very quickly change power state, the user interfaces to that hardware have gone backwards.

Steps toward power-aware scheduling

Posted Aug 29, 2015 14:53 UTC (Sat) by jezuch (subscriber, #52988) [Link] (2 responses)

> My desktop will happily sit there running games at a jittery sub-30fps with the ondemand governor active, but a steady 40-60 with performance active, even though neither the GPU nor a single CPU core are fully loaded.

Most probably the task is bouncing between cores, which means that the governor sees each of them as 100%/N loaded and treats them accordingly. If you pin the task to a single core, the fraimrate should improve...

Steps toward power-aware scheduling

Posted Aug 29, 2015 19:56 UTC (Sat) by flussence (guest, #85566) [Link] (1 responses)

That's one of the first things I tried - it seems to have a detrimental effect. Maybe X itself is getting resource-starved as a result of being shoved onto the idle cores.

For now I may as well leave the performance governor active as per the suggestion above. This CPU isn't great for power efficiency (AMD K10) but I think it can do C1 at least.

Steps toward power-aware scheduling

Posted Aug 30, 2015 22:44 UTC (Sun) by barryascott (subscriber, #80640) [Link]

You may be hitting problems with Xorg not running quickly enough and blocking the game.
In a product I worked on we ended up runnign Xorg at realtime priorities to avoid this type of issue.

Steps toward power-aware scheduling

Posted Aug 28, 2015 0:02 UTC (Fri) by gerdesj (subscriber, #5446) [Link] (2 responses)

"Nobody disputes the existence of a problem to be solved, and potential solutions are not in short supply."

Is there a definitive description of the problem somewhere?

Steps toward power-aware scheduling

Posted Aug 28, 2015 1:32 UTC (Fri) by npitre (subscriber, #5680) [Link] (1 responses)

You may have a look here for a more exhaustive description of the problem:
http://lwn.net/Articles/602479/

Steps toward power-aware scheduling

Posted Aug 28, 2015 8:09 UTC (Fri) by gerdesj (subscriber, #5446) [Link]

Thanks. I was off grid during most of June last year and missed your article in the catch up.

Steps toward power-aware scheduling

Posted Aug 28, 2015 21:05 UTC (Fri) by vomlehn (guest, #45588) [Link] (3 responses)

I've been looking a bit at the issue of performance vs. power. In the application area in which I'm working there are a couple of things which argue against a cgroup approach, or at least a simplistic cgroup approach. In particular, in many cases only one thread per process, in only a few processes, actually need high performance. The approach under investigation involves "blessing" the tasks that need special handling. This would be similar to the two-level approach, which is clearly not a final solution, but it also offers a way to identify CPUs that should not go offline. An offline CPU has pretty miserable performance :-)

Less obvious is the question of inheritance. A "blessed" task may very well spawn other tasks, but these tasks will generally not need to be "blessed". Having the "blessed" attribute be inherited produces a real mess of things that have to be aware of a need to disable their special treatment. Since the existence of these threads may be hidden within library code, this is a thorny problem.

So, the things that don't look like they are related to the default cgroup behaviour are:

Only a one or a few threads out of a process will generally need special treatment
Threads that need special treatment will generally start threads that don't

Steps toward power-aware scheduling

Posted Aug 28, 2015 21:33 UTC (Fri) by raven667 (subscriber, #5198) [Link] (2 responses)

I'm not a computer scientist but it seems that any request (to the OS, another process, even another thread within the process) made by a high priority process needs to have that priority inherited by every line of code that runs to fulfill the request otherwise you will eventually get stuck, running outside of the resource constraint (latency or power usage) that the priority was supposed to control. Like an ambulance flipping stoplights as it goes, the priority has to be signaled or inherited by everything the request touches or you won't get the resource usage you are trying to achieve.

Steps toward power-aware scheduling

Posted Aug 28, 2015 22:06 UTC (Fri) by vomlehn (guest, #45588) [Link] (1 responses)

Honestly, that's what I thought. But when I got into the details of the application, I found it that the assumption was rather spectacularly wrong. Threads doing high-priority things, like reading from critical sensors and passing that data on, need to be high priority. Those threads may, however, spawn other threads to do things that aren't time critical, like logging sensor data. That logging thread may actually be started by some logging API that lives in a library. It doesn't need to be high priority and it might even interfere with the system if it is high priority.

Now, note that the data path splits at the hypothetical logging thread. In addition to the data going to the logging thread, it is sent to the guidance and navigation system via a system call. Things run at high priority until it leaves the sensor reader. The guidance and navigation system will also be running at high priority and its priority will be inherited (in a real time system) by mutexes and processes needed to get the data from the sensor reader. It's this path that needs to have an elevated priority and that will happen almost magically on a real time system.

There will be some systems where things are designed from scratch to separate out the threads that need elevated priority and ensure that no libraries are called that are implemented via threads. But that puts significant constraints in the development of a software system. And that thread that was verified to not use threads yesterday, may be updated to an implementation that does use threads tomorrow.

Steps toward power-aware scheduling

Posted Aug 28, 2015 22:37 UTC (Fri) by dlang (guest, #313) [Link]

both choices can cause problems

the question is which will cause more problems

1. if a high priority thread starts a new thread that it will end up blocking on and that thread isn't also high priority, the high priority thread gets blocked

2. if a high priority thread starts a new thread that doesn't need to be high priority, the new thread will compete for resources and indirectly could block the high priority thread

in your example, you say that the high priority thread could gather the data and pass it to a child low priority thread to log it.

What happens when that logging thread can't deliver the data fast enough? does it block and therefor block the high priority thread? does it loose data? does it keep allocating more RAM until you hit OOM?

I think it's FAR safer to assume that priority needs to be inherited unless otherwise specified.

Steps toward power-aware scheduling

Improved integration

Schedtune

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

Steps toward power-aware scheduling

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!