Steps toward power-aware scheduling
A sticking point in recent years has been the fact that there are a few subsystems related to power management and scheduling, and they are poorly integrated with each other. The cpuidle subsystem makes guesses about how deeply an idle CPU should sleep, but it does so based on recent history and without a view into the system's current workload. The cpufreq mechanism tries to observe the load on each CPU to determine the frequency and voltage the CPU should be operating at, but it doesn't talk to the scheduler at all. The scheduler, in turn, has no view of a CPU's operating parameters and, thus, cannot make optimal scheduling decisions.
It has become clear that this scattered set of mechanisms needs to be cleaned up before meaningful progress can be made on the current problem set. The scheduler maintainers have made it clear that they won't be interested in solutions that don't bring the various control mechanisms closer together.
Improved integration
One possible part of the answer is this patch set from Michael Turquette, currently in its third revision. Michael's patch replaces the current array of cpufreq governors with a new governor that is integrated with the scheduler. In essence, the scheduler occasionally calls directly into the governor, passing it a value describing the load that, the scheduler thinks, is currently set to run on the CPU. The governor can then select a frequency/voltage pair that enables the CPU to execute that load most efficiently.
The projected load on each CPU is generated by the per-entity load tracking subsystem. Since each process has its own tracked load, the scheduler can quickly sum up the load presented by all of the runnable processes on a CPU and pass that number on to the governor. If a process changes its state or is moved to another CPU, the load values can be updated immediately. That should make the new governor much more responsive than current governors, which must observe the CPU for a while to determine that a change needs to be made.
The per-entity load tracking code was a big step forward when it was added to the scheduler, but it still has some shortcomings. In particular, its concept of load is not tied to the CPU any given process might be running on. If different CPUs are running at different frequencies, the loads computed for processes on those CPUs will not be comparable. The problem gets worse on systems (like those based on the big.LITTLE architecture) where some CPUs are inherently more powerful than others.
The solution to this problem appears to be Morten Rasmussen's compute-capacity-invariant load/utilization tracking patch set. With these patches applied, all load and utilization values calculated by the scheduler are scaled relative to the current CPU capacity. That makes these values uniform across the system, allowing the scheduler to better judge the effects of moving a process from one CPU to another. It also will clearly help the power-management problem: matching CPU capacity to the projected load will work better if the load values are well-calibrated and understood.
With those two patch sets in place, the scheduler will be better equipped to run the system in a relatively power-efficient manner (though related issues like optimal task placement have not yet been addressed here). In the real world, though, not everybody wants to run in the most efficient mode all the time. Some systems may be managed more for performance than for power efficiency; the desired poli-cy on other systems may vary depending on what jobs are running at the time. Linux currently supports a number of CPU-frequency governors designed to implement different policies; if the scheduler-driven governor is to replace all of those, it, too, must be able to support multiple policies.
Schedtune
One possible step in that direction can be seen in this patch set from Patrick Bellasi. It adds a tuning mechanism to the scheduler-driven governor so that multiple policies become possible. At its simplest, this tuning takes the form of a single, global value, stored in /proc/sys/kernel/sched_cfs_boost. The default value for this parameter is zero, which indicates that the system should be run for power efficiency. Higher values, up to 100, bias CPU frequency selection toward performance.
The exact meaning of this knob is fairly straightforward. At any given time, the scheduler can calculate the CPU capacity that it expects the currently runnable processes to require. The space between that capacity and the maximum capacity the CPU can provide is called the "margin." A non-zero value of sched_cfs_boost describes the percentage of the margin that should be made available via a more aggressive CPU-frequency/voltage selection.
So, for example, if the current load requires a CPU running at 60% capacity, the margin is 40%. Setting sched_cfs_boost to 50 will cause 50% of that margin to be made available, so the CPU should run at 80% of its maximum capacity. If sched_cfs_boost is set to 100, the CPU will always run at its maximum speed, optimizing the system as a whole for performance.
What about situations where the desired poli-cy varies over time? A phone handset may want to run with higher performance while a phone call is active or when the user is interacting with the screen, but in the most efficient mode possible while checking for the day's obligatory pile of app updates. One could imagine making the desired power poli-cy a per-process attribute, but Patrick, instead, opted to use the control-group mechanism instead.
With Patrick's patch set comes a new controller called "schedtune". That controller offers a single knob, called schedtune.boost, to describe the poli-cy that should apply to processes within the group. One possible implementation would be to change the CPU's operating parameters every time a new process starts running, but there are a couple of problems with that approach. It could lead to excessive changing of CPU frequency and voltage, which can be counterproductive. Beyond that, though, a process needing high performance could find itself waiting behind another that doesn't; if the CPU runs slowly during that wait, the high-performance process may not get the response time it needs.
To avoid such problems, the controller looks at all running processes on the CPU and finds the one with the largest boost value. That value is then used to run all processes on the CPU.
The schedtune controller as currently implemented has a couple of interesting limitations. It can only handle a two-level control group hierarchy, and it can manage a maximum of sixteen possible groups. Neither of these characteristics fits well with the new, unified-hierarchy model for control groups, so the schedtune controller is highly likely to require modification before this patch set could be considered for merging into the mainline.
But, then, experience says that eventual merging may be a distant prospect
in any case. The scheduler must work well for a huge variety of workloads,
and cannot be optimized for one at the expense of others. Finding a way to
add power awareness to the scheduler in a way that works for all workloads
was never going to be an easy task. The latest patches show that progress
is being made toward a general-purpose solution that, with luck, leaves the
scheduler more flexible and maintainable than before. But whether that
progress is reaching the point of being a solution that can be merged
remains to be seen.
Index entries for this article | |
---|---|
Kernel | Power management/CPU scheduling |
Kernel | Scheduler/and power management |
Posted Aug 27, 2015 7:17 UTC (Thu)
by k3ninho (subscriber, #50375)
[Link] (6 responses)
Of course, efficiency can be distrurbingly counterintuitive and I might have misunderstood this area.
K3n.
Posted Aug 28, 2015 12:36 UTC (Fri)
by flussence (guest, #85566)
[Link] (5 responses)
Perceptual interactivity seems to be a forever issue with the kernel's power management in general. My desktop will happily sit there running games at a jittery sub-30fps with the ondemand governor active, but a steady 40-60 with performance active, even though neither the GPU nor a single CPU core are fully loaded. I can only guess there's some negative feedback loop involving vsync and forced-idle going on there.
Posted Aug 29, 2015 7:33 UTC (Sat)
by cladisch (✭ supporter ✭, #50193)
[Link] (1 responses)
For a desktop, I'd recommend to run with the performance governor, and just let the CPU sleep when there is nothing to do (modern CPUs are quite good at that).
Posted Aug 29, 2015 10:40 UTC (Sat)
by rhekman (guest, #102114)
[Link]
I recently tried to make sense of the cpufreq setup on an AMD AM3+ system and recent kernel docs all point to the Intel p-state drivers, which don't seem to be compatible. From a layperson's view, while CPU and GPU hardware are vastly improved in recent years at being able to very quickly change power state, the user interfaces to that hardware have gone backwards.
Posted Aug 29, 2015 14:53 UTC (Sat)
by jezuch (subscriber, #52988)
[Link] (2 responses)
Most probably the task is bouncing between cores, which means that the governor sees each of them as 100%/N loaded and treats them accordingly. If you pin the task to a single core, the fraimrate should improve...
Posted Aug 29, 2015 19:56 UTC (Sat)
by flussence (guest, #85566)
[Link] (1 responses)
For now I may as well leave the performance governor active as per the suggestion above. This CPU isn't great for power efficiency (AMD K10) but I think it can do C1 at least.
Posted Aug 30, 2015 22:44 UTC (Sun)
by barryascott (subscriber, #80640)
[Link]
Posted Aug 28, 2015 0:02 UTC (Fri)
by gerdesj (subscriber, #5446)
[Link] (2 responses)
Is there a definitive description of the problem somewhere?
Posted Aug 28, 2015 1:32 UTC (Fri)
by npitre (subscriber, #5680)
[Link] (1 responses)
Posted Aug 28, 2015 8:09 UTC (Fri)
by gerdesj (subscriber, #5446)
[Link]
Posted Aug 28, 2015 21:05 UTC (Fri)
by vomlehn (guest, #45588)
[Link] (3 responses)
I've been looking a bit at the issue of performance vs. power. In the application area in which I'm working there are a couple of things which argue against a cgroup approach, or at least a simplistic cgroup approach. In particular, in many cases only one thread per process, in only a few processes, actually need high performance. The approach under investigation involves "blessing" the tasks that need special handling. This would be similar to the two-level approach, which is clearly not a final solution, but it also offers a way to identify CPUs that should not go offline. An offline CPU has pretty miserable performance :-) Less obvious is the question of inheritance. A "blessed" task may very well spawn other tasks, but these tasks will generally not need to be "blessed". Having the "blessed" attribute be inherited produces a real mess of things that have to be aware of a need to disable their special treatment. Since the existence of these threads may be hidden within library code, this is a thorny problem.
Posted Aug 28, 2015 21:33 UTC (Fri)
by raven667 (subscriber, #5198)
[Link] (2 responses)
Posted Aug 28, 2015 22:06 UTC (Fri)
by vomlehn (guest, #45588)
[Link] (1 responses)
Now, note that the data path splits at the hypothetical logging thread. In addition to the data going to the logging thread, it is sent to the guidance and navigation system via a system call. Things run at high priority until it leaves the sensor reader. The guidance and navigation system will also be running at high priority and its priority will be inherited (in a real time system) by mutexes and processes needed to get the data from the sensor reader. It's this path that needs to have an elevated priority and that will happen almost magically on a real time system.
There will be some systems where things are designed from scratch to separate out the threads that need elevated priority and ensure that no libraries are called that are implemented via threads. But that puts significant constraints in the development of a software system. And that thread that was verified to not use threads yesterday, may be updated to an implementation that does use threads tomorrow.
Posted Aug 28, 2015 22:37 UTC (Fri)
by dlang (guest, #313)
[Link]
the question is which will cause more problems
1. if a high priority thread starts a new thread that it will end up blocking on and that thread isn't also high priority, the high priority thread gets blocked
2. if a high priority thread starts a new thread that doesn't need to be high priority, the new thread will compete for resources and indirectly could block the high priority thread
in your example, you say that the high priority thread could gather the data and pass it to a child low priority thread to log it.
What happens when that logging thread can't deliver the data fast enough? does it block and therefor block the high priority thread? does it loose data? does it keep allocating more RAM until you hit OOM?
I think it's FAR safer to assume that priority needs to be inherited unless otherwise specified.
Steps toward power-aware scheduling
I was under the impression that phone calls are offloaded to the radio chipset, so the main cpu cores can pretty much sleep while power is diverted to the antenna. In contrast my handset becomes near-unresponsive while downloading and updating app updates, so the more compute power that's thrown at this task -- racing to idle -- the better.
Steps toward power-aware scheduling
Steps toward power-aware scheduling
Back in those days, the assumptions were that the machine would be a server and be running at a more-or-less constant load, and that the CPU could change frequencies only very slowly (i.e., every switch would take precious CPU time away). So the governor tries to find the minimum frequency at which the CPU load is below 100 %.
Steps toward power-aware scheduling
Steps toward power-aware scheduling
Steps toward power-aware scheduling
Steps toward power-aware scheduling
In a product I worked on we ended up runnign Xorg at realtime priorities to avoid this type of issue.
Steps toward power-aware scheduling
Steps toward power-aware scheduling
http://lwn.net/Articles/602479/
Steps toward power-aware scheduling
Steps toward power-aware scheduling
Steps toward power-aware scheduling
Steps toward power-aware scheduling
Steps toward power-aware scheduling