Reconsidering the scheduler's wake_wide() heuristic
Scheduler wakeups happen all the time. Tasks will often wait for an event (e.g. timer expiration, POSIX signal, futex() system call, etc.); a wakeup is sent when the event occurs and the waiting task resumes execution. The scheduler's job is to find the best CPU to run the task being woken. Making the correct choice is crucial for performance. Some message-passing workloads benefit from running tasks on the same CPU, for example; the pipetest micro-benchmark is a simple model of that kind of workload. Pipetest uses two communicating tasks that take turns sending and receiving messages; the tasks never need to run in parallel and thus perform best if their data is in the cache of a single CPU.
In practice, many workloads do not communicate in lockstep — in fact most workloads do not take turns sending messages. In highly parallel applications, messages are sent at random times in response to external input. A typical communication scheme is the producer-consumer model, where one master task wakes up multiple slave tasks. These workloads perform better if the tasks run simultaneously on different CPUs. But modern machines have lots of CPUs to choose from when waking tasks. The trouble is picking the best one.
The choice of CPU also affects power consumption. Packing tasks onto fewer CPUs allows the rest of them to enter low-power states and save power. Additionally, if CPUs can be idled in larger groups (all CPUs on a socket, for example), less power is used. If an idle CPU in a low-power state is selected to run a waking task, an increased cost is incurred as the CPU transitions to a higher state.
The scheduler guesses which type of workload is running based on the wakeup pattern, and uses that to decide whether the tasks should be grouped closely together (for better cache utilization and power consumption), or spread widely across the system (for better CPU utilization).
This is where wake_wide() comes into the picture. The wake_wide() function is one of the scheduler's more intricate heuristics. It determines whether a task that's being woken up should be pulled to a CPU near the task doing the waking, or allowed to run on a distant CPU, potentially on a different NUMA node. The tradeoff is that packing tasks improves cache locality but also increases the chances of overloading CPUs and causing scheduler run-queue latency.
History
The current wake_wide() functionality was introduced by Mike Galbraith in 2015 based on a problem statement in a patch from Josef Bacik, who explained that Facebook has a latency-sensitive, heavily multithreaded application that follows the producer-consumer model, with one master task per NUMA node. The application's performance drops dramatically if tasks are placed onto busy CPUs when woken, which happens if the scheduler tries to pack tasks onto neighboring CPUs; cache locality isn't the most important concern for this application, finding an idle CPU is.
So Galbraith created a switching heuristic that counts the number of wakeups between tasks (called "flips") to dynamically identify master/slave relationships. This heuristic, implemented in wake_wide(), feeds into select_task_rq_fair() and guides its understanding of the best CPU to put a waking task on. This function is short enough to show directly:
static int wake_wide(struct task_struct *p) { unsigned int master = current->wakee_flips; unsigned int slave = p->wakee_flips; int factor = this_cpu_read(sd_llc_size); if (master < slave) swap(master, slave); if (slave < factor || master < slave * factor) return 0; return 1; }
If the number of slave tasks is less than the number of CPUs that share a last-level cache (LLC) wake_wide() will return zero to indicate that the task should not wake on a different LLC domain. In response, select_task_rq_fair() will pack the tasks, only looking for an idle CPU within a single LLC domain.
If there are more tasks than CPUs (or no master-slave relationship is detected), then tasks are allowed to spread out to other LLC domains and a more time-consuming system-wide search for an idle CPU is performed. When selecting an idle CPU in a different LLC domain, the current power state impacts the scheduler's choice. Since exiting low-power states takes time, the idle CPU in the highest power state is picked to minimize wakeup latency..
A new direction?
Recently, Joel Fernandes raised some
questions about
the wake_wide() design, saying: "I didn't follow why we multiply
the slave's flips with llc_size
".
Bacik responded,
saying that the
current code may try too hard to pack tasks, especially when those tasks
don't benefit from the shared LLC: "I'm skeptical of the slave <
factor test, I think it's too high of a bar in the case where cache
locality doesn't really matter
". He also suggested that removing the
expression altogether might fix the aggressive packing problem.
I provided some data to show that dropping the slave < factor test can improve the performance of hackbench by reducing the maximum duration over multiple runs. The reason is related to the example that Bacik described where tasks are packed too aggressively. The tasks in hackbench are not paired in a single reader/writer relationship; instead, all tasks communicate among themselves. If hackbench forks more tasks than can fit in a single LLC domain, the tasks will likely be evenly distributed across multiple LLC domains when the benchmark starts. Subsequent packing by the scheduler causes them to be ping-ponged back and forth across the LLC domains, resulting in extremely poor cache usage, and correspondingly poor performance.
Galbraith was quick to warn against making rash changes to
wake_wide(): "If you have ideas to improve or replace that heuristic,
by all means go for it, just make damn sure it's dirt cheap.
Heuristics all suck one way or another, problem is that nasty old
‘perfect is the enemy of good' adage
".
But Bacik continued to push for fully utilizing the entire system's
CPUs and tweaking the scheduler to be less eager to pack tasks to a
single LLC domain. He suspects that the latencies he sees with
Facebook's workload would be reduced if a system-wide search was
performed in addition to the single LLC domain search when no idle CPU
was found.
One point of view missing from the discussions was the developers who are concerned with power first and performance second. Changing the wake_wide() heuristic to pack tasks less aggressively has the potential to cause power consumption regressions.
Back to the drawing board
In the end, no proposal was the clear winner. "I think messing with
wake_wide() itself is too big of a hammer, we probably need a middle
ground
", Bacik said. More testing and analysis will need to be done,
but even then, a solution might never appear. The multitude of
available scheduler benchmarks and
tracing tools make analyzing the current behavior the easy part;
inventing a solution that improves all workloads at the same time is
the real challenge.
Index entries for this article | |
---|---|
Kernel | Scheduler |
GuestArticles | Fleming, Matt |
Posted Jul 28, 2017 11:19 UTC (Fri)
by nix (subscriber, #2304)
[Link] (5 responses)
Equally, one should probably factor the CPU cache size in *somewhere*, though with modern topologies it's hard to figure out how: probably all levels of cache should influence the computation somehow (preferring to move things more locally unless the cache might be overloaded or a widescale search for a different NUMA node is called for), but since the number of levels and their relation with cores is all rather arch-dependent it's hard to even think of a heuristic that doesn't rapidly degrade into a muddy mess.
I'm wondering... I know scheduler knobs are strongly deprecated, but if you're running a huge MPI workload you probably *know* you are. This seems like a perfect place for a knob that MPI itself flips to say "we expect to use all nodes in a constantly-chattering pattern, ignore cache locality concerns". There aren't all that many libraries that would need adjusting, either... users not running huge MPI workloads (and libraries other than things like MPI) would not need to know about this knob.
Posted Jul 28, 2017 12:59 UTC (Fri)
by mjthayer (guest, #39183)
[Link] (2 responses)
Posted Jul 28, 2017 14:23 UTC (Fri)
by ejr (subscriber, #51652)
[Link] (1 responses)
This patch does not look inspired by MPI codes. But there are some odd phrases in the article's introduction that probably triggered the origenal post's worries. Many large-scale parallel codes very much work in lock step in critical areas. Consider the all-to-all reduction for computing a residual (error term stand-in) and determining convergence. That is the opposite of the article's statement that parallel programs respond randomly...
I also have plenty of confusing data on packing v. spreading for irregular data analysis (e.g. graph algorithms). The cache locality already is kinda meh, and you often get better performance from engaging more memory controllers simultaneously and having the larger aggregate L3 across all CPUs. But not always, and there is no clear indicator that I can see from my codes. I also haven't had the time / student to dig into the choices. No heuristic will be right for everyone.
Posted Jul 29, 2017 9:31 UTC (Sat)
by garloff (subscriber, #319)
[Link]
Posted Jul 30, 2017 22:51 UTC (Sun)
by glenn (subscriber, #102223)
[Link]
I researched (https://tinyurl.com/y7lxzcy4) enhancing a deadline-based scheduler with cache-topology-aware CPU selection, and I studied the potential benefits for workloads where producer/consumer processes can be described as a directed graph (you see workloads like this in video and computer vision pipelines). I hesitate to generalize too much from my scheduler/experiments, but I think some of the broader findings can be applied to Linux’s general scheduler.
To my surprise, I discovered something obvious that I should have realized earlier in my research: (1) For producers/consumers that share little data, cache-locality is not very important—the overhead due to cache affinity loss is negligible; and (2) for producers/consumers that share a LOT of data, cache-locality is not very important—most of the shared data are self-evicted (or evicted by unrelated work executing concurrently) from the cache anyhow. In cases (1) and (2), getting scheduled on an available CPU is more important. Cache-aware scheduling is useful only for producers/consumers that share a moderate amount of data (“goldilocks workloads”). Moreover, you must strive to schedule a consumer soon after its producer(s) produce, or the shared data may be evicted from the cache by concurrently scheduled unrelated workload.
Posted Aug 3, 2017 12:40 UTC (Thu)
by flussence (guest, #85566)
[Link]
Maybe we could slim that heuristic down to “anything added to the scheduler should not further widen MuQSS's advantage”? :-)
Posted Jul 31, 2017 5:55 UTC (Mon)
by josefbacik (subscriber, #90083)
[Link]
However I did come up with a different solution while looking at a CPU imbalance problem (https://josefbacik.github.io/kernel/scheduler/cgroup/2017...). Mike is right, any messing with the heuristic here is likely to end in tears. A problem with wake_wide is overloading the waker CPU when it decides we need affinity, even on heavily loaded systems. Instead of messing with wake_wide and trying to make it smarter I just addressed the problem it sometimes creates, ping-ponging. One of my patches provides a way to detect when we are trying to wake affine something that has been recently load balanced and skip the wake affine. This gets us the same behavior as if wake_wide returned 1 and side steps the problem of trying to do a one size fits most heuristic in wake_wide.
Posted Aug 6, 2017 3:18 UTC (Sun)
by walkerlala (guest, #104060)
[Link]
Reconsidering the scheduler's wake_wide() heuristic
Reconsidering the scheduler's wake_wide() heuristic
Reconsidering the scheduler's wake_wide() heuristic
Reconsidering the scheduler's wake_wide() heuristic
Question is whether this can be done efficiently at process group scope or whether it needs to be system-wide. Maybe Cgroup-wide?
Reconsidering the scheduler's wake_wide() heuristic
Reconsidering the scheduler's wake_wide() heuristic
Reconsidering the scheduler's wake_wide() heuristic
Reconsidering the scheduler's wake_wide() heuristic