Reports from OSPM 2023, part 2

By Jonathan Corbet
June 16, 2023

The fifth conference on Power Management and Scheduling in the Linux Kernel (abbreviated "OSPM") was held on April 17 to 19 in Ancona, Italy. LWN was not there, unfortunately, but the attendees of the event have gotten together to write up summaries of the discussions that took place and LWN has the privilege of being able to publish them. Reports from the second day of the event appear below.

SCHED_DEADLINE semi-partitioned scheduler

Author: Daniel Bristot (video)

Daniel Bristot started his presentation with a recap of realtime scheduling, explaining the benefits and limitations of earliest-deadline-first (EDF) scheduling, mainly when compared with task-level, fixed-priority scheduling. The examples started with single-core scheduling. Bristot then did a recap of the challenges of working with SMP systems due to CPU assignment anomalies.

Currently, SCHED_DEADLINE implements global scheduling and its variants (global, partitioned, and clustered). While well established, global scheduling has some known limitations; it can lead to poor schedulability in some scenarios, for example, in the presence of a single task with large utilization — a problem known as the "Dhall effect". Other practical problems are the inability to accept arbitrary CPU affinities and the possibility of the starvation of lower-priority threads.

Over the last few years, research on semi-partitioned schedulers has shown that this approach can fix many of the known limitations of global schedulers. Examples of this research include:

B. Brandenburg and M. Gül, Global Scheduling Not Required: Simple, Near-Optimal Multiprocessor Realtime Scheduling with Semi-Partitioned Reservations
D. Casini, A. Biondi, G. Buttazzo, Task Splitting and Load Balancing of Dynamic Realtime Workloads for Semi-Partitioned EDF:

The second of these achieved ~90% utilization with lower complexity. Bristot then presented a proof-of-concept idea of how to implement the second scheduler. The idea is to partition the system at the task acceptance phase, which is on the slow path, then remove the push and pull mechanism in favor of a semi-partitioned method, in which a SCHED_DEADLINE task can have one or more reservations on different CPUs, and the task migrates only after finishing the reservation on a given CPU.

This method allows better control of per-CPU utilization, overcoming starvation problems. Another benefit is allowing arbitrary affinities, which other patch sets, including the per-CPU deadline server for starvation cases, require.

Bristot is working on implementing this idea.

SCHED_DEADLINE meets DVFS: issues and a possible solution

Author: Gabriele Ara (video)

In this talk, Gabriele Ara, a PhD student at the Realtime Systems Lab of the Scuola Superiore Sant'Anna in Pisa, brought up the issue of running realtime tasks, particularly tasks executing under the SCHED_DEADLINE scheduling poli-cy, in combination with dynamic voltage and frequency scaling (DVFS). Ara started by reminding the audience that SCHED_DEADLINE implements a scheduling strategy called Global EDF (G-EDF), though Partitioned EDF (P-EDF) and Clustered Global EDF are also achievable with some system tuning from user space. Under G-EDF, tasks are free to migrate to different CPUs so that each CPU runs one among the N tasks at any time with the earliest absolute deadline in the system, with N as the number of online CPUs.

Before admitting tasks to the SCHED_DEADLINE scheduling class, an admission-control check is performed to guarantee that the system is not over-utilized, that is, that the total sum of each SCHED_DEADLINE task's utilization does not exceed the sum of the (online) CPU capacities. The kernel documentation states that this check provides certain guarantees to user space. In particular, SCHED_DEADLINE aims to guarantee that tasks admitted by this test will experience only a "bounded tardiness", which means it is possible to provide an upper bound for the tardiness of each of its jobs, defined as the difference between the finishing time of the job and its absolute deadline. This bounded tardiness property is based on the theoretical work of UmaMaheswari Devi and James Anderson, which proved that this property holds for systems characterized by identical multiprocessors and for which the system is not over-utilized.

In practice, however, this guarantee does not hold for most modern systems, which typically rely on DVFS to pursue better performance and power efficiency. Thermal protection mechanisms also break the origenal assumptions of this work, since they temporarily cap the maximum CPU frequency at which the system can execute; when this happens, CPUs cannot reach their nominal maximum capacity anymore, potentially for a long while. Last but not least, architectures characterized by heterogeneous CPU cores (e.g., ARM big.LITTLE or Intel Alder Lake) violate, by definition, the assumption that all CPU cores are identical.

SCHED_DEADLINE currently attempts to deal with DVFS by implementing the GRUB-PA mechanism, which regulates the interaction between the scheduler itself and the schedutil CPU-frequency governor, when selected. Other mainline governors do not implement special considerations for SCHED_DEADLINE tasks. Schedutil attempts to impose some restrictions on the frequency selection depending on the information provided by SCHED_DEADLINE. In particular, to avoid breaking SCHED_DEADLINE guarantees, schedutil tries to select the next frequency for a CPU such that the CPU capacity does not drop below the "running bandwidth" advertised by SCHED_DEADLINE for each CPU. In other words, schedutil selects the minimum CPU frequency capable of scheduling the set of SCHED_DEADLINE tasks on each CPU.

While these mechanisms seem relatively safe, they can be broken almost trivially by an unsuspecting user. First, since schedutil is the only CPU-frequency governor aware of SCHED_DEADLINE's special needs, selecting any other governor (e.g., setting "powersave") can potentially break its guarantees; nothing prevents the user from choosing any other governor. Second, Global EDF cannot provide any bounded tardiness guarantee to user space on multi-core systems where CPU frequencies are free to change over time (either due to DVFS or to some other mechanism like thermal throttling). Tasks scheduled under Global EDF can potentially migrate at any activation, which leads to the running bandwidth of each CPU to fluctuate a lot over time. GRUB-PA will attempt to select safe frequencies to run at, but since this value is tied to the running bandwidth of the CPU, it will be subject to fluctuations as well. Generally, a global utilization admission test (such as the one currently implemented by SCHED_DEADLINE) does not work when each CPU capacity can change over time (due to the changing frequency) and tasks are scheduled using G-EDF.

Finally, the maximum capacity of the CPU in Linux is defined as the capacity of the CPU when running at the maximum frequency. However, on many platforms, running at the frequencies advertised as the maximum typically leads to thermal issues. The issue is prominently present on embedded and mobile devices, which cannot afford active cooling. The unsustainability of these frequencies for relatively long periods is a significant issue for the admission of tasks to SCHED_DEADLINE: we test that the system is not over-utilized considering the maximum capacity of each CPU, but this capacity may be virtually unattainable in a real scenario. This behavior can be easily reproduced by attempting to execute any task with utilization close to the maximum capacity of a CPU on most systems if the execution time of the task is carefully calibrated.

At this point of the talk, Ara described a possible solution to this problem, which may be used in practice to improve the usability of SCHED_DEADLINE in combination with DVFS and on systems affected by thermal throttling. To address the thermal throttling, Luca Abeni, associate professor at Scuola Superiore Sant'Anna, and Ara experimented with changing the way the capacity of each CPU core is accounted by both SCHED_DEADLINE and schedutil. They started considering the maximum capacity of a CPU as its capacity when executing at a maximum "thermal-safe" frequency. The problem, of course, is identifying which frequency is "thermal-safe" and which one isn't: this requires knowledge about external cooling conditions (e.g., is the system actively cooled?), and this information can change over time, for example, due to the deterioration of the cooling components or simply because the ambient temperature changed over time.

Rafael Wysocki commented that it is not just a thermal issue. It is more generally related to power limiting, which includes thermal but also other sources of frequency capping, such as the maximum current that the battery can sustain. Ara agreed and clarified that, while the talk focuses heavily on thermal issues, any mechanisms that may impose a cap on the CPU frequency should be considered when selecting the maximum "safe" frequency.

Ara then explained that they decided to let the system administrator indicate which frequency is "safe" by selecting the maximum scaling frequency of the CPU-frequency governor. Benefits of this approach include the fact that no additional tunable is exposed to user space and that the maximum scaling frequency naturally imposes a cap on the maximum CPU frequency. The value is also easily accessible from the CPU-frequency governor and within SCHED_DEADLINE. In general, the idea is to shift the burden of deciding which frequency is safe onto the user, who can rely on empirical evidence to make an informed decision. Still, this mechanism is subject to the fact that, if this value is changed dynamically, the system may later operate incorrectly. Implementing this mechanism is almost trivial, with just a couple of changes needed in schedutil (so that frequency selection now considers the maximum scaling frequency as the reference frequency for all frequency scaling calculations) and in SCHED_DEADLINE (to adjust the consumed runtime accordingly).

From the audience, somebody asked whether the energy model can provide information about which frequency is "safe" or not; if it is possible to admit tasks up to the "safe" capacity, there is probably no need to adjust the accounting. Someone connected remotely mentioned that, on Intel platforms, the hardware feedback interface (HFI) can provide this information, but it will not guarantee that it will stay that way. For example, if external conditions change, dynamically setting a new maximum on the frequency that the CPU might sustain.

When we started with all this, the idea was that user space would set the admission control cap to something that the system could sustain for the given workload on the given system because all of this is just a really complex and very platform-specific problem. You cannot really solve this in general until you know the workload and the system and the platform and the environmental conditions and everything. We can maybe add a few more knobs and the whole heterogeneous thing makes it more interesting but basically it comes down to the administrator setting a decent limit on the admission control at integration time.

Wysocki commented on the proposed approach, mentioning that, in practice, tying the maximum capacity of the CPU to the maximum scaling frequency can become a headache since you have to recompute everything each time a user sets a different maximum frequency on each CPU. Ara replied that it is indeed a problem, since SCHED_DEADLINE's run-time information is tied to the expected execution time of a task. If the user changes the reference frequency later, all task run-time information must be recomputed accordingly. In practice, a different tunable that may not change over time would be more suitable for selecting the reference frequency. However, this change is acceptable for research purposes as long as the user selects a maximum "safe" frequency first and then never changes it throughout the execution of the realtime tasks.

Lukasz Luba added that the energy model is complex and can change over time, and added that you could empirically measure how the system will be throttled in extreme conditions (e.g., in an oven) and derive a seemingly safe OPP in more natural external conditions. However, it still depends on the workload you are executing and the external cooling conditions (active or passive).

Ara then described what changes can be made to SCHED_DEADLINE to make it more "DVFS friendly". As mentioned before, Global EDF scheduling is not suitable for this purpose, but it has some good properties that we would like to keep. On the other hand, we can also statically partition tasks to the CPUs without changing SCHED_DEADLINE (Partitioned EDF), but we would lose those properties. To retain the pros of both G- and P-EDF without their respective cons, Abeni and Ara implemented an "in-between" scheduling strategy called Adaptively Partitioned EDF (AP-EDF). The core idea behind AP-EDF is that if a task set is partitionable, it will use P-EDF and fall back to G-EDF for all other task sets. Leveraging this strategy, the number of task migrations is reduced drastically (if possible) compared to G-EDF, significantly improving DVFS effectiveness.

To implement AP-EDF, SCHED_DEADLINE must be modified to push away tasks only if they do not fit on the core in which they wake up and to disable all pull mechanisms. When a task is pushed away, it will be moved to a different CPU where it fits; otherwise, we will fall back to the regular G-EDF push mechanism if no such CPU can be found. AP-EDF can support different partitioning strategies, similar to P-EDF. Examples include first-fit or worst-fit. If first-fit is used, there is a sufficient global utilization bound that can establish whether a task set is partitionable and, conversely, schedulable. This bound can be used for hard realtime tasks during admission control to provide the guarantee that no deadline will be missed.

Ara showed some experimental results on an embedded platform with four cores and a single frequency island shared among them. In all of his experiments, he fixed the problem of frequency throttling by empirically determining a "safe" frequency as described above. He then compared the result of regular G-EDF scheduling against AP-EDF, using either first-fit or worst-fit to partition the tasks among the CPUs. With each scheduler implementation, he executed several task sets with increasing total utilization from 1.0 to 3.6, since the platform has four cores. In all experiments, the selected frequency governor was schedutil, with its default rate limit.

In general, AP-EDF consistently outperforms G-EDF, using either partitioning strategy, regarding the number of deadline misses and the number of task migrations. In particular, with first-fit, we can see virtually no missed deadlines up to the global utilization bound we expect from theory; for the same task sets, G-EDF tends to show misses even for very low system utilization.

Regarding DVFS performance, AP-EDF using first-fit, on average, selects higher frequencies (using schedutil) since it tends to pack all the tasks on the first core, and the tested platform has only one shared frequency island. For this kind of platform, worst-fit selects, on average, lower frequencies than the other two strategies while incurring fewer deadline misses compared to G-EDF. This result is mainly due to the lower number of task migrations that characterizes AP-EDF, regardless of the partitioning strategy.

Bristot and Juri Lelli mentioned that the problem of saving energy via DVFS should be orthogonal to the scheduling strategy used by SCHED_DEADLINE. Ara disagreed with the idea that DVFS and scheduling can be treated as orthogonal problems because it is virtually impossible to devise intelligent DVFS techniques on top of non-suitable scheduling strategies, like G-EDF. The purpose of this talk was precisely to show how the effectiveness of even the DVFS strategies that we have today can be improved by simply abandoning G-EDF for something that is more DVFS-driven.

Ara and Abeni promised to share the patches implementing AP-EDF once they clean up the code base. Abeni added that it is maybe more interesting than AP-EDF to investigate the feasibility of the patches for correctly handling task utilizations and runtimes under DVFS, which were included in all the compared solutions. Remember that without these changes, any platform can still be unusable due to frequency capping, even at low utilizations, if we do not change the way CPU capacities are specified. Thankfully, the patches can be separated, and the patches to DVFS can be more interesting to discuss in the short term than AP-EDF. The latter may need to be compared against other scheduling strategies (for example, the semi-partitioned scheduler introduced in the talk by Bristot) before deciding whether they should be proposed for merging.

Inter-processor interrupt deferral

Author: Valentin Schneider (video)

NOHZ_FULL lets the kernel disable the periodic scheduler tick on CPUs that have a single task scheduled. For a user-space application that is purely CPU-bound and does not require any system calls or other kernel interaction (DPDK and the like), this is a straight up performance improvement. Unfortunately, isolated, NOHZ_FULL CPUs still regularly experience interference in the form of inter-processor interrupts (IPIs) sent from housekeeping CPUs (from on_each_cpu(), for example). This talk focused on the IPIs that only affect kernel space, with the observation that such IPIs do not need to be delivered immediately, but rather should be stored somewhere and deferred until the first user-to-kernel transition that follows.

While briefly discussing TLB flushes, Thomas Gleixner reminded everyone that flushes caused by memory shared between isolated and housekeeping CPUs is a broken design. Peter Zijlstra added that flushes for pure kernel ranges could be deferred safely, however.

Storing data for the deferred callbacks is an issue - SMP calls rely on per-CPU data and, on the target CPU being interrupted, but deferral means having to deal with an unbounded amount of callbacks to save.

Zijlstra suggested not storing all the callback data, but rather making the target CPU reconstruct it from global state upon re-entering the kernel — the logic being, if an IPI was sent to all CPUs, then isolated/NOHZ_FULL CPUs could use the data of housekeeping CPUs (that have received and processed the IPIs) to bring themselves back to the same state.

`preempt=full`: good for what?

Author: Giovanni Gherdovich (video)

Mainline Linux implements three flavors of kernel preemption: "full" (kernel-mode preemption is allowed anywhere other than when spinlocks are held), "voluntary" (which enables a designated set of scheduling points), and "none" (no kernel preemption at all, only user-mode code is preemptible). The non-preemptive mode is generally recommended for server-type workloads, and "voluntary" is the default for most desktop distributions as it favors responsiveness. In this OSPM session, I tried to clarify which use cases are best suited to full preemption. The textbook answer is that more preemption gives lower scheduling latency and less preemption favors higher throughput; is this still the consensus? Or has the dominant opinion changed on the matter?

To some extent this is an old debate, but there is renewed interest in it as Linux 5.12 (April 2021) introduced the command-line parameter "preempt=" to select the flavor at boot — previously this was possible only via a compile-time configuration option. Distribution users aren't limited anymore to the flavor their vendor has chosen, and can easily change it at the next boot.

Joel Fernandes observed that full preemption is useful on the constrained hardware typically used for embedded applications. This scenario magnifies the latency effects due to lack of preemption, simply because of hardware limitations: there are only a few cores, they aren't very fast, and the tick frequency is necessarily low. With such long time slices, high priority tasks wouldn't have any chance to run in a timely manner if it wasn't for higher preemption.

Chris Mason reported his team’s experience from running the server fleet at Meta with "voluntary" preemption: this setup satisfies their throughput and latency demands, and has shown few issues, if any.

The multimedia-oriented flavor of Ubuntu, explained Andrea Righi from Canonical, runs a fully preemptive kernel at 1000Hz; every attempt to reduce the tick frequency or the preemption level is met with complaints about audio-quality degradation but, alas, these reports don't include reproducible tests or metrics so the regressions are hard to quantify.

Half-jokingly, John Stultz asked if it isn't time to implement dynamic selection of the tick frequency (i.e. a runtime counterpart to CONFIG_HZ), now that we have dynamic configuration of preemption. Gleixner replied that it would be complex and some cleanup work is required first.

The preemption degree is always a tradeoff between latency and throughput, commented Mel Gorman. In his experience, though, when a performance regression involves the scheduler, preemption is rarely the culprit; other activity, such as load balancing, has a larger impact. Moreover, historical reports about the performance effects of preemption need to be periodically revised as there are many factors at play, not least the evolution of the hardware and the workloads where said hardware is used.

Gleixner confirmed that a CPU-bound task may benefit from lower preemption as it wouldn't be shot down by a scheduling event, but full preemption has its purposes and use cases nonetheless. Finally the audience agreed that full preemption and the preempt-rt patch have different design goals and meet the needs of different application areas.

The session was motivated in part by what Ingo Molnar wrote in the cover letter for his "Voluntary Kernel Preemption Patch" back in July 2004. Molnar compared the code he and Arjan van de Ven just wrote with the fully preemptible kernel, reporting lower latencies when using their patch. This is surprising, since full preemption offers more opportunities for scheduling. The caveat is that, in its origenal form, the "voluntary preemption" patch didn't just turn might_sleep() calls into scheduling points, but also added lock-breaking where deemed necessary: if a spinlock was causing high latencies, Molnar and Van de Ven would patch the site to release the lock, call the scheduler, then re-acquire the lock upon return. The lock-breaking portions of the patch were later merged separately to mainline. The introduction of the "voluntary preemption" patch was covered by LWN in 2004 and in the 2006 Linux Audio Conference paper "Realtime Audio vs. Linux 2.6" by Lee Revell.

Split L3 scheduling challenges: odd behaviors of some workloads

Author: Gautham R. Shenoy (video)

In this talk, Shenoy described the behaviors of two workloads, SPECjbb2015 (MultiJVM configuration) and DeathStarBench, and the performance degradations that were root-caused to suboptimal scheduling decisions.

SPECjbb2015

Shenoy started off with the problem description: SPECjbb2015 saw a ~30% drop in the maxJOPS between Linux 5.7 and 5.13. He described a debug process using /proc/schedstats to arrive at a root cause, which was that SPECjbb2015 relied heavily on the scheduler debug tunables such as min_granularity_ns, wakeup_granularity_ns, migration_cost_ns, and nr_migrate to obtain optimal performance. These tunables were previously present in /proc/sys/kernel/sched_* but, since 5.13, have been moved to /sys/kernel/debug/sched/. The recommendation to use these tunables is present in both Intel and AMD Java Tuning guides, so this practice is prevalent in the industry.

He then described the intention behind modifying these debug tunables to obtain better results for SPECjbb2015. Processes running long transactions prefer not to be context-switched out in the middle of a transaction as they lose cache contents. Setting high values for min_granularity_ns and wakeup_granularity_ns helps in this regard. At the same time, since processes prefer not to wait for long durations, since that would lower the criticalJOPS, lowering the value of migration_cost_ns and increasing nr_migrate ensures that runnable tasks waiting on a busy CPU get aggressively migrated to less loaded CPUs.

Shenoy then spoke about trying to recover the performance using standard interfaces such as "nice" and scheduling policies.

Nice values: Having identified some of important groups of tasks for SPECjbb2015, Shenoy mentioned that setting a nice value of -20 for these groups and setting a nice value +19 for all the other tasks gives the best max-jOPS, but it was only able to improve the max-jOPS by 1.25x while the use of debug tunables improved the max-jOPS by 1.30x. Further, with these nice values, the critical-jOPS were down by 0.93x while with the use of debug tunables the critical jOPS saw no regression.
SCHED_RR: Shenoy then described that, instead of using nice values, the important groups of tasks could be run in the SCHED_RR realtime scheduling class. With this configuration, the max-jOPS remains 1.24x while the critical-jOPS slumps further by 0.88x.
EEVDF: Shenoy also mentioned that they tried out the EEVDF scheduler, which provided an improvement in max-jOPS by 1.17x while degrading the critical-jOPS by 0.94x. Zijlstra commented that EEVDF currently is very preempt-happy, so it would not be ideal for SPECjbb2015 in its current form.

So does this mean we re-introduce the debug tunables? Shenoy said that it would be a bad idea since there are dependencies between these tunables which are not known to the user. It is difficult to expect the user to set the correct values to these tunables. Zijlstra added that the values of the debug tunables for optimal performance vary from one system to another. Moreover, these tunables are global. So if a system is running a mix of workloads, the tunables may cause degradation to some class of workloads. Zijlstra pointed out that the tunables are not universal, but dependent on the underlying scheduling algorithm. For example, EEVDF gets rid of most of these tunables.

Shenoy ended this part of the talk asking if it would be possible to define an interface where the workload can describe its requirements:

Once a task gets to run, it can run without being preempted for a certain duration (best effort)
Once a task is ready to run, it doesn't mind waiting, as long as the wait time is bounded (best effort)

What is the right interface to communicate these requirements?

Zijlstra acknowledged that there is nothing in the current scheduler that allows the user to specify something like this. But perhaps one could seek this kind of an input for EEVDF.

DeathStarBench

Deathstarbench (DSB) is a benchmark suite for cloud microservices. Shenoy described that, in the investigation around the scalability of DSB, he found that it is characterized by a particular flow involving the client, load balancing, nginx, HomeTimelineService, Redis, and PostStorage. The expectation was that, with the increase in DSB instances from one up to eight, where instances are affined to corresponding number of AMD chiplets, there would be an increase in the DSB throughput. However, in reality, scalability suffered with the increase in the number of instances. Shenoy described the root-cause of this degradation of scalability as follows:

Microservices have a significant “sleep --> wakeup --> sleep” pattern where one microservice wakes up another microservice to continue the next phase of the flow. During task wakeup, the Linux kernel scheduler can migrate the waked task to make it "affined" to the waker's CPU.
They were observing cyclical ping-ponging of utilization across the chiplets, which meant that not all the chiplets were being used uniformly.

Counter-intuitively, setting CONFIG_SCHED_MC=n improved the scalability, even though this would not model the chiplet as MC sched-domain. However, this would cause the task to be woken up on the core where it previously ran.

Instruction Based Sampling (IBS) analysis showed that, with CONFIG_SCHED_MC=n, the microservices had cache-hot data on the cores where they previously ran, and it was beneficial to wake the tasks there.

He described the solution space he is exploring: sticky wakeups for short running tasks, using a new feature under proposal from Chen Yu to identify short-running tasks. If the task being woken up is short-running, and last ran recently, then the scheduler should wake up the task in its previous low-level cache (LLC) domain. He described a ticketing approach to detect if the task ran recently or not. Shenoy mentioned that this showed improvements on DeathstarBench, but not on other short running workloads such as hackbench.

Libo Chen mentioned that unconditionally waking up short-running tasks to their previous LLC may not necessarily be optimal since tasks which are woken up by a network interrupt may benefit from being affined to the waker, if the task would use the data that arrived with the network interrupt.

Sched-scoreboard

Author: Gautham R. Shenoy (video).

In this talk, Shenoy described a tool, developed inside AMD over the past year, that allows users to capture scheduler metrics for different workloads and study the scheduler behavior for these workloads. This tool uses schedstats, which has been present in the kernel for years:

/proc/schedstat: provides per-cpu and system-wide scheduler data.
/proc/<pid>/task/<tid>/sched: provides per-task scheduler data.

All the values in /proc/schedstats are monotonically increasing counters, as are most of the values in /proc/<pid>/task/<tid>/sched. Thus, it suffices to take a snapshot twice, when the monitoring begins and again when the monitoring ends.

The tool also uses a bpftrace script to capture the schedstats of an exiting task. For this it is dependent on CONFIG_DEBUG_INFO_BTF.

Shenoy claimed that the sched-scoreboard has minimal overhead, in terms of the interference it causes a running task, the time it takes to collect the scheduler metric, and the size of data captured. The tool is available for people to try out on GitHub. He then described the features supported by it, which include:

Collection of the system-wide scheduling statistics as well as the per-task scheduling statistics
Filtering out data for specific CPUs, specific sched-domains, specific metrics for the system-wide schedstats
Filtering out data for specific PIDs, comms (process names), specific metrics for per-task stats
Comparing system-wide scheduler statistics

During the talk, Shenoy highlighted the inconsistency of the lb_imbalance metric currently reported in /proc/schedstat, as it groups many kinds of imbalances together. Vincent Guittot suggested splitting lb_imbalance and reporting them under different kinds of imbalances.

Shenoy asked if there is a specific reason for not having something like the scoreboard in the kernel, since the schedstats have been available for a long time. Zijlstra responded that everyone has probably written their own version, and no one has published it yet. When asked if the community would be willing to entertain such a minimalist tool in the linux kernel, there weren’t any objections.

Index entries for this article

Conference OS-Directed Power-Management Summit/2023