Tracking pressure-stall information
A kernel with this patch set applied will have a new virtual directory called /proc/pressure containing three files. The first, cpu, describes the state of CPU utilization in the system. Reading it will produce a line like this:
some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
The avg numbers give the percentage of the time that runnable processes are delayed because the CPU is unavailable to them, accumulated over ten, 60, and 300 seconds. In a system with just one runnable process per CPU, the numbers will all be zero. If those numbers start to increase significantly, that means that processes are running more slowly than they otherwise would due to overloading of the CPUs. Administrators can use this information to determine whether the amount of delay due to CPU contention is within the bounds they can tolerate or whether something must be done to ensure that things run more quickly.
These delay numbers resemble the system load average, in that they both give a sense for how busy the system is. The load average is simply the number of processes waiting for the CPU (along with those in short-term I/O waits), though; it needs to be interpreted relative to the number of available CPUs to have meaning. The stall information, instead, tracks the actual amount of waiting time. It is also tracked over a much shorter time range than the load average.
The final number (total) is the total amount of time (in microseconds) during which processes were stalled. It is there to help with the detection of short-term latency spikes that wouldn't show up in the aggregated numbers. A system where a CPU is nearly always available but where occasional 10ms latency spikes are experienced may be entirely acceptable for some workloads, but not for others. For the latter group, the total count can be monitored to detect when those spikes are happening.
The next file is /proc/pressure/memory; as might be expected, it provides information on the time that processes spend waiting due to memory pressure. Its output looks like:
some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828 full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
The some line is similar to the CPU information: it tracks the percentage of the time that at least one process could be running if it weren't waiting for memory resources. In particular, the time spent for swapping in, refaulting pages from the page cache, and performing direct reclaim is tracked in this way. It is, thus, a good indicator of when the system is thrashing due to a lack of memory.
The full line is a little different: it tracks the time that nobody is able to use the CPU for actual work due to memory pressure. If all processes are waiting for paging I/O, the CPU may look idle, but that's not because of a lack of work to do. If those processes are performing memory reclaim, the end result is nearly the same; the CPU is busy, but it's not doing the work that the computer is there to do. If the full numbers are much above zero, it's clear that the system lacks the memory it needs to support the current workload.
Some care has been taken to distinguish paging due to thrashing from other sorts of paging. A process that is just starting up will experience a lot of page faults as its working set is brought in, but those are not really indicative of system load. For that reason, refaulted pages — those which were evicted due to memory pressure and subsequently brought back in — are used to calculate these metrics (see this article for a description of how refaults are tracked). Even then, though, there is a twist, in that a process may need different sets of pages during different phases of its execution. To try to detect the transition between different working sets, the patch set adds tracking of whether each page has made it to the active list (was used more than once, essentially) since it was faulted in. Only the pages that are actually used are counted when the stall times are calculated.
The final file is /proc/pressure/io, which tracks the time lost waiting for I/O. This number is likely to be more difficult to make good use of without some sense for what the baseline values should be. The block subsystem isn't able to track the amount of extra time spent waiting due to contention for the device, so the resulting numbers will not be directly related to that contention.
The files in /proc/pressure track the state of the system as a whole. In systems where control groups are in use, there will also be a set of files (cpu.pressure, memory.pressure, and io.pressure) associated with each group. They can be used to ensure that the resource limits for each group make sense; they should also make it easier to determine which processes are thrashing on a busy system.
This functionality has apparently been used within Facebook for some time,
and has helped considerably in the optimization of system resources and the
diagnosis of problems. "We now log and graph pressure for the
containers in our fleet and
can trivially link latency spikes and throughput drops to shortages
of specific resources after the fact, and fix the job
config/scheduling
", Weiner said.
There is also evidently interest from the Android world, where developers are
looking for better ways of detecting out-of-memory situations before system
performance is entirely lost. Linus Torvalds has indicated
that the idea looks interesting to him. There are still some open
questions on how the CPU data is accumulated (see this message for
a long explanation), but one assumes that will be worked out before too
long. So, in all likelihood, the pressure-stall patches will not be
stalled for too long before making it into the mainline.
Index entries for this article | |
---|---|
Kernel | Performance monitoring |
Posted Jul 14, 2018 0:07 UTC (Sat)
by mtaht (subscriber, #11087)
[Link]
Posted Jul 14, 2018 4:50 UTC (Sat)
by unixbhaskar (guest, #44758)
[Link]
Posted Jul 14, 2018 7:32 UTC (Sat)
by sperl (subscriber, #5657)
[Link] (1 responses)
Posted Jul 15, 2018 20:23 UTC (Sun)
by rossmohax (guest, #71829)
[Link]
Posted Jul 15, 2018 6:16 UTC (Sun)
by marcH (subscriber, #57642)
[Link]
Sounds great but... was there no measurement similar to this already? Kolivas and others have been working on bounded latency for a decade or two. I very vaguely remember the lack of measurements/proofs was a point of contention, so maybe that answers my question?
Posted Jul 19, 2018 12:44 UTC (Thu)
by CChittleborough (subscriber, #60775)
[Link]
Posted Jul 26, 2018 16:54 UTC (Thu)
by kmweber (guest, #114635)
[Link]
Posted Nov 5, 2018 17:42 UTC (Mon)
by Roger.Weihrauch (guest, #128426)
[Link] (1 responses)
Which will be the differences to using 'pressure stall information' in contrast to using zram/zswap?
Thanks to you all and your responses for clearing up my misunderstanding.
Kind regards,
Posted Nov 5, 2018 18:54 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
This is orthogonal to zram/zswap.
zram/zswap/ordinary swap provide the kernel with a way to free up memory that's not been recently used in favour of things that are actively in use. Thus, the kernel can keep going when it's out of memory, by freeing up not recently used memory for more useful things.
Pressure Stall Information lets a userspace monitoring system determine whether the system is spending too much time on management overheads (e.g. putting data into zram and taking it back out again) and not enough time doing useful work. You can then act on this to do something clever - e.g. move a job around between servers in a private cloud, terminate optimistic background work (such as indexing a filesystem), alert the user that they need to fix it.
Posted Jun 24, 2019 2:29 UTC (Mon)
by paul.tobias (guest, #107265)
[Link]
Tracking pressure-stall information
Tracking pressure-stall information
Tracking pressure-stall information
Tracking pressure-stall information
Tracking pressure-stall information
LWN! Superb technical information, plus bonus literary references!
What a website!
Tracking pressure-stall information
Tracking pressure-stall information
Will this feature only report being a system on less/missing ressources, or will it also manage the system's reduced/system under 'reduced' ressources?
I am not a very technical guy, but I have always been some kind of very successfull keeping a system still 'breathing' unter high load/reduced memory ressources using a specially configured combination on zswap and zram.
Roger
Tracking pressure-stall information
The pressure-stall information is available in linux-stable since v4.20, it's enabled with Tracking pressure-stall information
config PSI
.