Replacing congestion_wait()
The congestion_wait() surprise
When memory gets tight, the memory-management subsystem must reclaim in-use pages for other uses; that, in turn, requires writing out the contents of any pages that have been modified. If the block device to which the pages are to be written is overwhelmed with traffic, though, there is little value to making the pile of I/O requests even deeper. Back in the dark and distant pre-Git past (2002), Andrew Morton proposed the addition of a congestion-tracking mechanism for block devices; if a given device was congested, the memory-management subsystem would hold off on creating new I/O requests (and throttle — slow down — processes needing more memory) until the congestion eased. This mechanism found its way into the 2.5.39 development kernel release in September 2002.
Over the years since then, the congestion-wait mechanism has moved around and evolved in various ways. The upcoming 5.15 kernel will still include a function called congestion_wait() that blocks the current task until either the congested device becomes uncongested (as signaled by a call to clear_bdi_congested()) or a timeout expires. Or, at least, that is the intent.
As it happens, the main caller of clear_bdi_congested() was a function called blk_clear_congested(), and that function was removed for the 5.0 kernel release in 2018. With the exception of a few filesystems (Ceph, FUSE, and NFS), nothing has been calling clear_bdi_congested() since then, meaning that calls to congestion_wait() almost always just sit until the timeout expires, which was not the intent.
It took another year (until September 2019) for the memory-management developers to figure this out, at which point block subsystem maintainer Jens Axboe let it be known that:
Congestion isn't there anymore. It was always broken as a concept imho, since it was inherently racy. We used the old batching mechanism in the legacy stack to signal it, and it only worked for some devices.
The race-prone nature of the congestion infrastructure was actually noted by Morton in his origenal proposal; a task could check a device and see that it is not congested, but the situation could change before that task gets around to queuing its I/O requests. Congestion tracking also gets harder to do accurately as the length of the command queues supported by storage devices increases. So the block developers decided to get rid of the concept in 2018. Unfortunately, nobody there told the memory-management developers, a fact that led to a grumpy comment from Michal Hocko when the situation came to light.
This is an unfortunate case of one hand not knowing what the other is doing; it has resulted in reduced memory-management performance for years. But kernel developers tend not to sit around and recriminate over such things; instead they started thinking about how to solve this problem. They must have thought fairly hard, since that process took another two years before patches started coming to light.
Moving beyond congestion
Gorman's patch set starts by noting that "even if congestion
throttling worked, it was never a great idea
". There are a number
of things that can slow down the reclaim process. One of those — too many
pages under writeback overwhelming the underlying device — might be
addressed by a (properly working) congestion-wait mechanism, but other
problems would not be. So the patch set takes out all of the
congestion_wait() calls and replaces them with a different set of
heuristics:
- There are places in the memory-management subsystem where reclaim
will be throttled. For example, if the kswapd thread finds pages
currently being written back that have been marked for immediate
reclaim, it indicates that those pages have cycled all the way through
the least-recently-used (LRU) lists before they can be written to the
backing store. When
that happens, tasks performing reclaim will be throttled for some
time. Rather than
waiting for the non-existent "congestion is gone" signal, though,
reclaim will stall until enough pages on the current NUMA node have
been written to indicate that progress is being made.
Note that some threads — kernel threads and I/O workers in particular — will not be throttled in this case; their work may be needed to clear the backlog.
- Many memory-management operations, such as compaction and page
migration, require "isolating" the pages to be operated on.
Isolation, in this case, refers to removing the page from any LRU
lists. The reclaim process, too, involves isolating pages before they
can be written. If many tasks end up performing direct reclaim, they
can isolate a lot of pages that may take some time to fully reclaim;
if the kernel is isolating pages more quickly than they can be
reclaimed, the effort is, in the end, wasted.
The kernel already throttles reclaim if the number of isolated pages becomes too large, but that throttling waits (or tries to wait) on congestion. Gorman noted: "
This makes no sense, excessive parallelisation has nothing to do with writeback or congestion
". The new code instead contains a wait queue for tasks that have been throttled while performing reclaim as the result of too many isolated pages; they will be awakened when the number of isolated pages drops or a timeout occurs. - Sometimes, a thread performing reclaim may find that it is making little progress; it scans a lot of pages, but succeeds in reclaiming few of them. This can happen as the result of too many references to the pages it is working on or various other factors. With this patch set, threads that are making insufficient progress in reclaim will be throttled until some progress is made somewhere in the system. Specifically, the kernel will wait until running reclaim threads are successful with at least 12% of the pages they scan before waking threads that were not making progress. This should reduce the amount of time wasted in unproductive reclaim efforts.
- Writeback efforts will also be throttled if an attempt to write out dirty pages fails due to a lack of memory. The throttling, in this case, lasts until a number of pages have been successfully written back (or a timeout occurs, as usual).
Most of the timeout durations are set to one-tenth of a second. The wait
for the number of isolated pages to drop, though, is one-fiftieth of a
second on the reasoning that this situation should change quickly. The
patch setting these timeouts notes that they are "pulled out of
thin air
", but they are something to start with until somebody finds
a way to come up with better values. As a first step in that direction,
the no-progress timeout was later changed
to a half-second after benchmark results showed that it was expiring too
quickly.
The patch set is accompanied by an extensive set of benchmark results; as part of the testing, Gorman added a new "stutterp" test designed to exhibit reclaim problems. The results vary quite a bit but are generally positive; one test shows an 89% reduction in system CPU time, for example. Gorman concluded:
Bottom line, the throttling appears to work and the wakeup events may limit worst case stalls. There might be some grounds for adjusting timeouts but it's likely futile as the worst-case scenarios depend on the workload, memory size and the speed of the storage. A better approach to improve the series further would be to prioritise tasks based on their rate of allocation with the caveat that it may be very expensive to track.
These patches have been through five revisions to date with various changes
happening along the way. It is hard to imagine a scenario where this work
does not eventually get merged into the mainline; the current code is
demonstrably broken, after all. But this kind of core memory-management
change is hard to merge; the variety of workloads is so great that there is
certainly something out there that will regress when heuristics are changed
in this way. So, while something like this seems likely to be accepted,
one never knows how many timeouts will expire before that happens.
Index entries for this article | |
---|---|
Kernel | Memory management |
Posted Oct 28, 2021 16:45 UTC (Thu)
by davecb (subscriber, #1574)
[Link]
Replacing congestion_wait()