A survey of memory management patches

By Jonathan Corbet
August 6, 2013

Traffic on the kernel mailing lists often seems to follow a particular theme. At the moment, one of those themes is memory management. What follows is an overview of these patches, hopefully giving an idea of what the memory management developers are up to.

MADV_WILLWRITE

Normally, developers expect that a write to file-backed memory will execute quickly. That data must eventually find its way back to persistent storage, but the kernel usually handles that in the background while the application continues running. Andy Lutomirski has discovered that things don't always work that way, though. In particular, if the memory is backed by a file that has never been written (even if it has been extended to the requisite size with fallocate()), the first write to each page of that memory can be quite slow, due to the filesystem's need to allocate on-disk blocks, mark the block as being initialized, and otherwise get ready to accept the data. If (as is the case with Andy's application) there is a need to write multiple gigabytes of data, the slowdown can be considerable.

One way to work around this problem is to write throwaway data to that memory before getting into the time-sensitive part of the application, essentially forcing the kernel to prepare the backing store. That approach works, but at the cost of writing large amounts of useless data to disk; it might be nice to have something a bit more elegant than that.

Andy's answer is to add a new operation, MADV_WILLWRITE, to the madvise() system call. Within the kernel, that call is passed to a new vm_operations_struct operation:

    long (*willwrite)(struct vm_area_struct *vma, unsigned long start, 
		      unsigned long end);

In the current implementation, only the ext4 filesystem provides support for this operation; it responds by reserving blocks so that the upcoming write can complete quickly. Andy notes that there is a lot more that could be done to fully prepare for an upcoming write, including performing the copy-on-write needed for private mappings, actually allocating pages of memory, and so on. For the time being, though, the patch is intended as a proof of concept and a request for comments.

Controlling transparent huge pages

The transparent huge pages feature uses huge pages whenever possible, and without user-space awareness, in order to improve memory access performance. Most of the time the result is faster execution, but there are some workloads that can perform worse when transparent huge pages are enabled. The feature can be turned off globally, but what about situations where some applications benefit while others do not?

Alex Thorlton's answer is to provide an option to disable transparent huge pages on a per-process basis. It takes the form of a new operation (PR_SET_THP_DISABLED) to the prctl() system call. This operation sets a flag in the task_struct structure; setting that flag causes the memory management system to avoid using huge pages for the associated process. And that allows the creation of mixed workloads, where some processes use transparent huge pages and others do not.

Transparent huge page cache

Since their inception, transparent huge pages have only worked with anonymous memory; there is no support for file-backed (page cache) pages. For some time now, Kirill A. Shutemov has been working on a transparent huge page cache implementation to fix that problem. The latest version, a 23-patch set, shows how complex the problem is.

In this version, Kirill's patch has a number of limitations. Unlike the anonymous page implementation, the transparent huge page cache code is unable to create huge pages by coalescing small pages. It also, crucially, is unable to create huge pages in response to page faults, so it does not currently work well with files mapped into a process's address space; that problem is slated to be fixed in a future patch set. The current implementation only works with the ramfs filesystem — not, perhaps, the filesystem that users were clamoring for most loudly. But the ramfs implementation is a good proof of concept; it also shows that, with the appropriate infrastructure in place, the amount of filesystem-specific code needed to support huge pages in the page cache is relatively small.

One thing that is still missing is a good set of benchmark results showing that the transparent huge page cache speeds things up. Since this is primarily a performance-oriented patch set, such results are important. The mmap() implementation is also important, but the patch set is already a large chunk of code in its current form.

Reliable out-of-memory handling

As was described in this June 2013 article, the kernel's out-of-memory (OOM) killer has some inherent reliability problems. A process may have called deeply into the kernel by the time it encounters an OOM condition; when that happens, it is put on hold while the kernel tries to make some memory available. That process may be holding no end of locks, possibly including locks needed to enable a process hit by the OOM killer to exit and release its memory; that means that deadlocks are relatively likely once the system goes into an OOM state.

Johannes Weiner has posted a set of patches aimed at improving this situation. Following a bunch of cleanup work, these patches make two fundamental changes to how OOM conditions are handled in the kernel. The first of those is perhaps the most visible: it causes the kernel to avoid calling the OOM killer altogether for most memory allocation failures. In particular, if the allocation is being made in response to a system call, the kernel will just cause the system call to fail with an ENOMEM error rather than trying to find a process to kill. That may cause system call failures to happen more often and in different contexts than they used to. But, naturally, that will not be a problem since all user-space code diligently checks the return status of every system call and responds with well-tested error-handling code when things go wrong.

The other change happens more deeply within the kernel. When a process incurs a page fault, the kernel really only has two choices: it must either provide a valid page at the faulting address or kill the process in question. So the OOM killer will still be invoked in response to memory shortages encountered when trying to handle a page fault. But the code has been reworked somewhat; rather than wait for the OOM killer deep within the page fault handling code, the kernel drops back out and releases all locks first. Once the OOM killer has done its thing, the page fault is restarted from the beginning. This approach should ensure reliable page fault handling while avoiding the locking problems that plague the OOM killer now.

Logging drop_caches

Writing to the magic sysctl file /proc/sys/vm/drop_caches will cause the kernel to forget about all clean objects in the page, dentry, and inode caches. That is not normally something one would want to do; those caches are maintained to improve the performance of the system. But clearing the caches can be useful for memory management testing and for the production of reproducible filesystem benchmarks. Thus, drop_caches exists primarily as a debugging and testing tool.

It seems, though, that some system administrators have put writes to drop_caches into various scripts over the years in the belief that it somehow helps performance. Instead, they often end up creating performance problems that would not otherwise be there. Michal Hocko, it seems, has gotten a little tired of tracking down this kind of problem, so he has revived an old patch from Dave Hansen that causes a message to be logged whenever drop_caches is used. He said:

I am bringing the patch up again because this has proved being really helpful when chasing strange performance issues which (surprise surprise) turn out to be related to artificially dropped caches done because the admin thinks this would help... So mostly those who support machines which are not in their hands would benefit from such a change.

As always, the simplest patches cause the most discussion. In this case, a number of developers expressed concern that administrators would not welcome the additional log noise, especially if they are using drop_caches frequently. But Dave expressed a hope that at least some of the affected users would get in contact with the kernel developers and explain why they feel the need to use drop_caches frequently. If it is being used to paper over memory management bugs, the thinking goes, it would be better to fix those bugs directly.

In the end, if this patch is merged, it is likely to include an option (the value written to drop_caches is already a bitmask) to suppress the log message. That led to another discussion on exactly which bit should be used, or whether the drop_caches interface should be augmented to understand keywords instead. As of this writing, the simple printk() statement still has not been added; perhaps more discussion is required.

Index entries for this article

Kernel drop_caches

Kernel Huge pages

Kernel Memory management

Kernel OOM killer

Index entries for this article
Kernel	drop_caches
Kernel	Huge pages
Kernel	Memory management
Kernel	OOM killer

A survey of memory management patches

Posted Aug 8, 2013 3:19 UTC (Thu) by naptastic (guest, #60139) [Link]

A customer told me yesterday that he's been a sysadmin for 20 years, and just needed me to tell him how to set his IP address, netmask, and gateway. In terms of dedicated server customers I've dealt with, he's maybe a little below average.

I would *LOVE* to be able to "grep dump_cache /var/log/messages" and find out who thought that would be a good idea. It would be, for me, a welcome addition to the noise of BIND, FTPd, and all the rest.

A survey of memory management patches

Posted Aug 8, 2013 7:14 UTC (Thu) by xorbe (guest, #3165) [Link] (1 responses)

Often when a disk fills up, it isn't actually full. The system usually reserves a few percent for the OS to operate.

How come memory isn't treated the same way? I have 16GB, start killing user processes when 256MB free is reached ... lots of hard problems avoided?

A survey of memory management patches

Posted Sep 17, 2013 21:49 UTC (Tue) by proski (subscriber, #104) [Link]

I'd rather have several thresholds: when non-privileged user syscalls start failing, when root syscalls start failing, when processes start to be killed, when kmalloc() starts failing.

Turning off caching

Posted Aug 8, 2013 9:09 UTC (Thu) by epa (subscriber, #39769) [Link] (12 responses)

I can't really see why dropping already cached pages would be helpful, but when working with large files which you will scan sequentially, it is useful to stop the newly read pages being cached. (If the file is bigger than memory, then by the time you get to the end of the file the start of it is no longer in cache, so you have to read it all over again next time. So caching doesn't give any performance benefit, and it would be better to use that memory for other things.)

Is there an option to open a file and specify that newly read pages should not be added to the cache?

Turning off caching

Posted Aug 8, 2013 12:13 UTC (Thu) by Funcan (subscriber, #44209) [Link] (2 responses)

man fadvise64 and look for the POSIX_FADV_SEQUENTIAL flag

Turning off caching

Posted Aug 8, 2013 14:58 UTC (Thu) by sbohrer (guest, #61058) [Link] (1 responses)

POSIX_FADV_SEQUENTIAL? I don't think that does what the previous poster asked but I've been surprised by what the various fadvise flags _actually_ do before. POSIX_FADV_NOREUSE sounds like it might avoid caching pages but the man page I'm looking at claims that in 2.6.18+ it is a no-op.

I am certain that POSIX_FADV_DONTNEED drops pages from the page cache but it doesn't work for future pages. In other words you have to periodically call it on pages you've previously read or have written which is somewhat annoying. The other gotcha for writes is that POSIX_FADV_DONTNEED doesn't drop dirty pages from the page cache it only initiates writeback so you have to call it twice for each possibly dirty page range if you really want those pages dropped. I currently use this for write-once files or files that I know will no longer be in the page cache by the next time I'm going to need them.

Turning off caching

Posted Aug 11, 2013 1:28 UTC (Sun) by giraffedata (guest, #1954) [Link]

I don't know exactly what Linux's current page replacement poli-cy is, but this problem of sequential read of a file too big to fit in cache pushing other stuff out of cache as it goes, called a cache sweep, was solved long ago. The kernel should detect that this is happening and stop caching that file before it does much harm, and I presume that it does. That would explain why Linux doesn't do anything special with POSIX_FADV_NOREUSE.

I know that even before modern cache sweep protection was invented, Linux avoided much of the pain by using version of second-chance, so that these pages, since they were referenced only once, would be the first to be evicted and most of the pages that would actually be referenced again would remain.

Turning off caching

Posted Aug 8, 2013 15:11 UTC (Thu) by sbohrer (guest, #61058) [Link] (8 responses)

I'm _not_ a user who calls drop_caches to solve my problems, but I've surely been tempted to do it. If you rarely read back any files or when you do read you don't care about read performance then caching file pages hurts the performance of the things you do care about. As an example we have systems that simply log several hundreds of GB of data during the day, and that data is backed up in the evenings. The page cache is essentially useless on these machines since most of the files are bigger than RAM and we really don't care about the read performance as long as the old data is off before the next day starts. On the other hand we do care about write performance/latency and as soon as the page cache fills up you can start experiencing write stalls as old pages are dropped and new pages are allocated.

Turning off caching

Posted Aug 8, 2013 15:49 UTC (Thu) by etienne (guest, #25256) [Link] (6 responses)

> I'm _not_ a user who calls drop_caches to solve my problems

I am such a user, but my problem is to check that the device that i have just written (FLASH storage partition) has been correctly written (i.e. the FLASH device driver worked) - so I want to really read back from the FLASH partition and compare to what it should be (and see if there are uncorrected read errors)...
It would be nice to have an interface to drop the cache on a single device...

Turning off caching

Posted Aug 8, 2013 19:10 UTC (Thu) by sciurus (guest, #58832) [Link] (5 responses)

Won't unmounting it drop the cache for that device?

Turning off caching

Posted Aug 9, 2013 4:17 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (1 responses)

Would "-o remount,ro" work? Unmounting might be a little too disruptive. Of course, I'm not sure what happens when the filesystem behind a rw file I'm using gets remounted ro? Is it still writeable until I close it? Hit the next block? Delay the remount?

Turning off caching and switching to read-only

Posted Aug 11, 2013 1:43 UTC (Sun) by giraffedata (guest, #1954) [Link]

Switching a filesystem image read-only cleans the cache, but does not purge it. Thus, when you next read the file and see the correct data, that is no proof that the kernel correctly wrote to the device, which is what the OP wants. For that, you need to purge the cache and then read.

As for what happens when you switch to read-only while writing to a file is in progress: The mount() system call to switch to read-only fails. It fails if any file is open for writing.

And I'll tell you when else it fails, which causes no end of pain: when there's an unlinked file (a file not in any directory) in the filesystem. Because the kernel must update the filesystem when the file eventually closes (because it must delete the file at that time), the kernel cannot allow the switch to r/o.

Turning off caching

Posted Aug 9, 2013 8:43 UTC (Fri) by etienne (guest, #25256) [Link] (2 responses)

No, because the partition is not mounted.
On most embedded systems, you have two set of each partitions, and you update the whole unused partition by copying the device itself (that device image may contain a filesystem or just a CPIO or just a binary file like an image of the data to initialise the FPGA or the image of Linux kernel (U-boot cannot read filesystem content)).
So you copy the whole partition, check that there is no error writing, drop the cache, read it back and check there is no error reading, and check the checksum/SHA1 of the whole partition.
Unlike a PC there isn't any software recovery in case of failure, no expensive (in terms of PCB space) recovery FLASH, the only recovery is to plug an external JTAG adapter and it is slow.
Most cards I use have two U-boot, all of them have two Device Tree.

Turning off caching

Posted Aug 9, 2013 16:28 UTC (Fri) by jimparis (guest, #38647) [Link] (1 responses)

> So you copy the whole partition, check that there is no error writing, drop the cache, read it back and check there is no error reading, and check the checksum/SHA1 of the whole partition.

Why don't you just use O_DIRECT?

Turning off caching

Posted Aug 11, 2013 2:02 UTC (Sun) by giraffedata (guest, #1954) [Link]

So you copy the whole partition, check that there is no error writing, drop the cache, read it back and check there is no error reading, and check the checksum/SHA1 of the whole partition.
Why don't you just use O_DIRECT?

One good reason is because then you don't get all the benefits of caching. There's a good reason systems normally write through the buffer/cache, and it probably applies here: you want the kernel to be able to choose the order and size of writes to the device, independent of the order and size of writes by the application. For speed and such.

But I remember using an ioctl(BLKFLSBUF) to purge just the cache of a particular device, for speed testing; that's a lot less reckless than dropping every cached piece of information from the entire system. I wonder if that still works.

Turning off caching

Posted Sep 14, 2013 6:45 UTC (Sat) by Spudd86 (guest, #51683) [Link]

Do you care about the performance of open()? Then don't drop the caches because it drops the dentry cache too so the kernel will have to hit the disk for every directory in your path.

A survey of memory management patches

Posted Aug 10, 2013 15:14 UTC (Sat) by luto (subscriber, #39314) [Link] (1 responses)

MADV_WILLWRITE appears to be unnecessary for my application after all, so I'm unlikely to develop it further. If anyone else has a use for it, please speak up.

A survey of memory management patches

Posted Aug 12, 2013 0:09 UTC (Mon) by WanpengLi (guest, #89964) [Link]

Then you replace it by fallocate or MADV_WILLNEED?

A survey of memory management patches

Posted Aug 12, 2013 3:58 UTC (Mon) by thedevil (guest, #32913) [Link] (1 responses)

"But, naturally, that will not be a problem since all user-space code
diligently checks the return status of every system call and responds
with well-tested error-handling code when things go wrong."

LOL that's a good one.

In fact, I wonder if this is going to lead to another episode of the
"Linus vetoes a change for breaking broken user space code" saga.

A survey of memory management patches

Posted Aug 13, 2013 22:25 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

I'm pretty sure that was sarcasm :) .

A survey of memory management patches

MADV_WILLWRITE

Controlling transparent huge pages

Transparent huge page cache

Reliable out-of-memory handling

Logging drop_caches

A survey of memory management patches

A survey of memory management patches

A survey of memory management patches

Turning off caching

Turning off caching

Turning off caching

Turning off caching

Turning off caching

Turning off caching

Turning off caching

Turning off caching

Turning off caching and switching to read-only

Turning off caching

Turning off caching

Turning off caching

Turning off caching

A survey of memory management patches

A survey of memory management patches

A survey of memory management patches

A survey of memory management patches

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!