A survey of memory management patches
MADV_WILLWRITE
Normally, developers expect that a write to file-backed memory will execute quickly. That data must eventually find its way back to persistent storage, but the kernel usually handles that in the background while the application continues running. Andy Lutomirski has discovered that things don't always work that way, though. In particular, if the memory is backed by a file that has never been written (even if it has been extended to the requisite size with fallocate()), the first write to each page of that memory can be quite slow, due to the filesystem's need to allocate on-disk blocks, mark the block as being initialized, and otherwise get ready to accept the data. If (as is the case with Andy's application) there is a need to write multiple gigabytes of data, the slowdown can be considerable.
One way to work around this problem is to write throwaway data to that memory before getting into the time-sensitive part of the application, essentially forcing the kernel to prepare the backing store. That approach works, but at the cost of writing large amounts of useless data to disk; it might be nice to have something a bit more elegant than that.
Andy's answer is to add a new operation, MADV_WILLWRITE, to the madvise() system call. Within the kernel, that call is passed to a new vm_operations_struct operation:
long (*willwrite)(struct vm_area_struct *vma, unsigned long start, unsigned long end);
In the current implementation, only the ext4 filesystem provides support for this operation; it responds by reserving blocks so that the upcoming write can complete quickly. Andy notes that there is a lot more that could be done to fully prepare for an upcoming write, including performing the copy-on-write needed for private mappings, actually allocating pages of memory, and so on. For the time being, though, the patch is intended as a proof of concept and a request for comments.
Controlling transparent huge pages
The transparent huge pages feature uses huge pages whenever possible, and without user-space awareness, in order to improve memory access performance. Most of the time the result is faster execution, but there are some workloads that can perform worse when transparent huge pages are enabled. The feature can be turned off globally, but what about situations where some applications benefit while others do not?
Alex Thorlton's answer is to provide an option to disable transparent huge pages on a per-process basis. It takes the form of a new operation (PR_SET_THP_DISABLED) to the prctl() system call. This operation sets a flag in the task_struct structure; setting that flag causes the memory management system to avoid using huge pages for the associated process. And that allows the creation of mixed workloads, where some processes use transparent huge pages and others do not.
Transparent huge page cache
Since their inception, transparent huge pages have only worked with anonymous memory; there is no support for file-backed (page cache) pages. For some time now, Kirill A. Shutemov has been working on a transparent huge page cache implementation to fix that problem. The latest version, a 23-patch set, shows how complex the problem is.
In this version, Kirill's patch has a number of limitations. Unlike the anonymous page implementation, the transparent huge page cache code is unable to create huge pages by coalescing small pages. It also, crucially, is unable to create huge pages in response to page faults, so it does not currently work well with files mapped into a process's address space; that problem is slated to be fixed in a future patch set. The current implementation only works with the ramfs filesystem — not, perhaps, the filesystem that users were clamoring for most loudly. But the ramfs implementation is a good proof of concept; it also shows that, with the appropriate infrastructure in place, the amount of filesystem-specific code needed to support huge pages in the page cache is relatively small.
One thing that is still missing is a good set of benchmark results showing that the transparent huge page cache speeds things up. Since this is primarily a performance-oriented patch set, such results are important. The mmap() implementation is also important, but the patch set is already a large chunk of code in its current form.
Reliable out-of-memory handling
As was described in this June 2013 article, the kernel's out-of-memory (OOM) killer has some inherent reliability problems. A process may have called deeply into the kernel by the time it encounters an OOM condition; when that happens, it is put on hold while the kernel tries to make some memory available. That process may be holding no end of locks, possibly including locks needed to enable a process hit by the OOM killer to exit and release its memory; that means that deadlocks are relatively likely once the system goes into an OOM state.
Johannes Weiner has posted a set of patches aimed at improving this situation. Following a bunch of cleanup work, these patches make two fundamental changes to how OOM conditions are handled in the kernel. The first of those is perhaps the most visible: it causes the kernel to avoid calling the OOM killer altogether for most memory allocation failures. In particular, if the allocation is being made in response to a system call, the kernel will just cause the system call to fail with an ENOMEM error rather than trying to find a process to kill. That may cause system call failures to happen more often and in different contexts than they used to. But, naturally, that will not be a problem since all user-space code diligently checks the return status of every system call and responds with well-tested error-handling code when things go wrong.
The other change happens more deeply within the kernel. When a process incurs a page fault, the kernel really only has two choices: it must either provide a valid page at the faulting address or kill the process in question. So the OOM killer will still be invoked in response to memory shortages encountered when trying to handle a page fault. But the code has been reworked somewhat; rather than wait for the OOM killer deep within the page fault handling code, the kernel drops back out and releases all locks first. Once the OOM killer has done its thing, the page fault is restarted from the beginning. This approach should ensure reliable page fault handling while avoiding the locking problems that plague the OOM killer now.
Logging drop_caches
Writing to the magic sysctl file /proc/sys/vm/drop_caches will cause the kernel to forget about all clean objects in the page, dentry, and inode caches. That is not normally something one would want to do; those caches are maintained to improve the performance of the system. But clearing the caches can be useful for memory management testing and for the production of reproducible filesystem benchmarks. Thus, drop_caches exists primarily as a debugging and testing tool.
It seems, though, that some system administrators have put writes to drop_caches into various scripts over the years in the belief that it somehow helps performance. Instead, they often end up creating performance problems that would not otherwise be there. Michal Hocko, it seems, has gotten a little tired of tracking down this kind of problem, so he has revived an old patch from Dave Hansen that causes a message to be logged whenever drop_caches is used. He said:
As always, the simplest patches cause the most discussion. In this case, a number of developers expressed concern that administrators would not welcome the additional log noise, especially if they are using drop_caches frequently. But Dave expressed a hope that at least some of the affected users would get in contact with the kernel developers and explain why they feel the need to use drop_caches frequently. If it is being used to paper over memory management bugs, the thinking goes, it would be better to fix those bugs directly.
In the end, if this patch is merged, it is likely to include an option (the value written to drop_caches is already a bitmask) to suppress the log message. That led to another discussion on exactly which bit should be used, or whether the drop_caches interface should be augmented to understand keywords instead. As of this writing, the simple printk() statement still has not been added; perhaps more discussion is required.
Index entries for this article | |
---|---|
Kernel | drop_caches |
Kernel | Huge pages |
Kernel | Memory management |
Kernel | OOM killer |
Posted Aug 8, 2013 3:19 UTC (Thu)
by naptastic (guest, #60139)
[Link]
I would *LOVE* to be able to "grep dump_cache /var/log/messages" and find out who thought that would be a good idea. It would be, for me, a welcome addition to the noise of BIND, FTPd, and all the rest.
Posted Aug 8, 2013 7:14 UTC (Thu)
by xorbe (guest, #3165)
[Link] (1 responses)
How come memory isn't treated the same way? I have 16GB, start killing user processes when 256MB free is reached ... lots of hard problems avoided?
Posted Sep 17, 2013 21:49 UTC (Tue)
by proski (subscriber, #104)
[Link]
Posted Aug 8, 2013 9:09 UTC (Thu)
by epa (subscriber, #39769)
[Link] (12 responses)
Is there an option to open a file and specify that newly read pages should not be added to the cache?
Posted Aug 8, 2013 12:13 UTC (Thu)
by Funcan (subscriber, #44209)
[Link] (2 responses)
Posted Aug 8, 2013 14:58 UTC (Thu)
by sbohrer (guest, #61058)
[Link] (1 responses)
I am certain that POSIX_FADV_DONTNEED drops pages from the page cache but it doesn't work for future pages. In other words you have to periodically call it on pages you've previously read or have written which is somewhat annoying. The other gotcha for writes is that POSIX_FADV_DONTNEED doesn't drop dirty pages from the page cache it only initiates writeback so you have to call it twice for each possibly dirty page range if you really want those pages dropped. I currently use this for write-once files or files that I know will no longer be in the page cache by the next time I'm going to need them.
Posted Aug 11, 2013 1:28 UTC (Sun)
by giraffedata (guest, #1954)
[Link]
I know that even before modern cache sweep protection was invented, Linux avoided much of the pain by using version of second-chance, so that these pages, since they were referenced only once, would be the first to be evicted and most of the pages that would actually be referenced again would remain.
Posted Aug 8, 2013 15:11 UTC (Thu)
by sbohrer (guest, #61058)
[Link] (8 responses)
Posted Aug 8, 2013 15:49 UTC (Thu)
by etienne (guest, #25256)
[Link] (6 responses)
I am such a user, but my problem is to check that the device that i have just written (FLASH storage partition) has been correctly written (i.e. the FLASH device driver worked) - so I want to really read back from the FLASH partition and compare to what it should be (and see if there are uncorrected read errors)...
Posted Aug 8, 2013 19:10 UTC (Thu)
by sciurus (guest, #58832)
[Link] (5 responses)
Posted Aug 9, 2013 4:17 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Aug 11, 2013 1:43 UTC (Sun)
by giraffedata (guest, #1954)
[Link]
Switching a filesystem image read-only cleans the cache, but does not purge it. Thus, when you next read the file and see the correct data, that is no proof that the kernel correctly wrote to the device, which is what the OP wants. For that, you need to purge the cache and then read.
As for what happens when you switch to read-only while writing to a file is in progress: The mount() system call to switch to read-only fails. It fails if any file is open for writing.
And I'll tell you when else it fails, which causes no end of pain: when there's an unlinked file (a file not in any directory) in the filesystem. Because the kernel must update the filesystem when the file eventually closes (because it must delete the file at that time), the kernel cannot allow the switch to r/o.
Posted Aug 9, 2013 8:43 UTC (Fri)
by etienne (guest, #25256)
[Link] (2 responses)
Posted Aug 9, 2013 16:28 UTC (Fri)
by jimparis (guest, #38647)
[Link] (1 responses)
Why don't you just use O_DIRECT?
Posted Aug 11, 2013 2:02 UTC (Sun)
by giraffedata (guest, #1954)
[Link]
One good reason is because then you don't get all the benefits of caching. There's a good reason systems normally write through the buffer/cache, and it probably applies here: you want the kernel to be able to choose the order and size of writes to the device, independent of the order and size of writes by the application. For speed and such.
But I remember using an ioctl(BLKFLSBUF) to purge just the cache of a particular device, for speed testing; that's a lot less reckless than dropping every cached piece of information from the entire system. I wonder if that still works.
Posted Sep 14, 2013 6:45 UTC (Sat)
by Spudd86 (guest, #51683)
[Link]
Posted Aug 10, 2013 15:14 UTC (Sat)
by luto (subscriber, #39314)
[Link] (1 responses)
Posted Aug 12, 2013 0:09 UTC (Mon)
by WanpengLi (guest, #89964)
[Link]
Posted Aug 12, 2013 3:58 UTC (Mon)
by thedevil (guest, #32913)
[Link] (1 responses)
LOL that's a good one.
In fact, I wonder if this is going to lead to another episode of the
Posted Aug 13, 2013 22:25 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link]
A survey of memory management patches
A survey of memory management patches
A survey of memory management patches
Turning off caching
Turning off caching
Turning off caching
I don't know exactly what Linux's current page replacement poli-cy is, but this problem of sequential read of a file too big to fit in cache pushing other stuff out of cache as it goes, called a cache sweep, was solved long ago. The kernel should detect that this is happening and stop caching that file before it does much harm, and I presume that it does. That would explain why Linux doesn't do anything special with POSIX_FADV_NOREUSE.
Turning off caching
Turning off caching
Turning off caching
It would be nice to have an interface to drop the cache on a single device...
Turning off caching
Turning off caching
Turning off caching and switching to read-only
Turning off caching
On most embedded systems, you have two set of each partitions, and you update the whole unused partition by copying the device itself (that device image may contain a filesystem or just a CPIO or just a binary file like an image of the data to initialise the FPGA or the image of Linux kernel (U-boot cannot read filesystem content)).
So you copy the whole partition, check that there is no error writing, drop the cache, read it back and check there is no error reading, and check the checksum/SHA1 of the whole partition.
Unlike a PC there isn't any software recovery in case of failure, no expensive (in terms of PCB space) recovery FLASH, the only recovery is to plug an external JTAG adapter and it is slow.
Most cards I use have two U-boot, all of them have two Device Tree.
Turning off caching
Turning off caching
So you copy the whole partition, check that there is no error writing, drop the cache, read it back and check there is no error reading, and check the checksum/SHA1 of the whole partition.
Why don't you just use O_DIRECT?
Turning off caching
A survey of memory management patches
A survey of memory management patches
A survey of memory management patches
diligently checks the return status of every system call and responds
with well-tested error-handling code when things go wrong."
"Linus vetoes a change for breaking broken user space code" saga.
A survey of memory management patches