Improving page reclaim
The fundamental question, he said, was how to integrate these technologies into the Linux kernel. We have a number of subsystems like DAX that can provide high-speed access to persistent memory devices, but they require applications to be changed. If we run current kernels over such devices without using special interfaces, swapping is no faster than it is with older, slower devices. There is just too much overhead in the memory-management layer, and, in particular, in the manipulation of the least-recently-used (LRU) lists that track reclaimable pages in the system. The LRU, he said, is a fancy system to find the best eviction candidate at any given time, but, in this situation, perhaps it would be better to use something else?
Christoph Lameter suggested that users who care about performance should just put their entire application into memory and be done with it. But Dave was not so easily deterred; he would like to find ways for existing applications to get better performance on persistent-memory devices without changes.
Andrea Arcangeli said that we should not be worrying about memory in 4KB units when we are dealing with devices that can hold 100GB or more. Swapping pages in 2MB units would, he said, go a long way toward solving the problem. Andi Kleen agreed to a point — but he felt that 2MB was still far too small. In general, he said, we need to move toward managing memory in larger chunks or just do away with the LRU lists altogether.
Dave suggested that there are a number of opportunities to run the LRU lists in a more relaxed mode. One idea, he said, was to add a third LRU level for pages that are ready to be swapped out. (The kernel currently manages two levels of LRU lists, one for active pages and one for pages that seem to be inactive and should be considered for eviction). Perhaps some sort of "scanaround" algorithm could be applied to that third level to batch up pages for writing out to the swap device. Johannes Weiner answered that he had tried something similar a few years ago. It didn't work well, he said, due to disk seek issues, but it might work better on truly random-access devices.
Hugh Dickins expressed skepticism toward the entire idea, though. To him, it looks like an attempt to reduce memory-management overhead by adding even more complex algorithms to cluster things. That is increasing the complexity of the system rather than reducing it. Batching things up may help to speed things up, but you still have to deal with items individually to make up the batches.
As things wound down, Dave said that he was going away with a couple of
interesting ideas to explore.
Index entries for this article | |
---|---|
Kernel | Memory management/Nonvolatile memory |
Conference | Storage, Filesystem, and Memory-Management Summit/2015 |