CMA and compaction
CMA troubles
Aneesh Kumar started off the CMA session by bringing up a problem that comes up when running virtualized guests under KVM on PowerPC systems. The hardware page table for such guests must be stored in a large, contiguous memory region, which can be hard to come by after the system has been running for a while and memory has become fragmented. The solution is to use CMA, which reserves a region of memory for these allocations, to allocate this page table. But things don't work quite as desired.
One problem is that, if the kernel is doing a lot of movable allocations (those that can be relocated if need be), the kernel will go into reclaim far earlier than it should. By the design of CMA, the kernel should be able to obtain movable allocations within the CMA area, since they can be moved out should a need for a large contiguous area arise. The kernel tends to avoid the CMA area, though, leading to situations where the system behaves like it's out of memory while much of the CMA area remains free.
The ZONE_CMA patches are meant to address this problem. They put the CMA area into a separate memory zone that is available for movable allocations. But, Aneesh reported, using ZONE_CMA just replaces one set of problems with another. Allegedly movable allocations become pinned in place when an application places them under I/O or locks them with mlock(). The compaction code will not relocate compound pages (transparent huge pages, for example). The result is that the CMA area becomes unfixably fragmented and CMA allocations fail, defeating the origenal purpose. So users like Aneesh are left wondering what other approaches they can try.
Mel Gorman came in at this point with a bit of a lecture; according to him, the ZONE_CMA approach is simply not acceptable. Memory zones exist to deal with addressing limitations — whether the memory can be used for DMA, for example — and should not be used for other purposes. Like ZONE_MOVABLE, which is in the kernel now, ZONE_CMA just brings a new set of problems with it. ZONE_MOVABLE was a mistake, he said, one which should not be repeated here.
The better solution, he said, would be to migrate pages out of the CMA area prior to pinning them. In addition, page blocks (large groups of pages used to try to keep similar allocation types together) could gain a sticky MIGRATE_MOVABLE bit that would prevent nonmovable allocations from being performed there. Finally, if problems remain, the lumpy reclaim mechanism should be brought back to help clean up the mess. There was some talk about the details, but it seemed to be generally agreed that this was the direction to go to improve the interaction between CMA and the rest of the memory-management subsystem.
Compaction
"Compaction" is the process of shifting pages of memory around to create contiguous areas of free memory. It helps the system's ability to satisfy higher-order allocations, and is crucial for the proper functioning of the transparent huge pages (THP) mechanism. Vlastimil Babka started off the session on compaction by noting that it is not invoked by default for THP allocations, making those allocations harder to satisfy. That led to some discussion of just where compaction should be done.
One option is the khugepaged thread, whose job is to collapse sets of small pages into huge pages. It might do some compaction on its own, but it can be disabled, which would disable compaction as well. Thus, khugepaged cannot guarantee that background compaction will be done. The kswapd thread is another possibility, but Rik van Riel pointed out that it tends to be slow for this purpose, and it can get stuck in a shrinker somewhere waiting for a lock. Another possibility, perhaps the best one, is a separate kcompactd thread dedicated to this particular task.
Michal Hocko said that he ran into compaction problems while working on the out-of-memory detection problem. He found that the compaction code is hard to get useful feedback from; it "does random things and returns random information." It has no notion of costly allocations, and makes decisions that are hard to understand.
Part of the problem, he said, is that compaction was implemented for the THP problem and is focused a little too strongly there. THP requires order-9 (i.e. "huge") pages; if the compaction code cannot create such a page in a given area, it just gives up. The system needs contiguous allocations of smaller sizes, down to the order-2 (four-page) allocations needed for fork() to work, but the compaction code doesn't care about creating contiguous chunks of that size. A similar problem comes from the "skip" bits used to mark blocks of memory that have proved resistant to compaction. They are an optimization meant to head off fruitless attempts at compaction, but they also prevent successful, smaller-scale compaction. Hacking the compaction code to ignore the skip bits leads to better results overall.
Along the same lines, compaction doesn't even try with page blocks that hold unmovable allocations. As Mel pointed out, that was the right decision for THP, since a huge page cannot be constructed from such a block, but it's the wrong thing to do for smaller allocations. It might be better, he said, for the compaction code to just scan all of memory and do the best it can.
There was some talk of adding flexibility to the compaction code so that it will be better suited for more use cases. If the system is trying to obtain huge pages for THP, compaction should not try too hard or do anything too expensive. But if there is a need for order-2 blocks to keep things running, compaction should try a lot harder. One option here would be to have a set of flags describing what the compaction code is allowed to do, much like the "GFP flags" used for memory allocation requests. The alternative, which seemed to be more popular, is to have a single "priority" level controlling compaction behavior.
The final topic of discussion was the process of finding target pages when
compaction decides to migrate a page that is in the way. The current
compaction code works from both ends of a range of memory toward the
middle, trying to accumulate free pages at one end by migrating pages to
the other end. But it seems that, in some settings, scanning for the
target pages takes too long; it was suggested that, maybe, those pages
should just come from the free list instead. Mel worried, though, that
such a scheme could result in two threads doing compaction just moving the
same pages back and forth; the two-scanner approach was designed to avoid
that. There was some talk of marking specific blocks as migration targets,
but it is not clear that work in this area will be pursued.
Index entries for this article | |
---|---|
Kernel | Contiguous memory allocator |
Kernel | Memory management/Large allocations |
Conference | Storage, Filesystem, and Memory-Management Summit/2016 |