|
|
Subscribe / Log in / New account

The next steps for the maple tree

By Jonathan Corbet
May 27, 2024

LSFMM+BPF
The maple tree data structure was added during the 6.1 development cycle; since then, it has taken its place at the core of the kernel's memory-management subsystem. Unsurprisingly, work on maple trees is not yet done. Maple-tree maintainer Liam Howlett ran a session in the memory-management track of the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit to discuss the current state of the maple tree and which features can be expected next.

Howlett has a backlog of requested features that seems likely to keep him busy for some time. Some of them are internal to the data structure itself:

  • There is a desire for a fast way to get a count of the number of null entries within a node.
  • "Dense nodes", which contain more pointers per node.
  • The removal of "big nodes", which are a special structure used when nodes are rebalanced or split. Among other things, removing them will help to improve support for singleton ranges — process IDs, for example.
  • Finally, he plans to implement index compression.

With regard to externally visible features:

  • The ability to search marks and tags is at the top of the to-do list. That would allow searching a tree for entries with, for example, the "dirty" bit set.
  • The ability to prune trees under memory pressure would help the system overall; it could be used with the cache that holds shadow page-table entries for evicted pages.
  • Filesystem users would benefit from 64-bit indices on 32-bit architectures.
  • A contiguous iterator that would iterate over a range only as long as there are no gaps.
  • "Big dense nodes" were described as a large list that could hold up to 4K singleton items.

Overall, he said, he is trying to get maple trees to the point that they can match the features provided by XArrays. The maple tree should be able to do the same things with better performance; once the features are there, it should be possible to implement the XArray interface and switch users without anybody else having to even be aware of it.

Howlett said that the maple tree is getting more users, and he is seeing some common errors when code is converted over. It is possible to use external locks to serialize access to a maple tree, he said, and some users do it, but it is better avoided if possible. He cautioned that anybody using read-copy-update (RCU) read locks should be aware that the lock protects a maple-tree node from being freed, but not necessarily the data contained within that node.

Users of the generic storage API were encouraged to wrap it with a typed interface so that the compiler can catch mistakes. Developers converting from an XArray often are surprised when mas_next() fails to return the first entry. Its job is to get the next entry; to start at the beginning, mas_find() should be used instead.

In general, he said, he is working toward the addition of a type-safe interface and moving away from void * pointers. Eventually there will be a DEFINE_MAPLE_TREE() macro that creates a tree handling objects of a given type.

As usage has grown, the maple tree structure has encountered a number of challenges. Tracking of virtual memory areas (VMAs) is one of those; he is trying to find ways to remove some of the complexity associated with special VMAs. One example is guard VMAs, which define a short range of no-access address space to catch overruns. If guard VMAs are in use, the total number of VMAs in the tree is doubled, which is expensive, but those guard VMAs are never really used. So Howlett is trying to find a way to mark guard regions directly in the maple tree and avoid allocating so many extra structures.

Maple trees should eventually implement upper and lower limits, he said; that would be useful, for example, to implement restrictions on mapping the page at virtual address zero. Currently a maple tree will show gaps in areas that are not actually available for allocation. There are also some challenges in representing the vDSO area.

There were a few comments once Howlett finished. David Hildenbrand said that the kernel contains a lot of checks for gate VMAs, which are a special VMA used to represent the virtual-system-call page; it would be nice to find a way to represent them in the maple tree and remove those checks. Suren Baghdasaryan said that guard VMAs are one of the biggest allocation slabs on Android systems, so removing them would be a welcome optimization. The session wound down with a bit of discussion on the best way to identify guard VMAs within the kernel.

Index entries for this article
KernelMaple trees
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2024


to post comments

The next steps for the maple tree

Posted May 27, 2024 18:32 UTC (Mon) by jedix (subscriber, #116933) [Link] (1 responses)

For the guard areas, there are two issues:
1. guard VMAs are used to catch writes beyond a particular VMA
2. some VMAs that grow have unusable areas that are outside the range but are still unusable.

The first issue is primarily about the inflated memory use and counting towards the mm->map_count limit of a VMA.

The second issue is about how we track gaps. Since this area is considered fully usable from the trees point of view, we end up in a situation that is both not empty and not usable; an empty area can be found to place allocations but the search must continue for a place that is actually usable. This is a lot like the upper and lower bound issue, but not entirely the same as the area can more easily shift (annoyingly, upper and lower bounds can also shift).

The next steps for the maple tree

Posted May 27, 2024 20:07 UTC (Mon) by jedix (subscriber, #116933) [Link]

"we end up in a situation that is both not empty and not usable" should read

"we end up in a situation that is both empty and not usable"

The next steps for the maple tree

Posted May 27, 2024 20:11 UTC (Mon) by vbabka (subscriber, #91706) [Link] (3 responses)

During the session I have proposed an alternative to guard VMAs (which AFAIU is a VMA with mprotect() set to PROT_NONE). What should be enough to catch the stray accesses before/after the VMA would be to have the respective entries in page tables (PTE) marked as such. But because we match protections on VMAs and PTEs, once userspace e.g. creates a larger VMA that includes the guard pages and then mprotect(PROT_NONE) the boundary pages, the kernel will split the VMA into 3.

Instead we could extend e.g. madvise() with two new modes (tentatively called MADV_POISON and MADV_REMEDY) that would only mark the PTE's into something that behaves like PROT_NONE, or undoes this. Since there would be no actual page mapped (pfn) for them, it could be implemented as a new special swap type (such as e.g. migration entries are) to avoid the hassle of distinguishing it from other PROT_NONE marked entries such as those used for NUMA balancing. The page fault handler would recognize this and cause a segfault. This would avoid the VMA splitting. Users could also do this to pages in the middle of a VMA, not just boundaries i.e. in a userspace malloc implementation to catch a use-after-free. In that sense MADV_POISON would extend MADV_DONTNEED in that the PTE is zapped before setting the special swap type.

Has something like that been tried before? Any obvious gotchas? It could be also done with userfaultfd but that adds its own overhead to all the page faults.

The next steps for the maple tree

Posted May 28, 2024 2:31 UTC (Tue) by jedix (subscriber, #116933) [Link] (1 responses)

One potential issue is that some platforms (I believe MIPS?) can have PTEs outside of the areas mapped by the VMAs. This is why, when we munmap, we use the start of the next VMA and the end of the previous VMA as the area being unmapped (and passed through to free_pgtables() in unmap_region()). If we have special PTEs mapped in that area for other uses, we would need to work around the free_pgtables() using the extended range beyond the VMAs to avoid including these new mappings when removing the next VMA. The arguments passed through to free_pgtables() is already a mess.

The next steps for the maple tree

Posted May 28, 2024 13:39 UTC (Tue) by jedix (subscriber, #116933) [Link]

I think this isn't a concern as the guard would be included within the vma guards.

Also, for reference it seems what I was thinking of was for arm:
https://lore.kernel.org/all/Pine.LNX.4.61.0504070210430.2...

The next steps for the maple tree

Posted May 28, 2024 11:23 UTC (Tue) by david.hildenbrand (subscriber, #108299) [Link]

As raised, Jann Horn played with that idea and I remember he had a prototype, see https://lore.kernel.org/lkml/CAG48ez2NrQjB5T5++uJSZ8-id5-...

MADV_GUARD might be a more appropriate name.

I have a related project on my back burner to tackle optimize very sparse VMAs (raised in reply to Jann's mail), that I discussed in the bi-weekly MM meeting. While my initial idea was to similarly use PTE markers, I'm investigating using using something like sparse bitmaps instead. Other things keep interrupting me, so I did not yet have time to do more work on that. But the focus is a bit different than having only a handful of guard pages (MADV_GUARD might be more appropriate for that).


Copyright © 2024, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy