The next steps for the maple tree
Howlett has a backlog of requested features that seems likely to keep him busy for some time. Some of them are internal to the data structure itself:
- There is a desire for a fast way to get a count of the number of null entries within a node.
- "Dense nodes", which contain more pointers per node.
- The removal of "big nodes", which are a special structure used when nodes are rebalanced or split. Among other things, removing them will help to improve support for singleton ranges — process IDs, for example.
- Finally, he plans to implement index compression.
With regard to externally visible features:
- The ability to search marks and tags is at the top of the to-do list. That would allow searching a tree for entries with, for example, the "dirty" bit set.
- The ability to prune trees under memory pressure would help the system overall; it could be used with the cache that holds shadow page-table entries for evicted pages.
- Filesystem users would benefit from 64-bit indices on 32-bit architectures.
- A contiguous iterator that would iterate over a range only as long as there are no gaps.
- "Big dense nodes" were described as a large list that could hold up to 4K singleton items.
Overall, he said, he is trying to get maple trees to the point that they can match the features provided by XArrays. The maple tree should be able to do the same things with better performance; once the features are there, it should be possible to implement the XArray interface and switch users without anybody else having to even be aware of it.
Howlett said that the maple tree is getting more users, and he is seeing some common errors when code is converted over. It is possible to use external locks to serialize access to a maple tree, he said, and some users do it, but it is better avoided if possible. He cautioned that anybody using read-copy-update (RCU) read locks should be aware that the lock protects a maple-tree node from being freed, but not necessarily the data contained within that node.
Users of the generic storage API were encouraged to wrap it with a typed interface so that the compiler can catch mistakes. Developers converting from an XArray often are surprised when mas_next() fails to return the first entry. Its job is to get the next entry; to start at the beginning, mas_find() should be used instead.
In general, he said, he is working toward the addition of a type-safe interface and moving away from void * pointers. Eventually there will be a DEFINE_MAPLE_TREE() macro that creates a tree handling objects of a given type.
As usage has grown, the maple tree structure has encountered a number of challenges. Tracking of virtual memory areas (VMAs) is one of those; he is trying to find ways to remove some of the complexity associated with special VMAs. One example is guard VMAs, which define a short range of no-access address space to catch overruns. If guard VMAs are in use, the total number of VMAs in the tree is doubled, which is expensive, but those guard VMAs are never really used. So Howlett is trying to find a way to mark guard regions directly in the maple tree and avoid allocating so many extra structures.
Maple trees should eventually implement upper and lower limits, he said; that would be useful, for example, to implement restrictions on mapping the page at virtual address zero. Currently a maple tree will show gaps in areas that are not actually available for allocation. There are also some challenges in representing the vDSO area.
There were a few comments once Howlett finished. David Hildenbrand said
that the kernel contains a lot of checks for gate VMAs, which are a special
VMA used to represent the virtual-system-call page; it would be
nice to find a way to represent them in the maple tree and remove those
checks. Suren Baghdasaryan said that guard VMAs are one of the biggest
allocation slabs on Android systems, so removing them would be a welcome
optimization. The session wound down with a bit of discussion on the best
way to identify guard VMAs within the kernel.
Index entries for this article | |
---|---|
Kernel | Maple trees |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2024 |
The next steps for the maple tree
Posted May 27, 2024 18:32 UTC (Mon)
by jedix (subscriber, #116933)
[Link] (1 responses)
Posted May 27, 2024 18:32 UTC (Mon) by jedix (subscriber, #116933) [Link] (1 responses)
1. guard VMAs are used to catch writes beyond a particular VMA
2. some VMAs that grow have unusable areas that are outside the range but are still unusable.
The first issue is primarily about the inflated memory use and counting towards the mm->map_count limit of a VMA.
The second issue is about how we track gaps. Since this area is considered fully usable from the trees point of view, we end up in a situation that is both not empty and not usable; an empty area can be found to place allocations but the search must continue for a place that is actually usable. This is a lot like the upper and lower bound issue, but not entirely the same as the area can more easily shift (annoyingly, upper and lower bounds can also shift).
The next steps for the maple tree
Posted May 27, 2024 20:07 UTC (Mon)
by jedix (subscriber, #116933)
[Link]
Posted May 27, 2024 20:07 UTC (Mon) by jedix (subscriber, #116933) [Link]
"we end up in a situation that is both empty and not usable"
The next steps for the maple tree
Posted May 27, 2024 20:11 UTC (Mon)
by vbabka (subscriber, #91706)
[Link] (3 responses)
Posted May 27, 2024 20:11 UTC (Mon) by vbabka (subscriber, #91706) [Link] (3 responses)
Instead we could extend e.g. madvise() with two new modes (tentatively called MADV_POISON and MADV_REMEDY) that would only mark the PTE's into something that behaves like PROT_NONE, or undoes this. Since there would be no actual page mapped (pfn) for them, it could be implemented as a new special swap type (such as e.g. migration entries are) to avoid the hassle of distinguishing it from other PROT_NONE marked entries such as those used for NUMA balancing. The page fault handler would recognize this and cause a segfault. This would avoid the VMA splitting. Users could also do this to pages in the middle of a VMA, not just boundaries i.e. in a userspace malloc implementation to catch a use-after-free. In that sense MADV_POISON would extend MADV_DONTNEED in that the PTE is zapped before setting the special swap type.
Has something like that been tried before? Any obvious gotchas? It could be also done with userfaultfd but that adds its own overhead to all the page faults.
The next steps for the maple tree
Posted May 28, 2024 2:31 UTC (Tue)
by jedix (subscriber, #116933)
[Link] (1 responses)
Posted May 28, 2024 2:31 UTC (Tue) by jedix (subscriber, #116933) [Link] (1 responses)
The next steps for the maple tree
Posted May 28, 2024 13:39 UTC (Tue)
by jedix (subscriber, #116933)
[Link]
Posted May 28, 2024 13:39 UTC (Tue) by jedix (subscriber, #116933) [Link]
Also, for reference it seems what I was thinking of was for arm:
https://lore.kernel.org/all/Pine.LNX.4.61.0504070210430.2...
The next steps for the maple tree
Posted May 28, 2024 11:23 UTC (Tue)
by david.hildenbrand (subscriber, #108299)
[Link]
Posted May 28, 2024 11:23 UTC (Tue) by david.hildenbrand (subscriber, #108299) [Link]
MADV_GUARD might be a more appropriate name.
I have a related project on my back burner to tackle optimize very sparse VMAs (raised in reply to Jann's mail), that I discussed in the bi-weekly MM meeting. While my initial idea was to similarly use PTE markers, I'm investigating using using something like sparse bitmaps instead. Other things keep interrupting me, so I did not yet have time to do more work on that. But the focus is a bit different than having only a handful of guard pages (MADV_GUARD might be more appropriate for that).