The proper time to split struct page
The purpose of struct page is to allow the kernel to keep track of the status of each page — how it is being used, its position in a least-recently-used list, how many references to it exist, and more. The information needed varies considerably depending on how a given page is being used; a page of user-space anonymous memory is managed differently from, say, memory used for a kernel-space DMA buffer. Since page structures must be kept as small as possible — there are millions of them in a modern system, so every byte hurts — data must be stored as efficiently as possible. As a result, struct page is declared as a maze of nested unions, allowing the data fields for each usage type to be overlaid.
This all leads to a structure that is too big; about 1.6% of the memory in a system is used just to track that memory at the lowest level. Many uses do not require all of the space that struct page provides, but the size of the structure cannot vary and the extra memory is wasted. At the same time, struct page is too small, requiring constant efforts to shoehorn another bit into it. The structure itself is nearly incomprehensible to human minds, even after efforts have been made to clean up its definition. Which fields are available for a given memory type is not always clear. This structure also exposes a lot of internal memory-management details that would be better hidden within the memory-management subsystem, making many changes harder than they should be.
One of the many goals of the current churn in that subsystem is to get rid of struct page in its current form. The system's memory map, which is currently an array of these structures, would be reduced to an array of pointers, each of which would point to a descriptor of a type suited to the current use of the page it represents. Those descriptors would be dynamically allocated and sized appropriately for the information they need to contain.
This is not a simple change to make; since this structure has been exposed to the entire kernel, there is code all over the place that deals with it directly. This includes a lot of device drivers. Changing all of that code will not be done in a day — or in a year, for that matter.
Thus, smaller steps need to be taken on the way toward this goal. One of those steps is for code to stop dealing with struct page directly and, instead, work with a usage-specific structure type. The 5.17 kernel saw the introduction of struct slab, which describes a page of memory managed by the slab allocator. This structure overlays struct page exactly and is carefully designed to avoid stepping on the fields of that structure that have other uses. This change doesn't change the fact that the information lives in the same page structure as before, but it makes the slab-specific parts explicit and hides the rest of struct page from the slab allocator.
The next step may be the struct ptdesc proposal from Vishal Moola. This structure describes the form that struct page takes when the memory it describes holds a page table:
struct ptdesc { unsigned long __page_flags; union { struct rcu_head pt_rcu_head; struct list_head pt_list; struct { unsigned long _pt_pad_1; pgtable_t pmd_huge_pte; }; }; unsigned long _pt_s390_gaddr; union { struct mm_struct *pt_mm; atomic_t pt_frag_refcount; }; union { unsigned long _pt_pad_2; #if ALLOC_SPLIT_PTLOCKS spinlock_t *ptl; #else spinlock_t ptl; #endif }; unsigned int __page_type; atomic_t _refcount; #ifdef CONFIG_MEMCG unsigned long pt_memcg_data; #endif };
As can be seen, even after this use case has been separated from the rest of of struct page, a number of unions remain. Many of them represent architecture-specific usages; pt_mm is used on x86 systems, for example, while pt_frag_refcount is needed on PowerPC and s390. But this structure is still much simpler, and it makes the page-table-specific usage clearer and more explicit.
This work is in its sixth revision, and most of the concerns that have been
raised about it would appear to have been addressed. This time around,
though, Hugh Dickins complained,
saying: "I don't see the point of this patchset: to me it is just
obfuscation of the present-day tight relationship between page table and
struct page.
" He went on to say that, "in a kindly mood
", he
would describe the work as being ahead of its time, but would be willing to
live with it if need be. David Hildenbrand added
that he is "not a friend of these 'overlays'
", adding that they only
make sense once the descriptors can be dynamically allocated. Both
developers seem to see this work as churning the memory-management code
without providing any immediate benefit.
Matthew Wilcox answered that one reason to do this work now is to better document how each usage type manages the page structure:
By creating specific types for each user of struct page, we can see what's actually going on. Before the ptdesc conversion started, I could not have told you which bits in struct page were used by the s390 code. I knew they were playing some fun games with the refcount (it's even documented in the s390 code!) but I didn't know they were using ... whatever it is; page->private to point to the kvm private data?
There are, he said, assertions being added to ensure that the usage-specific structures continue to line up properly with struct page on each architecture; these can be seen in the form of the TABLE_MATCH() macros toward the end of this patch from Moola's series.
While there seems to be a consensus among the memory-management developers
regarding the replacement of struct page with dynamically
allocated, usage-specific descriptors, there apparently has not been a
conversation about the order in which those changes should take place. It
might be possible to do the dynamic allocation first, but that, too, would
be a lot of code churn without a huge immediate benefit. Both
transformations are needed to get to where the developers are trying to go.
This work has started by adding the new structure types first; chances are
it will continue that way for the duration (with, perhaps, zsmalloc
descriptors being the next step).
Index entries for this article | |
---|---|
Kernel | Memory management/Memory descriptors |
Kernel | Releases/6.6 |
Posted Jul 14, 2023 19:32 UTC (Fri)
by clugstj (subscriber, #4020)
[Link]
Posted Jul 15, 2023 15:56 UTC (Sat)
by josh (subscriber, #17465)
[Link] (2 responses)
A union doesn't seem like the right tool for that, given that only one branch of the union will ever be used on the current system. Why not just use an ifdef at that point?
Posted Jul 15, 2023 18:24 UTC (Sat)
by kazer (subscriber, #134462)
[Link] (1 responses)
Posted Jul 17, 2023 8:26 UTC (Mon)
by jengelh (subscriber, #33263)
[Link]
Posted Jul 18, 2023 5:59 UTC (Tue)
by marcH (subscriber, #57642)
[Link] (6 responses)
"We can solve any problem by introducing an extra level of indirection."
Posted Jul 18, 2023 14:49 UTC (Tue)
by willy (subscriber, #9762)
[Link] (5 responses)
I'd give a more useful response, but so much has been written about this already, I'm not inclined to give you a custom response to such a low-effort comment.
Posted Jul 18, 2023 15:37 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Jul 20, 2023 19:10 UTC (Thu)
by knotapun (guest, #166136)
[Link] (2 responses)
Posted Jul 21, 2023 2:50 UTC (Fri)
by willy (subscriber, #9762)
[Link] (1 responses)
unsigned long flags;
#define FOO (1<<0)
instead of
unsigned long foo_flag:1;
? Assuming that's your question ...
There's no way to atomically set a bitfield to a value. That is, if one process sets foo_flag at the same time another process sets bar_flag, both CPUs will do a read-modify-write and one write can get lost. Of course, this is true for "unsigned long flags" too, which is why we have set_bit() and friends.
We do use bitfields in some places, but probably could make more use of them; not every flags word needs to be accessed atomically.
Posted Jul 21, 2023 3:33 UTC (Fri)
by knotapun (guest, #166136)
[Link]
Thanks, C is still new to me!
Posted Jul 21, 2023 9:12 UTC (Fri)
by marcH (subscriber, #57642)
[Link]
On the other hand, it wasn't a question, just a perfectly neutral statement. As such it wasn't expecting any answer and certainly not an agressive and somewhat cryptic one (I honestly don't know what you've imagined from my comment; please do *not* elaborate on that)
I enjoy LWN comments most of the time because they can be both relaxed/low bar while being incredibly knowledgeable and valuable from time to time. Basically what social media should have been.
Posted Jul 18, 2023 7:34 UTC (Tue)
by taladar (subscriber, #68407)
[Link] (3 responses)
I also think the excuse of having to keep it the same size on all architectures seems rather weak considering how incomprehensible this makes even just the struct itself, never mind code that has to deal with this mess.
Posted Jul 18, 2023 13:16 UTC (Tue)
by corbet (editor, #1)
[Link] (2 responses)
When thinking about sizing, remember that there are other constraints, like making the structure fit neatly within a cache line. It is not like developers are deliberately creating gnarly data structures then having to come up with excuses for them.
Posted Jul 22, 2023 19:37 UTC (Sat)
by atnot (subscriber, #124910)
[Link] (1 responses)
Posted Aug 10, 2023 7:17 UTC (Thu)
by daenzer (subscriber, #7050)
[Link]
I don't know if this approach would be feasible for struct page even in principle though, let alone in practice.
The proper time to split struct page
The proper time to split struct page
The proper time to split struct page
The proper time to split struct page
The proper time to split struct page
The proper time to split struct page
The proper time to split struct page
Wol
The proper time to split struct page
The proper time to split struct page
#define BAR (1<<1)
unsigned long bar_flag:1;
The proper time to split struct page
The proper time to split struct page
The proper time to split struct page
C code can hide data structures just fine, that's just now how Linux memory management evolved.
Module systems
Module systems
Module systems