The proper time to split struct page

By Jonathan Corbet
July 14, 2023

The page structure sits at the core of the kernel's memory-management subsystem; one such structure exists for every page of installed RAM. This structure is increasingly seen as a problem, though, and phasing it out is one of the many side projects associated with the folio conversion. One step in that direction is currently meeting some pushback from memory-management developers, though, who think that some of these changes are coming too soon.

The purpose of struct page is to allow the kernel to keep track of the status of each page — how it is being used, its position in a least-recently-used list, how many references to it exist, and more. The information needed varies considerably depending on how a given page is being used; a page of user-space anonymous memory is managed differently from, say, memory used for a kernel-space DMA buffer. Since page structures must be kept as small as possible — there are millions of them in a modern system, so every byte hurts — data must be stored as efficiently as possible. As a result, struct page is declared as a maze of nested unions, allowing the data fields for each usage type to be overlaid.

This all leads to a structure that is too big; about 1.6% of the memory in a system is used just to track that memory at the lowest level. Many uses do not require all of the space that struct page provides, but the size of the structure cannot vary and the extra memory is wasted. At the same time, struct page is too small, requiring constant efforts to shoehorn another bit into it. The structure itself is nearly incomprehensible to human minds, even after efforts have been made to clean up its definition. Which fields are available for a given memory type is not always clear. This structure also exposes a lot of internal memory-management details that would be better hidden within the memory-management subsystem, making many changes harder than they should be.

One of the many goals of the current churn in that subsystem is to get rid of struct page in its current form. The system's memory map, which is currently an array of these structures, would be reduced to an array of pointers, each of which would point to a descriptor of a type suited to the current use of the page it represents. Those descriptors would be dynamically allocated and sized appropriately for the information they need to contain.

This is not a simple change to make; since this structure has been exposed to the entire kernel, there is code all over the place that deals with it directly. This includes a lot of device drivers. Changing all of that code will not be done in a day — or in a year, for that matter.

Thus, smaller steps need to be taken on the way toward this goal. One of those steps is for code to stop dealing with struct page directly and, instead, work with a usage-specific structure type. The 5.17 kernel saw the introduction of struct slab, which describes a page of memory managed by the slab allocator. This structure overlays struct page exactly and is carefully designed to avoid stepping on the fields of that structure that have other uses. This change doesn't change the fact that the information lives in the same page structure as before, but it makes the slab-specific parts explicit and hides the rest of struct page from the slab allocator.

The next step may be the struct ptdesc proposal from Vishal Moola. This structure describes the form that struct page takes when the memory it describes holds a page table:

    struct ptdesc {
    	unsigned long __page_flags;
    
    	union {
    	    struct rcu_head pt_rcu_head;
    	    struct list_head pt_list;
    	    struct {
    		unsigned long _pt_pad_1;
    		pgtable_t pmd_huge_pte;
    	    };
    	};
    	unsigned long _pt_s390_gaddr;
    
    	union {
    	    struct mm_struct *pt_mm;
    	    atomic_t pt_frag_refcount;
    	};
    
    	union {
    	    unsigned long _pt_pad_2;
    #if ALLOC_SPLIT_PTLOCKS
    	    spinlock_t *ptl;
    #else
    	    spinlock_t ptl;
    #endif
    	};
    	unsigned int __page_type;
    	atomic_t _refcount;
    #ifdef CONFIG_MEMCG
    	unsigned long pt_memcg_data;
    #endif
    };

As can be seen, even after this use case has been separated from the rest of of struct page, a number of unions remain. Many of them represent architecture-specific usages; pt_mm is used on x86 systems, for example, while pt_frag_refcount is needed on PowerPC and s390. But this structure is still much simpler, and it makes the page-table-specific usage clearer and more explicit.

This work is in its sixth revision, and most of the concerns that have been raised about it would appear to have been addressed. This time around, though, Hugh Dickins complained, saying: "I don't see the point of this patchset: to me it is just obfuscation of the present-day tight relationship between page table and struct page." He went on to say that, "in a kindly mood", he would describe the work as being ahead of its time, but would be willing to live with it if need be. David Hildenbrand added that he is "not a friend of these 'overlays'", adding that they only make sense once the descriptors can be dynamically allocated. Both developers seem to see this work as churning the memory-management code without providing any immediate benefit.

Matthew Wilcox answered that one reason to do this work now is to better document how each usage type manages the page structure:

By creating specific types for each user of struct page, we can see what's actually going on. Before the ptdesc conversion started, I could not have told you which bits in struct page were used by the s390 code. I knew they were playing some fun games with the refcount (it's even documented in the s390 code!) but I didn't know they were using ... whatever it is; page->private to point to the kvm private data?

There are, he said, assertions being added to ensure that the usage-specific structures continue to line up properly with struct page on each architecture; these can be seen in the form of the TABLE_MATCH() macros toward the end of this patch from Moola's series.

While there seems to be a consensus among the memory-management developers regarding the replacement of struct page with dynamically allocated, usage-specific descriptors, there apparently has not been a conversation about the order in which those changes should take place. It might be possible to do the dynamic allocation first, but that, too, would be a lot of code churn without a huge immediate benefit. Both transformations are needed to get to where the developers are trying to go. This work has started by adding the new structure types first; chances are it will continue that way for the duration (with, perhaps, zsmalloc descriptors being the next step).

Index entries for this article

Kernel Memory management/Memory descriptors

Kernel Releases/6.6

Index entries for this article
Kernel	Memory management/Memory descriptors
Kernel	Releases/6.6

The proper time to split struct page

Posted Jul 14, 2023 19:32 UTC (Fri) by clugstj (subscriber, #4020) [Link]

Struct "page" is truly frightening! This seems like a very useful first step toward wrangling it into something less scary (and more memory efficient).

The proper time to split struct page

Posted Jul 15, 2023 15:56 UTC (Sat) by josh (subscriber, #17465) [Link] (2 responses)

> Many of them represent architecture-specific usages; pt_mm is used on x86 systems, for example, while pt_frag_refcount is needed on PowerPC and s390.

A union doesn't seem like the right tool for that, given that only one branch of the union will ever be used on the current system. Why not just use an ifdef at that point?

The proper time to split struct page

Posted Jul 15, 2023 18:24 UTC (Sat) by kazer (subscriber, #134462) [Link] (1 responses)

Keeping size same across configurations helps tracking down bugs in the code. Particularly as (according to article) it has been accessed directly before without wrappers or abstractions of it's layout.

The proper time to split struct page

Posted Jul 17, 2023 8:26 UTC (Mon) by jengelh (subscriber, #33263) [Link]

Isn't it about "keeping the layout the same with struct page" rather than "keeping the size the same across architectures"?

The proper time to split struct page

Posted Jul 18, 2023 5:59 UTC (Tue) by marcH (subscriber, #57642) [Link] (6 responses)

> The system's memory map, which is currently an array of these structures, would be reduced to an array of pointers, each of which would point to a descriptor of a type suited to the current use of the page it represents

"We can solve any problem by introducing an extra level of indirection."

The proper time to split struct page

Posted Jul 18, 2023 14:49 UTC (Tue) by willy (subscriber, #9762) [Link] (5 responses)

Tell me you don't understand the problem without telling me you don't understand the problem.

I'd give a more useful response, but so much has been written about this already, I'm not inclined to give you a custom response to such a low-effort comment.

The proper time to split struct page

Posted Jul 18, 2023 15:37 UTC (Tue) by Wol (subscriber, #4433) [Link]

I think marc is being facetious. He left out "except for the problem of too many layers of indirection".

Cheers,
Wol

The proper time to split struct page

Posted Jul 20, 2023 19:10 UTC (Thu) by knotapun (guest, #166136) [Link] (2 responses)

Hi, thanks for all your work with Folio! I'm quite new to reading kernel source code, and especially memory management code. My reading has led me to a nagging question, why are so many of the structures described in terms of bit-offsets as opposed to C-std bit-fields? I haven't seen an example where the first would lead to more, or better functionality than the second, and I think this means I'm missing something.

The proper time to split struct page

Posted Jul 21, 2023 2:50 UTC (Fri) by willy (subscriber, #9762) [Link] (1 responses)

You're asking why we use things like:

unsigned long flags;

#define FOO (1<<0)
#define BAR (1<<1)

instead of

unsigned long foo_flag:1;
unsigned long bar_flag:1;

? Assuming that's your question ...

There's no way to atomically set a bitfield to a value. That is, if one process sets foo_flag at the same time another process sets bar_flag, both CPUs will do a read-modify-write and one write can get lost. Of course, this is true for "unsigned long flags" too, which is why we have set_bit() and friends.

We do use bitfields in some places, but probably could make more use of them; not every flags word needs to be accessed atomically.

The proper time to split struct page

Posted Jul 21, 2023 3:33 UTC (Fri) by knotapun (guest, #166136) [Link]

Ah! I had not ventured into "bitops.h", but that explains a lot. I guess it's probably a whole lot easier to be cross platform if you can force the assembly.

Thanks, C is still new to me!

The proper time to split struct page

Posted Jul 21, 2023 9:12 UTC (Fri) by marcH (subscriber, #57642) [Link]

My comment was certainly not competing for the "LWN comment of the year", no doubt about that.

On the other hand, it wasn't a question, just a perfectly neutral statement. As such it wasn't expecting any answer and certainly not an agressive and somewhat cryptic one (I honestly don't know what you've imagined from my comment; please do *not* elaborate on that)

I enjoy LWN comments most of the time because they can be both relaxed/low bar while being incredibly knowledgeable and valuable from time to time. Basically what social media should have been.

The proper time to split struct page

Posted Jul 18, 2023 7:34 UTC (Tue) by taladar (subscriber, #68407) [Link] (3 responses)

To me it is completely insane that a data type definition like this that has lots of internal details of a subsystem is completely exposed to the entire codebase of millions of lines. Would that happen in any other language that had a proper module system instead of the C textual includes?

I also think the excuse of having to keep it the same size on all architectures seems rather weak considering how incomprehensible this makes even just the struct itself, never mind code that has to deal with this mess.

Module systems

Posted Jul 18, 2023 13:16 UTC (Tue) by corbet (editor, #1) [Link] (2 responses)

C code can hide data structures just fine, that's just now how Linux memory management evolved.

When thinking about sizing, remember that there are other constraints, like making the structure fit neatly within a cache line. It is not like developers are deliberately creating gnarly data structures then having to come up with excuses for them.

Module systems

Posted Jul 22, 2023 19:37 UTC (Sat) by atnot (subscriber, #124910) [Link] (1 responses)

There is one important difference in that in C, having knowledge or visibility of a struct implies also having visbility of all of it's fields (at least, without involved type punning). This does kind of predictably lead to structures that can't be broken apart getting all sorts of secret uses of their fields that make them very hard to evolve and refactor centrally.

Module systems

Posted Aug 10, 2023 7:17 UTC (Thu) by daenzer (subscriber, #7050) [Link]

If a struct is passed around by pointer only, its definition can be hidden everywhere except for a single .c file, which implements accessor functions for other .c files. This does not require any type punning.

I don't know if this approach would be feasible for struct page even in principle though, let alone in practice.

The proper time to split struct page

The proper time to split struct page

The proper time to split struct page

The proper time to split struct page

The proper time to split struct page

The proper time to split struct page

The proper time to split struct page

The proper time to split struct page

The proper time to split struct page

The proper time to split struct page

The proper time to split struct page

The proper time to split struct page

The proper time to split struct page

Module systems

Module systems

Module systems

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!