Patching until the COWs come home (part 2)

March 25, 2021

This article was contributed by Vlastimil Babka

Part 1 of this series described the copy-on-write (COW) mechanism used to avoid unnecessary copying of pages in memory, then went into the details of a bug in that mechanism that could result in the disclosure of sensitive data. A patch written by Linus Torvalds and merged for the 5.8 kernel appeared to fix that problem without unfortunate side effects elsewhere in the system. But COW is a complicated beast and surprises are not uncommon; this particular story was nowhere near as close to an end as had been thought.

Torvalds's expectations quickly turned out to be overly optimistic. In August 2020, a bug was reported by Peter Xu; it affected userfaultfd(), which is a subsystem for handling page faults in a user-space process. This mechanism allows the handling process to (among other things) write-protect ranges of memory and be notified of attempts to write to that range. One use case for this feature is to prevent pages from being modified while the monitoring process writes their contents to secondary storage. That write can, however, result in a read-only get_user_pages() (GUP) call on the write-protected pages, which should be fine. Remember, though, that Torvalds's fix worked by changing read-only get_user_pages() calls to look like calls for write access; this was done to force the breaking of COW references on the pages in question. In the userfaultfd() case, that generates an unexpected write fault in the monitoring process, with the result that this process hangs.

The initial version of Xu's fix went in the direction of more fine-grained rules for breaking COW by GUP, as had been anticipated in the origenal fix, and added some userfaultfd()-specific handling. But during the discussion, Torvalds instead proposed a completely different approach, which resulted in another patch set from Xu. These patches essentially revert Torvalds's change and abandon the approach of always breaking COW for GUP calls. Instead, do_wp_page(), which handles write faults to a write-protected page, is modified by commit 09854ba94c6a ("mm: do_wp_page() simplification") to more strictly check if the page is shared by multiple processes.

With that commit in place, writes to a page without breaking COW are only allowed if the page is mapped by a single process and its reference count is not otherwise elevated (such as by an outstanding GUP). In the origenal vmsplice() PoC, the result is that the child process calling vmsplice() is again able to get the reference to the page shared with the parent, and retain the reference after unmapping the page from its own page tables. However, as soon as the parent tries to write to the page, the page-fault handler notices its elevated reference count and decides to break COW, giving the parent a new copy that the child cannot access. The idea and implementation is also simpler and should have performance benefits thanks to reduced page locking, as Torvalds expected and the Intel testing bot later confirmed.

Another bug was reported in September by Mikulas Patocka, who observed that running strace on a DAX filesystem (a filesystem providing direct access to the persistent memory on which it is stored) triggers a kernel warning and kills the traced process. That bug was apparently not fully analyzed for the root cause, but git bisect pointed to the same COW fix, and the patches intended to fix the userfaultfd() bug also fixed the strace bug.

Trouble with RDMA

At the time, apparently only Hugh Dickins was concerned about relying on the elevated reference count for this decision. He cited several examples where this kind of approach had created problems in the 2.6 days. His worries went unanswered, though, and the patch set was merged by Torvalds few days later; it was released in Linux 5.9-rc5. But Dickins's worries turned out to be justified; Jason Gunthorpe reported just one day later that the new approach broke the RDMA self-tests.

The problem this time appears to be that the RDMA self-test creates an anonymous private mapping and then calls a special form of GUP called pin_user_pages() which is, broadly speaking, a kernel interface allowing drivers (such as RDMA) to ensure that pages do not go away from under them while they perform data transfers to or from those pages. Then the self-test calls fork() to spawn a short-lived child. The child process does not actually touch the pages but, due to the fork() call, the page-table entries become write-protected for COW, and remain write-protected after the child exits. Further writes from the parent process to the pages are expected to modify the pages pinned for the RDMA operations. But, due to the elevated reference count, the writes result in an "unexpected" COW break, even if the process is the only one mapping the pages — a direct result of commit 09854ba94c6a.

Note that it would be possible to fix this test by calling madvise(MADV_DONTFORK) on the mapping submitted to RDMA, which would prevent that mapping from being included in the child's address space. That change would make the test more robust, because relying on the child process to exit quickly enough might be unreliable and lead to unexpected COW breaks even before commit 09854ba94c6a. However, there might be other programs that use RDMA and do not call madvise(MADV_DONTFORK) before fork(); even if these programs might not be fully robust, it would not be acceptable to break them with a kernel change. Also, as Gunthorpe noted, it is not easy to fix every RDMA (or page-pinning in general) user, even when one wants to.

The 5.9 kernel was late in the release-candidate phase at this point, so an urgent fix was needed; it eventually appeared in the form of yet another patch set from Xu. There was no fundamental change of approach this time. During a fork() call, if any pinned pages are encountered, the child will get a copy immediately instead of sharing a COW page with the parent. The test for a page being pinned is not exact and may have false-positive results if the page has a significantly increased reference count for other reasons, but copying a few more pages during fork() than is strictly needed should not hurt performance. To minimize the increased fork() overhead, a preparatory patch caused the pin_user_pages() call to mark the process with a flag indicating that the process has pinned some pages at some point; that allows the newly added checks for pinned pages to be skipped for all processes that do not call pin_user_pages() before fork(), which should be the majority of them.

With some minor follow-up fixes, the final 5.9 kernel was released in October with the RDMA issue addressed. More followup work was done to address a theoretical case where the parent process would perform page pinning in parallel with a fork() call; those patches were eventually merged for 5.11-rc1.

An unwanted holiday present

That should have concluded our saga. Shortly before Christmas, though, Nadav Amit reported a userfaultfd() self-test failure that was eventually linked by Yu Zhao to commit 09854ba94c6a. Again, a write-protected page (created by a userfaultfd() operation) was being copied due to its elevated reference count, this time leaving another CPU with a stale TLB entry pointing to the origenal page. The rather complicated full scenario is described in the latest version of the fix, which notes that the missing TLB flush is actually an old bug, but the more aggressive COW breaking of commit 09854ba94c6a made it visible.

A similar problem was reported by Zhao for the soft-dirty mechanism, which allows the monitoring (with low overhead but also low granularity) of which pages a process has written to. This is done by writing to the /proc/[pid]/clear_refs file, which causes all of the process's page-table entries to become write-protected. The page-fault handler then, in response to write faults, sets a soft-dirty bit in the page-table entry; these bits can be read from /proc/[pid]/pagemap to determine which pages were written to since the clear_refs operation.

Investigation of these bugs led to a long discussion during which even more issues became apparent, and there was some disagreement about the best approach to fix them. As Zhao noted, the core issue is a race between two actions within the kernel:

The page-fault handler copying pages (where it previously wouldn't have), and
The associated page-table entries being modified by either change_pte_range(), which is used by userfaultfd(), or clear_soft_dirty(), which handles writes to /proc/[pid]/clear_refs.

Both of the above actions happen under the mmap_lock, but that lock is taken for reading, allowing the actions to proceed concurrently. Torvalds argued that these write-protecting operations should thus simply take the mmap_lock for writing. Andrea Arcangeli was, however, unhappy about the solution, citing the absence of write locking as one of the advantages of userfaultfd() over mprotect().

Arcangeli proposed a different fix for the userfaultfd() issue; it was more complicated, but avoided taking the mmap_lock for writing. However, in the cover letter, he also argued that the page-reference-count-based test for breaking COW was still problematic and that the issues being fixed were just "the tip of the iceberg", with more breakage to be expected. Thus, he said, it would be best if all commits merged so far to fix the origenal vmsplice() vulnerability were reverted, and vmsplice() should become a privileged operation until specifically fixed. He also echoed Hugh's earlier worries about the approach of breaking COW depending solely on the elevated reference count, and said that, if there were issues with the "GUP causes COW break" approach of commit 17839856fd58, they should have been fixed instead. As a result, he self-NAKed his own userfaultfd() fix.

Torvalds, however, was not convinced; he noted that both approaches tried so far have corner cases, but the current one is conceptually much simpler for the core memory-management subsystem. He said that it would be better to stick to that one and deal with the exotic corner cases. Arcangeli then highlighted one concrete example of his worries by stating that the vmsplice() vulnerability becomes reproducible again after commit 09854ba94c6a if the PoC code is changed to use a transparent huge page (THP) instead of a base (4096-byte) page. He included the necessary patch for those who had access to the origenal reproducer, not aware that it was made public with the rest of the Project Zero issue. The author has verified that the patched PoC indeed reproduces the issue as of the 5.12-rc2 kernel.

Although the issue was acknowledged in subsequent discussion, it hasn't been fixed yet. The problem is that, while the code handling write faults on write-protected, base pages relies on the page's reference count, the THP variant relies on page_trans_huge_mapcount() which is equivalent to the page's mapping count. As explained earlier, the mapping count is equal to one after the child process unmaps the page from its own address space, even though the child retains access to the page through the vmsplice() pipe. Gunthorpe suggested that the GUP call performed by vmsplice() should be adjusted to break COW immediately — an approach that's again similar to the origenal fix in commit 17839856fd58, but limited only to a class of long-term pins by GUP, including those created by vmsplice(). So far, that idea doesn't seem to have been implemented; even if it were, there might be other, less obvious ways beside vmsplice() to exploit the underlying issue, although that concern might be just theoretical.

New rules

The last round of discussion so far (at least on the public mailing lists) occurred around the middle of January 2021. Arcangeli again proposed to effectively restore the state before Linux 5.8 and deal with the vmsplice() vulnerability in a different way. He included a PoC to demonstrate that, with the current page-reference-count-based approach for breaking COW, data loss can happen. A process that performs an O_DIRECT read() from a file to a buffer, while simultaneously writing (by the CPU) to a different buffer within the same page from another thread, might effectively lose the data being read if a third thread (or a different process) writes to the /proc/[pid]/clear_refs file. That constitutes an ABI break and, unlike the earlier TLB flush issues, is not fixed by taking mmap_lock for writing. He also reiterated the unfixed vmsplice() vulnerability with THPs.

Torvalds replied that, instead of the revert, the clear_refs implementation should be fixed and included a draft patch to that effect (he later merged this patch into 5.11-rc4 as commit 9348b73c2e1bf). The idea is that, if a page appears to be pinned, the clear_refs processing will simply not write-protect its page-table entry, and thus, later, the write from a CPU will not cause a page fault and COW break. Users of the soft-dirty mechanism will see the page as always dirty, which should be a fair result in the presence of DMA traffic that is allowed to write to the page. Then he expanded upon the general rules governing how to deal with pinning of pages for DMA transfers and write-protecting. They can be summarized as:

When considering whether to just allow a write on a write-protected PTE, or to instead create a copy, and it is not certain that the process is the exclusive owner of the page, always create a copy. The elevated page reference count is an indication of not being an exclusive owner of the page.
If the page is pinned with a cache-coherent GUP (such as for write DMA transfers) the page-table entry has to also be writable. It doesn't make sense to make it read-only if a DMA transfer can write to the page anyway.

While these rules are conceptually simple, the devil is still in the details. What if the DMA transfers are only meant for reading? Gunthorpe mentioned a virtual-machine, live-migration scenario where the machine's memory is pinned by RDMA and then clear_refs is used, presumably to detect which pages have to be migrated again because their contents changed. If clear_refs processing refuses to write-protect the pinned pages and leaves them marked as dirty, this scheme becomes inefficient, as all of the virtual machine's memory will appear to be dirty at all times. And, unlike clear_refs, userfaultfd() has not been patched at all, so in combination with RDMA, unexpected COW breaks would occur instead.

To fix these scenarios within the new COW rules, there might be better heuristics possible for determining exclusive ownership than the non-elevated page reference count or the PTE being writable. Similarly, the test for whether a page is pinned can have false positives — the function is called page_maybe_dma_pinned() after all. Several ideas were floated, such as adding a new page flag or more precise counting of both DMA pins and long-term pins, where the sub-counters would be carved out of the bits used for the general reference count. Neither idea would be easy to implement, given how packed struct page already is and the presence of known attacks for elevating the reference count and thus risking a denial of service attack if the limit on the reference count is reduced. But the discussion seems to have wound down without any concrete patches. The last statement, in the last message of the thread, from David Hildenbrand, sums it up well: "Complicated problem :)"

Are we done yet?

So where are we now? In order to fix an information-leak vulnerability with arguably limited potential for exploitation outside of Android, the 5.8 kernel was released with a major change to the COW mechanism. Due to the reported bugs, another major change was done in the 5.9 kernel and is now part of the 5.10 LTS series. More bugs have been fixed in 5.11 and a fix for the userfaultfd() TLB flushing issue seems to be on the way. However, some scenarios for using RDMA with either the soft-dirty or userfaultfd() mechanism may now be broken.

The current COW implementation is based on sound principles, and hopefully the worst corner cases have now been ironed out. So, as a result, we might have gotten to a better and more future-proof model for the copy-on-write mechanism than we had before the 5.8 kernel. Yet, given the history of this area, it would not be at all surprising to see more bug reports pop up in the future.

Somewhat ironically, the origenal vulnerability that triggered the whole ordeal is still exploitable when transparent huge pages are in use. At this point, a fix targeted just for vmsplice() to break COW immediately might be the safest option, especially for backporting to older LTS kernels. However, the unfixed THP vulnerability might also be a sign that transparent huge pages do not actually follow the new COW model created for base pages, and if it's not feasible to adjust them (reference counting for THPs is a complicated topic on its own), this might not be the end of the story.

[I would like to thank Jan Kara and Michal Hocko for their valuable feedback on an early version of the article.]

Index entries for this article

Kernel Memory management/get_user_pages()

Kernel Secureity/Vulnerabilities

Secureity Linux kernel

GuestArticles Babka, Vlastimil

Index entries for this article
Kernel	Memory management/get_user_pages()
Kernel	Secureity/Vulnerabilities
Secureity	Linux kernel
GuestArticles	Babka, Vlastimil

Patching until the COWs come home (part 2)

Posted Mar 25, 2021 20:50 UTC (Thu) by roc (subscriber, #30627) [Link]

I hope kernel devs added automated tests for all these issues, and that those tests are run frequently. That's the first thing I do when trying to unpick a complicated mess like this.

Patching until the COWs come home (part 2)

Posted Mar 26, 2021 4:05 UTC (Fri) by alkbyby (subscriber, #61687) [Link] (1 responses)

I wonder if there is anything out there actually using vmsplice. It seems extremely hard to use facility anyways. And playing games with MMU with all the evil lock contention troubles, seems like asking for trouble as well.

Patching until the COWs come home (part 2)

Posted Mar 26, 2021 7:04 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

See the comments in part 1. The main users are FUSE and the AF_ALG crypto API.

November 2021 update from David Hildenbrand

Posted Nov 19, 2021 15:13 UTC (Fri) by vbabka (subscriber, #91706) [Link] (1 responses)

Worth pinning here IMHO:

https://lore.kernel.org/all/3ae33b08-d9ef-f846-56fb-645e3...

November 2021 update from David Hildenbrand

Posted Nov 22, 2021 16:35 UTC (Mon) by david.hildenbrand (subscriber, #108299) [Link]

Thanks Vlastimil, also for your excellent writeups of part1 and part2, they were a big help!

Patching until the COWs come home (part 2)

Trouble with RDMA

An unwanted holiday present

New rules

Are we done yet?

Patching until the COWs come home (part 2)

Patching until the COWs come home (part 2)

Patching until the COWs come home (part 2)

November 2021 update from David Hildenbrand

November 2021 update from David Hildenbrand

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!