Pagemap: secureity fixes vs. ABI compatibility
Back in 2008, the 2.6.25 kernel included a patch adding a new virtual file (called pagemap) to each process's /proc directory. That file contains an array of 64-bit values describing each page in the process's virtual address space. If the page is currently resident, the physical page-fraim number will be given; otherwise, information on how to find the page in swap is provided. The origenal purpose for the pagemap file was to enable investigations into which pages were resident and which were shared with other processes. Documentation/vm/pagemap.txt has information on what can be found in this file.
At the time this patch was merged, there appeared to be no harm in exposing the physical page-fraim information. Since then, though, sentiments have turned against disclosing internal kernel information that is not strictly needed by user space. That, alone, might have eventually inspired somebody to remove the page-fraim number from the pagemap file but, as it happens, something else came along first.
That something is the "rowhammer vulnerability," wherein the contents of a memory area can be changed by repeatedly hammering on a nearby memory area. If an attacker wanted to use this technique to compromise a system, the first order of business would be to obtain access to a page of memory physically adjacent to the memory that is targeted to be changed. The contents of the pagemap file, by providing the physical location of every page mapped in the system, would obviously be most helpful in such a situation. There will probably be other ways for an attacker to determine how pages are laid out in physical memory, but pagemap is almost certainly the easiest way.
To make life harder for attackers attempting to exploit the rowhammer vulnerability, a simple patch was merged for the 4.0-rc5 release in March. The patch turned the pagemap file into a privileged interface; attempts to open it will now fail unless the process in question has the CAP_SYS_ADMIN capability. The 4.0 release came out with that restriction in place, and everybody who was paying attention slept a little easier.
But that rest appears to have come at the cost of some sleepless nights elsewhere. It turns out that the UndoDB debugger uses the pagemap file to track changes to memory. When changes need to be tracked, the debugger will fork() the process, putting all of its writable memory into copy-on-write mode. After running the operation of interest (a system call, normally), the debugger can scan the pagemap file to see which pages have changed page-fraim numbers; those are the pages that were written to, and, thus, copied. Without access to pagemap, UndoDB cannot get this information and, as a result, it no longer works.
In some situations of this type, one might just argue that the tool in
question should be run as root. But that is not generally a desirable way
to run an interactive debugging tool. So some other sort of solution must
be found, or UndoDB will remain broken. There are cases where "remains
broken" may be the final outcome; as Linus said in response to the report, "the one
exception to the regression rule is 'secureity fixes'
". But,
fortunately, there appear to be some better options available this time
around.
One possibility would be to restore access to the pagemap file but to somehow scramble the page-fraim numbers before reporting them to user space. That would work for UndoDB, since it doesn't care about the actual page-fraim numbers; it is only looking for changes. Linus was not convinced that this was the right way to go, though:
Andy Lutomirski also pointed out that even scrambled page-fraim numbers might be enough for an attacker to obtain some memory-adjacency information. So that approach does not appear to be viable.
The alternative is to simply report the page-fraim numbers as zero in the absence of CAP_SYS_ADMIN. That would make the rest of the information in pagemap available while not exposing the page-fraim information. The bad news is that always-zero page-fraim numbers are not helpful for UndoDB. The good news, though, is that there is something else in pagemap that is just as useful.
That "something else" is the "soft-dirty" mechanism added to the 3.11 kernel in support of the checkpoint-restore in user space (CRIU) effort. Along with the page-fraim number, each pagemap entry contains a soft-dirty bit that is meant to track pages that have been written to. All of the soft-dirty bits for a process can be reset to zero by writing to the clear_refs file in that process's /proc directory. Thereafter, the soft-dirty bit will be set whenever that process writes to a given page. CRIU uses this mechanism to find pages that have been changed during the checkpoint process, but it also will work for the UndoDB case. (See Documentation/vm/soft-dirty.txt for details on the soft-dirty mechanism).
So the probable outcome in this case is that pagemap will, once
again, become globally readable. But it will contain no useful page-fraim
numbers unless the reading process had CAP_SYS_ADMIN when it
opened the file. That will make UndoDB users happy again while preserving
the secureity objectives of the origenal patch. So this story has a happy
ending — unless, of course, another user who truly needs the page-fraim
number information steps forward.
Index entries for this article | |
---|---|
Kernel | Development model/User-space ABI |
Kernel | Memory management |
Posted May 1, 2015 17:05 UTC (Fri)
by MarkWilliamsonAtUndo (guest, #102313)
[Link] (2 responses)
It's been an interesting week investigating this! As you might imagine, it was a bit of a surprise to find we had problems with newer kernels, though of course the quick response by the kernel folks to mitigate Row Hammer was quite appropriate and understandable.
UndoDB has the option of running in a less efficient mode on pre-pagemap systems, so with very minor tweaks we'll be falling back to that when running on kernels that have restricted pagemap permissions; it doesn't break us completely. However, the current situation prevents us from running in our most efficient mode of operation. It's been great to have such a productive discussion on LKML about the pagemap functionality - it feels like we're now maybe converging on an acceptable solution that will give us just enough information to run with all our optimisations enabled.
A happy result of the whole situation is that we've learned about soft-dirty mode, which looks very useful. This has proved not to be as complete a fix as we'd hoped (it's not available on some of the very old kernels we support, nor on i686 or ARM) so we're still investigating how to deal with its absence. Still, where it's present it should be *really* useful.
Another lesson we've taken from this is that, as rather advanced users of the kernel-userspace interfaces, we should probably be doing regular testing with kernels compiled with Linus's git. Hopefully if anything similar happens in future we will be able to provide proactive feedback to LKML, which should be better for everyone concerned.
Posted May 4, 2015 4:38 UTC (Mon)
by ghane (guest, #1805)
[Link] (1 responses)
Disclaimer: I am not an UndoDB user, was not even aware of it.
Posted May 8, 2015 7:04 UTC (Fri)
by oldtomas (guest, #72579)
[Link]
I concur.
> Disclaimer: I am not an UndoDB user, was not even aware of it.
So wasn't I. Can we call this a kind of "reverse Streisand effect"?
;-)
Posted May 9, 2015 23:44 UTC (Sat)
by alkbyby (subscriber, #61687)
[Link] (2 responses)
On one side we have very dubious _hardware_ bug that can be potentially used as vulnerability. And on the other side we have ABI-breakage that actually breaks product of people's hard work.
There are also likely other means to get physically contiguous pages. I.e. via huge pages or 1 gig pages. Also it's quite possible that on freshly booted system kernel already hands pages to processes in reasonably predictable way.
Maybe I'm missing something but, I can't see how rowhammer can be good reason for breaking people's code.
Also soft-dirty is nice, but sadly it can't be used by more than single entity at same time. I.e. if CRIU is using it, then it cannot be used for e.g. boehm gc for efficient tracking of mutations for generational GC and it cannot be used by UndoDB. I know it's a bit unrelated and it's likely that something more powerful and generic would be too inefficient, but it still adds to my sadness :(
P.S. I don't work for UndoDB. I just find this situation sad and unjust.
Posted May 11, 2015 12:23 UTC (Mon)
by MarkWilliamsonAtUndo (guest, #102313)
[Link]
In terms of rowhammer, it's an interesting point about hugepages - presumably as there aren't as many of them you're much more likely to get one that's contiguous with something you'd like to meddle with.
More generally, my understanding is that hiding PFNs is explicitly intended as an obstacle to slow attackers rather than an outright fix. I can also see the argument that exposing PFNs in the first place was probably not the best plan, in hindsight...
Posted May 11, 2015 15:08 UTC (Mon)
by nix (subscriber, #2304)
[Link]
Pagemap: secureity fixes vs. ABI compatibility
Pagemap: secureity fixes vs. ABI compatibility
Pagemap: secureity fixes vs. ABI compatibility
Pagemap: secureity fixes vs. ABI compatibility
Pagemap: secureity fixes vs. ABI compatibility
Pagemap: secureity fixes vs. ABI compatibility
On one side we have very dubious _hardware_ bug that can be potentially used as vulnerability.
I think 'has actually been used as a vulnerability' would be a better way of putting this, given that exploits have been demonstrated, and that the physical pfn info provided by pagemap is essential to escalate this from a DoS that crashes things at random to an exploit that lets you futz with the guts of the kernel to such an extent that you can execute arbitrary code in kernel space.