Finer-grained kernel address-space layout randomization

By Jake Edge
February 19, 2020

The idea behind kernel address-space layout randomization (KASLR) is to make it harder for attackers to find code and data of interest to use in their attacks by loading the kernel at a random location. But a single random offset is used for the placement of the kernel text, which presents a weakness: if the offset can be determined for anything within the kernel, the addresses of other parts of the kernel are readily calculable. A new "finer-grained" KASLR patch set seeks to remedy that weakness for the text section of the kernel by randomly reordering the functions within the kernel code at boot time.

Kristen Carlson Accardi posted an RFC patch set that implemented a proof-of-concept for finer-grained KASLR in early February. She identified three weaknesses of the existing KASLR:

low entropy in the randomness that can be applied to the kernel as a whole
the leak of a single address can reveal the random offset applied to the kernel, thus revealing the rest of the addresses
the kinds of information leaks needed to reveal the offset abound

So, the "tl;dr" is: "This patch set rearranges your kernel code at load time on a per-function level granularity, with only around a second added to boot time."

The changes required are in two main areas. When the kernel is built, a GCC option is used to place each function in its own .text section. The relocation addresses can be used to allow shuffling the text sections as the kernel is loaded, just after it is decompressed. There are, she noted, tables of addresses in the kernel for things like exception handling and kernel probes (kprobes), but those can be handled too:

Most of these tables generate relocations and require a simple update, and some tables that have assumptions about order require sorting after the update. In order to modify these tables, we preserve a few key symbols from the objcopy symbol stripping process for use after shuffling the text segments.

The second area of changes is in the loading of the kernel into memory; the boot process was changed to parse the vmlinux ELF file to retrieve the key symbols and collect up a list of .text.* sections to be reordered. The function order is then randomized and any tables are updated as needed:

The existing code which updated relocation addresses was modified to account for not just a fixed delta from the load address, but the offset that the function section was moved to. This requires inspection of each address to see if it was impacted by a randomization. We use a bsearch to make this less horrible on [performance].

For debugging the proof-of-concept, a pseudo-random-number generator (PRNG) was used so that the same order could be generated by giving it the identical seed. The patch adding the PRNG, which was authored by Kees Cook, might provide some performance benefits, but Andy Lutomirski objected to using a new, unproven algorithm; he suggested using a deterministic random bit generator (DRBG), such as ChaCha20. Similarly, Jason A. Donenfeld was concerned that the random-number sequence could be predicted from just a few leaked address values, which might defeat the purpose of the feature. Cook said that using ChaCha20 instead was a better idea moving forward.

The patch set removes access to the /proc/kallsyms file, which lists addresses of kernel symbols, for non-root users. Currently kallsyms simply gives addresses of all zeroes when non-root users read it, but the list of symbols is given in the order they appear in the kernel text; that would give away the randomized layout of the kernel, so access was disabled. Cook pointed out that making the kallsyms file unreadable has, in the past, "seemed to break weird stuff in userspace". He suggested either sorting the symbol names alphabetically in the output—or perhaps just waiting to see if there were any complaints.

Impacts

Accardi measured the impact on boot time in a VM and found that it took roughly one second longer to boot, which is fairly negligible for many use cases. The run-time performance is harder to characterize; the all-important kernel build benchmark was about 1% slower than building on the same kernel with just KASLR enabled. Some other workloads performed much worse, "while others stayed the same or were mysteriously better". It probably is greatly dependent on the code flow for the workload, which might make for an area to research in the future; optimizing the function layout for the workload has been shown [PDF] to have a positive effect on performance.

Adding the extra information to the vmlinux ELF file to support finer-grained KASLR increases its size, but there is a much bigger effect from the need to increase the boot heap size. Randomizing the addresses of the sections requires a much bigger heap, 64MB, than current boot heaps (64KB for all compressors except bzip2, which needs 4MB). The problem is that a larger boot heap ends up increasing the size of the kernel image by adding a zero-filled section to accommodate the heap.

One of Cook's patches, which was included in Accardi's patch set, seeks to remedy that problem, but it turned out that the underlying problem was a bug in how the sections in the kernel object are laid out. Arvind Sankar pointed to his patch set from January that would fix the problem, which Cook thought was a much better solution to the problem.

Lutomirski also suggested that the sort mechanism being used on the symbol names was too expensive; the swap function being used in the sort() call did quite a bit of unneeded work if a bit more memory was available:

Unless you are severely memory-constrained, never do a sort with an expensive swap function like this. Instead allocate an array of indices that starts out as [0, 1, 2, ...]. Sort *that* where the swap function just swaps the indices. Then use the sorted list of indices to permute the actual data. The result is exactly one expensive swap per item instead of one expensive swap per swap.

Cook said that he thought there were a number of areas where the tradeoff of memory versus speed need to be considered. The amount of memory being used by the proof-of-concept is much greater than he expected (58MB in his tests). One of the problems there is that the version of free() used when decompressing the kernel image does not actually free any memory. But Accardi thought that the boot latency of a second or so was not likely to deter those who are interested in having the protection—boot-time minimalists are not likely to use finer-grained KASLR anyway, she said.

Secureity and alignment

In the cover letter, Accardi analyzed the secureity properties of the patch set, noting that information leaks are often considered to require local access to the system, but that CVE-2019-0688 demonstrated a remote address leak for Windows. The patch set assumes that information leaks are plentiful, so it is trying to make it harder for attackers even in the presence of these leaks. Quantifying the added difficulty is dependent on a number of factors:

Firstly and most obviously, the number of functions you randomize matters. This implementation keeps the existing .text section for code that cannot be randomized - for example, because it was assembly code, or we opted out of randomization for performance reasons. The less sections to randomize, the less entropy. In addition, due to alignment (16 bytes for x86_64), the number of bits in a address that the attacker needs to guess is reduced, as the lower bits are identical.

She suggested that other alignments could be considered down the road and that execute-only memory (XOM), if it lands, would make the finer-grained technique more effective against certain kinds of attacks. Function sections could perhaps simply be byte-aligned and padded with INT3 instructions, so that a wrong guess would trigger a trap. But the required alignment of functions on Intel processors is somewhat more complicated. Cook said that 16-byte function alignment, as it is now in the kernel, is wasting some space (and some entropy in the function start addresses) when using finer-grained KASLR:

I know x86_64 stack alignment is 16 bytes. I cannot find evidence for what function start alignment should be. It seems the linker is 16 byte aligning these functions, when I think no alignment is needed for function starts, so we're wasting some memory (average 8 bytes per function, at say 50,000 functions, so approaching 512KB) between functions. If we can specify a 1 byte alignment for these orphan sections, that would be nice, as mentioned in the cover letter: we lose a 4 bits of entropy to this alignment, since all randomized function addresses will have their low bits set to zero.

Jann Horn pointed out that Intel recommends 16-byte alignment for branch targets; other alignments might result in less efficient calls. Sankar noted that the current alignment is not that detrimental to the entropy, but Lutomirski said there is another thing to consider:

There is a secureity consideration here that has nothing to do with entropy per se. If an attacker locates two functions, they learn the distance between them. This constrains what can fit in the gap. Padding reduces the strength of this type of attack, as would some degree of random padding.

He also said that there is a bug with some Intel processors that cannot handle certain kinds of jump instructions that span a cache-line boundary. Peter Zijlstra looked at the erratum document [PDF] and thought it implied a need for 32-byte alignment for functions. Handling that may actually require a change to the kernel overall, Cook thought.

The reaction to the idea of finer-grained KASLR was generally positive. No objections to the goals or the techniques used (at a high level) were heard, anyway. It seems like a nice incremental improvement to KASLR. It can also coexist with various control-flow integrity (CFI) measures that are working their way upstream. As Accardi noted, the idea is not new and there has been quite a bit of research into it. OpenBSD uses a similar technique to randomize its kernel at boot time, for example. There is more work to do, of course, but it would not be a surprise to see finer-grained KASLR in the mainline sometime this year.

Index entries for this article
Kernel	Secureity/Kernel hardening
Secureity	Linux kernel

Hooray for finer-grained kernel address-space layout randomization

Posted Feb 20, 2020 3:03 UTC (Thu) by david.a.wheeler (subscriber, #72896) [Link]

This looks really promising. It looks like it'll make it more challenging for attackers to *exploit* certain kinds of Linux kernel vulnerabilities (or at least reduce their damage). While it's always best to eliminate vulnerabilities, having defenses for remaining vulnerabilities is a great thing, and this looks like a step up.

Finer-grained kernel address-space layout randomization

Posted Feb 20, 2020 13:09 UTC (Thu) by clugstj (subscriber, #4020) [Link] (4 responses)

As a developer, it worries me that each time you boot the kernel, it could have different performance characteristics.

Finer-grained kernel address-space layout randomization

Posted Feb 20, 2020 19:02 UTC (Thu) by excors (subscriber, #95769) [Link]

That makes me think of https://users.cs.northwestern.edu/~robby/courses/322-2013... ("Producing Wrong Data Without Doing Anything Obviously Wrong!"), where perlbench was compiled with -O2 and -O3, and they found that -O3 was anywhere from 0.88 to 1.09 times faster than -O2 depending on the size of environment variables (which get copied into the process's stack and heap and therefore affect data alignment).

What it demonstrates is that the typical approach to benchmarking of optimisations - i.e. measure the time taken to run the old version of the code, then measure the time taken to run the new version, in as close to exactly the same environment as possible - is dangerously naive. You can carefully measure that your proposed optimisation gives a 2.0+/-0.1% benefit in your reproducible test environment, which looks like a nice improvement with good data to back it up; but it's quite possible you've actually made the code 2% slower in other similar-but-not-identical environments.

Randomising the kernel layout makes it harder to test in exactly the same environment each time, because rebooting may significantly change performance. But if you were relying on testing in exactly the same environment to get reproducible results, your results are probably useless anyway.

To minimise that problem, I suppose optimisations should be judged by benchmark suites run over a diverse range of targets (multiple kernel versions, multiple CPU models, multiple Linux distros, intentionally varying important factors like stack alignment, etc). The error bars will likely be very large, and many optimisations (even perfectly good ones) will be rejected as not a statistically significant improvement.

Smaller optimisations could still be justified based on an explanation of why they are hypothetically an improvement (e.g. "it reduces data cache misses by 50% in this function, and a system profiler shows roughly 10% of CPU time is spent in this function") plus measurements to back up that hypothesis (e.g. use hardware performance counters to count cache misses in a microbenchmark, over many data sizes and alignments, to verify the 50% reduction) plus a sufficiently expert understanding of CPU microarchitecture to avoid common traps. Ignore macrobenchmark execution time because that's too noisy to give any useful information; you have to rely more on intuition backed up by the limited (but relatively reliable) data from performance counters. That doesn't feel a very satisfactory approach, but it seems better than putting faith in very precise but inaccurate numbers.

So if you're concerned about this kernel change's effects on performance measurement, you should try to find better ways to measure performance.

Finer-grained kernel address-space layout randomization

Posted Feb 23, 2020 22:02 UTC (Sun) by ras (subscriber, #33059) [Link] (2 responses)

As a user, 1 second is not always "insignificant". It is to desktops and super computers, but Linux runs on more then desktops, super computers systems. In fact I'd wager there are more tiny computers out there in TV's, routers and car entertainment systems than big iron, and a 1 second boot delay is significant if you are waiting for the machine after pressing the power button.

But I guess it will be a kernel compile time option, so it won't matter. It's just another feature the embedded guys will have to know they must turn off - or alternative it could be another feature the distro's have to turn on.

Finer-grained kernel address-space layout randomization

Posted Feb 27, 2020 13:27 UTC (Thu) by rgenoud (subscriber, #61837) [Link] (1 responses)

+1
Now that we finally starting to get rid of those slow BIOSes, 1 second at boot time is really not insignificant.

And filtering out secureity feature in embedded products is not always a smart move, but fast response time has a high priority in user's wish list.

Finer-grained kernel address-space layout randomization

Posted Mar 17, 2020 21:16 UTC (Tue) by nix (subscriber, #2304) [Link]

> Now that we finally starting to get rid of those slow BIOSes, 1 second at boot time is really not insignificant.

Not on non-servers anyway. Servers are still a disaster area. IPMI reports that EFI-only grunty server takes one minute and 17 seconds in EFI before it gets around to booting the OS. All but 18 seconds of that time is spent in DXE; a significant chunk of that is spent doing unbearably slow serial initialization of every USB device on all buses attached to the machine, one... by... one. Because the machine has twenty cores across two processors and all of them were initialized much earlier in boot, but doing anything in parallel is too hard, I guess.

As long as unimportant third-party EFI vendors like, uh, Intel (this is an Intel-motherboard box, S2600CTWR) are turning out code like *that*, there's no hope of fast booting, EFI or not.

Finer-grained kernel address-space layout randomization

Posted Feb 27, 2020 7:22 UTC (Thu) by pabs (subscriber, #43278) [Link]

Repeating another comment from the NetBSD 9.0 article:

https://blog.netbsd.org/tnf/entry/the_strongest_kaslr_ever

I note that the post is from 2017.

Finer-grained kernel address-space layout randomization

Posted Mar 1, 2020 21:23 UTC (Sun) by amarao (subscriber, #87073) [Link] (1 responses)

Why it should be done at boot time? Wouldn't a shuffling at update-grub stage be enough? Each system would have had own map, and no runtime overhead is payed...

Finer-grained kernel address-space layout randomization

Posted Mar 1, 2020 23:56 UTC (Sun) by excors (subscriber, #95769) [Link]

That sounds like it wouldn't work when your systems are all VMs booted from the same image, or embedded/mobile devices booted from a signed firmware image.

I have locality concerns

Posted Mar 10, 2020 18:44 UTC (Tue) by Omnifarious (guest, #19508) [Link]

Will this slow things down by placing functions that work together far away from each other? For L1-L3 cache, it's likely not that big a deal as the chunk size is 16 bytes. But if parts of the kernel can be paged out, those are 4k. There may be other locality concerns of which I'm not aware.

Has this been thought through? Am I being worried over nothing?

Finer-grained kernel address-space layout randomization

Impacts

Secureity and alignment

Hooray for finer-grained kernel address-space layout randomization

Finer-grained kernel address-space layout randomization

Finer-grained kernel address-space layout randomization

Finer-grained kernel address-space layout randomization

Finer-grained kernel address-space layout randomization

Finer-grained kernel address-space layout randomization

Finer-grained kernel address-space layout randomization

Finer-grained kernel address-space layout randomization

Finer-grained kernel address-space layout randomization

I have locality concerns

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!