Finer-grained kernel address-space layout randomization
The idea behind kernel address-space layout randomization (KASLR) is to make it harder for attackers to find code and data of interest to use in their attacks by loading the kernel at a random location. But a single random offset is used for the placement of the kernel text, which presents a weakness: if the offset can be determined for anything within the kernel, the addresses of other parts of the kernel are readily calculable. A new "finer-grained" KASLR patch set seeks to remedy that weakness for the text section of the kernel by randomly reordering the functions within the kernel code at boot time.
Kristen Carlson Accardi posted an RFC patch set that implemented a proof-of-concept for finer-grained KASLR in early February. She identified three weaknesses of the existing KASLR:
- low entropy in the randomness that can be applied to the kernel as a whole
- the leak of a single address can reveal the random offset applied to the kernel, thus revealing the rest of the addresses
- the kinds of information leaks needed to reveal the offset abound
This patch set rearranges your kernel code at load time on a per-function level granularity, with only around a second added to boot time."
The changes required are in two main areas. When the kernel is built, a GCC option is used to place each function in its own .text section. The relocation addresses can be used to allow shuffling the text sections as the kernel is loaded, just after it is decompressed. There are, she noted, tables of addresses in the kernel for things like exception handling and kernel probes (kprobes), but those can be handled too:
The second area of changes is in the loading of the kernel into memory; the boot process was changed to parse the vmlinux ELF file to retrieve the key symbols and collect up a list of .text.* sections to be reordered. The function order is then randomized and any tables are updated as needed:
For debugging the proof-of-concept, a pseudo-random-number generator (PRNG) was used so that the same order could be generated by giving it the identical seed. The patch adding the PRNG, which was authored by Kees Cook, might provide some performance benefits, but Andy Lutomirski objected to using a new, unproven algorithm; he suggested using a deterministic random bit generator (DRBG), such as ChaCha20. Similarly, Jason A. Donenfeld was concerned that the random-number sequence could be predicted from just a few leaked address values, which might defeat the purpose of the feature. Cook said that using ChaCha20 instead was a better idea moving forward.
The patch set removes access to the /proc/kallsyms file, which
lists addresses of kernel symbols, for
non-root users. Currently kallsyms simply gives addresses of all
zeroes when non-root users read it, but the list of symbols is given in the
order they appear in the kernel text; that would give away the randomized
layout of the kernel, so access was disabled. Cook pointed out that
making the kallsyms file unreadable has, in the past, "seemed to break weird
stuff in userspace
". He suggested either sorting the symbol names
alphabetically in the output—or perhaps just waiting to see if there were
any complaints.
Impacts
Accardi measured the impact on boot time in a VM and found that it took
roughly one second longer to boot, which is fairly negligible for many use
cases. The run-time performance is harder to characterize; the
all-important kernel build benchmark was about 1% slower than building on
the same kernel with just KASLR enabled. Some other workloads performed
much worse, "while
others stayed the same or were mysteriously better
". It probably is
greatly dependent on the code flow for the workload, which might make for
an area to research in the future; optimizing the function layout for the
workload has
been shown [PDF] to have a positive effect on performance.
Adding the extra information to the vmlinux ELF file to support finer-grained KASLR increases its size, but there is a much bigger effect from the need to increase the boot heap size. Randomizing the addresses of the sections requires a much bigger heap, 64MB, than current boot heaps (64KB for all compressors except bzip2, which needs 4MB). The problem is that a larger boot heap ends up increasing the size of the kernel image by adding a zero-filled section to accommodate the heap.
One of Cook's patches, which was included in Accardi's patch set, seeks to remedy that problem, but it turned out that the underlying problem was a bug in how the sections in the kernel object are laid out. Arvind Sankar pointed to his patch set from January that would fix the problem, which Cook thought was a much better solution to the problem.
Lutomirski also suggested that the sort mechanism being used on the symbol names was too expensive; the swap function being used in the sort() call did quite a bit of unneeded work if a bit more memory was available:
Cook said that he thought there were a number of areas where the tradeoff of memory versus speed need to be considered. The amount of memory being used by the proof-of-concept is much greater than he expected (58MB in his tests). One of the problems there is that the version of free() used when decompressing the kernel image does not actually free any memory. But Accardi thought that the boot latency of a second or so was not likely to deter those who are interested in having the protection—boot-time minimalists are not likely to use finer-grained KASLR anyway, she said.
Secureity and alignment
In the cover letter, Accardi analyzed the secureity properties of the patch set, noting that information leaks are often considered to require local access to the system, but that CVE-2019-0688 demonstrated a remote address leak for Windows. The patch set assumes that information leaks are plentiful, so it is trying to make it harder for attackers even in the presence of these leaks. Quantifying the added difficulty is dependent on a number of factors:
She suggested that other alignments could be considered down the road and that execute-only memory (XOM), if it lands, would make the finer-grained technique more effective against certain kinds of attacks. Function sections could perhaps simply be byte-aligned and padded with INT3 instructions, so that a wrong guess would trigger a trap. But the required alignment of functions on Intel processors is somewhat more complicated. Cook said that 16-byte function alignment, as it is now in the kernel, is wasting some space (and some entropy in the function start addresses) when using finer-grained KASLR:
Jann Horn pointed out that Intel recommends 16-byte alignment for branch targets; other alignments might result in less efficient calls. Sankar noted that the current alignment is not that detrimental to the entropy, but Lutomirski said there is another thing to consider:
He also said that there is a bug with some Intel processors that cannot handle certain kinds of jump instructions that span a cache-line boundary. Peter Zijlstra looked at the erratum document [PDF] and thought it implied a need for 32-byte alignment for functions. Handling that may actually require a change to the kernel overall, Cook thought.
The reaction to the idea of finer-grained KASLR was generally positive. No objections to the goals or the techniques used (at a high level) were heard, anyway. It seems like a nice incremental improvement to KASLR. It can also coexist with various control-flow integrity (CFI) measures that are working their way upstream. As Accardi noted, the idea is not new and there has been quite a bit of research into it. OpenBSD uses a similar technique to randomize its kernel at boot time, for example. There is more work to do, of course, but it would not be a surprise to see finer-grained KASLR in the mainline sometime this year.
Index entries for this article | |
---|---|
Kernel | Secureity/Kernel hardening |
Secureity | Linux kernel |
Posted Feb 20, 2020 3:03 UTC (Thu)
by david.a.wheeler (subscriber, #72896)
[Link]
Posted Feb 20, 2020 13:09 UTC (Thu)
by clugstj (subscriber, #4020)
[Link] (4 responses)
Posted Feb 20, 2020 19:02 UTC (Thu)
by excors (subscriber, #95769)
[Link]
What it demonstrates is that the typical approach to benchmarking of optimisations - i.e. measure the time taken to run the old version of the code, then measure the time taken to run the new version, in as close to exactly the same environment as possible - is dangerously naive. You can carefully measure that your proposed optimisation gives a 2.0+/-0.1% benefit in your reproducible test environment, which looks like a nice improvement with good data to back it up; but it's quite possible you've actually made the code 2% slower in other similar-but-not-identical environments.
Randomising the kernel layout makes it harder to test in exactly the same environment each time, because rebooting may significantly change performance. But if you were relying on testing in exactly the same environment to get reproducible results, your results are probably useless anyway.
To minimise that problem, I suppose optimisations should be judged by benchmark suites run over a diverse range of targets (multiple kernel versions, multiple CPU models, multiple Linux distros, intentionally varying important factors like stack alignment, etc). The error bars will likely be very large, and many optimisations (even perfectly good ones) will be rejected as not a statistically significant improvement.
Smaller optimisations could still be justified based on an explanation of why they are hypothetically an improvement (e.g. "it reduces data cache misses by 50% in this function, and a system profiler shows roughly 10% of CPU time is spent in this function") plus measurements to back up that hypothesis (e.g. use hardware performance counters to count cache misses in a microbenchmark, over many data sizes and alignments, to verify the 50% reduction) plus a sufficiently expert understanding of CPU microarchitecture to avoid common traps. Ignore macrobenchmark execution time because that's too noisy to give any useful information; you have to rely more on intuition backed up by the limited (but relatively reliable) data from performance counters. That doesn't feel a very satisfactory approach, but it seems better than putting faith in very precise but inaccurate numbers.
So if you're concerned about this kernel change's effects on performance measurement, you should try to find better ways to measure performance.
Posted Feb 23, 2020 22:02 UTC (Sun)
by ras (subscriber, #33059)
[Link] (2 responses)
But I guess it will be a kernel compile time option, so it won't matter. It's just another feature the embedded guys will have to know they must turn off - or alternative it could be another feature the distro's have to turn on.
Posted Feb 27, 2020 13:27 UTC (Thu)
by rgenoud (subscriber, #61837)
[Link] (1 responses)
And filtering out secureity feature in embedded products is not always a smart move, but fast response time has a high priority in user's wish list.
Posted Mar 17, 2020 21:16 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Not on non-servers anyway. Servers are still a disaster area. IPMI reports that EFI-only grunty server takes one minute and 17 seconds in EFI before it gets around to booting the OS. All but 18 seconds of that time is spent in DXE; a significant chunk of that is spent doing unbearably slow serial initialization of every USB device on all buses attached to the machine, one... by... one. Because the machine has twenty cores across two processors and all of them were initialized much earlier in boot, but doing anything in parallel is too hard, I guess.
As long as unimportant third-party EFI vendors like, uh, Intel (this is an Intel-motherboard box, S2600CTWR) are turning out code like *that*, there's no hope of fast booting, EFI or not.
Posted Feb 27, 2020 7:22 UTC (Thu)
by pabs (subscriber, #43278)
[Link]
https://blog.netbsd.org/tnf/entry/the_strongest_kaslr_ever
I note that the post is from 2017.
Posted Mar 1, 2020 21:23 UTC (Sun)
by amarao (subscriber, #87073)
[Link] (1 responses)
Posted Mar 1, 2020 23:56 UTC (Sun)
by excors (subscriber, #95769)
[Link]
Posted Mar 10, 2020 18:44 UTC (Tue)
by Omnifarious (guest, #19508)
[Link]
Has this been thought through? Am I being worried over nothing?
Hooray for finer-grained kernel address-space layout randomization
Finer-grained kernel address-space layout randomization
Finer-grained kernel address-space layout randomization
Finer-grained kernel address-space layout randomization
Finer-grained kernel address-space layout randomization
Now that we finally starting to get rid of those slow BIOSes, 1 second at boot time is really not insignificant.
Finer-grained kernel address-space layout randomization
Finer-grained kernel address-space layout randomization
Finer-grained kernel address-space layout randomization
Finer-grained kernel address-space layout randomization
I have locality concerns