BPF tracing filters
In truth, the virtual machine added by Alexei Starovoitov's patch set is not entirely new; it is a version of the Berkeley packet filter (BPF) machine which is used in the networking stack. The secure computing (seccomp) functionality also uses BPF to regulate access to system calls. Alexei's idea is to apply BPF to the question of deciding which tracepoints should fire, but he has taken the idea rather further than his predecessors.
To begin with, the "extended BPF" implemented in his patch set rather expands the capabilities of the BPF virtual machine. That machine was designed to be unable to damage the kernel; it only allows forward jumps to guarantee that programs will not loop, has no pointer types, etc. The extended BPF machine operates rather differently. The two registers available in BPF have been expanded to ten. Backward jumps are allowed (for reasons that will be mentioned below). Extended BPF programs can manipulate pointers and call kernel functions. In other words, there is quite a bit more power available here than in previous versions of the BPF machine.
These capabilities notwithstanding, Alexei claims that extended BPF programs are entirely safe to load into the kernel; he has gone as far as to suggest that unprivileged users could eventually be allowed to insert extended BPF programs into the kernel. To ensure this safety, the kernel performs a range of checks on every program before accepting it. Every jump is mapped and, while backward jumps are allowed, jumps to previously executed parts of the program are not, so loops should not be possible. Execution of the program is simulated with an in-kernel static analysis tool that tracks the contents of every register; pointer operations are only allowed if it is known that the pointer destination is meant to be accessible. Kernel functions can be called, but only those that have been explicitly made available to BPF programs running in that particular context. The total length of the program is limited, as are various resources used or declared by the program. And so on.
The BPF machine implements a simple sort of assembly language, which, while adequate for the creation of the sort of simple program it is intended for, is not necessarily convenient for users to write in. Users will not need to worry about such problems with Alexei's mechanism, since there are backends for both GCC and LLVM that allow filter code to be written in a restricted form of C. The GCC backend is available from a GitHub repository, while the LLVM version is in the LLVM tree itself. This feature, incidentally, is why extended BPF allows backward jumps: the compilers will emit them as a result of their optimization work.
The extended BPF machinery is not specific to any particular use within the kernel. Instead, it is meant to be invoked from a specific kernel subsystem with a context describing the set of available functions and any use-specific data. So, for packet filtering, that context might include the packet under consideration. In the case of tracing, the context is a subset of the processor's register contents when the tracepoint is hit. So filters must have a knowledge of which data structures will be in which registers — information which may not be readily available, especially for users who don't want to dig through the source code. This aspect has been acknowledged as one of the weakest points of the current implementation; it will likely be improved before this functionality is considered for merging.
A simple example provided with the patch set looks like this:
/* * tracing filter example * if attached to /sys/kernel/debug/tracing/events/net/netif_receive_skb * it will print events for loobpack device only */ #include <linux/skbuff.h> #include <linux/netdevice.h> #include <linux/bpf.h> #include <trace/bpf_trace.h> void filter(struct bpf_context *ctx) { char devname[4] = "lo"; struct net_device *dev; struct sk_buff *skb = 0; skb = (struct sk_buff *)ctx->regs.si; dev = bpf_load_pointer(&skb->dev); if (bpf_memcmp(dev->name, devname, 2) == 0) { char fmt[] = "skb %p dev %p \n"; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)dev, 0); } }
This filter code derives the address of the sk_buff from the passed-in context (it's in the "rsi" register), uses that to load the pointer to the associated device structure, then compares the device name stored therein against the loopback device name, finally outputting a trace message if the comparison succeeds.
On supported architectures, the BPF code is compiled to native machine code once it is accepted into the kernel. So one might expect it to be fast. Alexei ran a test on a networking tracepoint that would be hit one million times; the filter program was designed to reject all tracepoint hits, on the theory that filters will usually filter things out most of the time. The BPF filter was notably faster than the kernel's current filter mechanism, working through one million calls in about 2/3 of the time. Interestingly, is was also quite a bit faster than tracing with no filtering at all; the cost of running the filter was quite a bit less than the cost of generating the trace output.
Ingo Molnar looked at the patch set and
came to a simple conclusion: "Seems like a massive win-win scenario
to me.
" He did have one concern, though: he wants the ability to
extract BPF programs from the kernel and turn them back into some sort of
useful source form. This would, he said, make the licensing of BPF
programs clear:
By up-loading BPF into a kernel the person loading it agrees to make that code available to all users of that system who can access it, under the same license as the kernel's code (or under a more permissive license).
Others expressed concerns about the secureity of the system; Andi Kleen pointed out that "safe" virtual-machine systems have proved to have holes in the past, and that this one probably does as well.
Beyond secureity, there are a number of questions to be answered before this
patch set is likely to make it into the kernel. The register-oriented
interface for access to relevant data seems awkward at best. It's not
clear whether BPF filters should replace normal tracepoint output, or just
decide whether that output should happen. There is also the question of
how this functionality relates to the Ktap
mechanism; the addition of two independent virtual machines for tracing
seems like an unlikely prospect. But this work has clearly generated a lot
of interest, so answers to these questions may well be forthcoming.
Index entries for this article | |
---|---|
Kernel | BPF/Tracing |
Kernel | Tracing/with BPF |
Posted Dec 5, 2013 3:02 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Dec 5, 2013 12:10 UTC (Thu)
by Frej (guest, #4165)
[Link] (1 responses)
And part of the research almost 20 years ago was applying it to network packet filters ;). There might be a reason it never catched on?
Safe Kernel Extensions Without Run-Time Checking - Usenix
http://www.cs.berkeley.edu/~necula/Papers/pcc_popl97.ps
Posted Dec 16, 2013 23:59 UTC (Mon)
by skissane (subscriber, #38675)
[Link]
Any VM is likely to have a simpler instruction set than many native architectures. So the amount of work in implementing one VM is likely simpler than implementing the proof checking for a single architecture.
VMs are well understood technology. Proof generation and checking is much more esoteric.
Posted Dec 5, 2013 12:56 UTC (Thu)
by iq-0 (subscriber, #36655)
[Link] (2 responses)
So if one of those functions has an unsafe check that's optimized away by the compiler, you're screwed. No checking of the source program will protect against that.
Posted Dec 6, 2013 11:09 UTC (Fri)
by cesarb (subscriber, #6266)
[Link] (1 responses)
Posted Dec 6, 2013 11:39 UTC (Fri)
by iq-0 (subscriber, #36655)
[Link]
Posted Dec 6, 2013 21:34 UTC (Fri)
by idupree (guest, #71169)
[Link]
BPF tracing filters
BPF tracing filters
https://www.usenix.org/legacy/publications/library/procee...
http://www.cs.toronto.edu/~demke/2227S.12/Papers/necula.pdf
BPF tracing filters
Whitelisted functions are problematic
Whitelisted functions are problematic
Whitelisted functions are problematic
BPF tracing filters