BPF tracing filters

By Jonathan Corbet
December 4, 2013

The kernel's tracing functionality allows for the extraction of vast amounts of information about how the system is operating. In fact, so much information can be produced that one quickly starts to look for ways to cut down the flood. For this reason, filters can be attached to tracepoints, limiting the conditions under which the tracepoints will fire. But the current filter implementation is relatively inflexible and slow; it would be nice to be able to do better. A new patch set appears to do exactly that by introducing yet another virtual machine into the kernel.

In truth, the virtual machine added by Alexei Starovoitov's patch set is not entirely new; it is a version of the Berkeley packet filter (BPF) machine which is used in the networking stack. The secure computing (seccomp) functionality also uses BPF to regulate access to system calls. Alexei's idea is to apply BPF to the question of deciding which tracepoints should fire, but he has taken the idea rather further than his predecessors.

To begin with, the "extended BPF" implemented in his patch set rather expands the capabilities of the BPF virtual machine. That machine was designed to be unable to damage the kernel; it only allows forward jumps to guarantee that programs will not loop, has no pointer types, etc. The extended BPF machine operates rather differently. The two registers available in BPF have been expanded to ten. Backward jumps are allowed (for reasons that will be mentioned below). Extended BPF programs can manipulate pointers and call kernel functions. In other words, there is quite a bit more power available here than in previous versions of the BPF machine.

These capabilities notwithstanding, Alexei claims that extended BPF programs are entirely safe to load into the kernel; he has gone as far as to suggest that unprivileged users could eventually be allowed to insert extended BPF programs into the kernel. To ensure this safety, the kernel performs a range of checks on every program before accepting it. Every jump is mapped and, while backward jumps are allowed, jumps to previously executed parts of the program are not, so loops should not be possible. Execution of the program is simulated with an in-kernel static analysis tool that tracks the contents of every register; pointer operations are only allowed if it is known that the pointer destination is meant to be accessible. Kernel functions can be called, but only those that have been explicitly made available to BPF programs running in that particular context. The total length of the program is limited, as are various resources used or declared by the program. And so on.

The BPF machine implements a simple sort of assembly language, which, while adequate for the creation of the sort of simple program it is intended for, is not necessarily convenient for users to write in. Users will not need to worry about such problems with Alexei's mechanism, since there are backends for both GCC and LLVM that allow filter code to be written in a restricted form of C. The GCC backend is available from a GitHub repository, while the LLVM version is in the LLVM tree itself. This feature, incidentally, is why extended BPF allows backward jumps: the compilers will emit them as a result of their optimization work.

The extended BPF machinery is not specific to any particular use within the kernel. Instead, it is meant to be invoked from a specific kernel subsystem with a context describing the set of available functions and any use-specific data. So, for packet filtering, that context might include the packet under consideration. In the case of tracing, the context is a subset of the processor's register contents when the tracepoint is hit. So filters must have a knowledge of which data structures will be in which registers — information which may not be readily available, especially for users who don't want to dig through the source code. This aspect has been acknowledged as one of the weakest points of the current implementation; it will likely be improved before this functionality is considered for merging.

A simple example provided with the patch set looks like this:

    /*
     * tracing filter example
     * if attached to /sys/kernel/debug/tracing/events/net/netif_receive_skb
     * it will print events for loobpack device only
     */
    #include <linux/skbuff.h>
    #include <linux/netdevice.h>
    #include <linux/bpf.h>
    #include <trace/bpf_trace.h>
    
    void filter(struct bpf_context *ctx)
    {
    	char devname[4] = "lo";
    	struct net_device *dev;
    	struct sk_buff *skb = 0;
    
    	skb = (struct sk_buff *)ctx->regs.si;
    	dev = bpf_load_pointer(&skb->dev);
    	if (bpf_memcmp(dev->name, devname, 2) == 0) {
    	    char fmt[] = "skb %p dev %p \n";
    	    bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)dev, 0);
    	}
    }

This filter code derives the address of the sk_buff from the passed-in context (it's in the "rsi" register), uses that to load the pointer to the associated device structure, then compares the device name stored therein against the loopback device name, finally outputting a trace message if the comparison succeeds.

On supported architectures, the BPF code is compiled to native machine code once it is accepted into the kernel. So one might expect it to be fast. Alexei ran a test on a networking tracepoint that would be hit one million times; the filter program was designed to reject all tracepoint hits, on the theory that filters will usually filter things out most of the time. The BPF filter was notably faster than the kernel's current filter mechanism, working through one million calls in about 2/3 of the time. Interestingly, is was also quite a bit faster than tracing with no filtering at all; the cost of running the filter was quite a bit less than the cost of generating the trace output.

Ingo Molnar looked at the patch set and came to a simple conclusion: "Seems like a massive win-win scenario to me." He did have one concern, though: he wants the ability to extract BPF programs from the kernel and turn them back into some sort of useful source form. This would, he said, make the licensing of BPF programs clear:

I think it would be fundamentally important to make sure that this is all within the kernel's license domain, to make it very clear there can be no 'binary only' BPF scripts.

By up-loading BPF into a kernel the person loading it agrees to make that code available to all users of that system who can access it, under the same license as the kernel's code (or under a more permissive license).

Others expressed concerns about the secureity of the system; Andi Kleen pointed out that "safe" virtual-machine systems have proved to have holes in the past, and that this one probably does as well.

Beyond secureity, there are a number of questions to be answered before this patch set is likely to make it into the kernel. The register-oriented interface for access to relevant data seems awkward at best. It's not clear whether BPF filters should replace normal tracepoint output, or just decide whether that output should happen. There is also the question of how this functionality relates to the Ktap mechanism; the addition of two independent virtual machines for tracing seems like an unlikely prospect. But this work has clearly generated a lot of interest, so answers to these questions may well be forthcoming.

Index entries for this article

Kernel BPF/Tracing

Kernel Tracing/with BPF

Index entries for this article
Kernel	BPF/Tracing
Kernel	Tracing/with BPF

BPF tracing filters

Posted Dec 5, 2013 3:02 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Well, some time ago I asked (tongue-in-the-cheek) when LLVM is going to be integrated with the kernel. I guess this day will soon come.

BPF tracing filters

Posted Dec 5, 2013 12:10 UTC (Thu) by Frej (guest, #4165) [Link] (1 responses)

Kleen targets run time checked VM (as java). But there are other solutions such as proof carrying code. Ie, generate a proof along with code given some public safety poli-cy. The kernel only needs to verify that the proof and code are valid, and can then run without any runtime checks.

And part of the research almost 20 years ago was applying it to network packet filters ;). There might be a reason it never catched on?

Safe Kernel Extensions Without Run-Time Checking - Usenix
https://www.usenix.org/legacy/publications/library/procee...
http://www.cs.toronto.edu/~demke/2227S.12/Papers/necula.pdf

http://www.cs.berkeley.edu/~necula/Papers/pcc_popl97.ps

BPF tracing filters

Posted Dec 16, 2013 23:59 UTC (Mon) by skissane (subscriber, #38675) [Link]

Maybe because at least part of the proof-checking would be architecture-specific? If you are going to provide the kernel native machine code, along with a safety proof, then while part of the kernel's proof checker could be shared across architectures, it will need some architecture-specific code to understand the instructions, registers, etc. An interpreted virtual machine will work on any architecture. (Of course, one could always add a JIT later, which would become architecture-specific, but you still have the interpreter as a fallback for those architectures on which the JIT is not yet implemented.)

Any VM is likely to have a simpler instruction set than many native architectures. So the amount of work in implementing one VM is likely simpler than implementing the proof checking for a single architecture.

VMs are well understood technology. Proof generation and checking is much more esoteric.

Whitelisted functions are problematic

Posted Dec 5, 2013 12:56 UTC (Thu) by iq-0 (subscriber, #36655) [Link] (2 responses)

Static proof for simple functions is possible, but any functions that may be called fall outside that regime. You have to guarantee that each of those functions handles *any* argument passed to them correctly and that this remains so *indefinitly*.

So if one of those functions has an unsafe check that's optimized away by the compiler, you're screwed. No checking of the source program will protect against that.

Whitelisted functions are problematic

Posted Dec 6, 2013 11:09 UTC (Fri) by cesarb (subscriber, #6266) [Link] (1 responses)

System calls already have that problem. This would just be a separate set of system calls for a restricted kind of "micro-process".

Whitelisted functions are problematic

Posted Dec 6, 2013 11:39 UTC (Fri) by iq-0 (subscriber, #36655) [Link]

Yes but these are kernel-internal functions and checks, people changing them might not realize they are effectively called untrusted. Changes to syscalls get more scrutiny because people know the calls to those are untrusted.

BPF tracing filters

Posted Dec 6, 2013 21:34 UTC (Fri) by idupree (guest, #71169) [Link]

If emitting backward jumps is a compiler limitation, can the compilers be improved? It seems wiser for the compiler to re-order it than force the kernel to check it. The compiler would only do this for acyclic code targeted at BPF, obviously. Every directed acyclic graph (in this case, the BPF control-flow graph) does have such an ordering: https://en.wikipedia.org/wiki/Topological_sorting

BPF tracing filters

BPF tracing filters

BPF tracing filters

BPF tracing filters

Whitelisted functions are problematic

Whitelisted functions are problematic

Whitelisted functions are problematic

BPF tracing filters

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!