Checkpoint and restore for seccomp filters
The checkpoint/restore (C/R) feature of the kernel has had a long and twisty path—with several rewrites from scratch along the way—before finally getting merged in the form of C/R in user space (CRIU). There are still some kernel resources that cannot be handled by CRIU, though, including secure computing (seccomp) filters. A patch set from Tycho Andersen seeks to change that.
Seccomp filters are written using the Berkeley Packet Filter (BPF) language, which targets an in-kernel virtual machine—and which has increasingly outgrown its name. In any case, for seccomp filters to be saved and restored, there needs to be a way to provide information about them, and about the BPF programs attached to them, to user space. Some of that depends on figuring out how to determine "equality" between BPF programs (i.e. which programs have been inherited or copied from another).
As with other in-kernel BPF users, seccomp filters are converted to extended BPF (eBPF) internally. But there are plans to allow filters written in eBPF (rather than "classic" BPF) directly, which means that the eBPF verifier needs to be aware of a new type of program: BPF_PROG_TYPE_SECCOMP. So the first patch adds that type. Right now, seccomp filter programs cannot use some of the more advanced features of eBPF (notably maps), but that may eventually change. Meanwhile, BPF_PROG_TYPE_SECCOMP can be used to restrict the types of BPF programs that can be dumped, since the patches do not add support for maps, at least yet.
Next, the patches add a mechanism to actually dump an eBPF program, along with a bit of metadata (the GPL status of the program). The bpf() system call, using the new BPF_PROG_DUMP command, is used to accomplish that. The caller must provide a file descriptor for the program and a buffer of sufficient length to hold the instructions. The buffer is filled in; the number of (fixed-width) instructions and the GPL status of the program are returned as well.
But, how does one get a file descriptor for a BPF program? A sufficiently privileged program (i.e. has CAP_SYS_ADMIN) can call ptrace() on a process using the PTRACE_SECCOMP_GET_FILTER_FD command to get a file descriptor for the first seccomp filter attached. Subsequent filters can be accessed with PTRACE_SECCOMP_NEXT_FILTER and each can be dumped with bpf(BPF_PROG_DUMP, ...).
That just leaves one last piece of the puzzle: reading a seccomp filter program out of a file descriptor and attaching it to processes, which is needed for the restore operation. For that, Andersen has extended the seccomp() system call with a new operation: SECCOMP_MODE_FILTER_EBPF. In the patch set, that operation only handles a single new command, SECCOMP_EBPF_ADD_FD, but others will likely be added when seccomp filters can use more eBPF features. SECCOMP_EBPF_ADD_FD will add the filter associated with the file descriptor to the process's filter list.
It is all something of a complicated dance, but is similar in some ways to other CRIU save and restore dances. One unresolved issue is how to represent the hierarchy of filter programs, which can be inherited over fork() and clone() and shared in other ways. If that hierarchy is to be restored to the same state it had when a set of processes was checkpointed, some way to determine which programs are the same, thus are likely to have been inherited, needs to be provided. Restoring the hierarchy is needed so that changes to filter programs properly propagate throughout the hierarchy tree.
The kernel clearly knows which programs come from where as they get attached, but that information is not stored anywhere. An earlier version of the patch set added a "program ID" to the metadata that was dumped with the program, but there were complaints that it was leaking a kernel address into user space, which is a security hole. There was discussion about ways to either obfuscate the address or to maintain a simple counter for the ID, but Andersen dropped the ID from the second (and current) version of the patches.
Normally, kcmp() is used to determine if two kernel objects are the "same"; BPF maintainer Alexei Starovoitov believes that the KCMP_FILE comparison can be taught to do the right thing for BPF program file descriptors. But Andy Lutomirski would like to see the hierarchy be more explicit:
Representing the hierarchy was further discussed by Andersen and Lutomirski in a thread on the second version of the patch set. Lutomirski's use case has requirements that are beyond those needed for C/R. It seems that tracking the parent of a seccomp filter and using kcmp() to determine equality may be sufficient for both needs, though.
Another round or two of patches would seem likely before this feature is ready to be merged—for one thing, man page changes are needed for bpf() and seccomp(). As seccomp filters get used by more programs, a way to save and restore them will certainly be needed. Further refinements, for more complicated eBPF programs with maps, for example, can be expected as well. While this patch set is targeted at seccomp filters, the more general eBPF dumping problem will eventually need to be addressed as well.
Index entries for this article | |
---|---|
Security | Linux kernel/Seccomp |
Posted Oct 5, 2015 13:57 UTC (Mon)
by robbe (guest, #16131)
[Link]
Wait! Does that mean that I can do:
<parent> install filter F1
Would that be an intentional feature and for what purpose? This spooky action at a distance leaves somewhat of a bad taste. But I guess it's not worse than a parent ptrace()ing its child...
Changes to filter programs
> programs properly propagate throughout the hierarchy tree.
<parent> fork()
<child> install filter F2 ... I am now restricted by F1+F2
<parent> replace F1 by F1'
and have child be restricted by F1'+F2?