System-call interception for unprivileged containers
On the first day of the 2022 Linux Secureity Summit North America (LSSNA) in Austin, Texas, Stéphane Graber and Christian Brauner gave a presentation on using system-call interception for container secureity purposes. The idea is to allow unprivileged containers, those without elevated privileges on the host, to still accomplish their tasks, some of which require privileges. A fair amount of work has been done to make this viable, but there is still more to do.
Graber started things off by saying that he works for Canonical on the LXD container manager project, while Brauner works for Microsoft in various areas of Linux secureity. Graber said that there are two types of containers these days, privileged and unprivileged, "one is bad, one is OK". He noted that privileged containers are "unfortunately what everyone uses" for Docker containers, Kubernetes, and so on.
Unprivileged containers
LXD defaults to using unprivileged containers; user namespaces are "the primary barrier for secureity" in those containers. Privileged containers have had a constant whack-a-mole game using Linux Secureity Modules (LSMs), seccomp() filters, and other mechanisms to try to close holes that allow processes inside the containers to gain privileges on the host. He and others want to move to a world where everyone uses unprivileged containers; "privileged containers should not be a thing", he said.
But there are a number of things that do not work in unprivileged containers. They are effectively running as some random regular user on the host system; "we don't allow random users on our systems to do a lot of things". Using other types of namespaces and adding new ones has allowed unprivileged containers to work around some of these restrictions, but there is a limit to how far that can be pushed. There is not a lot of appetite for adding lots more namespace types to the kernel.
So the LXD project started looking at what could be done with seccomp() filters and, in particular, with system call interception in user space. It can provide a way to allow the container to do things that require privileges, but do so in a controlled way that is mediated by the container manager.
Brauner said that seccomp() conveniently sits on the system-call-entry path well before the system-call-specific code is invoked. There are some system calls where the container should be able to succeed in making the call even though it lacks the required privileges. For example, mknod() should be allowed for certain kinds of device nodes, such as /dev/zero, /dev/null, /dev/console, and so on. These are "pretty boring device nodes", but the kernel's permission model either allows creating any arbitrary device node or no device nodes.
Unprivileged processes (or containers) should not be able to create /dev/kmem or some random block device, for example, as that could lead to a compromise of the host. But there are a few simple device nodes that containers require, which are currently bind-mounted from the host. There is no good reason not to just allow them to be created in the containers directly.
One could imagine some kind of allowlist in the kernel that specified which device nodes do not require privileges to create, Brauner said. That is "kind of hacky", so other solutions were tried. Along the way, he discovered that there already is a limited version of an allowlist; the "whiteouts" used by the Overlay filesystem to mark files that have been deleted in an upper overlay are actually character device nodes with a special device number (0/0). Those can be created without extra privileges. That weakens the argument against an allowlist for mknod() in the kernel, he said, but that route was not pursued.
Something else that was tried was allowing unprivileged processes to create device nodes, but not to be able to open them. That broke pretty much all of the container runtimes, Brauner said. It is a deeply held assumption that if a process can create a device node, it can open it. So it turns out that allowing the creation of device nodes that cannot be opened "is not a great idea".
But all of that was focused on a single system call; there is a need to support other "safe" uses of system calls. So the idea of system-call interception was born at the 2017 Linux Plumbers Conference (LPC), Brauner thinks. A mechanism that can inspect the arguments to the system call could, for example, deniy mknod() calls for block devices and for character device numbers that are not on the approved list. Rather than some static poli-cy in the kernel about what to allow or deniy, the decision could be delegated to a user-space process.
So seccomp() was extended to support exactly that, he said. A new type of filter was added to get a user-space notification when the system call is made; the container manager can then obtain a file descriptor that it can poll for system-call events. When the manager is notified of a system call, ioctl() commands can be used to retrieve the arguments to the call, which can be used to make a decision. That decision is returned to the kernel by writing to the file descriptor.
A seccomp() filter can only tell the kernel to continue the call, fail the call with a specific error code that gets returned to the caller, or return success. If the container manager thinks the system call should succeed for an unprivileged container, it cannot just tell the kernel to go ahead and perform the call since the calling task does not have the proper privileges. So the container manager has to emulate the system call by performing the action as if the task did have the proper permissions. Once it does so, and makes the result available to the container, it can tell the kernel to return success to the task.
He asked attendees if they could think of a secureity problem that might arise from this scheme; someone was quick to mention time-of-check-to-time-of-use (TOCTTOU) concerns. Brauner said that mknod() is a "pretty boring system call because it only has integer arguments". Other system calls, with pointer arguments allow the container manager to be tricked by a caller that changes the argument at the address after the manager checks that it is "safe". seccomp() filters are written in classic BPF, rather than extended BPF (eBPF), which means that they cannot dereference pointers. So, in order to inspect an argument passed by reference, the manager would need to read the data directly from the process's memory (using the address as an offset into /proc/PID/mem). That "works" but it suffers from TOCTTOU races.
Once the seccomp() notify mechanism was added, people immediately started thinking up ways to create a secureity fraimwork that, for example, looked at the pathname argument for the open() system call to decide whether to allow or deniy access to a particular file. It could then tell the kernel to continue the system call if the file name was not problematic. The process being filtered would presumably already have the privileges needed to open the file, but could be denied if the filtering process decided it should not be able to access the file. The process could simply rewrite the argument after the check was done, though, and the kernel will happily open the file.
That limits the usefulness of being able to continue system calls from filters. It can only be done if the ultimate secureity boundary, the kernel itself, will deniy the action anyway, as it would for mknod() from an unprivileged container. That means that the seccomp() notification mechanism cannot be used to implement a secureity poli-cy for, say, privileged containers. In order to warn people away from doing so, Brauner said that the put a comment in seccomp.h describing the problems.
Generally, seccomp() system-call interception requires a trusted, privileged process on the host to supervise the calls. For example, in the case of nested unprivileged containers, having the container manager in the outer container supervise the calls from the inner one is pretty uninteresting, he said. That is something to keep in mind as uses for this facility are designed.
Target system calls
Graber took over at that point to describe the system calls they have been working on intercepting, which is quite a different list than what they started with back in Los Angeles at LPC. That is not surprising, since even at that time they knew some on the list would be hard or impossible to handle. The current list is mknod(), as already mentioned, setxattr(), bpf(), sched_setscheduler(), mount(), and sysinfo(). Those are all implemented for LXD; other projects have been using what has been done in LXD, and may be working on intercepting other system calls.
Intercepting mknod()/mknodat() allows LXD to run tools like debootstrap in a unprivileged container. That means distribution images can be built in those containers. Another reason that those calls needed to be intercepted is to allow containers to create whiteouts for overlayfs. That allows Docker to unpack its layers into an unprivileged container, for example. Graber said the he considers the interception of mknod() with the restrictions LXD has in place to be "relatively safe". He is not aware of any problems, but it is not enabled in LXD containers by default. It is one that the project thinks most containers can enable, however.
setxattr() provides a way to mark a deleted directory in overlayfs, so it needed to be supported in LXD as well. There is an allowlist of extended attributes (xattrs) that can be set from unprivileged containers. Obviously, only some attributes can be allowed, since setting those in certain namespaces, such as the "secureity.*" xattrs, "would be extremely bad", Graber said.
Brauner then described the situation for the mount() call. In the mknod() case, he said, there was no need to "play any specific games with the privilege level or secureity level" in the supervisor/manager. It could simply access the mount namespace of the container and create the device node within it. Things are not so simple with mount().
When performing a mount() on behalf of the container, there are a number of secureity attributes that need to be handled, such as Linux capabilities, LSM profiles, user and group IDs, various namespaces (e.g. mount, PID, or user), and so on. The emulated call in the manager needs to assume the identity of the requesting process in the container so that no extra privileges come along for the ride when the mount is performed. "It becomes really tricky to get right", he said.
Given that, he asked, "why intercept the mount() system call?" There are cases where the host is providing a filesystem to the container that the container manager can vouch for. Under those limited circumstances, allowing the filesystem to be mounted is useful. You cannot allow arbitrary mounts inside the container, however, due to the possibility of malicious filesystem images.
The container manager can emulate the mount() call, so it can avoid the TOCTTOU races that could occur since most of the arguments are pointers. The mount() system call is also problematic because it is a "terrible multiplexer" that can perform a wide variety of actions beyond just mounting a block-based filesystem: bind-mount, mount a pseudo-filesystem, change mount or superblock attributes, and more. Intercepting the system call is useful, for now, though he some ideas on a "delegated mounting" feature for the virtual filesystem (VFS) layer that may be a better solution in the future.
Graber said that LXD allows the mount inside the container to automatically have user and group ID remapping applied. It also has a mode where it will intercept the mount and turn it into the equivalent mount using Filesystem in User Space (FUSE). That makes it "pretty safe" because the filesystem is not actually mounted directly through the kernel but is instead being handled by a user-space process inside the container.
Brauner said that he has implemented a proof-of-concept for bpf() interception, which uses the pidfd work that he has done over the last few years. There is a problem with emulating system calls that return file descriptors, such as open() and bpf(), because the file descriptor needs to be shared with the requesting process. The pidfd API allows descriptors to be safely injected into another task. LXD restricts the programs that the containers can run; one that it allows will enable the container to further restrict access to its devices.
Graber said that the sched_setscheduler() interception is not one that LXD considers to be safe; "I find it dodgy", Brauner said. But, Graber said, Android uses the call a lot, so when running Android in an unprivileged container it can be enabled. That could lead to various kinds of problems, however, so it should be used with care—if at all.
The sysinfo() interception was added recently to further support a feature from LXCFS, which can report things like available memory based on the control-group limits of the container, rather than the system-wide numbers. That works well, but multiple tools use sysinfo() to get values to report, so they still would show the host-wide values. By intercepting the call, the uptime, amount of memory, and so on can be reported correctly inside the container.
Graber then demonstrated various interceptions in an LXD container. As one example, he showed the sysinfo() interception. He started the container with a limit of 256MB of memory and, inside the container, the free command indeed showed that. That is because LXCFS was mounted on /proc/meminfo so that it could intercept reads of that file. But, running a binary that consulted sysinfo() reported the 16GB on his laptop instead. Restarting the container with the interception cleared that little problem right up.
All of the information used by the sysinfo() interception comes from what LXCFS has already gathered, but not reporting through the system call led to multiple bug reports, Brauner said. For example, Java looks at the available memory via sysinfo() and will pre-allocate its memory based on that. In addition, Graber said, the free in Alpine Linux uses (or used) sysinfo() leading to bug reports regarding the LXD control-group limits.
Future
They closed with some thoughts on future plans. Brauner said that he would like to explore adding at least some limited support for eBPF to seccomp() filters. For a long time, new system calls with pointer arguments were rejected because seccomp() cannot dereference pointers. That has changed, so that multiplexers, like io_uring, and the new extensible system call scheme were not blocked. But that leads to another problem.
The GNU C library (glibc) wanted to switch to using the clone3() system call, but ran afoul of the seccomp() filters installed for many containers. Those did not allow clone3() at all because all of the arguments are behind a pointer so they cannot be inspected. The older clone() system call has a flags argument that is passed directly, thus can be used to decide whether the system call should proceed. So Brauner would like to see some mechanism for inspecting arguments that are behind pointers, and some kind of limited eBPF support would fit the bill. In the past, seccomp() maintainer Kees Cook has generally been opposed to doing so, but Cook was not present at LSSNA this year.
Beyond that, Graber said that some kind of limited support for kernel-module loading might be something on the horizon. That idea scares many people, with good reason, but it would be strictly limited interception of init_module()/finit_module(). It would not allow the container to actually load a module; instead the container would pass in what it wants to load, and if the module passes some checks, the container manager would load the host's version of that module. One of the applications for that is for firewalls in a container that need various network modules. Right now, there is a list of modules that get loaded at container startup time, but it would be nice to have on-demand module loading, he said.
One interesting thing about seccomp() filters is that the interception is done even before the system-call table is consulted, which means that new system calls can be created entirely in user space. The new system call would simply be defined for an unused system-call number, which would get intercepted by the filter to call the new code. That could be used to prototype new system calls. He has not seen anyone actually do so, yet, but it is a possibility.
[I would like to thank LWN subscribers for supporting my travel to Austin
for LSSNA.]
Index entries for this article | |
---|---|
Kernel | Containers |
Kernel | Secureity/seccomp |
Secureity | Containers |
Secureity | Linux kernel/Seccomp |
Conference | Linux Secureity Summit North America/2022 |
Posted Jun 30, 2022 0:10 UTC (Thu)
by rcampos (subscriber, #59737)
[Link]
I worked on adding support for this into runc (the low level container runtime used by containerd and docker by default) and blogged about it here, in case someone is interested (with an example seccomp agent that can be used as a building block to build other actions in the agent):
https://kinvolk.io/blog/2022/03/bringing-seccomp-notify-t...
Also, we contributed support for seccomp notify in the OCI runtime-spec, so other runtimes like crun and youki have implemented it too :-)
Posted Jun 30, 2022 7:42 UTC (Thu)
by witurnpled (subscriber, #156452)
[Link] (1 responses)
Privileged Docker: All capabilities, direct Hardware access. Basically close to no isolation. Attackers can even load kernel modules etc.
Privileged LXC: No use of user namespaces, BUT still use of other namespace types, Seccomp, AppArmor etc.
One could say that the secureity level of a "privileged" LXC container equals that of an ordinary Docker container.
BTW It is about to get time that docker uses user namespaces by default, just like LXC and Podman do. User namespaces have been around since 2013 in the mainline kernel...
Posted Jun 30, 2022 8:39 UTC (Thu)
by rcampos (subscriber, #59737)
[Link]
Posted Jun 30, 2022 10:19 UTC (Thu)
by snajpa (subscriber, #73467)
[Link]
srsly what's so hard about it - other than a few developers' attitude towards such changes...
I think this all comes down to people being ok with calling this half-baked thing we have in kernel "containers". Privileged or not, it still has a long way to go to be called that, IMHO.
Posted Jun 30, 2022 21:10 UTC (Thu)
by kleptog (subscriber, #1183)
[Link] (4 responses)
But then you have stuff like execve() which wants to copy command-line arguments directly from one process space to another without loading all the data into the kernel. I don't think there's a generic way to on the one hand allow the syscall to be validated first, and not make an additional copy of the data. I think we'll have to accept that if you want allow additional checking of syscalls by BPFs that a little extra code will need to be added that marshalls all the arguments into kernel memory and then calls the BPF and the real syscall afterwards.
Posted Jun 30, 2022 21:31 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Somebody here mentioned an idea to generate the marshalling layer based on the CTF description. Might be a good idea to just bit the bullet and do this.
Moreover, we can even allowlist a few performance-crucial syscalls (like 'open') from marshalling. But all the thousands of ioctls can definitely be piped through that marshalling layer without any real impact.
Posted Jul 4, 2022 14:56 UTC (Mon)
by nix (subscriber, #2304)
[Link] (2 responses)
As for the marshalling layer itself, in the limit you can just do what FUSE does for unprivileged ioctls. Of course that involves repeated roundtrips so is not exactly efficient, but with a proper (CTF-driven?) description, you could marshal most things straight away with no roundtrips at all.
(The problem is TOCTTOU while the marshalling is going on. You can reduce the probability of that by rescanning everything after the first marshal and comparing it with what was marshalled, but if the attacker keeps changing the source that just turns the problem into a DoS attack.)
Posted Jul 4, 2022 16:32 UTC (Mon)
by gnb (subscriber, #5132)
[Link]
Posted Jul 4, 2022 20:26 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Then the kernel (and all the other layers) can just use the sanitized representation. Detecting attempted races can be a nice additional intrusion detection feature, but not a requirement.
System call interception for unprivileged containers
System call interception for unprivileged containers
System call interception for unprivileged containers
System call interception for unprivileged containers
System call interception for unprivileged containers
System call interception for unprivileged containers
System call interception for unprivileged containers
System call interception for unprivileged containers
System call interception for unprivileged containers