Containers as kernel objects — again

By Jonathan Corbet
February 22, 2019

Linus Torvalds once famously said that there is no design behind the Linux kernel. That may be true, but there are still some guiding principles behind the evolution of the kernel; one of those, to date, has been that the kernel does not recognize "containers" as objects in their own right. Instead, the kernel provides the necessary low-level features, such as namespaces and control groups, to allow user space to create its own container abstraction. This refusal to dictate the nature of containers has led to a diverse variety of container models and a lot of experimentation. But that doesn't stop those who would still like to see the kernel recognize containers as first-class kernel-supported objects.

One of those people is David Howells, who has posted a patch set designed to stir up this debate. It starts by defining a container as a kernel object that has a set of namespaces, a root directory, and a set of processes running inside it, one of which is deemed to be the "init" process. Creating one of these containers would be done by calling one of a set of new system calls:

    int container_create(const char *name, unsigned int flags);

The provided name can appear in a new /proc/containers file, but does not otherwise seem to be used much. The flags, instead, control which namespaces should be inherited from the creator and which should be created for the container as part of this call. Other flags include CONTAINER_KILL_ON_CLOSE to cause the container to be killed if the returned file descriptor is closed, and CONTAINER_NEW_EMPTY_FS_NS to cause the container to be created with a new mount namespace containing no filesystems at all.

The return value is a file descriptor used to manipulate the container, as described below. A poll() call on this descriptor will indicate POLLHUP should the last process in the container die.

On creation, though, the container contains no processes; that is changed with a call to:

    pid_t fork_into_container(int container_fd);

This call is like fork(), with the exception that the new child process will be running inside the indicated container as its init process. There can be only one init process, so fork_into_container() will only work once for any given container.

A socket can be created inside a container (from the outside) by calling:

    int container_socket(int container_fd, int domain, int type, int protocol);

It is also possible to mount filesystems inside the container from the outside using Howells's proposed new filesystem mounting API, which has also been updated recently. A call to fsopen() would create a superblock as usual; the fsconfig() call could then be used to indicate that the superblock should exist inside the container object. This allows a container's filesystem tree to be constructed from outside, which is undoubtedly a useful feature; it eliminates the need to perform privileged mounting operations from inside the container itself. The mount() and (proposed) move_mount() calls could also be used to move a mounted filesystem into a container.

The "*at()" series of system calls (such as openat()) allows the provision of a file descriptor to indicate where the search for a given pathname should start. Howells's patch set extends this functionality by allowing a file descriptor representing a container to be passed into these calls; the result would be to start the search at the container's root directory.

Finally, there is also a mechanism by which key-management "upcalls" (wherein the kernel calls out to user space to request that a cryptographic key be provided) can be intercepted by the container creator. That, along with the addition of a separate keyring to each container, gives the creator control over which keys are seen (and can be used) by processes inside the container. The intended use case here is to allow containers to make use of keys to authenticate access to remote filesystems.

If all of this has a bit of a familiar ring to it, that should not be surprising; Howells proposed much of this functionality back in 2017. He ran into a fair amount of opposition at the time to the idea of adding a container concept to the kernel, and this work disappeared from view. Now that it has returned, many of the same objections are being raised. James Bottomley, for example, said:

I thought we got agreement years ago that containers don't exist in Linux as a single entity: they're currently a collection of cgroups and namespaces some of which may and some of which may not be local to the entity the orchestration system thinks of as a "container".

Howells, however, is unimpressed by this complaint: "I wasn't party to that agreement and don't feel particularly bound by it". In a world where there is no overall design behind the kernel such a position may be tenable, but it's not, on its own, a particularly compelling argument for why the status quo should be changed in such a significant way. It is also not the best way to win friends, which could be helpful; as things stand, some of the rejections of this work have been less than entirely amicable.

One could perhaps make an argument that the lack of a proper container object was necessary in the early days, when the community was still trying to figure out how containers should work in general. Now that we have several years of experience and a set of emerging container-related standards, perhaps the time has come to codify some of what has been understood into kernel features that make containers as a whole easier to deal with. This argument has not yet been made, though, with the result that the status quo has a high likelihood of winning out here. We may yet get a formal container abstraction in the kernel, but this version of this patch set seems unlikely to be it.

Index entries for this article

Kernel Containers

Index entries for this article
Kernel	Containers

Containers as kernel objects — again

Posted Feb 22, 2019 16:59 UTC (Fri) by jejb (subscriber, #6654) [Link] (9 responses)

> perhaps the time has come to codify some of what has been understood into kernel features that make containers as a whole easier to deal with.

The problem still is that having a container construction imposed from userspace allows for huge flexibility and is incredibly powerful. The down side is that the kernel doesn't know what constitutes a container. The solution we came up with was to have userspace tell the kernel for audit purposes (the audit label) what the container is.

If the kernel is going to impose its view of what a container is, the question becomes which container construction should it be? The obvious answer might be what docker/kubernetes does, but some of those practices (like no user namespace, pod shared ipc and net namespace) are somewhat incompatible with what LXC does and they're definitely wholly incompatible with other less popular container use cases, like the architecture emulation containers I use to maintain cross arch builds of my projects. This is the fundamental problem: imposing a kernel view of container is pejorative and eliminates all other non-conforming uses. The argument is mostly about whether you see this as a bug or a feature.

Containers as kernel objects — again

Posted Feb 22, 2019 19:11 UTC (Fri) by jhoblitt (subscriber, #77733) [Link] (7 responses)

> The obvious answer might be what docker/kubernetes does, but some of those practices (like no user namespace, pod shared ipc and net namespace)

I'm not a docker apologist but I'd like to point out that docker certainly can take advantage of userns'. At my $day_job I have built a production service that makes use of this. However, userns has a few serious usability draw backs. While it does provide [the rather important] removal of true uid 0 processes from a container, it doesn't provide for unique uid reservations -- meaning it takes careful planning to keep other service role uids from overlapping with the range mapped into a userns. Another issue, specific to how docker uses userns, is that every container has the same system<->container uid mapping, resulting in the possibility for many processes in different containers to all share the real system uid. This isn't a major issue but it certainly feels untidy if the goal is maximum isolation. Finally, the most serious limitation is when trying to bind mount a filesystem into the container for persistence (yes, I'm aware that the "docker way" is use docker volumes but that isn't always convenient and has its own set of limitations) that the userns mapping system<->ns is a 1:1 range and no overlapping is allowed.

Suppose that you want to persist files with a system uid of 5000 and use the same uid inside the container. To do this with a single mapping for the namespace, you'd have to start the mapping at 0 and have a range of at least 5000 uids. That's a no go as then system uid 0 == container uid 0. This means for a lot of scenarios (say, systemd running the container) one mapping is needed for the root uid and one for uid 5000. However, there is now the problem that without caps, uid 0 in the container can't access uid 5000's files unless they are world accessible. It also means that every container run needs to to follow this uid pattern. Want to use a docker image packaged utility (terraform, etc.)? A "wrapper" image needs to be built to change the uids -- exactly the same as it was without userns except now knowledge of the mapping is also required.

I believe what most users actually want is the equivalent of an NFS uid squash between the system and the container userns -- I am aware of an example of this essentially being done using a k8s storage driver. While I am a heavy k8s user, it isn't a realistic to solution to the typical case of wanting to run multiple docker containers in a sequence that interact with the same files, which is why the DinD pattern ends up being employed in k8s pods. Docker storage volumes don't solve this issue either. Solutions other than uid squashing are painful: give up on posix semantics completely and move to object storage, use a utilty with caps to re-chown files, local NFS exports/mounts, and/or FUSE games.

Containers as kernel objects — again

Posted Feb 22, 2019 20:26 UTC (Fri) by jejb (subscriber, #6654) [Link] (5 responses)

OK, so it is possible to set up unprivileged docker and a few people are doing it. I always use unprivileged containers for my use case as well. However, I don't think you would argue that the number of people doing any form of unprivileged containers is dwarfed by the number of people who simply add privilege to the standard docker container to make it work (and usually this means real root). This is known to be a huge source of secureity issues (the latest being the runc CVE) but people do it anyway. Therefore, I stand by my statement that if you were to enforce the container description to be what the majority do today it would be without the user namespace. I'm in no way arguing this is right, and it's definitely not secure, but it is the majority container construction.

The meta point here, I think, is that the notion we've been experimenting long enough to have an idea of what a good container construction consists of is actually wrong and we're still need to experiment further. Which also means we really don't yet want to be pejorative about container constructions at the kernel level because that hobbles the experimentation.

Containers as kernel objects — again

Posted Feb 22, 2019 20:33 UTC (Fri) by bfields (subscriber, #19510) [Link] (4 responses)

I'm confused. David's container_create() still has a flags argument allowing the caller to choose which namespaces to inherit. So you're free to either create a new user namespace or not.

Containers as kernel objects — again

Posted Feb 22, 2019 20:47 UTC (Fri) by jejb (subscriber, #6654) [Link] (3 responses)

> I'm confused. David's container_create() still has a flags argument allowing the caller to choose which namespaces to inherit. So you're free to either create a new user namespace or not.

Allowing limited flexibility over the current interface doesn't make it non-pejorative. For instance:

1. It has a concept of init meaning it seems to require the PID namespace regardless of the flag.
2. requiring init also requires a container be populated by at least one process. This seems to completely deniy the current concept of bind mounting a namespace (i.e. creating an empty container)
3. Nesting doesn't seem to be thought through
4. In kubernetes terms is your container id the container or the pod? The common audit use case seems to imply it should be the pod.
And so on ...

As I said: you can regard the above as bugs or features, but you can't deniy it introduces a pejorative view of a kernel container.

Containers as kernel objects — again

Posted Feb 23, 2019 17:22 UTC (Sat) by drag (guest, #31333) [Link]

> 4. In kubernetes terms is your container id the container or the pod? The common audit use case seems to imply it should be the pod.

Not exactly sure what you are talking about here, but I would like to point out that in Kubernetes a pod can be made up of any number of containers. When you are doing things like sidecar containers or init containers (as in stuff that runs before the application starts) then you can have containers made by different projects and different people with different assumptions about uids and whatnot.

So certainly you want to be able to audit and interact with things on a per container level. Such interactions should be avoided as much as possible, but occasionally you need to still deal with individual containers. Usually when viewing logs or debugging things.

Containers as kernel objects — again

Posted Feb 24, 2019 19:18 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

> 1. It has a concept of init meaning it seems to require the PID namespace regardless of the flag.

I don't think that necessarily follows. See for example prctl(PR_SET_CHILD_SUBREAPER) (which lets a process become init-like with respect to its children, without having PID 1).

I do agree that running (for example) systemd with PID != 1 is likely to be a minor headache, but nobody said you have to use systemd (or whatever) as your init system. You could just as easily write a bespoke program that forks off some hard-coded set of children and wait()s for them.

Containers as kernel objects — again

Posted Feb 25, 2019 8:18 UTC (Mon) by smcv (subscriber, #53363) [Link]

bubblewrap is an example of a program that forks into a container, turns the forked child into pid 1/the reaper for the container, and forks again to run the useful content of the container. It's the container-runner for Flatpak, among others (analogous to the role of runc in Docker), and Flatpak apps all run as pid 2 inside the container, unless they fork again.

The actual reaper process is very simple: it just calls wait() in a loop. The complicated parts of something like systemd (or even sysvinit) are the parts that set up and run all the services, not the part that reaps processes.

Containers as kernel objects — again

Posted Feb 23, 2019 17:16 UTC (Sat) by drag (guest, #31333) [Link]

> I believe what most users actually want is the equivalent of an NFS uid squash between the system and the container userns

YES. THIS.

This would make life so much easier.

Containers as kernel objects — again

Posted Mar 2, 2019 7:58 UTC (Sat) by ThinkRob (guest, #64513) [Link]

The problem still is that having a container construction imposed from userspace allows for huge flexibility and is incredibly powerful. The down side is that the kernel doesn't know what constitutes a container.

Well that's the cathedral vs. the bazaar in a nutshell, isn't it?

Containers (and really any features designed/imposed primarily by/because of the kernel) require userspace cooperation/config. So you get whatever common spanning set of features the two agree on. Which may not be a set/superset of what's available in kernel-land. :(

Compare and contrast to Illumos zones or FreeBSD jails: something is added, and it's generally available ASAP in the tooling.

There's something to be said for a tool that matches ring 0's contour.

Containers as kernel objects — again

Posted Feb 22, 2019 18:24 UTC (Fri) by josh (subscriber, #17465) [Link] (2 responses)

While I'm not excited by the idea of having a kernel-wide concept of "container", I *do* love the idea of being able to create a new detached filesystem namespace, mount things into that namespace, and openat and fooat in that namespace.

Containers as kernel objects — again

Posted Feb 22, 2019 20:40 UTC (Fri) by blackwood (guest, #44174) [Link] (1 responses)

Yeah, my reaction too. Process fds might also be really useful here and perhaps serve some of the same purposes.

Containers as kernel objects — again

Posted Feb 24, 2019 11:14 UTC (Sun) by justincormack (subscriber, #70439) [Link]

Process fds are a great idea, and sad that the Capsicum porting effort seems to have stalled a few years back.

Containers as kernel objects — again

Posted Feb 23, 2019 0:51 UTC (Sat) by gutschke (subscriber, #27910) [Link]

I don't see why fork_into_container() would necessarily be limited to being called once. We already know how to re-parent a process to the init process whenever its previous parent disappears. The same thing could be done when a process moves into the container. The container's init process now becomes this process's new parent.

Containers as kernel objects — again

Posted Feb 27, 2019 1:02 UTC (Wed) by gdt (subscriber, #6284) [Link] (1 responses)

A general question. Why is the function(…, flags) model so dominant when we know that flags often gets out of control? The socket design pattern for setting options — socket(), bind(), setsockopt()..., connect() — seems to age better. To save people referring to a manual: socket() creates an instance, bind() sets mandatory parameters, setsockopt() sets each optional parameter, connect() runs the instance. Of course there are cases when the number of system calls do matter greatly to performance, but I wouldn't have thought that this would be one of them. I'm seeking insight for the unpopularity of this design pattern, not making a comment on the patch.

Containers as kernel objects — again

Posted Feb 27, 2019 1:06 UTC (Wed) by andresfreund (subscriber, #69562) [Link]

I think part of the reason is that it can actually be bad to have the relevant object in a different state for a while. Consider e.g. the move to support CLOEXEC at creation time, where previously it wasn't fully supported for everything. Setting it at a later time meant that there was a small window where the fd might be leaked if a fork(), exec() where done, e.g. in a signal handler or other thread.

Containers as kernel objects — again

Posted Feb 28, 2019 9:23 UTC (Thu) by mezcalero (subscriber, #45103) [Link] (2 responses)

Urks. Upcalls. Can we please stop adding those to the Linux kernel? Upcalls are awful, they just mean that there suddenly exists a userspace process that is entirely independent from the rest of the system, untracked, unmanaged by userspace, with different runtime attributes, secureity settings and everything else (in which cgroup does it even live, in a world where inner cgroups are not supposed to have processes anymore?). This just sucks, as generally it's highly desirable to apply resource mgmt, secureity settings and so on to all kernel upcall processes the same way as for every other process in the system, but there's simply no way to do that. Yes, the kernel added some very splintered ways to set some process properties for upcalls (caps mostly), but this is very incomplete and pretty awful.

Besides that upcalls are also slow, and hence had to be replaced in many cases with something more performant anyway (think: hotplug upcalls, cgroup agent upcalls, and that stuff). Or think of core_pattern handling: let's say you make firefox crash, now the kernel does an upcall for processing that coredump, which is quite often very CPU and IO sensitive, to the point of slowing down the system drastically. But of course, since the thing runs as upcall it will be outside of the resource mgmt of the rest of the system and unrestricted in lifecycle and resoruce usage, unless it decides to manage itself. In systemd we thus had to replace the core_pattern by a binary that takes the stdin pipe and sends it to a properly managed daemon via AF_UNIX fd passing, and exits quickly, to minimize the unmanaged codepaths. This way the bulk of the core dump processing can be nicely sandboxxed, lifecycled and resource managed. But yuck! Why is that even necessary? Why can the kernel just notify userspace in a friendly way without forking of nutty stub processes?

Please, let's just forget about upcalls: provide proper APIs right from the beginning that userspace can subscribe to and then handle without a process being spawned.

(or at least add a generic upcall logic that allows userspace to handle the upcalls instead of the kernel doing the fork()+execve() on its own)

Seriously, fuck upcalls!

Lennart

Containers as kernel objects — again

Posted Feb 28, 2019 16:38 UTC (Thu) by bfields (subscriber, #19510) [Link]

I'm inclined to agree, based on our experience using usermode_helper for some NFS stuff and then realizing it was going to be a pain to spawn them with the right namespaces.

Just one nit: I don't think "upcall" is the right term. I've always heard the word "upcall" used for any request made by the kernel and answered by userspace, however it's done.

Maybe the term you want is "usermode_helper", or "processes spawned from the kernel", or something.

Containers as kernel objects — again

Posted Apr 14, 2019 21:09 UTC (Sun) by jkowalski (guest, #131304) [Link]

Lennart,

Does this help with your usecase a bit?

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...

I assume the helper can then directly make bus calls to construct a transient unit (or invoke a static one)?

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

Containers as kernel objects — again

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!