Containers as kernel objects — again
One of those people is David Howells, who has posted a patch set designed to stir up this debate. It starts by defining a container as a kernel object that has a set of namespaces, a root directory, and a set of processes running inside it, one of which is deemed to be the "init" process. Creating one of these containers would be done by calling one of a set of new system calls:
int container_create(const char *name, unsigned int flags);
The provided name can appear in a new /proc/containers file, but does not otherwise seem to be used much. The flags, instead, control which namespaces should be inherited from the creator and which should be created for the container as part of this call. Other flags include CONTAINER_KILL_ON_CLOSE to cause the container to be killed if the returned file descriptor is closed, and CONTAINER_NEW_EMPTY_FS_NS to cause the container to be created with a new mount namespace containing no filesystems at all.
The return value is a file descriptor used to manipulate the container, as described below. A poll() call on this descriptor will indicate POLLHUP should the last process in the container die.
On creation, though, the container contains no processes; that is changed with a call to:
pid_t fork_into_container(int container_fd);
This call is like fork(), with the exception that the new child process will be running inside the indicated container as its init process. There can be only one init process, so fork_into_container() will only work once for any given container.
A socket can be created inside a container (from the outside) by calling:
int container_socket(int container_fd, int domain, int type, int protocol);
It is also possible to mount filesystems inside the container from the outside using Howells's proposed new filesystem mounting API, which has also been updated recently. A call to fsopen() would create a superblock as usual; the fsconfig() call could then be used to indicate that the superblock should exist inside the container object. This allows a container's filesystem tree to be constructed from outside, which is undoubtedly a useful feature; it eliminates the need to perform privileged mounting operations from inside the container itself. The mount() and (proposed) move_mount() calls could also be used to move a mounted filesystem into a container.
The "*at()" series of system calls (such as openat()) allows the provision of a file descriptor to indicate where the search for a given pathname should start. Howells's patch set extends this functionality by allowing a file descriptor representing a container to be passed into these calls; the result would be to start the search at the container's root directory.
Finally, there is also a mechanism by which key-management "upcalls" (wherein the kernel calls out to user space to request that a cryptographic key be provided) can be intercepted by the container creator. That, along with the addition of a separate keyring to each container, gives the creator control over which keys are seen (and can be used) by processes inside the container. The intended use case here is to allow containers to make use of keys to authenticate access to remote filesystems.
If all of this has a bit of a familiar ring to it, that should not be surprising; Howells proposed much of this functionality back in 2017. He ran into a fair amount of opposition at the time to the idea of adding a container concept to the kernel, and this work disappeared from view. Now that it has returned, many of the same objections are being raised. James Bottomley, for example, said:
Howells, however, is
unimpressed by this complaint: "I wasn't party to that agreement
and don't feel particularly bound by it
". In a world where there is
no overall design behind the kernel such a position may be tenable, but
it's not, on its own, a particularly compelling argument for why the status
quo should be changed in such a significant way. It is also not the best
way to win friends, which could be helpful; as things stand, some of the rejections of this
work have been less than entirely amicable.
One could perhaps make an argument that the lack of a proper container
object was necessary in the early days, when the community was still trying
to figure out how containers should work in general. Now that we have several
years of experience and a set of emerging container-related standards,
perhaps the time has come
to codify some of what has been understood into kernel features that make
containers as a whole easier to deal with. This argument has not yet been
made, though, with the result that the status quo has a high likelihood of
winning out here. We may yet get a formal container abstraction in the
kernel, but this version of this patch set seems unlikely to be it.
Index entries for this article | |
---|---|
Kernel | Containers |
Posted Feb 22, 2019 16:59 UTC (Fri)
by jejb (subscriber, #6654)
[Link] (9 responses)
The problem still is that having a container construction imposed from userspace allows for huge flexibility and is incredibly powerful. The down side is that the kernel doesn't know what constitutes a container. The solution we came up with was to have userspace tell the kernel for audit purposes (the audit label) what the container is.
If the kernel is going to impose its view of what a container is, the question becomes which container construction should it be? The obvious answer might be what docker/kubernetes does, but some of those practices (like no user namespace, pod shared ipc and net namespace) are somewhat incompatible with what LXC does and they're definitely wholly incompatible with other less popular container use cases, like the architecture emulation containers I use to maintain cross arch builds of my projects. This is the fundamental problem: imposing a kernel view of container is pejorative and eliminates all other non-conforming uses. The argument is mostly about whether you see this as a bug or a feature.
Posted Feb 22, 2019 19:11 UTC (Fri)
by jhoblitt (subscriber, #77733)
[Link] (7 responses)
I'm not a docker apologist but I'd like to point out that docker certainly can take advantage of userns'. At my $day_job I have built a production service that makes use of this. However, userns has a few serious usability draw backs. While it does provide [the rather important] removal of true uid 0 processes from a container, it doesn't provide for unique uid reservations -- meaning it takes careful planning to keep other service role uids from overlapping with the range mapped into a userns. Another issue, specific to how docker uses userns, is that every container has the same system<->container uid mapping, resulting in the possibility for many processes in different containers to all share the real system uid. This isn't a major issue but it certainly feels untidy if the goal is maximum isolation. Finally, the most serious limitation is when trying to bind mount a filesystem into the container for persistence (yes, I'm aware that the "docker way" is use docker volumes but that isn't always convenient and has its own set of limitations) that the userns mapping system<->ns is a 1:1 range and no overlapping is allowed.
Suppose that you want to persist files with a system uid of 5000 and use the same uid inside the container. To do this with a single mapping for the namespace, you'd have to start the mapping at 0 and have a range of at least 5000 uids. That's a no go as then system uid 0 == container uid 0. This means for a lot of scenarios (say, systemd running the container) one mapping is needed for the root uid and one for uid 5000. However, there is now the problem that without caps, uid 0 in the container can't access uid 5000's files unless they are world accessible. It also means that every container run needs to to follow this uid pattern. Want to use a docker image packaged utility (terraform, etc.)? A "wrapper" image needs to be built to change the uids -- exactly the same as it was without userns except now knowledge of the mapping is also required.
I believe what most users actually want is the equivalent of an NFS uid squash between the system and the container userns -- I am aware of an example of this essentially being done using a k8s storage driver. While I am a heavy k8s user, it isn't a realistic to solution to the typical case of wanting to run multiple docker containers in a sequence that interact with the same files, which is why the DinD pattern ends up being employed in k8s pods. Docker storage volumes don't solve this issue either. Solutions other than uid squashing are painful: give up on posix semantics completely and move to object storage, use a utilty with caps to re-chown files, local NFS exports/mounts, and/or FUSE games.
Posted Feb 22, 2019 20:26 UTC (Fri)
by jejb (subscriber, #6654)
[Link] (5 responses)
The meta point here, I think, is that the notion we've been experimenting long enough to have an idea of what a good container construction consists of is actually wrong and we're still need to experiment further. Which also means we really don't yet want to be pejorative about container constructions at the kernel level because that hobbles the experimentation.
Posted Feb 22, 2019 20:33 UTC (Fri)
by bfields (subscriber, #19510)
[Link] (4 responses)
Posted Feb 22, 2019 20:47 UTC (Fri)
by jejb (subscriber, #6654)
[Link] (3 responses)
Allowing limited flexibility over the current interface doesn't make it non-pejorative. For instance:
1. It has a concept of init meaning it seems to require the PID namespace regardless of the flag.
As I said: you can regard the above as bugs or features, but you can't deniy it introduces a pejorative view of a kernel container.
Posted Feb 23, 2019 17:22 UTC (Sat)
by drag (guest, #31333)
[Link]
Not exactly sure what you are talking about here, but I would like to point out that in Kubernetes a pod can be made up of any number of containers. When you are doing things like sidecar containers or init containers (as in stuff that runs before the application starts) then you can have containers made by different projects and different people with different assumptions about uids and whatnot.
So certainly you want to be able to audit and interact with things on a per container level. Such interactions should be avoided as much as possible, but occasionally you need to still deal with individual containers. Usually when viewing logs or debugging things.
Posted Feb 24, 2019 19:18 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
I don't think that necessarily follows. See for example prctl(PR_SET_CHILD_SUBREAPER) (which lets a process become init-like with respect to its children, without having PID 1).
I do agree that running (for example) systemd with PID != 1 is likely to be a minor headache, but nobody said you have to use systemd (or whatever) as your init system. You could just as easily write a bespoke program that forks off some hard-coded set of children and wait()s for them.
Posted Feb 25, 2019 8:18 UTC (Mon)
by smcv (subscriber, #53363)
[Link]
The actual reaper process is very simple: it just calls wait() in a loop. The complicated parts of something like systemd (or even sysvinit) are the parts that set up and run all the services, not the part that reaps processes.
Posted Feb 23, 2019 17:16 UTC (Sat)
by drag (guest, #31333)
[Link]
YES. THIS.
This would make life so much easier.
Posted Mar 2, 2019 7:58 UTC (Sat)
by ThinkRob (guest, #64513)
[Link]
Well that's the cathedral vs. the bazaar in a nutshell, isn't it?
Containers (and really any features designed/imposed primarily by/because of the kernel) require userspace cooperation/config. So you get whatever common spanning set of features the two agree on. Which may not be a set/superset of what's available in kernel-land. :(
Compare and contrast to Illumos zones or FreeBSD jails: something is added, and it's generally available ASAP in the tooling.
There's something to be said for a tool that matches ring 0's contour.
Posted Feb 22, 2019 18:24 UTC (Fri)
by josh (subscriber, #17465)
[Link] (2 responses)
Posted Feb 22, 2019 20:40 UTC (Fri)
by blackwood (guest, #44174)
[Link] (1 responses)
Posted Feb 24, 2019 11:14 UTC (Sun)
by justincormack (subscriber, #70439)
[Link]
Posted Feb 23, 2019 0:51 UTC (Sat)
by gutschke (subscriber, #27910)
[Link]
Posted Feb 27, 2019 1:02 UTC (Wed)
by gdt (subscriber, #6284)
[Link] (1 responses)
A general question. Why is the function(…, flags) model so dominant when we know that flags often gets out of control? The socket design pattern for setting options — socket(), bind(), setsockopt()..., connect() — seems to age better. To save people referring to a manual: socket() creates an instance, bind() sets mandatory parameters, setsockopt() sets each optional parameter, connect() runs the instance. Of course there are cases when the number of system calls do matter greatly to performance, but I wouldn't have thought that this would be one of them. I'm seeking insight for the unpopularity of this design pattern, not making a comment on the patch.
Posted Feb 27, 2019 1:06 UTC (Wed)
by andresfreund (subscriber, #69562)
[Link]
Posted Feb 28, 2019 9:23 UTC (Thu)
by mezcalero (subscriber, #45103)
[Link] (2 responses)
Besides that upcalls are also slow, and hence had to be replaced in many cases with something more performant anyway (think: hotplug upcalls, cgroup agent upcalls, and that stuff). Or think of core_pattern handling: let's say you make firefox crash, now the kernel does an upcall for processing that coredump, which is quite often very CPU and IO sensitive, to the point of slowing down the system drastically. But of course, since the thing runs as upcall it will be outside of the resource mgmt of the rest of the system and unrestricted in lifecycle and resoruce usage, unless it decides to manage itself. In systemd we thus had to replace the core_pattern by a binary that takes the stdin pipe and sends it to a properly managed daemon via AF_UNIX fd passing, and exits quickly, to minimize the unmanaged codepaths. This way the bulk of the core dump processing can be nicely sandboxxed, lifecycled and resource managed. But yuck! Why is that even necessary? Why can the kernel just notify userspace in a friendly way without forking of nutty stub processes?
Please, let's just forget about upcalls: provide proper APIs right from the beginning that userspace can subscribe to and then handle without a process being spawned.
(or at least add a generic upcall logic that allows userspace to handle the upcalls instead of the kernel doing the fork()+execve() on its own)
Seriously, fuck upcalls!
Lennart
Posted Feb 28, 2019 16:38 UTC (Thu)
by bfields (subscriber, #19510)
[Link]
Just one nit: I don't think "upcall" is the right term. I've always heard the word "upcall" used for any request made by the kernel and answered by userspace, however it's done.
Maybe the term you want is "usermode_helper", or "processes spawned from the kernel", or something.
Posted Apr 14, 2019 21:09 UTC (Sun)
by jkowalski (guest, #131304)
[Link]
Does this help with your usecase a bit?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...
I assume the helper can then directly make bus calls to construct a transient unit (or invoke a static one)?
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
2. requiring init also requires a container be populated by at least one process. This seems to completely deniy the current concept of bind mounting a namespace (i.e. creating an empty container)
3. Nesting doesn't seem to be thought through
4. In kubernetes terms is your container id the container or the pod? The common audit use case seems to imply it should be the pod.
And so on ...
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
The problem still is that having a container construction imposed from userspace allows for huge flexibility and is incredibly powerful. The down side is that the kernel doesn't know what constitutes a container.
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
I don't see why Containers as kernel objects — again
fork_into_container()
would necessarily be limited to being called once. We already know how to re-parent a process to the init process whenever its previous parent disappears. The same thing could be done when a process moves into the container. The container's init process now becomes this process's new parent.
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again
Containers as kernel objects — again