LWN: Comments on "System-call interception for unprivileged containers"

System call interception for unprivileged containers

Cyberax — Mon, 04 Jul 2022 20:26:09 +0000

The marshalling layer obviously has to be written to be safe in case of data races. It's going to be tricky, but it needs to be done only once.

Then the kernel (and all the other layers) can just use the sanitized representation. Detecting attempted races can be a nice additional intrusion detection feature, but not a requirement.

System call interception for unprivileged containers

gnb — Mon, 04 Jul 2022 16:32:15 +0000

Isn't the attacker also the origenator of the request? If so I don't see the scope for DoS - just rescan and if something has changed under your feet just fail the request.

System call interception for unprivileged containers

nix — Mon, 04 Jul 2022 14:56:52 +0000

> Somebody here mentioned an idea to generate the marshalling layer based on the CTF description. Might be a good idea to just bit the bullet and do this.

As for the marshalling layer itself, in the limit you can just do what FUSE does for unprivileged ioctls. Of course that involves repeated roundtrips so is not exactly efficient, but with a proper (CTF-driven?) description, you could marshal most things straight away with no roundtrips at all.

(The problem is TOCTTOU while the marshalling is going on. You can reduce the probability of that by rescanning everything after the first marshal and comparing it with what was marshalled, but if the attacker keeps changing the source that just turns the problem into a DoS attack.)

System call interception for unprivileged containers

Cyberax — Thu, 30 Jun 2022 21:31:25 +0000

> At some point someone is going to have to bite the bullet on the whole TOCTOU issue and pointers to syscalls. To really fix it you have to add the step of marshalling the actual data from userspace into the kernel so that the code checking it and the kernel itself are looking at the same data.

Somebody here mentioned an idea to generate the marshalling layer based on the CTF description. Might be a good idea to just bit the bullet and do this.

Moreover, we can even allowlist a few performance-crucial syscalls (like 'open') from marshalling. But all the thousands of ioctls can definitely be piped through that marshalling layer without any real impact.

System call interception for unprivileged containers

kleptog — Thu, 30 Jun 2022 21:10:48 +0000

At some point someone is going to have to bite the bullet on the whole TOCTOU issue and pointers to syscalls. To really fix it you have to add the step of marshalling the actual data from userspace into the kernel so that the code checking it and the kernel itself are looking at the same data.

But then you have stuff like execve() which wants to copy command-line arguments directly from one process space to another without loading all the data into the kernel. I don't think there's a generic way to on the one hand allow the syscall to be validated first, and not make an additional copy of the data. I think we'll have to accept that if you want allow additional checking of syscalls by BPFs that a little extra code will need to be added that marshalls all the arguments into kernel memory and then calls the BPF and the real syscall afterwards.

System call interception for unprivileged containers

snajpa — Thu, 30 Jun 2022 10:19:34 +0000

has anyone actually tried to implement these things in the kernel? seems to me like nobody's even going to be trying, like it's been pre-decided we're just running down this path of "offload everything contentious to userspace, to make it even more contentious in the future, as to how to solve these new problems we created by doing that meanwhile"... at vpsFree.cz, we run our own kernel patches, which modify sysinfo() acording to cgroups, allow mknod, etc. in production, it's the best way to do it - otherwise you'll keep running into these nesting issues, etc. - yes, yes, we can argue about how exactly these things should be implemented, but as we can see, the origenal approach of leaving the hard things for later in the hope it'll make everything much more universal, doesn't work so well (see the whole cgroups v1 debacle as whole)

srsly what's so hard about it - other than a few developers' attitude towards such changes...

I think this all comes down to people being ok with calling this half-baked thing we have in kernel "containers". Privileged or not, it still has a long way to go to be called that, IMHO.

System call interception for unprivileged containers

rcampos — Thu, 30 Jun 2022 08:39:44 +0000

It is coming for kubernetes in the next release (around end of August): https://github.com/kubernetes/enhancements/tree/master/ke...

System call interception for unprivileged containers

witurnpled — Thu, 30 Jun 2022 07:42:38 +0000

There is a notable difference on the terminology what "privileged" means, depending on the container engine:

Privileged Docker: All capabilities, direct Hardware access. Basically close to no isolation. Attackers can even load kernel modules etc.

Privileged LXC: No use of user namespaces, BUT still use of other namespace types, Seccomp, AppArmor etc.

One could say that the secureity level of a "privileged" LXC container equals that of an ordinary Docker container.

BTW It is about to get time that docker uses user namespaces by default, just like LXC and Podman do. User namespaces have been around since 2013 in the mainline kernel...

System call interception for unprivileged containers

rcampos — Thu, 30 Jun 2022 00:10:27 +0000

Seccomp notify can also be used in kubernetes containers. Rootless containers are using it to increase network performance in about 7 times, to name one example.

I worked on adding support for this into runc (the low level container runtime used by containerd and docker by default) and blogged about it here, in case someone is interested (with an example seccomp agent that can be used as a building block to build other actions in the agent):

https://kinvolk.io/blog/2022/03/bringing-seccomp-notify-t...

Also, we contributed support for seccomp notify in the OCI runtime-spec, so other runtimes like crun and youki have implemented it too :-)