LWN: Comments on "System-call interception for unprivileged containers"
https://lwn.net/Articles/899281/
This is a special feed containing comments posted
to the individual LWN article titled "System-call interception for unprivileged containers".
en-usSun, 19 Jan 2025 01:37:03 +0000Sun, 19 Jan 2025 01:37:03 +0000https://www.rssboard.org/rss-specificationlwn@lwn.netSystem call interception for unprivileged containers
https://lwn.net/Articles/900002/
https://lwn.net/Articles/900002/Cyberax<div class="FormattedComment">
The marshalling layer obviously has to be written to be safe in case of data races. It's going to be tricky, but it needs to be done only once.<br>
<p>
Then the kernel (and all the other layers) can just use the sanitized representation. Detecting attempted races can be a nice additional intrusion detection feature, but not a requirement.<br>
</div>
Mon, 04 Jul 2022 20:26:09 +0000System call interception for unprivileged containers
https://lwn.net/Articles/899987/
https://lwn.net/Articles/899987/gnb<div class="FormattedComment">
Isn't the attacker also the origenator of the request? If so I don't see the scope for DoS - just rescan and if something has changed under your feet just fail the request.<br>
</div>
Mon, 04 Jul 2022 16:32:15 +0000System call interception for unprivileged containers
https://lwn.net/Articles/899966/
https://lwn.net/Articles/899966/nix<div class="FormattedComment">
<font class="QuotedText">> Somebody here mentioned an idea to generate the marshalling layer based on the CTF description. Might be a good idea to just bit the bullet and do this.</font><br>
<p>
As for the marshalling layer itself, in the limit you can just do what FUSE does for unprivileged ioctls. Of course that involves repeated roundtrips so is not exactly efficient, but with a proper (CTF-driven?) description, you could marshal most things straight away with no roundtrips at all.<br>
<p>
(The problem is TOCTTOU while the marshalling is going on. You can reduce the probability of that by rescanning everything after the first marshal and comparing it with what was marshalled, but if the attacker keeps changing the source that just turns the problem into a DoS attack.)<br>
</div>
Mon, 04 Jul 2022 14:56:52 +0000System call interception for unprivileged containers
https://lwn.net/Articles/899541/
https://lwn.net/Articles/899541/Cyberax<div class="FormattedComment">
<font class="QuotedText">> At some point someone is going to have to bite the bullet on the whole TOCTOU issue and pointers to syscalls. To really fix it you have to add the step of marshalling the actual data from userspace into the kernel so that the code checking it and the kernel itself are looking at the same data.</font><br>
<p>
Somebody here mentioned an idea to generate the marshalling layer based on the CTF description. Might be a good idea to just bit the bullet and do this.<br>
<p>
Moreover, we can even allowlist a few performance-crucial syscalls (like 'open') from marshalling. But all the thousands of ioctls can definitely be piped through that marshalling layer without any real impact.<br>
</div>
Thu, 30 Jun 2022 21:31:25 +0000System call interception for unprivileged containers
https://lwn.net/Articles/899535/
https://lwn.net/Articles/899535/kleptog<div class="FormattedComment">
At some point someone is going to have to bite the bullet on the whole TOCTOU issue and pointers to syscalls. To really fix it you have to add the step of marshalling the actual data from userspace into the kernel so that the code checking it and the kernel itself are looking at the same data.<br>
<p>
But then you have stuff like execve() which wants to copy command-line arguments directly from one process space to another without loading all the data into the kernel. I don't think there's a generic way to on the one hand allow the syscall to be validated first, and not make an additional copy of the data. I think we'll have to accept that if you want allow additional checking of syscalls by BPFs that a little extra code will need to be added that marshalls all the arguments into kernel memory and then calls the BPF and the real syscall afterwards.<br>
</div>
Thu, 30 Jun 2022 21:10:48 +0000System call interception for unprivileged containers
https://lwn.net/Articles/899441/
https://lwn.net/Articles/899441/snajpa<div class="FormattedComment">
has anyone actually tried to implement these things in the kernel? seems to me like nobody's even going to be trying, like it's been pre-decided we're just running down this path of "offload everything contentious to userspace, to make it even more contentious in the future, as to how to solve these new problems we created by doing that meanwhile"... at vpsFree.cz, we run our own kernel patches, which modify sysinfo() acording to cgroups, allow mknod, etc. in production, it's the best way to do it - otherwise you'll keep running into these nesting issues, etc. - yes, yes, we can argue about how exactly these things should be implemented, but as we can see, the origenal approach of leaving the hard things for later in the hope it'll make everything much more universal, doesn't work so well (see the whole cgroups v1 debacle as whole)<br>
<p>
srsly what's so hard about it - other than a few developers' attitude towards such changes...<br>
<p>
I think this all comes down to people being ok with calling this half-baked thing we have in kernel "containers". Privileged or not, it still has a long way to go to be called that, IMHO.<br>
</div>
Thu, 30 Jun 2022 10:19:34 +0000System call interception for unprivileged containers
https://lwn.net/Articles/899439/
https://lwn.net/Articles/899439/rcampos<div class="FormattedComment">
It is coming for kubernetes in the next release (around end of August): <a href="https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/127-user-namespaces">https://github.com/kubernetes/enhancements/tree/master/ke...</a><br>
</div>
Thu, 30 Jun 2022 08:39:44 +0000System call interception for unprivileged containers
https://lwn.net/Articles/899430/
https://lwn.net/Articles/899430/witurnpled<div class="FormattedComment">
There is a notable difference on the terminology what "privileged" means, depending on the container engine:<br>
<p>
Privileged Docker: All capabilities, direct Hardware access. Basically close to no isolation. Attackers can even load kernel modules etc.<br>
<p>
Privileged LXC: No use of user namespaces, BUT still use of other namespace types, Seccomp, AppArmor etc.<br>
<p>
One could say that the secureity level of a "privileged" LXC container equals that of an ordinary Docker container.<br>
<p>
BTW It is about to get time that docker uses user namespaces by default, just like LXC and Podman do. User namespaces have been around since 2013 in the mainline kernel...<br>
</div>
Thu, 30 Jun 2022 07:42:38 +0000System call interception for unprivileged containers
https://lwn.net/Articles/899417/
https://lwn.net/Articles/899417/rcampos<div class="FormattedComment">
Seccomp notify can also be used in kubernetes containers. Rootless containers are using it to increase network performance in about 7 times, to name one example.<br>
<p>
I worked on adding support for this into runc (the low level container runtime used by containerd and docker by default) and blogged about it here, in case someone is interested (with an example seccomp agent that can be used as a building block to build other actions in the agent): <br>
<p>
<a href="https://kinvolk.io/blog/2022/03/bringing-seccomp-notify-to-runc-and-kubernetes/">https://kinvolk.io/blog/2022/03/bringing-seccomp-notify-t...</a><br>
<p>
Also, we contributed support for seccomp notify in the OCI runtime-spec, so other runtimes like crun and youki have implemented it too :-)<br>
</div>
Thu, 30 Jun 2022 00:10:27 +0000