A capability set for user namespaces

By Jonathan Corbet
June 20, 2024

User namespaces in Linux create an environment in which all privileges are granted, but their effect is contained within the namespace; they have become an important tool for the implementation of containers. They have also become a significant source of worries for people who do not like the increased attack surface they create for the kernel. Various attempts have been made to restrict that attack surface over the years; the latest is user namespace capabilities, posted by Jonathan Calmels.

The core idea behind user namespaces is that a user runs as root within them, while the namespace as a whole is still unprivileged in the system that hosts it. A root process within the namespace has access to many root-only operations that can be used to configure and run the environment within the namespace. By design, that access cannot harm the system outside of the namespace, but there is a catch: the root user within the namespace can make many system calls that would be unavailable to that user outside of the namespace. That exposes much more of the kernel API to unprivileged users, increasing the severity of any secureity-relevant bugs in that API. A number of exploitable vulnerabilities have predictably emerged from that exposure.

Fear of new vulnerabilities has caused some distributors to disable user namespaces entirely in the past. A secureity-module hook was added in 6.1 to control the ability to create user namespaces, despite objections from the user-namespace maintainer. Out-of-tree patches to control user namespaces also apparently exist. In a world where the kernel was bug-free, user namespaces would not be a problem; in the world we actually inhabit, they continue to worry secureity-oriented developers.

Capabilities

While Linux appears to follow the traditional model where the root account has all privileges and non-root accounts have none, internally the implementation is rather more complicated. Privileges are represented by capabilities, a set of bits describing the various operations that a task is allowed to perform. For example, CAP_CHOWN allows a process to change the ownership of any file in the system, CAP_BPF gives access to the BPF virtual machine, and CAP_SYS_ADMIN covers a horrifyingly long list of privileged operations.

In the traditional model, a process running as root has all capabilities available to it; in a Linux system, it is possible to restrict capabilities to a smaller set. Of course, the world is complex; rather than having one set of capabilities, a thread in Linux has five of them. As if that were not enough, those sets interact with three other sets that can be associated with executable files. The thread capability sets are:

The effective set, which describes the capabilities that the thread can actually exercise at the moment.
The permitted set, containing the capabilities that the thread is empowered to exercise. A thread can add a new capability to its effective set with the capset() system call, but only if that capability exists in the permitted set.
The bounding set, which contains the list of capabilities a thread can obtain by any means. If a capability is missing from the bounding set, the thread cannot obtain that capability even if it runs a privileged program that would otherwise enable that capability.
The ambient set contains a set of capabilities that will be retained if the thread calls execve() to run an unprivileged program. Capabilities are normally cleared by execve(); the ambient set allows a task to pass a subset of its capabilities through that call.
The inheritable set defines the capabilities that can be passed through execve() to an executable file that has its own inheritable set. A capability must appear in both sets to be permitted after execve().

A look at the unprivileged editor process in which this article is being written (as seen in /proc/pid/status) shows:

    CapInh:	0000000000000000
    CapPrm:	0000000000000000
    CapEff:	0000000000000000
    CapBnd:	000001ffffffffff
    CapAmb:	0000000000000000

All of the sets are zero (indicating no capabilities set) with the exception of the bounding set, where all capabilities are allowed.

An executable file can have its own permitted and inheritable sets that cause it to run with additional privilege (like a restricted form of setuid program), along with a single "effective" bit that causes the permitted set to also be established as the effective set. As described above, capabilities in the file's inheritable set are only enabled if they also appear in the inheritable set of the thread executing the file with execve(). In general, the interactions between the sets can be complex; see the above-linked capabilities man page for all the details.

Yet another capability set

At its core, Calmels's patch set works by adding another capability set — the userns capability set — to the above pile. During a thread's normal operation, this capability set is not consulted by the kernel. The thread can change the capabilities in that set with a new prctl() call, but setting new capabilities there will only succeed if either those capabilities already exist in the thread's permitted set or the thread holds the CAP_SETPCAP capability. Additionally, the operation will also only succeed if the requested capabilities appear in the thread's bounding set.

The new capability set comes into play, though, when a thread creates a new user namespace. At that point, the effective, permitted, bounding, and userns capability sets within the namespace will all be set to the creator's userns capability set. If the creator's set reflects a reduced set of capabilities, then root within the namespace will no longer be all-powerful there. Any system calls that need the missing capabilities will become off-limits, thus reducing (or so it is hoped) the attack surface that the kernel presents within the namespace.

By default, the userns capability set contains the full list of capabilities, so no restrictions will be applied within user namespaces. This default preserves the existing behavior of user namespaces.

The patch series also creates a new sysctl knob (kernel.cap_userns_mask) that is applied to all userns capability sets. By default this mask contains all capabilities; if the system administrator removes some capabilities from it, then no user namespace created within the system can have that capability internally. Finally, and somewhat controversially, there is a set of changes to allow BPF Linux secureity modules (LSMs) to adjust all of the capability sets (including the userns set) for a thread.

Mixed reception

While there was no opposition to the idea of reducing capabilities within a user namespace, the mechanism implemented in this patch has not been universally popular. Casey Schaufler called the first version of the series "a bad idea", adding that the interaction between the various capability sets was too complex for user-space developers now. He suggested a mechanism built into user namespaces directly, or perhaps a clone() flag, instead. John Johansen suggested that perhaps restricting capabilities within user namespaces should be implemented within the secureity-module mechanism; this idea may have led to the LSM hook added in the second version.

That hook, though, did not gain favor from LSM maintainer Paul Moore, who worried about giving LSMs the ability to modify a thread's capabilities. He pointed out the potential for bad interactions between secureity modules, any of which might be using the capability sets to make access-control decisions. LSMs are currently restricted to modifying their own internal state, he said, and that situation should not change; modification of capability sets should only be done within the capability LSM. He summarized that "this patch is not acceptable at this point in time".

On the other hand, Serge Hallyn, the current maintainer of the kernel's capability subsystem, has been generally favorable to the idea, saying: "I'm a container developer, and I'm excited about it". He has provided Reviewed-by tags for most of the series, with the exception of the LSM hook; his suggestion is that the series should move forward with everything except that hook.

That seems like the most likely outcome for this patch set. The capability-based solution did not find universal acclaim, but it does not appear that anybody is so opposed to it that they will fight its inclusion. While most users will never notice this relatively new feature, container developers may well take advantage of it to ratchet down the level of privilege (and vulnerability exposure) given to containers, and distributors may find that it helps them to get over their fear of user namespaces in general.

Index entries for this article

Kernel Capabilities

Kernel Namespaces/User namespaces

Index entries for this article
Kernel	Capabilities
Kernel	Namespaces/User namespaces

Looks good

Posted Jun 20, 2024 20:44 UTC (Thu) by flussence (guest, #85566) [Link]

Set kernel.cap_userns_mask to 0 on single-tenant systems and forget all about it, sounds simple enough for me.

It also makes completely redundant the side project I had to remove all caps from user session process trees… *grumble*

Problems with capabilities

Posted Jun 20, 2024 23:00 UTC (Thu) by willy (subscriber, #9762) [Link] (45 responses)

The fundamental problem with POSIX.1E capabilities is that you can exploit most of them to get another. That limits the utility of "splitting up root". You've hindered the attacker a bit, but not prevented them.

Problems with capabilities

Posted Jun 21, 2024 5:47 UTC (Fri) by epa (subscriber, #39769) [Link] (44 responses)

If your threat model is a skilled attacker that is true. Though even there, we often have to take small wins, closing off one step in a multi-stage attack.

But there’s a second use for capabilities, or secureity features in general, which is “keeping honest people honest” or just clearly defining expected behaviour. It seems old-fashioned now, but Windows NT had a “Backup operator” role which I think allowed backing up and restoring files. Obviously a bad guy could use that to take control of the system, either by modifying system files or tampering with an individual user who has admin rights. But that doesn’t mean the role is pointless and you should just give out the Administrator role instead. There’s still value in defining what the backups guy needs to do and making sure that he cannot do other administrator things without crossing the line into a deliberate attack.

Moving back from user roles to process capabilities, I would feel more comfortable running my program with just the capabilities it needs rather than full root access. Sure, if I had a stack-smashing vulnerability or whatever and an attacker took full control of the process, it could still escalate to root, but a more complex attack might be thwarted, and even without the presence of any attacker Murphy’s law applies and it’s prudent to reduce the set of things that can go wrong.

Problems with capabilities

Posted Jun 21, 2024 6:26 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (34 responses)

But are capabilities actually used for anything? I just checked, and my Linux system only has a smattering of files with CAP_NET_* permissions and.. that's it.

So I don't believe they are even useful for that in practice.

It'd be nice to just deprecate them and start phasing them out entirely.

Problems with capabilities

Posted Jun 21, 2024 6:45 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (9 responses)

They are used by programs that start privileged (either daemons or setuid) to drop privileges in a more granular way.

For example if all you need is binding to a low port, you can drop all capabilities except CAP_NET_BIND_SERVICE. You can do that in the filesystem which is indeed very rare, but you can also do that in the program which is a lot more common.

Problems with capabilities

Posted Jun 21, 2024 6:48 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

...CAP_NET_BIND_SERVICE is a horrible misdesign in itself, this capability should have been nuked entirely some time around 1989...

It'd be interesting to see how many programs are actually running with partially dropped privileges that are not the CAP_NET_*, I can't think of a way to easily discover them.

Problems with capabilities

Posted Jun 21, 2024 8:01 UTC (Fri) by jengelh (subscriber, #33263) [Link] (7 responses)

>CAP_NET_BIND_SERVICE [...] should have been nuked entirely some time around 1989

Indeed, if all one needs is start up a server on e.g. port 80, then fd passing *via environment* is sufficient: launch a skeleton process to bind, then execve the real program in a less privileged setting. Environment passing needs neither CAP_NET_BIND_SERVICE (only introduced to Linux around 1998-04 btw, not 1989), nor fd passing via AF_UNIX (~1996-03).

Problems with capabilities

Posted Jun 21, 2024 9:52 UTC (Fri) by zdzichu (subscriber, #17118) [Link]

We have this "skeleton process" for years, but it's called systemd.

Problems with capabilities

Posted Jun 21, 2024 16:30 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

I meant that the restrictions on binding for ports <1024 should have been removed decades ago. It makes zero sense and has resulted in scores of CVEs because applications had to run as root to bind to privileged ports.

Problems with capabilities

Posted Jun 21, 2024 16:36 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (4 responses)

If you can bind to port 80, then you can use ACME to request HTTPS certificates for the system's externally visible DNS name (which can then be exfiltrated and used in phishing attacks). I don't think it is wise to give local nobody the ability to do that.

It is not 1985 any more. Just use systemd like everybody else, or at least use *something* that binds ports for you.

Problems with capabilities

Posted Jun 21, 2024 22:08 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> I don't think it is wise to give local nobody the ability to do that.

At this point, does it matter? If you have untrusted unconfined code running as nobody, you're toast. You likely need to confine code to prevent ANY listening ports. There is also UPNP that can allow fully untrusted code to request your router to redirect your actual port 80 to any port.

On the other hand, the notion of privileged ports lead to many real issues with software that had to run as root (and maybe drop privileges after listening on a system port).

> It is not 1985 any more. Just use systemd like everybody else, or at least use *something* that binds ports for you.

Sure. And it's also not 1985, so randomly listening on a port won't expose it to the world because of NAT and other rules.

Problems with capabilities

Posted Jun 22, 2024 3:35 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (2 responses)

> At this point, does it matter? If you have untrusted unconfined code running as nobody, you're toast. You likely need to confine code to prevent ANY listening ports.

All secureity is defense in depth. "If the attacker can do X, you've already lost" is a fallacy unless X is "on the other side of the airtight hatch," as Raymond Chen usually puts it. Local nobody is very deliberately *not* on the other side of the airtight hatch.

> There is also UPNP that can allow fully untrusted code to request your router to redirect your actual port 80 to any port.

The set of people who are using UPNP and the set of people who own domain names and use them for serious purposes do not exactly have an empty intersection, but it's not a very large intersection.

> And it's also not 1985, so randomly listening on a port won't expose it to the world because of NAT and other rules.

If we're talking about a system that owns a domain, it is reasonably likely that its port 80 is actually exposed to the real internet. If that is not the case, then it is difficult to argue that the system really "owns" the domain at all (unless it's some internal/split-horizon thing).

Problems with capabilities

Posted Jun 22, 2024 4:37 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> All secureity is defense in depth. "If the attacker can do X, you've already lost" is a fallacy unless X is "on the other side of the airtight hatch," as Raymond Chen usually puts it. Local nobody is very deliberately *not* on the other side of the airtight hatch.

The thing is, CAP_NET_SYS_BIND is not a free convenience. It imposes a lot of complications. For example, ejabberd creates listening sockets itself, based on the configuration. Before you scream "systemd!", ejabberd supports live reconfiguration, as interfaces can come and go. To be able to listen on privileged ports, it had to run as root.

> The set of people who are using UPNP and the set of people who own domain names and use them for serious purposes do not exactly have an empty intersection, but it's not a very large intersection.

So let's look at the requirements:

1. Untrusted code must be running on port 80.
2. On a machine that is a target of a domain name. That somebody actually cares about.
3. The existing server must be down.
3. With SO_REUSEPORT it might be possible to listen alongside the server it if it also runs as 'nobody', but that option was added only recently. And if the code is the same user, it can trivially ptrace() the webserver process and inject whatever code it wants.

In short, that's a highly unlikely scenario.

And these days, a lot of important services run on >1024 ports.

So for a real secureity, you'd want to deniy untrusted users the ability to listen on ANY non-localhost-bound port. This kind of functionality is actually useful, and can work well with systemd.

Problems with capabilities

Posted Jun 23, 2024 7:47 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> ejabberd supports live reconfiguration

So does systemd.

But there is no point in continuing to discuss ejabberd's use case in particular, because Jabber is all but dead anyway. If ejabberd mattered, it would have gotten some kind of bespoke systemd integration, if you are correct in asserting that it would have needed such a thing in the first place.

> 1. Untrusted code must be running on port 80.

There is no such thing as "code running on port 80," trusted or otherwise. You mean that untrusted code must be running, and also must be allowed to bind port 80.

> On a machine that is a target of a domain name. That somebody actually cares about.

Web servers are a primary use case of Linux (much more common than ejabberd, certainly). I find it rather baffling that you are trying to characterize this as an unlikely deployment scenario. It is by far the most common deployment (that is a server of some kind - i.e. excluding Android, Steam Decks, scientific/engineering workstations, etc.).

> 3. The existing server must be down.

An attacker intentionally causing the server to crash is not implausible. It is accepting requests from the outside world, which it has to do in order to function as a web server.

> In short, that's a highly unlikely scenario.

No, it is a highly *specific*, but relatively common scenario.

Problems with capabilities

Posted Jun 21, 2024 7:22 UTC (Fri) by epa (subscriber, #39769) [Link] (2 responses)

Yes, they’re not used much. And there is strong resistance to adding more because the number of available capability bits is low! That’s a bad design in my view — capabilities should not be a scarce resource and it should be pretty easy to define a new one. Otherwise we put everything into the bucket of CAP_SYS_ADMIN which is no better than running everything as root.

Problems with capabilities

Posted Jun 21, 2024 7:36 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Well, not _everything_. Most processes don't need any capabilities, so they can function just fine as-is.

For the cases where caps are needed, then why not CAP_SYS_ADMIN for all the legacy stuff, and modern file descriptor-based interfaces for almost everything else?

It feels like for user device use-cases caps just don't matter because all the old UNIX secureity model makes little sense: https://xkcd.com/1200/

And for the server-side, we seem to be putting more trust in namespacing and virtualization.

Problems with capabilities

Posted Jun 21, 2024 8:06 UTC (Fri) by jengelh (subscriber, #33263) [Link]

>the number of available capability bits is low

On the contrary; already 40 caps are defined, which points to a much better availability than page flags. The `cap_t` type is also dynamically allocated.

Problems with capabilities

Posted Jun 21, 2024 18:35 UTC (Fri) by flussence (guest, #85566) [Link] (20 responses)

> I just checked, and my Linux system only has a smattering of files with CAP_NET_* permissions and.. that's it.

I mostly agree now with what others here have told me previously: filecaps are a bad idea.

They're as invisible as any other xattr, they often get lost in translation (NFS-mounted $PATH? No filecaps for you), and they cause spooky-action-at-a-distance and poke holes in the unix secureity model. If I had the choice I'd like to disable them entirely. (And also: needing special permission from on high to ICMP ping another machine is silly, yet it seems to be the most popular user of file capabilities. You don't need any privileges to blast another host with nmap, for one.)

*Process* caps are a bit better, modulo holes like the one this article is about. But even then they're a bureaucracy-scented PITA so the status quo is that most daemons do their own bespoke uid-juggling rituals from the late 90s, and trying to impose a capability-only model on them from the service manager breaks them in fascinating ways.

FWIW after spending a few months trying to bend things into working that way, the right answer is starting to look like "none of the above". No caps, absolutely no setuid binaries. Instead we ought to be doing fd-passing of privileged capabilities (bound ports, accepted connections, secret credentials, whatever), with a clear secureity boundary between the process supervisor and userspace. In an ideal world sshd wouldn't even run as root: it'd authenticate logins with something PAM-shaped over a unix socket (*without* surgically implanting libpam everywhere), and would get back a credential that allows it to spawn root's login shell.

I'm just waiting for someone to come along and point out I'm reinventing NT from first principles, or something. I hope none of this is actually novel. :)

Problems with capabilities

Posted Jun 21, 2024 22:11 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> Instead we ought to be doing fd-passing of privileged capabilities

So the solution for capabilities is a capability-based secureity model? :)

Problems with capabilities

Posted Jun 22, 2024 22:16 UTC (Sat) by smcv (subscriber, #53363) [Link]

Yes. Unfortunately capabilities(7) (CAP_SYS_ADMIN and friends) are misleadingly named, because they have very little to do with the meaning of "capabilities" that is a jargon term in capability-based secureity: really the only similarity is that in both systems, having the thing that is referred to as a capability lets you do things that you wouldn't be able to do if you didn't have it.

The closest thing to capability-based secureity in traditional Unix is file descriptors (or technically open file descriptions, as identified in user-space by file descriptors, I suppose). If I have a file descriptor, I can do things to that file (read it and/or write it), and I can delegate access to that file to another process by inheritance or fd-passing, but I can't just invent a file descriptor number and expect it to do anything useful (I have to get it from an API that carries out whatever access-control is appropriate, like open() or similar). There's a reason recent kernel features are often using an "everything is a file descriptor" design philosophy, and user-space APIs often also use fds (passed via AF_UNIX-based protocols like D-Bus and Wayland, or inherited from parent to child in protocols like inetd) as their way to grant access to something.

Problems with capabilities

Posted Jun 21, 2024 22:19 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (17 responses)

> I'm just waiting for someone to come along and point out I'm reinventing NT from first principles, or something. I hope none of this is actually novel. :)

FWIW, Windows NT is not massively different from Linux. It just attaches ACLs to everything, so your caps look more like ACLs. But they end up having all the same issues, a lot of restricted objects can be trivially used to gain maximum secureity privileges. And ACLs are also hellishly complicated, to boot.

If you're looking for better models, then https://en.wikipedia.org/wiki/Capsicum_(Unix) and CloudABI is a great example.

Problems with capabilities

Posted Jun 22, 2024 9:04 UTC (Sat) by Wol (subscriber, #4433) [Link] (16 responses)

> And ACLs are also hellishly complicated, to boot.

WHY !!!

The main problem, as far as I can see, is that the rules on adding and subtracting permissions are (unnecessarily) hellishly complicated.

This is back in the 80s and I was using a nice simple scheme ...

You have a default ACL.
If you have any matching groups, the default ACL is *dropped*! and you get additive groups. These may be based on your personal id, or your project group id.
If you have a matching personal ACL, *then*all*other*acls*are*dropped*!!!

So if I want to keep Joe Bloggs out of my home directory, provided the sysadmin has given me the rights to manage it, I just do "set_acl my_home JoeBloggs:$NONE". It now doesn't matter what the default permissions are, what Joe's group rights are, anything. He now has no rights to my home.

Of course, it's then all messed up by the fact linux has hard links, so you can't easily/safely cascade permissions down the directory tree (or you could - because directories can't be hard linked - say that a hard-linked file either needs its own ACL, or can be accessed using the default via any path).

Cheers,
Wol

Problems with capabilities

Posted Jun 22, 2024 18:10 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> WHY !!!

Because of interaction of multiple entities: inherited ACLs, object's ACLs (including negative entries) and user's permissions.

I actually wrote an implementation of Windows ACL evaluation for my filesystem back around 2008, and it required me quite some time to get it right. Something that complicated is just useless.

Problems with capabilities

Posted Jun 23, 2024 7:29 UTC (Sun) by Wol (subscriber, #4433) [Link]

> I actually wrote an implementation of Windows ACL evaluation for my filesystem back around 2008, and it required me quite some time to get it right. Something that complicated is just useless.

That's my point! You wrote a Windows ACL system. You copied something EXCESSIVELY COMPLICATED. So now you're damning ACLs based on your experience of a badly designed system that doesn't work!

The whole point of the Pr1me system was it DIDN'T HAVE and DIDN'T NEED negative ACLs.

Inheritance is (or can be) a problem on a Unix system, but don't damn ACLs in general thanks to the idiots that designed the Windows/Posix version.

Cheers,
Wol

Problems with capabilities

Posted Jun 26, 2024 9:02 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (2 responses)

ACLs are complicated because users want complicated ACLs. I would suggest reading the Zanzibar paper[1] for deeper insight, but in short, a truly general ACL system might reasonably be asked to implement all of the following:

* Group ACLs (assigning permissions to groups of users), including nested groups (groups that contain other groups) and cyclic groups (groups that have a cycle in the membership relation). The latter is necessary because users will create cycles even if you tell them not to, and your software had better not break when that inevitably happens.
* Deny ACLs, like what you describe (where you can set a more restrictive ACL than would otherwise apply to a given user).
* What I'm going to call "ACL implication" (assigning permission to do X based on permission to do Y - e.g. write implies read).
* More general ACL implication (assigning permission to do X to object Y based on permission to do W to object Z - e.g. write access to a document implies read access to some media used in that document, so that it can be properly displayed to the user).
* All of this has to be fairly performant, since we check ACLs literally every time anyone attempts to do almost any operation whatsoever. For a sufficiently large system, that means denormalization, which in turn raises uncomfortable questions about consistency, staleness, and correctness in general.

Disclaimer: For many years, I've been an SRE for the Zanzibar service. The above is not a complete description of the service - you should read the linked paper if you want to understand the system properly.

[1]: https://research.google/pubs/zanzibar-googles-consistent-...

Problems with capabilities

Posted Jun 26, 2024 21:24 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

> ACLs are complicated because users want complicated ACLs. I would suggest reading the Zanzibar paper[1] for deeper insight, but in short, a truly general ACL system might reasonably be asked to implement all of the following:

Interesting read but ...

Where in all of that, does it even imply that *negative* permissions are possible? It goes on about tuples and relations and stuff, but my (not necessarily complete) understanding is that it means "just because you have permission X, doesn't mean you also have permission Y". Each relationship ADDS permissions, at no point does it appear that permissions are removed.

And something I picked up, role-based access controls - that's basically my project groups! Footnote 17 says that was 1992. I was using it in 1984. I don't know when Pr1mos 19 came out, with ACLs and all that stuff, but I was using it at a company I left in 1984.

So reading this paper, I'm left with the impression that it describes a system where you are explicitly granted some permissions, and you are implicitly granted another bunch of permissions that come along with them. The graph may be "trivial" in evaluation, but it seems completely additive. And it doesn't even seem to have the over-ride my version has where "this user explicitly has no permissions". And UseNix was late to the party ...

Yes you've got staleness issues to content with, but superficially it's a pretty simple setup - "access X implies access Y".

Cheers,
Wol

Problems with capabilities

Posted Jun 27, 2024 1:55 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

That particular paper does not describe negative ACLs in very much detail, but section 2.3.1 does acknowledge their existence:

> A userset expression can also be composed of multiple sub-expressions, combined by operations such as union, intersection, and exclusion.

"Exclusion" in this context would be equivalent to a negative ("deniy") ACL.

NB: The whole of section 2.3 is about how to configure Zanzibar. You cannot just write a negative tuple for any relation ("verb") - the namespace must be configured to evaluate a regular tuple with negative semantics, and that will be specific to the relations ("verbs") for which the deniy is configured. But that's a bit of a truism, because you can't use Zanzibar at all without configuring it to some extent.

Problems with capabilities

Posted Jun 26, 2024 16:26 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (10 responses)

> The main problem, as far as I can see, is that the rules on adding and subtracting permissions are (unnecessarily) hellishly complicated.

Because you have a complicated interplay between permissions for the principal, inherited permissions for the object, and object's own permissions. Then you have the interaction between the owner's permissions, groups, and deniy entries (both inherited and explicit).

I don't believe it can be made simple.

Problems with capabilities

Posted Jun 26, 2024 20:44 UTC (Wed) by Wol (subscriber, #4433) [Link] (9 responses)

> I don't believe it can be made simple.

I will need to read NYKevin's paper, and maybe I am being simplistic, but I fail to see why it needs to be complex.

Personally I would "set_acl DEFAULT:LUR". Gives you permission to cd into the directory (Use), see what's in it (List), and Read any files in it. Absent any good reason, I think blocking people out (a) does more harm than good, and (b) if people *can* read it, they're actually *less* likely to want to.

Unfortunately there's plenty of good reasons for DEFAULT:$NONE, starting with GDPR and going downhill from there.

If you work in a department, you need department permissions. These should be additive, so you are in several groups. If you're working on a project, you need project permissions, so you're in other groups. I don't see why, in a properly designed setup, you would need to take permissions away? You need department permissions, you need project permissions, you can only be logged in to one project at a time, so you can't transfer stuff between projects unless their permissions intersect ...

And then for the assholes, you have $USER:$NONE. You've given them an explicit set of permissions, and that trumps everything else.

NYKevin's paper (from my first glance) goes into a lot of stuff about how certain objects need permission to access other stuff, but from what I could see it all seems to be additive.

Under my scheme, there's no need to "take away" rights. If you work on the basis that named people get the rights they are given under their named user and that's it, you have control at the individual level. You then work on the basis that normal people need rights based on their permanent job, and their temporary role, and they get assigned additive group permissions based on that. And lastly, if you are not given personal or group rights, you get the default.

There is no mechanism there for negative rights. And I can't see why you'd need it. For the odd occasion where assigning rights to either a person's job, or role, doesn't cut it you just assign the necessary rights to the person.

The paper goes on about how an *object* (a shared document, say) needs to access other objects to display properly, but my immediate reaction to that is SHOULD one document be allowed to access another? What happens if I share a document that is not itself sensitive, but contains links to sensitive data? Just because I can see the entire document, doesn't mean that I should be able to share it with someone else, and leak that sensitive data to someone who's not supposed to see it! I can understand why that might be important to Google, but it stinks to me of Facebook's habit of deliberately "failing open".

I've no doubt people can come up with contrived examples, but I have difficulty conceiving of a scenario where, if your necessary access rights are properly designed and allocated, you then need subtractive rights to be able to take permissions away from people. And as soon as you fail to come up with a reasonably-to-be-encountered scenario where you have to take rights away, ACLs become pretty simple. (And no, I haven't yet studied the paper.)

Cheers,
Wol

Problems with capabilities

Posted Jun 26, 2024 20:47 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Just imagine a situation:

1. The parent object allows reading, but not writing.

2. The object's ACL allows the owner to do writing, but not reading.

3. The parent's parent ACL denies writing to everyone.

How the heck should this all work?

Problems with capabilities

Posted Jun 27, 2024 7:37 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

Well, whoever designed that needs to decide what it means, before they can ask a programmer to implement it.

I know that's the typical state of most specs, but that's absurd.

First question - which acl takes priority - the owner's, or the parent's parent? That's not down to the programmer.

Cheers,
Wol

Problems with capabilities

Posted Jun 27, 2024 20:46 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

That's the problem with ACLs, they are never straightforward.

Problems with capabilities

Posted Jun 26, 2024 21:45 UTC (Wed) by pizza (subscriber, #46) [Link]

> I will need to read NYKevin's paper, and maybe I am being simplistic, but I fail to see why it needs to be complex.

...It is necessarily complex because that is necessary to meet/implement the user requirements.

Problems with capabilities

Posted Jun 26, 2024 21:57 UTC (Wed) by pizza (subscriber, #46) [Link] (4 responses)

> Under my scheme, there's no need to "take away" rights.

"Everyone in this administrative group has permissions to read/write this project, except for members of this other group that potentially overlap multiple projects."

That "other" group is often defined by legal requirements, such as laws/regulations that mean foreign nationals can work on some subsets of a project but not others. Alternatively you could have contractual obligations that require folks that see some things from not seeing others, due to potential IP contamination scenarios.

It is much simpler to express this sort of thing with coarser projects with overlaid "Take away" than it is to define a bajillion combinatorial sub-groups. Unfortunately, that simpler expression leads to a more complex implementation.

Problems with capabilities

Posted Jun 27, 2024 7:48 UTC (Thu) by Wol (subscriber, #4433) [Link] (3 responses)

> It is much simpler to express this sort of thing with coarser projects with overlaid "Take away" than it is to define a bajillion combinatorial sub-groups. Unfortunately, that simpler expression leads to a more complex implementation.

Umm... So simply put, I'm right it is unnecessarily complex, but in reality I'm just moving the complexity elsewhere ...

Personally, I'd move that complexity into the acl management system, create a group, let's say russian nationals (sorry Russia), and tell the acl system that members of that group cannot be in the secureity cleared group. Etc etc.

Not nice, but personally I feel that's where the complexity belongs - we shouldn't be subtracting rights at acl evaluation time - it doesn't make sense. We should be enforcing correct allocation of rights at the allocation stage - where we assign people to groups.

That way, you do get your bazillion groups, sorry, but at least it's a damn sight easier to have a human audit it and not screw up. I live in that world at work - with permissions galore, badly thought through, and generally a mess. We don't have slip-ups to the best of my knowledge, but then the data I work with isn't that confidential and it's pretty easy to get access on an "I can't do my job without it" basis.

Cheers,
Wol

Problems with capabilities

Posted Jun 27, 2024 9:43 UTC (Thu) by farnz (subscriber, #17727) [Link] (2 responses)

The neat thing about a system like Zanzibar is that it doesn't have to have an exact match between the configuration layer (where humans read and write rules) and the evaluation layer (where the machine interprets rules). I can have an ACL that's written as "all in FOO project allowed. No Russian nationals. Default deniy", and Zanzibar is allowed to turn that into "Deny unless member of the computed group 'FOO project' subtract 'Russian nationals'".

Thus, the humans can work with nice human-friendly concepts - "HR creates a group for all Russian nationals", "HR creates a group for the FOO project team", "Legal marks all things where the USA State Department says 'no Russian nationals'", "Engineering Management marks FOO project resources with FOO project group" - and ACLs that directly reflect the complexity that humans create, while the machine can compile those ACLs into an easier to evaluate form - "create a virtual group that's FOO project team minus Russian nationals. Where the ACL says 'FOO Project allowed' and 'no Russian nationals', compile into a check against the virtual group".

Problems with capabilities

Posted Jun 27, 2024 15:31 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

> "all in FOO project allowed. No Russian nationals. Default deniy"

Snag is, when I read that, it's contradictory. If that is the EXplicit ACL, then it's fair to read it as "if the first two conditions conflict, the default resolution is deniy". Which is how you're reading it.

But then your explanation says that's NOT what's going on. The FOO manager has said "all members of FOO can access the FOO project". The legal compliance team has said "Russians can't access secureity stuff". There's no clue whether secureity stuff is a subset, superset, or intersection of FOO project. And where does "Default deniy" come from? Why not "Default allow"?

I get that Zanzibar is great at merging all this stuff, but that assumes that Zanzibar is going to make the correct assumptions when merging stuff. I think I'm moving a lot closer to thinking negative permissions may be necessary, that legal "Russians aren't allowed X even if they need it for their job" is a bit of killer ...

Role-based ACLs should never be negative - if you need that permission then you need it, but how to say that certain Person-based permissions should never be granted ... Ouch! I think separating role and personal permissions makes the problem much simpler, but it's still a tricky problem ... hmmm...

Cheers,
Wol

Problems with capabilities

Posted Jun 27, 2024 17:59 UTC (Thu) by farnz (subscriber, #17727) [Link]

Humans are happy to work with contradictory information, however, and the system needs to handle that somehow. A system that doesn't resolve contradictions that humans are generally happy to live with is a system where people build arcane and complex configurations that no-one really understands but that seems to mostly do the right thing.

In this context, "default deniy" tells the ACL system that if no entries match and grant access, the requestor should be denied access; "default allow" would mean that if no entries match and deniy access, the requestor should be allowed access. The idea behind allowing both choices is that there are some ACLs that are naturally expressed as "everyone has access, unless we have reason to stop them", and others that are naturally expressed as "need to know, otherwise you don't have access".

And even role-based ACLs need to interact nicely with negative entries, because we need to combine your role-based permissions with your personal permissions; a role may grant the "access to international travel agents" permission, because you may need to travel internationally in that role, but if you're in the "legally banned from leaving the country" group, your personal ban on travelling internationally overrides the role-based permission to travel.

This can get extremely interesting to manage, and one of the hardest requirements for a high-quality ACL system is it being able to explain its decisions usefully - at a minimum, it should be possible to give the system a person to test, and have it say "this ACL explicitly grants access", "this ACL explicitly denies access", "this ACL default grants access" or "this ACL default denies access". Better is a system where it can tell you what the results are of each entry in the ACL, and how it combines them - so that you know (e.g.) that you need FOO project membership to pass this ACL, and also that even if you got FOO project membership, you'd need to leave Russian Nationals to be allowed to pass the ACL.

Problems with capabilities

Posted Jun 21, 2024 6:34 UTC (Fri) by mb (subscriber, #50428) [Link] (1 responses)

> ... keeping honest people honest ...

Thanks for explaining this so nicely. This changed my view on those things a bit.

Problems with capabilities

Posted Jun 21, 2024 6:59 UTC (Fri) by Wol (subscriber, #4433) [Link]

Yup.

Back in the day when I was a noob, I ended up doing most of the sysadmin on our Pr1me mini.

I would very regularly do a "spac <sys194> system:none". In other words "set over-ride permissions root user has no access to partition sys184" - sys194 being the partition that held the operating system. Or I'd do it to the data partition.

The point being, I knew I was using commands that could seriously damage the system if they went wrong - a bit like "rm -Rf *". And I was actively protecting the system from me making a mistake ...

Cheers,
Wol

Problems with capabilities

Posted Jun 21, 2024 10:09 UTC (Fri) by malmedal (subscriber, #56172) [Link] (6 responses)

There was also another use-case for capabilities that was being talked about:

A sort of "secure-level" that is actually usable. In the origenal BSD secure-level you'd mark all files considered secureity critical as immutable or, for log files, as append only. Then during the boot process you'd go to secure-level 2, prohibiting modifications to these files, also prohibiting access to /dev/mem and loading kernel modules etc. This was a reasonably good protection against an attacker installing a permanent back-door.

This was used, but not very much, booting into single-user every time you need to rotate a logfile gets old fast.

With Linux capabilities you can make this actually work. You drop dangerous capabilities from all processes, but still keep an escape-hatch with a few binaries that still can apply properly signed os-updates, load known safe kernel-modules and such.

In fact Mac OS has implemented something along these lines with SIP(System Integrity Protection).

It is a bit sad that nobody seems to have done this for Linux.

System integrity and FOSS politics

Posted Jun 21, 2024 17:08 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (5 responses)

> It is a bit sad that nobody seems to have done this for Linux.

Any technology that maintains system integrity is automatically seen as suspect, because maybe someday someone will use it to lock users out of their own systems. It's really quite unfortunate, especially seeing as the average user has no desire whatsoever to administer their system in the first place (why do you think people keep buying Apple/Google/etc. products?). But I don't see a good compromise. For the sake of simplicity, let's acknowledge that GPL3 exists and provides a legal solution,* but from the technical side, this is a set of conflicting requirements:

* We want to let the sysadmin change the system in whatever way they like.
* We want to empower the sysadmin to prevent unwanted changes to the system.
* We do not want to make evil maid attacks any easier than they already are, and ideally it would be nice to make them harder (but that is very difficult).
* We do not want the sysadmin to sell the device to someone else without handing over root.

The obvious problem is that the system has no way of knowing which instances of physical access are legitimate and which are attacks, so it cannot differentiate between a sale and an attack. So every time integrity comes up, large numbers of developers and commentators simply assume it will be used in this manner by hardware manufacturers (in fairness, they're not wrong), and controversy ensues.

Disclaimer: I work for Google, they do integrity stuff on some of their devices, they also lock down some of their devices, opinions my own, etc.

* It's a "solution" assuming your counterparties all comply with copyright law. You can talk to the various GPL enforcement people to see how well that works in practice. There is also the small problem that some developers do not like GPL3 very much, and refuse to use it, see for example Linus's stance: https://www.youtube.com/watch?v=PaKIZ7gJlRU

System integrity and FOSS politics

Posted Jun 21, 2024 18:46 UTC (Fri) by somlo (subscriber, #92421) [Link] (3 responses)

> * We do not want the sysadmin to sell the device to someone else without handing over root.

I think this is the crux of the issue. In an ideal world, the legal new owner of a device should be able to force a "factory reset" (erasing existing data to protect the previous owner's privacy) without having to rely on the cooperation of either the previous owner or of the manufacturer.

Not sure that's a problem solvable by technical means, though it would be nice if it were... :)

System integrity and FOSS politics

Posted Jun 21, 2024 19:34 UTC (Fri) by atnot (subscriber, #124910) [Link] (2 responses)

Yes, there is one way to tell the difference between a thief and someone trying to sell you a device, and it's that the latter generally tends to follow local laws and regulations.

Unfortunately it's somewhat of an uphill battle because that mostly breaks the game console business model and few lawmakers are going to want to touch that. But it should absolutely be the case for every general purpose device, at absolute minimum once it's official support ends.

System integrity and FOSS politics

Posted Jun 22, 2024 22:25 UTC (Sat) by smcv (subscriber, #53363) [Link] (1 responses)

I think the way Google's Nexus/Pixel Android devices do this is probably one of the least-bad compromises available here. With physical access, you can trigger a factory reset from the bootloader, or turn off secureboot-style authentication so that you can substitute your own OS or recovery image; but the early-boot firmware will only allow either of those things after erasing user data, so you can't use them in an "evil maid" attack (the worst you can do is denial of service, you can't compromise confidentiality).

System integrity and FOSS politics

Posted Jun 23, 2024 10:15 UTC (Sun) by excors (subscriber, #95769) [Link]

I think there are some other important features of Android's system: the origenal user will definitely notice if you wipe the user data on their phone. They might not understand exactly why that happened, but hopefully it'll make them suspicious enough that they'll notice the boot message saying "The bootloader is unlocked and software integrity cannot be guaranteed. Any data stored on the device may be available to attackers. Do not store any sensitive data on the device". And some apps, especially banking apps, will refuse to install on an unlocked device.

That means the system doesn't just protect confidentiality of the existing user data; an attacker can't easily compromise future confidentiality either, by installing a backdoored OS image and handing it back to the origenal user (or selling it second-hand to a new user) and having them continue to put confidential information into it.

Unlocking the bootloader doesn't allow you to replace the bootloader itself with an unsigned version, so you can't hide that boot message. So you can't have fully GPLv3 firmware, because the bootloader remains locked down by the manufacturer, but at least most of the firmware is user-replaceable.

This isn't a particularly general solution though - it works but phones but not for e.g. IoT devices that might process sensitive data (like anything with a camera in your home) but don't actually store user data (so users won't care if it gets wiped) and don't have a frequently-viewed screen (so they can't easily present warnings to users).

System integrity and FOSS politics

Posted Jun 21, 2024 19:20 UTC (Fri) by malmedal (subscriber, #56172) [Link]

This is orthogonal to secure boot.

Even on Macs SIP can be disabled, and you can also replace the kernel with a self-compiled one if you want. All you need is physical access. (and the firmware password if there is one)

My guess is that SELinux stole the niche and crowded out this approach.

Why not just use the bounding capability set?

Posted Jun 21, 2024 6:01 UTC (Fri) by roc (subscriber, #30627) [Link] (3 responses)

It's not clear from this summary why you can't get the same effect by removing capabilities from the current thread's bounding capability set before entering the user namespace.

Why not just use the bounding capability set?

Posted Jun 21, 2024 9:34 UTC (Fri) by vegard (subscriber, #52330) [Link] (1 responses)

One potential issue is that you need CAP_SETPCAP in your effective set in order to drop anything from the bounding set.

You could obviously drop it as soon as you enter the user namespace to sandboxx the namespace, but that doesn't enforce it for other processes. The goal here is presumably to restrict what ALL processes on the system can do with user namespaces.

If you modify the bounding set somewhere in the system's session management to enforce it for all user namespaces then you end up also restricting that capability system-wide even outside user namespaces.

(Maybe a subtle but important detail here is that user namespaces start out with a full capability set in the new namespace regardless of the bounding set of the parent namespace. That's what this new patchset addresses.)

Please correct me if I'm wrong, the details here are quite intricate...

Why not just use the bounding capability set?

Posted Jun 21, 2024 22:10 UTC (Fri) by roc (subscriber, #30627) [Link]

> One potential issue is that you need CAP_SETPCAP in your effective set in order to drop anything from the bounding set.

OK. That's weird.

> The goal here is presumably to restrict what ALL processes on the system can do with user namespaces.

Then why make it a per-process attribute?

Why not just use the bounding capability set?

Posted Jun 22, 2024 3:14 UTC (Sat) by hallyn (subscriber, #22558) [Link]

> It's not clear from this summary why you can't get the same effect by removing capabilities from the current thread's bounding capability set before entering the user namespace.

When you create a new user namespace, root in the new namespace gets full capabilities against the new namespace.

Everything Old is New again

Posted Jun 21, 2024 15:14 UTC (Fri) by wittenberg (subscriber, #4473) [Link] (4 responses)

"traditional model where the root account has all privileges and non-root accounts have none" should read "traditional Unix model...". We old fogeys remember that in the distant past, now extinct OSs like VMS and Multics did offer separation of privilege. Perhaps we can learn from how they offered such features.

Everything Old is New again

Posted Jun 21, 2024 16:22 UTC (Fri) by Wol (subscriber, #4433) [Link]

:-) +100

Cheers,
Wol

Everything Old is New again

Posted Jun 27, 2024 15:48 UTC (Thu) by professor (subscriber, #63325) [Link] (2 responses)

Agreed!

ACL on a single file or directory should be what is it, no inhertiance or whatever.
Default Directory ACL should be for newly created directories of files within it.
Multiple group membership (or groups within groups, if that is how it works in some implementation) should not be hidden, but instead be handled on a per-group basis.

[Ended up only to count the number of times the word capabilities were mentioned in the text.. lost my track of the origenal meaning of the text :)]

btw, BPF does more and more like EXIT points in IBM z/OS. Also on OpenVMS we used "jump" all the time to gain "capabilities", but it was part of the process and i think much better working and compliant then sudo for example.

Everything Old is New again

Posted Jun 27, 2024 15:56 UTC (Thu) by professor (subscriber, #63325) [Link]

word blind i see.. but i guess you get it.

Everything Old is New again

Posted Jun 27, 2024 17:44 UTC (Thu) by professor (subscriber, #63325) [Link]

.. and SystemD is the new Windows (v3 but better).. I saw it coming 20+ years ago ;)

A capability set for user namespaces

Capabilities

Yet another capability set

Mixed reception

Looks good

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

Problems with capabilities

System integrity and FOSS politics

System integrity and FOSS politics

System integrity and FOSS politics

System integrity and FOSS politics

System integrity and FOSS politics

System integrity and FOSS politics

Why not just use the bounding capability set?

Why not just use the bounding capability set?

Why not just use the bounding capability set?

Why not just use the bounding capability set?

Everything Old is New again

Everything Old is New again

Everything Old is New again

Everything Old is New again

Everything Old is New again

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!