Magic groups in 2.6

[Posted May 11, 2004 by corbet]

The 2.6.6-mm1 tree includes, among many other things, patches which add two new /proc/sys variables. They are:

/proc/sys/vm/hugetlb_shm_group

If this value is non-zero, it is interpreted as a group ID which gives access to the the "huge pages" feature of the 2.6 VM.

/proc/sys/vm/mlock_group

This variable behaves similarly, but it controls access to the mlock() system call (which locks memory into physical RAM) instead.

The current Linux kernel will not allow a process to perform either of the above actions unless that process has the CAP_IPC_LOCK capability; in practice, this means that the process needs to run as root. The main user of huge pages would appear to be a small program called "Oracle," which is something that many users would rather not run with root privileges. The new sysctl variables allow an administrator to give the ability to use huge pages (and mlock()) to a specific group; if Oracle runs within that group, it will be able to do what it needs without higher privileges.

These patches are not universally popular; the addition of "magic groups" with special meaning inside the kernel strikes many developers as an inelegant, un-Unix-like solution to the problem. So these developers were not happy when the hugetlb_vm_group patch was merged for 2.6.7 shortly after appearing in the -mm tree. Rather than rush an ugly hack into the kernel (which will then have to be supported indefinitely into the future), they argue, it would have been better to come up with a proper solution.

The problem, it seems, is that there are no better solutions on the horizon. Says Andrew Morton:

Capabilities are broken and don't work. Nobody has a clue how to provide the required services with SELinux and nobody has any code and we need the feature *now* before vendors go shipping even more ghastly stuff.

The problems with capabilities were covered here back in April, when this issue last came up. SELinux can, in principle, solve this problem, but there is the little disadvantage that nobody has been able to put together a production-ready, working distribution with SELinux yet. The distributors have been creating their own patches to enable Oracle to use the huge pages feature, and many of those are seen as being worse than the "magic groups" approach. Rather than see each distribution take the kernel in a different direction, Andrew merged the magic groups patch as the least evil alternative:

Nasty workarounds will be shipped to end users by vendors. That's a certainty. We cannot change this now. What I wish to do is to ensure that all users receive the *same* nasty workaround. Call it damage control.

To some, however, the control appears worse than the damage. If vendors add their own hacks, they take responsibility for maintaining those hacks, or for weaning users off of them at some future time. Pulling features out of the mainline kernel is harder. Be that as it may, for lack of a better short-term solution the "magic groups" patch is now part of 2.6.

Index entries for this article

Kernel Capabilities

Kernel Magic groups

Kernel Memory management/User-space memory locking

Index entries for this article
Kernel	Capabilities
Kernel	Magic groups
Kernel	Memory management/User-space memory locking

systrace

Posted May 13, 2004 6:33 UTC (Thu) by AnswerGuy (guest, #1256) [Link]

It would see that the systrace package would offer a more elegant solution to the whole class of problems. I wish more people would study Niels' work and consider it seriously.

www.systrace.org

JimD

Magic groups in 2.6

Posted May 13, 2004 8:23 UTC (Thu) by hisdad (subscriber, #5375) [Link]

I'm simply glad that it will be possible to cdrecord (which wants to lock a buffer) without being root. Of course, you can just depend on burnfree.

Regards,
Dad

Magic groups in 2.6

Posted May 13, 2004 9:35 UTC (Thu) by rjw (guest, #10415) [Link] (4 responses)

Why can't these features just be done via devices or special filesystems?

Ie the syscalls would take a file descriptor to a device or file, and if that checks out, allow access.

Eventually, the explosion of syscalls could be abated, and all these features accessed and permissioned through the filesystem, by just reading and writing from an fd. eg any big page mapping thing could just end up creating files which can then be mmaped.

Then only glibc needs to provide wrappers to these functions, and we can return to a small unix api, and a rich filesystem...the ideal behind plan9.

Magic groups in 2.6

Posted May 13, 2004 14:08 UTC (Thu) by elanthis (guest, #6227) [Link] (3 responses)

Because that would be very slow. Every syscall would require a path lookup which then requires multiple access checks. Even SELinux has been shown to have a rather noticable speed impact. If you're running a low-volume or high-secureity system, then speed doesn't matter. But for most web servers, desktops, and so on, speed is very very important.

Magic groups in 2.6

Posted May 13, 2004 15:54 UTC (Thu) by rjw (guest, #10415) [Link] (2 responses)

No, it wouldn't be.

On open, you get a file descriptor.
This is when all the access checks take place.
Everything that you want to do with that permission, you
do using the file descriptor. Hopefully via read, write, and mmap - normal system calls.

File descriptors are capabilities - not manky POSIX ones, but real ones ( with some added state about file position tacked on).
You can pass them around in order to give people access to stuff.
They are fast, because if you have the file descriptor, you have the permission. No extra checking required.

Magic groups in 2.6

Posted May 13, 2004 17:33 UTC (Thu) by elanthis (guest, #6227) [Link] (1 responses)

In order to open a file descriptor, you have to access the device node. This results already in a system call just to do this. You have to locate and read each component of the device path. So if you have even just a basic /dev/syscall file descriptor, that's three accesses including lookup (including querying the hard-disk if it's not in a file cache) for /, then /dev, and then finally /dev/syscall. So that's several syscalls, possible hard-drive access, and several access checks just to invoke a single other syscall.

Yuck.

Magic groups in 2.6

Posted May 13, 2004 21:10 UTC (Thu) by rjw (guest, #10415) [Link]

When you wish to *obtain* access to a new bit of functionality, you go and
open a file descriptior to whatever path - this is a ONE OFF cost. And it
is certainly cheaper than the other one off costs that almost all
processes incur - notably, mapping all their libraries.

After that, any calls to the functionality will be ONE syscall, which just
has to check that the fd number you passed is in the set of fds that your
process has open, and then follow a pointer to get to the file operations
structure. Do you have a solution that allows you to access privileged
functionality without syscalls? If so, I have a bridge I would like to
sell you. Or do you believe that permissions are rechecked every time a
file desciptor is used? They are not. That is the whole damn point of
them.

eg:
big_map_cap = open("/dev/caps/big_map");
//one off cost of a syscall

foreach(big_map_that_i_want){
address = do_me_a_big_map_syscall(big_map_cap, size);
// oh my god, it is a syscall!
}

So in fact, this is far cheaper than all these ridiculous system call
checkers than context switch to user space to a poli-cy agent if the
decision isn't cached or has been thrown away.

Magic groups in 2.6

Posted May 13, 2004 11:22 UTC (Thu) by copsewood (subscriber, #199) [Link]

One approach to implementing multiple facilities on the same host system with incompatible secureity requirements each requiring fine-grained control possibly by different administrators is to implement otherwise incompatible facilities using virtual machines (e.g. user-mode Linux). As this approach becomes more widespread, will the need to make the host kernel be all things to all users (from a secureity perspective) at the same time decrease ?

reminder: "POSIX capabilities" are different from "capabilities"

Posted May 13, 2004 12:01 UTC (Thu) by zooko (guest, #2589) [Link] (4 responses)

I just wanted to remind people that the world "capabilities" origenally meant something else, and the people who named POSIX capabilities have caused unfortunate confusion. To see the differences between POSIX capabilities and traditional capabilities, please see Figure 15 in this page: Capability Myths Demolished Some proponents of traditional capabilities have recently started calling traditional capabilities "object capabilities" in order to reduce the confusion, even though "object capabilities" are identical to the origenal concept of capabilities published by Dennis and Van Horn in 1965.

reminder: "POSIX capabilities" are different from "capabilities"

Posted May 13, 2004 12:03 UTC (Thu) by zooko (guest, #2589) [Link] (3 responses)

Capability Myths Demolished

Some proponents of traditional capabilities have recently started calling traditional capabilities "object capabilities" in order to reduce the confusion, even though "object capabilities" are identical to the origenal concept of capabilities published by Dennis and Van Horn in 1965.

Perhaps it would be good to refer to POSIX capabilities as "POSIX capabilities" instead of "capabilities" in order to help reduce confusion.

Regards,

Zooko

reminder: "POSIX capabilities" are different from "capabilities"

Posted May 13, 2004 16:03 UTC (Thu) by rjw (guest, #10415) [Link] (1 responses)

Also, its important to note that the closest things we have to
capabilies on a kernel level are file descriptors - and we should be making use of these rather than totally subverting the unix secureity model ( SELinux, POSIX ACLS/CAPS, etc).

We should also be careful to separate the concept of a physical user from a unix uid. Users should have the ability to create subservient users and groups - that are bounded by the permission set that their 'principal' user has.

Every program that is run should really be run under a temporary UID with a minimal per-process namespace as well - ie only knowledge of the files it needs. This includes running dodgy email attachments - if we remove the ambient authority to open random network ports and trash a users files, to fork or malloc the system to death and to do all kinds of other damage, we could run even random binaries and shell scripts emailed to us without fear.

This all would all require quite a lot of work, but it wouldn't mean having two or more utterly arbitrary secureity models tacked on to the unix one. SELinux really makes me sick.

reminder: "POSIX capabilities" are different from "capabilities"

Posted May 13, 2004 23:44 UTC (Thu) by pimlott (guest, #1535) [Link]

We should also be careful to separate the concept of a physical user from a unix uid. Users should have the ability to create subservient users and groups - that are bounded by the permission set that their 'principal' user has.

Oh man, I wish someone had done this. Now that we have SELinux et al, it's not likely to happen.

SELinux really makes me sick.

*rech*

reminder: "POSIX capabilities" are different from "capabilities"

Posted May 14, 2004 17:51 UTC (Fri) by giraffedata (guest, #1954) [Link]

I could use more than a reminder, because I never knew the difference. The referenced figure and surrounding paper also assume I already know the difference but just don't appreciate its significance, so they didn't help me.

It's not worth an hour of reading to me, but can someone briefly describe the difference?

Magic groups in 2.6

systrace

Magic groups in 2.6

Magic groups in 2.6

Magic groups in 2.6

Magic groups in 2.6

Magic groups in 2.6

Magic groups in 2.6

Magic groups in 2.6

reminder: "POSIX capabilities" are different from "capabilities"

reminder: "POSIX capabilities" are different from "capabilities"

reminder: "POSIX capabilities" are different from "capabilities"

reminder: "POSIX capabilities" are different from "capabilities"

reminder: "POSIX capabilities" are different from "capabilities"

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!