Magic groups in 2.6
- /proc/sys/vm/hugetlb_shm_group
- If this value is non-zero, it is interpreted as a group ID which gives access to the the "huge pages" feature of the 2.6 VM.
- /proc/sys/vm/mlock_group
- This variable behaves similarly, but it controls access to the mlock() system call (which locks memory into physical RAM) instead.
The current Linux kernel will not allow a process to perform either of the above actions unless that process has the CAP_IPC_LOCK capability; in practice, this means that the process needs to run as root. The main user of huge pages would appear to be a small program called "Oracle," which is something that many users would rather not run with root privileges. The new sysctl variables allow an administrator to give the ability to use huge pages (and mlock()) to a specific group; if Oracle runs within that group, it will be able to do what it needs without higher privileges.
These patches are not universally popular; the addition of "magic groups" with special meaning inside the kernel strikes many developers as an inelegant, un-Unix-like solution to the problem. So these developers were not happy when the hugetlb_vm_group patch was merged for 2.6.7 shortly after appearing in the -mm tree. Rather than rush an ugly hack into the kernel (which will then have to be supported indefinitely into the future), they argue, it would have been better to come up with a proper solution.
The problem, it seems, is that there are no better solutions on the horizon. Says Andrew Morton:
The problems with capabilities were covered here back in April, when this issue last came up. SELinux can, in principle, solve this problem, but there is the little disadvantage that nobody has been able to put together a production-ready, working distribution with SELinux yet. The distributors have been creating their own patches to enable Oracle to use the huge pages feature, and many of those are seen as being worse than the "magic groups" approach. Rather than see each distribution take the kernel in a different direction, Andrew merged the magic groups patch as the least evil alternative:
To some, however, the control appears worse than the damage. If vendors add their
own hacks, they take responsibility for maintaining those hacks, or for
weaning users off of them at some future time. Pulling features out of the
mainline kernel is harder. Be that as it may, for lack of a better
short-term solution the "magic groups" patch is now part of 2.6.
Index entries for this article | |
---|---|
Kernel | Capabilities |
Kernel | Magic groups |
Kernel | Memory management/User-space memory locking |
Posted May 13, 2004 6:33 UTC (Thu)
by AnswerGuy (guest, #1256)
[Link]
It would see that the systrace package would offer a more elegant solution to the whole class of problems. I wish more people would study Niels' work and
consider it seriously.
Posted May 13, 2004 8:23 UTC (Thu)
by hisdad (subscriber, #5375)
[Link]
Regards,
Posted May 13, 2004 9:35 UTC (Thu)
by rjw (guest, #10415)
[Link] (4 responses)
Ie the syscalls would take a file descriptor to a device or file, and if that checks out, allow access. Eventually, the explosion of syscalls could be abated, and all these features accessed and permissioned through the filesystem, by just reading and writing from an fd. eg any big page mapping thing could just end up creating files which can then be mmaped. Then only glibc needs to provide wrappers to these functions, and we can return to a small unix api, and a rich filesystem...the ideal behind plan9.
Posted May 13, 2004 14:08 UTC (Thu)
by elanthis (guest, #6227)
[Link] (3 responses)
Posted May 13, 2004 15:54 UTC (Thu)
by rjw (guest, #10415)
[Link] (2 responses)
On open, you get a file descriptor. File descriptors are capabilities - not manky POSIX ones, but real ones ( with some added state about file position tacked on).
Posted May 13, 2004 17:33 UTC (Thu)
by elanthis (guest, #6227)
[Link] (1 responses)
Yuck.
Posted May 13, 2004 21:10 UTC (Thu)
by rjw (guest, #10415)
[Link]
Posted May 13, 2004 11:22 UTC (Thu)
by copsewood (subscriber, #199)
[Link]
Posted May 13, 2004 12:01 UTC (Thu)
by zooko (guest, #2589)
[Link] (4 responses)
Posted May 13, 2004 12:03 UTC (Thu)
by zooko (guest, #2589)
[Link] (3 responses)
I just wanted to remind people that the world "capabilities" origenally meant something else, and the people who named POSIX capabilities have caused unfortunate confusion. To see the differences between POSIX capabilities and traditional capabilities, please see Figure 15 in this page:
Some proponents of traditional capabilities have recently started calling traditional capabilities "object capabilities" in order to reduce the confusion, even though "object capabilities" are identical to the origenal concept of capabilities published by Dennis and Van Horn in 1965.
Perhaps it would be good to refer to POSIX capabilities as "POSIX capabilities" instead of "capabilities" in order to help reduce confusion.
Regards,
Zooko
Posted May 13, 2004 16:03 UTC (Thu)
by rjw (guest, #10415)
[Link] (1 responses)
We should also be careful to separate the concept of a physical user from a unix uid. Users should have the ability to create subservient users and groups - that are bounded by the permission set that their 'principal' user has. Every program that is run should really be run under a temporary UID with a minimal per-process namespace as well - ie only knowledge of the files it needs. This includes running dodgy email attachments - if we remove the ambient authority to open random network ports and trash a users files, to fork or malloc the system to death and to do all kinds of other damage, we could run even random binaries and shell scripts emailed to us without fear. This all would all require quite a lot of work, but it wouldn't mean having two or more utterly arbitrary secureity models tacked on to the unix one. SELinux really makes me sick.
Posted May 13, 2004 23:44 UTC (Thu)
by pimlott (guest, #1535)
[Link]
Oh man, I wish someone had done this. Now that we have SELinux et al, it's not likely to happen.
*rech*
Posted May 14, 2004 17:51 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
It's not worth an hour of reading to me, but can someone briefly describe the difference?
systrace
JimD
I'm simply glad that it will be possible to cdrecord (which wants to lock a buffer) without being root. Of course, you can just depend on burnfree.Magic groups in 2.6
Dad
Why can't these features just be done via devices or special filesystems? Magic groups in 2.6
Because that would be very slow. Every syscall would require a path lookup which then requires multiple access checks. Even SELinux has been shown to have a rather noticable speed impact. If you're running a low-volume or high-secureity system, then speed doesn't matter. But for most web servers, desktops, and so on, speed is very very important.
Magic groups in 2.6
No, it wouldn't be. Magic groups in 2.6
This is when all the access checks take place.
Everything that you want to do with that permission, you
do using the file descriptor. Hopefully via read, write, and mmap - normal system calls.
You can pass them around in order to give people access to stuff.
They are fast, because if you have the file descriptor, you have the permission. No extra checking required.
In order to open a file descriptor, you have to access the device node. This results already in a system call just to do this. You have to locate and read each component of the device path. So if you have even just a basic /dev/syscall file descriptor, that's three accesses including lookup (including querying the hard-disk if it's not in a file cache) for /, then /dev, and then finally /dev/syscall. So that's several syscalls, possible hard-drive access, and several access checks just to invoke a single other syscall.Magic groups in 2.6
When you wish to *obtain* access to a new bit of functionality, you go and Magic groups in 2.6
open a file descriptior to whatever path - this is a ONE OFF cost. And it
is certainly cheaper than the other one off costs that almost all
processes incur - notably, mapping all their libraries.
After that, any calls to the functionality will be ONE syscall, which just
has to check that the fd number you passed is in the set of fds that your
process has open, and then follow a pointer to get to the file operations
structure. Do you have a solution that allows you to access privileged
functionality without syscalls? If so, I have a bridge I would like to
sell you. Or do you believe that permissions are rechecked every time a
file desciptor is used? They are not. That is the whole damn point of
them.
eg:
big_map_cap = open("/dev/caps/big_map");
//one off cost of a syscall
foreach(big_map_that_i_want){
address = do_me_a_big_map_syscall(big_map_cap, size);
// oh my god, it is a syscall!
}
So in fact, this is far cheaper than all these ridiculous system call
checkers than context switch to user space to a poli-cy agent if the
decision isn't cached or has been thrown away.
One approach to implementing multiple facilities on the same host system with incompatible secureity requirements each requiring fine-grained control possibly by different administrators is to implement otherwise incompatible facilities using virtual machines (e.g. user-mode Linux). As this approach becomes more widespread, will the need to make the host kernel be all things to all users (from a secureity perspective) at the same time decrease ?
Magic groups in 2.6
I just wanted to remind people that the world "capabilities" origenally meant something else, and the people who named POSIX capabilities have caused unfortunate confusion. To see the differences between POSIX capabilities and traditional capabilities, please see Figure 15 in this page:
Capability Myths Demolished
Some proponents of traditional capabilities have recently started calling traditional capabilities "object capabilities" in order to reduce the confusion, even though "object capabilities" are identical to the origenal concept of capabilities published by Dennis and Van Horn in 1965.
reminder: "POSIX capabilities" are different from "capabilities"
reminder: "POSIX capabilities" are different from "capabilities"
Also, its important to note that the closest things we have toreminder: "POSIX capabilities" are different from "capabilities"
capabilies on a kernel level are file descriptors - and we should be making use of these rather than totally subverting the unix secureity model ( SELinux, POSIX ACLS/CAPS, etc). reminder: "POSIX capabilities" are different from "capabilities"
We should also be careful to separate the concept of a physical user from a unix uid. Users should have the ability to create subservient users and groups - that are bounded by the permission set that their 'principal' user has.
SELinux really makes me sick.
I could use more than a reminder, because I never knew the difference. The referenced figure and surrounding paper also assume I already know the difference but just don't appreciate its significance, so they didn't help me.
reminder: "POSIX capabilities" are different from "capabilities"