A "kill" button for control groups

By Jonathan Corbet
May 3, 2021

The kernel's control-group mechanism exists to partition processes and to provide resource guarantees (and limits) for each. Processes running within a properly configured control group are unable to deprive those running in a different group of their allocated resources (CPU time, memory, I/O bandwidth, etc.), and are equally protected from interference by others. With few exceptions, control groups are not used to take direct actions on processes; Christian Brauner's cgroup.kill patch set is meant to be one of those exceptions.

In current kernels, one way of acting on processes within a control group is through the "freezer", which can be used to suspend (or resume) all contained processes. Beyond that, though, there are few control-group knobs that will directly affect a process's state. Brauner's patch set adds another one in the form of a control file in each non-root group called kill; it "does what it says on the tin". Writing "1" to that file will cause the immediate death of every process contained within the group (more correctly, it causes the immediate delivery of a SIGKILL signal to each, which has a similar effect). If the control group contains other groups, those, too, will be exterminated. Once the operation is complete, the group will normally be left in an entirely depopulated state.

There are a couple of exceptions to this behavior, of course. The kill operation is defined to work on a process; if the process contains many threads, they will all suffer the same fate. But, if the control group in question is operating in the threaded mode, which allows the threads of a process to be split across multiple groups, that could lead to the untimely demise of threads that were not in the targeted group. So the kill operation will fail if attempted on groups running in the threaded mode.

Similarly, the kill operation will not take down kernel threads, as that could lead to any of a number of surprising results. Writing to the kill file in a group containing kernel threads is allowed, but the kernel threads themselves will survive the operation. In such cases, the group will not be empty at the end.

Brauner cites a number of potential uses for this feature. One of the most obvious ones is container management; if a decision is made that a container should go away, the kill operation is a quick and straightforward way to make that happen. Systemd organizes services into control groups already; it could use this operation as an easier way to stop a service when need be. Similarly, user-space out-of-memory managers could use it as a quick way to make entire control groups go away if the need arises. The kill operation could also be an effective fork-bomb defense; when the kill operation is invoked, a flag is set on the group that prevents the creation of new processes, stopping a forking process in its tracks.

On the other hand, this feature could be thought of as equipping every control group in the system with a big red button with "do not push this" written on it. A stray write to the kill file has the potential to do a fair amount of damage to a running system. The obvious answer to such worries is "don't do that, then", but it is not hard to imagine some users wishing that there were some guard rails around this file.

The current patch works by sending a SIGKILL signal to every process within the target group. There is not any provision for sending a different signal; that, too, seems like something that may be wanted at some point. The kill() system call can send any signal; there may eventually be a case for allowing the cgroup.kill file to do the same.

There have not been a lot of comments on this patch series so far, perhaps partly because it has not been circulated beyond the cgroups mailing list. There is probably not much to complain about with regard to the implementation, which is fairly straightforward, so if there is an issue that could slow this work down, it will have to do with the design of the feature itself. There seem to be clear use cases for it, though, so a kill switch may indeed be a control-group feature in the near future.

Index entries for this article

Kernel Control groups

Index entries for this article
Kernel	Control groups

A "kill" button for control groups

Posted May 3, 2021 16:44 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (4 responses)

> On the other hand, this feature could be thought of as equipping every control group in the system with a big red button with "do not push this" written on it. A stray write to the kill file has the potential to do a fair amount of damage to a running system. The obvious answer to such worries is "don't do that, then", but it is not hard to imagine some users wishing that there were some guard rails around this file.

Linux has lots of "do not push this" buttons, of course. We've all heard the old saw about recursively rm'ing /, but you can also recursively chmod or chown it (which breaks all the setuid bits on the system), write garbage into the /dev/sdx devices, and so on... killall5(8) and kill(-1, SIGKILL) have both existed basically forever, and this seems significantly *less* dangerous than either of them.

On the other hand, perhaps having all these attractive nuisances was never a good design in the first place. You can cry about the sysadmin "taking personal responsibility" all you want, but the way to prevent outages is to add friction in front of dangerous operations, not to castigate fallible humans when something goes wrong. But I'm not sure what that should mean for this feature in particular.

A "kill" button for control groups

Posted May 3, 2021 16:50 UTC (Mon) by tbelaire (subscriber, #141140) [Link] (2 responses)

Just spitballing here, but if I'm deleting a repo on github, it makes me type the name of the repo in again, to make sure I have the right one.

You could replace "write 1" with write the cgroup's name into the file to bring down the group?

A "kill" button for control groups

Posted May 4, 2021 5:05 UTC (Tue) by riking (subscriber, #95706) [Link]

Or we could require that you write the number "9", which is the numeric value of SIGKILL. Other writes can be ignored for now but it's fairly obvious how to extend to other signals in the future.

A "kill" button for control groups

Posted May 8, 2021 9:39 UTC (Sat) by flussence (guest, #85566) [Link]

sysvinit's control socket won't react to anything that doesn't begin with a magic number (0x03091969). I imagine that's the date they came up with the idea.

A "kill" button for control groups

Posted May 3, 2021 17:29 UTC (Mon) by mtu (guest, #144375) [Link]

The file could be write-protected, requiring a chmod before using it as a kill switch. Sort of like those little caps over buttons that do dangerous things.

A "kill" button for control groups

Posted May 3, 2021 17:55 UTC (Mon) by brauner (subscriber, #109349) [Link]

By default any cgroup is owned by (host) root. So without delegating a cgroup only root can take down a cgroup tree. In addition, the root cgroup in which all kthreads and PID 1 live doesn't have a cgroup.kill file similar to how it doesn't have a cgroup.freeze file so it's impossible to take down the whole system (ignoring the fact that PID 1 can't be sent SIGKILL anyway).

So the interesting case is a delegated cgroup. This feature is only available in cgroup v2 and in contrast to cgroup v1 cgroup v2 has a clear delegation mechanism. In order for a cgroup to be "delegated" (read "owned") by an unprivileged process the following files need to change ownership appropriately so that the unprivileged process can write to them:

cgroup.procs
cgroup.threads
cgroup.subtree_control
memory.oom.group

This is needed to correctly delegate control of a subtree to an unprivileged process. Do note that neither cgroup.freeze does nor will cgroup.kill appear in this list (cat /sys/kernel/cgroup/delegate). So even if you delegated a cgroup to a process and want it to be able to manage subtrees it doesn't mean you need to delegate cgroup.freeze or cgroup.kill too. You can leave those with unaltered ownership or even restrict it further. So even if your unprivileged process tried to freeze or kill the cgroup either on purpose or on accident it wouldn't be able to do so. But by delegating a cgroup tree you're definitely _delegating resource management_ as that's required by the implementation but imho you're also implying that delegation of _utility controllers_ such as freezer or kill is ok but you don't need to actually do it.

A "kill" button for control groups

Posted May 3, 2021 21:39 UTC (Mon) by Sesse (subscriber, #53779) [Link] (2 responses)

I'm surprised this doesn't already exist. One of the things that systemd does _really_ well compared to sysvinit, is that it's able to kill daemons reliably, with no fuzz, and no cooperation of any kind (e.g. pid files). I always attributed this to its use of cgroups, assuming they had an easy way to just take down the entire cgroup.

A "kill" button for control groups

Posted May 3, 2021 22:06 UTC (Mon) by wahern (subscriber, #37304) [Link] (1 responses)

systemd just iterates over the list of PIDs in a cgroup and kills them individually. Contrary to the mythology it's neither completely deterministic nor race-free (i.e. subject to TOCTTOU), but nonetheless more robust than PID files, especially when dealing with forking, multi-process services.

For services which play nice--don't fork away from the supervisor, don't change process groups, etc--you can just kill the process group. But the people who understand the arcana of good process management, and the people who write software that people tend to install, seem to be mutually exclusive groups.

The ability to atomically kill all processes in a cgroup will finally bring systemd's actual behavior (mostly?) inline with the myth.

A "kill" button for control groups

Posted May 4, 2021 12:31 UTC (Tue) by hmh (subscriber, #3838) [Link]

So far so good... but I'd expect any real API for this to mimic signal grouping, and allow us to send *any* signal to the cgroup. Preferably with an option to do it "atomically" (if such a notion is possible, maybe relative to fork, clone and friends -- or to cgroup membership-changing?), and restricted to members of a cgroup that the calling process would have permissions to signal (plus the other restrictions that make sense: no kernel threads, etc).

Echo 1 to a sysfs file to send SIGKILL looks like the kind of thing one should throw around just to sound waters if the whole idea might have some traction, before doing the real thing...

A "kill" button for control groups

Posted May 3, 2021 22:06 UTC (Mon) by zblaxell (subscriber, #26385) [Link] (6 responses)

This doesn't seem controversial to me. Just look at what it's replacing: the systemd-style while loop scraping PIDs out of /sys/fs/cgroup, dodging processes stuck in high-latency kernel syscalls while trying to win a game of whack-a-mole with high-priority fork-bombs on low-core-count systems with potentially unbounded run time and no clear exit condition...is an abomination familiar to anyone who has had to troubleshoot it, or worse, implement it.

It was possible to get a similar effect by freezing a cgroup before enumerating pids to kill it (i.e. first stop the fork-bombs from running, then kill the frozen processes), but the freezer cgroup has its own set of gotchas--we have to wait for the freeze to take effect to be sure we've captured every pid, and that waiting means we're back to "potentially unbounded run time" and distinguishing between new processes popping up and old processes that refuse to die, and with that extra complexity we are doing only slightly better than the abomination.

If the kernel implements this capability, it can lock out new processes from being created, while it terminates old ones. Userspace can do one write(2) syscall with running time proportional to the task list size, then forget anything in the cgroup existed (unless it chooses to wait until all killed processes blocked in syscalls exit, in which case cgroup2 has an API for that). Simple, elegant, and effective.

Of course, writing something to some file named "kill" will likely wipe out some processes. No lesser result should be expected by a human writing to a file on a control filesystem with such a name. Software blindly writing numbers into new cgroup files without knowing what the numbers mean is already asking for a world of trouble--it's best to make such software notorious, so it can be removed from the world.

But...maybe it would be better if writing, say, 0x4321fedc (LINUX_REBOOT_CMD_POWER_OFF, defined in sys/reboot.h) killed the cgroup, and other values didn't? Or the string "kill" or "-9" or really anything other than the first non-zero integer.

I also wonder why it only sends SIGKILL? People often want to send SIGTERM first, and since the kill file takes a numeric value anyway, it might as well be the signal ID.

A "kill" button for control groups

Posted May 3, 2021 22:14 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> If the kernel implements this capability, it can lock out new processes from being created, while it terminates old ones.
You can already do that with PID controllers.

A "kill" button for control groups

Posted May 3, 2021 23:19 UTC (Mon) by zblaxell (subscriber, #26385) [Link] (4 responses)

systemd doesn't seem to know that. The pids controller was invented in 2015, systemd's cg_kill functions were last significantly updated in 2011. OTOH freezer cgroup was invented in 2008 and systemd didn't use that to derace cg_kill either.

It looks like there are some ways to escape from the pids controller which the kill button closes off: a process that is running fork() can evade some of the limits that are imposed after fork() starts and before it finishes, or escape by migrating to another cgroup. The kill-button patch leaves a note to smack that process with a SIGKILL just before fork() returns.

A "kill" button for control groups

Posted May 3, 2021 23:52 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Well, if you can escape to another cgroup then you can also avoid the kill controller. Normal processes don't have the rights for it.

Personally, I would prefer a reliable handle-based API for processes instead of trying to plug leaks in a dam with fingers.

A "kill" button for control groups

Posted May 4, 2021 22:07 UTC (Tue) by zblaxell (subscriber, #26385) [Link] (1 responses)

> if you can escape to another cgroup then you can also avoid the kill controller. Normal processes don't have the rights for it.

Rights can be delegated. That's one of the central features of cgroups: you don't need to be root to use it.

A process can move around within its delegation hierarchy and evade a (naive, non-looping) userspace terminator--that was part of what made looping (and possibly also recursive search) in userspace necessary. Processes can hold the controller FD's open so they can give themselves their rights back even if the control files are chmod-ed. Also probably a hundred other holes I haven't bothered to think about, and with this patch set, no longer have to.

A "kill" button for control groups

Posted May 4, 2021 22:46 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Realistically, systemd will kill processes faster than they can migrate within their subtree. It's a theoretical problem, but not a practical one.

A "kill" button for control groups

Posted May 4, 2021 16:50 UTC (Tue) by mezcalero (subscriber, #45103) [Link]

So, regarding use of freezer in systemd for killing: the cgroupsv1 freeze is really broken API-wise. i.e. you need to have a timed sleep loop to see when it is done. It's not usable for any code that strives to be reasonably clean. We never supported it in systemd for anything for this reason. I mean, there are limits to everything how much ugly low-level code we are willing to accept...

The cgroupsv2 freezer makes a ton more sense, and we expose it with hence high level operations (systemctl freeze + systemctl thaw), but we don't use it to make killing race-free. We could do that, but it doesn't feel ideal to me, since freezing is slow, i.e. we need to initiate the freeze, then wait until the kernel tells us it is done (poll()), then enqueue the signal, and then unfreeze and wait again. And blocking syscalls can delay the freeze for long times. Thus killing would become a "slow" operation in the worst case (at least that's my understanding), and that kinda sucks. After all we want this as a clean-up operation that gets rid of broken stuff, i.e. SIGKILL is the unfriendly way to abort stuff, but if things are not abortive anymore if we use the freezer, that defeats half the point.

I love Christian's work on this, since it fixes the race for us *and* is always a quick operation. We don't have to wait for anything "slow". (I mean, it internally also iterates through all processes, so it's not O(1), but that's not what I mean by "slow"...) It just enqueues the SIGKILL for each process in a race-free fashion, and that's all we need.

So, yeah, I am looking forward to Christian' work land and we'll happily make use of it in systemd once it lands. It fixes a real problem for us.

Lennart

A "kill" button for control groups

Posted May 4, 2021 21:07 UTC (Tue) by barryascott (subscriber, #80640) [Link]

The kill could be extend later if “9”, the sigkill number, was written instead of “1”.
This leaves the way clear to allow any signal that’s makes sense to be sent to all processes in the cgroup.

Barry

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

A "kill" button for control groups

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!