Tracking resources and capabilities used
There are various types of limits and privileges that administrators can apply to processes or control groups (cgroups) in Linux, but it is sometimes difficult to determine what those values should be—except by trial and error. A patch set from Topi Miettinen targets making that easier by tracking resource and capability usage by processes in order to give users and administrators a starting point to use when setting those values. The idea is that the processes can be run under a normal load and the high-water values (as well as the capabilities used) will be recorded to provide a guide for future, more-restrictive deployments.
The 18-patch series is broken up into three groups: capabilities used (one patch), cgroup limits (three patches), and resource limits (14 patches). Capabilities used are reported in /proc/PID/status, while cgroup maximums are presented in files in the cgroup filesystem. Resource limits (i.e. rlimits), on the other hand, are reported in the /proc/PID/limits file. Those may change since there are programs that parse the files in /proc, so adding more information could potentially alter the user-space interface for the kernel.
As Miettinen says in the cover letter for the patches, much of the information can already be gleaned from various /proc files and using tools like ps, but those methods only give a value at one point in time. In order to be sure that transient spikes are also recorded, so they can be taken into account, the kernel needs to be involved; thus these patches.
But Konstantin Khlebnikov objected to the overall goal:
He also suggested that tracepoints could be used (perhaps in conjunction with SystemTap or other kernel tracing infrastructure), rather than adding high-water recording to the kernel.
But both Miettinen and Austin S. Hemmelgarn disagreed with that analysis. Miettinen noted that there are always risks when setting limits, but that the patches are just meant to help provide some guidance. Hemmelgarn essentially agreed:
Rlimits could be handled similarly, he said. Beyond that, though, there
are different types of failure modes for processes that cannot get the
resources they need (e.g. can't start a thread or process), which may not
manifest as application errors or crashes. In addition, getting the
information about
the maximum usage from
user space will be difficult or impossible, he said. In a follow-up post, he also noted that tracing can't supply
any better answers for the upper bound of these values than internal kernel
tracking can: "You can't get a perfectly reliable upper bound for any
type of resource
usage with just black box observations, period.
"
There were also comments on many of the individual patches. The capabilities-tracking patch simply adds a cap_used bit array to struct task_struct and sets the bit corresponding to a capability whenever that capability is checked (and passes the check). But as Andy Lutomirski pointed out, simply tracking the capabilities used by a process won't work well in the presence of ambient capabilities. If a process runs a program with ambient capabilities, which uses some capabilities beyond what the main process uses, those will be missed in the set of capabilities collected. He suggested tracking capabilities used for an entire process tree or cgroup.
The cgroup patches track values for three specific controllers: the maximum PIDs used in a PID cgroup, maximum memory used in a memory cgroup, and the devices accessed in a device cgroup. The PID cgroup patch uses an atomic variable to track the highest number of PIDs that have been active in the cgroup at any point. It makes that number available in the pids.current_max file. Cgroup maintainer Tejun Heo didn't like the name (he suggested a high_watermark field in the pids.stats file) and was concerned that some of the atomic variable handling that could lead to races.
The
patch for the memory cgroup simply presents
the existing watermark value in the memory.current_max
file. But, as Johannes Weiner noted, that
generally won't provide much useful information. The page cache is counted
in that watermark and is not reduced in size unless there is memory
pressure, "so in all but very
few cases the high watermark you are introducing will be pegged to the
configured limit
".
The last of the cgroup patches keeps a list of devices that are accessed in a device cgroup. That list, which contains the device type (character or block), major and minor numbers, and access type (read, write, or mknod), can be read from the devices.accessed file.
The rlimit patches drew fewer comments in general (or, perhaps, the comments were outweighed by the sheer number of patches). There was some general confusion because Miettinen did not send a copy of the cover letter (or the first rlimit patch that added some infrastructure used by the rest) to everyone who got copies of the individual patches. In addition, the function name used to update the current maximum value, bump_rlimit(), was confusing to some, since it seems to imply that the actual rlimit is being increased (bumped).
There are individual patches to record (and sometimes report) the maximum use of different resources that are tied to rlimits. That includes the number of open files (RLIMIT_NOFILE), CPU usage (RLIMIT_CPU), file sizes created (RLIMIT_FSIZE), number of processes (RLIMIT_NPROC), and so on. There were some complaints about race conditions and using read-copy-update (RCU) incorrectly, along with some suggestions for better comments to make the intent of the code clearer. Aside from the final patch in the series, which Kees Cook pointed out was unneeded, the series as a whole got a fairly warm response.
There is clearly some work to be done, but maximum resource usage tracking seems like a feature that might make its way into the kernel in, say, 4.9 or 4.10 unless some major opposition appears. It will provide users with a way to gauge what their processes are doing so that limits and privileges can be tightened down appropriately. It certainly won't provide all the answers, but may give the starting point that Miettinen is seeking.
Index entries for this article | |
---|---|
Kernel | Capabilities |
Kernel | Control groups |
Tracking resources and capabilities used
Posted Jul 16, 2016 21:26 UTC (Sat)
by geuder (guest, #62854)
[Link]
Posted Jul 16, 2016 21:26 UTC (Sat) by geuder (guest, #62854) [Link]
When working with unprivileged containers and trying to minimize the capability set for container root I have once done a quick and dirty patch to my kernel to report failing capability checks.
Mostly this works well, if the capability test fails the software fails to work as intended. However, I hit one exception somewhere related to mmap(). In every call there will be a test for some capability. Here, failure in this test does not mean that the whole syscall will fail, it just means that a different code path is taken, were both can still succeed. I did not need to investigate the details, because it turned out that my user space worked just fine, despite the missing capability and the failing test.
Well, I would need to search my old git branch and compare with the suggested patch how they handle this case. It might be quite tricky for the kernel to decide whether the caller really needs the capability in question or whether it will be happy with the other code path.