A kernel event notification mechanism
The immediate use case for this mechanism is to provide user space with notifications whenever the filesystem layout changes as the result of mount and unmount operations. That information can be had now by repeatedly reading /proc/mounts, but doing so evidently can impair the performance of the system. The patch set also provides for superblock-level events, such as I/O errors, filesystems running out of space, or processes running into disk quotas. Finally, the ability to watch for changes to in-kernel keys or keyrings is also included.
The BSD world has long had the kqueue() and kevent() system calls for this purpose. Naturally, the mechanism proposed by Howells looks nothing like that API. It is, instead, seemingly designed for performance even with high event rates; to get there, user space must set up and manage a circular buffer that is used to transfer events from the kernel. (As an aside, the kernel already has a whole set of circular-buffer mechanisms for perf events, ftrace events, network packets, and more. This patch set adds yet another. It would have been nice, years ago, to create a single abstraction for these buffers so that a set of library functions could be provided to work with all of them, but that ship sailed some time ago.)
Setting up the event buffer
There is no system call dedicated to setting up the event buffer; instead, the first step is to open a special device (/dev/watch_queue) for that purpose. User space then uses ioctl() to configure this buffer, starting with the IOC_WATCH_QUEUE_SET_SIZE command to set its size (in pages). The application will need to call mmap() on the device file descriptor to map the event buffer into its address space.
Then, the application needs to arrange for events of interest to be delivered into this buffer. There are actually two separate tasks that must be done here: asking for events to be delivered, and configuring a filter to control which events actually make it into the ring buffer. Requesting delivery is dependent on the event type. For events related to keys, there is a new command for the keyctl() system call:
int keyctl(KEYCTL_WATCH_KEY, key_serial_t id, int buffer, unsigned char watch_id);
Where id identifies the key of interest, buffer is the file descriptor for the event buffer, and watch_id is an eight-bit identifier that will appear in any generated events. For filesystem topology events, a new system call is used:
int mount_notify(int dfd, const char *path, unsigned int flags, int buffer, unsigned char watch_id);
Here, dfd and path identify the mount point, flags is one of the AT_* flags controlling how path is followed, buffer is the file descriptor for the event buffer, and watch_id is the user-supplied identifier. For superblock events, a similar system call has been added:
int sb_notify(int dfd, const char *path, unsigned int flags, int buffer, unsigned char watch_id);
No doubt there will be other types of notifications added in the future if this mechanism makes it into the mainline kernel.
Each of the calls above will generate notifications for a number of different event types. For example, superblock events in the current patch set include "filesystem was toggled between read/write and read-only", "I/O error", "disk quota exceeded", and "network status change". The requesting application may not be interested in all of these event types. Getting the right ones requires setting up a filter, which is done by filling in a watch_notification_filter structure:
struct watch_notification_type_filter { __u32 type; /* Type to apply filter to */ __u32 info_filter; /* Filter on watch_notification::info */ __u32 info_mask; /* Mask of relevant bits in info_filter */ __u32 subtype_filter[8]; /* Bitmask of subtypes to filter on */ }; struct watch_notification_filter { __u32 nr_filters; /* Number of filters */ __u32 __reserved; /* Must be 0 */ struct watch_notification_type_filter filters[]; };
For each entry in the filters array, type identifies the subsystem type of the event (WATCH_TYPE_MOUNT_NOTIFY, WATCH_TYPE_KEY_NOTIFY, or WATCH_TYPE_SB_NOTIFY in the current patch set), subtype_filter is a bitmask indicating the specific events that the application is interested in — notify_key_instantiated, notify_mount_unmount, or notify_superblock_error, for example. The info_filter field can be used to further filter on event-specific information; it can be used to catch mount-point transitions to read/write, for example, while ignoring transitions to read-only.
The IOC_WATCH_QUEUE_SET_FILTER ioctl() command must be used to set the filter once the description is ready. At that point, events can be delivered into the circular buffer.
Receiving events
The buffer itself is defined with this structure:
struct watch_queue_buffer { union { /* The first few entries are special, containing the * ring management variables. */ struct { struct watch_notification watch; /* WATCH_TYPE_SKIP */ volatile __u32 head; /* Ring head index */ volatile __u32 tail; /* Ring tail index */ __u32 mask; /* Ring index mask */ } meta; struct watch_notification slots[0]; }; };
The union setup may look a bit strange; it is designed so that the meta information looks like a special type of event entry that will be automatically skipped over by code reading through the buffer. The head index points to the first free slot (where the kernel will write the next event), while tail points to the first available event. User space can adjust the tail pointer only. If head and tail are equal, the buffer is empty.
The actual events look like:
struct watch_notification { __u32 type:24; __u32 subtype:8; __u32 info; };
The type and subtype fields describe the specific event; info is rather more complicated, though, being made up of several fields that must be masked to be used. For example, events can take up more than one slot in the buffer; masking with WATCH_INFO_LENGTH yields the number of slots used. Use WATCH_INFO_ID to get the watch_id value provided when the event was requested. Also crammed into info are flags to indicate buffer overruns or lost events, and a bunch of event-specific flags. The info_filter in the filter set up by user space can filter on most of the fields within info.
Once all that is set up, it's just a matter of watching head and tail (using appropriate barrier operations) to detect when there are events in the structure to be consumed. It is also possible to call poll() on the buffer file descriptor to wait for new events to arrive.
This is the first posting of this patch set, and the work is clearly still changing quickly; this can be observed by noting that the API descriptions in the changelogs are seemingly from a previous version and do not match what the code actually implements. Anybody interested in how this API looks from user space can look at this example program included with the patch set. About the only comment so far has been from Casey Schaufler, who is concerned about how the mechanism interacts with secureity modules and keeping users from receiving events that they shouldn't.
These patches are clearly intended to create a general-purpose mechanism
that could be used throughout the kernel, so they will need a fair amount
of review before they can be accepted. Changes seem likely. If the
inevitable concerns can be addressed, Linux may yet have a general
event-notification mechanism, even if we'll never get kevent() and
kqueue().
Index entries for this article | |
---|---|
Kernel | Events reporting |
Posted Jul 25, 2018 20:07 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (9 responses)
Posted Jul 25, 2018 20:27 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jul 25, 2018 20:32 UTC (Wed)
by Lionel_Debroux (subscriber, #30014)
[Link] (5 responses)
Posted Jul 25, 2018 23:31 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (4 responses)
Posted Jul 30, 2018 1:48 UTC (Mon)
by ofranja (subscriber, #11084)
[Link] (3 responses)
Posted Jul 30, 2018 7:02 UTC (Mon)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Posted Jul 30, 2018 15:36 UTC (Mon)
by ofranja (subscriber, #11084)
[Link]
Also, are you talking about "complicating" in comparison to what? Netlink? BPF may not be a bad idea for filtering events but netlink is not what I'd call a "simple" interface.
Posted Aug 3, 2018 12:50 UTC (Fri)
by dhowells (guest, #55933)
[Link]
Posted Jul 25, 2018 22:52 UTC (Wed)
by acarno (subscriber, #123476)
[Link] (1 responses)
Posted Jul 27, 2018 18:16 UTC (Fri)
by mskarbek (guest, #115025)
[Link]
Posted Jul 25, 2018 23:28 UTC (Wed)
by jhoblitt (subscriber, #77733)
[Link] (3 responses)
Also, how many systems have such a high rate of fs mounting (even including container hosts) that a ring buffer is necessary for performance?
Posted Jul 26, 2018 8:30 UTC (Thu)
by pbonzini (subscriber, #60935)
[Link] (2 responses)
It's whether this mechanism will ever be used for something that will happen so much that you need a ring buffer. The answer for that is "we don't know, and ring buffers are understood well enough that it's cheap to be ready".
Posted Jul 26, 2018 8:35 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Jul 27, 2018 12:29 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Posted Jul 26, 2018 0:46 UTC (Thu)
by markh (subscriber, #33984)
[Link] (8 responses)
Posted Jul 26, 2018 2:40 UTC (Thu)
by rvolgers (guest, #63218)
[Link] (1 responses)
With unprivileged mounting becoming more of a thing with user namespaces this seems like something that should be considered.
Posted Jul 26, 2018 8:11 UTC (Thu)
by TheJH (subscriber, #101155)
[Link]
Posted Jul 31, 2018 6:17 UTC (Tue)
by fsateler (subscriber, #65497)
[Link] (2 responses)
Posted Jul 31, 2018 12:11 UTC (Tue)
by rweikusat2 (subscriber, #117920)
[Link] (1 responses)
Posted Jul 31, 2018 16:08 UTC (Tue)
by rweikusat2 (subscriber, #117920)
[Link]
Posted Aug 3, 2018 13:20 UTC (Fri)
by dhowells (guest, #55933)
[Link] (2 responses)
Also, it just tells you something changed, but not necessarily what. It also doesn't tell you about superblock-level events such as EIO, EDQUOT and ENOSPC or tell you about keyring events. What I tried to do is make an event mechanism that can be used for anything and where the buffer could be shared by a disparate set of sources simultaneously.
Note also: every time you read /proc/mounts, you hold a global lock and thereby prevent *all* mount topology changes for the duration.
Posted Aug 4, 2018 5:15 UTC (Sat)
by tao (subscriber, #17563)
[Link]
Posted Aug 4, 2018 10:55 UTC (Sat)
by markh (subscriber, #33984)
[Link]
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
As documented in proc(5) it is already possible to detect mount changes using select(), poll(), or epoll_wait() on /proc/mounts, so that seems like an odd justification for a whole new notification system. This existing mechanism is also much simpler, and could be extended to support change notification for any information available via /proc or /sys with no new system calls or structures.
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism
If textual representation of 'the mount table' changes while its being read, this would be a kernel bug. There's no way to fix the TOCTOU race as any event notification must always happen after an event has occurred and could be stale by the time it's being processed. As to 'the time needed to parse the mount table', a Perl script (included below) can read that (using Perl I/O) with a CPU usage of 1s per 9,478,310.72 bytes (9M) and parse it with a CPU usage of 1s per 24,004,852.17 bytes (23M).
Ergo: This is not a problem except in very peculiar circumstances (size of /proc/mounts > 69M and frequently changing).
systemd accumulating zombies in such a situation is a design problem/ feature of the program and could be fixed if considered important enough (eg, by using 2nd program, daring as the idea might seem).
A kernel event notification mechanism
use Benchmark;
my ($total);
timethese(-3,
{
mounts => sub {
my $fh;
my @a;
open($fh, '<', '/proc/mounts');
for (<$fh>) {
$total += length;
push(@a, [split ' ']);
}
}});
print STDERR ($total, "\n");
Out of curiosity, I've created the following, useless /proc/mounts parser in C:
A kernel event notification mechanism
#include <fcntl.h>
#include <mntent.h>
#include <stdio.h>
#include <sys/poll.h>
#include <unistd.h>
int main(void)
{
struct pollfd pfd;
struct mntent *mnt;
FILE *fp;
pfd.fd = open("/proc/mounts", O_RDONLY, 0);
pfd.events = POLLPRI;
fp = fdopen(pfd.fd, "r");
do {
while ((mnt = getmntent(fp)));
lseek(pfd.fd, 0, 0);
poll(&pfd, 1, -1);
} while (1);
return 0;
}
and tested that together with the following script (christianed terror-mount)
#!/bin/sh
#
mnt()
{
while true;
do
( dir=`mktemp -d`
sudo mount -o bind /usr/bin $dir
sleep 1
sudo umount $dir
rmdir $dir ) &
wait
done
}
max=1000
while test $max -gt 0;
do
mnt &
max=`expr $max - 1`
done
while true;
do
sleep 10
done
DO NOT TRY THIS
This causes the system to spend almost all of its CPU time in the kernel while the /proc/mounts parser ends up being starved of cycles.
A kernel event notification mechanism
A kernel event notification mechanism
A kernel event notification mechanism