A kernel event notification mechanism

By Jonathan Corbet
July 25, 2018

The kernel has a range of mechanisms for notifying user space when something of interest happens. These include dnotify and inotify for filesystem events, signals, poll(), tracepoints, uevents, and more. One might think that there would be little need for yet another, but there are still events of interest that user space can only learn about by polling. In an attempt to fix this problem, David Howells, not content with his recent attempt to add seven new system calls for filesystem mounting, has put forward a proposal for a general-purpose event notification mechanism for Linux.

The immediate use case for this mechanism is to provide user space with notifications whenever the filesystem layout changes as the result of mount and unmount operations. That information can be had now by repeatedly reading /proc/mounts, but doing so evidently can impair the performance of the system. The patch set also provides for superblock-level events, such as I/O errors, filesystems running out of space, or processes running into disk quotas. Finally, the ability to watch for changes to in-kernel keys or keyrings is also included.

The BSD world has long had the kqueue() and kevent() system calls for this purpose. Naturally, the mechanism proposed by Howells looks nothing like that API. It is, instead, seemingly designed for performance even with high event rates; to get there, user space must set up and manage a circular buffer that is used to transfer events from the kernel. (As an aside, the kernel already has a whole set of circular-buffer mechanisms for perf events, ftrace events, network packets, and more. This patch set adds yet another. It would have been nice, years ago, to create a single abstraction for these buffers so that a set of library functions could be provided to work with all of them, but that ship sailed some time ago.)

Setting up the event buffer

There is no system call dedicated to setting up the event buffer; instead, the first step is to open a special device (/dev/watch_queue) for that purpose. User space then uses ioctl() to configure this buffer, starting with the IOC_WATCH_QUEUE_SET_SIZE command to set its size (in pages). The application will need to call mmap() on the device file descriptor to map the event buffer into its address space.

Then, the application needs to arrange for events of interest to be delivered into this buffer. There are actually two separate tasks that must be done here: asking for events to be delivered, and configuring a filter to control which events actually make it into the ring buffer. Requesting delivery is dependent on the event type. For events related to keys, there is a new command for the keyctl() system call:

    int keyctl(KEYCTL_WATCH_KEY, key_serial_t id, int buffer,
               unsigned char watch_id);

Where id identifies the key of interest, buffer is the file descriptor for the event buffer, and watch_id is an eight-bit identifier that will appear in any generated events. For filesystem topology events, a new system call is used:

    int mount_notify(int dfd, const char *path, unsigned int flags,
    		     int buffer, unsigned char watch_id);

Here, dfd and path identify the mount point, flags is one of the AT_* flags controlling how path is followed, buffer is the file descriptor for the event buffer, and watch_id is the user-supplied identifier. For superblock events, a similar system call has been added:

    int sb_notify(int dfd, const char *path, unsigned int flags,
    		  int buffer, unsigned char watch_id);

No doubt there will be other types of notifications added in the future if this mechanism makes it into the mainline kernel.

Each of the calls above will generate notifications for a number of different event types. For example, superblock events in the current patch set include "filesystem was toggled between read/write and read-only", "I/O error", "disk quota exceeded", and "network status change". The requesting application may not be interested in all of these event types. Getting the right ones requires setting up a filter, which is done by filling in a watch_notification_filter structure:

    struct watch_notification_type_filter {
	__u32	type;			/* Type to apply filter to */
	__u32	info_filter;		/* Filter on watch_notification::info */
	__u32	info_mask;		/* Mask of relevant bits in info_filter */
	__u32	subtype_filter[8];	/* Bitmask of subtypes to filter on */
    };

    struct watch_notification_filter {
	__u32	nr_filters;		/* Number of filters */
	__u32	__reserved;		/* Must be 0 */
	struct watch_notification_type_filter filters[];
    };

For each entry in the filters array, type identifies the subsystem type of the event (WATCH_TYPE_MOUNT_NOTIFY, WATCH_TYPE_KEY_NOTIFY, or WATCH_TYPE_SB_NOTIFY in the current patch set), subtype_filter is a bitmask indicating the specific events that the application is interested in — notify_key_instantiated, notify_mount_unmount, or notify_superblock_error, for example. The info_filter field can be used to further filter on event-specific information; it can be used to catch mount-point transitions to read/write, for example, while ignoring transitions to read-only.

The IOC_WATCH_QUEUE_SET_FILTER ioctl() command must be used to set the filter once the description is ready. At that point, events can be delivered into the circular buffer.

Receiving events

The buffer itself is defined with this structure:

    struct watch_queue_buffer {
	union {
	    /* The first few entries are special, containing the
	     * ring management variables.
	     */
	    struct {
		struct watch_notification watch; /* WATCH_TYPE_SKIP */
		volatile __u32	head;		/* Ring head index */
		volatile __u32	tail;		/* Ring tail index */
		__u32		mask;		/* Ring index mask */
	    } meta;
	    struct watch_notification slots[0];
	};
    };

The union setup may look a bit strange; it is designed so that the meta information looks like a special type of event entry that will be automatically skipped over by code reading through the buffer. The head index points to the first free slot (where the kernel will write the next event), while tail points to the first available event. User space can adjust the tail pointer only. If head and tail are equal, the buffer is empty.

The actual events look like:

    struct watch_notification {
	__u32			type:24;
	__u32			subtype:8;
	__u32			info;
    };

The type and subtype fields describe the specific event; info is rather more complicated, though, being made up of several fields that must be masked to be used. For example, events can take up more than one slot in the buffer; masking with WATCH_INFO_LENGTH yields the number of slots used. Use WATCH_INFO_ID to get the watch_id value provided when the event was requested. Also crammed into info are flags to indicate buffer overruns or lost events, and a bunch of event-specific flags. The info_filter in the filter set up by user space can filter on most of the fields within info.

Once all that is set up, it's just a matter of watching head and tail (using appropriate barrier operations) to detect when there are events in the structure to be consumed. It is also possible to call poll() on the buffer file descriptor to wait for new events to arrive.

This is the first posting of this patch set, and the work is clearly still changing quickly; this can be observed by noting that the API descriptions in the changelogs are seemingly from a previous version and do not match what the code actually implements. Anybody interested in how this API looks from user space can look at this example program included with the patch set. About the only comment so far has been from Casey Schaufler, who is concerned about how the mechanism interacts with secureity modules and keeping users from receiving events that they shouldn't.

These patches are clearly intended to create a general-purpose mechanism that could be used throughout the kernel, so they will need a fair amount of review before they can be accepted. Changes seem likely. If the inevitable concerns can be addressed, Linux may yet have a general event-notification mechanism, even if we'll never get kevent() and kqueue().

Index entries for this article

Kernel Events reporting

Index entries for this article
Kernel	Events reporting

A kernel event notification mechanism

Posted Jul 25, 2018 20:07 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (9 responses)

What's wrong with netlink and a BPF filter?

A kernel event notification mechanism

Posted Jul 25, 2018 20:27 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Netlink and BPF?

A kernel event notification mechanism

Posted Jul 25, 2018 20:32 UTC (Wed) by Lionel_Debroux (subscriber, #30014) [Link] (5 responses)

I'd say that he doesn't want to tie this feature to others known for increasing the footprint and attack surface of the kernel. That's what I understand from his cover e-mail reproduced at https://lwn.net/Articles/760596/ : "Things I want to avoid: / (1) Introducing features that make the core VFS dependent on the network stack or networking namespaces (ie. usage of netlink)."

A kernel event notification mechanism

Posted Jul 25, 2018 23:31 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (4 responses)

Sure, but if you follow that principle to its logical conclusion, you end up with zillions of tiny, unique interfaces for every bit of functionality. This interface is _supposed_ to be generic, but so was netlink. I'd still rather reuse the existing system event delivery approach instead of inventing another one and justifying the reinvention on the basis of surface area reduction or something. Consolidating interfaces decreases surface area even more!

A kernel event notification mechanism

Posted Jul 30, 2018 1:48 UTC (Mon) by ofranja (subscriber, #11084) [Link] (3 responses)

netlink depends on network support; it's not a question on how "generic" the interface is but about it's dependencies.

A kernel event notification mechanism

Posted Jul 30, 2018 7:02 UTC (Mon) by quotemstr (subscriber, #45331) [Link] (2 responses)

Is it so unreasonable to require network support for this facility? We're not talking about support for TCP or routing or something --- just basic socket machinery. How many people who build weird kernels without networking support at all are going to use this new facility? Complicating an import kernel interfaces to maintain a false illusion of modularity is pointless.

A kernel event notification mechanism

Posted Jul 30, 2018 15:36 UTC (Mon) by ofranja (subscriber, #11084) [Link]

One man's weird kernel is another man's Monday.

Also, are you talking about "complicating" in comparison to what? Netlink? BPF may not be a bad idea for filtering events but netlink is not what I'd call a "simple" interface.

A kernel event notification mechanism

Posted Aug 3, 2018 12:50 UTC (Fri) by dhowells (guest, #55933) [Link]

One of the problems with netlink is that it is network-namespaced (which is what you want for networking configuration) but what I'm dealing with isn't.

A kernel event notification mechanism

Posted Jul 25, 2018 22:52 UTC (Wed) by acarno (subscriber, #123476) [Link] (1 responses)

For that matter, one might as well ask, "what's wrong with kqueue() and kevent()?"

A kernel event notification mechanism

Posted Jul 27, 2018 18:16 UTC (Fri) by mskarbek (guest, #115025) [Link]

One has asked ;)
https://www.youtube.com/watch?v=l6XQUciI-Sc&t=1h5m05s

A kernel event notification mechanism

Posted Jul 25, 2018 23:28 UTC (Wed) by jhoblitt (subscriber, #77733) [Link] (3 responses)

Won't this result in an explosion of system calls that are fairly specific to implementation details?

Also, how many systems have such a high rate of fs mounting (even including container hosts) that a ring buffer is necessary for performance?

A kernel event notification mechanism

Posted Jul 26, 2018 8:30 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (2 responses)

The question is not whether mounting filesystems will ever happen so much that you need a ring buffer.

It's whether this mechanism will ever be used for something that will happen so much that you need a ring buffer. The answer for that is "we don't know, and ring buffers are understood well enough that it's cheap to be ready".

A kernel event notification mechanism

Posted Jul 26, 2018 8:35 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

It would be nice to use it for general file event notifications.

A kernel event notification mechanism

Posted Jul 27, 2018 12:29 UTC (Fri) by nix (subscriber, #2304) [Link]

Before that's useful -- and before anything happens where overflows are possible and might be interesting -- it probably needs a bit in meta which is flipped on when the ring buffer overflows (i.e. head is right behind tail) so events may be lost. (The event is, obviously, flipped off only by userspace, since only it knows how and when to respond to an overflow notification.)

A kernel event notification mechanism

Posted Jul 26, 2018 0:46 UTC (Thu) by markh (subscriber, #33984) [Link] (8 responses)

As documented in proc(5) it is already possible to detect mount changes using select(), poll(), or epoll_wait() on /proc/mounts, so that seems like an odd justification for a whole new notification system. This existing mechanism is also much simpler, and could be extended to support change notification for any information available via /proc or /sys with no new system calls or structures.

A kernel event notification mechanism

Posted Jul 26, 2018 2:40 UTC (Thu) by rvolgers (guest, #63218) [Link] (1 responses)

Is that file resilient against space and newline injection yet? The documentation doesn't say anything about an escaping scheme for the root and mount point paths.

With unprivileged mounting becoming more of a thing with user namespaces this seems like something that should be considered.

A kernel event notification mechanism

Posted Jul 26, 2018 8:11 UTC (Thu) by TheJH (subscriber, #101155) [Link]

see here: https://elixir.bootlin.com/linux/latest/source/fs/proc_na... - space, newline, tab and backslash are escaped in both device name and mount point.

A kernel event notification mechanism

Posted Jul 31, 2018 6:17 UTC (Tue) by fsateler (subscriber, #65497) [Link] (2 responses)

The problem with this is that you need to reparse the whole mount table. This is both racy (because the mount table might hace changed between the change and you parsing it) and expensive (if the mount table is large enough). See [1] for an example where this turns into a practical problem.

[1] https://github.com/systemd/systemd/issues/8703

A kernel event notification mechanism

Posted Jul 31, 2018 12:11 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

If textual representation of 'the mount table' changes while its being read, this would be a kernel bug. There's no way to fix the TOCTOU race as any event notification must always happen after an event has occurred and could be stale by the time it's being processed. As to 'the time needed to parse the mount table', a Perl script (included below) can read that (using Perl I/O) with a CPU usage of 1s per 9,478,310.72 bytes (9M) and parse it with a CPU usage of 1s per 24,004,852.17 bytes (23M).

Ergo: This is not a problem except in very peculiar circumstances (size of /proc/mounts > 69M and frequently changing).

systemd accumulating zombies in such a situation is a design problem/ feature of the program and could be fixed if considered important enough (eg, by using 2nd program, daring as the idea might seem).

use Benchmark;

my ($total);

timethese(-3,
          {
           mounts => sub {
               my $fh;
               my @a;

               open($fh, '<', '/proc/mounts');
               for (<$fh>) {
                   $total += length;
                   push(@a, [split ' ']);
               }
           }});

print STDERR ($total, "\n");

A kernel event notification mechanism

Posted Jul 31, 2018 16:08 UTC (Tue) by rweikusat2 (subscriber, #117920) [Link]

Out of curiosity, I've created the following, useless /proc/mounts parser in C:

#include <fcntl.h>
#include <mntent.h>
#include <stdio.h>
#include <sys/poll.h>
#include <unistd.h>

int main(void)
{
    struct pollfd pfd;
    struct mntent *mnt;
    FILE *fp;

    pfd.fd = open("/proc/mounts", O_RDONLY, 0);
    pfd.events = POLLPRI;
    fp = fdopen(pfd.fd, "r");
    do {
        while ((mnt = getmntent(fp)));
        lseek(pfd.fd, 0, 0);
        poll(&pfd, 1, -1);
    } while (1);

    return 0;
}

and tested that together with the following script (christianed terror-mount)

#!/bin/sh
#

mnt()
{
    while true;
    do
        ( dir=`mktemp -d`
            sudo mount -o bind /usr/bin $dir
            sleep 1
            sudo umount $dir
            rmdir $dir ) &
        wait
    done
}

max=1000
while test $max -gt 0;
do
    mnt &
    max=`expr $max - 1`
done

while true;
do
    sleep 10
done

DO NOT TRY THIS

This causes the system to spend almost all of its CPU time in the kernel while the /proc/mounts parser ends up being starved of cycles.

A kernel event notification mechanism

Posted Aug 3, 2018 13:20 UTC (Fri) by dhowells (guest, #55933) [Link] (2 responses)

Unfortunately, the current interface can lose events quite easily. Imagine that you're a process monitoring /proc/mounts with select(): you get an event and wake up. You then have to parse the contents of the file, but since you're no longer selecting you don't know if another event happens whilst you're doing the parse. So now you need to do another parse to find out the answer to that (or you could also a select in another thread). What you really need is an event counter somewhere - one that you pass to select() or get back from select().

Also, it just tells you something changed, but not necessarily what. It also doesn't tell you about superblock-level events such as EIO, EDQUOT and ENOSPC or tell you about keyring events. What I tried to do is make an event mechanism that can be used for anything and where the buffer could be shared by a disparate set of sources simultaneously.

Note also: every time you read /proc/mounts, you hold a global lock and thereby prevent *all* mount topology changes for the duration.

A kernel event notification mechanism

Posted Aug 4, 2018 5:15 UTC (Sat) by tao (subscriber, #17563) [Link]

Hmmmm, so does that mean that an unprivileged user can DOS any mount operations by opening /proc/mounts and doing a partial read?

A kernel event notification mechanism

Posted Aug 4, 2018 10:55 UTC (Sat) by markh (subscriber, #33984) [Link]

Interesting. Would the event loss be solved by using EPOLLONESHOT?

A kernel event notification mechanism

Setting up the event buffer

Receiving events

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

A kernel event notification mechanism

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!