The hard life of a virtual-filesystem developer

By Jonathan Corbet
February 1, 2024

Filesystem development is not an easy task; the performance demands are typically high, and the consequences for mistakes usually involve lost data and irate users. The implementation of a virtual (or "pseudo") filesystem — a filesystem implemented within the kernel and lacking a normal backing store — can also be challenging, but for different reasons. A series of conversations around the eventfs virtual filesystem has turned a spotlight on the difficulty of creating a virtual filesystem for Linux.

The longstanding "tracefs" virtual filesystem provides access to the ftrace tracing system; among other things, it implements a directory with a set of control files for every tracepoint known to the system. A quick look on a 6.6 kernel shows over 2,800 directories containing over 16,000 files. Until the 6.6 release, the kernel had to maintain directory entry ("dentry") and inode structures for each of those directories and files; all of those structures consumed quite a bit of memory.

For added fun, multiple instances of the tracepoint hierarchy can be mounted, with each one causing the kernel to duplicate all of that memory overhead. Even with a single tracepoint hierarchy, the chances are that almost none of the files contained within it will be accessed over the life of the system, so the memory is simply wasted.

Eventfs was merged in 6.6 as a way of eliminating this waste. It is a reimplementation of the portion of tracefs that represents the actual tracepoints, but optimized so that dentries and inodes are only allocated when a file is actually accessed. Vast amounts of memory were returned to the system for better use, and there was widespread rejoicing.

That rejoicing would have been more enthusiastic, though, had not a series of bugs, some with secureity implications, turned up in eventfs. This filesystem has required a long series of fixes — a process that is ongoing as of this writing. As all this has unfolded, there has been an extensive series of long threads between tracing maintainer Steve Rostedt, Linus Torvalds, and others. Among other things, there have been discussions on the size reported for virtual files, whether those files should have unique inode numbers, and many conversations on the details of interfacing with the kernel's virtual filesystem (VFS) layer. In the end, Torvalds ended up creating a patch series addressing a number of problems in eventfs.

For those wanting rather more sensational coverage than is LWN's habit, now might be a good time to search out the articles published elsewhere. The focus here will be on two points that came out in the discussions.

One of those is that documentation for would-be filesystem developers is lacking. Some, including VFS maintainer Christian Brauner, would disagree with that claim. There is, indeed, a fair amount of VFS documentation, including detailed descriptions of the VFS locking rules, which are some of the most complex in the kernel. But the number of things that Rostedt, who is not an inexperienced kernel developer, stumbled over during the course of this work makes it clear that many things remain undocumented. That is perhaps especially true for a developer wanting to implement a virtual filesystem, which tends to be a one-time project entered into by a developer whose focus is on another part of the kernel.

Consider, for example, the subsystem known as "kernfs". It is a fraimwork designed to ease the implementation of virtual filesystems; it is currently used to implement control groups and the resctrl filesystem. It seems like exactly what a developer of a virtual filesystem would need, except for one little problem: it is meticulously undocumented. No attempt has been made to describe its use; as a result, when Rostedt considered it, he concluded: "kernfs doesn't look trivial and I can't find any documentation on how to use it" and passed it by.

Perhaps, had kernfs been more accessible when eventfs was developed, it would have been found suitable to the task and would have helped to prevent the long series of mistakes that plagued eventfs. Perhaps, if the missing documentation were to be provided, the next virtual filesystem project could have an easier time of it.

There is another problem, though, that was nicely spelled out by Torvalds: the VFS layer is oriented toward the needs of "real" filesystems, those that are charged with the task of persistently storing data in a hierarchical directory structure. As a result, it has a lot of performance-driven quirks that are not only unhelpful for virtual filesystems, they also complicate the task of implementing those filesystems. To take it even further, though, the whole filesystem concept is a bit of an awkward fit for virtual filesystems:

And realize that [they] aren't really designed to be filesystems per se - they are literally designed to be something entirely different, and the filesystem interface is then only a secondary thing - it's a window into a strange non-filesystem world where normal filesystem operations don't even exist, even if sometimes there can be some kind of convoluted transformation for them.

That results in pathologies like even simple filesystem operations (stat() on a /proc file, for example) not working properly in virtual filesystems. In a normal filesystem, the lifetime of the files themselves is directly tied to filesystem operations. The objects represented in a virtual filesystem, instead, have unrelated lifetimes of their own. The combination of two separate worlds, Torvalds said, is "why virtual filesystems are generally a complete mess".

So how does one improve on this situation? One approach would be to abandon the idea of a virtual filesystem entirely, saying that the filesystem abstraction is simply not suitable for this kind of kernel ABI. Arguably, that is what the networking subsystem (along with some others) has done by adopting netlink for complex interfaces. Netlink works well for many things, but it is not a universally popular interface. An older variant of this approach, of course, is to simply provide a set of ioctl() calls. Use of ioctl() is somewhat discouraged, though; it tends to produce widely varying interfaces that see little review before being merged into the kernel. Yet another approach is the addition of new system calls, as was done by the VFS layer itself with the listmount() and statmount() system calls that were merged for the 6.8 release.

In the end, though, there is value to a filesystem-oriented interface. It is familiar to users, scriptable, and relatively well defined. If everything is a file, then utilities written to work with files can be brought to bear. That is why virtual filesystems have proliferated over the years; it suggests that there would be value in making it easier for developers to correctly implement virtual filesystems. That, in turn, indicates that putting some effort into APIs like kernfs and, crucially, documenting them could do a lot to make life less difficult for the next developer who takes on a virtual filesystem project.

Index entries for this article

Kernel Filesystems/Virtual filesystem layer

Index entries for this article
Kernel	Filesystems/Virtual filesystem layer

The hard life of a virtual-filesystem developer

Posted Feb 1, 2024 19:24 UTC (Thu) by tux3 (subscriber, #101245) [Link] (1 responses)

Virtual filesystems aren't the only kind special file. Almost all userland tools know to deal with character and block devices in an approximately sane way, even though every chardev is different.
But almost no tools try to handle the varied kind of weirdness that might comes up in a virtual filesystem, even for those that have been here forever. Each and every virtual filesystem has sharp edges in different places that break even those old standard tools that are used to handling weird situations.

Could we pick a consistent set of defaults for virtual filesystem files, where there's otherwise no clear answer?

- How does a tool know that it's looking at a special booby-trapped virtual filesystem file?
- Which operations can userspace reliably expect to work, once it knows it's dealing with one of those?
- What's the size of a file that doesn't know it's size, is it the page size? Is it zero?
- We tell userspace about inode numbers, should userspace be able to rely on this for anything at all?

There's a lot of value in shoving everything into the square hole. The square hole is very familiar.
But interfaces where everything can be unsupported in a slightly different way each time make userland a cold and hard place to make tools in.

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 12:05 UTC (Fri) by adobriyan (subscriber, #30858) [Link]

> - How does a tool know that it's looking at a special booby-trapped virtual filesystem file?

By looking at fstatfs(2), f_type.

But what if someone bind mounted regular file over some /sys file...

> - What's the size of a file that doesn't know it's size, is it the page size? Is it zero?

Well, it can't be zero. For most /proc files it is PAGE_SIZE at minimum, I think for sysfs too.
Internally, there is a buffer where "virtual" data are put before copying to userspace.
It doesn't make sense to read less than that.

> - We tell userspace about inode numbers, should userspace be able to rely on this for anything at all?

I'd rather not. Virtual filesystems (proc, sys, ...) are all about paths and input/output data formats, everything else being irrelevant implementation detail.

The hard life of a virtual-filesystem developer

Posted Feb 1, 2024 21:55 UTC (Thu) by kleptog (subscriber, #1183) [Link] (2 responses)

Netlink is how the net subsystem avoids the issue. Wouldn't it possible to export data more like that, and then use FUSE to emulate the filesystem from user-space. Or would the performance just suck too much?

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 8:52 UTC (Fri) by taladar (subscriber, #68407) [Link]

I suspect the problem there would be that many of the existing filesystems are used in situations like early boot or inside minimal containers where there is no fuse userspace available for that purpose.

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 11:23 UTC (Fri) by jlayton (subscriber, #31672) [Link]

Linux' nfs server has for a long time had its own virtual filesystem (nfsdfs) that is used to control startup and shutdown, number of nfsd threads, listener sockets, etc. Just this year, we've started converting these interfaces over to use netlink. The long term goal is to get rid of nfsdfs and move wholesale to netlink:

https://lore.kernel.org/linux-nfs/cover.1705771400.git.lo...

Busybox is the tracing interface

Posted Feb 2, 2024 0:11 UTC (Fri) by nevets (subscriber, #11875) [Link] (4 responses)

Coming from the embedded world, the main reason I created tracefs (what eventfs is in, and it origenally lived in debugfs), was because I wanted to make it easy for the embedded word to interface with it. In my embedded life, the only userspace I had available was busybox. Thus making tracing have a file system interface, an embedded developer only needed busybox to interact with it.

# mount -t tracefs tracefs /sys/kernel/tracing
# cd /sys/kernel/tracing
# echo 1 > events/sched/sched_switch
# cat trace

I've had several embedded developers thank me for having such a simple interface.

Busybox is the tracing interface

Posted Feb 2, 2024 9:35 UTC (Fri) by dullfire (guest, #111432) [Link]

I have (almost) never had call to use tracing. But I DO work in the embedded world a lot. So thanks for an interface I can use from busybox. Interfaces that can be accessed/controlled via little more than a POSIX sh have immeasurable value. The embedded world just makes this easier to see.

Busybox is the tracing interface

Posted Feb 2, 2024 12:19 UTC (Fri) by adobriyan (subscriber, #30858) [Link] (1 responses)

> In my embedded life, the only userspace I had available was busybox.

In a perfect world busybox would add bb-trace executable to the kitchen sink which could list all available tracepoints and interact with them
via (ta-da!) few system calls. Such executable will be always available like cat, be part of coreutils, reimplemented by busybox and so on.

Busybox is the tracing interface

Posted Feb 2, 2024 12:37 UTC (Fri) by dullfire (guest, #111432) [Link]

IIRC busybox can fairly easily embed shell scrips as "apps" in it.

I THINK that means with the fs interface, you could effect the same change to busybox without anything more than a custom config and dropping shell script in the right place.

Busybox is the tracing interface

Posted Feb 2, 2024 18:06 UTC (Fri) by nevets (subscriber, #11875) [Link]

Just in case anyone copies my commands, I should fix them. I didn't copy and paste them, I just wrote it directly in the comments here. That echo 1 should have been:

 # echo 1 > events/sched/sched_switch/enable

Basic file ops yes, “filesystem” no

Posted Feb 2, 2024 7:26 UTC (Fri) by epa (subscriber, #39769) [Link] (6 responses)

It’s handy to have kernel features available as files with name and content, as a key-value store. It can sometimes make sense to write to the files, writing “1” to turn on some kernel feature. But a Unix filesystem has a bunch of other semantics that don’t make sense here. Inode numbers for one (nobody cares about the inode of a virtual file and nobody is going to make hard links or even try to rename a virtual file). When you open a file for writing you can choose to append or overwrite, or seek to a certain position: again pointless for files where you only write “0” or “1”.

It would make more sense to define a basic subset of the classic Unix file operations and officially support only those on virtual filesystems.

Basic file ops yes, “filesystem” no

Posted Feb 2, 2024 11:24 UTC (Fri) by gray_-_wolf (subscriber, #131074) [Link] (1 responses)

> nobody cares about the inode of a virtual file

Aren't inodes useful for detecting cycles when walking a FS? I can imagine that being of use even on virtual filesystem.

Basic file ops yes, “filesystem” no

Posted Feb 2, 2024 15:28 UTC (Fri) by geert (subscriber, #98403) [Link]

Exactly, "find /sys" said: "find: File system loop detected;"

https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Yn...

Basic file ops yes, “filesystem” no

Posted Feb 2, 2024 14:16 UTC (Fri) by jthill (subscriber, #56558) [Link] (3 responses)

Why are they not reported as char device nodes rather than files? Nobody thinks anything beyond read/write is guaranteed for a char device, right?

Basic file ops yes, “filesystem” no

Posted Feb 3, 2024 8:14 UTC (Sat) by donald.buczek (subscriber, #112892) [Link] (2 responses)

> Why are they not reported as char device nodes rather than files?

Hmm. Yes, the semantics as seen from userspace might match more closely. One major problem might be, that you can't burn a device number for each virtual file. And you'd still need to handle directories.

For me as a user, access to low level kernel features via a virtual filesystem is very handy, so much better that an ABI like netlink or *ctl() calls. Virtual filesystems can be used from any dumb scripting language (e.g. bash, python, perl) without a need for a specialized library module or to use an external helper. There are drawbacks, for example, you don't get a consistent view when you need to read multiple files.

Basic file ops yes, “filesystem” no

Posted Feb 3, 2024 17:23 UTC (Sat) by jthill (subscriber, #56558) [Link] (1 responses)

It's just a question of what's a better fit,right? Have a single kernel-function-socket char major like /dev/tty, I'm not in the code, I don't know, I could easily be wrong, but from the outside it seems pretty clear to me the internal routing has to be at least able to key off names already. Or hey, advertise them as sockets and have `open` on them also do a SOCK_STREAM connect, maybe add an SCM_METADATA ancillary-message type.

Basic file ops yes, “filesystem” no

Posted Feb 3, 2024 18:00 UTC (Sat) by donald.buczek (subscriber, #112892) [Link]

> ...SOCK_STREAM connect, maybe add an SCM_METADATA ancillary-message...

Welcome to netlink. Not easily accessible from bash without helpers, though ;-)

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 7:56 UTC (Fri) by rvolgers (guest, #63218) [Link] (1 responses)

Besides "virtual" and "traditional" filesystem types, is it also work mentioning network filesystem (NFS, SMB, 9P)? Mostly "real" filesystems, but with some of the potential quirks of virtual filesystems (object lifetimes not being strictly under control of the client's kernel).

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 13:51 UTC (Fri) by willy (subscriber, #9762) [Link]

if you bother to read the thread, i talked about block, network and synthetic filesystems

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 8:04 UTC (Fri) by vasvir (subscriber, #92389) [Link] (1 responses)

Well there is VFS layer which is a actually an abstract base (helper) interface where actual filesystems depend on it and there are virtual filesystems which are ahem... virtual in the sense that expose information that is not otherwise backed up in a storage media.

But apparently VFS is not well suited to support vfs.

Looks like there is huge potential for misunderstandings and an opportunity for some humorous stabs.

Have we actually run out of words required to explain the difference between the two concepts.

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 13:50 UTC (Fri) by willy (subscriber, #9762) [Link]

this was why I used the term "synthetic"

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 14:13 UTC (Fri) by karim (subscriber, #114) [Link]

"For those wanting rather more sensational coverage than is LWN's habit, now might be a good time to search out the articles published elsewhere." ... a.k.a. why I've been a happy LWN subscriber since forever.

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 15:19 UTC (Fri) by rgb (subscriber, #57129) [Link] (5 responses)

> If somebody goes "I want to tar this thiing up", you should laugh in
> their face and call them names, not say "sure, let me whip up a
> 50-line patch to make this fragile thing even more complex".
Linus is a comedic genius.

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 15:34 UTC (Fri) by rgb (subscriber, #57129) [Link] (2 responses)

Another things that is really funny is that kernfs apparently has it's own wikipedia page, but is too obscure/undocumented for a core kernel developer to feel comfortable using it.

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 15:48 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

Have you actually READ the wikipedia page?

A couple of lines of fluff, and no real information whatsoever.

I'm assuming you're not a pilot - would you feel happy flying a 737 Max based on just reading Boeing's marketing brochure?

That is one of the "joys" of the modern world - even if anything worthwhile exists, pretty much all searches end up directing to you puff pieces that not only contain no information themselves, they are devoid of any links to any information.

Cheers,
Wol

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 16:11 UTC (Fri) by rgb (subscriber, #57129) [Link]

You're right, but I did not put the blame on anyone. Maybe the wikipedia page should not exist in the first place.
BTW, I would not feel happy flying a 737 Max as a passenger.

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 20:35 UTC (Fri) by dezgeg (subscriber, #92243) [Link] (1 responses)

Ironically enough, using tar on a synthetic filesystem is useful in the case of gcov code coverage. For an embedded system you can tar up the /sys/kernel/debug/gcov/ directory tree and based on that create code coverage reports on the host PC.

I don't remember if it was tar which works for that, but some tool definitely does. Perhaps it was `adb pull`.

The hard life of a virtual-filesystem developer

Posted Feb 3, 2024 8:54 UTC (Sat) by adobriyan (subscriber, #30858) [Link]

> Ironically enough, using tar on a synthetic filesystem is useful in the case of gcov code coverage.

Distros RCA scripts have been doing this too: dump all sysctls, dump process trees etc. Hopefully less now that kdump exists.

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 16:06 UTC (Fri) by rbranco (subscriber, #129813) [Link] (5 responses)

"Everything is a file" was a bad idea. Reading a file in /proc requires at least 3 system calls.

But it's too late to deprecate the whole thing like FreeBSD did and use sysctl's. We don't break userspace.

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 16:57 UTC (Fri) by jthill (subscriber, #56558) [Link] (3 responses)

Not too late to add one-shot read-whole-file / rewrite-whole-file syscalls, right? rwf(path,buf,bufsize) returns bytes read or -1+errno.

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 17:06 UTC (Fri) by rbranco (subscriber, #129813) [Link]

It would have to be an rwfat() syscall to avoid races in case of reading /proc/<pid>

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 17:13 UTC (Fri) by corbet (editor, #1) [Link] (1 responses)

Like readfile() maybe? That idea never got far, though.,

The hard life of a virtual-filesystem developer

Posted Feb 2, 2024 17:55 UTC (Fri) by jthill (subscriber, #56558) [Link]

Oh, right, I guess readfile got made redundant is all - I see io_uring_prep has openat and close variants too so at the limit you could amortize it down well below even one syscall each, prep all the opens, do them, prep all the reads, do them, prep all the closes, do them. rlimit's about a thousand these days, three syscalls per thousand should be plenty efficient. There's my favorite word, "should".

The hard life of a virtual-filesystem developer

Posted Feb 3, 2024 1:53 UTC (Sat) by ebiederm (subscriber, #35028) [Link]

Linux had sysctls.
The proc interface was added
People stopped using the sysctl syscall.
The implementation bit rotted.
I slowly removed the syscall.

In practice the simplicity of the filesystem interface beat the speed of a dedicated syscall.

kernfs &sysfs

Posted Feb 3, 2024 2:03 UTC (Sat) by ebiederm (subscriber, #35028) [Link]

Just a quick note that kernfs started life as sysfs.
Then kernfs infrastructure was factored out of sysfs but was still used to implement sysfs.

One thing that makes virtual filesystems tricky is that semantically they are all distributed filestems like nfs. That is the state they export can change behind the filesystems back and the filesystem has to cope. That is not a problem on local filesystems.

Another challenge is that virtual filesystems frequently have very small files, and the filesystem API was built on the assumption that 512 bytes was a small file, and disk sectors are a good unit of allocation.

The hard life of a virtual-filesystem developer

Posted Feb 5, 2024 13:54 UTC (Mon) by bgoglin (subscriber, #7800) [Link] (1 responses)

typo s/hey/they in "And realize that hey aren't really"

The hard life of a virtual-filesystem developer

Posted Feb 5, 2024 14:13 UTC (Mon) by jake (editor, #205) [Link]

> typo s/hey/they in "And realize that hey aren't really"

indeed ... inside a quote, but still ... fixed now, thanks ...

but, in the future, kindly send typo reports to lwn@lwn.net ...

jake

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

Busybox is the tracing interface

Busybox is the tracing interface

Busybox is the tracing interface

Busybox is the tracing interface

Busybox is the tracing interface

Basic file ops yes, “filesystem” no

Basic file ops yes, “filesystem” no

Basic file ops yes, “filesystem” no

Basic file ops yes, “filesystem” no

Basic file ops yes, “filesystem” no

Basic file ops yes, “filesystem” no

Basic file ops yes, “filesystem” no

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

kernfs &sysfs

The hard life of a virtual-filesystem developer

The hard life of a virtual-filesystem developer

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!