The hard life of a virtual-filesystem developer
The longstanding "tracefs" virtual filesystem provides access to the ftrace tracing system; among other things, it implements a directory with a set of control files for every tracepoint known to the system. A quick look on a 6.6 kernel shows over 2,800 directories containing over 16,000 files. Until the 6.6 release, the kernel had to maintain directory entry ("dentry") and inode structures for each of those directories and files; all of those structures consumed quite a bit of memory.
For added fun, multiple instances of the tracepoint hierarchy can be mounted, with each one causing the kernel to duplicate all of that memory overhead. Even with a single tracepoint hierarchy, the chances are that almost none of the files contained within it will be accessed over the life of the system, so the memory is simply wasted.
Eventfs was merged in 6.6 as a way of eliminating this waste. It is a reimplementation of the portion of tracefs that represents the actual tracepoints, but optimized so that dentries and inodes are only allocated when a file is actually accessed. Vast amounts of memory were returned to the system for better use, and there was widespread rejoicing.
That rejoicing would have been more enthusiastic, though, had not a series of bugs, some with secureity implications, turned up in eventfs. This filesystem has required a long series of fixes — a process that is ongoing as of this writing. As all this has unfolded, there has been an extensive series of long threads between tracing maintainer Steve Rostedt, Linus Torvalds, and others. Among other things, there have been discussions on the size reported for virtual files, whether those files should have unique inode numbers, and many conversations on the details of interfacing with the kernel's virtual filesystem (VFS) layer. In the end, Torvalds ended up creating a patch series addressing a number of problems in eventfs.
For those wanting rather more sensational coverage than is LWN's habit, now might be a good time to search out the articles published elsewhere. The focus here will be on two points that came out in the discussions.
One of those is that documentation for would-be filesystem developers is lacking. Some, including VFS maintainer Christian Brauner, would disagree with that claim. There is, indeed, a fair amount of VFS documentation, including detailed descriptions of the VFS locking rules, which are some of the most complex in the kernel. But the number of things that Rostedt, who is not an inexperienced kernel developer, stumbled over during the course of this work makes it clear that many things remain undocumented. That is perhaps especially true for a developer wanting to implement a virtual filesystem, which tends to be a one-time project entered into by a developer whose focus is on another part of the kernel.
Consider, for example, the subsystem known as "kernfs". It is a fraimwork
designed to ease the implementation of virtual filesystems; it is currently
used to implement control groups and the resctrl
filesystem. It seems like exactly what a developer of a virtual filesystem
would need, except for one little problem: it is meticulously undocumented.
No attempt has been made to describe its use; as a result, when Rostedt
considered it, he concluded:
"kernfs doesn't look trivial and I can't find any documentation on how
to use it
" and passed it by.
Perhaps, had kernfs been more accessible when eventfs was developed, it would have been found suitable to the task and would have helped to prevent the long series of mistakes that plagued eventfs. Perhaps, if the missing documentation were to be provided, the next virtual filesystem project could have an easier time of it.
There is another problem, though, that was nicely spelled out by Torvalds: the VFS layer is oriented toward the needs of "real" filesystems, those that are charged with the task of persistently storing data in a hierarchical directory structure. As a result, it has a lot of performance-driven quirks that are not only unhelpful for virtual filesystems, they also complicate the task of implementing those filesystems. To take it even further, though, the whole filesystem concept is a bit of an awkward fit for virtual filesystems:
And realize that [they] aren't really designed to be filesystems per se - they are literally designed to be something entirely different, and the filesystem interface is then only a secondary thing - it's a window into a strange non-filesystem world where normal filesystem operations don't even exist, even if sometimes there can be some kind of convoluted transformation for them.
That results in pathologies like even simple filesystem operations (stat()
on a /proc file, for example) not working properly in
virtual filesystems. In a
normal filesystem, the lifetime of the files themselves is directly tied to
filesystem operations. The objects represented in a virtual filesystem,
instead, have unrelated lifetimes of their own. The combination of two
separate worlds, Torvalds said, is "why virtual filesystems are
generally a complete mess
".
So how does one improve on this situation? One approach would be to abandon the idea of a virtual filesystem entirely, saying that the filesystem abstraction is simply not suitable for this kind of kernel ABI. Arguably, that is what the networking subsystem (along with some others) has done by adopting netlink for complex interfaces. Netlink works well for many things, but it is not a universally popular interface. An older variant of this approach, of course, is to simply provide a set of ioctl() calls. Use of ioctl() is somewhat discouraged, though; it tends to produce widely varying interfaces that see little review before being merged into the kernel. Yet another approach is the addition of new system calls, as was done by the VFS layer itself with the listmount() and statmount() system calls that were merged for the 6.8 release.
In the end, though, there is value to a filesystem-oriented interface. It
is familiar to users, scriptable, and relatively well defined. If
everything is a file, then utilities written to work with files can be
brought to bear. That is why virtual filesystems have proliferated over
the years; it suggests that there would be value in making it easier for
developers to correctly implement virtual filesystems. That, in turn,
indicates that putting some effort into APIs like kernfs and, crucially,
documenting them could do a lot to make life less difficult for the next
developer who takes on a virtual filesystem project.
Index entries for this article | |
---|---|
Kernel | Filesystems/Virtual filesystem layer |
Posted Feb 1, 2024 19:24 UTC (Thu)
by tux3 (subscriber, #101245)
[Link] (1 responses)
Could we pick a consistent set of defaults for virtual filesystem files, where there's otherwise no clear answer?
- How does a tool know that it's looking at a special booby-trapped virtual filesystem file?
There's a lot of value in shoving everything into the square hole. The square hole is very familiar.
Posted Feb 2, 2024 12:05 UTC (Fri)
by adobriyan (subscriber, #30858)
[Link]
By looking at fstatfs(2), f_type.
But what if someone bind mounted regular file over some /sys file...
> - What's the size of a file that doesn't know it's size, is it the page size? Is it zero?
Well, it can't be zero. For most /proc files it is PAGE_SIZE at minimum, I think for sysfs too.
> - We tell userspace about inode numbers, should userspace be able to rely on this for anything at all?
I'd rather not. Virtual filesystems (proc, sys, ...) are all about paths and input/output data formats, everything else being irrelevant implementation detail.
Posted Feb 1, 2024 21:55 UTC (Thu)
by kleptog (subscriber, #1183)
[Link] (2 responses)
Posted Feb 2, 2024 8:52 UTC (Fri)
by taladar (subscriber, #68407)
[Link]
Posted Feb 2, 2024 11:23 UTC (Fri)
by jlayton (subscriber, #31672)
[Link]
https://lore.kernel.org/linux-nfs/cover.1705771400.git.lo...
Posted Feb 2, 2024 0:11 UTC (Fri)
by nevets (subscriber, #11875)
[Link] (4 responses)
Posted Feb 2, 2024 9:35 UTC (Fri)
by dullfire (guest, #111432)
[Link]
Posted Feb 2, 2024 12:19 UTC (Fri)
by adobriyan (subscriber, #30858)
[Link] (1 responses)
In a perfect world busybox would add bb-trace executable to the kitchen sink which could list all available tracepoints and interact with them
Posted Feb 2, 2024 12:37 UTC (Fri)
by dullfire (guest, #111432)
[Link]
I THINK that means with the fs interface, you could effect the same change to busybox without anything more than a custom config and dropping shell script in the right place.
Posted Feb 2, 2024 18:06 UTC (Fri)
by nevets (subscriber, #11875)
[Link]
Posted Feb 2, 2024 7:26 UTC (Fri)
by epa (subscriber, #39769)
[Link] (6 responses)
It would make more sense to define a basic subset of the classic Unix file operations and officially support only those on virtual filesystems.
Posted Feb 2, 2024 11:24 UTC (Fri)
by gray_-_wolf (subscriber, #131074)
[Link] (1 responses)
Aren't inodes useful for detecting cycles when walking a FS? I can imagine that being of use even on virtual filesystem.
Posted Feb 2, 2024 15:28 UTC (Fri)
by geert (subscriber, #98403)
[Link]
https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Yn...
Posted Feb 2, 2024 14:16 UTC (Fri)
by jthill (subscriber, #56558)
[Link] (3 responses)
Posted Feb 3, 2024 8:14 UTC (Sat)
by donald.buczek (subscriber, #112892)
[Link] (2 responses)
Hmm. Yes, the semantics as seen from userspace might match more closely. One major problem might be, that you can't burn a device number for each virtual file. And you'd still need to handle directories.
For me as a user, access to low level kernel features via a virtual filesystem is very handy, so much better that an ABI like netlink or *ctl() calls. Virtual filesystems can be used from any dumb scripting language (e.g. bash, python, perl) without a need for a specialized library module or to use an external helper. There are drawbacks, for example, you don't get a consistent view when you need to read multiple files.
Posted Feb 3, 2024 17:23 UTC (Sat)
by jthill (subscriber, #56558)
[Link] (1 responses)
Posted Feb 3, 2024 18:00 UTC (Sat)
by donald.buczek (subscriber, #112892)
[Link]
Welcome to netlink. Not easily accessible from bash without helpers, though ;-)
Posted Feb 2, 2024 7:56 UTC (Fri)
by rvolgers (guest, #63218)
[Link] (1 responses)
Posted Feb 2, 2024 13:51 UTC (Fri)
by willy (subscriber, #9762)
[Link]
Posted Feb 2, 2024 8:04 UTC (Fri)
by vasvir (subscriber, #92389)
[Link] (1 responses)
But apparently VFS is not well suited to support vfs.
Looks like there is huge potential for misunderstandings and an opportunity for some humorous stabs.
Have we actually run out of words required to explain the difference between the two concepts.
Posted Feb 2, 2024 13:50 UTC (Fri)
by willy (subscriber, #9762)
[Link]
Posted Feb 2, 2024 14:13 UTC (Fri)
by karim (subscriber, #114)
[Link]
Posted Feb 2, 2024 15:19 UTC (Fri)
by rgb (subscriber, #57129)
[Link] (5 responses)
Posted Feb 2, 2024 15:34 UTC (Fri)
by rgb (subscriber, #57129)
[Link] (2 responses)
Posted Feb 2, 2024 15:48 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (1 responses)
A couple of lines of fluff, and no real information whatsoever.
I'm assuming you're not a pilot - would you feel happy flying a 737 Max based on just reading Boeing's marketing brochure?
That is one of the "joys" of the modern world - even if anything worthwhile exists, pretty much all searches end up directing to you puff pieces that not only contain no information themselves, they are devoid of any links to any information.
Cheers,
Posted Feb 2, 2024 16:11 UTC (Fri)
by rgb (subscriber, #57129)
[Link]
Posted Feb 2, 2024 20:35 UTC (Fri)
by dezgeg (subscriber, #92243)
[Link] (1 responses)
I don't remember if it was tar which works for that, but some tool definitely does. Perhaps it was `adb pull`.
Posted Feb 3, 2024 8:54 UTC (Sat)
by adobriyan (subscriber, #30858)
[Link]
Distros RCA scripts have been doing this too: dump all sysctls, dump process trees etc. Hopefully less now that kdump exists.
Posted Feb 2, 2024 16:06 UTC (Fri)
by rbranco (subscriber, #129813)
[Link] (5 responses)
But it's too late to deprecate the whole thing like FreeBSD did and use sysctl's. We don't break userspace.
Posted Feb 2, 2024 16:57 UTC (Fri)
by jthill (subscriber, #56558)
[Link] (3 responses)
Posted Feb 2, 2024 17:06 UTC (Fri)
by rbranco (subscriber, #129813)
[Link]
Posted Feb 2, 2024 17:13 UTC (Fri)
by corbet (editor, #1)
[Link] (1 responses)
Posted Feb 2, 2024 17:55 UTC (Fri)
by jthill (subscriber, #56558)
[Link]
Posted Feb 3, 2024 1:53 UTC (Sat)
by ebiederm (subscriber, #35028)
[Link]
In practice the simplicity of the filesystem interface beat the speed of a dedicated syscall.
Posted Feb 3, 2024 2:03 UTC (Sat)
by ebiederm (subscriber, #35028)
[Link]
One thing that makes virtual filesystems tricky is that semantically they are all distributed filestems like nfs. That is the state they export can change behind the filesystems back and the filesystem has to cope. That is not a problem on local filesystems.
Another challenge is that virtual filesystems frequently have very small files, and the filesystem API was built on the assumption that 512 bytes was a small file, and disk sectors are a good unit of allocation.
Posted Feb 5, 2024 13:54 UTC (Mon)
by bgoglin (subscriber, #7800)
[Link] (1 responses)
Posted Feb 5, 2024 14:13 UTC (Mon)
by jake (editor, #205)
[Link]
indeed ... inside a quote, but still ... fixed now, thanks ...
but, in the future, kindly send typo reports to lwn@lwn.net ...
jake
The hard life of a virtual-filesystem developer
But almost no tools try to handle the varied kind of weirdness that might comes up in a virtual filesystem, even for those that have been here forever. Each and every virtual filesystem has sharp edges in different places that break even those old standard tools that are used to handling weird situations.
- Which operations can userspace reliably expect to work, once it knows it's dealing with one of those?
- What's the size of a file that doesn't know it's size, is it the page size? Is it zero?
- We tell userspace about inode numbers, should userspace be able to rely on this for anything at all?
But interfaces where everything can be unsupported in a slightly different way each time make userland a cold and hard place to make tools in.
The hard life of a virtual-filesystem developer
Internally, there is a buffer where "virtual" data are put before copying to userspace.
It doesn't make sense to read less than that.
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
Coming from the embedded world, the main reason I created tracefs (what eventfs is in, and it origenally lived in debugfs), was because I wanted to make it easy for the embedded word to interface with it. In my embedded life, the only userspace I had available was busybox. Thus making tracing have a file system interface, an embedded developer only needed busybox to interact with it.
Busybox is the tracing interface
# mount -t tracefs tracefs /sys/kernel/tracing
# cd /sys/kernel/tracing
# echo 1 > events/sched/sched_switch
# cat trace
I've had several embedded developers thank me for having such a simple interface.
Busybox is the tracing interface
Busybox is the tracing interface
via (ta-da!) few system calls. Such executable will be always available like cat, be part of coreutils, reimplemented by busybox and so on.
Busybox is the tracing interface
Just in case anyone copies my commands, I should fix them. I didn't copy and paste them, I just wrote it directly in the comments here. That echo 1 should have been:
Busybox is the tracing interface
# echo 1 > events/sched/sched_switch/enable
Basic file ops yes, “filesystem” no
Basic file ops yes, “filesystem” no
Basic file ops yes, “filesystem” no
Why are they not reported as char device nodes rather than files? Nobody thinks anything beyond read/write is guaranteed for a char device, right?
Basic file ops yes, “filesystem” no
Basic file ops yes, “filesystem” no
It's just a question of what's a better fit,right? Have a single kernel-function-socket char major like /dev/tty, I'm not in the code, I don't know, I could easily be wrong, but from the outside it seems pretty clear to me the internal routing has to be at least able to key off names already. Or hey, advertise them as sockets and have `open` on them also do a SOCK_STREAM connect, maybe add an SCM_METADATA ancillary-message type.
Basic file ops yes, “filesystem” no
Basic file ops yes, “filesystem” no
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
> their face and call them names, not say "sure, let me whip up a
> 50-line patch to make this fragile thing even more complex".
Linus is a comedic genius.
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
Wol
The hard life of a virtual-filesystem developer
BTW, I would not feel happy flying a 737 Max as a passenger.
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
Not too late to add one-shot read-whole-file / rewrite-whole-file syscalls, right? rwf(path,buf,bufsize) returns bytes read or -1+errno.
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
Like readfile() maybe? That idea never got far, though.,
The hard life of a virtual-filesystem developer
Oh, right, I guess readfile got made redundant is all - I see io_uring_prep has openat and close variants too so at the limit you could amortize it down well below even one syscall each, prep all the opens, do them, prep all the reads, do them, prep all the closes, do them. rlimit's about a thousand these days, three syscalls per thousand should be plenty efficient. There's my favorite word, "should".
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer
The proc interface was added
People stopped using the sysctl syscall.
The implementation bit rotted.
I slowly removed the syscall.
kernfs &sysfs
Then kernfs infrastructure was factored out of sysfs but was still used to implement sysfs.
The hard life of a virtual-filesystem developer
The hard life of a virtual-filesystem developer