Asynchronous fsync()
The cost of fsync() is well known to filesystem developers, which is why there are efforts to provide cheaper alternatives. Ric Wheeler wanted to discuss the longstanding idea of adding an asynchronous version of fsync() in a filesystem session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). It turns out that what he wants may already be available via the new io_uring interface.
The idea of an asynchronous version of fsync() is kind of counter-intuitive, Wheeler said. But there are use cases in large-scale data migration. If you are copying a batch of thousands of files from one server to another, you need a way to know that those files are now durable, but don't need to know that they were done in any particular order. You could find out that the data had arrived before destroying the source copies.
It would be something like syncfs() but more fine-grained so that you can select which inodes to sync, Jan Kara suggested. Wheeler said that he is not sure what the API would look like, perhaps something like select(). But it would be fast and useful. The idea goes back to ReiserFS, where it was discovered that syncing files in reverse order was much faster than syncing them in the order written. Ceph, Gluster, and others just need to know that all the files made it to disk in whatever order is convenient for the filesystem.
Chris Mason said that io_uring should be able to provide what Wheeler is looking for. He said that Jens Axboe (author of the io_uring code) already implemented an asynchronous version of sync_file_range(), but he wasn't sure about fsync(). The io_uring interface allows arbitrary operations to be done in a kernel worker thread and, when they complete, notifies user space. It would provide an asynchronous I/O (AIO) version of fsync(), "but done properly".
There was some discussion of io_uring and how it could be applied to various use cases. Wheeler asked if it could be used to implement what Amir Goldstein was looking for in terms of a faster fsync(). Mason said that he did not think so, since io_uring is restricted to POSIX operations. Goldstein agreed, saying he needed something that would not interfere with other workloads sharing the filesystem.
Kara is concerned that an asynchronous fsync() as described is not really going to buy any performance gains as it will effectively become a series of fsync() calls on the files of interest. But Trond Myklebust said there are user-space NFS and SMB servers that might benefit from not having to tie up a thread to handle the fsync() calls.
Wheeler said that if the new call just turns into a bunch of fsync() calls under the covers, it is not really going to help. Ts'o said that maybe what Wheeler wants is an fsync2() that takes an array of file descriptors and returns when they have all been synced. If the filesystem has support for fsync2(), it can do batching on the operations. It would be easier for application developers to call a function with an array of file descriptors rather than jumping through the hoops needed to set up an io_uring, he said.
There is one obvious question, however: will all the files need fsync() or will some simply need fdatasync()? For a mix of operations, perhaps a flag needs to be associated with each descriptor. Kara raised the issue of file descriptors in different filesystems, though the VFS could multiplex the call to each filesystem. Wheeler wondered if it could simply be restricted to a single filesystem, but Kara said that the application may not know which filesystem the files belong to. Ts'o said it made sense to not restrict the new call to only handle files from one filesystem; it may be more of a pain for the VFS, but will be a much easier interface for application developers.
Index entries for this article | |
---|---|
Kernel | Asynchronous I/O |
Conference | Storage, Filesystem, and Memory-Management Summit/2019 |
Posted May 22, 2019 0:15 UTC (Wed)
by JohnVonNeumann (guest, #131609)
[Link] (1 responses)
"The idea goes back to ReiserFS, where it was discovered that syncing files in reverse order was much faster than syncing them in the order written."
Posted May 22, 2019 6:56 UTC (Wed)
by viiru (subscriber, #53129)
[Link]
Posted May 22, 2019 7:02 UTC (Wed)
by viiru (subscriber, #53129)
[Link] (2 responses)
Posted May 22, 2019 8:00 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Besides, an asynchronous fsync doesn't actually address the other benefit of the rename pattern: atomicity. That's just as important as durability. If you want to replace atomic rename, what you really want is a general transactional filesystem API.
Posted May 29, 2019 3:37 UTC (Wed)
by zblaxell (subscriber, #26385)
[Link]
Does any filesystem currently do this? I haven't checked, other than to note that rename-as-atomic-replace seems to be about as slow as fsync() in practice, which suggests that filesystems in practice interpret the rename as implied fsync(). I occasionally have to move applications onto tmpfs due to this, or they're just too slow.
> if it does so, it provides exactly the guarantee user space programs want
That depends. Some programs want to ensure an atomic update happens eventually (e.g. web page hit counter, don't care if we lose a few updates during rare crashes, do care if the count is mangled, don't want to wait for IO). Other programs want that atomic update to happen before the rename() call returns (e.g. mail server, wants to know the message is stored on disk before telling the sender it was received, doesn't want rename() to return until the file is persistently updated), and most of the latter group want that atomic update to start immediately to reduce latency. There doesn't seem to be a way to select the user's choice of three distinct behaviors from just the rename() call.
Posted May 22, 2019 7:58 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Posted May 30, 2019 12:28 UTC (Thu)
by sourcejedi (guest, #45153)
[Link]
(Latest I can find on writes: "There's no RWF_NOWAIT support for *write* in pwritev2. But it's technically possible to implement it"...)
For other buffered IO with io_uring, you get a convenient kernel thread pool which is managed automatically. This was already required by the above.
I don't know exactly why IOCB_CMD_FSYNC + io_submit() was considered useful v.s. a simple thread pool example; you might be right about that.
"io_pgetevents & aio fsync V4" https://lore.kernel.org/lkml/20180502211448.18276-1-hch@l...
"Re: Triggering non-integrity writeback from userspace" https://lore.kernel.org/lkml/20151029221022.GB10656@dastard/
For io_uring "kernel side polling" IO, I belive it lets you avoid system call overheads altogether, while a kernel thread continuously polls for IO completions on your super-fast device. It has an option to set the affinity of the kernel thread.
Posted May 22, 2019 15:26 UTC (Wed)
by josh (subscriber, #17465)
[Link] (4 responses)
Posted May 22, 2019 17:13 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
For "regular application development," what you (probably) want is something along the lines of "atomically replace the contents of file foo with data bar, but I don't care about ordering or durability." This is traditionally expressed by writing it to a temp file and doing a rename() over the existing file. Technically, you are required to fsync() or fdatasync() the temp file* to prevent the write from being ordered after the rename, but many applications skip this step. In practice it is desirable that the filesystem detect this case and handle it automatically. Then you don't use any sync primitives, don't get durability, do get ordering, and end up with reasonable performance. Best of all, this provides transparent support for apps that only care "a little bit" about doing things correctly, which is probably most of them. You'll still have pain from open(..., O_WRONLY|O_TRUNC) on an existing file, of course, but that's your own fault and you should fix your ways.
For databases, you definitely want durability in addition to sequencing, so that you know when you can mark the transaction as committed. However, databases have a lot of other, more complicated needs, and they tend to drive new filesystem features to an accordingly greater degree than "regular applications." I therefore am hesitant to make an across-the-board pronouncement about what they need, but in general, fsync() is a significant part of what they need.
The tricky part comes from the rare "in between" cases, where you might have multiple files and want to mutate them all transactionally (e.g. because you are writing a package manager and want to transactionally install a package). Surprisingly, NTFS actually has support for this, but MSDN** says it's a Bad Thing that you should avoid using. To the best of my knowledge, this is completely unsupported on (most mainstream) Linux filesystems.
* Or open with O_SYNC, but to a first approximation, nobody does that.
Posted May 22, 2019 17:38 UTC (Wed)
by rweikusat2 (subscriber, #117920)
[Link] (1 responses)
If that's really necessary for some application, it can be accomplished by collecting a set of related files in a directory and selecting one of several such directories by using a symlink with a well-known name (which can be updated atomically via rename).
Posted May 31, 2019 20:42 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
That's fine if you actually do it, but when I look in /etc on a typical Debian system, I don't see a nest of symlinks (except under /etc/alternatives, but that's doing an entirely different dance). So it seems APT doesn't actually do that. Probably due to some combination of these factors:
Posted Feb 15, 2022 16:58 UTC (Tue)
by MinMan (guest, #156882)
[Link]
Posted May 22, 2019 16:04 UTC (Wed)
by kiko (guest, #69905)
[Link] (4 responses)
Posted May 22, 2019 17:32 UTC (Wed)
by tome (subscriber, #3171)
[Link] (2 responses)
But it sacrifices nothing to provide an async alternative for those whom it benefits.
Posted May 22, 2019 17:47 UTC (Wed)
by wahern (subscriber, #37304)
[Link] (1 responses)
Posted May 23, 2019 4:30 UTC (Thu)
by tome (subscriber, #3171)
[Link]
Posted May 31, 2019 22:14 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
It's definitely NOT a marketing win when said user discovers that removing the fsync results in faster performance by orders of magnitude! (Or the other way round - adding it makes the machine run as slow as treacle.)
For most users - including databases most of the time I would have thought - a simple write barrier is sufficient. That way you get guaranteed consistency - if the log isn't written you lose the transaction completely, while if it is written then the database write can be replayed. The barrier needs to be on a "by user" or "by application" basis, though. Not on an fd basis because quite often logs and data are written to different files and we don't want file b to start updating until file a has finished. Making it system-wide might not punish performance that much on a not too heavily loaded system.
If the caller can choose between an asynchronous "fire and forget" barrier, and a synchronous "wait until it completes" barrier, then all the better. Make it two synchronous barriers - a "manyana" version and an "asap" version (the latter basically telling the system to "flush it all NOW"), and then it allows the APPLICATION to decide what's important.
There's no point having an operating system that runs the computer according the needs of the OS. Without applications there's no point in having the computer!
Cheers,
Posted May 22, 2019 22:39 UTC (Wed)
by dgc (subscriber, #6611)
[Link]
The fsync2() API is essentially identical to the existing AIO_FSYNC/AIO_FDSYNC API, except it's synchronous and that is what applications want to avoid.
The only argument I've been presented with against AIO_FSYNC is that "the implementation is just a workqueue", which is largely non-sensical because it is filesystem implementation independent but allows automatic kernel side parallelisation of all the fsync operations issued. This allows the filesystem(s) to then automatically optimise away unnecessary journal writes when completing concurrent fsync operations - XFS, ext4, etc already do this when user applications run fsync() concurrently from lots of processes/threads.....
This simple implementation allows a simple "untar with aio fsync" workload (i.e."write many 4kB files and aio_fsync() in batches as we go, retiring completed fsync()s before we dispatch a new batch") workload on XFS to go from about 2000 files/s (synchronous write IO latency bound) to over 40,000 files/s (write iops bound on the back end storage).
IOWs, we've already got efficient asynchronous fsync functionality in the kernel that does most of what is being asked for....
-Dave.
Asynchronous fsync()
Asynchronous fsync()
Asynchronous fsync()
Asynchronous fsync()
Asynchronous fsync()
Why are people fascinated by kernel threads?
Why are people fascinated by kernel threads?
Asynchronous fsync()
Asynchronous fsync()
** https://docs.microsoft.com/en-us/windows/desktop/fileio/d...
Asynchronous fsync()
> them all transactionally (e.g. because you are writing a package manager and want to transactionally install a
> package). Surprisingly, NTFS actually has support for this, but MSDN** says it's a Bad Thing that you should avoid using.
Asynchronous fsync()
Asynchronous fsync()
Asynchronous fsync()
Asynchronous fsync()
Asynchronous fsync()
Asynchronous fsync()
Asynchronous fsync()
Wol
Asynchronous fsync()