|
|
Subscribe / Log in / New account

Asynchronous fsync()

By Jake Edge
May 21, 2019
LSFMM

The cost of fsync() is well known to filesystem developers, which is why there are efforts to provide cheaper alternatives. Ric Wheeler wanted to discuss the longstanding idea of adding an asynchronous version of fsync() in a filesystem session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). It turns out that what he wants may already be available via the new io_uring interface.

The idea of an asynchronous version of fsync() is kind of counter-intuitive, Wheeler said. But there are use cases in large-scale data migration. If you are copying a batch of thousands of files from one server to another, you need a way to know that those files are now durable, but don't need to know that they were done in any particular order. You could find out that the data had arrived before destroying the source copies.

It would be something like syncfs() but more fine-grained so that you can select which inodes to sync, Jan Kara suggested. Wheeler said that he is not sure what the API would look like, perhaps something like select(). But it would be fast and useful. The idea goes back to ReiserFS, where it was discovered that syncing files in reverse order was much faster than syncing them in the order written. Ceph, Gluster, and others just need to know that all the files made it to disk in whatever order is convenient for the filesystem.

[Ric Wheeler]

Chris Mason said that io_uring should be able to provide what Wheeler is looking for. He said that Jens Axboe (author of the io_uring code) already implemented an asynchronous version of sync_file_range(), but he wasn't sure about fsync(). The io_uring interface allows arbitrary operations to be done in a kernel worker thread and, when they complete, notifies user space. It would provide an asynchronous I/O (AIO) version of fsync(), "but done properly".

There was some discussion of io_uring and how it could be applied to various use cases. Wheeler asked if it could be used to implement what Amir Goldstein was looking for in terms of a faster fsync(). Mason said that he did not think so, since io_uring is restricted to POSIX operations. Goldstein agreed, saying he needed something that would not interfere with other workloads sharing the filesystem.

Kara is concerned that an asynchronous fsync() as described is not really going to buy any performance gains as it will effectively become a series of fsync() calls on the files of interest. But Trond Myklebust said there are user-space NFS and SMB servers that might benefit from not having to tie up a thread to handle the fsync() calls.

Wheeler said that if the new call just turns into a bunch of fsync() calls under the covers, it is not really going to help. Ts'o said that maybe what Wheeler wants is an fsync2() that takes an array of file descriptors and returns when they have all been synced. If the filesystem has support for fsync2(), it can do batching on the operations. It would be easier for application developers to call a function with an array of file descriptors rather than jumping through the hoops needed to set up an io_uring, he said.

There is one obvious question, however: will all the files need fsync() or will some simply need fdatasync()? For a mix of operations, perhaps a flag needs to be associated with each descriptor. Kara raised the issue of file descriptors in different filesystems, though the VFS could multiplex the call to each filesystem. Wheeler wondered if it could simply be restricted to a single filesystem, but Kara said that the application may not know which filesystem the files belong to. Ts'o said it made sense to not restrict the new call to only handle files from one filesystem; it may be more of a pain for the VFS, but will be a much easier interface for application developers.


Index entries for this article
KernelAsynchronous I/O
ConferenceStorage, Filesystem, and Memory-Management Summit/2019


to post comments

Asynchronous fsync()

Posted May 22, 2019 0:15 UTC (Wed) by JohnVonNeumann (guest, #131609) [Link] (1 responses)

Kernel noob here, does anyone have more info on why this is the case?

"The idea goes back to ReiserFS, where it was discovered that syncing files in reverse order was much faster than syncing them in the order written."

Asynchronous fsync()

Posted May 22, 2019 6:56 UTC (Wed) by viiru (subscriber, #53129) [Link]

I haven't actually looked into this, but I'd guess that the earlier written files are more likely to have been synced by writeback already (so nothing needs to be done for them). Writeback often provides better throughput than individually syncing files.

Asynchronous fsync()

Posted May 22, 2019 7:02 UTC (Wed) by viiru (subscriber, #53129) [Link] (2 responses)

This would be super useful for things that use the usual pattern of writing into a temporary file and renaming on top of the original (rsync, dpkg, etc). Many of these used to not do fsync (since on ext3 with data=ordered it was extremely slow, and had no practical effect on data safety), but then came along filesystems that have delayed allocation (XFS, ext4) making this unsafe and the fsync calls needed to be added. This would allow for example dpkg to write all the temp files of a package, wait for them to be written as a group and then rename them, instead of doing this one by one (which tends to be very slow since there is no benefit from writeback).

Asynchronous fsync()

Posted May 22, 2019 8:00 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (1 responses)

There's no reason that the atomic rename pattern has to be slow. The system is free to interpret the rename as a write barrier, and if it does so, it provides exactly the guarantee user space programs want, and without having to add any new facilities.

Besides, an asynchronous fsync doesn't actually address the other benefit of the rename pattern: atomicity. That's just as important as durability. If you want to replace atomic rename, what you really want is a general transactional filesystem API.

Asynchronous fsync()

Posted May 29, 2019 3:37 UTC (Wed) by zblaxell (subscriber, #26385) [Link]

> The system is free to interpret the rename as a write barrier

Does any filesystem currently do this? I haven't checked, other than to note that rename-as-atomic-replace seems to be about as slow as fsync() in practice, which suggests that filesystems in practice interpret the rename as implied fsync(). I occasionally have to move applications onto tmpfs due to this, or they're just too slow.

> if it does so, it provides exactly the guarantee user space programs want

That depends. Some programs want to ensure an atomic update happens eventually (e.g. web page hit counter, don't care if we lose a few updates during rare crashes, do care if the count is mangled, don't want to wait for IO). Other programs want that atomic update to happen before the rename() call returns (e.g. mail server, wants to know the message is stored on disk before telling the sender it was received, doesn't want rename() to return until the file is persistently updated), and most of the latter group want that atomic update to start immediately to reduce latency. There doesn't seem to be a way to select the user's choice of three distinct behaviors from just the rename() call.

Why are people fascinated by kernel threads?

Posted May 22, 2019 7:58 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (1 responses)

On a few recent occasions, I've seen people propose interfaces that let userspace signal the kernel that it should do some work on a kernel thread somewhere. Why do people like this sort of interface? If you want to have a thread do some work, make a thread and do the work. Anything a kernel thread can do, a user thread can do too. and better: you can control the affinity, priority, and other characteristics of user threads much better than you can kernel threads. If you're tempted to make an "asynchronous" API that queues work for some kernel thread, you should instead just provide a blocking system call and let user space decide the context to do whatever it is that you want to do.

Why are people fascinated by kernel threads?

Posted May 30, 2019 12:28 UTC (Thu) by sourcejedi (guest, #45153) [Link]

For io_uring buffered reads: because it is able to avoid the overhead of switching between threads, when the data is already in cache.

(Latest I can find on writes: "There's no RWF_NOWAIT support for *write* in pwritev2. But it's technically possible to implement it"...)

For other buffered IO with io_uring, you get a convenient kernel thread pool which is managed automatically. This was already required by the above.

I don't know exactly why IOCB_CMD_FSYNC + io_submit() was considered useful v.s. a simple thread pool example; you might be right about that.

"io_pgetevents & aio fsync V4" https://lore.kernel.org/lkml/20180502211448.18276-1-hch@l...

"Re: Triggering non-integrity writeback from userspace" https://lore.kernel.org/lkml/20151029221022.GB10656@dastard/

For io_uring "kernel side polling" IO, I belive it lets you avoid system call overheads altogether, while a kernel thread continuously polls for IO completions on your super-fast device. It has an option to set the affinity of the kernel thread.

Asynchronous fsync()

Posted May 22, 2019 15:26 UTC (Wed) by josh (subscriber, #17465) [Link] (4 responses)

Do people actually want a sync, or just a barrier call for ordering purposes?

Asynchronous fsync()

Posted May 22, 2019 17:13 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (3 responses)

It depends who you are, what you are trying to do, and what sorts of guarantees you are looking to have.

For "regular application development," what you (probably) want is something along the lines of "atomically replace the contents of file foo with data bar, but I don't care about ordering or durability." This is traditionally expressed by writing it to a temp file and doing a rename() over the existing file. Technically, you are required to fsync() or fdatasync() the temp file* to prevent the write from being ordered after the rename, but many applications skip this step. In practice it is desirable that the filesystem detect this case and handle it automatically. Then you don't use any sync primitives, don't get durability, do get ordering, and end up with reasonable performance. Best of all, this provides transparent support for apps that only care "a little bit" about doing things correctly, which is probably most of them. You'll still have pain from open(..., O_WRONLY|O_TRUNC) on an existing file, of course, but that's your own fault and you should fix your ways.

For databases, you definitely want durability in addition to sequencing, so that you know when you can mark the transaction as committed. However, databases have a lot of other, more complicated needs, and they tend to drive new filesystem features to an accordingly greater degree than "regular applications." I therefore am hesitant to make an across-the-board pronouncement about what they need, but in general, fsync() is a significant part of what they need.

The tricky part comes from the rare "in between" cases, where you might have multiple files and want to mutate them all transactionally (e.g. because you are writing a package manager and want to transactionally install a package). Surprisingly, NTFS actually has support for this, but MSDN** says it's a Bad Thing that you should avoid using. To the best of my knowledge, this is completely unsupported on (most mainstream) Linux filesystems.

* Or open with O_SYNC, but to a first approximation, nobody does that.
** https://docs.microsoft.com/en-us/windows/desktop/fileio/d...

Asynchronous fsync()

Posted May 22, 2019 17:38 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (1 responses)

> The tricky part comes from the rare "in between" cases, where you might have multiple files and want to mutate
> them all transactionally (e.g. because you are writing a package manager and want to transactionally install a
> package). Surprisingly, NTFS actually has support for this, but MSDN** says it's a Bad Thing that you should avoid using.

If that's really necessary for some application, it can be accomplished by collecting a set of related files in a directory and selecting one of several such directories by using a symlink with a well-known name (which can be updated atomically via rename).

Asynchronous fsync()

Posted May 31, 2019 20:42 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

That's fine if you actually do it, but when I look in /etc on a typical Debian system, I don't see a nest of symlinks (except under /etc/alternatives, but that's doing an entirely different dance). So it seems APT doesn't actually do that. Probably due to some combination of these factors:

  1. You end up with "extra" (real) directories in addition to the well known (symlink) directories.
  2. You can't rename the real directory afterwards because you would have to have an atomic rename-and-flip-the-symlink operation, which the filesystem does not support, and therefore...
  3. You have to come up with a new name for the directory for each new version, which is hardly difficult but nevertheless makes for ugly directory names (/etc/apt/real_etc/foo_version-1.2.3-debian4/...), or you have to alternate between two names (/etc/apt/real_etc/foo_version-A/...), which is even worse because it's not obvious which version is the "semantically correct" one (so debugging a broken symlink is harder than it should be).
  4. The joys of relative vs. absolute paths, PATH_MAX, etc.
  5. It is inherently impossible to guarantee that APT's database will be completely consistent with whatever is actually on the filesystem, without filesystem transactions.
  6. Inertia.

Asynchronous fsync()

Posted Feb 15, 2022 16:58 UTC (Tue) by MinMan (guest, #156882) [Link]

For our use case, we just need a bool fsyncq(int fd) that returns true when the file has finished synchronizing, or false if it is still writing from buffer. Perhaps a qint64 fsyncq(int fd) that gives an estimate of the number of bytes remaining to sync, zero means the file is completely sync'ed. We have users who dump a large file to USB, then of course they want to pull the stick out ASAP and get on with their day, but we don't have a way to pop a big notice in their face saying "YOUR FILE ISN'T FINISHED TRANSFERRING YET, DON'T YANK OUT THE STICK!" a nice progress bar wouldn't hurt. Would be nice if the hardware had some blinking lights for them to interpret, but it doesn't.

Asynchronous fsync()

Posted May 22, 2019 16:04 UTC (Wed) by kiko (guest, #69905) [Link] (4 responses)

Haven't we spent decades teaching people that fsync() is the only thing that ensures writes are safely delivered to stable storage? And yes, while that's only mostly true (hard drive caching, etc) let's not ruin that simple marketing win because we wanted a convenience API.

Asynchronous fsync()

Posted May 22, 2019 17:32 UTC (Wed) by tome (subscriber, #3171) [Link] (2 responses)

> let's not ruin that

But it sacrifices nothing to provide an async alternative for those whom it benefits.

Asynchronous fsync()

Posted May 22, 2019 17:47 UTC (Wed) by wahern (subscriber, #37304) [Link] (1 responses)

Because additional complexity and features in the kernel have zero externalities?

Asynchronous fsync()

Posted May 23, 2019 4:30 UTC (Thu) by tome (subscriber, #3171) [Link]

Good point. I should have said that it doesn't diminish the goodness of classic fsync.

Asynchronous fsync()

Posted May 31, 2019 22:14 UTC (Fri) by Wol (subscriber, #4433) [Link]

> let's not ruin that simple marketing win

It's definitely NOT a marketing win when said user discovers that removing the fsync results in faster performance by orders of magnitude! (Or the other way round - adding it makes the machine run as slow as treacle.)

For most users - including databases most of the time I would have thought - a simple write barrier is sufficient. That way you get guaranteed consistency - if the log isn't written you lose the transaction completely, while if it is written then the database write can be replayed. The barrier needs to be on a "by user" or "by application" basis, though. Not on an fd basis because quite often logs and data are written to different files and we don't want file b to start updating until file a has finished. Making it system-wide might not punish performance that much on a not too heavily loaded system.

If the caller can choose between an asynchronous "fire and forget" barrier, and a synchronous "wait until it completes" barrier, then all the better. Make it two synchronous barriers - a "manyana" version and an "asap" version (the latter basically telling the system to "flush it all NOW"), and then it allows the APPLICATION to decide what's important.

There's no point having an operating system that runs the computer according the needs of the OS. Without applications there's no point in having the computer!

Cheers,
Wol

Asynchronous fsync()

Posted May 22, 2019 22:39 UTC (Wed) by dgc (subscriber, #6611) [Link]

*cough* AIO_FSYNC *cough*

The fsync2() API is essentially identical to the existing AIO_FSYNC/AIO_FDSYNC API, except it's synchronous and that is what applications want to avoid.

The only argument I've been presented with against AIO_FSYNC is that "the implementation is just a workqueue", which is largely non-sensical because it is filesystem implementation independent but allows automatic kernel side parallelisation of all the fsync operations issued. This allows the filesystem(s) to then automatically optimise away unnecessary journal writes when completing concurrent fsync operations - XFS, ext4, etc already do this when user applications run fsync() concurrently from lots of processes/threads.....

This simple implementation allows a simple "untar with aio fsync" workload (i.e."write many 4kB files and aio_fsync() in batches as we go, retiring completed fsync()s before we dispatch a new batch") workload on XFS to go from about 2000 files/s (synchronous write IO latency bound) to over 40,000 files/s (write iops bound on the back end storage).

IOWs, we've already got efficient asynchronous fsync functionality in the kernel that does most of what is being asked for....

-Dave.


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy