Asynchronous buffered read operations

By Jake Edge
March 18, 2015

LSFMM 2015

A problem that Milosz Tanski has run into throughout his career is part of what brought him to the 2015 Linux Storage, Filesystem, and Memory Management Summit. Some reads can be satisfied immediately from the page cache, while others require an expensive I/O. Distinguishing between the two can lead to more efficient programs. He has implemented a new mode for read() that does so, though it requires adding a new system call.

The problem typically occurs in low-level network applications, Tanski said. Not every application can use sendfile(). For example, applications using TLS modify the data to encrypt it before sending it, which means they can't use sendfile(). So they must do their own copies but, depending on whether the data is in the page cache, some will be "slow", while others are "fast". Some programs that want to do asynchronous disk I/O often just use O_DIRECT and replicate the page cache concept in user space. That way they can track the contents of the cache to determine if an I/O can be satisfied quickly or not.

The normal workaround for these problems is to use thread pools for the I/O, but that pattern "kinda sucks". The latency added due to synchronization between the threads is not insubstantial. It is also often the case that requests that could be satisfied quickly get stuck behind slower requests.

So, with the help of Christoph Hellwig, he has implemented preadv2(), which is like preadv() except that there is a new flags argument (which, as was pointed out by several attendees, really should have been added with preadv()). There is only one flag available in his patches: RWF_NONBLOCK (which could also have been called RWF_NOWAIT, he said). That flag will cause reads to succeed only if the data is already in the page cache, otherwise it will return EAGAIN.

Basically, that flag allows reads from the network loop to skip the queue if the data needed is already available in the page cache. It essentially provides a fast path with minimal changes to the user-space application. He has been using it with an internal application and it works well.

His patches drew one major comment, he said, which was about using functionality like that in fincore() to get a list of the pages of a file that are resident in the page cache. The problem with that is a race condition where a page that was present at the time of the check is no longer there when the read is performed, which puts that read back into the slow lane.

He has also tested the patches with Samba, where they reduce the latency significantly. For his internal application, which is a large, columnar data store using the Ceph filesystem, he got 23% lower response times. The average response times dropped by 200ms, he said.

There have been some objections to adding another system call, Tanski said. James Bottomley was not particularly concerned about that, since the new system call is just adding a flag argument that should have been there already. Hellwig added that it required a new system call just to get the flag in, which is not an unusual situation in recent times.

Hellwig has also implemented pwritev2() as part of the patch set to add a flag argument for the write() side. There are no write flags included in the patch, though some will be added as separate patches down the road. There are some potential user-space uses for flags for writes, including a "high priority" flag and a non-blocking flag that could be used for logging, Hellwig said.

No one in the room seemed opposed to the idea. It seems likely that the two new system calls could show up as early as the 4.1 kernel.

[I would like to thank the Linux Foundation for travel support to Boston for the summit.]

Index entries for this article

Kernel Asynchronous I/O

Conference Storage, Filesystem, and Memory-Management Summit/2015

Index entries for this article
Kernel	Asynchronous I/O
Conference	Storage, Filesystem, and Memory-Management Summit/2015

Asynchronous buffered read operations

Posted Mar 20, 2015 7:24 UTC (Fri) by mathieu_lacage (guest, #3967) [Link] (4 responses)

Are there plans to support something similar for select ? Namely, have this improved estimation of whether a read will block be used to determine read 'readyness' for select ?

Asynchronous buffered read operations

Posted Mar 20, 2015 14:44 UTC (Fri) by mtanski (guest, #56423) [Link] (3 responses)

There's currently no plans to support select(), pool(), epoll(). I would say it's also unlikely because I don't think those interfaces make sense for disk IO with a fd that points to a fail.

It makes sense to have select() work for network sockets, pipes, singnalfd, timerfd... because you're waiting on external progress (somebody to send you data or somebody retrieves data and drains the buffer). The same is not quite true for files. It's unlikely that any progress getting data in the page cache will be made unless you explicility perform it yourself (read(), readahead()). To add to that even doing a lseek() on a file does not trigger any kind of readahead / activity just updates the position pointer.

io_submit / io_wait for APIs that exist for direct IO are a much better API for files.

Asynchronous buffered read operations

Posted Mar 20, 2015 15:05 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

The problem here is in multiplexing. It's extremely common to have a loop that reads data from a disk and puts it into a network socket.

And there's no real way to do it asynchronously in one thread - poll/epoll don't work for disk IO and aio doesn't work for sockets.

Asynchronous buffered read operations

Posted Mar 20, 2015 15:17 UTC (Fri) by mtanski (guest, #56423) [Link] (1 responses)

You have two options:

1. io_submit(), eventfd(), poll() the eventfd
Any requests you submit via io_submit can have an optional notification via eventfd. You have to set that in the struct iocb. The downside is that you're limited to O_DIRECT and it can still block in some cases.

2. Build yourself a threadpool for handling IO.
This is kind of what everybody does today who doesn't do O_DIRECT. It still issues like introducing latency (synchronization, slow requests). This is what this RWF_NONBLOCK addresses it lets you try doing a "fast read" from your network thread, so you can skip the threadpool if the data is there (because of kernel readahead or because it's hot data).

Asynchronous buffered read operations

Posted Mar 20, 2015 15:38 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> 1. io_submit(), eventfd(), poll() the eventfd
You really have to do it from a separate thread.

> 2. Build yourself a threadpool for handling IO.
That's what everyone does since there's really no way around it. That's why it would be nice to have a unified asynchronous API.

And really, I don't see disk IO that much different from the network IO.

Asynchronous buffered read operations

Asynchronous buffered read operations

Asynchronous buffered read operations

Asynchronous buffered read operations

Asynchronous buffered read operations

Asynchronous buffered read operations

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!