Ringing in a new asynchronous I/O API
Setting up
Any AIO implementation must provide for the submission of operations and the collection of completion data at some future point in time. In io_uring, that is handled through two ring buffers used to implement a submission queue and a completion queue. The first step for an application is to set up this structure using a new system call:
int io_uring_setup(int entries, struct io_uring_params *params);
The entries parameter is used to size both the submission and completion queues. The params structure looks like this:
struct io_uring_params { __u32 sq_entries; __u32 cq_entries; __u32 flags; __u16 resv[10]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off; };
On entry, this structure (with the possible exception of flags as described later) should simply be initialized to zero. On return from a successful call, the sq_entries and cq_entries fields will be set to the actual sizes of the submission and completion queues; the code is set up to allocate entries submission entries, and twice that many completion entries.
The return value from io_uring_setup() is a file descriptor that can then be passed to mmap() to map the buffer into the process's address space. More specifically, three calls are needed to map the two ring buffers and an array of submission-queue entries; the information needed to do this mapping will be found in the sq_off and cq_off fields of the io_uring_params structure. In particular, the submission queue, which is a ring of integer array indices, is mapped with a call like:
subqueue = mmap(0, params.sq_off.array + params.sq_entries*sizeof(__u32), PROT_READ|PROT_WRITE|MAP_SHARED|MAP_POPULATE, ring_fd, IORING_OFF_SQ_RING);
Where params is the io_uring_params structure, and ring_fd is the file descriptor returned from io_uring_setup(). The addition of params.sq_off.array to the length of the region accounts for the fact that the ring is not located right at the beginning. The actual array of submission-queue entries, instead, is mapped with:
sqentries = mmap(0, params.sq_entries*sizeof(struct io_uring_sqe), PROT_READ|PROT_WRITE|MAP_SHARED|MAP_POPULATE, ring_fd, IORING_OFF_SQES);
This separation of the queue entries from the ring buffer is needed because I/O operations may well complete in an order different from the submission order. The completion queue is simpler, since the entries are not separated from the queue itself; the incantation required is similar:
cqentries = mmap(0, params.cq_off.cqes + params.cq_entries*sizeof(struct io_uring_cqe), PROT_READ|PROT_WRITE|MAP_SHARED|MAP_POPULATE, ring_fd, IORING_OFF_CQ_RING);
It's perhaps worth noting at this point that Axboe is working on a user-space library that will hide much of the complexity of this interface from most users.
I/O submission
Once the io_uring structure has been set up, it can be used to perform asynchronous I/O. Submitting an I/O request involves filling in an io_uring_sqe structure, which looks like this (simplified a bit):
struct io_uring_sqe { __u8 opcode; /* type of operation for this sqe */ __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ __u64 off; /* offset into file */ void *addr; /* buffer or iovecs */ __u32 len; /* buffer size or number of iovecs */ union { __kernel_rwf_t rw_flags; __u32 fsync_flags; }; __u64 user_data; /* data to be passed back at completion time */ __u16 buf_index; /* index into fixed buffers, if used */ };
The opcode describes the operation to be performed; options include IORING_OP_READV, IORING_OP_WRITEV, IORING_OP_FSYNC, and a couple of others that we will return to. There are clearly a number of parameters that affect how the I/O is performed, but most of them are relatively straightforward: fd describes the file on which the I/O will be performed, for example, while addr and len describe a set of iovec structures pointing to the memory where the I/O is to take place.
As mentioned above, the io_uring_sqe structures are kept in an array that is mapped into both user and kernel space. Actually submitting one of those structures requires placing its index into the submission queue, which is defined this way:
struct io_uring { u32 head; u32 tail; }; struct io_sq_ring { struct io_uring r; u32 ring_mask; u32 ring_entries; u32 dropped; u32 flags; u32 array[]; };
The head and tail values are used to manage entries in the ring; if the two values are equal, the ring is empty. User-space code adds an entry by putting its index into array[r.tail] and incrementing the tail pointer; only the kernel side should change r.head. Once one or more entries have been placed in the ring, they can be submitted with a call to:
int io_uring_enter(unsigned int fd, u32 to_submit, u32 min_complete, u32 flags);
Here, fd is the file descriptor associated with the ring, and to_submit is the number of entries in the ring that the kernel should submit at this time. The return value should be zero if all goes well.
Completion events will find their way into the completion queue as operations are executed. If flags contains IORING_ENTER_GETEVENTS and min_complete is nonzero, io_uring_enter() will block until at least that many operations have completed. The actual results can be found in the completion structure:
struct io_uring_cqe { __u64 user_data; /* sqe->user_data submission passed back */ __s32 res; /* result code for this event */ __u32 flags; };
Where user_data is a value passed from user space when the operation was submitted and res is the return code for the operation. The flags field will contain IOCQE_FLAG_CACHEHIT if the request could be satisfied without needing to perform I/O — an option that may yet have to be reconsidered given the current concern about using the page cache as a side channel.
These structures live in the completion queue, which looks similar to the submission queue:
struct io_cq_ring { struct io_uring r; u32 ring_mask; u32 ring_entries; u32 overflow; struct io_uring_cqe cqes[]; };
In this ring, the r.head index points to the first available completion event, while r.tail points to the last; user space should only change r.head.
The interface as described so far is enough to enable a user-space program to enqueue multiple I/O operations and to collect the results as those operations complete. The functionality is similar to what the current AIO interface provides, though the interface is quite different. Axboe claims that it is far more efficient, but no benchmark results have been included yet to back up that claim. Among other things, this interface can do asynchronous buffered I/O without a context switch in cases where the desired data is in the page cache; buffered I/O has always been a bit of a sore spot for Linux AIO.
Advanced features
There are, however, some more features worthy of note in this interface. One of those is the ability to map a program's I/O buffers into the kernel. This mapping normally happens with each I/O operation so that data can be copied into or out of the buffers; the buffers are unmapped when the operation completes. If the buffers will be used many times over the course of the program's execution, it is far more efficient to map them once and leave them in place. This mapping is done by filling in yet another structure describing the buffers to be mapped:
struct io_uring_register_buffers { struct iovec *iovecs; __u32 nr_iovecs; };
That structure is then passed to another new system call:
int io_uring_register(unsigned int fd, unsigned int opcode, void *arg);
In this case, the opcode should be IORING_REGISTER_BUFFERS. The buffers will remain mapped for as long as the initial file descriptor remains open, unless the program explicitly unmaps them with IORING_UNREGISTER_BUFFERS. Mapping buffers in this way is essentially locking memory into RAM, so the usual resource limit that applies to mlock() applies here as well. When performing I/O to premapped buffers, the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED operations should be used.
There is also an IORING_REGISTER_FILES operation that can be used to optimize situations where many operations will be performed on the same file(s).
In many high-bandwidth settings, it can be more efficient for the application to poll for completion events rather than having the kernel collect them and wake the application up; that is the motivation behind the existing block-layer polling interface, for example. Polling is most efficient in situations where, by the time the application gets around to doing a poll, there is almost certainly at least one completion ready for it to consume. This polling mode can be enabled for io_uring by setting the IORING_SETUP_IOPOLL flag when calling io_uring_setup(). In such rings, an occasional call to io_uring_enter() (with the IORING_ENTER_GETEVENTS flag set) is mandatory to ensure that completion events actually make it into the completion queue.
Finally, there is also a fully polled mode that (almost) eliminates the need to make any system calls at all. This mode is enabled by setting the IORING_SETUP_SQPOLL flag at ring setup time. A call to io_uring_enter() will kick off a kernel thread that will occasionally poll the submission queue and automatically submit any requests found there; receive-queue polling is also performed if it has been requested. As long as the application continues to submit I/O and consume the results, I/O will happen with no further system calls.
Eventually, though (after one second currently), the kernel will get bored if no new requests are submitted and the polling will stop. When that happens, the flags field in the submission queue structure will have the IORING_SQ_NEED_WAKEUP bit set. The application should check for this bit and, if it is set, make a new call to io_uring_enter() to start the mechanism up again.
This patch set is in its third version as of this writing, though that is a
bit deceptive since there were (at least) ten revisions of the polled AIO
patch set that preceded it. While it is possible that the interface is
beginning to stabilize, it would not be surprising to see some
significant changes yet. One review comment that has not yet been
addressed is Matthew Wilcox's request
that the name be changed to "something that looks a little less like
io_urine
". That could yet become the biggest remaining issue — as
we all
know, naming is always the hardest part in the end. But, once those
details are worked out, the kernel may yet have an asynchronous I/O
implementation that is not a constant source of complaints.
For the curious, Axboe has posted a complete
example of a program that uses the io_uring interface.
Index entries for this article | |
---|---|
Kernel | Asynchronous I/O |
Kernel | io_uring |
Posted Jan 16, 2019 5:45 UTC (Wed)
by unixbhaskar (guest, #44758)
[Link]
Posted Jan 16, 2019 6:58 UTC (Wed)
by mokki (subscriber, #33200)
[Link] (12 responses)
Did anything ever come out of the various syslets/fibrils/sys_indirect ways of making most syscalls async?
There was lots of discussion in 2007, see: https://lwn.net/Articles/221913/ and
Posted Jan 16, 2019 13:00 UTC (Wed)
by Sesse (subscriber, #53779)
[Link] (9 responses)
Posted Jan 16, 2019 14:22 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
Not just RAID - any device that can do multiqueue I/O will benefit from parallel I/O, such as an NVMe SSD (which can have 65,536 parallel queues to the device).
Posted Jan 17, 2019 1:03 UTC (Thu)
by dw (subscriber, #12017)
[Link] (7 responses)
Posted Jan 17, 2019 12:30 UTC (Thu)
by Sesse (subscriber, #53779)
[Link] (1 responses)
(zlib/deflate is still around pretty much only due to huge transition costs, and a fragmented market among the alternatives. Try something like zstd if you want to make a clean break.)
Posted Jan 17, 2019 12:33 UTC (Thu)
by dw (subscriber, #12017)
[Link]
Posted Jan 22, 2019 10:09 UTC (Tue)
by epa (subscriber, #39769)
[Link] (4 responses)
Posted Jan 22, 2019 11:35 UTC (Tue)
by dw (subscriber, #12017)
[Link] (2 responses)
For zipping, imagine something like a 100k item maildir of tiny 1.5kb messages. While the compression is still relatively expensive, a huge chunk of the operation will be wasted on ceremonial serialized filesystem round-trips (open/close/read/stat/getdents/etc). To avoid that I'm not sure there is any way around it except a whole bunch of threads keeping as many FS operations in flight (either doing the CPU bits or any IO bits for uncached data) to get even close to a genuinely busy computer.
Posted Jan 22, 2019 12:22 UTC (Tue)
by epa (subscriber, #39769)
[Link]
How about a generalized stat() that lets you open a directory and get info on all the files it contains? That would save a lot of time, and not just for parallel code. Network filesystems, for example.
Posted Jan 22, 2019 12:38 UTC (Tue)
by epa (subscriber, #39769)
[Link]
You could then sprinkle these calls all over your code -- including scripting languages -- and get a handy speedup without having to do any real programming.
Posted Feb 26, 2019 1:53 UTC (Tue)
by josh (subscriber, #17465)
[Link]
The readahead system call does that.
Posted Jan 16, 2019 15:27 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link]
Posted Jan 16, 2019 15:48 UTC (Wed)
by smurf (subscriber, #17840)
[Link]
Posted Jan 16, 2019 10:03 UTC (Wed)
by kitanatahu (guest, #44605)
[Link] (12 responses)
Posted Jan 16, 2019 18:07 UTC (Wed)
by axboe (subscriber, #904)
[Link] (11 responses)
Apart from that, I do think you're mixing up the polling with the io polling. One provides a way to signal when data is ready, the other skips IRQs in favor of busy polling for completion events.
Posted Jan 17, 2019 3:12 UTC (Thu)
by samroberts (subscriber, #46749)
[Link] (10 responses)
The point stands: io_uring should be easily useable with poll/select/epoll so it can be integrated with existing event loop based code, networking code in particular are heavy users of these calls. Specifically, this fd
> The return value from io_uring_setup() is a file descriptor that can then be passed to mmap() to map the buffer into the process's address space.
should be epoll()able.
Posted Jan 17, 2019 3:28 UTC (Thu)
by axboe (subscriber, #904)
[Link] (9 responses)
If the ring_fd should be pollable, in terms of epoll, absolutely. That would be trivial to add. It would NOT work for IORING_SETUP_IOPOLL for obvious reasons, as you can't sleep for those kinds of completions. But for "normal", IRQ driven IO, adding epoll() support for the CQ side of the ring_fd is straight forward. On the SQ ring side, there's nothing to epoll for. The application knows if the ring is writeable (eg can hold new entries) without entering the kernel.
Outside of that, my IOCB_CMD_POLL reference has to do with this:
https://lwn.net/Articles/743714/
and adding IORING_OP_POLL for similar functionality on the io_uring side.
Posted Jan 17, 2019 16:26 UTC (Thu)
by axboe (subscriber, #904)
[Link] (8 responses)
I believe this caters to both of your needs.
Posted Jan 17, 2019 23:16 UTC (Thu)
by nix (subscriber, #2304)
[Link] (7 responses)
(yes, I help maintain one of those monsters, making heavy use of ioctl() passing intricate structures into and out of the kernel, and the massive use of ioctl() is one thing I at least am hoping to get rid of in the process of getting it ready for upstreaming.)
Posted Jan 17, 2019 23:34 UTC (Thu)
by axboe (subscriber, #904)
[Link] (6 responses)
With liburing, it should be _very_ easy for applications to use. If you go native, yes, you need to be a bit more careful, and it's more hairy. But even with the basic support liburing has now, you just do:
{
io_uring_queue_init(queue_depth, &ring, 0);
sqe = io_uring_get_sqe(&ring);
io_uring_submit(&ring);
io_uring_wait_completion(&ring, &cqe);
as a very basic example.
Posted Jan 18, 2019 4:21 UTC (Fri)
by axboe (subscriber, #904)
[Link] (1 responses)
Posted Jan 19, 2019 18:57 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Jan 19, 2019 19:00 UTC (Sat)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Jan 19, 2019 20:27 UTC (Sat)
by zdzichu (subscriber, #17118)
[Link] (1 responses)
Posted Jan 20, 2019 10:20 UTC (Sun)
by nix (subscriber, #2304)
[Link]
Posted Jan 24, 2019 12:35 UTC (Thu)
by joib (subscriber, #8541)
[Link]
Posted Jan 16, 2019 16:14 UTC (Wed)
by abatters (✭ supporter ✭, #6932)
[Link] (3 responses)
Regarding IORING_REGISTER_BUFFERS, would it be possible to have the kernel use high-order allocations (if available) to get larger physically-contiguous buffers? If you are using the same buffers for DMA over and over again, then setting up a larger physically-contiguous buffer would reduce the number of scatterlist entries required for each I/O. For example, imagine you are submitting 1 MiB per I/O. With 4096-byte pages, that would take 256 scatterlist entries. But with high-order allocations, you might be able to do it with just a few scatterlist entries, saving a lot of overhead. This could be done either by having the kernel allocate the memory using high-order allocations and then letting userspace mmap() it, or by trying to compact the memory when it is submitted to IORING_REGISTER_BUFFERS.
Posted Jan 16, 2019 18:03 UTC (Wed)
by axboe (subscriber, #904)
[Link] (1 responses)
In general we have various pieces of low hanging fruit on the block layer side, which are readily apparent now that we have an efficient interface into the kernel. Work in progress! But I'd like to wrap up io_uring first.
Posted Jan 16, 2019 23:31 UTC (Wed)
by ms-tg (subscriber, #89231)
[Link]
Posted Jan 17, 2019 9:45 UTC (Thu)
by nilsmeyer (guest, #122604)
[Link]
Posted Jan 16, 2019 16:15 UTC (Wed)
by me@jasonclinton.com (guest, #52701)
[Link] (8 responses)
What's the procedure for a user-space library tightly coupled to a kernel API, like this one, getting into glibc (or any of the rest of the libcs)?
Posted Jan 16, 2019 16:19 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Posted Jan 16, 2019 18:05 UTC (Wed)
by axboe (subscriber, #904)
[Link] (6 responses)
git://git.kernel.dk/liburing
though not a lot of items are in there yet. It does contain helpers to setup/teardown the ring, and submit/complete helpers for applications that don't want (or need) to muck with the ring itself. This will grow some more features, the intent is that most applications will _probably_ end up using that instead of handling all the details themselves.
Posted Jan 17, 2019 23:12 UTC (Thu)
by nix (subscriber, #2304)
[Link] (5 responses)
Posted Jan 17, 2019 23:30 UTC (Thu)
by axboe (subscriber, #904)
[Link]
Posted Jan 18, 2019 10:36 UTC (Fri)
by NAR (subscriber, #1313)
[Link] (3 responses)
Posted Jan 18, 2019 12:06 UTC (Fri)
by zdzichu (subscriber, #17118)
[Link] (2 responses)
Posted Jan 18, 2019 17:22 UTC (Fri)
by axboe (subscriber, #904)
[Link] (1 responses)
Posted Jan 24, 2019 16:09 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Jan 16, 2019 19:24 UTC (Wed)
by arjan (subscriber, #36785)
[Link] (1 responses)
hmm that makes 32/64 compat funky.. wonder if it really should just be a u64
Posted Jan 16, 2019 19:30 UTC (Wed)
by axboe (subscriber, #904)
[Link]
Posted Jan 16, 2019 20:09 UTC (Wed)
by HIGHGuY (subscriber, #62277)
[Link]
Slightly more complex operations could be useful in combination with primitives like the P2P PCIe transfers that are being worked on to avoid going through main memory altogether.
Posted Jan 20, 2019 13:22 UTC (Sun)
by zse (guest, #120483)
[Link] (1 responses)
I haven't found the complete list of opcodes that are proposed, so don't know if this is already in the works, but I'd think you'll also need synchronization primitives (e.g. a barrier so that all io ops before it need to complete before those after the barrier can start).
In general this proposal kind of reminds me of the command queues you have for graphics hardware (OpenGL/Vulkan). I'm wondering if there is potential for (partial) unification or at least mutual inspiration...
Posted Jan 24, 2019 12:37 UTC (Thu)
by joib (subscriber, #8541)
[Link]
Posted May 11, 2019 10:35 UTC (Sat)
by hnakamur (guest, #123503)
[Link]
io_uring_register(ring_fd, IORING_REGISTER_FILES, fds, nr_files);
But, what to do if you want to add some more file descriptors while using some of already registered
Posted May 17, 2019 19:30 UTC (Fri)
by crzbear (guest, #132097)
[Link]
is there any particular reason the kernel has to allocate those buffers
while this might obviously lead to not properly aligned buffers,
this would do away with the mmapping
Posted Nov 9, 2023 2:30 UTC (Thu)
by Tushar (guest, #167911)
[Link]
I am working on building a new application which is required to support 60,000+ tcp connections on a single server (preferably a single POSIX thread) with <1ms RTT. I am considering using io_uring for this.
I have not found any data for similar application of io_uring for other applications. Do you have some benchmark that I might refer to to see if something like this may be possible.
Thanks for your time and help.
Ringing in a new asynchronous I/O API
What about async metadata
https://lwn.net/Articles/259068/
What about async metadata
What about async metadata
What about async metadata
What about async metadata
What about async metadata
What about async metadata
What about async metadata
What about async metadata
What about async metadata
What about async metadata
What about async metadata
What about async metadata
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
sqe->opcode = IORING_OP_READV;
sqe->fd = fd;
[...]
}
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Physically-contiguous buffers
Physically-contiguous buffers
Physically-contiguous buffers
Physically-contiguous buffers
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Wol
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
Ringing in a new asynchronous I/O API
How to register more files while using some registered files?
io_uring_register(ring_fd, IORING_UNREGISTER_FILES);
io_uring_register(ring_fd, IORING_REGISTER_FILES, fds2, nr_files2);
file descriptors?
Ringing in a new asynchronous I/O API
couldn't they be passed from userspace in the setup call
and then the kernel maps those into its address space
the kernel can check that and return with an error if needed
Guidance on using io_uring to support 60,000+ TCP connections with <1ms RTT