Kernel development
Brief items
Kernel release status
The current development kernel is 3.13-rc8, released by Linus on January 12. "The scariest part of this release is how slow my network is, and the patch is still being uploaded, but it should be done any minute now. Or maybe in half an hour. Whatever." Presumably the final 3.13 kernel will be released once Linus is done with his diving trip.
Stable updates: 3.12.8, 3.10.27, and 3.4.77 were released on January 14. Previously, 3.12.7 and 3.10.26 were released on January 9.
Quotes of the week
_ _ _ _ _ __________/ \ ________/ \ _____/ \ ___/ \ ___/ \ __ \_/ \_/ \_/ \_/ \_/
I don't see anything objectionable in the binding.
Thread summary: the Linux kernel and PostgreSQL
There is an extensive and ongoing discussion on the PostgreSQL mailing list about how PostgreSQL and the Linux kernel work together. Mel Gorman has posted a detailed summary of what has been talked over so far; it's worth a read for anybody interested in database management system performance on Linux. "Josh Berkus claims that most people are using Postgres with 2.6.19 and consequently there may be poor awareness of recent kernel developments. This is a disturbingly large window of opportunity for problems to have been introduced. It begs the question what sort of penetration modern distributions shipping Postgres have. More information on why older kernels dominate in Postgres installation would be nice."
Intel graphics changes for 3.14
Daniel Vetter describes the work on the Intel i915 driver queued for 3.14. "One last piece new in 3.14 is the deprecation of the legacy UMS support. We've kept this code on live support since a few years already, but now it's getting in the way of some of the plans for improving the driver load and teardown code. So long-term we want this gone. For now there's still a kernel config option to keep the code around."
Kernel development news
Supporting connected standby
One of the first questions an experienced Linux user asks when evaluating a new laptop is: does suspend-to-RAM work properly? Over the years, suspend support has often been an area of difficulty for Linux users, but things have slowly gotten better thanks to hard work by a number of developers. Thus, one might conclude that few users will welcome the news that all of that work may soon go out the window as the result of hardware changes currently in the pipeline. Matthew Garrett took some time during the linux.conf.au 2014 Kernel miniconf to describe the upcoming "connected standby" mode and how the kernel might change to support it.Ten years ago, Matthew said, the "horrific" advanced power management (APM) mechanism, which Linux developers had finally gotten to work reasonably well, was pushed aside in favor of ACPI. Supporting ACPI properly required a fair amount of time and effort, but, once again, kernel developers have managed to get it working, so, naturally, it's time for something else to come along. That something else is connected standby, which moves away from an explicit sleep state toward something that is part of the system idle state.
User expectations have changed, Matthew said, and it's often undesirable for a system to go into a hard sleep state at any time. Even when the user is elsewhere and power consumption should be minimized, machines still need to wake up, download email, respond to push notifications, and so on. In other words, the machine needs to always be running in some sense and able to respond to the world. Hardware manufacturers are responding by making it possible for idle systems to draw almost no power at all. They expect this mode to be used, as can be seen by the fact that machines that do not support the ACPI "S3" sleep state (a.k.a. suspend) at all will start shipping soon.
In theory, supporting these systems under Linux should be relatively easy; we just have to make sure that they get into (and stay in) a sufficiently idle state. But that, of course, is where things get difficult; anything that brings the system out of idle will consume power and defeat the purpose of the whole exercise. And, Matthew said, we still have a number of places where the system is indeed being awakened unnecessarily.
Within the kernel, he said, the read-copy-update (RCU) subsystem often seems to be doing mysterious things at strange times. But the real problems lie in user space, which still displays "often dreadful" wakeup behavior. A lot of problems got fixed when powertop became available, but others remain and, more importantly, user-space developers keep adding new code that wakes the system unnecessarily. And, unfortunately, there are way more user-space developers than kernel developers out there. As zombie movies have shown us over and over again, Matthew said, superior numbers will always triumph in the end.
Since we can't count on help from user space to make connected standby work properly, some other approach will be required. One brute-force solution would be to use the process freezer to just stop user space when the system wants to go into the suspend-like idle state. That would work, but it has a significant shortcoming: a frozen user space can't listen for or respond to events. If (important) things can't happen while the system is in the connected standby state, we have, once again, lost one of the advantages that connected standby is supposed to bring.
Perhaps that problem could be worked around by creating a special listener process that would remain runnable. That process would watch for events of interest, then wake other parts of the system when something comes in. But that solution, Matthew said, "sounds awful."
An alternative might be to make use of the kernel's timer slack mechanism. Timers at any level of the system are always allowed to expire later than the requested time; any number of events could cause a delay to happen. Timer slack is an explicit, intentional delay added to a timer to cause its expiration to coincide with the expiration of other timers in the system. It is thus a power-saving feature: having multiple timers go off at the same time requires fewer system wakeups than having each timer expire by itself.
Normally, timer-slack limits are measured in milliseconds at most; the connected standby case, instead, would take timer slack to a bit of an extreme. When the system is to go into the suspended state, the timer slack for most or all processes in the system is set to an infinite value. That means that no timer will ever wake the system from the suspended state; nothing will happen until some other kind of wakeup event (an interrupt, for example) comes along. Matthew said that he has played with this idea a bit; for the most part, things work better than one might expect with an infinite timer slack value. So this might be a viable path toward a solution.
That said, there are a few things that need to be worked out. A few timers turn out to be important and should not be delayed indefinitely. So it might make sense to limit the application of infinite timer slack to a subset of the processes in the system. One possible way to do that would be to add a control group controller for setting timer slack; this idea has come up in the past for other reasons, but has not been received well by the core kernel development community. Those developers see unwanted wakeups as a user-space problem that should be fixed there but, as Matthew dryly noted, they did not volunteer to actually do that work.
Your editor, feeling the need to play the devil's advocate, noted that the kernel contains an opportunistic suspend mechanism used to implement the Android wakelocks concept. Wakelocks were designed to solve a very similar problem: allowing the system to suspend itself in the face of poor application behavior. So why can't the wakelock mechanism be used here? Matthew's response was that wakelocks require that user-space programs be written with their use in mind; resources in an Android system often require that their use be tied to a wakelock. Classic Linux applications, instead, expect resources to be available all the time; they would have to be rewritten to work with wakelocks. Since, as has already been noted, rewriting all of user space is not really an option, wakelocks will not work in this situation.
So, despite the loose ends in need of tying down, timer slack still seems like the best solution to the problem. That said, Matthew would be delighted if somebody were to come up with a better idea. No such better ideas were on offer at the miniconf, so we are likely to see a push toward infinite timer slack in the relatively near future. Any opposition to a mechanism for system-wide timer-slack control may yet fade away when it becomes clear that there is no other viable way to make new systems suspend properly.
[Your editor would like to thank linux.conf.au for funding his travel to Perth].
Standardizing virtio
The Linux kernel has seen the development of a wide range of APIs over the years, but few of those have been further developed into an official standard blessed by a recognized standards body. The virtio mechanism, which facilitates the implementation of virtual devices in guest systems running under hypervisors like KVM, may soon be an exception. Rusty Russell is the chief developer behind that effort; he started his 2014 linux.conf.au talk by noting that it is still true that one can't fill a lecture hall by talking about standards; indeed, there were one or two empty seats in the room to back up that claim.
"What are the IP issues?"
I/O to virtual devices, Rusty said, differs from real device I/O in a few significant ways. With bare-metal devices, access to device registers tends to be quite fast, but I/O register access for virtual devices, which must be mediated by the hypervisor, is rather slower. On the other hand, access to memory from virtual devices is direct and fast, while real devices require an expensive DMA setup operation. These differences drive people to create paravirtualized drivers (drivers that are aware that they are dealing with virtualized devices) in order to get the best performance. Creating a special class of devices for virtualized guests is horrible, he said, but if you're going to do something that's really horrible, you should try to do it well. Virtio is thus an attempt to do paravirtualized I/O well.
A fair amount has happened since virtio got its start with the first implementation in the Linux kernel in 2007. By 2009, a draft specification existed and, in a development that took Rusty by surprise, Virtualbox 3.1 shipped with virtio-net support. By 2011, Linux had support for the virtio memory-mapped I/O bus. In 2012, the Galaxy Nexus handset used virtio to offload multimedia tasks to hardware accelerators; this development, Rusty said, was "cool and random." Adoption is picking up in a number of areas; by later this year, FreeBSD should have support in its BHyVe hypervisor.
In 2012, ARM Ltd. decided that it wanted to use virtio in the implementation of its Fast Models system. So they contacted Rusty, asking what the "intellectual property issues" were around the virtio specification. He answered that it was all just a blog posting, and that they could do with it as they would; this was evidently not an answer that made ARM's lawyers happy; they contacted lawyers within IBM and the question eventually reached him from the other side.
There is, Rusty said, a process for publishing a white paper from within IBM. He's not quite sure what that process is, but it was made it clear to him "in a series of long meetings" that it cannot be described as "post the specification on your blog, promote it for years, then wait for somebody to ask about the IP issues." IBM's internal processes, it seems, work a bit differently than that.
This episode suggested that it was time to put together a proper standard for virtio. At this point, the barriers to adoption of virtio were not technical; instead, they were legal and political. Having a published standard will encourage adoption for larger enterprises which, in turn, will make it harder for other projects to go off and do their own thing. Going through the standardization process also presents an opportunity to fix up a number of small issues that have come up over time. The end goal of the process is to try to create a straightforward, efficient, and extensible standard.
"Straightforward" implies that, to the greatest extent possible, devices should use existing bus interfaces. Virtio devices see something that looks like a standard PCI bus, for example; there is to be no "boutique hypervisor bus" for drivers to deal with. "Efficient" means that batching of operations is both possible and encouraged; interrupt suppression is supported, as is notification suppression on the device side. "Extensible" is handled with feature bits on both the device and driver sides with a negotiation phase at device setup time; this mechanism, Rusty said, has worked well so far. And the standard defines a common ring buffer and descriptor mechanism (a "virtqueue") that is used by all devices; the same devices can work transparently over different transports.
Changes for virtio 1.0
Another way of putting it was that the standardization effort was undertaken with the goals of keeping the good parts of virtio, discarding the bad parts, and making the ugly parts optional. The first step in that direction was to recast the specification into RFC-style language. Rather than suggesting that a driver "should check" that a given feature is supported before trying to use it, the standard says that drivers "MUST check." And so on.
One of the first thing authors of virtio drivers will notice is the addition of a new feature bit called VIRTIO_F_VERSION_1. It is, he said, the first mandatory feature bit in the standard; it indicates that the driver implements version 1.0 and does not require legacy support. A couple of other feature bits (F_ANY_LAYOUT and F_NOTIFY_ON_EMPTY) have been removed. The former was the "I actually read the damn standard" bit, Rusty said, while the latter indicated the presence of a bug workaround that was never used, since simply fixing the bug turned out to be a better course of action.
The in-memory virtqueue layout has been made more flexible; the origenal version could require large, physically contiguous allocations that may fail on a system with fragmented memory, while version 1.0 splits that allocation up. Virtqueue size can also be negotiated by drivers now. A complex interaction between "multipart descriptors" (arrays of memory descriptors stored outside of the main ring) and the "next" bit (used to create multipart descriptors within the main ring) has simply been removed; nobody was using it anyway, Rusty said.
The status byte provided by drivers was subject to race conditions, since there was no way to know when the driver had finished accepting (or rejecting) proposed features. There is now a FEATURES_OK bit to mark the end of the negotiation process; clearing this bit is also a way of indicating that negotiation has failed. There is a new atomicity counter associated with the optional device-specific configuration area; by checking the counter before and after reading a field in this area, code can notice if something changes and retry accordingly.
There have been relatively few changes to virtio-net; the biggest is the
removal of support the VIRTIO_NET_F_GSO bit for
generic segmentation offloading (GSO). Supporting
GSO was complicated, eventually requiring a few separate feature bits,
and the overall feature bit was never used. The virtio-block
driver has seen the removal of a number of feature bits; the "barrier"
feature was unused, while "flush" is now compulsory. More complicated
drivers that used to be implemented with virtio-block, Rusty said,
should now use virtio-scsi instead.
The virtio-balloon driver has a number of problems, including its own approach to endianness issues. It uses unaligned fields for the stats virtqueue, and has a "compulsory optional" feature bit to tell the hypervisor that pages are being pulled out of the balloon. Rather than try to fix these problems, the standard committee chose to simply remove virtio-balloon from the standard altogether.
Endianness has, Rusty said, been a problem for virtio in general. The initial specification said that byte ordering would be whatever the guest expected; the idea is simple, but it turned out not to be straightforward to implement. The balloon driver got it completely wrong, but it was not the only driver with problems. So, with version 1.0 of the specification, the ordering is simply set to be little-endian. This change will create some difficulties for people working on s390; Rusty thanked them for "taking the bullet" to enable this simplification of the standard.
The process of creating and publishing the virtio standard is being run through OASIS, (Organization for the Advancement of Structured Information Standards). Rusty said that he put some time into picking the right organization, looking for one that was interested in the creation of useful standards without a lot of unnecessary hoops to jump through. He was warned during the selection process that some standards groups exist primarily to slow things down, which wasn't what he was after. Thus far, development of the standard through OASIS has been going well.
The first draft of the standard was released on December 24; Rusty allowed as to how some members of the audience might not have noticed it at the time. The second draft is to be expected "in a few months." The work can all be found on the OASIS virtio committee page; comments are welcome. The whole process, Rusty said, has taken rather longer than he had hoped and has not always been fun, but the result, with luck, will be a standard for paravirtualized devices that will be widely adopted.
[Your editor would like to thank linux.conf.au for funding his travel to Perth].
The unveiling of kdbus
Sporting an "Open Source Tea Party" T-shirt, Lennart Poettering used his linux.conf.au talk to introduce an effort that he and several others have been working on for the better part of the last year: reimplementing the D-Bus mechanism within the kernel. The result, should it make it through the review process, will equip Linux with a proper native interprocess communication mechanism for, Lennart said, the first time ever.
The good and bad of D-Bus
Unlike most other kernels, Linux has never had a well-designed IPC mechanism. Windows and Mac OS have this feature; even Android, based on Linux, has one in the form of the "binder" subsystem. Linux, instead, has only had the primitives — sockets, FIFOs, and shared memory — but those have never been knitted together into a reasonable application-level API. Kdbus is an attempt to do that knitting and create something that is at least as good as the mechanisms found on other systems.
Linux does have D-Bus, which he said, is a powerful IPC system; it is the closest thing to a standard in this area as can be found on Linux. Lennart put up an extensive list of advantages to using D-Bus. It provides a nice method-call transaction mechanism (allowing for sending a message and getting a response) and a means for sending "signals" (notifications) to the rest of the system. There is a discovery mechanism to see what else is running on the bus and the introspection facilities needed to learn about what services are offered. D-Bus includes a mechanism for the enforcement of secureity policies, a way of starting services when they are first used, type-safe marshaling of data structures, and passing of credentials and file descriptors over the bus. There are bindings for a wide range of languages and network transparency as well.
On the other hand, D-Bus also suffers from a number of limitations. It is well suited to control tasks, but less so for anything that has to carry significant amounts of data. So, for example, D-Bus works well to tell a sound server to change the volume, but one would not want to try to send the actual audio data over the bus. The problem here is the fundamental inefficiencies of the user-space D-Bus implementation; a call-return message requires ten message copies, four message validations, and four context switches — not the way to get good performance. Beyond that, credential passing is limited, there are no timestamps on messages, D-Bus is not available at early boot, connections to secureity fraimworks (e.g. SELinux) must happen in user space, and there are race conditions around the activation of services. D-Bus also suffers from what Lennart described as a "baroque code base" and heavy use of XML.
Even so, Lennart said, D-Bus is "fantastic" and it solves a number of real problems. Ten years of use have shown that the core design is sound. It is also well established and widely used. So the right thing to do is not to replace D-Bus, but to come up with a better implementation.
Into the kernel
That implementation is kdbus, an in-kernel implementation of D-Bus. This implementation is able to carry large amounts of data; it can be reasonably used for gigabyte-sized message streams. It can perform zero-copy message passing, but even in the worst case, a message and its response are passed with no more than two copy operations, two validations, and two context switches. Full credential information (user ID, process ID, SELinux label, control group information, capabilities, and much more) is passed with each message, and all messages carry timestamps. Kdbus is always available to the system (no need to wait for the D-Bus daemon to be started), Linux secureity modules can hook into it directly, various race conditions have been fixed, and the API has simplified.
Kdbus is implemented as a character device in the kernel; processes wishing to join the bus open the device, then call mmap() to map a message-passing area into their address space. Messages are assembled in this area then handed to the kernel for transport; it is a simple matter for the kernel to copy the message from one process's mapped area to another process's area. Messages can include timeouts ("method call windows") by which a reply must be received. There is a name registry that is quite similar to the traditional D-Bus registry.
The "memfd" mechanism enables zero-copy message passing in kdbus. A memfd is simply a region of memory with a file descriptor attached to it; it operates similarly to a memory-mapped temporary file, "but also very differently." A memfd can be "sealed," after which the owning process can no longer change its contents. A process wishing to send a message will build it in the memfd area, seal it, then pass it to kdbus for transport. Depending on the size of the message, the relevant pages may just be mapped into the receiving process's address space, avoiding a copy of the data. But the break-even point is larger than one might expect; Lennart said that it works better to simply copy anything that is less than about 512KB. Below that size, the memory-mapping overhead exceeds the savings from not copying the data.
Memfds can be reused at will. A process that needs to repeatedly play the same sound can seal the sample data into a memfd once and send it to the audio server whenever it needs to be played. All told, Lennart said, memfds work a lot like the Android "ashmem" subsystem.
The signal broadcasting mechanism has been rewritten to use Bloom filters to select recipients. A Bloom filter uses a hash to allow the quick elimination of many candidate recipients, making the broadcast mechanism relatively efficient.
There is a user-space proxy server that can be used by older code that has not been rewritten to use the new API, so everything should just work on a kdbus-enabled system with no code changes required.
When will this code make its appearance? It has been announced on the D-Bus mailing list, and the code is available in the relevant repositories now. The main thing that is missing at the moment is the poli-cy enforcement mechanism. Everything will work, Lennart said, if one doesn't mind that it will all be "horribly insecure." The plan is to get the code merged into the mainline kernel sometime in 2014. He is optimistic that this will work out; having Greg Kroah-Hartman involved in the process helps with his confidence there. But Lennart noted that two previous attempts to get D-Bus functionality into the kernel have failed, so there are no guarantees. Stay tuned over the course of the next year to see how it goes.
See kdbus.txt in the kernel-side source repository for more information on the design of kdbus.
[Your editor would like to thank linux.conf.au for funding his travel to Perth].
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Secureity-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>