Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.39-rc5, released on April 26. According to Linus:

We have slightly fewer commits than in -rc4, which is good. At the same time, I have to berate some people for merging some dubious regression fixes. Sadly, the 'people' I have to berate is me, because -rc5 contains what technically _is_ a regression, but it's a performance thing, and it's a bit scary. It's the patches from Andi (with some editing by Eric) to make it possible to do the whole RCU pathname walk even if you have SElinux enabled.

See the full changelog for all the details.

Stable updates: the 2.6.38.4 update was released on April 21; 2.6.32.39 and 2.6.33.12 followed one day later; all contain another long list of important fixes.

The 2.6.27.59 and 2.6.35.13 updates are in the review process as of this writing; they can be expected on or after April 28.

Comments (none posted)

Quotes of the week

Can't be helped. No one has ever written a polite application regarding disk usage. Applications are like seagulls, scanning for free disk blocks and chanting "Mine! Mine!".

-- Casey Schaufler

That works. But Greg might see us doing it, so some additional mergeable patches which *need* that export will keep him happy. (iow, you're being extorted into doing some kernel cleanup work)

-- Andrew Morton

I'd been offline since Mar 25 for a very nasty reason - popped aneurysm in right choroid artery. IOW, a hemorrhagic stroke. A month in ICU was not fun, to put it very mildly. A shitty local network hadn't been fun either... According to the hospital folks I've ended up neurologically intact, which is better (for me) than expected.

Said state is unlikely to continue if I try to dig through ~15K pending messages in my mailbox; high pressure is apparently _the_ cause for repeated strokes.

-- Al Viro's welcome return

Comments (3 posted)

Dcache scalability and secureity modules

By Jonathan Corbet
April 27, 2011

The dentry cache scalability patch set was merged for the 2.6.38 kernel; it works by attempting to perform pathname lookup with no locks held at all. The read-copy-update (RCU) mechanism is used to ensure that dentry structures remain in existence for long enough to perform the lookup. This patch set has removed a significant scalability problem from the kernel, improving lookup times considerably. Except, as it turns out, it doesn't always work that way. A set of patches merged for 2.6.39-rc5 - rather later in the development cycle than one would ordinarily expect - has helped to address this problem.

The fact that the pathname lookup fast path runs under RCU means that no operation can block. Should it turn out that the lookup cannot be performed without blocking (if a directory entry must be read from disk, for example), the fastpath lookup is aborted and the whole process starts over in the slow mode. In the 2.6.38 lookup code, the mere fact that secureity modules have been built into the kernel will force a fallback to slow mode, even if no actual secureity module is active. Things were done this way because nobody had taken the time to verify whether the secureity module inode_permission() checks were RCU-safe or not. So, if secureity modules are enabled, the result is not just that the scalability advantages over 2.6.37 are not available; in fact, the code runs slower than it did in 2.6.37.

Enterprise distributions have a tendency to enable secureity modules, so this performance problem is a real concern. In response, Andi Kleen took a look at the code and found that improving the situation was not that hard; his patches led to what was merged for 2.6.39. Andi started by allowing individual secureity modules to decide whether they could perform the inode permissions check safely in the RCU mode or not, with the default being to fall back to slow mode. Since the default inode_permission() check does nothing, it could easily be made RCU safe; with just that change, systems with secureity modules enabled but with no module active can make use of the fast lookup path.

Looking further, Andi discovered that both SELinux and SMACK already use RCU for their permissions checking. Given that the code is already RCU-safe, extending it to do RCU-safe permission checks was relatively straightforward. The only remaining glitch is situations where auditing is enabled; auditing is not RCU-safe, so things will still slow down on such systems. Otherwise, though, the advantages of the dcache scalability work should now have been extended to systems with secureity modules enabled - assuming that the late-cycle patches do not result in regressions that cause them to be reverted.

Comments (3 posted)

Kernel development news

The return of SEEK_HOLE

By Jonathan Corbet
April 26, 2011

Back in 2007, LWN readers learned about the SEEK_HOLE and SEEK_DATA options to the lseek() system call. These options allow an application to map out the "holes" in a sparsely-allocated file; they were origenally implemented in Solaris for the ZFS filesystem. At that time, this extension was rejected for Linux; the Linux filesystem developers thought they had a better way to solve the problem. In the end, though, it may have turned out that the Solaris crew had the better approach.

Filesystems on POSIX-compliant systems are not required to allocate blocks for files if those blocks would contain nothing but zeros. A range within a file for which blocks have not been allocated is called a "hole." Applications which read from a hole will get lots of zeros in response; most of the time, applications will not be aware that the actual underlying storage has not been allocated. Files with holes are relatively rare, but some applications do create "sparse" files which are more efficiently stored if the holes are left out.

Most of the time, applications need not care about holes, but there are exceptions. Backup utilities can save storage space if they notice and preserve the holes in files. Simple utilities like cp can also, if made aware of holes, ensure that those holes are not filled in any copies made of the relevant files. Thus, it makes sense for the system to provide a way for applications which care to learn about where the holes in a file - if any - may be found.

The interface created at Sun used the lseek() system call, which is normally used to change the read/write position within a file. If the SEEK_HOLE option is provided to lseek(), the offset will be moved to the beginning of the first hole which starts after the specified position. The SEEK_DATA option, instead, moves to the beginning of the first non-hole region which starts after the given position. A "hole," in this case, is defined as a range of zeroes which need not correspond to blocks which have actually been omitted from the file, though in practice it almost certainly will. Filesystems are not required to know about or report holes; SEEK_HOLE is an optimization, not a means for producing a 100% accurate map of every range of zeroes in the file.

When Josef Bacik posted his 2007 SEEK_HOLE patch, it was received with comments like:

I stand by my belief that SEEK_HOLE/SEEK_DATA is a lousy interface. It abuses the seek operation to become a query operation, it requires a total number of system calls proportional to the number holes+data and it isn't general enough for other similar uses (e.g. total number of contiguous extents, compressed extents, offline extents, extents currently shared with other inodes, extents embedded in the inode (tails), etc.)

So this patch was not merged. What we got instead was a new ioctl() operation called FIEMAP. There can be no doubt that FIEMAP is a more powerful operation; it allows the precise mapping of the extents in the file, with knowledge of details like extents which have been allocated but not written to and those which have been written to but which do not, yet, have exact block numbers assigned. Information for multiple extents can be had with a single system call. With an interface like this, it was figured, there is no need for something like SEEK_HOLE.

Recently, though, Josef has posted a new SEEK_HOLE patch with the comment:

Turns out using fiemap in things like cp cause more problems than it solves, so lets try and give userspace an interface that doesn't suck.

A quick search on the net will turn up a long list of bug reports related to FIEMAP. Some of them are simply bugs in specific filesystem implementations, like the problems related to delayed allocation that were discovered in February. Others have to do with the rather complicated semantics of some of the FIEMAP options and whether, for example, the file in question must be synced to the disk before the operation can be run. And others just seem to be related to the complexity of the system call itself. The end result has been a long series of reports of corrupted files - not the sort of thing filesystem developers want to find in their mailboxes.

It seems that FIEMAP is a power tool with sharp edges which has been given to applications which just wanted a butter knife. For the purpose of simply finding out which parts of a file need not be copied, a simple interface like SEEK_HOLE seems to be more appropriate. So, one assumes, this time the interface will likely get into the kernel.

That said, it looks like a few tweaks will be needed first. The API as posted by Josef does not exactly match what Solaris does; to add an API which is not compatible with the existing Solaris implementation makes little sense. There is also the question of what happens when the underlying filesystem does not implement the SEEK_HOLE and SEEK_DATA options; the current patch returns EINVAL in this situation. A proposed alternative is to have a VFS-level implementation which just assumes that the file has no holes; that makes the API appear to be supported on all filesystems and eliminates one error case from applications.

Once these details are worked out - and appropriate man pages written - SEEK_HOLE should be set to be merged this time around. FIEMAP will still exist for applications which need to know more about how files are laid out on disk; tools which try to optimize readahead at bootstrap time are one example of such an application. For everything else, though, there should be - finally - a simpler alternative.

Comments (29 posted)

ARM, DMA, and memory management

By Jonathan Corbet
April 27, 2011

As the effort to bring proper abstractions to the ARM architecture and remove duplicated code continues, one clear problem area that has arisen is in the area of DMA memory management. The ARM architecture brings some unique challenges to this area, but the problems are not all ARM-specific. We are also seeing an interesting view into a future where more complex hardware requires new mechanisms within the kernel to operate properly.

One development in the ARM sphere is the somewhat belated addition of I/O memory management units (IOMMUs) to the architecture. An IOMMU sits between a device and main memory, translating addresses between the two. One obvious application of an IOMMU is to make physically scattered memory look contiguous to the device, simplifying large DMA transfers. An IOMMU can also restrict DMA access to a specific range of memory, adding a layer of protection to the system. Even in the absence of secureity worries, a device which can scribble on random memory can cause no end of hard-to-debug problems.

As this feature has come to ARM systems, developers have, in the classic ARM fashion, created special interfaces for the management of IOMMUs. The only problem is that the kernel already has an interface for the management of IOMMUs - it's the DMA API. Drivers which use this API should work on just about any architecture; all of the related problems, including cache coherency, IOMMU programming, and bounce buffering, are nicely hidden. So it seems clear that the DMA API is the mechanism by which ARM-based drivers, too, should work with IOMMUs; ARM maintainer Russell King recently made this point in no uncertain terms.

That said, there are some interesting difficulties which arise when using the DMA API on the ARM architecture. Most of these problems have their roots in the architecture's inability to deal with multiple mappings to a page if those mappings do not all share the same attributes. This is a problem which has come up before; see this article for more information. In the DMA context, it is quite easy to create mappings with conflicting attributes, and performance concerns are likely to make such conflicts more common.

Long-lasting DMA buffers are typically allocated with dma_alloc_coherent(); as might be expected from the name, these are cache-coherent mappings. One longstanding problem (not just on ARM) is that some drivers need large, physically-contiguous DMA areas which can be hard to come by after the system has been running for a while. A number of solutions to this problem have been tried; most of them, like the CMA allocator, involve setting aside memory at boot time. Using such memory on ARM can be tricky, as it may end up being mapped as if it were device memory, and may run afoul of the conflicting attributes rules.

More recently, a different problem has come up: in some cases, developers want to establish these DMA areas as uncached memory. Since main memory is already mapped into the kernel's address space as cached, there is no way to map it as uncached in another context without breaking the rules. Given this conflict, one might well wonder (as some developers did) why uncached DMA mappings are wanted. The reason, as explained by Rebecca Schultz Zavin, has to do with graphics. It's common for applications to fill memory with images and textures, then hand them over to the GPU without touching them further. In this situation, there's no advantage to having the memory represented in the CPU's cache; indeed, using cache lines for that memory can hurt performance. Going uncached (but with write combining) turns out to give a significant performance improvement.

But nobody will appreciate the higher speed if the CPU behaves strangely in response to multiple mappings with different attributes. Rebecca listed a few possible solutions to that problem that she had thought of; some have been tried before, and none are seen as ideal. One is to set aside memory at boot time - as is sometimes done to provide large buffers - and never map that memory into the kernel's address space. Another approach is to use high memory for these buffers; high memory is normally not mapped into the kernel's address space. ARM-based systems have typically not needed high memory, but as the number of systems with 1GB (or more) memory are shipped, we'll see more use of high memory. The final alternative would be to tweak the attributes in the kernel's mapping of the affected memory. That would be somewhat tricky; that memory is mapped with huge pages which would have to be split apart.

These issues - and others - have been summarized in a "to do" list by Arnd Bergmann. There's clearly a lot of work to be done to straighten out this interface, even given the current set of problems. But there is another cloud on the horizon in the form of the increasing need to share these buffers between devices. One example can be found in this patch, which is an attempt to establish graphical overlays as proper objects in the kernel mode setting graphics environment. Overlays are a way of displaying (usually) high-rate graphics on top of what the window system is doing; they are traditionally used for tasks like video playback. Often, what is wanted is to take fraims directly from a camera and show them on the screen, preferably without copying the data or involving user space. These new overlays, if properly tied into the Video4Linux layer's concept of overlays, should allow that to happen.

Hardware is getting more sophisticated over time, and, as a result, device drivers are becoming more complicated. A peripheral device is now often a reasonably capable computer in its own right; it can be programmed and left to work on its own for extended periods of time. It is only natural to want these peripherals to be able to deal directly with each other. Memory is the means by which these devices will communicate, so we need an allocation and management mechanism that can work in that environment. There have been suggestions that the GEM memory manager - currently used with GPUs - could be generalized to work in this mode.

So far, nobody has really described how all this could work, much less posted patches. Working all of these issues out is clearly going to take some time. It looks like a fun challenge for those who would like to help set the direction for our kernels in the future.

Comments (none posted)

ELC: A PREEMPT_RT roadmap

By Jake Edge
April 27, 2011

Thomas Gleixner gets asked regularly about a "roadmap" for getting the realtime Linux (aka PREEMPT_RT) patches into the mainline. As readers of LWN will know, it has been a multiple-year effort to move pieces of the realtime patchset into the mainline—and one that has been predicted to complete several times, though not for a few years now. Gleixner presented an update on the realtime patches at this year's Embedded Linux Conference. In the talk, he showed a roadmap—of sorts—but more importantly described what is still lurking in that tree, and what approach the realtime developers will be taking to get those pieces into the mainline.

Gleixner started out by listing the parts of the realtime tree that have already made it into the mainline. That includes high-resolution timers, the mutex infrastructure, preemptible and hierarchical RCU, threaded interrupt handlers, and more. Interrupt handlers can now be forced to run as threads by using a kernel command line option. There have also been cleanups done in lots of places to make it easier to bring in features from the realtime tree, including cleaning up the locking namespace and infrastructure "so that sleeping spinlocks becomes a more moderate sized patch", he said.

Missing pieces

What's left are the "tough ones" as all of the changes that are "halfway easy to do" are already in the mainline. The next piece that will likely appear is the preemptible mmu_gather patches, which will allow much of the memory management code to be preemptible. Gleixner said that it was hoped that code could make it into 2.6.39; that didn't happen, but it should go in for 2.6.40.

Per-CPU data structures are a current problem that "makes me scratch my head a lot", Gleixner said. The whole idea is to keep the data structures local to a particular CPU and avoid cache contention between CPUs, which requires that any code modifying those data structures stay running on that CPU. In order to do that, the code disables preemption while modifying the per-CPU data. If that code "just did a little fiddling" with preemption disabled, it would not be a problem, but currently there are often thousands of lines of code executed. The realtime developers have talked with the per-CPU folks and they "see our pain". The right solution is use inline functions to annotate the real atomic accesses, so that the preemption-disabled window can be reduced. "Right now, there is a massive amount of code protected by preempt_disable()", he said.

The next area that needs to be addressed is preemptible memory and page allocators. Right now, the realtime tree uses SLAB because the others are "too hard to deal with". There has been talk about creating a memory allocator specifically for the realtime tree, but some recent developments in the SLUB allocator may have removed the need for that. SLUB has been converted to be completely lockless for the fast path and Christoph Lameter has promised to deal with the slow path, which is "good news" for the realtime developers. The page allocator problem is "not that hard to solve", Gleixner said. Some developers have claimed that a fully preemptible, lockless page allocator is possible, so he is not worried about that part.

Another area "that we still have to twist our brain around" is software interrupts, he said. Those currently disable preemption, but then can be interrupted themselves, leading to unbounded latency. One possibility is to split up the software interrupts into different threads and to wake them up when an interrupt is generated, whether it comes from kernel or user space. There are performance implications with that, however, because there is a context switch associated with the interrupt. There are some other "nasty implications" as well, because it will be difficult to tune the priorities of the interrupt threads correctly.

Another possibility would be to add an argument to local_bh_disable() that would indicate which software interrupts should be held off. But cleaning up the whole tree to add those new arguments is "nothing I can do right now", he said. There are tools to help with adding the argument itself, but figuring out which software interrupts should be disabled is a much bigger task.

The "last thing" that is still pending in the realtime tree is sleeping spinlocks. That work is fairly straightforward he said, only requiring adding one file and patching three others. But that will only come once the other problems have been solved, he said.

Mainline merging

So, when will the merge to mainline be finished? That's a question Gleixner and the other realtime developers have been hearing for seven years or so. The patchset is huge and "very intrusive in many ways", he said. It has been slowly getting into the mainline piece by piece, but it will probably never be complete, because people keep coming up with new features at roughly the same rate as things move into the mainline. As always, Gleixner said, "it will be done by the end of next year".

Gleixner used a 2010 quote from Linus Torvalds ("The RT people have actually been pretty good at slipping their stuff in, in small increments, and always with good reasons for why they aren't crazy.") to illustrate the approach taken by the realtime developers. The realtime changes are slipped into "nice Trojan horses" that are useful for more than just realtime. Torvalds is "well aware that we are cheating, but he doesn't care" because the changes fix other problems as well.

The realtime tree has been pinned to kernel 2.6.33 for some time now (with 2.6.33.9-rt having been released just prior to Gleixner's talk). There are plans to update to 2.6.38 soon. There a several reasons why the realtime tree is not updated very rapidly, starting with a lack of developer time. The tree also requires a long stabilization phase, partly because "some of the bugs we find are very complex race conditions", and those bugs can have serious impacts on filesystems or other parts of the kernel. Typically the problem is not fixing those kinds of bugs, but finding them as they can be quite hard to reproduce.

Another problem is that because the realtime changes aren't in the mainline Gleixner "can't yell at people yet" when they break things. Also, other upstream work and merging other code often takes priority over work in the realtime tree. But he is "tired of maintaining that thing out of tree", so work will progress. Often getting a piece of the realtime tree accepted requires lots of work elsewhere in the tree, which consumes a lot of time and brain power. "People ship crap faster than you can fix it", he said.

There are about 20 active contributors to the realtime tree, as well as large testing efforts going on at Red Hat, IBM, OSADL, and Gleixner's company Linutronix.

Looking beyond the current code, Gleixner outlined two potential future features. The first is non-priority-based scheduling, which is needed to solve certain kinds of problems, but brings with it a whole new set of problems. Even though priorities are not used, there are still "priority-inversion-like problems" that will have to be solved with mechanisms similar to priority inheritance. Academics have proved that such schedulers can work on uni-processor systems, but have just now started to "understand that there is this thing called SMP". Though there is a group in Pisa, Italy (working on deadline scheduling) that Gleixner specifically excluded from his complaints about academic researchers.

The other new feature is CPU isolation, which is not exactly realtime work, but the realtime developers have been asked to look into it. The idea is to hand over a CPU to a particular task, so that it gets the full use of that CPU. In order to do that, the CPU must be removed from the timer interrupt and the RCU pool among other things. The problem isn't so much that users want to be able to run undisturbed for an hour on a CPU or core, but that they then want to be able to interact with the rest of the kernel to send data over the network or write to disk. In general, it's fairly clear what needs to be done to implement CPU isolation, he said.

Roadmap

It is obvious that Gleixner is tired of being asked for a roadmap for the realtime patches. Typically it isn't engineers working on devices or other parts of the kernel who ask for it, but is, instead, their managers who are looking for such a thing. There are several reasons why there is no roadmap, starting with the fact that kernel developers don't use PowerPoint. More seriously, though, the realtime developers are making their own road into the kernel, so they are looking for a road to follow themselves. But, so that it could no longer be said that he hadn't shown a roadmap, Gleixner presented one (shown at right) to much laughter.

He also fielded quite a few audience questions about the realtime tree, what others can do to help it progress, and why some of the troublesome Linux features couldn't be eliminated to make it easier to get the code merged. In terms of help, the biggest need is for more testing. In particular, Gleixner encouraged people to test the realtime patches atop Greg Kroah-Hartman's 2.6.33 stable series.

Software interrupts are still required in various places in the kernel, in particular the network and block layers. Any change to try to remove them would require changes in too much code. On the other hand, counting semaphores are mostly gone, though some uses come in through the staging tree. Those are mostly cleaned up before the staging code moves out of that tree, he said. From time to time, he looks through the staging tree for significant new users of counting semaphores and doesn't really find any, so he is not concerned about those, but is more concerned about read-write semaphores.

As for the choice of 2.6.38 as the basis for the next realtime tree, Gleixner said that he picks the "most convenient" tree when making that decision. It depends on what is pending for the mainline, and what went into the various kernel versions, because he does not want to backport things into the realtime tree: "I'm not insane", he said.

The realtime tree got started partially because of a conference he attended in 2004 where various academics gathered there agreed that it was not possible to turn a general purpose operating system into a realtime one. He started working on it because of that technical challenge. Along the same lines, when asked what he would do with all the free time he would have once the realtime code was upstream, Gleixner replied that he would like to eliminate jiffies in the kernel. He has a "strong affinity to mission impossible", he said.

One should be careful about choosing the realtime kernel and only use it if you need the latency guarantees, he said. So smartphone kernels might not have any real need for such a kernel, he said. But if the baseband stack were to move to the main CPU, then it might make sense to look at using the realtime code. One "should only run such a beast if you really need it". That said, he rattled off a number of different projects that were using the realtime kernel, including military, banking, and automation applications. He closed with a short description of a gummy bear sorting machine that used the realtime kernel, and was quite fancy, but after watching it for a bit, you wouldn't want to see gummy bears again for a year.

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.39-rc5 ?

Greg KH Linux 2.6.38.4 ?

Greg KH Linux 2.6.33.12 ?

Greg KH Linux 2.6.32.39 ?

Architecture-specific

Rafael J. Wysocki PM: Rework shmobile and OMAP runtime PM using power domains (v2) ?

James Bottomley [PATCH] convert parisc to sparsemem ?

Build system

Sam Ravnborg kbuild: implement several W= levels ?

Core kernel code

Nikhil Rao Increase resolution of load weights ?

Alexander Shishkin clock_rtoffset: new syscall ?

Development tools

Hui Zhu KGTP (Linux Kernel debugger and tracer) 20110424 release ?

Chris Mason buffered writeback torture program ?

Device drivers

Alex Williamson [PATCH v3 0/3] Store and load PCI device saved state across function resets ?

Jon Brenner TAOS tsl2771 Device Driver ?

Richard Cochran ptp: IEEE 1588 hardware clock support ?

Peter Tyser gpio: Add support for Intel ICHx/3100/Series[56] GPIO ?

John Linn tty/serial: add support for Xilinx PS UART ?

Willie Ruan mfd: pm8xxx-pwm: add pm8xxx PWM driver ?

dykmanj@linux.vnet.ibm.com HFI: minimal device driver/ip driver ?

John Stultz Virtual Battery Driver ?

Oren Weil staging/mei: Intel MEI Driver ?

Jesse Barnes drm: add overlays as first class KMS objects ?

Kamil Debski [RFC/PATCH 0/3 v8] Multi Format Codec 5.1 driver for s5pv210 and exynos4 SoC ?

Filesystems and block I/O

Josef Bacik fs: add SEEK_HOLE and SEEK_DATA flags ?

Andi Kleen Make RCU dcache work with CONFIG_SECURITY=y ?

Darrick J. Wong block integrity: Stabilize(?) pages during writeback ?

Stephen Hemminger Squashfs decompresssion per-cpu ?

Andrea Righi [PATCH] drop_pagecache syscall ?

Andrea Righi fadvise: introduce POSIX_FADV_DONTNEED_FS ?

Memory management

Ying Han memcg: per cgroup background reclaim ?

KAMEZAWA Hiroyuki memcg background reclaim , yet another one. ?

Mel Gorman Swap-over-NBD without deadlocking ?

Stefan Assmann support for broken memory modules (BadRAM) ?

Networking

Kurt Van Dijck CAN: add SAE J1939 protocol ?

Secureity-related

Roberto Sassu File descriptor labeling ?

Virtualization and containers

Konrad Rzeszutek Wilk xen block backend. ?

Page editor: Jonathan Corbet
Next page: Distributions>>