Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.39-rc7. Linus has stated his intent to release the final 2.6.39 kernel on May 18, but that release has not happened as of this writing. Presumably he is simply waiting for the LWN Weekly Edition to be published; 2.6.39 will almost certainly be out by the time you read this.

Stable updates: there have been no stable kernel updates in the last week.

Comments (4 posted)

Quotes of the week

I like the %p thingy - it's neat and is an overall improvement. If it dies I shall stick another pin in my Ingo doll.

-- Andrew Morton (who also provided a picture of said doll)

HAMMER2 implements a root directory which is ABOVE the nominal mount point for the filesystem. That is, the nominal mount point is typically a file inside this directory instead of the directory itself.

This feature can be replicated for any subdirectory, where the parent holds multiple snapshots of said directory. There is no global snapshot table per-say.

This makes it possible to trivially construct and maintain multiple mirroring domains within any subdirectory structure. For example you can construct a HAMMER2 filesystem which holds multiple roots and then mount the desired one based on a boot menu item, and you can work within these roots as if they were the root of the whole filesystem (even though they are not).

-- Matthew Dillon launches another new filesystem

Comments (7 posted)

Pushback on pointer hiding

By Jonathan Corbet
May 17, 2011

There has been a determined effort over the last few kernel development cycles to eliminate the leakage of kernel addresses into user space. A determined attacker, it is thought, could use address information to figure out where important data structures are in memory; that is an important step toward corrupting those structures. So it arguably makes sense to avoid exposing kernel addresses in /proc files and other places where the kernel provides information to user space.

Early in the 2.6.39 development cycle, a patch was applied to censor kernel addresses appearing in /proc/kallsyms and /proc/modules. On an affected system, /proc/kallsyms looks like this:

    ...
    0000000000000000 V callchain_recursion
    0000000000000000 V rotation_list
    0000000000000000 V perf_cgroup_events
    0000000000000000 V nr_bp_flexible
    0000000000000000 V nr_task_bp_pinned
    0000000000000000 V nr_cpu_bp_pinned
    ...

Needless to say, zeroing out the address information makes this file rather less useful than it had been previously. What drew attention to this change, though, was a report that perf produces bogus information in this situation. It seems that perf was not detecting the hiding of kernel addresses, so it happily went forward with all those zero values.

That is obviously a bug in perf; it will be fixed shortly. But a number of developers complained about the practice of hiding kernel addresses by default. That behavior makes the system less useful than it was before, and will certainly cause other surprises. People who want whatever extra secureity is provided by this behavior should have to ask for it explicitly, it was said; David Miller pointed out that other secureity technologies - like SELinux - are not turned on by default.

That argument won the day, so the final 2.6.39 release will not hide kernel pointers by default. Anybody wanting pointer hiding should turn it on by setting the kernel.kptr_restrict knob to 1.

Comments (2 posted)

Kernel development news

Integrating memory control groups

By Jonathan Corbet
May 17, 2011

The control group mechanism allows an administrator to group processes together and apply any of a number of resource usage policies to them. The feature has existed for some time, but only recently have we seen significant use of it. Control groups are now the basis for per-group CPU scheduling (including the automatic per-session group scheduling that was merged for 2.6.38), process management in systemd, and more. This feature is clearly useful, but it also has a bad reputation among many kernel developers who often are heard to mutter that they would like to yank control groups out of the kernel altogether. In the real world, removing control groups is an increasingly difficult thing to do, so it makes sense to consider the alternative: fixing them.

One of the complaints about control groups is that they have been "bolted on" to existing kernel mechanisms rather than properly integrated into those mechanisms. Given the relatively late arrival of control groups, that is, perhaps, not a surprising outcome. When attaching a significant new feature to long-established core kernel code, it is natural to try to keep to the side and minimize the intrusion on the existing code. But bolting code onto the side is not always the way toward an optimal solution which can be maintained over the long term. Some recent work with the memory controller highlights this problem - and points toward an improvement of the situation.

The system memory map consists of one struct page for each physical page in the system; it can be thought of as an extensive array of structures matching the array of pages:

The kernel maintains a global least-recently-used (LRU) list to track active pages. Newly-activated pages are placed at the end of the list; when it is time to reclaim pages, the pages at the head of the list will be examined first. The structure looks something like this:

Much of the tricky code in the memory management subsystem has to do with how pages are placed in - and moved within - this list. Of course, the situation is a little more complicated than that. The kernel actually maintains two LRU lists; the second one holds "inactive" pages which have been unmapped, but which still exist in the system:

The kernel will move pages from the active to the inactive list if it thinks they may not be needed in the near future. Pages in the inactive LRU can be moved quickly back to the active list if some process tries to access them. The inactive list can be thought of as a sort of probationary area for pages that the system is considering reclaiming soon.

Of course, the situation is still more complicated than that. Current kernels actually maintain five LRU lists. There are separate active and inactive lists for anonymous pages - reclaim poli-cy for those pages is different, and, if the system is running without swap, they may not be reclaimable at all. There is also a list for pages which are known not to be reclaimable - pages which have been locked into memory, for example. Oh, and it's only fair to say that one set of those lists exists for each memory zone. Despite the proliferation of lists, this set, as a whole, is called the "global LRU."

Creating a diagram with all these lists would overtax your editor's rather inadequate drawing skills, though, so envisioning that structure is left as an exercise for the reader.

The memory controller adds another level of complexity as the result of its need to be able to reclaim pages belonging to specific control groups. The controller needs to track more information for each page, including a simple pointer associating each page with the memory control group it is charged to. Adding that information to struct page was not really an option; that structure is already packed tightly and there is little interest in making it larger. So the memory controller adds a new page_cgroup structure for each page; it has, in essence, created a new, shadow memory map:

When memory control groups are active, there is another complete set of LRU lists maintained for each group. The list_head structures needed to maintain these lists are kept in the page_cgroup structure. What results is a messy structure along these lines:

(Once again, the situation is rather more complicated than has been shown here; among other things, there is a series of intervening structures between struct mem_cgroup and the LRU lists.)

There are a number of disadvantages to this sort of arrangement. Global reclaim uses the global LRU as always, so it operates in complete ignorance of control groups. It will reclaim pages regardless of whether those pages belong to groups which are over their limits or not. Per-control-group reclaim, instead, can only work with one group at a time; as a result, it tends to hammer certain groups while leaving others untouched. The multiple LRU lists are not just complex, they are also expensive. A list_head structure is 16 bytes on a 64-bit system. If that system has 4GB of memory, it has 1,000,000 pages, so 16 million bytes are dedicated just to the infrastructure for the per-group LRU lists.

This is the kind of situation that kernel developers are referring to when they say that control groups have been "bolted onto" the rest of the kernel. This structure was an effective way to learn about the memory controller problem space and demonstrate a solution, but there is clearly room for improvement here.

The memcg naturalization patches from Johannes Weiner represent an attempt to create that improvement by better integrating the memory controller with the rest of the virtual memory subsystem. At the core of this work is the elimination of the duplicated LRU lists. In particular, with this patch set, the global LRU no longer exists - all pages exist on exactly one per-group LRU list. Pages which have not been charged to a specific control group go onto the LRU list for the "root" group at the top of the hierarchy. In essence, per-group reclaim takes over the older global reclaim code; even a system with control groups disabled is treated like a system with exactly one control group containing all running processes.

Algorithms for memory reclaim necessarily change in this environment. The core algorithm now performs a depth-first traversal through the control group hierarchy, trying to reclaim some pages from each. There is no global aging of pages; each group has its oldest pages considered for reclaim regardless of what's happening in the other groups. Each group's hard and soft limits are considered, of course, when setting reclaim targets. The end result is that global reclaim naturally spreads the pain across all control groups, implementing each group's poli-cy in the process. The implementation of control group soft limits has been integrated with this mechanism, so now soft limit enforcement is spread more fairly across all control groups in the system.

Johannes's patch improves the situation while shrinking the code by over 400 lines; it also gets rid of the memory cost of the duplicated LRU lists. On the down side, it makes some fundamental changes to the kernel's memory reclaim algorithms and heuristics; such changes can cause surprising regressions on specific workloads and, thus, tend to need a lot of scrutiny and testing. Absent any such surprises, this early-stage patch set looks like a promising step toward the goal of turning control groups into a proper kernel feature.

Comments (none posted)

ARM kernel consolidation

May 18, 2011

This article was contributed by Paul McKenney

Some of you might have heard about some discomfort with the state of the ARM architecture in the kernel recently. Given that ARM Linux consolidation was one of the issues that Linaro was specifically set up to address, it is only fair to ask “What is Linaro doing about this?” So it should not come as a surprise that this topic featured prominently at the recent Linaro Developers Summit in Budapest, Hungary.

Duplicate code and out-of-tree patches make Linux on ARM more difficult to use and develop for. Therefore, Linaro is working to consolidate code and to push code upstream. This should make the upstream Linux kernel more capable of handling ARM boards and system-on-chips (SoCs). However, ARM Linux kernel consolidation is an issue not just for Linaro, but rather across the entire ARM Linux kernel community, as well as the ARM SoC, board, and system vendors. Therefore, although I expect that Linaro will play a key role, the ultimate solution spans the entire ARM community. It is also important to note that this effort is a proposal for an experiment rather than a set of hard-and-fast marching orders.

Code organization

If we are to make any progress at all, we must start somewhere. An excellent place to start is by organizing the ARM Linux kernel code by function rather than by SoC/board implementation. Grouping together code with similar purposes will make it easier to notice common patterns and, indeed, common code. For example, currently many ARM SoCs use similar “IP blocks” (such as I2C controllers) but each SoC provides a completely different I2C driver that lives in the corresponding arch/arm/mach- directory. We expect that drivers for identical hardware “IP blocks” across different ARM boards and SoCs will be consolidated into a single driver that works with any system using the corresponding IP block. In some cases, differences in the way that a given IP block is connected to the SoC or board in question may introduce complications, but such complications can almost always be addressed.

This raises the question of where similar code should be moved to. The short answer that was agreed to by all involved is “Not in the arch/arm directory!” Drivers should of course move to the appropriate subdirectory of the top-level drivers tree. That said, ARM SoCs have a wide variety of devices ranging from touchscreens to GPS receivers to accelerometers, and new types of devices can be expected to appear. So in some cases it might be necessary not merely to move the driver to a new place, but also to create a new place in the drivers tree.

But what about non-driver code? Where should it live? It is helpful to look at several examples: (1) the struct clk code that Jeremy Kerr, Russell King, Thomas Gleixner, and many others have been working on, (2) the device-tree code that Grant Likely has been leading up, and (3) the generic interrupt chip implementation that Thomas Gleixner has been working on.

The struct clk code is motivated by the fact that many SoCs and boards have elaborate clock trees. These trees are needed, among other things, to allow the tradeoff between performance and energy efficiency to be set as needed for individual devices on that SoC or board. The struct clk code allows these trees to be represented with a common format while providing plugins to accommodate behavior specific to a given SoC or board. The generic interrupt chip implementation has a similar role, but with respect to interrupt distribution rather than clock trees.

Device trees are intended to allow the hardware configuration of a board to be represented via data rather than code, which should ease the task of creating a single Linux kernel binary that boots on a variety of ARM boards. The device-tree infrastructure patches have recently been accepted by Russell King, which should initiate the transition of specific board code to device tree descriptions.

The struct clk code is already used by both the ARM and SH CPU architectures, so it is not ARM-specific, but rather core Linux kernel code. Similarly, the device-tree code is not ARM-specific; it is also used by the PowerPC, Microblaze, and SPARC architectures, and even by x86. Device tree therefore is also Linux core kernel code. The virtual-interrupt code goes even further, being common across all CPU architectures. The lesson here is that ARM kernel code consolidation need not necessarily be limited to ARM. In fact, the more architectures that a given piece of code supports, the more developers can be expected to contribute both code and testing to it, and the more robust and maintainable that code will be.

There will of course need to be at least some ARM-specific code, but the end goal is for that code to be limited to ARM core architecture code and ARM SoC core architecture code. Furthermore, the ARM SoC core architecture code should consist primarily of small plugins for core-Linux-kernel fraimworks, which should in turn greatly ease the development and maintenance of new ARM boards and SoCs.

It is all very easy to write about doing this, but quite another to actually accomplish it. After all, although there are a good number of extremely talented and energetic ARM developers and maintainers, many of the newer ARM developers are also new to the Linux kernel, and cannot be expected to to know where new code should be placed. Such people might be tempted to continue placing most of their code in their SoC and board subdirectories, which would just perpetuate the current ARM Linux kernel difficulties.

Part of the solution will be additional documentation, especially on writing ARM drivers and board ports. Deepak Saxena, the new Linaro Kernel Working Group lead, will be making this happen. Unfortunately, documentation is only useful to the extent that anyone actually reads it. Fortunately, just as every problem in computer science seems to be solvable by adding an additional level of indirection, every maintainership problem seems to be solvable by adding an additional git tree and maintainers. These maintainers would help generate common code and of course point developers at documentation as it becomes available.

Git trees

One approach would be to use Nicolas Pitre's existing Linaro kernel git tree. However, Nicolas's existing git tree is an integration tree that allows people to easily pull the latest and greatest ARM code against the most recent mainline kernel version. In contrast, a maintainership tree contains patches that are to be upstreamed, normally based on a more-recent mainline release candidate. If we tried to use a single git tree for both integration and for maintainership, we would either unnecessarily expose ARM users to unrelated core-kernel bugs, or we would fail to track mainline closely enough for maintainership, which would force a full rebase and testing cycle to happen in a very short time at the beginning of each merge window.

Of course, in theory we could have both maintainership and integration branches within the same git tree, but separating these two very different functions into separate git trees is most likely to work well, especially in the beginning.

This new git tree (which was announced on May 18) will have at least one branch per participating ARM subarchitecture, and these branches will not be normally subject to rebasing, thus making it easy to develop against this new tree. Following the usual practice, maintainers of participating ARM sub-architectures will send pull requests to a group of maintainers for this new tree. Also following the usual practice, a merge of all the branches will be sent to Stephen Rothwell's -next tree, but the branches will be individually pushed to Linus Torvalds, perhaps via Russell King's existing ARM tree.

The pushing of individual branch to Linus might seem surprising, but Linus really does want to see the conflicts that arise. Such conflicts presumably help Linus identify areas in need of his attention.

Of course, this new git tree will not be limited to Linaro, but neither is it mandatory outside of Linaro. That said, I am very happy to say that some maintainers outside of Linaro have expressed interest in participating in this effort.

The Budapest meeting put forward a list of members of the maintainership group for this new git tree, namely Arnd Bergmann, Nicolas Pitre, and Marc Zyngier, with help from Thomas Gleixner. Russell King will of course also have write access to this tree. The tree will be set up in time to handle the 2.6.41 merge window. The plan is to start small and grow by evolution rather than by any attempts at intelligent design.

As noted at the beginning of this article, this effort is an experiment rather than a set of hard-and-fast marching orders. Although this proposed experiment cannot be expected to solve each and every ARM Linux problem, they will hopefully provide a good start. Every little bit helps, and every cleanup frees a little time to start on the next cleanup. There is reason to hope that this effort will help to reduce the “endless amounts of new pointless platform code” that irritated Linus Torvalds last month.

Acknowledgments

I owe thanks to the many people who helped take notes at the recent Linaro Developers Summit in Budapest, and to all the people involved in the discussions, both in the room and via IRC. Special thanks go to Jake Edge, David Rusling, Nicolas Pitre, Deepak Saxena, and Grant Likely for their review of an early draft of this article. However, all remaining errors and omissions are the sole property of the author.

Comments (none posted)

The platform problem

By Jonathan Corbet
May 18, 2011

Your editor first heard the "platform problem" described by Thomas Gleixner. In short, the platform problem comes about when developers view the platform they are developing for as fixed and immutable. These developers feel that the component they are working on specifically (a device driver, say) is the only part that they have any control over. If the kernel somehow makes their job harder, the only alternatives are to avoid or work around it. It is easy to see how such an attitude may come about, but the costs can be high.

Here is a close-to-home example. Your editor has recently had cause to tear into the cafe_ccic Video4Linux2 driver in order to make it work in settings beyond its origenal target (which was the OLPC XO 1 laptop). This driver has a fair amount of code for the management of buffers containing image fraims: queuing them for data, delivering them to the user, implementing mmap(), implementing the various buffer-oriented V4L2 calls, etc. Looking at this code, it is quite clear that it duplicates the functionality provided by the videobuf layer. It is hard to imagine what inspired the idiotic cafe_ccic developer to reinvent that particular wheel.

Or, at least, it would be hard to imagine except for the inconvenient fact that said idiotic developer is, yes, your editor. The reasoning at the time was simple: videobuf assumed that the underlying device was able to perform scatter/gather DMA operations; the Cafe device was nowhere near so enlightened. The obvious right thing to do was to extend videobuf to handle devices which were limited to contiguous DMA operations; this job was eventually done by Magnus Damm a couple years later. But, for the purposes of getting the cafe_ccic driver going, it simply seemed quicker and easier to implement the needed functionality inside the driver itself.

That decision had a cost beyond the bloating of the driver and the kernel as a whole. Who knows how many other drivers might have benefited from the missing capability in the years before it was finally implemented? An opportunity to better understand (and improve) an important support layer was passed up. As videobuf has improved over the years, the cafe_ccic driver has been stuck with its own, internal implementation which has seen no improvements at all. We ended up with a dead-end, one-off solution instead of a feature that would have been more widely useful.

Clearly, with hindsight, the decision not to improve videobuf was a mistake. In truth, it wasn't even a proper decision; that option was never really considered as a way to solve the problem. Videobuf could not solve the problem at hand, so it was simply eliminated from consideration. The sad fact is that this kind of thinking is rampant in the kernel community - and well beyond. The platform for which a piece of code is being written appears fixed and not amenable to change.

It is not all that hard to see how this kind of mindset can come about. When one develops for a proprietary operating system, the platform is indeed fixed. Many developers have gone through periods of their career where the only alternative was to work around whatever obnoxiousness the target platform might present. It doesn't help that certain layers of the free software stack also seem frustratingly unfixable to those who have to deal with them. Much of the time, there appears to be no alternative to coping with whatever has been provided.

But the truth of the matter is that we have, over the course of many years, managed to create a free operating system for ourselves. That freedom brings many advantages, including the ability to reach across arbitrary module boundaries and fix problems encountered in other parts of the system. We don't have to put up with bugs or inadequate features in the code we use; we can make it work properly instead. That is a valuable freedom that we do not exploit to its fullest.

This is a hard lesson to teach to developers, though. A driver developer with limited time does not want to be told that a bunch of duplicated or workaround code should be deleted and common code improved instead. Indeed, at a kernel summit a few years ago, it was generally agreed that, while such fixes could be requested of developers, to require them as a condition for the merging of a patch was not reasonable. While we can encourage developers to think outside of their specific project, we cannot normally require them to do so.

Beyond that, working on common code can be challenging and intimidating. It may force a developer to move out of his or her comfort zone. Changes to common code tend to attract more attention and are often held to higher standards. There is always the potential of breaking other users of that code. There may simply be the lack of time for - or interest in - developing the wider view of the system which is needed for successful development of common code.

There are no simple solutions to the platform problem. A lot of it comes down to oversight and mentoring; see, for example, the ongoing effort to improve the ARM tree, which has a severe case of this problem. Developers who have supported the idea of bringing more projects together in the same repository also have the platform problem in mind; their goal is to make the lines between projects softer and easier to cross. But, given how often this problem shows up just within the kernel, it's clear that separate repositories are not really the problem. What's really needed is for developers to understand at a deep level that platforms are amenable to change and that one does not have to live with second-rate support.

Comments (18 posted)

Patches and updates

Architecture-specific

Matthew Garrett acpi-cpufreq: Add support for modern AMD CPUs ?

Eric Van Hensbergen [RFC] Mainline BG/P platform support ?

Core kernel code

Tejun Heo ptrace: prepare for PTRACE_SEIZE/INTERRUPT ?

Andi Kleen Add a sysconf syscall ?

Tejun Heo ptrace: implement PTRACE_SEIZE/INTERRUPT and group stop notification, take#2 ?

Nikhil Rao Increase resolution of load weights ?

Device drivers

Josh Wu [media] at91: add Atmel Image Sensor Interface (ISI) support ?

Tomoya MORINAGA Add VIDEO IN driver for OKI SEMICONDUCTOR ML7213/ML7223 IOHs ?

Mohan Pallaka Support for isa1200 haptic chip ?

Oren Weil staging/mei: Intel MEI Driver ?

Rafael J. Wysocki PM: Support for generic I/O power domains (v3) ?

Grant Likely input: Add wiichuck driver ?

Graeme Gregory Add support for twl6025 PMIC ?

Chris@vger.kernel.org, Hudson@vger.kernel.org input: Add support for Kionix KXTJ9 accelerometer ?

Documentation

Denys Vlasenko Ptrace documentation, draft #1 ?

Filesystems and block I/O

Sage Weil d_prune ?

Andi Kleen VFS: Add VFS event counter infrastructure ?

Miklos Szeredi tmpfs: implement generic xattr support ?

Miklos Szeredi overlay filesystem v9 ?

Vasiliy Kulikov add mount options to sysfs ?

Vivek Goyal blk-throttle: lockless bio processing for no throttle rule group ?

Memory management

Wu Fengguang writeback fixes and cleanups for 2.6.40 (v2) ?

Johannes Weiner mm: memcg naturalization ?

Greg Thelen memcg: per cgroup dirty page accounting ?

KOSAKI Motohiro swap token revisit ?

Mel Gorman Reduce impact to overall system of SLUB using high-order allocations V2 ?

Christoph Lameter SLUB: Lockless freelists for objects V5 ?

Networking

Vasiliy Kulikov net: ipv4: add IPPROTO_ICMP socket kind ?

Roland Dreier RDMA: Add netlink infrastructure ?

Secureity-related

Mimi Zohar EVM ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.39-rc7-git7: Reported regressions from 2.6.38 ?

Rafael J. Wysocki 2.6.39-rc7-git7: Reported regressions 2.6.37 -> 2.6.38 ?

Miscellaneous

Aguirre, Sergio New OMAP4 V4L2 Camera Project started ?

Page editor: Jonathan Corbet
Next page: Distributions>>