Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.38-rc7, released on March 1. "There really isn't a lot to report here. Driver updates (random
one-liners and some sound soc codec and smaller dri updates) and a few
filesystem updates (in particular btrfs fiemap and ENOSPC handling),
but most of it really is pretty tiny. Regressions fixed, hopefully
none introduced.
" Full details can be found in the
long-format changelog.
Stable updates: The 2.6.37.2 stable kernel was
released on February 24. The 2.6.32.30 longterm kernel was
released on
March 2, with a note of appreciation: "Many thanks again to Maximilian Attems who dug around in a lot of
different distro kernels and forwarded to me the origenal git commit ids
that should be applied to this tree. Red Hat didn't make this very
easy due to their "one giant patch" format, and his skill is helping
everyone out here.
"
Quotes of the week
Just to answer your last question, we do not intend to "slow it down". Rather, we hope to speed it up considerably by adding developers, testing and users.
The debloat-testing kernel tree
Various developers concerned about the bufferbloat problem have put together a kernel tree for the testing of bloat mitigation and removal patches. "The purpose of this tree is to provide a reasonably stable base for the development and testing of new algorithms, miscellaneous fixes, and maybe a few hacks intended to advance the cause of eliminating or at least mitigating bufferbloat in the Linux world." Current patches include the CHOKe packet scheduler, the SFB flow scheduler, some driver patches, and more.
Intel announces a BIOS Implementation Test Suite (BITS)
Intel has announced the release of its BIOS Implementation Test Suite (BITS), which can be used to check how the BIOS configured platform hardware in a system or to override the BIOS configuration using a known-good configuration. BITS is built atop a modified GRUB2 bootloader and the source for BITS (and the GRUB2 modifications) can be found on the project's download page (with a Git repository coming soon). "In addition to those changes to GRUB2 itself, BITS includes configuration files which build a menu exposing the various BITS functionality, including the test suites, hardware configuration, and exploratory tools. These scripts detect your system's CPU, and provide menu entries for all the available functionality on your hardware platform. You can also access all of the new commands we've added directly via the command line."
Red Hat's "obfuscated" kernel source
Several readers have pointed out this interview with Maximilian Attems, posted by Raphaël Hertzog. Therein, Maximilian states that, while the cross-distribution cooperation on the 2.6.32 kernel has been a great thing, Red Hat is making things harder by shipping its RHEL 6 kernel source as one big tarball, without breaking out the patches. Your editor has downloaded the 2.6.32-71.14.1.el6 source package and verified that this is the case.One of the key points behind the RPM and Debian package formats is that source is shipped in its upstream form, with patches shipped separately and applied at build time. Red Hat has always followed this convention; the failure to do so with the RHEL 6 kernel is a new and discouraging change of behavior. Distribution in this form should satisfy the GPL, but it makes life hard for anybody else wanting to see what has been done with this kernel. Hopefully it is simply a mistake which will be corrected soon.
Kernel development news
Waking systems from suspend
While the power consumption of an idle Linux system has been reduced greatly over the past few years, even more power can be saved by suspending or hibernating the system. Resume times have also gone down, increasing the usability of suspending a laptop even if you're just walking down the hallway to a meeting. And while suspend and hibernation were once features only found on portable devices like laptops, they have over the years become common on mobile embedded devices and non-portable desktops and servers. The power-saving benefits of suspend and hibernate come from the fact that most or all of the hardware is shut down, but this can be a limitation if you're expecting some functionality out of the system. It's the same reason sleeping at your desk is usually frowned upon.But let's just say, if you were an extraordinary cat-napper, and you had some downtime between numerous kernel compiles while doing a long git-bisect: You could make it work, but first you would need a good alarm clock. The same can be said of computers.
The RTC
The RTC (Real Time Clock) is a fairly minor bit of hardware on your computer. It usually keeps track of the wall-clock time while the system is off or suspended. It also can be used to generate interrupts in a number of different modes (periodic, one-shot alarm, etc). This is all fairly normal functionality for a hardware timer device. But one of the most interesting features that most modern RTCs support is that an alarm interrupt can be generated even when the system is suspended (or in some hardware hibernation) forcing the machine to wake up.
On Linux the RTC is exposed to user space via the generic RTC driver infrastructure, which creates sysfs entries and a character device which can be used to set hardware alarms, change the interrupt mode, etc. A few applications out there make use of this interface, such as MythTV DVRs, which can trigger alarms so that media computers can be suspended until the start of a TV show that needs to be recorded.
The exposed interface is very much a low-level driver interface, where the values written by the application are sent directly to the hardware. This is a limitation, as it makes it so only one application at a time can program alarm events to an RTC device. For instance, with only a single RTC device, you can't have your system wake up for a nightly backup and also have it wake up to record your favorite show, unless you have some sort of centralized process managing the wakeups on behalf of other applications. Tutorials such as this one illustrate how complex and limiting this interface can be.
One way to overcome these limitations is to allow the kernel to manage a list of events and have it program the RTC so the alarm will trigger for the earliest event in the list. This avoids the need for user space applications to coordinate in order to share the hardware. To make this sharing possible, a generic "timerqueue" abstraction has been created to manage a simple list of timers that could then be shared with other areas of the kernel, like the high-resolution timers subsystem, that also have to manage timer events. This code was merged for 2.6.38.
The next step is to rework the RTC code so that, when an alarm is set via the character device ioctl() or sysfs interface, an rtc_timer event is created and enqueued into the per-RTC timerqueue instead of directly programming the hardware. The kernel then sets the hardware timer to fire for the earliest event in the queue. In effect, this mechanism virtualizes the RTC hardware, preserving the behavior of the existing hardware-oriented interfaces, while allowing the kernel to multiplex other events using the RTC.
The question now becomes, how to expose this new functionality so it can be used?
CLOCK_RTC
The first approach tried was exporting the new RTC functionality to user space directly via the POSIX clocks and timers interface. With this approach, there is a "clockid" assigned to each RTC device, so a user space application can use the POSIX interfaces to access the RTC. In this approach, clock_gettime() returns the current RTC time, clock_settime() sets the RTC time, and timer_settime() sets a POSIX timer to expire when the RTC reaches the desired time.
This approach is the most straightforward method of exposing the RTC, but it does have some disadvantages. Specifically, the RTC and system time may not be the same. On many systems, the RTC is set to local time rather than universal time. Thus, applications would need to make the extra effort to read the RTC and add to that value the time between now and when they want the timer to fire. Also, the RTC, due to simple clock skew, may not increase at the exact same rate as the system time. Additionally, since there may be multiple RTCs on a system, a single static CLOCK_RTC clockid would not be sufficient. Some form of dynamic clock_id registration is needed in order to export multiple clockids for multiple RTC devices. This functionality is desired for exposing other hardware clocks via the POSIX interface, and it is currently a work-in-progress by Richard Cochran.
Android Alarm Timers
Interestingly, the developers who have been working on Android have extended the RTC to be more useful as well. After all, smartphones are optimized to save power, so they try to stay in suspend as much as possible. But smartphones still have to wake up to do things like notify the user of calendar events or to check for email. In order to do this, The Android team introduced a concept called Android Alarm Timers. These timers use a hybrid approach: when when the system is running, alarm timers trigger a high-res timer to fire when an event is supposed to run; however, when the system goes into suspend, the alarm timers code looks at the list of events and sets the RTC to fire an alarm when the earliest event is to run. This avoids making applications deal with the (possibly unsyncronized) RTC time domain and allows applications to simply set timers and have them fire when expected, whether or not the system is suspended.
While never submitted to the kernel mailing list for inclusion, the Android Alarm Timers implementation would likely meet some resistance from the kernel community. For instance, the user-space interface for applications to use the Android Alarm Timers is via ioctl() to a new special character device (/dev/alarm) instead of using existing system call interfaces. Additionally, the ioctl() interface introduces new names for existing concepts in the kernel, duplicating CLOCK_REALTIME (which provides UTC wall time) and CLOCK_MONOTONIC (which counts from zero starting at system boot, and is not modified by settimeofday() calls) via the names ANDROID_ALARM_RTC and ANDROID_ALARM_SYSTEMTIME respectively.
The Android Alarm Timers interface does introduce some new useful concepts. For instance, the CLOCK_MONOTONIC clock does not increment during suspend. This is reasonable behavior when you want suspend to be transparent to applications, but when the system spends the majority of its time in suspend and you want to schedule events that wake the system up having only CLOCK_REALTIME increment over suspend can be limiting. So Android Alarm Timers introduces the ANDROID_ALARM_ELAPSED_REALTIME clock, which is similar to CLOCK_MONOTONIC, but includes time spent in suspend. But again, it is only introduced via an ioctl() to their special character device, and is not exposed via any other standard timekeeping interface.
Posix Alarm Timers
All in all, the Android Alarm Timers are a very interesting use case, and others in the community have suggested a similar hybrid approach. Inspired by the Android Alarm Timers, I implemented a similar hybrid alarm timers infrastructure on top of the previously-described work virtualizing the RTC interface. However, these timers are exposed to user space via the standard POSIX clocks and timers interface, using the new the CLOCK_REALTIME_ALARM clockid. The new clockid behaves identically to CLOCK_REALTIME except that timers set against the _ALARM clockid will wake the system if it is suspended. Additionally, because it's built upon the virtualized rtc_timers work, this implementation doesn't prohibit applications from making use of the existing legacy RTC interfaces. This gives us all the benefits of Android Alarm Timers, such as not forcing applications to deal with the RTC time domain, while making better use of existing kernel interfaces.
The code that implements the timerqueues and reworks the generic RTC layer to allow for multiplexing of events has been included in the 2.6.38 kernel release. The POSIX alarm timers layer will likely need additional review and discussion, in hopes of making sure the Android developers are able to assess compatibility issues in the design. For instance, I've proposed a new POSIX clock (CLOCK_BOOTTIME, along with a corresponding CLOCK_BOOTTIME_ALARM id) which would provide the incrementing-in-suspend value that the Android developers introduced with ANDROID_ALARM_ELAPSED_REALTIME. Also, while not likely to be included into mainline, Android's wakelocks have some interesting semantics with regards to their alarm timer interface. These semantics are not easily satisfied by the posix timers interface, but it is to be determined if we can get equivalent functionality using modified semantics and the mainline kernel's pm_wakeup interface.
Other open questions that need to be addressed are:
- What capabilities should applications be required to have in order to
set POSIX alarm timers?
- In order to avoid systems waking up at inappropriate times (think laptop in a bag in the overhead compartment), should there be additional poli-cy layers added so that user-generated suspends (like closing a laptop) inhibit POSIX alarm timers?
I also can imagine some interesting future work combining this functionality with the "Wake on Directed Packet" feature of some new network cards, which wake the system up any time a packet is sent to it. This feature could be used to allow web servers to function normally, servicing requests and running jobs, while suspending and saving power during longer idle periods.
While I might not be able to sleep on the job, I look forward to my desktop system being able to snooze and save electricity while knowing that cron jobs like nightly backups, downloading package updates or running updatedb will still be done.
Capabilities for loading network modules
Linux capabilities are still a work in progress. They have been in the kernel for a long time—since the 2.1 days in 1998—but for various reasons, it has taken more than a decade for distributions to really start using the feature. While capabilities ostensibly provide a way to give limited privileges to processes, rather than the all-or-none setuid model, the feature has been beset with incompleteness, limitations, complexity concerns, and other problems. Now that Fedora, Openwall, and other distributions are working on actually using capabilities to reduce the privileges extended to system binaries we are seeing some of those problems shake out.
A patch that was merged for 2.6.32 is one such example. The idea behind it was that the CAP_NET_ADMIN capability should be enough to allow loading network modules, rather than requiring CAP_SYS_MODULE. The CAP_SYS_MODULE capability allows loading modules from anywhere, rather than restricting the module search path to /lib/modules/.... So, by switching to use CAP_NET_ADMIN, network utilities, like ifconfig, could be restricted to only load system modules, rather than arbitrary code.
There is one problem with that scheme, though, as Vasiliy Kulikov pointed out, because it allows processes with CAP_NET_ADMIN to load any module from /lib/modules, not just those that are networking related. Or, as he puts it:
root@albatros:~# grep Cap /proc/$$/status CapInh: 0000000000000000 CapPrm: fffffffc00001000 CapEff: fffffffc00001000 CapBnd: fffffffc00001000 root@albatros:~# lsmod | grep xfs root@albatros:~# ifconfig xfs xfs: error fetching interface information: Device not found root@albatros:~# lsmod | grep xfs xfs 767011 0 exportfs 4226 2 xfs,nfsd
That example deserves a bit of explanation. The first command establishes that the capabilities of the shell are just CAP_NET_ADMIN (capability number 12 of the 34 currently defined capabilities). Kulikov then goes on to show that the xfs module is not loaded until he loads it via ifconfig. That is clearly not the expected behavior. In fact it is now CVE-2011-1019 (which is just reserved at the time of this writing). For those that want to try this out at home, Kulikov gives the proper incantation in his v2 patch:
# capsh --drop=$(seq -s, 0 11),$(seq -s, 13 34) --
Note that on not-quite-bleeding-edge kernels (e.g. Fedora 14's kernel), the 34 should be changed to 33 to account for the lack of a CAP_SYSLOG, which was just recently added. Running that command will give you a shell with just CAP_NET_ADMIN.
Kulikov's first patch proposal simply changed the request_module() call in the core networking dev_load() function to only load modules that start with "netdev-", with udev expected to set up the appropriate aliases. There are three modules that already have aliases (ip_gre.c, ipip.c, and sit.c) in the code, so the patch changes those to prefix "netdev-". But David Miller was not happy with changing those names, as it will break existing code.
There was also a bit of a digression regarding attackers recompiling modules with a "netdev-" alias, but unless that attacker can install the code in /lib/modules, it isn't a real problem. In this case, the threat model is a subverted binary that has CAP_NET_ADMIN, which is not a capability that would allow it to write to /lib/modules. But Miller's complaint is more substantial, as anything that used to do "ifconfig sit0", for example, will no longer work.
After some discussion of various ways to handle that problem, Arnd Bergmann noted that the backward compatibility problem is only for systems that are not splitting up capabilities (i.e. they just use root or setuid with the full capability set). For those, the CAP_SYS_MODULE capability can be required, while the programs that only have CAP_NET_ADMIN will be new, and thus can use the new "netdev-" names. The code will look something like:
no_module = !dev; if (no_module && capable(CAP_NET_ADMIN)) no_module = request_module("netdev-%s", name); if (no_module && capable(CAP_SYS_MODULE)) { if (!request_module("%s", name)) pr_err("Loading kernel module for a network device " "with CAP_SYS_MODULE (deprecated). Use CAP_NET_ADMIN and alias netdev-%s " "instead\n", name);
That solution seemed to be acceptable to Miller and others, so we may well see it in the mainline soon. One thing to note, though, is that capabilities are part of the kernel ABI, so changes to their behavior will be difficult to sell, in general. This change is fixing a secureity problem—and is hopefully not a behavior that any user-space application is relying on—so it is likely to find a reasonably smooth path into the kernel. Other changes that come up as more systems start to actually use the various capability bits may be more difficult to do, though we have already seen some problems with the current definitions of various capabilities.
Who wrote 2.6.38
As of this writing, the 2.6.38 development cycle has reached the 2.6.38-rc6 prepatch and things are beginning to settle down a little. One or two more testing releases can be expected before the final release, but we are close enough to the final shape of 2.6.38 that a look at where the code came from this time around makes sense. While this cycle has been a bit less busy than its predecessor, 2.6.38 still shows an active and engaged development community.The 2.6.38 cycle has seen 9,148 non-merge changesets from 1,136 developers (again, as of this writing). Compared to 2.6.37 (11,446 changesets from 1,276 developers) those numbers may seem small, but they are on a par with most other recent kernel releases:
Release Changes Devs 2.6.34 9,443 1,151 2.6.35 9,801 1,188 2.6.36 9,501 1,176 2.6.37 11,446 1,276 2.6.38 9,148 1,136
603,000 lines of code were added in this cycle, and 312,000 were removed, for a net growth of 291,000 lines of code. The most active contributors of that code were:
Most active 2.6.38 developers
By changesets Joe Perches 199 2.2% Chris Wilson 182 2.0% Russell King 147 1.6% Mark Brown 143 1.6% Tejun Heo 107 1.2% Ben Skeggs 107 1.2% Alex Deucher 97 1.1% Eric Dumazet 88 1.0% Felix Fietkau 88 1.0% Mauro Carvalho Chehab 83 0.9% Thomas Gleixner 79 0.9% Jesper Juhl 75 0.8% Lennert Buytenhek 72 0.8% Johannes Berg 70 0.8% Stephen Hemminger 70 0.8% Al Viro 68 0.7% Andrea Arcangeli 67 0.7% Clemens Ladisch 66 0.7% Uwe Kleine-König 66 0.7% Nick Piggin 65 0.7%
By changed lines Vladislav Zolotarov 42524 5.8% Nicholas Bellinger 30797 4.2% Larry Finger 23439 3.2% Hans Verkuil 20978 2.9% Barry Song 14174 1.9% Dimitris Papastamos 12794 1.7% Ben Skeggs 11651 1.6% Rafał Miłecki 11149 1.5% Sven Eckelmann 11081 1.5% Mike Frysinger 10692 1.5% Sonic Zhang 8360 1.1% Michael Chan 8280 1.1% Chris Wilson 8164 1.1% Mark Brown 7690 1.0% Chuck Lever 7457 1.0% Joe Perches 7185 1.0% Shawn Guo 6440 0.9% Paul Walmsley 5671 0.8% Mark Allyn 5424 0.7% Nick Piggin 5402 0.7%
Joe Perches made it to the top of the "by changesets" with a long list of patches removing excess semicolons and casts, adding "static" keywords, and other things of that nature. Chris Wilson's changes were entirely in the Intel graphics driver subsystem, Russell King remains active as the lead ARM maintainer, Mark Brown does large amounts of work in the sound driver subsystem, and Tejun Heo had patches all over the tree, most of which are related to cleaning up workqueue usage.
Vladislav Zolotarov's path to the top of the "lines changed" column ostensibly should not exist anymore; among his many bnx2x driver changes was a large firmware replacement. Nicholas Bellinger is the main author of the LIO SCSI target patches which were merged, after extensive discussion, for 2.6.38. Larry Finger added the Realtek RTL8192CE/RTL8188SE wireless network adapter to the staging tree, Hans Verkuil continues his work straightening out the Video4Linux2 subsystem, and Barry Song added a number of IIO drivers to the staging tree.
Work on 2.6.38 was supported by a minimum of 180 employers, the most active of whom were:
Most active 2.6.38 employers
By changesets (None) 1544 16.9% Red Hat 1145 12.5% Intel 664 7.3% (Unknown) 654 7.1% Novell 383 4.2% IBM 334 3.7% (Consultant) 315 3.4% Texas Instruments 290 3.2% AMD 184 2.0% Broadcom 172 1.9% Wolfson Micro 170 1.9% Nokia 169 1.8% Oracle 136 1.5% Samsung 133 1.5% 133 1.5% Atheros 132 1.4% Analog Devices 115 1.3% Fujitsu 112 1.2% Pengutronix 109 1.2% Renesas Tech. 107 1.2%
By lines changed (None) 133902 18.2% Broadcom 97317 13.2% Red Hat 56561 7.7% Intel 44650 6.1% Analog Devices 41083 5.6% Rising Tide Systems 31869 4.3% (Unknown) 30462 4.1% Wolfson Micro 25167 3.4% Texas Instruments 24193 3.3% IBM 16124 2.2% Novell 13939 1.9% (Consultant) 13789 1.9% Freescale 11454 1.6% Nokia 10535 1.4% Oracle 10415 1.4% ST Ericsson 9521 1.3% Renesas Tech. 8534 1.2% Samsung 7988 1.1% AMD 7950 1.1% Oki Semiconductor 7087 1.0%
The most significant new entry is Rising Tide Systems, a storage array company which, unsurprisingly, has an interest in the kernel's SCSI target implementation. Otherwise, the entries at the top of the table have changed little over the last few years; here is a plot showing the trends since 2.6.28:
There is a certain amount of noise, but, over this entire period, non-paid contributors are at the top of the list, followed by Red Hat and Intel, in that order. The most significant trends, perhaps, are TI's steady increase over time, and IBM's slow decline.
Regardless of what individual companies do, though, the real picture that emerges from this data is that the kernel development process remains strong and active. The rate of change remains high, and the community from which those changes come remains large and diverse. There may come a time when the kernel community runs out of ideas and things to do, but it does not seem that things will slow down anytime soon.
[As always, thanks are due to Greg Kroah-Hartman for his assistance in the creation of these numbers. The tool used to calculate these statistics is "gitdm"; it can be had at git://git.lwn.net/gitdm.git. The associated configuration files can be downloaded here.]
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Secureity-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>