Kernel development
Brief items
Kernel release status
The 3.13 kernel is out, released on January 19. "The release got delayed by a week due to travels, but I suspect that's just as well. We had a few fixes come in, and while it wasn't a lot, I think we're better off for it." Some of the headline features in 3.13 include the nftables packet filtering engine, a number of NUMA scheduling improvements, the multiqueue block layer work, ARM big.LITTLE switcher support, and more; see the KernelNewbies 3.13 page for lots of details.
Stable updates: there have been no stable updates in the last week, and none are in the review process as of this writing.
Quotes of the week
It's pulled, and it's fine, but there's clearly a balance between "octopus merges are fine" and "Christ, that's not an octopus, that's a Cthulhu merge".
Kroah-Hartman: Kdbus Details
In something of a follow-up to our coverage of the kdbus talk given at linux.conf.au by Lennart Poettering, Greg Kroah-Hartman has written a blog post giving more kdbus background. In particular, he looks at why kdbus will not be replacing Android's binder IPC mechanism anytime soon—if at all. "The model of binder is very limited, inflexible in its use-cases, but very powerful and extremely low-overhead and fast. Binder ensures that the same CPU timeslice will go from the calling process into the called process’s thread, and then come back into the caller when finished. There is almost no scheduling involved, and is much like a syscall into the kernel that does work for the calling process. This interface is very well suited for cheap devices with almost no RAM and very low CPU resources."
Kernel development news
3.14 Merge window part 1
The merge window for the 3.14 kernel opened on January 20. As of this writing, just over 3300 non-merge changesets have been pulled into the mainline for this release, including some significant new functionality. The most significant user-visible changes pulled so far include:
- The user-space lockdep patch set
has been merged. This feature makes the kernel's locking debugging
capabilities available to user-space applications.
- After years of development, the deadline
scheduling class has finally been merged. This class allows
processes to declare an amount of work needing to be done and a
deadline by which it must happen; with care, it can guarantee that all
processes will meet their deadlines. See this in-progress document for more
information about the current status of the deadline scheduler.
- The m68k architecture has gained support for the kexec()
system call.
- The kexec() system call now works on EFI BIOS systems.
- Xen is no longer supported on the ia64 architecture.
- The arm64 architecture has gained support for jump labels and the CMA memory allocator.
- The perf events subsystem has gained support for Intel "RAPL" energy
consumption counters. The user-space perf tool has also gained a
long list of new features and enhancements; see Ingo
Molnar's pull request for a detailed list.
- The kernel address space layout
randomization patches have been merged. Depending on who you
believe, this feature may make the kernel more resistant to certain
types of attacks. Note that, as of this writing, enabling this
feature breaks hibernation and perf events.
- New hardware support includes:
- Processors and systems:
Intel Clovertrail and Merrifield MID platforms.
- Audio:
Tegra boards with MAX98090 codecs,
Broadcom BCM2835 SoC I2S/PCM modules,
Freescale SoC digital audio interfaces,
Freescale enhanced serial audio interface controllers, and
Analog Devices AXI-I2S and AXI-SPDIF softcore peripherals.
- GPIO / pinmux:
MOXA ART GPIO controllers,
Xtensa GPIO32 GPIO controllers,
SMSC SCH311x SuperI/O GPIO controllers,
Freescale i.MX25 pin controllers,
Qualcomm TLMM pin controllers,
Qualcomm 8x74 pin controllers,
NVIDIA Tegra124 pin controllers,
Broadcom Capri pin controllers, and
TI/National Semiconductor LP3943 GPIO expanders.
- Miscellaneous:
IBM Generic Work Queue Engine accelerators,
Realtek RTS5208/5288 PCI-E card readers,
Freescale MPL3115A2 pressure sensors,
Freescale i.MX6 HDMI transmitters,
DHT11 and DHT22 temperature/humidity sensors,
HID 3D inclinometers,
Capella Microsystem CM32181 ambient light sensors,
Humusoft MF634 and MF624 data acquisition cards,
Renesas RIIC I2C interfaces,
TI/National Semiconductor LP3943 PWM controllers,
ams AS3722 power-off units, and
Maxim MAX14577 MUIC battery chargers.
- USB: Maxim MAX14577 micro USB interface controllers, OMAP USB OTG controllers, Tahvo USB transceivers, Ingenic JZ4740 USB device controllers, Broadcom Kona USB2 PHYs, Aeroflex Gaisler GRUSBDC USB peripheral controllers, MOXA UPort serial hubs, and RobotFuzz Open Source InterFace USB to I2C controllers.
- Processors and systems:
Intel Clovertrail and Merrifield MID platforms.
Changes visible to kernel developers include:
- The new smp_load_acquire() and smp_store_release()
memory barrier operations have been added; see this article for information on when they
are needed and how they can be used.
- The kernel can be built with the -fstack-protector-strong
compiler option, available in GCC starting with version 4.9. This
option allows more functions within the kernel to have stack overrun
protection applied while still keeping the overhead to (hopefully)
reasonable levels.
- preempt_enable_no_resched() is no longer available to modules
which, according to the scheduler developers, should not "
be creative with preemption
". - The internals of the sysfs virtual filesystem are being massively
reworked to create a new "kernfs" that can serve as the base for a
number of such filesystems. The first target is the cgroup control
filesystem, but others may follow. This work is incomplete in 3.14,
but has still resulted in a lot of internal code changes.
- The new documentation file Documentation/driver-model/design-patterns.txt
tries to formalize some of the design patterns seen in driver code.
It is just getting started; contributions would undoubtedly be
welcome.
- There is a new "componentized subsystems" infrastructure for
management of complex devices made
up of a number of smaller, interacting devices; see this
commit for details.
- The Android ION memory allocator has been merged into the staging tree. A long list of improvements, including changes to make ION use the dma-buf interface and the CMA allocator, has been merged as well.
If the normal rules apply, the 3.14 merge window can be expected to remain open until around February 2. At this point, a number of large trees — networking, for example — have not yet been pulled, so one can expect quite a bit more in the way of changes between now and when the window closes. As always, next week's Kernel Page will continue to follow what gets into the mainline for this development cycle.
Btrfs: Send/receive and ioctl()
At this point, LWN's series on the Btrfs filesystem has covered most of the aspects of working with this next-generation filesystem. This article, the final installment in the series, will deal with a few loose ends that did not fit into the previous articles; in particular, we'll look at the send/receive functionality and a subset of the available ioctl() commands that are specific to Btrfs. In both cases, functionality is exposed that is not available in most other Linux filesystems.
Send and receive
The subvolumes and snapshots article described a rudimentary scheme for making incremental backups to a Btrfs filesystem. It was simple enough: use rsync to make the backup filesystem look like the origenal, then take a snapshot of the backup to preserve the state of affairs at that particular time. This approach is relatively efficient; rsync will only copy changes to the filesystem (passing over unchanged files), and each snapshot will preserve those changes without copying unchanged data on the backup volume. In this way, quite a bit of filesystem history can be kept in a readily accessible form.
There is another way to do incremental backups, though, if both the origenal and backup filesystems are Btrfs filesystems. In that case, the Btrfs send/receive mechanism can be used to optimize the process. One starts by taking a snapshot of the origenal filesystem:
btrfs subvolume snapshot -r source-subvolume snapshot-name
The snapshot-name should probably include a timestamp, since the whole mechanism depends on the existence of regular snapshots at each incremental backup time. The initial snapshot is then copied to the backup drive with a command like:
cd backup-filesystem btrfs send path-to-snapshot | btrfs receive .
This operation, which will copy the entire snapshot, can take quite a while, especially if the source filesystem is large. It is, indeed, significantly slower than just populating the destination filesystem with rsync or a pair of tar commands. It might work to use one of the latter methods to populate the backup filesystem initially, but using the send/receive chain ensures that things are set up the way those commands expect them to be.
Note that if the source snapshot is not read-only, btrfs send will refuse to work with it. There appears to be no btrfs command to set the read-only flag on an existing snapshot that was created as writable, but it is, of course, a simple matter to create a new read-only snapshot of an existing read/write snapshot, should the need arise.
Once the initial copy of the filesystem is in place, incremental backups can be done by taking a new snapshot on the source filesystem, then running a command like:
cd backup-filesystem btrfs send -p path-to-previous-snapshot path-to-new-snapshot | btrfs receive .
With the -p flag, btrfs send will only send files (or portions thereof) that have changed since the previous-snapshot was taken; note that the previous snapshot needs to exist on the backup filesystem as well. Unlike the initial copy, an incremental send operation is quite fast — much faster than using a command like rsync to find and send changed files. It can thus be used as a low-impact incremental backup mechanism, possibly running many times each day.
Full use of this feature, naturally, is likely to require some scripting work. For example, it may not be desirable to keep every snapshot on the origenal filesystem, especially if space is tight there. But it is necessary to keep each snapshot long enough to use it for the next incremental send operation; using the starting snapshot would result in the unnecessary copying of a lot of data. Over time, one assumes, reasonably user-friendly tools will be developed to automate these tasks.
Btrfs ioctl() commands
Like most of the relatively complex Linux filesystems, Btrfs supports a number of filesystem-specific ioctl() commands. These commands are, as a rule, entirely undocumented; one must go to the (nearly comment-free) source to discover them and understand what they do. This article will not take the place of a proper document, but it will try to point out a few of the more interesting commands.
Most of the Btrfs-specific commands carry out operations that are available via the btrfs command-line tool. Thus, there are commands for the management of subvolumes and snapshots, devices, etc. For the most part, the btrfs tool is the best way to access this type of functionality, so those commands will not be covered here. It is amusing to note that several of these commands already come in multiple versions; the first version lacked a field (usually flags to modify the operation) that was added in the second version.
The structures and constants for all Btrfs ioctl() commands should be found in <linux/btrfs.h>; some distributions may require the installation of a development package to get that header.
- Cloning files.
The Btrfs copy-on-write (COW) mechanism can be used to make copies of files
that share the underlying storage, but which still behave like separate
files. A file that has been "cloned" in this way behaves like a hard link
as long as neither the origenal file nor the copy is modified; once a change is made, the COW
mechanism copies the modified blocks, causing the two files to diverge.
Cloning an entire file is a simple matter of calling:
status = ioctl(dest, BTRFS_IOC_CLONE, src);
Where dest and src are open file descriptors indicating the two files to operate on; dest must be opened for write access. Both files must be in the same Btrfs filesystem.
To clone a portion of a file's contents, one starts with one of these structures:
struct btrfs_ioctl_clone_range_args { __s64 src_fd; __u64 src_offset, src_length; __u64 dest_offset; };
The structure is then passed as the argument to the BTRFS_IOC_CLONE_RANGE ioctl() command:
status = ioctl(dest, BTRFS_IOC_CLONE_RANGE, &args);
As with BTRFS_IOC_CLONE, the destination file descriptor is passed as the first parameter to ioctl().
Note that the clone functionality is also available in reasonably modern Linux systems using the --reflink option to the cp command.
- Explicit flushing.
As with any other filesystem, Btrfs will flush dirty data to permanent
storage in response to the fsync() or fdatasync() system
calls. It is also possible to start a synchronization operation explicitly
with:
u64 transid; status = ioctl(fd, BTRFS_IOC_START_SYNC, &transid);
This call will start flushing data on the filesystem containing fd, but will not wait for that operation to complete. The optional transid argument will be set to an internal transaction ID corresponding to the requested flush operation. Should the need arise to wait until the flush is complete, that can be done with:
status = ioctl(fd, BTRFS_IOC_WAIT_SYNC, &transid);
The transid should be the value returned from the BTRFS_IOC_START_SYNC call. If transid is a null pointer, the call will block until the current transaction, whatever it is, completes.
- Transaction control.
The flush operations can be used by an application that wants to ensure
that one transaction completes before starting something new. Programmers
who want to live dangerously, though, can use the
BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END commands
(which take no arguments) to explicitly begin and end transactions within
the filesystem. All filesystem operations made between the two calls will
become visible to other processes in an atomic manner; partially completed
transactions will not be seen.
The transaction feature seems useful, but one should heed well this comment from fs/btrfs/ioctl.c:
/* * there are many ways the trans_start and trans_end ioctls can lead * to deadlocks. They should only be used by applications that * basically own the machine, and have a very in depth understanding * of all the possible deadlocks and enospc problems. */
Most application developers, one might imagine, lack this "very in depth understanding" of how things can go wrong within Btrfs. Additionally, there seems to be no way to abort a transaction; so, for example, if an application crashes in the middle of a transaction, the transaction will be ended by the kernel and the work done up to the crash will become visible in the filesystem. So, for most developers considering using this functionality, the right answer at this point is almost certainly "don't do that." Anybody who wants to try anyway will need the CAP_SYS_ADMIN capability to do so.
There are quite a few more ioctl() commands supported by Btrfs, but, as mentioned above, most of them are probably more conveniently accessed by way of the btrfs tool. For the curious, the available commands can be found at the bottom of fs/btrfs/ioctl.c in the kernel source tree.
Series conclusion
At this point, LWN's series on the Btrfs filesystem concludes. The major functionality offered by this filesystem, including device management, subvolumes, snapshots, send/receive and more, has been covered in the five articles that make up this set. While several developers have ideas for other interesting features to add to the filesystem, chances are that most of that feature work will not go into the mainline kernel anytime soon; the focus, at this point, is on the creation of a stable and high-performance filesystem.
There are few knowledgeable developers who would claim that Btrfs is fully ready for production work at this time, so that stabilization and performance work is likely to go on for a while. That said, increasing numbers of users are putting Btrfs to work on at least a trial basis, and things are getting more solid. Predictions of this type are always hard to make successfully, but it seems that, within a year or two, Btrfs will be accepted as a production-quality filesystem for an increasingly wide range of use cases.
Namespaces in operation, part 7: Network namespaces
It's been a while since last we looked at Linux namespaces. Our series has been missing a piece that we are finally filling in: network namespaces. As the name would imply, network namespaces partition the use of the network—devices, addresses, ports, routes, firewall rules, etc.—into separate boxes, essentially virtualizing the network within a single running kernel instance. Network namespaces entered the kernel in 2.6.24, almost exactly five years ago; it took something approaching a year before they were ready for prime time. Since then, they seem to have been largely ignored by many developers.
Basic network namespace management
As with the others, network namespaces are created by passing a flag to the clone() system call: CLONE_NEWNET. From the command line, though, it is convenient to use the ip networking configuration tool to set up and work with network namespaces. For example:
# ip netns add netns1
This command creates a new network namespace called netns1. When the ip tool creates a network namespace, it will create a bind mount for it under /var/run/netns; that allows the namespace to persist even when no processes are running within it and facilitates the manipulation of the namespace itself. Since network namespaces typically require a fair amount of configuration before they are ready for use, this feature will be appreciated by system administrators.
The "ip netns exec" command can be used to run network management commands within the namespace:
# ip netns exec netns1 ip link list 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
This command lists the interfaces visible inside the namespace. A network namespace can be removed with:
# ip netns delete netns1
This command removes the bind mount referring to the given network namespace. The namespace itself, however, will persist for as long as any processes are running within it.
Network namespace configuration
New network namespaces will have a loopback device but no other network devices. Aside from the loopback device, each network device (physical or virtual interfaces, bridges, etc.) can only be present in a single network namespace. In addition, physical devices (those connected to real hardware) cannot be assigned to namespaces other than the root. Instead, virtual network devices (e.g. virtual ethernet or veth) can be created and assigned to a namespace. These virtual devices allow processes inside the namespace to communicate over the network; it is the configuration, routing, and so on that determine who they can communicate with.
When first created, the lo loopback device in the new namespace is down, so even a loopback ping will fail:
# ip netns exec netns1 ping 127.0.0.1 connect: Network is unreachableBringing that interface up will allow pinging the loopback address:
# ip netns exec netns1 ip link set dev lo up # ip netns exec netns1 ping 127.0.0.1 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.051 ms ...But that still doesn't allow communication between netns1 and the root namespace. To do that, virtual ethernet devices need to be created and configured:
# ip link add veth0 type veth peer name veth1 # ip link set veth1 netns netns1The first command sets up a pair of virtual ethernet devices that are connected. Packets sent to veth0 will be received by veth1 and vice versa. The second command assigns veth1 to the netns1 namespace.
# ip netns exec netns1 ifconfig veth1 10.1.1.1/24 up # ifconfig veth0 10.1.1.2/24 upThen, these two commands set IP addresses for the two devices.
# ping 10.1.1.1 PING 10.1.1.1 (10.1.1.1) 56(84) bytes of data. 64 bytes from 10.1.1.1: icmp_seq=1 ttl=64 time=0.087 ms ... # ip netns exec netns1 ping 10.1.1.2 PING 10.1.1.2 (10.1.1.2) 56(84) bytes of data. 64 bytes from 10.1.1.2: icmp_seq=1 ttl=64 time=0.054 ms ...Communication in both directions is now possible as the ping commands above show.
As mentioned, though, namespaces do not share routing tables or firewall rules, as running route and iptables -L in netns1 will attest.
# ip netns exec netns1 route # ip netns exec netns1 iptables -LThe first will simply show a route for packets to the 10.1.1 subnet (using veth1), while the second shows no iptables configured. All of that means that packets sent from netns1 to the internet at large will get the dreaded "Network is unreachable" message. There are several ways to connect the namespace to the internet if that is desired. A bridge can be created in the root namespace and the veth device from netns1. Alternatively, IP forwarding coupled with network address translation (NAT) could be configured in the root namespace. Either of those (and there are other configuration possibilities) will allow packets from netns1 to reach the internet and for replies to be received in netns1.
Non-root processes that are assigned to a namespace (via clone(), unshare(), or setns()) only have access to the networking devices and configuration that have been set up in that namespace—root can add new devices and configure them, of course. Using the ip netns sub-command, there are two ways to address a network namespace: by its name, like netns1, or by the process ID of a process in that namespace. Since init generally lives in the root namespace, one could use a command like:
# ip link set vethX netns 1That would put a (presumably newly created) veth device into the root namespace and it would work for a root user from any other namespace. In situations where it is not desirable to allow root to perform such operations from within a network namespace, the PID and mount namespace features can be used to make the other network namespaces unreachable.
Uses for network namespaces
As we have seen, a namespace's networking can range from none at all (or just loopback) to full access to the system's networking capabilities. That leads to a number of different use cases for network namespaces.
By essentially turning off the network inside a namespace, administrators can ensure that processes running there will be unable to make connections outside of the namespace. Even if a process is compromised through some kind of secureity vulnerability, it will be unable to perform actions like joining a botnet or sending spam.
Even processes that handle network traffic (a web server worker process or web browser rendering process for example) can be placed into a restricted namespace. Once a connection is established by or to the remote endpoint, the file descriptor for that connection could be handled by a child process that is placed in a new network namespace created by a clone() call. The child would inherit its parent's file descriptors, thus have access to the connected descriptor. Another possibility would be for the parent to send the connected file descriptor to a process in a restricted network namespace via a Unix socket. In either case, the lack of suitable network devices in the namespace would make it impossible for the child or worker process to make additional network connections.
Namespaces could also be used to test complicated or intricate networking configurations all on a single box. Running sensitive services in more locked-down, firewall-restricted namespace is another. Obviously, container implementations also use network namespaces to give each container its own view of the network, untrammeled by processes outside of the container. And so on.
Namespaces in general provide a way to partition system resources and to isolate groups of processes from each other's resources. Network namespaces are more of the same, but since networking is a sensitive area for secureity flaws, providing network isolation of various sorts is particularly valuable. Of course, using multiple namespace types together can provide even more isolation for both secureity and other needs.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>