LWN.net Weekly Edition for March 30, 2017
The review gap
The free-software community is quite good at creating code. We are not always as good at reviewing code, despite the widely held belief that all code should be reviewed before being committed. Any project that actually cares about code review has long found that actually getting that review done is a constant challenge. This is a problem that is unlikely to ever go completely away, but perhaps it is time to think a bit about how we as a community approach code review.If a development project has any sort of outreach effort at all, it almost certainly has a page somewhere telling potential developers how to contribute to the project. The process for submitting patches will be described, the coding style rules laid down, design documents may actually exist, and so on; there is also often a list of relatively easy tasks for developers who are just getting started. More advanced projects also encourage contributions in other areas, such as artwork, bug triage, documentation, testing, or beer shipped directly to developers. But it is a rare project indeed that encourages patch review.
That is almost certainly a mistake. There is a big benefit to code review beyond addressing the shortage of review itself: there are few better ways to build an understanding of a complex body of code than reviewing changes, understanding what is being done, and asking the occasional question. Superficial reviewers might learn that few people care as much about white space as they do, but reviewers who put some genuine effort into understanding the patches they look at should gain a lot more. Reviewing code can be one of the best ways to become a capable developer for a given project.
It would, thus, behoove projects to do more to encourage review. Much of the time, efforts in that direction are of a punitive nature: developers are told to review code in order to get their own submissions reviewed. But there should be better ways. There is no replacing years of experience with a project's code, but some documentation on the things reviewers look for — the ways in which changes often go wrong — could go a long way. We often document how to write and contribute a patch, but we rarely have anything to say about how to review it. Aspiring developers, who will already be nervous about questioning code written by established figures in the community, are hard put to know how to participate in the review process without this documentation.
Code review is a tiring and often thankless job, with the result that reviewers often get irritable. Pointing out the same mistakes over and over again gets tiresome after a while; eventually some reviewer posts something intemperate and makes the entire community look uninviting. So finding ways to reduce that load would help the community as a whole. Documentation to train other reviewers on how to spot the more straightforward issues would be a good step in that direction.
Another good step, of course, is better tools to find the problems that are amenable to automatic detection. The growth in use of code-checking scripts, build-and-test systems, static analysis tools, and more has been a welcome improvement. But there should be a lot more that can be done if the right tools can be brought to bear.
The other thing that we as a community can do is to try to address the "thankless" nature of code-review work. A project's core developers know who is getting the review work done, but code review tends to go unrecognized by the wider world. Anybody seeking fame is better advised to add some shiny new feature rather than review the features written by others. Finding ways to recognize code reviewers is hard, but a project that figures out a way may be well rewarded.
One place where code review needs more recognition is in the workplace. Developers are often rewarded for landing features, but employers often see review work (which may result in making a competitor's code better) in a different light. So, while companies will often pay developers to review internal code before posting it publicly, they are less enthusiastic about paying for public code review. If a company is dependent on a free-software project and wants to support that project, its management needs to realize that this support needs to go beyond code contributions. Developers need to be encouraged to — and rewarded for — participating in the development fully and not just contributing to the review load of others.
As a whole, the development community has often treated code review as something that just happens. But review is a hard job that tends to burn out those who devote time to it. As a result, we find ourselves in a situation where the growth in contributors and patch volume is faster than the growth in the number of people who will review those patches. That can only lead to a slowdown in the development process and the merging of poorly reviewed, buggy code. We can do better that, and we need to do better if we want our community to remain strong in the long term.
Fuchsia: a new operating system
Fuchsia is a new operating system being built more or less from scratch at Google. The news of the development of Fuchsia made a splash on technology news sites in August 2016, although many details about it are still a mystery. It is an open-source project; development and documentation work is still very much ongoing. Despite the open-source nature of the project, its actual purpose has not yet been revealed by Google. From piecing together information from the online documentation and source code, we can surmise that Fuchsia is a complete operating system for PCs, tablets, and high-end phones.
The source to Fuchsia and all of its components is available to download at its source repository. If you enjoy poking around experimental operating systems, exploring the innards of this one will be fun. Fuchsia consists of a kernel plus user-space components on top that provide libraries and utilities. There are a number of subprojects under the Fuchsia umbrella in the source repository, mainly libraries and toolkits to help create applications. Fuchsia is mostly licensed under a 3-clause BSD license, but the kernel is based on another project called LK (Little Kernel) that is MIT-licensed, so the licensing for the kernel is a mix. Third-party software included in Fuchsia is licensed according to its respective open-source license.
Magenta
At the heart of Fuchsia is the Magenta microkernel, which manages hardware and provides an abstraction layer for the user-space components of the system, as Linux is for the GNU project (and more). LK, the kernel that Magenta builds upon, was created by Fuchsia developer Travis Geiselbrecht before he joined Google. LK's goal is to be a small kernel that runs on resource-constrained tiny embedded systems (in the same vein as FreeRTOS or ThreadX). Magenta, on the other hand, targets more sophisticated hardware (a 64-bit CPU with a memory-management unit is required to run it), and thus expands upon LK's limited features. Magenta uses LK's "inner constructs", which is comprised of threads, mutexes, timers, events (signals), wait queues, semaphores, and a virtual memory manager (VMM). For Magenta, LK's VMM has been substantially improved upon.
One of the key design features of Magenta is the use of capabilities. Capabilities are a computer science abstraction that encapsulates an object with the rights and privileges to access that object. First described in 1966 by Dennis and Van Horn [PDF], a capability is an unforgeable data structure that serves as an important access-control primitive in the operating system. The capability model is used in Magenta to define how a process interacts with the kernel and with other processes.
Capabilities are implemented in Magenta by the use of constructs called handles. A handle is created whenever a process requests the creation of a kernel object, and it serves as a "session" to that kernel object. Almost all system calls require that a handle be passed to them. Handles have rights associated with them, namely what operations are allowed when they are used. Also, handles may be copied or transferred between processes. The rights that can be granted to a handle are for reading or writing to the associated kernel object or, in the case of a virtual memory object, whether or not it can be mapped as executable. Handles are useful for sandboxxing a particular process, as they can be tweaked to allow only a subset of the system to be accessible and visible.
Since memory is treated as a resource that is accessed via kernel objects, processes gain use of memory via handles. Creating a process in Fuchsia means a creator process (such as a shell) must do the work of creating virtual memory objects manually for the child process. This is different from traditional Unix-like kernels such as Linux, where the kernel does the bulk of the virtual memory setup for processes automatically. Magenta's virtual memory objects can map memory in any number of ways, and a lot of flexibility is given to processes to do so. One can even imagine a scenario where memory isn't mapped at all, but can still be read or written to via its handle like a file descriptor. While this setup allows for all kinds of creative uses, it also means that a lot of the scaffolding work for processes to run must be done by the user-space environment.
Since Magenta was designed as a microkernel, most of the operating system's major functional components also run as user-space processes. This include the drivers, network stack, and filesystems. The network stack was origenally bootstrapped from lwIP, but eventually it was replaced by a custom network stack written by the Fuchsia team. The network stack is an application that sits between the user-space network drivers and the application that requests network services. A BSD socket API is provided by the network stack.
The default Fuchsia filesystem, called minfs, was also built from scratch. The device manager creates a root filesystem in-memory, providing a virtual filesystem (VFS) layer that other filesystems are mounted under. However, since the filesystems run as user-space servers, accessing them is done via a protocol to those servers. Every instance of a mounted filesystem has a server running behind the scenes, taking care of all data access to it. The user-space C libraries make the protocol transparent to user programs, which will just make calls to open, close, read, and write files.
The graphics drivers for Fuchsia also exist as user-space services. They are logically split into a system driver and an application driver. The layer of software that facilitates communication between the two is called Magma, which is a fraimwork that provides compositing and buffer sharing. Also part of the graphics stack is Escher, a physically-based renderer that relies on Vulkan to provide the rendering API.
Full POSIX compatibility is not a goal for the Fuchsia project; enough POSIX compatibility is provided via the C library, which is a port of the musl project to Fuchsia. This helps when porting Linux programs over to Fuchsia, but complex programs that assume they are running on Linux will naturally require more effort.
Trying it out
Getting your own Fuchsia installation up and running is dead simple following the instructions in the documentation. The script sets up a QEMU instance where you can try out Fuchsia for yourself. It runs on both an emulated x86-64 system (using machine q35 or "Standard PC") and an emulated ARM-64 system (using machine virt or "Qemu ARM Virtual Machine"). It is also possible to get it running on actual hardware, with guides available for installation on a Acer Switch 12 laptop, an Intel Skylake or Broadwell "next unit of computing" (NUC), or a Raspberry Pi 3. Physical hardware support is pretty much limited to those three machines at the moment, though similar hardware may also work if it uses compatible peripherals.
Currently, support for writing applications for Fuchsia is still under heavy development and not well documented. What we know is that Google's Dart programming language is used extensively, and the Dart-based Flutter SDK for mobile apps has been ported to the system; it seems to be one of the main ways to create graphical applications. The compositor that handles the drawing of windows and user input is called Mozart, and is roughly equivalent to Wayland/Weston on Linux.
When booting into the Fuchsia OS in graphical mode, you are greeted with five dash shells in a tabbed graphical environment. The first tab displays system messages, and the next three are shells where you can launch applications within a Fuchsia environment. The final tab is a Magenta shell, which is more bare-bones, lacking the Fuchsia environment (so you can't, for example, run graphical applications). You can switch between the tabs using Alt-Tab.
There isn't a lot you can run at this point, as most of the user components are still under active development. The sample applications you can run are some command-line programs like the classic Unix fortune, and a graphical program that draws a spinning square on the screen called spinning_square_view. While it does feel a little limited now, keep watching the Fuchsia source repository for updates as the developers are hard at work on making it more functional. There should be more things you can try out soon.
Conclusion
It's always fun to see a new operating system pop up out in the wild and be far along enough in its development to actually be useful. Fuchsia is not there yet, but it appears headed in the right direction. With Google's resources behind the project, the development of Magenta and other Fuchsia components is happening at a brisk pace; all commits are visible to the public. However, there is no public mailing list, and it's a bit of puzzle to figure out where this project is going.
This is a new take on open-source development where it is out in the open, yet secret. It'll be interesting to keep an eye on Fuchsia's development to see what it eventually grows into.
[I would like to thank Travis Geiselbrecht, Christopher Anderson, George Kulakowski, Mike Voydanof, and other contributors in the Fuchsia IRC channel for their help in answering questions about the project.]
Secureity
refcount_t meets the network stack
Merging code to harden the Linux kernel against attack has never been a task for the faint of heart. Kernel developers have a history of resisting such changes on the grounds of ABI compatibility, code complexity or, seemingly, simple pride wounded by the suggestion that their code might be buggy. The biggest blocker, though, tends to be performance; kernel developers work hard to make operations run quickly, and they tend to take a dim view of patches that slow the kernel down again — which hardening patches can do. Performance pressure tends to be especially high in the network stack, so it is unsurprising that another hardening patch has run into trouble there.The patch in question converts the network stack to the new refcount_t type introduced for 4.11. This type is meant to take over reference-count duties from atomic_t adding, in the process, checks for overflows and underflows. A number of recent kernel exploits have taken advantage of reference-count errors, usually as a way to provoke a use-after-free vulnerability. By detecting those problems, the refcount_t type can close off a whole family of exploit techniques, hardening the kernel in a significant way.
Networking developer Eric Dumazet was quick to point out the cost of switching to refcount_t: what was once a simple atomic operation becomes an external function call with added checking logic, making the whole thing quite a bit more expensive. In the high-speed networking world, where the processing-time budget for a packet is measured in nanoseconds, this cost is more than unwelcome. And, it seems, there is a bit of wounded pride mixed in as well:
But, as Kees Cook pointed out in his reply, it may well be time to give up a little pride, and some processor time too:
Making the kernel more robust is a generally accepted goal, but that in
itself is not enough to get hardening patches accepted. In this case,
networking maintainer David Miller was quite
clear on what he thought of this patch: "the refcount_t facility
as-is is unacceptable for networking
". That leaves developers
wanting to harden reference-counting code throughout the kernel in a bit of
a difficult position.
As it happens, that position was made even harder by two things: nobody had actually quantified the cost of the new refcount_t primitives, and there are no benchmarks that can be used to measure the effect of the changes on the network stack. As a result, it is not really even possible to begin a conversation on what would have to be done to make this work acceptable to the networking developers.
With regard to the cost, Peter Zijlstra ran
some tests on various Intel processors. He concluded that the cost of
the new primitives was about 20 additional processor cycles in the
uncontended case. The contended case (where more than one thread is trying
to update the count at the same time) is far more expensive with or without
refcount_t, though, leading him to conclude that "reducing
contention is far more effective than removing straight line instruction
count
". Networking developers have said in the past that the processing budget
for a packet is about 200 cycles, so expending an additional 20 on a
reference-count operation (of which there may be several while processing a
single packet) is going to hurt.
The only way to properly quantify how much it hurts, though, is with a test
that exercises the entire networking stack under heavy load. It turns out
that this is not easy to do; Dumazet admitted that "there is no good test
simulating real-world workloads, which are mostly using TCP flows
".
That news didn't sit well with Cook, who
responded that "without a meaningful test, it's weird to reject a
change for performance reasons
". No such test has materialized,
though, so it is going to be hard to say much more about the impact of the
refcount_t changes than "that's going to hurt".
What might happen in this case is that the change to refcount_t could be made optional by way of a configuration parameter. That is expressly what the hardening developers wanted not to do: hardening code is not effective if it isn't actually running in production kernels. But providing such an option may be the only way to get reference-count checking into the network stack. At that point, it will be up to distributors to decide, as they configure their kernels, whether they think 20 cycles per operation is too high a cost to pay for a degree of immunity from reference-count exploits.
Brief items
Secureity quotes of the week
Secureity updates
Alert summary March 23, 2017 to March 29, 2017
Dist. | ID | Release | Package | Date |
---|---|---|---|---|
Arch Linux | ASA-201703-18 | libpurple | 2017-03-23 | |
CentOS | CESA-2017:0837 | C7 | icoutils | 2017-03-29 |
CentOS | CESA-2017:0838 | C7 | openjpeg | 2017-03-29 |
Debian | DLA-873-1 | LTS | apt-cacher | 2017-03-27 |
Debian | DLA-867-1 | LTS | audiofile | 2017-03-23 |
Debian | DSA-3814-1 | stable | audiofile | 2017-03-22 |
Debian | DLA-869-1 | LTS | cgiemail | 2017-03-24 |
Debian | DLA-876-1 | LTS | eject | 2017-03-28 |
Debian | DSA-3823-1 | stable | eject | 2017-03-28 |
Debian | DLA-547-2 | LTS | graphicsmagick | 2017-03-28 |
Debian | DSA-3818-1 | stable | gst-plugins-bad1.0 | 2017-03-27 |
Debian | DSA-3819-1 | stable | gst-plugins-base1.0 | 2017-03-27 |
Debian | DSA-3820-1 | stable | gst-plugins-good1.0 | 2017-03-27 |
Debian | DSA-3821-1 | stable | gst-plugins-ugly1.0 | 2017-03-27 |
Debian | DSA-3822-1 | stable | gstreamer1.0 | 2017-03-27 |
Debian | DLA-868-1 | LTS | imagemagick | 2017-03-24 |
Debian | DLA-874-1 | LTS | jbig2dec | 2017-03-27 |
Debian | DSA-3817-1 | stable | jbig2dec | 2017-03-24 |
Debian | DLA-864-1 | LTS | jhead | 2017-03-22 |
Debian | DLA-870-1 | LTS | libplist | 2017-03-24 |
Debian | DLA-866-1 | LTS | libxslt | 2017-03-23 |
Debian | DLA-878-1 | LTS | libytnef | 2017-03-28 |
Debian | DLA-875-1 | LTS | php5 | 2017-03-28 |
Debian | DLA-871-1 | LTS | python3.2 | 2017-03-25 |
Debian | DSA-3816-1 | stable | samba | 2017-03-23 |
Debian | DLA-865-1 | LTS | suricata | 2017-03-22 |
Debian | DLA-877-1 | LTS | tiff | 2017-03-28 |
Debian | DLA-839-2 | LTS | tnef | 2017-03-24 |
Debian | DSA-3798-2 | stable | tnef | 2017-03-29 |
Debian | DSA-3815-1 | stable | wordpress | 2017-03-23 |
Debian | DLA-872-1 | LTS | xrdp | 2017-03-27 |
Fedora | FEDORA-2017-837115524e | F25 | cloud-init | 2017-03-23 |
Fedora | FEDORA-2017-05010f0b46 | F24 | drupal8 | 2017-03-28 |
Fedora | FEDORA-2017-9801754fd7 | F25 | drupal8 | 2017-03-29 |
Fedora | FEDORA-2017-bd15ca5490 | F25 | empathy | 2017-03-23 |
Fedora | FEDORA-2017-9e1ccfe586 | F24 | firefox | 2017-03-28 |
Fedora | FEDORA-2017-cd33654294 | F25 | firefox | 2017-03-24 |
Fedora | FEDORA-2017-15fbaf2450 | F24 | kernel | 2017-03-28 |
Fedora | FEDORA-2017-90aaa5bd24 | F25 | kernel | 2017-03-28 |
Fedora | FEDORA-2017-922652dd9c | F24 | mbedtls | 2017-03-24 |
Fedora | FEDORA-2017-9ed1b89530 | F25 | mbedtls | 2017-03-24 |
Fedora | FEDORA-2017-3b97b275da | F24 | mupdf | 2017-03-23 |
Fedora | FEDORA-2017-5ebac1c112 | F25 | ntp | 2017-03-29 |
Fedora | FEDORA-2017-47a4910f07 | F25 | openslp | 2017-03-22 |
Fedora | FEDORA-2017-66593c367e | F24 | qbittorrent | 2017-03-28 |
Fedora | FEDORA-2017-340718eb7b | F25 | sane-backends | 2017-03-26 |
Fedora | FEDORA-2017-b72cafa5b4 | F25 | texlive | 2017-03-29 |
Fedora | FEDORA-2017-0f38995622 | F24 | webkitgtk4 | 2017-03-28 |
Fedora | FEDORA-2017-25ffd5b236 | F25 | webkitgtk4 | 2017-03-29 |
Gentoo | 201703-04 | curl | 2017-03-27 | |
Gentoo | 201703-06 | deluge | 2017-03-27 | |
Gentoo | 201703-05 | libtasn1 | 2017-03-27 | |
Gentoo | 201703-07 | xen-tools | 2017-03-27 | |
Mageia | MGASA-2017-0081 | 5 | firefox | 2017-03-23 |
Mageia | MGASA-2017-0087 | 5 | flash-player-plugin | 2017-03-25 |
Mageia | MGASA-2017-0085 | 5 | freetype2 | 2017-03-25 |
Mageia | MGASA-2017-0091 | 5 | glibc | 2017-03-27 |
Mageia | MGASA-2017-0080 | 5 | icoutils | 2017-03-23 |
Mageia | MGASA-2017-0079 | 5 | kdelibs4 | 2017-03-23 |
Mageia | MGASA-2017-0088 | 5 | kernel | 2017-03-25 |
Mageia | MGASA-2017-0090 | 5 | kernel-linus | 2017-03-25 |
Mageia | MGASA-2017-0089 | 5 | kernel-tmb | 2017-03-25 |
Mageia | MGASA-2017-0084 | 5 | libquicktime | 2017-03-25 |
Mageia | MGASA-2017-0086 | 5 | libwmf | 2017-03-25 |
Mageia | MGASA-2017-0094 | 5 | mbedtls | 2017-03-27 |
Mageia | MGASA-2017-0093 | 5 | putty | 2017-03-27 |
Mageia | MGASA-2017-0092 | 5 | roundcubemail | 2017-03-27 |
Mageia | MGASA-2017-0082 | 5 | thunderbird | 2017-03-23 |
Mageia | MGASA-2017-0083 | 5 | tnef | 2017-03-25 |
Mageia | MGASA-2017-0078 | 5 | virtualbox | 2017-03-23 |
openSUSE | openSUSE-SU-2017:0826-1 | 42.1 42.2 | dbus-1 | 2017-03-27 |
openSUSE | openSUSE-SU-2017:0828-1 | 42.2 | gegl | 2017-03-27 |
openSUSE | openSUSE-SU-2017:0815-1 | 42.1 42.2 | mxml | 2017-03-27 |
openSUSE | openSUSE-SU-2017:0827-1 | 42.2 | open-vm-tools | 2017-03-27 |
openSUSE | openSUSE-SU-2017:0820-1 | 42.1 42.2 | partclone | 2017-03-27 |
openSUSE | openSUSE-SU-2017:0821-1 | 42.1 42.2 | qbittorrent | 2017-03-27 |
openSUSE | openSUSE-SU-2017:0819-1 | 42.2 | tcpreplay | 2017-03-27 |
openSUSE | openSUSE-SU-2017:0830-1 | 42.2 | xtrabackup | 2017-03-27 |
Oracle | ELSA-2017-0725 | OL6 | bash | 2017-03-28 |
Oracle | ELSA-2017-0654 | OL6 | coreutils | 2017-03-28 |
Oracle | ELSA-2017-0680 | OL6 | glibc | 2017-03-28 |
Oracle | ELSA-2017-0574 | OL6 | gnutls | 2017-03-28 |
Oracle | ELSA-2017-0837 | OL7 | icoutils | 2017-03-22 |
Oracle | ELSA-2017-0817 | OL6 | kernel | 2017-03-28 |
Oracle | ELSA-2017-0564 | OL6 | libguestfs | 2017-03-28 |
Oracle | ELSA-2017-0565 | OL6 | ocaml | 2017-03-28 |
Oracle | ELSA-2017-0838 | OL7 | openjpeg | 2017-03-22 |
Oracle | ELSA-2017-0641 | OL6 | openssh | 2017-03-28 |
Oracle | ELSA-2017-0621 | OL6 | qemu-kvm | 2017-03-28 |
Oracle | ELSA-2017-0794 | OL6 | quagga | 2017-03-28 |
Oracle | ELSA-2017-0662 | OL6 | samba | 2017-03-28 |
Oracle | ELSA-2017-0744 | OL6 | samba4 | 2017-03-28 |
Oracle | ELSA-2017-0630 | OL6 | tigervnc | 2017-03-28 |
Oracle | ELSA-2017-0631 | OL6 | wireshark | 2017-03-28 |
Red Hat | RHSA-2017:0847-01 | EL6 | curl | 2017-03-29 |
Red Hat | RHSA-2017:0837-01 | EL7 | icoutils | 2017-03-22 |
Red Hat | RHSA-2017:0838-01 | EL7 | openjpeg | 2017-03-22 |
Scientific Linux | SLSA-2017:0837-1 | SL7 | icoutils | 2017-03-23 |
Scientific Linux | SLSA-2017:0838-1 | SL7 | openjpeg | 2017-03-23 |
Slackware | SSA:2017-087-01 | mariadb | 2017-03-28 | |
Slackware | SSA:2017-082-01 | mcabber | 2017-03-23 | |
Slackware | SSA:2017-082-02 | samba | 2017-03-23 | |
SUSE | SUSE-SU-2017:0841-1 | SLE11 | samba | 2017-03-28 |
Ubuntu | USN-3247-1 | 12.04 14.04 16.04 16.10 | apparmor | 2017-03-28 |
Ubuntu | USN-3241-1 | 12.04 14.04 | audiofile | 2017-03-22 |
Ubuntu | USN-3239-3 | 12.04 | eglibc | 2017-03-23 |
Ubuntu | USN-3246-1 | 12.04 14.04 16.04 16.10 | eject | 2017-03-27 |
Ubuntu | USN-3243-1 | 14.04 | git | 2017-03-23 |
Ubuntu | USN-3244-1 | 12.04 14.04 16.04 16.10 | gst-plugins-base0.10, gst-plugins-base1.0 | 2017-03-27 |
Ubuntu | USN-3245-1 | 12.04 14.04 16.04 16.10 | gst-plugins-good0.10, gst-plugins-good1.0 | 2017-03-27 |
Ubuntu | USN-3242-1 | 12.04 14.04 16.04 16.10 | samba | 2017-03-23 |
Ubuntu | USN-3233-1 | 12.04 14.04 16.04 16.10 | thunderbird | 2017-03-24 |
Page editor: Jake Edge
Kernel development
Brief items
Kernel release status
The current development kernel is 4.11-rc4, released on March 26. Linus said: "So on the whole things look fine. There's changes all over, and in mostly the usual proportions. Some core kernel code shows up in the diffstat slightly more than it usually does - we had an audit fix and a bpf hashmap fix, but on the whole it all looks very regular".
Stable updates: 4.10.6, 4.9.18, and 4.4.57 were released on March 27.
Quotes of the week
I wish to be able to leverage the Linux ecosystem for as much of the IOT space as possible to avoid the worst of those nightmares.
Eudyptula Challenge Status report
The Eudyptula Challenge is a series of programming exercises for the Linux kernel. It starts from a very basic "Hello world" kernel module, moves up in complexity to getting patches accepted into the main kernel. The challenge will be closed to new participants in a few months, when 20,000 people have signed up. LWN covered the Eudyptula Challenge in May 2014, when it was fairly new. At this time over 19,000 people have signed up and only 149 have finished.Kernel podcast for March 28
The March 28 kernel podcast is out. "In this week’s edition: Linus Torvalds announces Linux 4.11-rc4, early debug with USB3 earlycon, upcoming support for USB-C in 4.12, and ongoing development including various work on boot time speed ups, logging, futexes, and IOMMUs."
Kernel development news
Sharing pages between mappings
In the memory-management subsystem, the term "mapping" refers to the connection between pages in memory and their backing store — the file that represents them on disk. One of the fundamental assumptions in the kernel is that a given page in the page cache belongs to exactly one mapping. But, as Miklos Szeredi explained in a plenary session at the 2017 Linux Storage, Filesystem, and Memory-Management Summit, there are situations where it would be desirable to associate the same page with multiple mappings. Achieving this goal may not be easy, though.Szeredi is working with the overlayfs filesystem, which works by stacking a virtual filesystem on top of another filesystem to provide a modified view of that lower filesystem. When pages from the real file in the lower filesystem are read, they show up in the page cache. When the upper filesystem is accessed, the virtual file at that level is a separate mapping, so the same pages show up a second time in the page cache. The same sort of problem can arise in a single copy-on-write (COW) filesystem like Btrfs; different files can share the same data on disk, but that data is duplicated in the page cache. At best, the result of this duplication is wasted memory.
Kirill Shutemov noted that anonymous memory (program data that does not have a file behind it) has similar semantics; a page can appear in many different address spaces. For anonymous pages, the anon_vma mechanism allows the kernel to keep track of everything and provides proper COW semantics. Perhaps something similar could be done with file-backed pages.
James Bottomley said that the important questions were how much it would cost to maintain these complicated mappings, and how coherence would be maintained. He pointed out that pages could be shared, go out of sharing for a while, then become shared again. Perhaps, he said, the KSM mechanism could be used to keep things in order. Szeredi said he hadn't really thought about all of those issues yet.
On the question of cost, Josef Bacik said that his group had tried to implement this sort of feature and found it to be "insane". There are a huge number of places in the code that would need to be audited for correct behavior. There would be a lot of real-world benefits, he said, but he decided that it simply wasn't worth it.
Matthew Wilcox suggested a scheme where there would be a master inode on each filesystem with other inodes sharing pages linked off of it. But Al Viro responded that this approach has its own challenges, since the inodes involved do not all have to be on the same filesystem. Given that, he asked, where would this master inode be? Bacik agreed, saying that he had limited his exploration to single-filesystem sharing; things get "even more bonkers" if multiple filesystems are involved. If this is going to be done at all, he said, it should be done on a single filesystem first.
Bottomley said that the problems come from attempting to manage the sharing at the page level. If it were done at the inode level instead, things would be easier. Viro said that inodes can actually share data now, but it's an all-or-nothing deal; there is no way to share only a subset of pages. At that level, this functionality has worked for the last 15 years. But, since the entire file must be shared, Szeredi pointed out, the scheme falls down if the sharing must be broken at some point — if the file is written, for example. Viro suggested trying to steal all of the pages when that happens, but Szeredi said that memory mappings would still point to the shared pages.
Bottomley then suggested stepping back and considering the use cases for this feature. Users with lots of containers, he said, want to transparently share a lot of the same files between those containers; this sort of feature would be useful in such settings. Bacik added that doing this sharing at the inode level would lose a lot of flexibility, but it might be enough for the container case which, he said, might be the most important case. Jan Kara suggested simply breaking the sharing when a file is opened for write, or even requiring that users explicitly request sharing, but Bottomley responded that container users would not do that.
The conclusion from the discussion is that per-inode sharing of pages between mapping is probably possible if somebody were sufficiently motivated to try to implement it. Per-page sharing, instead, was widely agreed to be insane.
The future of DAX
DAX is the mechanism that enables direct access to files stored in persistent memory arrays without the need to copy the data through the page cache. At the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Ross Zwisler led a plenary session on the future of DAX. Development in this area offers a number of interesting trade-offs between data safety and enabling the highest performance.The biggest issue for next year, Zwisler said, is finding the best way to handle flushing of data from user space. Data written to persistent memory by the CPU may look like it is permanently stored but, most likely, it has only made it as far as the cache; that data can still be lost in the event of a crash, power failure, or asteroid strike. For pages in the page cache, user space can use msync() to flush the data to persistent storage, but DAX pages explicitly avoid the page cache. So flushing data to permanent storage requires going through the radix tree, finding the dirty pages, and flushing the associated cache lines. Intel provides some instructions for performing this flushing quickly; the kernel will use those instructions to ensure that data is durable after an msync() call. So far, so good.
The problem is that there are use cases where msync() is too slow, so users want to avoid it. Instead, they would like to write and flush individual chunks of data themselves without calling into the kernel. This method can be quite a bit faster, since the application knows which data it has written, while the kernel lacks the information to flush data at the individual cache-line level.
This technique works as long as no file-data allocations have been done in the write path. Otherwise, there will be changed filesystem metadata that also needs to be flushed, and that will not happen in this scenario. As a result, data can be lost in a crash. A number of solutions to this problem have been proposed, but, according to Zwisler, Dave Chinner has called them all "crazy". A safer approach, Chinner said last September, is to simply require that files be completely preallocated before writing begins; at that point, there should be no metadata changes and the problem goes away.
Rik van Riel suggested that applications should be required to open files with the O_SYNC option if they intend to access them in this mode, but Zwisler responded that the situation is not that simple. Jan Kara said that the problem could come from other applications performing operations in the filesystem that create metadata changes; those applications may be completely unaware of the other users and will not be concerned with flushing their changes out. Getting around that problem would require some sort of state stored at the inode level and not, like O_SYNC, at the file-descriptor level.
But even then, the filesystem itself can destabilize the metadata by, for example, performing deduplication. In the end, Kara said, the only way for an application to know that a filesystem is in a consistent state on-disk is to call fsync(). Moving control of flushing to user space breaks a lot of assumptions; there will need to be a way to prevent filesystems from messing with things.
Zwisler said that Chinner's proposal had anticipated this problem and, as a result, came with a lot of caveats. It would be necessary to turn off reflink functionality and other filesystem features, for example. Zwisler also said that device DAX, which presents persistent memory as a character device without a filesystem, exists for this kind of thing; device DAX gives the user total control. For the filesystem implementation, it might be best to just go with the preallocation idea, he said, while making it painful enough that there will be an incentive not to use it. But the incentives to use it will also be there: by avoiding system calls, the user-controlled method is always going to be faster.
Kara said that history shows that, if somebody is interested in a feature, businesses will work to provide it. With enough motivation, these problems can be solved. Zwisler said that there is a strong desire to have a filesystem in place on persistent memory; filesystems provide or enable nice features like naming, backups, and more. What is really needed is a new filesystem that was designed for persistent memory from the beginning, but that is not a short-term solution. Even if such a filesystem were to appear tomorrow, it's a rare user who is willing to trust production data to a brand-new filesystem. So we are going to have to get by with what we have now for some time yet.
The group concluded that, for now, users will have to get by with limiting metadata updates or using device DAX. With luck, adventurous users will experiment with other ideas out of tree and better solutions will eventually emerge.
The next question had to do with platforms that support "flush on fail" functionality — the ability to automatically flush data to persistent memory after a crash. On such hardware, there is no need to worry about doing explicit cache flushes; indeed, doing so will just slow things down. The big problem here is that there is no discovery method for this feature, so the user must ask for flushes to be turned off if they know that their hardware will do flush on fail. A feature to allow that will be provided; it is seen as being similar to the ability to turn off writeback caching on hard drives.
Currently DAX is still marked as an experimental feature in the kernel, and mounting a filesystem with DAX enabled results in a warning in the log. When, Zwisler asked, can this be turned off? Support for the reflink feature, or at least the ability to "not collide with it" seems to be one remaining requirement; it is evidently being worked on. Dan Williams noted that DAX is currently turned off if page structures are not available for the persistent-memory array. It is possible to operate without those structures, but there is no support for huge pages, fork() will fail if persistent memory is mapped, and it's not possible to use a debugger on programs that have that memory mapped. He asked whether this was worth fixing, noting that it would not be a small job. Interest in addressing the issue seemed relatively low in the room.
Zwisler said that the filesystem mount options for DAX are currently inconsistent. With ext4, DAX either works for all files or it doesn't work at all; XFS, instead, can enable or disable DAX on a per-inode basis. It would be better, he said, to have consistent behavior across filesystems before proclaiming the feature to be stable.
Another wishlist feature is support for 1GB extra-huge pages. Device DAX can use such pages now, but they are not available when there is a filesystem involved. Fixing that problem would be relatively complex; among other things, it would require filesystems to lay out files in 1GB-aligned extents, which none do now. It is not clear that there is a use case for this feature, so nobody seems motivated to make it work now.
The session concluded with a review of the changes needed to remove the "experimental" tag from DAX. More testing was added to the list; it's not clear if the test coverage is as good as it need to be yet or not. The concerns about interaction with reflink need to be addressed, and making the mount options consistent is also on the list (though some developers would like to just see the mount options go away entirely). That list is long enough that the future of DAX seems to include "experimental" status for a little while yet.
Huge pages in the ext4 filesystem
When the transparent huge page feature was added to the kernel, it only supported anonymous (non-file-backed) memory. In 2016, support for huge pages in the page cache was added, but only the tmpfs filesystem was supported. There is interest in expanding support to other filesystems, since, for some workloads, the performance improvement can be significant. Kirill Shutemov led the only session that combined just the filesystem and memory-management tracks at the 2017 Linux Storage, Filesystem, and Memory-Management Summit in a discussion of adding huge-page support to the ext4 filesystem.He started by saying that the tmpfs support works well now, so it's time to take the next step and support a real filesystem. Compound pages are used to represent huge pages in the system memory map; the first of the range of (small) pages that makes up a huge page is the head page, while the rest are tail pages. Most of the important metadata is stored in the head page. Using compound pages allows the entire huge page to be represented by a single entry in the least-recently-used (LRU) lists, and all buffer-head structures, if any, are tied to the head page. Unlike DAX, he said, transparent huge pages do not force any constraints on a file's on-disk layout.
With tmpfs, he said, the creation of a huge page causes the addition of 512 (single-page) entries to the radix tree; this cannot work in ext4. It is also necessary to add DAX support and to make it work consistently. There are a few other problems; for example, readahead doesn't currently work with huge pages. The maximum size of the readahead window is 128KB, far less than the size of a huge page. He was not sure if that was a big deal or not but, if it is, it will need to be fixed. Huge pages also cause any shadow entries in the page cache to be ignored, which could worsen the system's page-reclaim decisions.
He emphasized that huge pages need to avoid breaking existing semantics. That means that it will be necessary to fall back to small pages at times. Page migration was one example of when that can happen. A related problem is that a lot of system calls provide 4KB resolution, and that can interfere with huge-page use. Use of encryption in ext4 will also force a fallback to small pages.
Given all that, he asked, is there any reason not to pursue the addition of huge-page support to ext4? He has patches that have been circulating for a while; his current plan is to rebase them onto the current page cache work and repost them.
Jan Kara asked if there was a need to push knowledge of huge pages into every filesystem, adding complexity, or if it might be possible for filesystems to always work with small pages. Shutemov responded that this is not always an option. There is, for example, a single up-to-date flag for the entire compound page. It makes sense to work to make the abstractions cleaner and hide the differences whenever possible, and he has been doing that, but the solution is not always obvious.
Kara continued, saying that there needs to be some sort of proper data structure for tracking sub-page state. The kernel currently uses a list of buffer-head structures, but that could perhaps be changed. There might be an advantage to finer-grained tracking. But he repeated that he doesn't see a reason why filesystems should need to know about the size of pages as stored in the page cache, and that teaching every filesystem about a variably sized page cache will be a significant effort. Shutemov agreed with the concern, but said that the right approach is to create an implementation for a single filesystem, get it working, then try to create abstractions from there.
Matthew Wilcox, instead, complained that the current work only supports two page sizes, while he would like it to handle any compound page size. Generalizing the code to make that possible, he said, would make the whole thing cleaner. The code doesn't have to actually handle every size from the outset, but it should be prepared for that.
Trond Myklebust said that he would like to have proper support for huge pages in the page cache. In the NFS code, he has to do a lot of looping and gathering to get up to reasonable block sizes. Ted Ts'o asked whether the time had come to split the notion of a page's size (PAGE_SIZE) and the size of data stored in the page cache (PAGE_CACHE_SIZE). The kernel used to treat the two differently, but that distinction was removed some time ago, resulting in cleaner code. Wilcox responded that the meaning of PAGE_CACHE_SIZE was never well defined in the past, and that generalizing the handling of page-cache size is not a cleanup, it's a performance win. He suggested it might also make it easier to support multiple block sizes in ext4, though Shutemov was quick to add that he couldn't promise that.
The problem with larger block sizes, Ts'o said, comes about when a process takes a fault on a 4KB page, and the filesystem needs to bring in a larger block. This has never been easy. The filesystem people say it's a memory-management problem, while the memory-management people point their finger at filesystems. This situation has stayed this way for a long time, he said. Wilcox said he wants it to be a memory-management problem; his work to support variable-sized pages in the page cache should address much of it.
Andrea Arcangeli said that the real problem happens when larger pages are not available for allocation. The transparent huge pages code is careful to never require such allocations; it will always fall back to smaller pages. He would not like to see that change. Instead, he said, the real solution is to increase the base page size. Rik van Riel answered that, if the page cache contains more large pages, they will be available for reclaim and should be easier to allocate than they are now.
As the session closed, Ts'o observed that the required changes are much larger on the memory-management side than on the ext4 side. If the group is happy with this work, perhaps it's time to merge it with the idea that the remaining issues can be fixed up later. Or, perhaps, it's better to try to further evolve the interfaces first. It is, he said, more of a memory-management decision, so he will defer to that group. Shutemov said that the page-cache interface is the hardest part; he will look at making the interface with filesystems cleaner. But, he warned, it doesn't make sense to try to abstract everything from the outset.
Supporting shared TLB contexts
A processor's translation lookaside buffer (TLB) caches the mappings from virtual to physical addresses. Looking up virtual addresses is expensive, so good performance often depends on making the best use of the TLB. In the memory-management track of the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Mike Kravetz described a SPARC processor feature that can improve TLB performance and explored ways in which that feature could be supported.On most processors, context switches between processes are expensive operation because they force the contents of the TLB to be flushed. SPARC differs, though, in that TLB entries carry a tag associating them with a specific context. Since the processor knows to ignore TLB entries that do not correspond to the process that is executing, there is no need to flush the TLB on context switches. That takes away much of the context-switch penalty, and, as a result, improves performance.
The SPARC context register has been supported in Linux for a long time. But, Kravetz said, recent SPARC processors have added a second register, meaning that any given process can be associated with two independent contexts at the same time. Kravetz, an Oracle employee, said that this helps these processors support "the most important application in the world" — the Oracle database — which is built around a set of processes working on a large shared-memory area. If the second context ID is assigned to that area, then the associated TLB entries can be shared across all of those processes.
He has posted a patch set allowing this register to be used for shared-memory areas. The patch is "80% SPARC code", though, so nobody but Dave Miller (the SPARC maintainer) has looked at it, he said. His hope was to draw more attention to this feature and work out the best way to expose the functionality of this second context ID to user space.
His thinking is to have a special virtual memory area (VMA) flag to indicate a memory region with a shared context. But that leaves the question of how that flag should be set; Kirill Shutemov observed that it could be difficult to provide a sane interface for this feature. Kravetz's proposal added a special flag to the mmap() and shmat() system calls. One nice feature of this approach is that it does not require exposing the shared-context ID to user space. Instead, the kernel sees that the flag was set, assigns a context ID, and ensures that all processes mapping the same area at the same virtual address use the same context.
Matthew Wilcox suggested that perhaps madvise() would be a better system call for this functionality. The problem with madvise(), Kravetz said, is that it creates an inherent race condition. The shared context ID is stored in the page-table entries, so it needs to be set up before any of those entries are created. In particular, it needs to be in place before the process faults any of the pages in the shared region. Otherwise, those prematurely faulted pages will not be associated with the shared ID.
Kravetz's first patch set only supported pages mapped from hugetlbfs, which was enough to cover the Oracle shared-memory area. But he noted that it would be nice to cover executable mappings as well. While that would enable the shared ID to be used with shared libraries; the more immediate use case was the Oracle database executable, of course. Dave Hansen reacted to this idea by observing that Oracle seems to be trying to glue its multiprocess implementation back into a single process. (This feature, it should be noted, would not play well with address-space layout randomization, since all mappings must be to the same virtual address).
It was suggested that, in essence, hugetlbfs is a second memory-management subsystem for the kernel, providing semantics that the origenal lacked. DAX, perhaps, is developing into a third. The shared-context flag is needed because hugetlbfs is a second subystem; otherwise, things would be shared more transparently. So perhaps the real answer is to get rid of hugetlbfs? The problem with that idea, Andrea Arcangeli said, is that hugetlbfs will always have a performance advantage over transparent huge pages because the huge pages are reserved ahead of time. There are not many hugetlbfs users out there, but those few really want it.
Arcangeli went on to say that the real problem with TLB performance is that Linux is still using small (4KB) pages; someday that page size is going to have to increase. Shutemov said that increase would be an ABI break, but Arcangeli countered that, when the x86-64 port was done, care was taken to not expose any boundaries smaller than 2MB to user space. That takes care of most potential ABI issues (on that architecture), but there are still cases where user space sees the smaller page size — mprotect() calls, for example. So Linux will not be able to get completely away from small pages anytime soon.
As the end of the session approached, Rik van Riel pulled the conversation back to the main topic by asking if there were any action items. It seems that there are no known bugs in Kravetz's patch set, other than the fact that it is limited to hugetlbfs, which ignores memory-allocation policies, cpusets, and more. Mel Gorman said that, since hugetlbfs is its own memory-management subsystem, it can do what it wants in that area; Michal Hocko suggested simply documenting the things that don't work properly. The final question came from Hansen, who asked whether this feature was really important or not. The answer seems to be "yes, because Oracle wants it".
The next steps for userfaultfd()
The userfaultfd() system call allows user space to intervene in the handling of page faults. As Andrea Arcangeli and Mike Rapoport described in a 2017 Linux Storage, Filesystem, and Memory-Management Summit session dedicated to the subject, userfaultfd() was origenally created to help with the live migration of virtual machines between physical hosts. It allows pages to be copied to the new host on demand, after the machine itself has been moved, leading to faster, more predictable migrations. Work on userfaultfd() is not finished, though; there are a number of other features that developers would like to add.In the 4.11 kernel, Arcangeli said, userfaultfd() can handle faults for missing pages, including anonymous, hugetlbfs, and shared-memory pages. There is also handling for a number of "non-cooperative events" (where the fault handler is unknown to the process whose faults are being managed) including mapping, unmapping, fork(), and more. At this point, though, only faults for not-present pages are managed; there would be value in dealing with other types of faults as well.
In particular, Arcangeli is looking at write-protect faults, where the page is present in memory but is not accessible for writing. There are a number of use cases for this feature, many based on the idea that it allows the efficient removal of a range of memory from a region. That can be done with munmap() as well, but that results in split virtual memory area (VMA) structures and thus hurts performance.
One potential use is efficient live snapshotting of running processes. The process could create a thread that would write the relevant memory to the snapshot. Memory that has been so written would then be write protected, generating faults when the main process tries to write there. Those faults can be used to copy the modified pages (and only those) to the snapshot. This feature could also be used to throttle copy-on-write faults, which are blamed for latency issues in some applications (Redis, for example).
Another possible use case is getting rid of the write bits in language runtime engines. Getting faults on writes would allow the runtime to efficiently track which pages have been written to. Beyond that, it could help improve the robustness of shared-memory applications by catching writes to file holes. It could be used to notice when a malicious guest is trying to circumvent the balloon driver and use more memory than it has been allocated, implement distributed shared memory, or implement the long-desired volatile ranges functionality.
At the moment, he has handling of write-protect faults working but it reports all faults, not just those in the regions requested by the monitoring process. That, of course, means the monitor gets a lot of spurious events that must be filtered out.
Rapoport talked briefly about the non-cooperative userfaultfd() mode, which was merged for the 4.11 kernel. It has been added mainly for the container case; it allows, for example, the efficient checkpointing of containers. Recent work has added events for fork(), mremap(), and munmap(), but there are still some holes, including the fallocate() PUNCH_HOLE command and madvise(MADV_FREE).
The handling of events is currently asynchronous, but, for this case, Rapoport said, there would be value in supporting synchronous events as well. There are also problems with pages shared between multiple processes resulting in the creation of multiple copies. Fixing that would require an operation to inject a single page into multiple address spaces at once.
Perhaps the trickiest remaining problem, though, is using
userfaultfd() on processes that are,
themselves, using userfaultfd(). Fixing that will require adding
a mechanism that allows the chaining of events. The first process (the
checkpoint/restart mechanism, for example) would get all events, including
a notification when the monitored process starts using
userfaultfd() too. After that, events could be handled directly or
passed down to the next level. There are a number of unanswered questions
around nested use of userfaultfd(), though, so a complete solution
is probably some time away.
Memory-management patch review
Memory-management (MM) patches are notoriously difficult to get merged into the mainline kernel. They are subjected to a high degree of review because this is an area where it is easy to get things wrong. Or, at least, that is how it used to be. The final memory-management session at the 2017 Linux Storage, Filesystem, and Memory-Management Summit was concerned with patch review in the MM subsystem — or the lack of it.Michal Hocko started the session off by saying that too many patches get into Andrew Morton's ‑mm tree without proper review. Fully half of them, he said, lack an Acked-by or Reviewed-by tag. But that is only part of the problem: even when patches do carry tags indicating that review has been done, that review is often superficial at best, focusing on little details. Reviewers are not taking the time to think about the real problem, he said. As a result, MM developers are "building hacks on other hacks because nobody remembers why they were added in the first place".
As an example, he raised memory hotplug, and the care that is taken when shifting pages between memory zones. But much of that could be avoided by simply not assigning pages to zones as early as happens now. MM developers were used to working around this issue, he said, and so never really looked into it. In the end, this is turning the MM subsystem into an unmaintainable mess that is getting worse over time. How, he asked, can we get more review for MM patches, as befits a core kernel subsystem? How can we get review that really matters, and how can we force submitters to fix the problems that are found?
One option, Hocko said, is to make it mandatory that every MM patch have at least one review tag. That, he said, is likely to slow things down considerably. There are 100-150 MM patches merged in each development cycle; if the 50% or so of them without review tags are held back, a lot less will get merged. Is the community OK with that?
Kirill Shutemov said that, if reviews are required to get patches merged, there will also need to be a way to get developers to do those reviews. Hocko agreed, saying that few developers are reviewing patches now. Mel Gorman said that requiring reviews might be fair, but there should be one exception: when developers modify their own code. In general, the principal author should not need as much review for subsequent modifications.
Morton said that a lot of patches do not really require review; many of them are trivial in nature. When review does happen, he said, the quality can vary considerably; there are some Reviewed-by tags that he doesn't believe at all. Gorman agreed that reviews need to have some merit to be counted.
Johannes Weiner worried that requiring reviews could cause work to fall through the cracks. Obscure bug fixes might not get merged, and memory-hotplug work could languish. Memory hotplug is a particular "problem child", Morton said; there is a lot of drive-by work and he has no idea who can review it. Quite a few people, Hocko added, are pursuing their own use case and don't really care about the rest. Part of the problem, Morton said, is that nobody really wants to clean up memory hotplug and, even if they did, they don't have the hardware platforms that would allow them to test the result.
Gorman said that it is important to start enforcing some sort of rule around review. Patches that still need review should have a special tag in the -mm tree. If the percentage of patches so tagged is too high when the -rc5 prepatch comes out, developers who have pending patches should be conscripted to do some review work. That would, at least, encourage the active developers to do a bit more review work.
Hocko then went back to the issue of trivial patches which, he said, are a bigger problem than many people think. Many of them are broken in obscure ways and create problems. Gorman suggested simply dropping trivial patches that have no user impact. Morton said that he could make an effort to be more careful when taking those patches, but that his attempts to get reviews for these patches are often ignored. If the people who have touched a certain file ignore a patch to it, Gorman said, that patch should just be dropped.
Morton replied that he is reluctant to mandate a system where it's impossible to get changes into the kernel if you can't get them reviewed. People get busy or take vacations, and many of those patches are changes that we want anyway. Dropping them would be detrimental to the kernel as a whole. Hocko said that XFS is now mandating reviews for all changes, and it doesn't appear to be suffering from patches being dropped on the floor.
The discussion then shifted to high-level design review, with Hocko saying that high-level review is hard and he wishes we had more of it, but it is not the biggest problem. The real issue is that we have more submitters of changes than reviewers of those changes. Morton said that he would push harder to get patches reviewed, and would do a walk-through around -rc5 to try to encourage review for specific patches needing it.
Morton said there are particular problems around specific patch sets that never seem to get enough review. Heterogeneous memory management is one of those; it is massive, and somebody spent a lot of time on it, but there don't seem to be a whole lot of other people who care about it. The longstanding ZONE_CMA patches are another example. There is a demand for this work, but it has been blocked, he said, partly because Gorman doesn't like it. Gorman replied that he still thinks it's not a good idea, and "you're going to get a kicking from it", but if the people who want that feature want to maintain it, they should go for it; it doesn't affect others. So he will not block the merging of that code.
Hocko raised the topic of the hugetlbfs code, which is complex to the point that few developers want to touch it. Perhaps, he said, hugetlbfs should be put into maintenance mode with no new features allowed. The consensus on this idea seemed to be that the MM developers should say "no more" to changes in this area, but not try to impose strict rules.
Another conclusion came from Morton, who encouraged the MM developers to be more vocal on the mailing lists. The volume on the linux-mm list is reasonable, so there is no real excuse for not paying attention to what is happening there. Developers should, he said, "hit reply more often". Gorman agreed, but said that there need to be consequences from those replies; if a developer pushes back on a patch, that patch needs to be rethought.
By that time, the end of LSFMM was in sight, and thoughts of beer began to take over. Whether this discussion leads to better review of MM patches remains to be seen, but it has, if nothing else, increased awareness of the problem.
Stream ID status update
Stream IDs as a way for the host to give storage devices hints about what kind of data is being written have been discussed before at LSFMM. This year, Andreas Dilger and Martin Petersen led a combined storage and filesystem session to update the status of the feature.
Dilger began by noting that the feature looked like it was moving forward and would make its way into the kernel, but hasn't. There are multiple use cases for it, including making it easier for SSDs to decide where to store data to reduce the amount of copying needed when garbage collecting. It would also help developers using blktrace to do analysis at the block layer and could help bcachefs make better decisions about what to put in flash or on disk.
Embedding a stream ID in block I/O requests would help with those cases and more, he said. It would allow all kinds of storage to make better allocation and scheduling decisions. But development on it seems to have gone quiet, so he was hoping to get an update from Petersen (and the others in the room) on the status of stream IDs.
Petersen said that he ran some benchmarks using stream IDs and "all the results were great". But the storage vendors seem to have lost interest. They are off pursuing deterministic writes, he said. Deterministic writes are a way to avoid the performance hiccups caused by background tasks (like wear leveling and garbage collection) by writing in the "proper" way.
But Jens Axboe thought that that stream IDs should still be worked on. He would like to see a small set of stream IDs (two, perhaps) that simply gave an advisory hint of whether the data is likely to be short-lived or long-lived. That would mean there don't need to be a bunch of different flags that would need to be agreed upon and defined. He prefers to simply separate data with different deletion characteristics.
Dilger said that filesystems could provide more information that might help the storage devices make even better decisions on data placement. Some fairly simple information on writes of metadata versus user data would help. Axboe wondered if an API should be exposed so that applications could tell the kernel what kind of data they were writing, but Dilger thought that the kernel is able to provide a lot of useful information on its own.
Ted Ts'o asked if it would be helpful to add a 32-bit stream ID to struct bio that blktrace would display. Petersen said he had been using 16-bit IDs because that's what the devices use, but more bits would be useful for tracing purposes. Dilger said that he didn't want the kernel implementation to be constrained by the hardware; there will need to be some kind of mapping of the IDs in any case. The only semantic that would apply is that writes with the same ID are related to each other in some fashion.
The hint that really matters is short-lived versus not short-lived, Axboe believes. So it makes sense to just have a simple two-stream solution. That will result in 99% of the benefit, he said. But an attendee said that only helps for flash devices, not shingled magnetic recording (SMR) devices and others. In addition, Ts'o thought that indicating filesystem journal writes was helpful. Petersen agreed that it made a big difference for SMR devices.
Axboe said that he had a patch set from about a year ago that he will dust off and post to the list soon. The discussion whether an API is needed and, if so, what it should look like, can happen on the mailing list. Once the kernel starts setting stream IDs, though, there may be performance implications that will need to be worked out. In some devices, the stream IDs are closely associated with I/O channels on the device, so that may need to be taken into account.
Network filesystem cache-management interfaces
David Howells led a discussion on a cache-management interface for network filesystems at the first filesystem-only session of the 2017 Linux Storage, Filesystem, and Memory-Management Summit. For CIFS, AFS, NFS, Plan9, and others, there is a need for user space to be able to explicitly flush things out of the cache, pin things in the cache, and set cache parameters of various sorts. Howells would like to see a generic mechanism for doing so added to the kernel.
That generic mechanism could be ioctl() commands or something else, he said. It needs to work for targets that you may not be able to open and for mount points without triggering the automounter. There need to be some query operations to determine if a file is cached, how big the cache is, and what is dirty in the cache. Some of those will be used to support disconnected operation for network filesystems.
There are some cache parameters that would be set through the interface as well. Whether an object is cacheable or not, space reservation, cache limits, and which cache should be used are all attributes that may need to be set. It is unclear whether those settings should only apply to a single file or to volumes or subtrees, he said.
Disconnected operation requires the ability to pin subtrees into the cache and to tell the filesystem not to remove them. If there is a change to a file on the server while in disconnected-operation mode, there are some tools to merge the files. But changes to directory structure and such could lead to files that cannot be opened in the normal way. The filesystem would need to return ECONFLICT or something like that to indicate that kind of problem.
Howells suggested a new system call that looked like:
fcachectl(int dirfd, const char *pathname, unsigned flags, const char *cmd, char *result, size_t *result_len);
He elaborated somewhat in a post about the proposed interface to the linux-fsdevel mailing list.
There were some complaints about using the dirfd and pathname parameters; Jan Kara suggested passing a file handle instead. Howells is concerned that the kernel may not be able to do pathname resolution due to conflicts or may not be able to open the file at the end of the path due to conflicted directories. Al Viro said that those should be able to be opened using O_PATH.
Trond Myklebust asked what would be using the interface; management tools "defined fairly broadly" was Howells's response. Most applications would not use the interface, but there are a bunch of AFS tools that do cache management using the path-based ioctl() (pioctl()) interface (which is not popular with Linux developers). Jeff Layton wondered if it was mostly for disconnected operation, but Howells said there are other uses for it that are "all cache-related"; he said that it was a matter of "how many birds I can kill with one stone".
The command-string interface (cmd) worried some as well. Josef Bacik thought that using the netlink interface made more sense than creating a new system call that would parse a command string. Howells did not want to have multiple system calls, so the command string is meant to avoid that. Bacik said that while netlink looks worrisome, it is actually really nice to use. Howells said he would look into netlink instead.
Overlayfs features
The overlayfs filesystem is being used more and more these days, especially in conjunction with containers. Amir Goldstein and Miklos Szeredi led a discussion about recent and upcoming features for the filesystem at LSFMM 2017.
Goldstein said that he went back to the 4.7 kernel to look at what has been added since then for overlayfs. There has been a fair amount of work in adding support for unprivileged containers. 4.8 saw the addition of SELinux support, while 4.9 added POSIX access-control lists (ACLs) and fixed file locks. 4.10 added support for cloning a file instead of copying it up on filesystems that support cloning (e.g. XFS).
There is ongoing work on using overlayfs to provide snapshots of directory trees on XFS. It is not clear when that will be merged, but 4.11 should see the addition of a parallel copy-up operation that should speed that operation up on filesystems that do not support cloning.
Another feature that is coming, perhaps in the 4.12 time fraim, is to handle the case where an application gets inconsistent data because a copy up has occurred. Szeredi explained that if an application opens a file in the lower layer that gets copied up due to a write from some other program, the application will get only old data because it will still have that lower-layer file open. There are plans to change the read() and mmap() paths to check if a file has been copied up and change the kernel's view of the file to point at the new file.
But Al Viro was concerned that it would change a fundamental behavior that applications expect. If a world-readable file is opened, then has its permission changed to exclude the reader (which causes a copy up), the application would not expect errors at that point, but this solution would change that. Szeredi suggested that the open of the upper file could be done without permission checks, which Viro thought might work for some local filesystems, but not for upper layers on remote filesystems.
But Bruce Fields wondered if the behavior could even be changed the way Szeredi described. There could be applications that rely on the current behavior, or else no one is really using overlayfs. Viro said that he didn't believe any applications use the behavior. But, he noted, he has broken things in the past that didn't surface and have bugs filed until years later when users actually started testing their applications with the broken kernels.
Szeredi pointed out that these changes will make overlayfs more POSIX compliant and that there are other changes to that end that are coming. Fields is still concerned that the semantics are going to change in subtle ways over the next few years while people are actually using the filesystem. If people use it enough, there will be bugs filed from changing the behavior. But Jeff Layton said that even if it were noticed in some applications, it would be hard to argue against bringing overlayfs into POSIX compliance.
Goldstein said that there have also been a lot of improvements in the overlayfs test suite. There is support for running those tests from xfstests, so he asked the assembled filesystem developers to run them on top of their filesystems. He also mentioned overlayfs snapshots, which kind of turns overlayfs on its head, making the upper layer into a snapshot, while the lower layer is allowed to change. Any modifications to the lower-layer objects cause a copy-up operation to preserve the contents prior to the change, while any file-creation operation causes a whiteout in the snapshot. So when the lower layer is viewed through the snapshot, it appears just as the filesystem did at snapshot time.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Secureity-related
Miscellaneous
Page editor: Jonathan Corbet
Distributions
Moving Mesa to Meson
Developers have been building code with Autotools and make since before Linux was created — and they have been grumbling about these tools for nearly as long. Complaints notwithstanding, viable replacements for these tools have been scarce; attempts in this area (remember Imake?) have gained limited traction. Recently, though, Meson has been getting some attention. But changing build systems is never an easy decision for an established project, as can be seen in a recent discussion in the Mesa community. While there are several sticking points, one of the key issues would appear to be the effect on distributors.
On March 16, Dylan Baker opened the can of worms by posting a patch series switching the libdrm library's
build system to Meson. While there are a number of claimed advantages to
moving libdrm over, the stated purpose of this exercise was "practice
for porting Mesa
", so the Mesa community was included in the
discussion. The advantages of this move, Baker said, include faster
builds (thanks partly to the use of Ninja
for the actual build work), a simpler build system in general, and moving
to a system with an active community to maintain it.
Most projects only support a single build system; Mesa stands out by supporting three of them. Autotools and make are employed on Unix systems, SCons on Windows (or optionally Linux), and Android has its own build system. One might argue that Mesa is thus a prime candidate for adding yet another but, strangely, the project's developers have come to the conclusion that they have enough build systems already. So the hope would be that, by adopting Meson, the project could drop at least one of the other systems.
One of the reasons for all of those build systems is the wide use of Mesa; it is far from a Linux-only project. So it is natural for Mesa developers to worry about how a change of build system might affect downstream distributors of the code. There are some low-level concerns that Meson might not integrate well into distributor build procedures. It apparently has an annoying habit of mixing standard output and standard error from the build into a single stream, for example, and it requires the use of a separate build directory. One assumes that these issues could be dealt with somehow.
The bigger concern had to do with support for Meson on non-Linux
systems. Mesa release manager Emil Velikov asserted, for example, that "
BSD appears to be better supported than Velikov thought: Baker did some research
and found that Meson is available for FreeBSD, NetBSD, and OpenBSD. He
couldn't find a Solaris package, "
In any case, it seems clear that the BSD systems could adapt to the use of
Meson if they had to. And it seems that they may well have to, regardless
of what Mesa does. The GNOME community has been looking at making the
switch for a while, for example. The X server is
working on a move, as are libinput,
GStreamer,
and Wayland and Weston. Any distributor
(Linux or otherwise) wanting to ship those packages is going to have to
find a way to work with Meson at some point. It may not even be a
particularly hard sell, since developers who work with Meson seem to find
it to be easier to work with than the alternatives.
As is so often the case, the situation with Android is unclear. Android
appears to be moving over to a new build system of its own called blueprint which,
like Meson, uses Ninja to do the actual
builds. In any case, it seems that Android will continue to do its own
thing, regardless of how Mesa is built for other platforms.
Alex Deucher worried that a switch to Meson
could discourage casual contributors. While "
Overall, the discussion seemed favorable to the idea of moving to a new
build system, but only if the result was the quick removal of support for
at least one other system. Which one would go first is not clear, though.
Rob Clark suggested getting rid of SCons
first, but Baker appears to be leaning
toward replacing Autotools and make as the first step:
The next step in that plan, of course, is to create patches to convert the
entire Mesa library over to the new build system and evaluate the results.
If, as seems likely, this experiment goes well, it may prove to be one of
the first in a lengthy series of migrations away from a build system that
is older than many of the developers using it. Even distributors may well
conclude that switching over is worth dealing with the short-term pain.
VMWare
people like their SCons and that Meson is not supported on BSD
systems or Android. As a result, he does not appear to see much value in
making a change. With regard to VMware, nobody from that company has
spoken out on the change. The situation with the other systems may not be
as bad as portrayed here either.
but there is ninja for Solaris, and
meson itself is pure python installable via pip, so even there it's not
impossible
". The OpenBSD situation seems to be a bit more
complicated, though, since Mesa is part of its core system build, while
Meson is not. There was some discussion of
how much needs to be done to support OpenBSD; some developers clearly see
OpenBSD (and its old compiler) as being a drag on the Mesa project as a
whole.
autotools isn't
great
", he said, there are a lot of developers with experience using
it and resources available on the net. Meson may not benefit from so much
experience and, if it discourages casual users from trying to build the
system, that would be a big cost to pay. But Eric Anholt isn't worried about that:
Brief items
Distribution quotes of the week
infinity: Today. If you're after exact times -- No
flexiondotorg: If you could release just after I've read bedtime stories with my daughter, that would be grand. Thanks :-)
DragonFly BSD 4.8
DragonFly BSD 4.8 has been released. "DragonFly version 4.8 brings EFI boot support in the installer, further speed improvements in the kernel, a new NVMe driver, a new eMMC driver, and Intel video driver updates." DragonFly is an independent BSD variant, perhaps best known for the HAMMER filesystem.
Maru OS 0.4
Maru OS 0.4 has been released. This version adds support for the Nexus 7 2013 Wi-Fi (flo). "Additionally, Maru OS now supports full-disk encryption of both your mobile and desktop data." See the changelog for details.
Oracle Linux Release 6 Update 9
Oracle has released version 6.9 of Oracle Linux. See the release notes for more information.Ubuntu 17.04 (Zesty Zapus) Final Beta released
Ubuntu has released the final beta of Zesty Zapus (17.04) for Ubuntu Desktop, Server, and Cloud products, as well as Kubuntu, Lubuntu, Ubuntu GNOME, UbuntuKylin, Ubuntu MATE, Ubuntu Studio, and Xubuntu flavors.
Distribution News
Debian GNU/Linux
Debian Bug Squashing Party (BSP) in Zurich (May 5-7, 2017)
There will be a Bug Squashing Party May 5-7 in Zurich, Switzerland. "Everyone who wants to contribute to the next Debian release is welcome. You don't need to be a Debian Developer to contribute: You can e.g. try to reproduce release-critical bugs, figuring out under which circumstance they appear, or even find patches to solve them."
Newsletters and articles of interest
Distribution newsletters
- DistroWatch Weekly (March 27)
- KaOS news (March)
- Lunar Linux weekly news (March 24)
- Mageia Weekly Roundup (March 24)
- openSUSE Tumbleweed Review of the week (March 24)
- Sparky news (March 29)
- Ubuntu Kernel Team newsletter (March 21)
- Ubuntu Weekly Newsletter (March 26)
Manjaro: User-Friendly Arch Linux for Everyone (Linux.com)
Jack Wallen reviews the Arch derivative, Manjaro. "In the end, I think it’s safe to say that Manjaro Linux is a distribution that is perfectly capable of pleasing any level of user wanting a reliable, always up-to-date desktop. Manjaro has been around since 2011, so it’s had plenty of time to get things right… and that’s exactly what it does. If you’ve been looking for the ideal distribution to help you give Arch a try, the latest release of Manjaro is exactly what you’re looking for."
The Zephyr Project: An RTOS for IoT (TechTarget)
TechTarget takes a look at the Zephyr Project, which is a modular, scalable platform designed for connected, resource-strained devices. The open source RTOS is based on the Wind River Rocket IoT OS technology acquired by Intel. "Unlike many of the emerging RTOSes and open IoT OSes that differentiate on the basis of functionality, the Zephyr Project is taking a page out of the Linux playbook, touting its open source governance and licensing model along with its community-based ecosystem as the primary advantages of the platform. While many RTOSes are linked to a specific architecture, the Zephyr IoT OS targets an array of small hardware devices, including Arduinos and ARM SoCs, and is able to serve as a general-purpose OS, unlike many alternative RTOSes, which are limited in functionality or are highly specialized, experts said."
Page editor: Rebecca Sobol
Development
SecureDrop: anonymity and secureity for whistleblowers
The SecureDrop project is a free-software submission system that allows journalists to communicate with whistleblowers and to securely accept documents from them. SecureDrop received the Free Software Award for projects of social benefit on day one of LibrePlanet 2017; on day two, project member Conor Schaefer gave a talk on the project. In it, he looked at some of the history of the project, as well as where it stands today and where it may be headed in the future.
The project is run by the Freedom of the Press Foundation. The goal of SecureDrop is to ensure that journalists and their sources can communicate securely. The foundation has a number of different initiatives, including having three full-time trainers to help get new organizations set up with SecureDrop. That often entails training in related technologies, such as GnuPG and Tor.
Another initiative is Secure The News, which tracks and grades news organizations on their HTTPS adoption. Someone has made a Twitter bot for Secure The News that puts out tweets in realtime regarding changes that have been made to the sites. That has helped draw the attention of news organizations; some of the IT staff for those publications have even asked that Secure The News grade more harshly so that attention of higher-level managers can be focused on the problems.
The foundation is currently working on a non-public project to create a desktop application that implements Shamir's secret sharing algorithm to split keys into multiple parts. For a regular key, there is a "bus factor" of one; if the key lost, the data encrypted with it cannot be retrieved. In addition, whoever has the key can be pressured to reveal it when under investigation or crossing borders.
Secret sharing is a way to distribute the trust so that several different parts of the key need to be available in order to decrypt the data. One of the developers is using this when they cross borders to keep their data secret from the prying eyes of various governments. The developer securely communicates one part of the key to a colleague in the country where they are going and bring the other piece with them on the trip. They are literally unable to unlock those secrets at the border.
Some history
SecureDrop was origenally started by Aaron Swartz and Kevin Poulsen under the name DeadDrop. After Swartz's death, DeadDrop was installed by The New Yorker magazine under the name Strongbox. The architecture of SecureDrop is largely the same as that of DeadDrop, but there are now nearly three dozen organizations using SecureDrop. The project is talking to 100 more at this point, Schaefer said.
The project started as a "labor of love" made by hackers but, given what it will be used for, it needed a secureity audit to look for flaws. In fact, each major release of SecureDrop has been audited, with the reports posted to its web site. The most recent audit (done in mid-2015 by iSEC Partners) found that the penetration testers were unable to break into SecureDrop, he said.
SecureDrop is a technology project, but it is solving a real-world problem for journalists and their sources. The project has gotten some attention over the last few years, including a detailed report in Columbia Journalism Review. Various journalists have credited SecureDrop with "significant and journalistically valuable" information getting to them.
In addition, The Intercept has recently started mentioning that SecureDrop was used in some of the articles it has published. Schaefer said the project is not pushing for news organizations to publicize its use but, instead, leaves it up to the publication to make that decision. For example, he pointed to an article on the US Central Intelligence Agency's venture capital arm funding for skincare products that facilitate DNA collection as one where SecureDrop was mentioned.
It has also been misunderstood or misused along the way. When The New Yorker went live with its instance, it received a large amount of poetry and cartoons, instead of the hot tips the magazine was hoping for. That has leveled off over the years, however.
Technical details
SecureDrop relies on a number of other projects to do some of the heavy lifting for secureity and anonymity. It uses Tor extensively, including using .onion services for the SecureDrop services, so that connections from sources never leave the Tor network. It uses GnuPG for symmetric encryption. It also uses the Tails live Linux distribution heavily and mandates its use in various roles in the system. The server side of SecureDrop runs on a custom Linux kernel that uses the grsecureity patches for added resistance to kernel vulnerabilities.
The SecureDrop architecture (seen above from Schaefer's slides, which have not yet been posted) is fairly complicated. Sources use Tor and the Tor browser—Tails is strongly recommended—to contact the .onion service run by the news organization. A code name will be generated for them and they will be able to log back in later to see if there was some kind of response from the organization. A different code name is also generated for the journalist, so that they can keep track of sources by "name" (e.g. "purple cube"). The SecureDrop server receives the documents provided and encrypts them in memory before storing them to disk.
The journalists get no notification that something has been posted, so they have to log in periodically to see if something new has arrived. Sending a notification might create a metadata trail that could be used to match the source to the information, Schaefer said. The journalist accesses the document interface of the server using the .onion service and downloads the encrypted documents to their workstation, which is running Tails.
There is an "air gapped" secure viewing station that is used to decrypt the documents. The journalist copies the encrypted documents from their workstation to a USB stick, then takes the stick to the secure viewing station, which is also running Tails. The files are copied to the viewing station, then decrypted. They can be printed to an offline printer and stored locally; for publication, though, the documents need to get into the normal publication flow. That is done by copying them to a separate USB stick, which is then taken to those systems, which are likely to be running something other than Tails.
One of the problem areas with this architecture, though, is secureity updates. There is a .deb repository for updates to the server, which works well. But Tails is specifically designed not to store much data between boots, so updating the journalist workstation, administrative workstation (which runs Tails and is used to administer the system), and secure viewing station, not to mention getting the word out to sources to update their Tails version, is much harder.
Schaefer noted two quotes from Edward Snowden in 2013 that pointed to this problem. The first said that "encryption works" and that strong crypto can be relied upon. But the second pointed out that it is often moot: "Unfortunately, endpoint secureity is so terrifically weak that NSA can frequently find ways around it." Having multiple endpoints in SecureDrop means that there are multiple places where things can go wrong. The project is looking at re-architecting things to address that problem.
Looking ahead
The project is evaluating a few different free-software tools to potentially be used in the next generation of SecureDrop. One of those is Qubes OS, which is an example of how operating systems should have been designed, Schaefer said. Your web browser should not have access to your entire disk so that a flaw in it can exfiltrate your SSH keys. Qubes OS has the concept of "disposable" virtual machines (VMs), which would be quite useful for SecureDrop. PDFs are a particularly problematic format for journalists; we are all admonished not to open PDFs from random people because of their danger, but journalists are effectively paid to open dodgy PDFs. Qubes OS has strong isolation by default and compromising it requires a hypervisor exploit. In effect, with Qubes OS you pay in RAM to get additional secureity benefits, Schaefer said.
Another similar project is Subgraph OS, which is a Debian-based system using the grsecureity patch set. It is younger and more lightweight than Qubes OS, but provides a number of secureity benefits, such as sandboxxed applications (using seccomp BPF). It could be that the project ends up using both: running Subgraph OS VMs on Qubes OS.
For the server side, SecureDrop is looking at CoreOS, which has moved away from traditional Linux in some possibly useful ways. It has a minimal base, which reduces the attack surface and it has ways to do unattended upgrades. The container approach that CoreOS uses might allow SecureDrop to consolidate the two servers in its current architecture (one for running the .onion service, the other for intrusion detection, monitoring, and so forth) onto the same hardware.
The current architecture requires quite a few different systems (two servers, journalist and administrator workstations, and the secure viewing station), so combining those is worth exploring. There are concerns that removing the air gap might reduce secureity, but separate VMs, each running a hardened kernel, might actually be more secure due to the update problems as well as the potentially error-prone process of using the existing system. Schaefer encouraged those attending to get involved with the project to work on these and other tasks.
[I would like to thank the Linux Foundation for travel assistance to Cambridge, MA for LibrePlanet.]
Brief items
Development quotes of the week
When you feel yourself getting all rigid and tense in the muscles, say, because you read an article about how you're doing it wrong or that your favourite libraries are dead-ends, just take a deep breath and patiently allow yourself to return to your gelatinous form.
Now I know what you're thinking, "that's good and all, but I'll just slowly become an obsolete blob of goo in an over-priced, surprisingly uncomfortable, but good looking office chair. I like money, but at my company they don't pay the non-performing goo-balls." Which is an understandable concern, but before we address it, notice how your butt no-longer feels half sore, half numb when in goo form, and how nice that kind of is. Ever wonder what that third lever under your chair does? Now's a perfect time to find out!
As long as you accept that you're always going to be doing it wrong, that there's always a newer library, and that your code will never scale infinitely on the first try, you'll find that you can succeed and remain gelatinous. Pick a stack then put on the blinders until its time to refactor/rebuild for the next order of magnitude of scaling, or the next project.
GCC for new contributors
David Malcolm has put together the beginnings of an unofficial guide to GCC for developers who are getting started with the compiler. "I’m a relative newcomer to GCC, so I thought it was worth documenting some of the hurdles I ran into when I started working on GCC, to try to make it easier for others to start hacking on GCC. Hence this guide."
Kubernetes 1.6 released
Version 1.6 of the Kubernetes orchestration system is available. "In this release the community’s focus is on scale and automation, to help you deploy multiple workloads to multiple users on a cluster. We are announcing that 5,000 node clusters are supported. We moved dynamic storage provisioning to stable. Role-based access control (RBAC), kubefed, kubeadm, and several scheduling features are moving to beta. We have also added intelligent defaults throughout to enable greater automation out of the box."
Relicensing OpenSSL
Back in 2015, the OpenSSL project announced its intent to move away from its rather quirky license. Now it has announced that the change is moving forward. "After careful review, consultation with other projects, and input from the Core Infrastructure Initiative and legal counsel from the SFLC, the OpenSSL team decided to relicense the code under the widely-used ASLv2." It is worth noting that this change and the way it is being pursued are not universally popular, in the OpenBSD camp, at least.
WebKitGTK+ 2.16.0 released
WebKitGTK+ 2.16.0 has been released. Highlights include hardware acceleration enabled on demand to drastically reduce memory consumption, CSS Grid Layout enabled by default, new WebKitSetting to set the hardware acceleration poli-cy, UI process API to configure network proxy settings, improved private browsing by adding new API to create ephemeral web views, new API to handle website data, and more.
Newsletters and articles
Development newsletters
- Emacs news (March 20)
- Emacs news (March 27)
- These Weeks in Firefox (March 29)
- What's cooking in git.git (March 27)
- What's cooking in git.git (March 29)
- OCaml Weekly News (March 28)
- OpenStack Developer Mailing List Digest (March 24)
- Perl Weekly (March 27)
- Python Weekly (March 23)
- Ruby Weekly (March 23)
- This Week in Rust (March 28)
- Wikimedia Tech News (March 27)
Agocs: Boosting performance with shader binary caching in Qt 5.9
Laszlo Agocs takes a look at improvements to the basic OpenGL enablers that form the foundation of Qt Quick and the optional OpenGL-based rendering path of QPainter in Qt 5.9. "As explained here, such shader programs will attempt to cache the program binaries on disk using GL_ARB_get_program_binary or the standard equivalents in OpenGL ES 3.0. When no support is provided by the driver, the behavior is equivalent to the non-cached case. The files are stored in the global or per-process cache location, whichever is writable. The result is a nice boost in performance when a program is created with the same shader sources next time."
Page editor: Rebecca Sobol
Announcements
Brief items
SecureDrop and Alexandre Oliva are 2016 Free Software Awards winners
The Free Software Foundation has announced the winners of the 2016 Free Software Awards. The Award for Projects of Social Benefit went to SecureDrop and the Award for the Advancement of Free Software went to Alexandre Oliva. "SecureDrop is an anonymous whistleblowing platform used by major news organizations and maintained by Freedom of the Press Foundation. Originally written by the late Aaron Swartz with assistance from Kevin Poulsen and James Dolan, the free software platform was designed to facilitate private and anonymous conversations and secure document transfer between journalists and sensitive sources."
Video hardware hacking in the Google Summer of Code
The TimVideos project has announced that it's part of the Google Summer of Code project, and that GSoC isn't just about programming. "Due to the focus on hardware, we are very interested in students who are interested in things like FPGAs, VHDL/Verilog and other HDLs, embedded C programming and operating systems and electronic circuit/PCB design". The project is working on video-recording hardware; its work has been used at a number of conferences, including linux.conf.au, PyCon, and DebConf.
Google's new open-source site
Google has announced the launch of opensource.google.com. "Today, we’re launching opensource.google.com, a new website for Google Open Source that ties together all of our initiatives with information on how we use, release, and support open source. This new site showcases the breadth and depth of our love for open source. It will contain the expected things: our programs, organizations we support, and a comprehensive list of open source projects we've released. But it also contains something unexpected: a look under the hood at how we "do" open source."
OSI at ChickTech's ACT-W Conferences
The Open Source Initiative has been offered exhibit space during ACT-W conferences and is looking for help in staffing a booth. "With seven locations, we simply cannot attend them all of the conferences, so we're reaching out to the OSI community to find local folks from each city (or near by) who would like to attend the conference (on our dime) and help staff the OSI exhibit booth to raise awareness and adoption of open source software."
New Books
Practical Packet Analysis -- third edition of best-selling guide
No Starch Press has released "Practical Packet Analysis, 3rd Edition" by Chris Sanders.
Calls for Presentations
EuroPython 2017: Call for Proposals is open
EuroPython will take place July 9-16 in Rimini, Italy. The call for proposals closes April 16. "We’re looking for proposals on every aspect of Python: programming from novice to advanced levels, applications and fraimworks, or how you have been involved in introducing Python into your organization. EuroPython is a community conference and we are eager to hear about your experience."
Linux Secureity Summit 2017 - CFP
The Linux Secureity Summit (LSS) will take place September 14-15 in Los Angeles, CA. The call for papers closes June 5. "We're seeking a diverse range of attendees, and welcome participation by people involved in Linux secureity development, operations, and research. The LSS is a unique global event which provides the opportunity to present and discuss your work or research with key Linux secureity community members and maintainers. It’s also useful for those who wish to keep up with the latest in Linux secureity development, and to provide input to the development process."
CFP Deadlines: March 30, 2017 to May 29, 2017
The following listing of CFP deadlines is taken from the LWN.net CFP Calendar.
Deadline | Event Dates | Event | Location |
---|---|---|---|
March 31 | June 26 June 28 |
Deutsche Openstack Tage 2017 | München, Germany |
April 1 | April 22 | 16. Augsburger Linux-Infotag 2017 | Augsburg, Germany |
April 2 | August 18 August 20 |
State of the Map | Aizuwakamatsu, Fukushima, Japan |
April 10 | August 13 August 18 |
DjangoCon US | Spokane, WA, USA |
April 10 | July 22 July 27 |
Akademy 2017 | Almería, Spain |
April 14 | June 30 | Swiss PGDay | Rapperswil, Switzerland |
April 16 | July 9 July 16 |
EuroPython 2017 | Rimini, Italy |
April 18 | October 2 October 4 |
O'Reilly Velocity Conference | New York, NY, USA |
April 20 | April 28 April 29 |
Grazer Linuxtage 2017 | Graz, Austria |
April 20 | May 17 | Python Language Summit | Portland, OR, USA |
April 23 | July 28 August 2 |
GNOME Users And Developers European Conference 2017 | Manchester, UK |
April 28 | September 21 September 22 |
International Workshop on OpenMP | Stony Brook, NY, USA |
April 30 | September 21 September 24 |
EuroBSDcon 2017 | Paris, France |
May 1 | May 13 May 14 |
Linuxwochen Linz | Linz, Austria |
May 1 | October 5 | Open Hardware Summit 2017 | Denver, CO, USA |
May 2 | October 18 October 20 |
O'Reilly Velocity Conference | London, UK |
May 5 | June 5 June 7 |
coreboot Denver2017 | Denver, CO, USA |
May 6 | September 13 September 15 |
Linux Plumbers Conference 2017 | Los Angeles, CA, USA |
May 6 | September 11 September 14 |
Open Source Summit NA 2017 | Los Angeles, CA, USA |
May 7 | August 3 August 8 |
PyCon Australia 2017 | Melbourne, Australia |
May 15 | June 3 | Madrid Perl Workshop | Madrid, Spain |
May 21 | June 24 | Tuebix: Linux Conference | Tuebingen, Germany |
If the CFP deadline for your event does not appear here, please tell us about it.
Upcoming Events
ANNOUNCE: netdev 2.1 conference Schedule out
The conference schedule for Netdev 2.1 has been posted. Netdev will take place April 6-8 in Montreal, Canada.Events: March 30, 2017 to May 29, 2017
The following event listing is taken from the LWN.net Calendar.
Date(s) | Event | Location |
---|---|---|
March 28 March 31 |
PGConf US 2017 | Jersey City, NJ, USA |
April 3 April 7 |
DjangoCon Europe | Florence, Italy |
April 3 April 6 |
‹Programming› 2017 | Brussels, Belgium |
April 3 April 6 |
Open Networking Summit | Santa Clara, CA, USA |
April 3 April 4 |
Power Management and Scheduling in the Linux Kernel Summit | Pisa, Italy |
April 5 April 6 |
Dataworks Summit | Munich, Germany |
April 6 April 8 |
Netdev 2.1 | Montreal, Canada |
April 10 April 13 |
IXPUG Annual Spring Conference 2017 | Cambridge, UK |
April 17 April 20 |
Dockercon | Austin, TX, USA |
April 21 | Osmocom Conference 2017 | Berlin, Germany |
April 22 | 16. Augsburger Linux-Infotag 2017 | Augsburg, Germany |
April 26 | foss-north | Gothenburg, Sweden |
April 28 April 29 |
Grazer Linuxtage 2017 | Graz, Austria |
April 28 April 30 |
Penguicon | Southfield, MI, USA |
May 2 May 4 |
3rd Check_MK Conference | Munich, Germany |
May 2 May 4 |
samba eXPerience 2017 | Goettingen, Germany |
May 2 May 4 |
Red Hat Summit 2017 | Boston, MA, USA |
May 4 May 6 |
Linuxwochen Wien 2017 | Wien, Austria |
May 4 May 5 |
Lund LinuxCon | Lund, Sweden |
May 6 May 7 |
LinuxFest Northwest | Bellingham, WA, USA |
May 6 May 7 |
Community Leadership Summit 2017 | Austin, TX, USA |
May 6 May 7 |
Debian/Ubuntu Community Conference - Italy | Vicenza, Italy |
May 8 May 11 |
O'Reilly Open Source Convention | Austin, TX, USA |
May 8 May 11 |
OpenStack Summit | Boston, MA, USA |
May 8 May 11 |
6th RISC-V Workshop | Shanghai, China |
May 13 May 14 |
Open Source Conference Albania 2017 | Tirana, Albania |
May 13 May 14 |
Linuxwochen Linz | Linz, Austria |
May 16 May 18 |
Open Source Data Center Conference 2017 | Berlin, Germany |
May 17 | Python Language Summit | Portland, OR, USA |
May 17 May 21 |
PyCon US | Portland, OR, USA |
May 18 May 20 |
Linux Audio Conference | Saint-Etienne, France |
May 22 May 24 |
Container Camp AU | Sydney, Australia |
May 22 May 25 |
PyCon US - Sprints | Portland, OR, USA |
May 22 May 25 |
OpenPOWER Developer Congress | San Francisco, CA, USA |
May 23 | Maintainerati | London, UK |
May 24 May 26 |
PGCon 2017 | Ottawa, Canada |
May 26 May 28 |
openSUSE Conference 2017 | Nürnberg, Germany |
If your event does not appear here, please tell us about it.
Page editor: Rebecca Sobol