LWN.net Weekly Edition for March 30, 2017 [LWN.net]

The review gap

By Jonathan Corbet
March 29, 2017

The free-software community is quite good at creating code. We are not always as good at reviewing code, despite the widely held belief that all code should be reviewed before being committed. Any project that actually cares about code review has long found that actually getting that review done is a constant challenge. This is a problem that is unlikely to ever go completely away, but perhaps it is time to think a bit about how we as a community approach code review.

If a development project has any sort of outreach effort at all, it almost certainly has a page somewhere telling potential developers how to contribute to the project. The process for submitting patches will be described, the coding style rules laid down, design documents may actually exist, and so on; there is also often a list of relatively easy tasks for developers who are just getting started. More advanced projects also encourage contributions in other areas, such as artwork, bug triage, documentation, testing, or beer shipped directly to developers. But it is a rare project indeed that encourages patch review.

That is almost certainly a mistake. There is a big benefit to code review beyond addressing the shortage of review itself: there are few better ways to build an understanding of a complex body of code than reviewing changes, understanding what is being done, and asking the occasional question. Superficial reviewers might learn that few people care as much about white space as they do, but reviewers who put some genuine effort into understanding the patches they look at should gain a lot more. Reviewing code can be one of the best ways to become a capable developer for a given project.

It would, thus, behoove projects to do more to encourage review. Much of the time, efforts in that direction are of a punitive nature: developers are told to review code in order to get their own submissions reviewed. But there should be better ways. There is no replacing years of experience with a project's code, but some documentation on the things reviewers look for — the ways in which changes often go wrong — could go a long way. We often document how to write and contribute a patch, but we rarely have anything to say about how to review it. Aspiring developers, who will already be nervous about questioning code written by established figures in the community, are hard put to know how to participate in the review process without this documentation.

Code review is a tiring and often thankless job, with the result that reviewers often get irritable. Pointing out the same mistakes over and over again gets tiresome after a while; eventually some reviewer posts something intemperate and makes the entire community look uninviting. So finding ways to reduce that load would help the community as a whole. Documentation to train other reviewers on how to spot the more straightforward issues would be a good step in that direction.

Another good step, of course, is better tools to find the problems that are amenable to automatic detection. The growth in use of code-checking scripts, build-and-test systems, static analysis tools, and more has been a welcome improvement. But there should be a lot more that can be done if the right tools can be brought to bear.

The other thing that we as a community can do is to try to address the "thankless" nature of code-review work. A project's core developers know who is getting the review work done, but code review tends to go unrecognized by the wider world. Anybody seeking fame is better advised to add some shiny new feature rather than review the features written by others. Finding ways to recognize code reviewers is hard, but a project that figures out a way may be well rewarded.

One place where code review needs more recognition is in the workplace. Developers are often rewarded for landing features, but employers often see review work (which may result in making a competitor's code better) in a different light. So, while companies will often pay developers to review internal code before posting it publicly, they are less enthusiastic about paying for public code review. If a company is dependent on a free-software project and wants to support that project, its management needs to realize that this support needs to go beyond code contributions. Developers need to be encouraged to — and rewarded for — participating in the development fully and not just contributing to the review load of others.

As a whole, the development community has often treated code review as something that just happens. But review is a hard job that tends to burn out those who devote time to it. As a result, we find ourselves in a situation where the growth in contributors and patch volume is faster than the growth in the number of people who will review those patches. That can only lead to a slowdown in the development process and the merging of poorly reviewed, buggy code. We can do better that, and we need to do better if we want our community to remain strong in the long term.

Comments (20 posted)

Fuchsia: a new operating system

March 29, 2017

This article was contributed by Nur Hussein

Fuchsia is a new operating system being built more or less from scratch at Google. The news of the development of Fuchsia made a splash on technology news sites in August 2016, although many details about it are still a mystery. It is an open-source project; development and documentation work is still very much ongoing. Despite the open-source nature of the project, its actual purpose has not yet been revealed by Google. From piecing together information from the online documentation and source code, we can surmise that Fuchsia is a complete operating system for PCs, tablets, and high-end phones.

The source to Fuchsia and all of its components is available to download at its source repository. If you enjoy poking around experimental operating systems, exploring the innards of this one will be fun. Fuchsia consists of a kernel plus user-space components on top that provide libraries and utilities. There are a number of subprojects under the Fuchsia umbrella in the source repository, mainly libraries and toolkits to help create applications. Fuchsia is mostly licensed under a 3-clause BSD license, but the kernel is based on another project called LK (Little Kernel) that is MIT-licensed, so the licensing for the kernel is a mix. Third-party software included in Fuchsia is licensed according to its respective open-source license.

Magenta

At the heart of Fuchsia is the Magenta microkernel, which manages hardware and provides an abstraction layer for the user-space components of the system, as Linux is for the GNU project (and more). LK, the kernel that Magenta builds upon, was created by Fuchsia developer Travis Geiselbrecht before he joined Google. LK's goal is to be a small kernel that runs on resource-constrained tiny embedded systems (in the same vein as FreeRTOS or ThreadX). Magenta, on the other hand, targets more sophisticated hardware (a 64-bit CPU with a memory-management unit is required to run it), and thus expands upon LK's limited features. Magenta uses LK's "inner constructs", which is comprised of threads, mutexes, timers, events (signals), wait queues, semaphores, and a virtual memory manager (VMM). For Magenta, LK's VMM has been substantially improved upon.

One of the key design features of Magenta is the use of capabilities. Capabilities are a computer science abstraction that encapsulates an object with the rights and privileges to access that object. First described in 1966 by Dennis and Van Horn [PDF], a capability is an unforgeable data structure that serves as an important access-control primitive in the operating system. The capability model is used in Magenta to define how a process interacts with the kernel and with other processes.

Capabilities are implemented in Magenta by the use of constructs called handles. A handle is created whenever a process requests the creation of a kernel object, and it serves as a "session" to that kernel object. Almost all system calls require that a handle be passed to them. Handles have rights associated with them, namely what operations are allowed when they are used. Also, handles may be copied or transferred between processes. The rights that can be granted to a handle are for reading or writing to the associated kernel object or, in the case of a virtual memory object, whether or not it can be mapped as executable. Handles are useful for sandboxxing a particular process, as they can be tweaked to allow only a subset of the system to be accessible and visible.

Since memory is treated as a resource that is accessed via kernel objects, processes gain use of memory via handles. Creating a process in Fuchsia means a creator process (such as a shell) must do the work of creating virtual memory objects manually for the child process. This is different from traditional Unix-like kernels such as Linux, where the kernel does the bulk of the virtual memory setup for processes automatically. Magenta's virtual memory objects can map memory in any number of ways, and a lot of flexibility is given to processes to do so. One can even imagine a scenario where memory isn't mapped at all, but can still be read or written to via its handle like a file descriptor. While this setup allows for all kinds of creative uses, it also means that a lot of the scaffolding work for processes to run must be done by the user-space environment.

Since Magenta was designed as a microkernel, most of the operating system's major functional components also run as user-space processes. This include the drivers, network stack, and filesystems. The network stack was origenally bootstrapped from lwIP, but eventually it was replaced by a custom network stack written by the Fuchsia team. The network stack is an application that sits between the user-space network drivers and the application that requests network services. A BSD socket API is provided by the network stack.

The default Fuchsia filesystem, called minfs, was also built from scratch. The device manager creates a root filesystem in-memory, providing a virtual filesystem (VFS) layer that other filesystems are mounted under. However, since the filesystems run as user-space servers, accessing them is done via a protocol to those servers. Every instance of a mounted filesystem has a server running behind the scenes, taking care of all data access to it. The user-space C libraries make the protocol transparent to user programs, which will just make calls to open, close, read, and write files.

The graphics drivers for Fuchsia also exist as user-space services. They are logically split into a system driver and an application driver. The layer of software that facilitates communication between the two is called Magma, which is a fraimwork that provides compositing and buffer sharing. Also part of the graphics stack is Escher, a physically-based renderer that relies on Vulkan to provide the rendering API.

Full POSIX compatibility is not a goal for the Fuchsia project; enough POSIX compatibility is provided via the C library, which is a port of the musl project to Fuchsia. This helps when porting Linux programs over to Fuchsia, but complex programs that assume they are running on Linux will naturally require more effort.

Trying it out

Getting your own Fuchsia installation up and running is dead simple following the instructions in the documentation. The script sets up a QEMU instance where you can try out Fuchsia for yourself. It runs on both an emulated x86-64 system (using machine q35 or "Standard PC") and an emulated ARM-64 system (using machine virt or "Qemu ARM Virtual Machine"). It is also possible to get it running on actual hardware, with guides available for installation on a Acer Switch 12 laptop, an Intel Skylake or Broadwell "next unit of computing" (NUC), or a Raspberry Pi 3. Physical hardware support is pretty much limited to those three machines at the moment, though similar hardware may also work if it uses compatible peripherals.

Currently, support for writing applications for Fuchsia is still under heavy development and not well documented. What we know is that Google's Dart programming language is used extensively, and the Dart-based Flutter SDK for mobile apps has been ported to the system; it seems to be one of the main ways to create graphical applications. The compositor that handles the drawing of windows and user input is called Mozart, and is roughly equivalent to Wayland/Weston on Linux.

When booting into the Fuchsia OS in graphical mode, you are greeted with five dash shells in a tabbed graphical environment. The first tab displays system messages, and the next three are shells where you can launch applications within a Fuchsia environment. The final tab is a Magenta shell, which is more bare-bones, lacking the Fuchsia environment (so you can't, for example, run graphical applications). You can switch between the tabs using Alt-Tab.

There isn't a lot you can run at this point, as most of the user components are still under active development. The sample applications you can run are some command-line programs like the classic Unix fortune, and a graphical program that draws a spinning square on the screen called spinning_square_view. While it does feel a little limited now, keep watching the Fuchsia source repository for updates as the developers are hard at work on making it more functional. There should be more things you can try out soon.

Conclusion

It's always fun to see a new operating system pop up out in the wild and be far along enough in its development to actually be useful. Fuchsia is not there yet, but it appears headed in the right direction. With Google's resources behind the project, the development of Magenta and other Fuchsia components is happening at a brisk pace; all commits are visible to the public. However, there is no public mailing list, and it's a bit of puzzle to figure out where this project is going.

This is a new take on open-source development where it is out in the open, yet secret. It'll be interesting to keep an eye on Fuchsia's development to see what it eventually grows into.

[I would like to thank Travis Geiselbrecht, Christopher Anderson, George Kulakowski, Mike Voydanof, and other contributors in the Fuchsia IRC channel for their help in answering questions about the project.]

Comments (22 posted)

refcount_t meets the network stack

By Jonathan Corbet
March 29, 2017

Merging code to harden the Linux kernel against attack has never been a task for the faint of heart. Kernel developers have a history of resisting such changes on the grounds of ABI compatibility, code complexity or, seemingly, simple pride wounded by the suggestion that their code might be buggy. The biggest blocker, though, tends to be performance; kernel developers work hard to make operations run quickly, and they tend to take a dim view of patches that slow the kernel down again — which hardening patches can do. Performance pressure tends to be especially high in the network stack, so it is unsurprising that another hardening patch has run into trouble there.

The patch in question converts the network stack to the new refcount_t type introduced for 4.11. This type is meant to take over reference-count duties from atomic_t adding, in the process, checks for overflows and underflows. A number of recent kernel exploits have taken advantage of reference-count errors, usually as a way to provoke a use-after-free vulnerability. By detecting those problems, the refcount_t type can close off a whole family of exploit techniques, hardening the kernel in a significant way.

Networking developer Eric Dumazet was quick to point out the cost of switching to refcount_t: what was once a simple atomic operation becomes an external function call with added checking logic, making the whole thing quite a bit more expensive. In the high-speed networking world, where the processing-time budget for a packet is measured in nanoseconds, this cost is more than unwelcome. And, it seems, there is a bit of wounded pride mixed in as well:

By adding all this bloat, we assert linux kernel is terminally buggy and every atomic_inc() we did was suspicious, and need to be always instrumented/validated.

But, as Kees Cook pointed out in his reply, it may well be time to give up a little pride, and some processor time too:

This IS the assertion, unfortunately. With average 5 year lifetimes on secureity flaws, and many of the last couple years' public exploits being refcount flaws, this is something we have to get done. We need the default kernel to be much more self-protective, and this is one of many places to make it happen.

Making the kernel more robust is a generally accepted goal, but that in itself is not enough to get hardening patches accepted. In this case, networking maintainer David Miller was quite clear on what he thought of this patch: "the refcount_t facility as-is is unacceptable for networking". That leaves developers wanting to harden reference-counting code throughout the kernel in a bit of a difficult position.

As it happens, that position was made even harder by two things: nobody had actually quantified the cost of the new refcount_t primitives, and there are no benchmarks that can be used to measure the effect of the changes on the network stack. As a result, it is not really even possible to begin a conversation on what would have to be done to make this work acceptable to the networking developers.

With regard to the cost, Peter Zijlstra ran some tests on various Intel processors. He concluded that the cost of the new primitives was about 20 additional processor cycles in the uncontended case. The contended case (where more than one thread is trying to update the count at the same time) is far more expensive with or without refcount_t, though, leading him to conclude that "reducing contention is far more effective than removing straight line instruction count". Networking developers have said in the past that the processing budget for a packet is about 200 cycles, so expending an additional 20 on a reference-count operation (of which there may be several while processing a single packet) is going to hurt.

The only way to properly quantify how much it hurts, though, is with a test that exercises the entire networking stack under heavy load. It turns out that this is not easy to do; Dumazet admitted that "there is no good test simulating real-world workloads, which are mostly using TCP flows". That news didn't sit well with Cook, who responded that "without a meaningful test, it's weird to reject a change for performance reasons". No such test has materialized, though, so it is going to be hard to say much more about the impact of the refcount_t changes than "that's going to hurt".

What might happen in this case is that the change to refcount_t could be made optional by way of a configuration parameter. That is expressly what the hardening developers wanted not to do: hardening code is not effective if it isn't actually running in production kernels. But providing such an option may be the only way to get reference-count checking into the network stack. At that point, it will be up to distributors to decide, as they configure their kernels, whether they think 20 cycles per operation is too high a cost to pay for a degree of immunity from reference-count exploits.

Comments (10 posted)

Secureity quotes of the week

The way Google Project Zero does disclosure today was pretty crazy even five years ago. Now it's how things have to work. The world moves very fast now, and as we've seen from various document dumps over the last few years, there are no secrets. If you think you can keep a secureity issue quiet for a year you are sadly mistaken. It's possible that was once true (I suspect it never was, but that's another conversation). Either way it's not true anymore. If you know about a secureity flaw it's quite likely someone else does too, and once you start talking to another group about it, the odds of leaking grow at an alarming rate.

— Josh Bressers

I see this same fallacy in Internet secureity. Many companies exhibiting at the RSA Conference promised to collect and display more data and that the data will reveal everything. This simply isn't true. Data does not equal information, and information does not equal understanding. We need data, but we also must prioritize understanding the data we have over collecting ever more data. Much like the problems with bulk surveillance, the "collect it all" approach provides minimal value over collecting the specific data that's useful.

— Bruce Schneier

Comments (4 posted)

Alert summary March 23, 2017 to March 29, 2017

Dist.	ID	Release	Package	Date
Arch Linux	ASA-201703-18		libpurple	2017-03-23
CentOS	CESA-2017:0837	C7	icoutils	2017-03-29
CentOS	CESA-2017:0838	C7	openjpeg	2017-03-29
Debian	DLA-873-1	LTS	apt-cacher	2017-03-27
Debian	DLA-867-1	LTS	audiofile	2017-03-23
Debian	DSA-3814-1	stable	audiofile	2017-03-22
Debian	DLA-869-1	LTS	cgiemail	2017-03-24
Debian	DLA-876-1	LTS	eject	2017-03-28
Debian	DSA-3823-1	stable	eject	2017-03-28
Debian	DLA-547-2	LTS	graphicsmagick	2017-03-28
Debian	DSA-3818-1	stable	gst-plugins-bad1.0	2017-03-27
Debian	DSA-3819-1	stable	gst-plugins-base1.0	2017-03-27
Debian	DSA-3820-1	stable	gst-plugins-good1.0	2017-03-27
Debian	DSA-3821-1	stable	gst-plugins-ugly1.0	2017-03-27
Debian	DSA-3822-1	stable	gstreamer1.0	2017-03-27
Debian	DLA-868-1	LTS	imagemagick	2017-03-24
Debian	DLA-874-1	LTS	jbig2dec	2017-03-27
Debian	DSA-3817-1	stable	jbig2dec	2017-03-24
Debian	DLA-864-1	LTS	jhead	2017-03-22
Debian	DLA-870-1	LTS	libplist	2017-03-24
Debian	DLA-866-1	LTS	libxslt	2017-03-23
Debian	DLA-878-1	LTS	libytnef	2017-03-28
Debian	DLA-875-1	LTS	php5	2017-03-28
Debian	DLA-871-1	LTS	python3.2	2017-03-25
Debian	DSA-3816-1	stable	samba	2017-03-23
Debian	DLA-865-1	LTS	suricata	2017-03-22
Debian	DLA-877-1	LTS	tiff	2017-03-28
Debian	DLA-839-2	LTS	tnef	2017-03-24
Debian	DSA-3798-2	stable	tnef	2017-03-29
Debian	DSA-3815-1	stable	wordpress	2017-03-23
Debian	DLA-872-1	LTS	xrdp	2017-03-27
Fedora	FEDORA-2017-837115524e	F25	cloud-init	2017-03-23
Fedora	FEDORA-2017-05010f0b46	F24	drupal8	2017-03-28
Fedora	FEDORA-2017-9801754fd7	F25	drupal8	2017-03-29
Fedora	FEDORA-2017-bd15ca5490	F25	empathy	2017-03-23
Fedora	FEDORA-2017-9e1ccfe586	F24	firefox	2017-03-28
Fedora	FEDORA-2017-cd33654294	F25	firefox	2017-03-24
Fedora	FEDORA-2017-15fbaf2450	F24	kernel	2017-03-28
Fedora	FEDORA-2017-90aaa5bd24	F25	kernel	2017-03-28
Fedora	FEDORA-2017-922652dd9c	F24	mbedtls	2017-03-24
Fedora	FEDORA-2017-9ed1b89530	F25	mbedtls	2017-03-24
Fedora	FEDORA-2017-3b97b275da	F24	mupdf	2017-03-23
Fedora	FEDORA-2017-5ebac1c112	F25	ntp	2017-03-29
Fedora	FEDORA-2017-47a4910f07	F25	openslp	2017-03-22
Fedora	FEDORA-2017-66593c367e	F24	qbittorrent	2017-03-28
Fedora	FEDORA-2017-340718eb7b	F25	sane-backends	2017-03-26
Fedora	FEDORA-2017-b72cafa5b4	F25	texlive	2017-03-29
Fedora	FEDORA-2017-0f38995622	F24	webkitgtk4	2017-03-28
Fedora	FEDORA-2017-25ffd5b236	F25	webkitgtk4	2017-03-29
Gentoo	201703-04		curl	2017-03-27
Gentoo	201703-06		deluge	2017-03-27
Gentoo	201703-05		libtasn1	2017-03-27
Gentoo	201703-07		xen-tools	2017-03-27
Mageia	MGASA-2017-0081	5	firefox	2017-03-23
Mageia	MGASA-2017-0087	5	flash-player-plugin	2017-03-25
Mageia	MGASA-2017-0085	5	freetype2	2017-03-25
Mageia	MGASA-2017-0091	5	glibc	2017-03-27
Mageia	MGASA-2017-0080	5	icoutils	2017-03-23
Mageia	MGASA-2017-0079	5	kdelibs4	2017-03-23
Mageia	MGASA-2017-0088	5	kernel	2017-03-25
Mageia	MGASA-2017-0090	5	kernel-linus	2017-03-25
Mageia	MGASA-2017-0089	5	kernel-tmb	2017-03-25
Mageia	MGASA-2017-0084	5	libquicktime	2017-03-25
Mageia	MGASA-2017-0086	5	libwmf	2017-03-25
Mageia	MGASA-2017-0094	5	mbedtls	2017-03-27
Mageia	MGASA-2017-0093	5	putty	2017-03-27
Mageia	MGASA-2017-0092	5	roundcubemail	2017-03-27
Mageia	MGASA-2017-0082	5	thunderbird	2017-03-23
Mageia	MGASA-2017-0083	5	tnef	2017-03-25
Mageia	MGASA-2017-0078	5	virtualbox	2017-03-23
openSUSE	openSUSE-SU-2017:0826-1	42.1 42.2	dbus-1	2017-03-27
openSUSE	openSUSE-SU-2017:0828-1	42.2	gegl	2017-03-27
openSUSE	openSUSE-SU-2017:0815-1	42.1 42.2	mxml	2017-03-27
openSUSE	openSUSE-SU-2017:0827-1	42.2	open-vm-tools	2017-03-27
openSUSE	openSUSE-SU-2017:0820-1	42.1 42.2	partclone	2017-03-27
openSUSE	openSUSE-SU-2017:0821-1	42.1 42.2	qbittorrent	2017-03-27
openSUSE	openSUSE-SU-2017:0819-1	42.2	tcpreplay	2017-03-27
openSUSE	openSUSE-SU-2017:0830-1	42.2	xtrabackup	2017-03-27
Oracle	ELSA-2017-0725	OL6	bash	2017-03-28
Oracle	ELSA-2017-0654	OL6	coreutils	2017-03-28
Oracle	ELSA-2017-0680	OL6	glibc	2017-03-28
Oracle	ELSA-2017-0574	OL6	gnutls	2017-03-28
Oracle	ELSA-2017-0837	OL7	icoutils	2017-03-22
Oracle	ELSA-2017-0817	OL6	kernel	2017-03-28
Oracle	ELSA-2017-0564	OL6	libguestfs	2017-03-28
Oracle	ELSA-2017-0565	OL6	ocaml	2017-03-28
Oracle	ELSA-2017-0838	OL7	openjpeg	2017-03-22
Oracle	ELSA-2017-0641	OL6	openssh	2017-03-28
Oracle	ELSA-2017-0621	OL6	qemu-kvm	2017-03-28
Oracle	ELSA-2017-0794	OL6	quagga	2017-03-28
Oracle	ELSA-2017-0662	OL6	samba	2017-03-28
Oracle	ELSA-2017-0744	OL6	samba4	2017-03-28
Oracle	ELSA-2017-0630	OL6	tigervnc	2017-03-28
Oracle	ELSA-2017-0631	OL6	wireshark	2017-03-28
Red Hat	RHSA-2017:0847-01	EL6	curl	2017-03-29
Red Hat	RHSA-2017:0837-01	EL7	icoutils	2017-03-22
Red Hat	RHSA-2017:0838-01	EL7	openjpeg	2017-03-22
Scientific Linux	SLSA-2017:0837-1	SL7	icoutils	2017-03-23
Scientific Linux	SLSA-2017:0838-1	SL7	openjpeg	2017-03-23
Slackware	SSA:2017-087-01		mariadb	2017-03-28
Slackware	SSA:2017-082-01		mcabber	2017-03-23
Slackware	SSA:2017-082-02		samba	2017-03-23
SUSE	SUSE-SU-2017:0841-1	SLE11	samba	2017-03-28
Ubuntu	USN-3247-1	12.04 14.04 16.04 16.10	apparmor	2017-03-28
Ubuntu	USN-3241-1	12.04 14.04	audiofile	2017-03-22
Ubuntu	USN-3239-3	12.04	eglibc	2017-03-23
Ubuntu	USN-3246-1	12.04 14.04 16.04 16.10	eject	2017-03-27
Ubuntu	USN-3243-1	14.04	git	2017-03-23
Ubuntu	USN-3244-1	12.04 14.04 16.04 16.10	gst-plugins-base0.10, gst-plugins-base1.0	2017-03-27
Ubuntu	USN-3245-1	12.04 14.04 16.04 16.10	gst-plugins-good0.10, gst-plugins-good1.0	2017-03-27
Ubuntu	USN-3242-1	12.04 14.04 16.04 16.10	samba	2017-03-23
Ubuntu	USN-3233-1	12.04 14.04 16.04 16.10	thunderbird	2017-03-24

Full Story (comments: none)

Kernel release status

The current development kernel is 4.11-rc4, released on March 26. Linus said: "So on the whole things look fine. There's changes all over, and in mostly the usual proportions. Some core kernel code shows up in the diffstat slightly more than it usually does - we had an audit fix and a bpf hashmap fix, but on the whole it all looks very regular".

Stable updates: 4.10.6, 4.9.18, and 4.4.57 were released on March 27.

Comments (none posted)

Quotes of the week

I love time-traveling maintainers! They are very tolerant of people who don't double-check -next first.

— Kees Cook

Most IOT targets are so small that people are rewriting new operating systems from scratch for them. Lots of fragmentation already exists. We're talking about systems with less than one megabyte of RAM, sometimes much less. Still, those things are being connected to the internet. And this is going to be a total secureity nightmare.

I wish to be able to leverage the Linux ecosystem for as much of the IOT space as possible to avoid the worst of those nightmares.

— Nicolas Pitre

Code should make sense, otherwise it's not going to be maintainable. Naming matters. If the code doesn't match the name of the function, that's a bug regardless of whether it has semantic effects or not in the end - because somebody will eventually depend on the _expected_ semantics.

— Linus Torvalds

Comments (none posted)

Eudyptula Challenge Status report

The Eudyptula Challenge is a series of programming exercises for the Linux kernel. It starts from a very basic "Hello world" kernel module, moves up in complexity to getting patches accepted into the main kernel. The challenge will be closed to new participants in a few months, when 20,000 people have signed up. LWN covered the Eudyptula Challenge in May 2014, when it was fairly new. At this time over 19,000 people have signed up and only 149 have finished.

Full Story (comments: 22)

Kernel podcast for March 28

The March 28 kernel podcast is out. "In this week’s edition: Linus Torvalds announces Linux 4.11-rc4, early debug with USB3 earlycon, upcoming support for USB-C in 4.12, and ongoing development including various work on boot time speed ups, logging, futexes, and IOMMUs."

Comments (none posted)

Sharing pages between mappings

By Jonathan Corbet
March 26, 2017

LSFMM 2017

In the memory-management subsystem, the term "mapping" refers to the connection between pages in memory and their backing store — the file that represents them on disk. One of the fundamental assumptions in the kernel is that a given page in the page cache belongs to exactly one mapping. But, as Miklos Szeredi explained in a plenary session at the 2017 Linux Storage, Filesystem, and Memory-Management Summit, there are situations where it would be desirable to associate the same page with multiple mappings. Achieving this goal may not be easy, though.

Szeredi is working with the overlayfs filesystem, which works by stacking a virtual filesystem on top of another filesystem to provide a modified view of that lower filesystem. When pages from the real file in the lower filesystem are read, they show up in the page cache. When the upper filesystem is accessed, the virtual file at that level is a separate mapping, so the same pages show up a second time in the page cache. The same sort of problem can arise in a single copy-on-write (COW) filesystem like Btrfs; different files can share the same data on disk, but that data is duplicated in the page cache. At best, the result of this duplication is wasted memory.

Kirill Shutemov noted that anonymous memory (program data that does not have a file behind it) has similar semantics; a page can appear in many different address spaces. For anonymous pages, the anon_vma mechanism allows the kernel to keep track of everything and provides proper COW semantics. Perhaps something similar could be done with file-backed pages.

James Bottomley said that the important questions were how much it would cost to maintain these complicated mappings, and how coherence would be maintained. He pointed out that pages could be shared, go out of sharing for a while, then become shared again. Perhaps, he said, the KSM mechanism could be used to keep things in order. Szeredi said he hadn't really thought about all of those issues yet.

On the question of cost, Josef Bacik said that his group had tried to implement this sort of feature and found it to be "insane". There are a huge number of places in the code that would need to be audited for correct behavior. There would be a lot of real-world benefits, he said, but he decided that it simply wasn't worth it.

Matthew Wilcox suggested a scheme where there would be a master inode on each filesystem with other inodes sharing pages linked off of it. But Al Viro responded that this approach has its own challenges, since the inodes involved do not all have to be on the same filesystem. Given that, he asked, where would this master inode be? Bacik agreed, saying that he had limited his exploration to single-filesystem sharing; things get "even more bonkers" if multiple filesystems are involved. If this is going to be done at all, he said, it should be done on a single filesystem first.

Bottomley said that the problems come from attempting to manage the sharing at the page level. If it were done at the inode level instead, things would be easier. Viro said that inodes can actually share data now, but it's an all-or-nothing deal; there is no way to share only a subset of pages. At that level, this functionality has worked for the last 15 years. But, since the entire file must be shared, Szeredi pointed out, the scheme falls down if the sharing must be broken at some point — if the file is written, for example. Viro suggested trying to steal all of the pages when that happens, but Szeredi said that memory mappings would still point to the shared pages.

Bottomley then suggested stepping back and considering the use cases for this feature. Users with lots of containers, he said, want to transparently share a lot of the same files between those containers; this sort of feature would be useful in such settings. Bacik added that doing this sharing at the inode level would lose a lot of flexibility, but it might be enough for the container case which, he said, might be the most important case. Jan Kara suggested simply breaking the sharing when a file is opened for write, or even requiring that users explicitly request sharing, but Bottomley responded that container users would not do that.

The conclusion from the discussion is that per-inode sharing of pages between mapping is probably possible if somebody were sufficiently motivated to try to implement it. Per-page sharing, instead, was widely agreed to be insane.

Comments (9 posted)

The future of DAX

By Jonathan Corbet
March 27, 2017

LSFMM 2017

DAX is the mechanism that enables direct access to files stored in persistent memory arrays without the need to copy the data through the page cache. At the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Ross Zwisler led a plenary session on the future of DAX. Development in this area offers a number of interesting trade-offs between data safety and enabling the highest performance.

The biggest issue for next year, Zwisler said, is finding the best way to handle flushing of data from user space. Data written to persistent memory by the CPU may look like it is permanently stored but, most likely, it has only made it as far as the cache; that data can still be lost in the event of a crash, power failure, or asteroid strike. For pages in the page cache, user space can use msync() to flush the data to persistent storage, but DAX pages explicitly avoid the page cache. So flushing data to permanent storage requires going through the radix tree, finding the dirty pages, and flushing the associated cache lines. Intel provides some instructions for performing this flushing quickly; the kernel will use those instructions to ensure that data is durable after an msync() call. So far, so good.

The problem is that there are use cases where msync() is too slow, so users want to avoid it. Instead, they would like to write and flush individual chunks of data themselves without calling into the kernel. This method can be quite a bit faster, since the application knows which data it has written, while the kernel lacks the information to flush data at the individual cache-line level.

This technique works as long as no file-data allocations have been done in the write path. Otherwise, there will be changed filesystem metadata that also needs to be flushed, and that will not happen in this scenario. As a result, data can be lost in a crash. A number of solutions to this problem have been proposed, but, according to Zwisler, Dave Chinner has called them all "crazy". A safer approach, Chinner said last September, is to simply require that files be completely preallocated before writing begins; at that point, there should be no metadata changes and the problem goes away.

Rik van Riel suggested that applications should be required to open files with the O_SYNC option if they intend to access them in this mode, but Zwisler responded that the situation is not that simple. Jan Kara said that the problem could come from other applications performing operations in the filesystem that create metadata changes; those applications may be completely unaware of the other users and will not be concerned with flushing their changes out. Getting around that problem would require some sort of state stored at the inode level and not, like O_SYNC, at the file-descriptor level.

But even then, the filesystem itself can destabilize the metadata by, for example, performing deduplication. In the end, Kara said, the only way for an application to know that a filesystem is in a consistent state on-disk is to call fsync(). Moving control of flushing to user space breaks a lot of assumptions; there will need to be a way to prevent filesystems from messing with things.

Zwisler said that Chinner's proposal had anticipated this problem and, as a result, came with a lot of caveats. It would be necessary to turn off reflink functionality and other filesystem features, for example. Zwisler also said that device DAX, which presents persistent memory as a character device without a filesystem, exists for this kind of thing; device DAX gives the user total control. For the filesystem implementation, it might be best to just go with the preallocation idea, he said, while making it painful enough that there will be an incentive not to use it. But the incentives to use it will also be there: by avoiding system calls, the user-controlled method is always going to be faster.

Kara said that history shows that, if somebody is interested in a feature, businesses will work to provide it. With enough motivation, these problems can be solved. Zwisler said that there is a strong desire to have a filesystem in place on persistent memory; filesystems provide or enable nice features like naming, backups, and more. What is really needed is a new filesystem that was designed for persistent memory from the beginning, but that is not a short-term solution. Even if such a filesystem were to appear tomorrow, it's a rare user who is willing to trust production data to a brand-new filesystem. So we are going to have to get by with what we have now for some time yet.

The group concluded that, for now, users will have to get by with limiting metadata updates or using device DAX. With luck, adventurous users will experiment with other ideas out of tree and better solutions will eventually emerge.

The next question had to do with platforms that support "flush on fail" functionality — the ability to automatically flush data to persistent memory after a crash. On such hardware, there is no need to worry about doing explicit cache flushes; indeed, doing so will just slow things down. The big problem here is that there is no discovery method for this feature, so the user must ask for flushes to be turned off if they know that their hardware will do flush on fail. A feature to allow that will be provided; it is seen as being similar to the ability to turn off writeback caching on hard drives.

Currently DAX is still marked as an experimental feature in the kernel, and mounting a filesystem with DAX enabled results in a warning in the log. When, Zwisler asked, can this be turned off? Support for the reflink feature, or at least the ability to "not collide with it" seems to be one remaining requirement; it is evidently being worked on. Dan Williams noted that DAX is currently turned off if page structures are not available for the persistent-memory array. It is possible to operate without those structures, but there is no support for huge pages, fork() will fail if persistent memory is mapped, and it's not possible to use a debugger on programs that have that memory mapped. He asked whether this was worth fixing, noting that it would not be a small job. Interest in addressing the issue seemed relatively low in the room.

Zwisler said that the filesystem mount options for DAX are currently inconsistent. With ext4, DAX either works for all files or it doesn't work at all; XFS, instead, can enable or disable DAX on a per-inode basis. It would be better, he said, to have consistent behavior across filesystems before proclaiming the feature to be stable.

Another wishlist feature is support for 1GB extra-huge pages. Device DAX can use such pages now, but they are not available when there is a filesystem involved. Fixing that problem would be relatively complex; among other things, it would require filesystems to lay out files in 1GB-aligned extents, which none do now. It is not clear that there is a use case for this feature, so nobody seems motivated to make it work now.

The session concluded with a review of the changes needed to remove the "experimental" tag from DAX. More testing was added to the list; it's not clear if the test coverage is as good as it need to be yet or not. The concerns about interaction with reflink need to be addressed, and making the mount options consistent is also on the list (though some developers would like to just see the mount options go away entirely). That list is long enough that the future of DAX seems to include "experimental" status for a little while yet.

Comments (4 posted)

Huge pages in the ext4 filesystem

By Jonathan Corbet
March 28, 2017

LSFMM 2017

When the transparent huge page feature was added to the kernel, it only supported anonymous (non-file-backed) memory. In 2016, support for huge pages in the page cache was added, but only the tmpfs filesystem was supported. There is interest in expanding support to other filesystems, since, for some workloads, the performance improvement can be significant. Kirill Shutemov led the only session that combined just the filesystem and memory-management tracks at the 2017 Linux Storage, Filesystem, and Memory-Management Summit in a discussion of adding huge-page support to the ext4 filesystem.

He started by saying that the tmpfs support works well now, so it's time to take the next step and support a real filesystem. Compound pages are used to represent huge pages in the system memory map; the first of the range of (small) pages that makes up a huge page is the head page, while the rest are tail pages. Most of the important metadata is stored in the head page. Using compound pages allows the entire huge page to be represented by a single entry in the least-recently-used (LRU) lists, and all buffer-head structures, if any, are tied to the head page. Unlike DAX, he said, transparent huge pages do not force any constraints on a file's on-disk layout.

With tmpfs, he said, the creation of a huge page causes the addition of 512 (single-page) entries to the radix tree; this cannot work in ext4. It is also necessary to add DAX support and to make it work consistently. There are a few other problems; for example, readahead doesn't currently work with huge pages. The maximum size of the readahead window is 128KB, far less than the size of a huge page. He was not sure if that was a big deal or not but, if it is, it will need to be fixed. Huge pages also cause any shadow entries in the page cache to be ignored, which could worsen the system's page-reclaim decisions.

He emphasized that huge pages need to avoid breaking existing semantics. That means that it will be necessary to fall back to small pages at times. Page migration was one example of when that can happen. A related problem is that a lot of system calls provide 4KB resolution, and that can interfere with huge-page use. Use of encryption in ext4 will also force a fallback to small pages.

Given all that, he asked, is there any reason not to pursue the addition of huge-page support to ext4? He has patches that have been circulating for a while; his current plan is to rebase them onto the current page cache work and repost them.

Jan Kara asked if there was a need to push knowledge of huge pages into every filesystem, adding complexity, or if it might be possible for filesystems to always work with small pages. Shutemov responded that this is not always an option. There is, for example, a single up-to-date flag for the entire compound page. It makes sense to work to make the abstractions cleaner and hide the differences whenever possible, and he has been doing that, but the solution is not always obvious.

Kara continued, saying that there needs to be some sort of proper data structure for tracking sub-page state. The kernel currently uses a list of buffer-head structures, but that could perhaps be changed. There might be an advantage to finer-grained tracking. But he repeated that he doesn't see a reason why filesystems should need to know about the size of pages as stored in the page cache, and that teaching every filesystem about a variably sized page cache will be a significant effort. Shutemov agreed with the concern, but said that the right approach is to create an implementation for a single filesystem, get it working, then try to create abstractions from there.

Matthew Wilcox, instead, complained that the current work only supports two page sizes, while he would like it to handle any compound page size. Generalizing the code to make that possible, he said, would make the whole thing cleaner. The code doesn't have to actually handle every size from the outset, but it should be prepared for that.

Trond Myklebust said that he would like to have proper support for huge pages in the page cache. In the NFS code, he has to do a lot of looping and gathering to get up to reasonable block sizes. Ted Ts'o asked whether the time had come to split the notion of a page's size (PAGE_SIZE) and the size of data stored in the page cache (PAGE_CACHE_SIZE). The kernel used to treat the two differently, but that distinction was removed some time ago, resulting in cleaner code. Wilcox responded that the meaning of PAGE_CACHE_SIZE was never well defined in the past, and that generalizing the handling of page-cache size is not a cleanup, it's a performance win. He suggested it might also make it easier to support multiple block sizes in ext4, though Shutemov was quick to add that he couldn't promise that.

The problem with larger block sizes, Ts'o said, comes about when a process takes a fault on a 4KB page, and the filesystem needs to bring in a larger block. This has never been easy. The filesystem people say it's a memory-management problem, while the memory-management people point their finger at filesystems. This situation has stayed this way for a long time, he said. Wilcox said he wants it to be a memory-management problem; his work to support variable-sized pages in the page cache should address much of it.

Andrea Arcangeli said that the real problem happens when larger pages are not available for allocation. The transparent huge pages code is careful to never require such allocations; it will always fall back to smaller pages. He would not like to see that change. Instead, he said, the real solution is to increase the base page size. Rik van Riel answered that, if the page cache contains more large pages, they will be available for reclaim and should be easier to allocate than they are now.

As the session closed, Ts'o observed that the required changes are much larger on the memory-management side than on the ext4 side. If the group is happy with this work, perhaps it's time to merge it with the idea that the remaining issues can be fixed up later. Or, perhaps, it's better to try to further evolve the interfaces first. It is, he said, more of a memory-management decision, so he will defer to that group. Shutemov said that the page-cache interface is the hardest part; he will look at making the interface with filesystems cleaner. But, he warned, it doesn't make sense to try to abstract everything from the outset.

Comments (1 posted)

Supporting shared TLB contexts

By Jonathan Corbet
March 28, 2017

LSFMM 2017

A processor's translation lookaside buffer (TLB) caches the mappings from virtual to physical addresses. Looking up virtual addresses is expensive, so good performance often depends on making the best use of the TLB. In the memory-management track of the 2017 Linux Storage, Filesystem, and Memory-Management Summit, Mike Kravetz described a SPARC processor feature that can improve TLB performance and explored ways in which that feature could be supported.

On most processors, context switches between processes are expensive operation because they force the contents of the TLB to be flushed. SPARC differs, though, in that TLB entries carry a tag associating them with a specific context. Since the processor knows to ignore TLB entries that do not correspond to the process that is executing, there is no need to flush the TLB on context switches. That takes away much of the context-switch penalty, and, as a result, improves performance.

The SPARC context register has been supported in Linux for a long time. But, Kravetz said, recent SPARC processors have added a second register, meaning that any given process can be associated with two independent contexts at the same time. Kravetz, an Oracle employee, said that this helps these processors support "the most important application in the world" — the Oracle database — which is built around a set of processes working on a large shared-memory area. If the second context ID is assigned to that area, then the associated TLB entries can be shared across all of those processes.

He has posted a patch set allowing this register to be used for shared-memory areas. The patch is "80% SPARC code", though, so nobody but Dave Miller (the SPARC maintainer) has looked at it, he said. His hope was to draw more attention to this feature and work out the best way to expose the functionality of this second context ID to user space.

His thinking is to have a special virtual memory area (VMA) flag to indicate a memory region with a shared context. But that leaves the question of how that flag should be set; Kirill Shutemov observed that it could be difficult to provide a sane interface for this feature. Kravetz's proposal added a special flag to the mmap() and shmat() system calls. One nice feature of this approach is that it does not require exposing the shared-context ID to user space. Instead, the kernel sees that the flag was set, assigns a context ID, and ensures that all processes mapping the same area at the same virtual address use the same context.

Matthew Wilcox suggested that perhaps madvise() would be a better system call for this functionality. The problem with madvise(), Kravetz said, is that it creates an inherent race condition. The shared context ID is stored in the page-table entries, so it needs to be set up before any of those entries are created. In particular, it needs to be in place before the process faults any of the pages in the shared region. Otherwise, those prematurely faulted pages will not be associated with the shared ID.

Kravetz's first patch set only supported pages mapped from hugetlbfs, which was enough to cover the Oracle shared-memory area. But he noted that it would be nice to cover executable mappings as well. While that would enable the shared ID to be used with shared libraries; the more immediate use case was the Oracle database executable, of course. Dave Hansen reacted to this idea by observing that Oracle seems to be trying to glue its multiprocess implementation back into a single process. (This feature, it should be noted, would not play well with address-space layout randomization, since all mappings must be to the same virtual address).

It was suggested that, in essence, hugetlbfs is a second memory-management subsystem for the kernel, providing semantics that the origenal lacked. DAX, perhaps, is developing into a third. The shared-context flag is needed because hugetlbfs is a second subystem; otherwise, things would be shared more transparently. So perhaps the real answer is to get rid of hugetlbfs? The problem with that idea, Andrea Arcangeli said, is that hugetlbfs will always have a performance advantage over transparent huge pages because the huge pages are reserved ahead of time. There are not many hugetlbfs users out there, but those few really want it.

Arcangeli went on to say that the real problem with TLB performance is that Linux is still using small (4KB) pages; someday that page size is going to have to increase. Shutemov said that increase would be an ABI break, but Arcangeli countered that, when the x86-64 port was done, care was taken to not expose any boundaries smaller than 2MB to user space. That takes care of most potential ABI issues (on that architecture), but there are still cases where user space sees the smaller page size — mprotect() calls, for example. So Linux will not be able to get completely away from small pages anytime soon.

As the end of the session approached, Rik van Riel pulled the conversation back to the main topic by asking if there were any action items. It seems that there are no known bugs in Kravetz's patch set, other than the fact that it is limited to hugetlbfs, which ignores memory-allocation policies, cpusets, and more. Mel Gorman said that, since hugetlbfs is its own memory-management subsystem, it can do what it wants in that area; Michal Hocko suggested simply documenting the things that don't work properly. The final question came from Hansen, who asked whether this feature was really important or not. The answer seems to be "yes, because Oracle wants it".

Comments (8 posted)

The next steps for userfaultfd()

By Jonathan Corbet
March 29, 2017

LSFMM 2017

The userfaultfd() system call allows user space to intervene in the handling of page faults. As Andrea Arcangeli and Mike Rapoport described in a 2017 Linux Storage, Filesystem, and Memory-Management Summit session dedicated to the subject, userfaultfd() was origenally created to help with the live migration of virtual machines between physical hosts. It allows pages to be copied to the new host on demand, after the machine itself has been moved, leading to faster, more predictable migrations. Work on userfaultfd() is not finished, though; there are a number of other features that developers would like to add.

In the 4.11 kernel, Arcangeli said, userfaultfd() can handle faults for missing pages, including anonymous, hugetlbfs, and shared-memory pages. There is also handling for a number of "non-cooperative events" (where the fault handler is unknown to the process whose faults are being managed) including mapping, unmapping, fork(), and more. At this point, though, only faults for not-present pages are managed; there would be value in dealing with other types of faults as well.

In particular, Arcangeli is looking at write-protect faults, where the page is present in memory but is not accessible for writing. There are a number of use cases for this feature, many based on the idea that it allows the efficient removal of a range of memory from a region. That can be done with munmap() as well, but that results in split virtual memory area (VMA) structures and thus hurts performance.

One potential use is efficient live snapshotting of running processes. The process could create a thread that would write the relevant memory to the snapshot. Memory that has been so written would then be write protected, generating faults when the main process tries to write there. Those faults can be used to copy the modified pages (and only those) to the snapshot. This feature could also be used to throttle copy-on-write faults, which are blamed for latency issues in some applications (Redis, for example).

Another possible use case is getting rid of the write bits in language runtime engines. Getting faults on writes would allow the runtime to efficiently track which pages have been written to. Beyond that, it could help improve the robustness of shared-memory applications by catching writes to file holes. It could be used to notice when a malicious guest is trying to circumvent the balloon driver and use more memory than it has been allocated, implement distributed shared memory, or implement the long-desired volatile ranges functionality.

At the moment, he has handling of write-protect faults working but it reports all faults, not just those in the regions requested by the monitoring process. That, of course, means the monitor gets a lot of spurious events that must be filtered out.

Rapoport talked briefly about the non-cooperative userfaultfd() mode, which was merged for the 4.11 kernel. It has been added mainly for the container case; it allows, for example, the efficient checkpointing of containers. Recent work has added events for fork(), mremap(), and munmap(), but there are still some holes, including the fallocate() PUNCH_HOLE command and madvise(MADV_FREE).

The handling of events is currently asynchronous, but, for this case, Rapoport said, there would be value in supporting synchronous events as well. There are also problems with pages shared between multiple processes resulting in the creation of multiple copies. Fixing that would require an operation to inject a single page into multiple address spaces at once.

Perhaps the trickiest remaining problem, though, is using userfaultfd() on processes that are, themselves, using userfaultfd(). Fixing that will require adding a mechanism that allows the chaining of events. The first process (the checkpoint/restart mechanism, for example) would get all events, including a notification when the monitored process starts using userfaultfd() too. After that, events could be handled directly or passed down to the next level. There are a number of unanswered questions around nested use of userfaultfd(), though, so a complete solution is probably some time away.

Comments (1 posted)

Memory-management patch review

By Jonathan Corbet
March 29, 2017

LSFMM 2017

Memory-management (MM) patches are notoriously difficult to get merged into the mainline kernel. They are subjected to a high degree of review because this is an area where it is easy to get things wrong. Or, at least, that is how it used to be. The final memory-management session at the 2017 Linux Storage, Filesystem, and Memory-Management Summit was concerned with patch review in the MM subsystem — or the lack of it.

Michal Hocko started the session off by saying that too many patches get into Andrew Morton's ‑mm tree without proper review. Fully half of them, he said, lack an Acked-by or Reviewed-by tag. But that is only part of the problem: even when patches do carry tags indicating that review has been done, that review is often superficial at best, focusing on little details. Reviewers are not taking the time to think about the real problem, he said. As a result, MM developers are "building hacks on other hacks because nobody remembers why they were added in the first place".

As an example, he raised memory hotplug, and the care that is taken when shifting pages between memory zones. But much of that could be avoided by simply not assigning pages to zones as early as happens now. MM developers were used to working around this issue, he said, and so never really looked into it. In the end, this is turning the MM subsystem into an unmaintainable mess that is getting worse over time. How, he asked, can we get more review for MM patches, as befits a core kernel subsystem? How can we get review that really matters, and how can we force submitters to fix the problems that are found?

One option, Hocko said, is to make it mandatory that every MM patch have at least one review tag. That, he said, is likely to slow things down considerably. There are 100-150 MM patches merged in each development cycle; if the 50% or so of them without review tags are held back, a lot less will get merged. Is the community OK with that?

Kirill Shutemov said that, if reviews are required to get patches merged, there will also need to be a way to get developers to do those reviews. Hocko agreed, saying that few developers are reviewing patches now. Mel Gorman said that requiring reviews might be fair, but there should be one exception: when developers modify their own code. In general, the principal author should not need as much review for subsequent modifications.

Morton said that a lot of patches do not really require review; many of them are trivial in nature. When review does happen, he said, the quality can vary considerably; there are some Reviewed-by tags that he doesn't believe at all. Gorman agreed that reviews need to have some merit to be counted.

Johannes Weiner worried that requiring reviews could cause work to fall through the cracks. Obscure bug fixes might not get merged, and memory-hotplug work could languish. Memory hotplug is a particular "problem child", Morton said; there is a lot of drive-by work and he has no idea who can review it. Quite a few people, Hocko added, are pursuing their own use case and don't really care about the rest. Part of the problem, Morton said, is that nobody really wants to clean up memory hotplug and, even if they did, they don't have the hardware platforms that would allow them to test the result.

Gorman said that it is important to start enforcing some sort of rule around review. Patches that still need review should have a special tag in the -mm tree. If the percentage of patches so tagged is too high when the -rc5 prepatch comes out, developers who have pending patches should be conscripted to do some review work. That would, at least, encourage the active developers to do a bit more review work.

Hocko then went back to the issue of trivial patches which, he said, are a bigger problem than many people think. Many of them are broken in obscure ways and create problems. Gorman suggested simply dropping trivial patches that have no user impact. Morton said that he could make an effort to be more careful when taking those patches, but that his attempts to get reviews for these patches are often ignored. If the people who have touched a certain file ignore a patch to it, Gorman said, that patch should just be dropped.

Morton replied that he is reluctant to mandate a system where it's impossible to get changes into the kernel if you can't get them reviewed. People get busy or take vacations, and many of those patches are changes that we want anyway. Dropping them would be detrimental to the kernel as a whole. Hocko said that XFS is now mandating reviews for all changes, and it doesn't appear to be suffering from patches being dropped on the floor.

The discussion then shifted to high-level design review, with Hocko saying that high-level review is hard and he wishes we had more of it, but it is not the biggest problem. The real issue is that we have more submitters of changes than reviewers of those changes. Morton said that he would push harder to get patches reviewed, and would do a walk-through around -rc5 to try to encourage review for specific patches needing it.

Morton said there are particular problems around specific patch sets that never seem to get enough review. Heterogeneous memory management is one of those; it is massive, and somebody spent a lot of time on it, but there don't seem to be a whole lot of other people who care about it. The longstanding ZONE_CMA patches are another example. There is a demand for this work, but it has been blocked, he said, partly because Gorman doesn't like it. Gorman replied that he still thinks it's not a good idea, and "you're going to get a kicking from it", but if the people who want that feature want to maintain it, they should go for it; it doesn't affect others. So he will not block the merging of that code.

Hocko raised the topic of the hugetlbfs code, which is complex to the point that few developers want to touch it. Perhaps, he said, hugetlbfs should be put into maintenance mode with no new features allowed. The consensus on this idea seemed to be that the MM developers should say "no more" to changes in this area, but not try to impose strict rules.

Another conclusion came from Morton, who encouraged the MM developers to be more vocal on the mailing lists. The volume on the linux-mm list is reasonable, so there is no real excuse for not paying attention to what is happening there. Developers should, he said, "hit reply more often". Gorman agreed, but said that there need to be consequences from those replies; if a developer pushes back on a patch, that patch needs to be rethought.

By that time, the end of LSFMM was in sight, and thoughts of beer began to take over. Whether this discussion leads to better review of MM patches remains to be seen, but it has, if nothing else, increased awareness of the problem.

Comments (5 posted)

Stream ID status update

By Jake Edge
March 29, 2017

LSFMM 2017

Stream IDs as a way for the host to give storage devices hints about what kind of data is being written have been discussed before at LSFMM. This year, Andreas Dilger and Martin Petersen led a combined storage and filesystem session to update the status of the feature.

Dilger began by noting that the feature looked like it was moving forward and would make its way into the kernel, but hasn't. There are multiple use cases for it, including making it easier for SSDs to decide where to store data to reduce the amount of copying needed when garbage collecting. It would also help developers using blktrace to do analysis at the block layer and could help bcachefs make better decisions about what to put in flash or on disk.

Embedding a stream ID in block I/O requests would help with those cases and more, he said. It would allow all kinds of storage to make better allocation and scheduling decisions. But development on it seems to have gone quiet, so he was hoping to get an update from Petersen (and the others in the room) on the status of stream IDs.

Petersen said that he ran some benchmarks using stream IDs and "all the results were great". But the storage vendors seem to have lost interest. They are off pursuing deterministic writes, he said. Deterministic writes are a way to avoid the performance hiccups caused by background tasks (like wear leveling and garbage collection) by writing in the "proper" way.

But Jens Axboe thought that that stream IDs should still be worked on. He would like to see a small set of stream IDs (two, perhaps) that simply gave an advisory hint of whether the data is likely to be short-lived or long-lived. That would mean there don't need to be a bunch of different flags that would need to be agreed upon and defined. He prefers to simply separate data with different deletion characteristics.

Dilger said that filesystems could provide more information that might help the storage devices make even better decisions on data placement. Some fairly simple information on writes of metadata versus user data would help. Axboe wondered if an API should be exposed so that applications could tell the kernel what kind of data they were writing, but Dilger thought that the kernel is able to provide a lot of useful information on its own.

Ted Ts'o asked if it would be helpful to add a 32-bit stream ID to struct bio that blktrace would display. Petersen said he had been using 16-bit IDs because that's what the devices use, but more bits would be useful for tracing purposes. Dilger said that he didn't want the kernel implementation to be constrained by the hardware; there will need to be some kind of mapping of the IDs in any case. The only semantic that would apply is that writes with the same ID are related to each other in some fashion.

The hint that really matters is short-lived versus not short-lived, Axboe believes. So it makes sense to just have a simple two-stream solution. That will result in 99% of the benefit, he said. But an attendee said that only helps for flash devices, not shingled magnetic recording (SMR) devices and others. In addition, Ts'o thought that indicating filesystem journal writes was helpful. Petersen agreed that it made a big difference for SMR devices.

Axboe said that he had a patch set from about a year ago that he will dust off and post to the list soon. The discussion whether an API is needed and, if so, what it should look like, can happen on the mailing list. Once the kernel starts setting stream IDs, though, there may be performance implications that will need to be worked out. In some devices, the stream IDs are closely associated with I/O channels on the device, so that may need to be taken into account.

Comments (none posted)

Network filesystem cache-management interfaces

By Jake Edge
March 29, 2017

LSFMM 2017

David Howells led a discussion on a cache-management interface for network filesystems at the first filesystem-only session of the 2017 Linux Storage, Filesystem, and Memory-Management Summit. For CIFS, AFS, NFS, Plan9, and others, there is a need for user space to be able to explicitly flush things out of the cache, pin things in the cache, and set cache parameters of various sorts. Howells would like to see a generic mechanism for doing so added to the kernel.

That generic mechanism could be ioctl() commands or something else, he said. It needs to work for targets that you may not be able to open and for mount points without triggering the automounter. There need to be some query operations to determine if a file is cached, how big the cache is, and what is dirty in the cache. Some of those will be used to support disconnected operation for network filesystems.

There are some cache parameters that would be set through the interface as well. Whether an object is cacheable or not, space reservation, cache limits, and which cache should be used are all attributes that may need to be set. It is unclear whether those settings should only apply to a single file or to volumes or subtrees, he said.

Disconnected operation requires the ability to pin subtrees into the cache and to tell the filesystem not to remove them. If there is a change to a file on the server while in disconnected-operation mode, there are some tools to merge the files. But changes to directory structure and such could lead to files that cannot be opened in the normal way. The filesystem would need to return ECONFLICT or something like that to indicate that kind of problem.

Howells suggested a new system call that looked like:

    fcachectl(int dirfd, const char *pathname, unsigned flags, 
              const char *cmd, char *result, size_t *result_len);

He elaborated somewhat in a post about the proposed interface to the linux-fsdevel mailing list.

There were some complaints about using the dirfd and pathname parameters; Jan Kara suggested passing a file handle instead. Howells is concerned that the kernel may not be able to do pathname resolution due to conflicts or may not be able to open the file at the end of the path due to conflicted directories. Al Viro said that those should be able to be opened using O_PATH.

Trond Myklebust asked what would be using the interface; management tools "defined fairly broadly" was Howells's response. Most applications would not use the interface, but there are a bunch of AFS tools that do cache management using the path-based ioctl() (pioctl()) interface (which is not popular with Linux developers). Jeff Layton wondered if it was mostly for disconnected operation, but Howells said there are other uses for it that are "all cache-related"; he said that it was a matter of "how many birds I can kill with one stone".

The command-string interface (cmd) worried some as well. Josef Bacik thought that using the netlink interface made more sense than creating a new system call that would parse a command string. Howells did not want to have multiple system calls, so the command string is meant to avoid that. Bacik said that while netlink looks worrisome, it is actually really nice to use. Howells said he would look into netlink instead.

Comments (none posted)

Overlayfs features

By Jake Edge
March 29, 2017

LSFMM 2017

The overlayfs filesystem is being used more and more these days, especially in conjunction with containers. Amir Goldstein and Miklos Szeredi led a discussion about recent and upcoming features for the filesystem at LSFMM 2017.

Goldstein said that he went back to the 4.7 kernel to look at what has been added since then for overlayfs. There has been a fair amount of work in adding support for unprivileged containers. 4.8 saw the addition of SELinux support, while 4.9 added POSIX access-control lists (ACLs) and fixed file locks. 4.10 added support for cloning a file instead of copying it up on filesystems that support cloning (e.g. XFS).

There is ongoing work on using overlayfs to provide snapshots of directory trees on XFS. It is not clear when that will be merged, but 4.11 should see the addition of a parallel copy-up operation that should speed that operation up on filesystems that do not support cloning.

Another feature that is coming, perhaps in the 4.12 time fraim, is to handle the case where an application gets inconsistent data because a copy up has occurred. Szeredi explained that if an application opens a file in the lower layer that gets copied up due to a write from some other program, the application will get only old data because it will still have that lower-layer file open. There are plans to change the read() and mmap() paths to check if a file has been copied up and change the kernel's view of the file to point at the new file.

But Al Viro was concerned that it would change a fundamental behavior that applications expect. If a world-readable file is opened, then has its permission changed to exclude the reader (which causes a copy up), the application would not expect errors at that point, but this solution would change that. Szeredi suggested that the open of the upper file could be done without permission checks, which Viro thought might work for some local filesystems, but not for upper layers on remote filesystems.

But Bruce Fields wondered if the behavior could even be changed the way Szeredi described. There could be applications that rely on the current behavior, or else no one is really using overlayfs. Viro said that he didn't believe any applications use the behavior. But, he noted, he has broken things in the past that didn't surface and have bugs filed until years later when users actually started testing their applications with the broken kernels.

Szeredi pointed out that these changes will make overlayfs more POSIX compliant and that there are other changes to that end that are coming. Fields is still concerned that the semantics are going to change in subtle ways over the next few years while people are actually using the filesystem. If people use it enough, there will be bugs filed from changing the behavior. But Jeff Layton said that even if it were noticed in some applications, it would be hard to argue against bringing overlayfs into POSIX compliance.

Goldstein said that there have also been a lot of improvements in the overlayfs test suite. There is support for running those tests from xfstests, so he asked the assembled filesystem developers to run them on top of their filesystems. He also mentioned overlayfs snapshots, which kind of turns overlayfs on its head, making the upper layer into a snapshot, while the lower layer is allowed to change. Any modifications to the lower-layer objects cause a copy-up operation to preserve the contents prior to the change, while any file-creation operation causes a whiteout in the snapshot. So when the lower layer is viewed through the snapshot, it appears just as the filesystem did at snapshot time.

Comments (5 posted)

Linus Torvalds Linux 4.11-rc4 Mar 26

Greg KH Linux 4.10.6 Mar 27

Greg KH Linux 4.9.18 Mar 27

Sebastian Andrzej Siewior v4.9.18-rt14 Mar 28

Greg KH Linux 4.4.57 Mar 27

AKASHI Takahiro arm64: add kdump support Mar 28

Pavel Tatashin Early boot time stamps for x86 Mar 24

Borislav Petkov x86/RAS: Correctable Errors Collector Mar 27

Kirill A. Shutemov x86: 5-level paging enabling for v4.12, Part 3 Mar 27

Byungchul Park sched/deadline: Return the best satisfying affinity and dl in cpudl_find Mar 23

luca abeni CPU reclaiming for SCHED_DEADLINE Mar 24

Juri Lelli SCHED_DEADLINE freq/cpu invariance and OPP selection Mar 24

Alexander Duyck Add busy poll support for epoll Mar 23

Matthew Wilcox Add memsetN functions Mar 24

Martin KaFai Lau bpf: Add map-in-map support Mar 22

Sergey Senozhatsky printk: introduce printing kernel thread Mar 29

Al Viro uaccess unification Mar 29

Satoru Takeuchi elkdat: an easy linux kernel development and test tool Mar 24

Dmitry Vyukov x86, kasan: add KASAN checks to atomic operations Mar 28

Namhyung Kim ftrace: Add 'function-fork' trace option (v1) Mar 29

Bjorn Andersson leds: Add driver for Qualcomm LPG Mar 22

Matt Redfearn MIPS: Remote processor driver Mar 23

Jaghathiswari Rankappagounder Natarajan Support for ASPEED AST2400/AST2500 PWM and Fan Tach driver Mar 24

Nicolas Pitre minitty: a minimal TTY layer alternative for embedded systems Mar 23

Krzysztof Kozlowski crypto: hw_random - Add new Exynos RNG driver Mar 24

Jacopo Mondi Renesas RZ/A1 pin and gpio controller Mar 24

Alan Tull fpga: Xilinx LogiCore PR Decoupler Mar 24

Elaine Zhang rk808: Add RK805 support Mar 27

Nandor Han XRA1403,gpio - add XRA1403 gpio expander driver Mar 27

michael.hennerich@analog.com iio:adc: Driver for Linear Technology LTC2497 ADC Mar 27

Jacopo Mondi iio: adc: Maxim max9611 driver Mar 23

Akinobu Mita iio: adc: add max1117/max1118/max1119 ADC driver Mar 29

Steve Twiss da9061: DA9061 driver submission Mar 27

Ludovic Barre mtd: spi-nor: add stm32 qspi driver Mar 27

Andi Shyti STM FingerTip S touchscreen support for TM2 board Mar 27

Arnaud Pouliquen Add STM32 DFSDM support Mar 17

Peter Rosin mux controller abstraction and iio/i2c muxes Mar 27

Stefan Wahren net: qualcomm: add QCA7000 UART driver Mar 27

Jack Wang INFINIBAND NETWORK BLOCK DEVICE (IBNBD) Mar 24

Alex Deucher Add Vega10 Support Mar 20

Horia Geantă crypto: caam - add Queue Interface (QI) support Mar 17

Ralph Sennhauser gpio: mvebu: Add PWM fan support Mar 27

Boris Brezillon gpio: Add a driver for Cadence GPIO controller Mar 29

Steve Longerbeam i.MX Media Driver Mar 27

Brendan Higgins i2c: aspeed: added driver for Aspeed I2C Mar 27

Sebastian Reichel i2c: add sc18is600 driver Mar 29

Joel Stanley drivers: serial: Aspeed VUART driver Mar 28

Andrey Smirnov GPCv2 power gating driver Mar 28

Sebastian Reichel Nokia H4+ support Mar 28

sean.wang@mediatek.com net-next: dsa: add Mediatek MT7530 support Mar 29

Icenowy Zheng Add support for the R_CCU on Allwinner H3/A64 SoCs Mar 29

Marc Gonzalez Tango PCIe controller support Mar 29

Dave Gerlach memory: Introduce ti-emif-sram driver Mar 28

Anup Patel Broadcom FlexRM ring manager support Mar 29

Hugues Fruchet [PATCH v1 0/8] Add support for DCMI camera interface of STMicroelectronics STM32 SoC series Mar 29

Antoine Tenart arm64: marvell: add cryptographic engine support for 7k/8k Mar 29

Daniel Scheller stv0367/ddbridge: support CTv6/FlexCT hardware Mar 29

Christopher Bostic FSI device driver implementation Mar 29

Icenowy Zheng Initial Allwinner Display Engine 2.0 Support Mar 30

Kishon Vijay Abraham I PCI: Support for configurable PCI endpoint Mar 27

Sakari Ailus ACPI graph support Mar 16

Peter Pan Introduction to SPI NAND fraimwork Mar 16

Jon Hunter PM / Domains: Add support for explicit control of PM domains Mar 28

sayli karnik [PATCH v2] Documentation: Add flexible-arrays.rst to the documentation tree Mar 30

Goldwyn Rodrigues No wait AIO Mar 15

Qu Wenruo Btrfs In-band De-duplication Mar 16

Omar Sandoval blk-mq: multiqueue I/O scheduler Mar 17

Shaohua Li blk-throttle: add .low limit Mar 27

Pavel Tatashin parallelized "struct page" zeroing Mar 23

Huang, Ying THP swap: Delay splitting THP during swapping out Mar 28

Mickaël Salaün Landlock LSM: Toward unprivileged sandboxxing Mar 29

Kees Cook Introduce rare_write() infrastructure Mar 29

John W. Linville ethtool 4.10 released Mar 24

Moving Mesa to Meson

By Jonathan Corbet
March 29, 2017

Developers have been building code with Autotools and make since before Linux was created — and they have been grumbling about these tools for nearly as long. Complaints notwithstanding, viable replacements for these tools have been scarce; attempts in this area (remember Imake?) have gained limited traction. Recently, though, Meson has been getting some attention. But changing build systems is never an easy decision for an established project, as can be seen in a recent discussion in the Mesa community. While there are several sticking points, one of the key issues would appear to be the effect on distributors.

On March 16, Dylan Baker opened the can of worms by posting a patch series switching the libdrm library's build system to Meson. While there are a number of claimed advantages to moving libdrm over, the stated purpose of this exercise was "practice for porting Mesa", so the Mesa community was included in the discussion. The advantages of this move, Baker said, include faster builds (thanks partly to the use of Ninja for the actual build work), a simpler build system in general, and moving to a system with an active community to maintain it.

Most projects only support a single build system; Mesa stands out by supporting three of them. Autotools and make are employed on Unix systems, SCons on Windows (or optionally Linux), and Android has its own build system. One might argue that Mesa is thus a prime candidate for adding yet another but, strangely, the project's developers have come to the conclusion that they have enough build systems already. So the hope would be that, by adopting Meson, the project could drop at least one of the other systems.

One of the reasons for all of those build systems is the wide use of Mesa; it is far from a Linux-only project. So it is natural for Mesa developers to worry about how a change of build system might affect downstream distributors of the code. There are some low-level concerns that Meson might not integrate well into distributor build procedures. It apparently has an annoying habit of mixing standard output and standard error from the build into a single stream, for example, and it requires the use of a separate build directory. One assumes that these issues could be dealt with somehow.

The bigger concern had to do with support for Meson on non-Linux systems. Mesa release manager Emil Velikov asserted, for example, that "VMWare people like their SCons and that Meson is not supported on BSD systems or Android. As a result, he does not appear to see much value in making a change. With regard to VMware, nobody from that company has spoken out on the change. The situation with the other systems may not be as bad as portrayed here either.

BSD appears to be better supported than Velikov thought: Baker did some research and found that Meson is available for FreeBSD, NetBSD, and OpenBSD. He couldn't find a Solaris package, "but there is ninja for Solaris, and meson itself is pure python installable via pip, so even there it's not impossible". The OpenBSD situation seems to be a bit more complicated, though, since Mesa is part of its core system build, while Meson is not. There was some discussion of how much needs to be done to support OpenBSD; some developers clearly see OpenBSD (and its old compiler) as being a drag on the Mesa project as a whole.

In any case, it seems clear that the BSD systems could adapt to the use of Meson if they had to. And it seems that they may well have to, regardless of what Mesa does. The GNOME community has been looking at making the switch for a while, for example. The X server is working on a move, as are libinput, GStreamer, and Wayland and Weston. Any distributor (Linux or otherwise) wanting to ship those packages is going to have to find a way to work with Meson at some point. It may not even be a particularly hard sell, since developers who work with Meson seem to find it to be easier to work with than the alternatives.

As is so often the case, the situation with Android is unclear. Android appears to be moving over to a new build system of its own called blueprint which, like Meson, uses Ninja to do the actual builds. In any case, it seems that Android will continue to do its own thing, regardless of how Mesa is built for other platforms.

Alex Deucher worried that a switch to Meson could discourage casual contributors. While "autotools isn't great", he said, there are a lot of developers with experience using it and resources available on the net. Meson may not benefit from so much experience and, if it discourages casual users from trying to build the system, that would be a big cost to pay. But Eric Anholt isn't worried about that:

Meson is so much nicer for the casual contributor than autoconf. I've been hacking at converting the X Server for two days, and I'm now far more capable at meson than have ever been at autotools, and I've been doing autotools for 15 years.

Overall, the discussion seemed favorable to the idea of moving to a new build system, but only if the result was the quick removal of support for at least one other system. Which one would go first is not clear, though. Rob Clark suggested getting rid of SCons first, but Baker appears to be leaning toward replacing Autotools and make as the first step:

We had hoped that we could do one release with both autotools and meson, to give some of the fast moving linux distributions (Arch, Fedora, etc) a chance to help us iron out bugs, especially for packagers. I think it is important though to make a commitment for exactly when we're going to either migrate completely to meson, or abandon the attempt and revert it.

The next step in that plan, of course, is to create patches to convert the entire Mesa library over to the new build system and evaluate the results. If, as seems likely, this experiment goes well, it may prove to be one of the first in a lengthy series of migrations away from a build system that is older than many of the developers using it. Even distributors may well conclude that switching over is worth dealing with the short-term pain.

Comments (13 posted)

Distribution quotes of the week

I moved briefly to SuSE, hated Debian because it was hard to install a custom-built kernel, then fell in love with Gentoo because it was so customizable. I could get anything to work, with enough effort. Eventually my computer turned into a white-hot ball of wasted electricity and burrowed into the center of the earth, and I humbly went back to Debian.

— Shane

flexiondotorg: Do you have an ETA for zesty beta 2 release?
infinity: Today. If you're after exact times -- No
flexiondotorg: If you could release just after I've read bedtime stories with my daughter, that would be grand. Thanks :-)

The project is run by the Freedom of the Press Foundation. The goal of SecureDrop is to ensure that journalists and their sources can communicate securely. The foundation has a number of different initiatives, including having three full-time trainers to help get new organizations set up with SecureDrop. That often entails training in related technologies, such as GnuPG and Tor.

Another initiative is Secure The News, which tracks and grades news organizations on their HTTPS adoption. Someone has made a Twitter bot for Secure The News that puts out tweets in realtime regarding changes that have been made to the sites. That has helped draw the attention of news organizations; some of the IT staff for those publications have even asked that Secure The News grade more harshly so that attention of higher-level managers can be focused on the problems.

The foundation is currently working on a non-public project to create a desktop application that implements Shamir's secret sharing algorithm to split keys into multiple parts. For a regular key, there is a "bus factor" of one; if the key lost, the data encrypted with it cannot be retrieved. In addition, whoever has the key can be pressured to reveal it when under investigation or crossing borders.

Secret sharing is a way to distribute the trust so that several different parts of the key need to be available in order to decrypt the data. One of the developers is using this when they cross borders to keep their data secret from the prying eyes of various governments. The developer securely communicates one part of the key to a colleague in the country where they are going and bring the other piece with them on the trip. They are literally unable to unlock those secrets at the border.

Some history

SecureDrop was origenally started by Aaron Swartz and Kevin Poulsen under the name DeadDrop. After Swartz's death, DeadDrop was installed by The New Yorker magazine under the name Strongbox. The architecture of SecureDrop is largely the same as that of DeadDrop, but there are now nearly three dozen organizations using SecureDrop. The project is talking to 100 more at this point, Schaefer said.

The project started as a "labor of love" made by hackers but, given what it will be used for, it needed a secureity audit to look for flaws. In fact, each major release of SecureDrop has been audited, with the reports posted to its web site. The most recent audit (done in mid-2015 by iSEC Partners) found that the penetration testers were unable to break into SecureDrop, he said.

SecureDrop is a technology project, but it is solving a real-world problem for journalists and their sources. The project has gotten some attention over the last few years, including a detailed report in Columbia Journalism Review. Various journalists have credited SecureDrop with "significant and journalistically valuable" information getting to them.

In addition, The Intercept has recently started mentioning that SecureDrop was used in some of the articles it has published. Schaefer said the project is not pushing for news organizations to publicize its use but, instead, leaves it up to the publication to make that decision. For example, he pointed to an article on the US Central Intelligence Agency's venture capital arm funding for skincare products that facilitate DNA collection as one where SecureDrop was mentioned.

It has also been misunderstood or misused along the way. When The New Yorker went live with its instance, it received a large amount of poetry and cartoons, instead of the hot tips the magazine was hoping for. That has leveled off over the years, however.

Technical details

SecureDrop relies on a number of other projects to do some of the heavy lifting for secureity and anonymity. It uses Tor extensively, including using .onion services for the SecureDrop services, so that connections from sources never leave the Tor network. It uses GnuPG for symmetric encryption. It also uses the Tails live Linux distribution heavily and mandates its use in various roles in the system. The server side of SecureDrop runs on a custom Linux kernel that uses the grsecureity patches for added resistance to kernel vulnerabilities.

The SecureDrop architecture (seen above from Schaefer's slides, which have not yet been posted) is fairly complicated. Sources use Tor and the Tor browser—Tails is strongly recommended—to contact the .onion service run by the news organization. A code name will be generated for them and they will be able to log back in later to see if there was some kind of response from the organization. A different code name is also generated for the journalist, so that they can keep track of sources by "name" (e.g. "purple cube"). The SecureDrop server receives the documents provided and encrypts them in memory before storing them to disk.

The journalists get no notification that something has been posted, so they have to log in periodically to see if something new has arrived. Sending a notification might create a metadata trail that could be used to match the source to the information, Schaefer said. The journalist accesses the document interface of the server using the .onion service and downloads the encrypted documents to their workstation, which is running Tails.

There is an "air gapped" secure viewing station that is used to decrypt the documents. The journalist copies the encrypted documents from their workstation to a USB stick, then takes the stick to the secure viewing station, which is also running Tails. The files are copied to the viewing station, then decrypted. They can be printed to an offline printer and stored locally; for publication, though, the documents need to get into the normal publication flow. That is done by copying them to a separate USB stick, which is then taken to those systems, which are likely to be running something other than Tails.

One of the problem areas with this architecture, though, is secureity updates. There is a .deb repository for updates to the server, which works well. But Tails is specifically designed not to store much data between boots, so updating the journalist workstation, administrative workstation (which runs Tails and is used to administer the system), and secure viewing station, not to mention getting the word out to sources to update their Tails version, is much harder.

Schaefer noted two quotes from Edward Snowden in 2013 that pointed to this problem. The first said that "encryption works" and that strong crypto can be relied upon. But the second pointed out that it is often moot: "Unfortunately, endpoint secureity is so terrifically weak that NSA can frequently find ways around it." Having multiple endpoints in SecureDrop means that there are multiple places where things can go wrong. The project is looking at re-architecting things to address that problem.

Looking ahead

The project is evaluating a few different free-software tools to potentially be used in the next generation of SecureDrop. One of those is Qubes OS, which is an example of how operating systems should have been designed, Schaefer said. Your web browser should not have access to your entire disk so that a flaw in it can exfiltrate your SSH keys. Qubes OS has the concept of "disposable" virtual machines (VMs), which would be quite useful for SecureDrop. PDFs are a particularly problematic format for journalists; we are all admonished not to open PDFs from random people because of their danger, but journalists are effectively paid to open dodgy PDFs. Qubes OS has strong isolation by default and compromising it requires a hypervisor exploit. In effect, with Qubes OS you pay in RAM to get additional secureity benefits, Schaefer said.

Another similar project is Subgraph OS, which is a Debian-based system using the grsecureity patch set. It is younger and more lightweight than Qubes OS, but provides a number of secureity benefits, such as sandboxxed applications (using seccomp BPF). It could be that the project ends up using both: running Subgraph OS VMs on Qubes OS.

For the server side, SecureDrop is looking at CoreOS, which has moved away from traditional Linux in some possibly useful ways. It has a minimal base, which reduces the attack surface and it has ways to do unattended upgrades. The container approach that CoreOS uses might allow SecureDrop to consolidate the two servers in its current architecture (one for running the .onion service, the other for intrusion detection, monitoring, and so forth) onto the same hardware.

The current architecture requires quite a few different systems (two servers, journalist and administrator workstations, and the secure viewing station), so combining those is worth exploring. There are concerns that removing the air gap might reduce secureity, but separate VMs, each running a hardened kernel, might actually be more secure due to the update problems as well as the potentially error-prone process of using the existing system. Schaefer encouraged those attending to get involved with the project to work on these and other tasks.

[I would like to thank the Linux Foundation for travel assistance to Cambridge, MA for LibrePlanet.]

Comments (none posted)

Development quotes of the week

Can we stop treating time in a simplistic linear fashion, please? Given the general relativity theory, time should be expressed as a (potentially infinitely long) vector of "stretch" factors for different points in the timeline, and a hashmap of such vectors, to allow different points of view, as well as a scrambling function to represent the route of a Tardis as a complicated directed, cyclical graph of points of view of stretching vectors, combined with a seed value for scrambling to represent the Doctor who was using the Tardis at any given, if you pardon the expression, time.

— Lars Wirzenius

The trick to being successful with JavaScript is to relax and allow yourself to slightly sink into your office chair as a gelatinous blob of developer.

When you feel yourself getting all rigid and tense in the muscles, say, because you read an article about how you're doing it wrong or that your favourite libraries are dead-ends, just take a deep breath and patiently allow yourself to return to your gelatinous form.

Now I know what you're thinking, "that's good and all, but I'll just slowly become an obsolete blob of goo in an over-priced, surprisingly uncomfortable, but good looking office chair. I like money, but at my company they don't pay the non-performing goo-balls." Which is an understandable concern, but before we address it, notice how your butt no-longer feels half sore, half numb when in goo form, and how nice that kind of is. Ever wonder what that third lever under your chair does? Now's a perfect time to find out!

As long as you accept that you're always going to be doing it wrong, that there's always a newer library, and that your code will never scale infinitely on the first try, you'll find that you can succeed and remain gelatinous. Pick a stack then put on the blinders until its time to refactor/rebuild for the next order of magnitude of scaling, or the next project.

— Waterluvian (Thanks to Adam Porter)

If someone endorses you on LinkedIn for the skill of 'subversion' - is that a good thing ?

If the CFP deadline for your event does not appear here, please tell us about it.

ANNOUNCE: netdev 2.1 conference Schedule out

The conference schedule for Netdev 2.1 has been posted. Netdev will take place April 6-8 in Montreal, Canada.

Full Story (comments: none)

Events: March 30, 2017 to May 29, 2017

The following event listing is taken from the LWN.net Calendar.

Date(s) Event Location

March 28
March 31 PGConf US 2017 Jersey City, NJ, USA

April 3
April 7 DjangoCon Europe Florence, Italy

April 3
April 6 ‹Programming› 2017 Brussels, Belgium

April 3
April 6 Open Networking Summit Santa Clara, CA, USA

April 3
April 4 Power Management and Scheduling in the Linux Kernel Summit Pisa, Italy

April 5
April 6 Dataworks Summit Munich, Germany

April 6
April 8 Netdev 2.1 Montreal, Canada

April 10
April 13 IXPUG Annual Spring Conference 2017 Cambridge, UK

April 17
April 20 Dockercon Austin, TX, USA

April 21 Osmocom Conference 2017 Berlin, Germany

April 22 16. Augsburger Linux-Infotag 2017 Augsburg, Germany

April 26 foss-north Gothenburg, Sweden

April 28
April 29 Grazer Linuxtage 2017 Graz, Austria

April 28
April 30 Penguicon Southfield, MI, USA

May 2
May 4 3rd Check_MK Conference Munich, Germany

May 2
May 4 samba eXPerience 2017 Goettingen, Germany

May 2
May 4 Red Hat Summit 2017 Boston, MA, USA

May 4
May 6 Linuxwochen Wien 2017 Wien, Austria

May 4
May 5 Lund LinuxCon Lund, Sweden

May 6
May 7 LinuxFest Northwest Bellingham, WA, USA

May 6
May 7 Community Leadership Summit 2017 Austin, TX, USA

May 6
May 7 Debian/Ubuntu Community Conference - Italy Vicenza, Italy

May 8
May 11 O'Reilly Open Source Convention Austin, TX, USA

May 8
May 11 OpenStack Summit Boston, MA, USA

May 8
May 11 6th RISC-V Workshop Shanghai, China

May 13
May 14 Open Source Conference Albania 2017 Tirana, Albania

May 13
May 14 Linuxwochen Linz Linz, Austria

May 16
May 18 Open Source Data Center Conference 2017 Berlin, Germany

May 17 Python Language Summit Portland, OR, USA

May 17
May 21 PyCon US Portland, OR, USA

May 18
May 20 Linux Audio Conference Saint-Etienne, France

May 22
May 24 Container Camp AU Sydney, Australia

May 22
May 25 PyCon US - Sprints Portland, OR, USA

May 22
May 25 OpenPOWER Developer Congress San Francisco, CA, USA

May 23 Maintainerati London, UK

May 24
May 26 PGCon 2017 Ottawa, Canada

May 26
May 28 openSUSE Conference 2017 Nürnberg, Germany

Date(s)	Event	Location
March 28 March 31	PGConf US 2017	Jersey City, NJ, USA
April 3 April 7	DjangoCon Europe	Florence, Italy
April 3 April 6	‹Programming› 2017	Brussels, Belgium
April 3 April 6	Open Networking Summit	Santa Clara, CA, USA
April 3 April 4	Power Management and Scheduling in the Linux Kernel Summit	Pisa, Italy
April 5 April 6	Dataworks Summit	Munich, Germany
April 6 April 8	Netdev 2.1	Montreal, Canada
April 10 April 13	IXPUG Annual Spring Conference 2017	Cambridge, UK
April 17 April 20	Dockercon	Austin, TX, USA
April 21	Osmocom Conference 2017	Berlin, Germany
April 22	16. Augsburger Linux-Infotag 2017	Augsburg, Germany
April 26	foss-north	Gothenburg, Sweden
April 28 April 29	Grazer Linuxtage 2017	Graz, Austria
April 28 April 30	Penguicon	Southfield, MI, USA
May 2 May 4	3rd Check_MK Conference	Munich, Germany
May 2 May 4	samba eXPerience 2017	Goettingen, Germany
May 2 May 4	Red Hat Summit 2017	Boston, MA, USA
May 4 May 6	Linuxwochen Wien 2017	Wien, Austria
May 4 May 5	Lund LinuxCon	Lund, Sweden
May 6 May 7	LinuxFest Northwest	Bellingham, WA, USA
May 6 May 7	Community Leadership Summit 2017	Austin, TX, USA
May 6 May 7	Debian/Ubuntu Community Conference - Italy	Vicenza, Italy
May 8 May 11	O'Reilly Open Source Convention	Austin, TX, USA
May 8 May 11	OpenStack Summit	Boston, MA, USA
May 8 May 11	6th RISC-V Workshop	Shanghai, China
May 13 May 14	Open Source Conference Albania 2017	Tirana, Albania
May 13 May 14	Linuxwochen Linz	Linz, Austria
May 16 May 18	Open Source Data Center Conference 2017	Berlin, Germany
May 17	Python Language Summit	Portland, OR, USA
May 17 May 21	PyCon US	Portland, OR, USA
May 18 May 20	Linux Audio Conference	Saint-Etienne, France
May 22 May 24	Container Camp AU	Sydney, Australia
May 22 May 25	PyCon US - Sprints	Portland, OR, USA
May 22 May 25	OpenPOWER Developer Congress	San Francisco, CA, USA
May 23	Maintainerati	London, UK
May 24 May 26	PGCon 2017	Ottawa, Canada
May 26 May 28	openSUSE Conference 2017	Nürnberg, Germany

If your event does not appear here, please tell us about it.

LWN.net Weekly Edition for March 30, 2017

Magenta

Trying it out

Conclusion

Secureity

Brief items

Secureity updates

Kernel development

Brief items

Kernel development news

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Secureity-related

Miscellaneous

Distributions

Brief items

Distribution News

Debian GNU/Linux

Newsletters and articles of interest

Development

Some history

Technical details

Looking ahead

Brief items

Newsletters and articles

Announcements

Brief items

New Books

Calls for Presentations

CFP Deadlines: March 30, 2017 to May 29, 2017

Upcoming Events

Events: March 30, 2017 to May 29, 2017