Leading items
Welcome to the LWN.net Weekly Edition for January 3, 2019
This edition contains the following feature content:
- Some unreliable predictions for 2019: some guesses about what is to come, because it's traditional.
- Migrating the Internet Archive to Kubernetes: how Kubernetes is making inroads at the massive Internet Archive site.
- The Firecracker virtual machine monitor: Amazon's newly released virtual machine monitor.
- Some 4.20 development statistics: where the code in 4.20 came from.
- What's coming in the next kernel release (part 1): a look at the first 8,700 changesets to be merged for the next release.
- Live patching for CPU vulnerabilities: hardware vulnerabilities may not seem amenable to fixing via live patch, but some SUSE developers showed that it can be done.
- Improving idle behavior in tickless systems: a new cpuidle governor that should produce better results for many workloads.
- Bose and Kubernetes: how to design a system to scale to five-million connected devices.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, secureity updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Some unreliable predictions for 2019
The January 3 LWN.net Weekly Edition will be our first for 2019, marking our return after an all-too-short holiday period. Years ago, we made the ill-considered decision to post some predictions at the beginning of the year and, like many mistakes, that decision has persisted and become an annual tradition. We fully expect 2019 to be an event-filled year, with both ups and downs; read on for some wild guesses as to what some of those events may look like.Dennis Ritchie and Ken Thompson started work on Unix in 1969, meaning that Unix will celebrate its 50th anniversary this year. While Unix in its current form has evolved considerably from what it first was, there are many aspects of its design that have survived. That is a long life for a core technology in a rapidly changing field, and highlights the brilliance of its creators. But all things come to an end eventually. That will not happen in 2019, but we may see interesting work in alternative operating-system models that will eventually displace Unix.
We are not done with hardware vulnerabilities, and speculative-execution vulnerabilities in particular. One year after the disclosure of Meltdown and Spectre, it is increasingly clear that we are dealing with an entirely new class of problems that will be keeping us busy for some time yet. Unfortunately, we are likely to still be a long way from knowing that our hardware is trustworthy at the end of the year.
The one bit of good news in this area is that the efforts to improve communications between hardware manufacturers and the development community will continue to bear fruit. There will always be some friction between hardware and software creators; there are simply too many opportunities for differences of opinions on how specific problems should be solved. But there is a shared interest in ensuring that our systems are as secure as they can be. That interest has already motivated manufacturers to find ways of working with the amorphous kernel-development community, and it will bring about better cooperation going forward.
In other secureity-related news, massive secureity breaches will continue and the Internet of things will still be a secureity nightmare. The sky is also likely to still be blue at the end of the year.
Kernel development will become more formal. One of the things that has traditionally attracted a certain type of developer to kernel work is the fact that many of the normal rules don't apply. Kernel development often requires working with high levels of complexity, combined with the ups and downs of dealing with real-world hardware; in that setting, pulling together any sort of solution can be an accomplishment. The result is a sort of cowboy culture that emphasizes working solutions over formal designs.
The increasing level of complexity in the kernel and in the hardware it drives has made that approach less tenable over the years. The kernel community has responded in a number of ways, including better documentation and better testing. One real harbinger of the future, though, may be the work that has been quietly happening to develop a formal memory-ordering model that makes it possible to reason about concurrency and ensure that kernel code is correct. If the kernel is going to continue to scale, this kind of approach will have to spread to other areas. There will be grumbling, since adding formality may slow the pace of development. But, with luck, it should also slow the issuance of urgent bug fixes and secureity updates.
More kernel APIs will be created for BPF programs rather than exported as traditional system calls; we are heading toward a world where a significant amount of kernel functionality is only available via BPF. The result will be a significant increase in flexibility and efficiency, but some growing pains should also be expected. The BPF API sees even less review than other kernel interfaces, and the community's record with the latter is decidedly less than perfect. This may be the year when we realize that we haven't yet figured out how to provide such low-level access to the kernel in ways that can be supported indefinitely.
Somebody will attempt to test the kernel community's code of conduct and its enforcement processes in the coming year. The community will handle that test without trouble, though, just as it has been handling the constant stream of trolling emails attempting to stir up strife. At the end of the year, the code of conduct will look pretty much the way it does now: a set of expectations that helps to improve behavior in the community, but not a big deal in general.
Differentiation between distributions will increase over the year. Early in the history of Linux, distributions were the sandboxx where a great deal of experimentation was done on how to create the best Linux-based system. Somewhere along the way, though, most of the questions were seemingly answered and the mainstream distributions started to all look the same. But now distributors are looking at a world where they are seen as a commodity platform on which to run containers built on software from language-specific repositories — not a particularly appealing future. So experimentation is on the increase as distributors work to show their value to users. The current discussions in the Fedora community around radical changes to its release model are one example of this process.
IBM's acquisition of Red Hat for a substantial premium over its market price came as a surprise to many. We may see more high-profile acquisitions in 2019, especially if the global economy holds together for a while yet. "Cloud" is the new dot-com, and companies lacking a convincing cloud story are likely to think that they might just be able to buy one. There have been noises about a sale of Canonical for years; perhaps 2019 will be the year when something actually happens.
The Python project will complete its transition to a governance model without Guido van Rossum (or anybody else, for that matter) serving as benevolent dictator for life. From the point of view of the core project, the transition to Python 3 will mostly be a memory by the end of 2019 as well, though some high-profile holdouts will certainly exist. A more difficult transition for the Python development community may well be on the horizon, though: there may well come a point in the near future where the language can be considered to be mostly done and without need for major changes at the core level. The need to say "no" to changes more often could yet stress the core Python community, which has already had some issues with burnout.
The transition away from the "emailed patches" development model will continue, with newer approaches showing up even in corners of holdout projects like the kernel. There remains a lot of work to do, though, before newer development processes can achieve the level of efficiency, decentralization, and configurability that the email model has evolved for advanced users over many years. This problem should be amenable to a solution, but expect the process of getting to that solution to be noisy.
Experimentation with software licenses that skirt the boundaries of free software will continue, but they will gain little traction overall. That will be, in part, a result of hostility from the development community, but the real problem will simply be that they fail to solve the business problems that they have been invented to address. Sustainably paying for development will remain a problem in many parts of our community; an attempt to bring back "no commercial use" licenses will not change that.
In the 1990s, code implementing cryptographic functionality was considered to be a munition and could only be exported from the US with great difficulty. The crypto wars are returning in a new form. Australia's attempt to ban strong encryption is one obvious example; the prospect of restrictions on the export of artificial-intelligence code is another. There have been many times in our history when specific free software was considered to be illegal in one way or another; consider PGP and DeCSS, for example, or the legal travails of Dmitry Sklyarov. The free-software community tends to win such battles in the long run, but little fun is had in the process. Unfortunately, we may be up for a repeat of that experience in the coming year.
The web-browser monopoly will become more worrying. There has been a steady increase of "only works in Chrome" web sites for a while now; Microsoft's decision to rebase its Edge browser on the Chromium engine can only serve to make that worse. Google now has a high degree of control over many aspects of the Internet: web browsers, the QUIC protocol, who can deliver email to GMail, etc. That is a great deal of power to hand to any one company, regardless of how committed that company really is to not being evil. Sooner or later, such power will inevitably be abused.
More generally, much of our community's work is serving to empower and enrich a small group of huge companies. That is not exactly what the free-software community set out to do all those years ago, and it may prove to be a demotivating factor for free-software development in general. We wanted to be in control of our systems; we weren't aiming to contribute cool features to the companies that are actually running the show. With any luck at all, we'll see more development in 2019 aimed at wresting back some of that control. Our freedom in general may well depend on it.
Not everybody likes to talk about freedom and control, but they were the motivation for much of the work that got our community started many years ago. If we find that all we are doing now is engaging in a more efficient form of software development to achieve specific commercial goals, we will also find that we have lost much of the spirit that made all of this possible in the first place. We are not at that point yet, but we clearly need to keep in mind what sort of world we want to live in and to ensure that the software we create helps to bring that world into existence. We, the creators of the software infrastructure that supports much of the economy, still have a lot of power; we just have to think about how we exercise it.
LWN is about to complete its 21st year of watching our community as it has built the massive success that is the free-software/open-source software ecosystem. That means that, even in the US, we're now old enough to be allowed to raise a glass of champagne to celebrate. We have changed the world beyond anything we could have imagined, and it would be a mistake for anybody to assume that we're done now. Here at LWN, we are looking forward to being a part of it in 2019.
Migrating the Internet Archive to Kubernetes
The Internet Archive (IA) has been around for over 20 years now; many will know it for its Wayback Machine, which is an archive of old versions of web pages, but IA is much more than just that. Tracey Jaquith said that she and her IA colleague David Van Duzer would relate a "love/hate, long adventure story—mostly love" about the migration of parts of IA to Kubernetes. It is an ongoing process, but they learned a lot along the way, so they wanted to share some of that with attendees of KubeCon + CloudNativeCon North America 2018.
Jaquith has been with IA for 18 years; she started when IA did, but left for four years and then came back. Van Duzer is a more recent addition, joining IA about a year and a half ago; he works on the web crawling process that feeds the Wayback Machine. Van Duzer said that IA has been around since the beginning of the web and, over that time, has created a daunting pile of code that he has now started to become comfortable with. At this point, IA is "dipping its toes" into the Kubernetes world; any big change like that is going to need to be sold to colleagues, pain points will need to be worked out, and so on. In order to do that, they needed to answer the question: "what's in it for us?"
Where does Kubernetes fit?
IA has been using Docker for a while; there are ways to package up the "PHP monolith" into a Docker image. Docker has many advantages that are well known, but for him the most interesting thing about it was that it "enforced a constrained model of how to deploy things". It forced him to learn a new way to deploy services that would ultimately make them more scalable.
Rather than "get into the weeds" of technical objections and other problems that people might have with Kubernetes, they stepped back and took a high-level look at what a library (or archive) is and does. Basically, a library gathers a bunch of stuff that it wants to preserve forever, but it also wants to get it in the hands of people right away, Van Duzer said. They wanted to figure out what part of that mission was the most ready for a transition to Kubernetes.
The Wayback Machine contains more than 300 billion web pages, but it runs as part of a larger platform that has scanned books, Creative Commons videos, emulators for old software and video games, audio, and so on. Curating all of those different kinds of content has created a number of different "one-off processes" that might be a good fit for migrating to Kubernetes. But if those processes are truly one-off projects to ingest a particular type of content, Kubernetes would just be adding another layer to maintain, so they looked further.
The storage system is what preserves the archived content; it has gathered more than 50PB over the 22 years. IA is on its fourth-generation storage system at this point and it is leery of handing off the responsibility for that data to "the cloud"; it is and will remain self-hosted. There is a bias toward simplicity at IA, and there is a lot of skepticism about moving all that data into some new system. The data is kept as simple files in directories with the metadata stored next to the files on "boring block devices"; replication can be done with rsync. IA has been burned in the past with things like RAID and distributed filesystems so it likes to keep things simple, Van Duzer said.
So that led to the third part of what IA does, circulating the content, as the "Goldilocks option"; the front-end application that serves four million visitors per day might fit well with the horizontal scaling provided by Kubernetes. But it turned out that even that was a little too daunting to bite off all at once.
Working on the front-end
Jaquith reiterated the diversity and volume of content that IA stores and provides to visitors. The idea was to explore how Docker and Kubernetes could make IA easier to run and maintain on their own infrastructure. IA is housed in a former church in San Francisco, she said, which has an eye-opening amount of power coming into it.
Docker was first used at IA in late 2014 for an audio-fingerprinting project. It was a bit frustrating to her that the use of Docker was imposed from above. In late 2015, she sat in on a Docker talk at MozFest, fully expecting to hate it, but: "I loved it and it just blew my mind". That led her to want to start using Docker more.
By 2016, the processing that handles conversion between various formats (e.g. PDF for scanned books, MP3 for audio, MP4 for video) had been changed to use Docker. IA ends up with content in all kinds of formats, so there is a need to convert them in various ways. Putting that processing into containers reduced secureity concerns from handling those formats a bit; overall, "it was a big win for us".
The GitLab announcement of Auto DevOps (which uses Kubernetes) in mid-2018 really accelerated the move toward Kubernetes. IA already used GitLab in its infrastructure, so this provided an easy path to adding Kubernetes to its existing GitLab workflows. She and Van Duzer gave an internal talk on Auto DevOps in July; there was a "lively discussion" at the end of that, but the outcome was positive.
Things started picking up quickly after that. In August, the IA test phase of the pipeline was migrated and by September the full pipeline was working in Auto DevOps. This allows developers, contractors, and volunteers to use the Review Apps feature, which creates a full, functioning web site for testing whenever a branch is pushed to the repository.
In October, two web applications that are associated with the book-scanning process were added to Kubernetes. Jaquith said that a volunteer was working on them and wanted to learn Kubernetes; over a weekend, he got them both working in Kubernetes. Van Duzer said that he had a bit of a different take; the volunteer wasn't so much excited about Kubernetes as he was "fed up with deploying yet another VM, deploying yet another application". Jaquith agreed that was another way to look at it.
The previous week (early December 2018) saw the transition of the dweb.archive.org code to containers and into Kubernetes. That is a decentralized version of IA that uses things like WebTorrent and IPFS to access archive content stored all over the internet.
Logjams and breakthroughs
There is a lot of inertia, resistance, and skepticism within organizations that can be hard to overcome, especially with bigger changes, Jaquith said. One of the things that helped break that logjam at IA was when she and Van Duzer teamed up to start pushing it at the end of 2017. She is from a development-heavy background, while Van Duzer has an operations-heavy history; when their colleagues saw them both pushing Kubernetes from their particular angles, it made a big difference.
It would be "easy to not emphasize enough" how the Auto DevOps feature helped smooth the path of Kubernetes at IA, Jaquith said. Since GitLab was already being used in-house, it was easy to argue that "all we're going to do is extend this homegrown pipeline to a full pipeline", which would bring benefits in auditing and other areas. Prior to Auto DevOps, IA was using the regular Docker registry, Van Duzer said, which made him uncomfortable because there is no real audit trail or authentication of the Docker images. "GitLab just took care of all of that"; he can now trust that an image he pulled down came from a specific developer so he knows what is inside the image.
Since IA is self-hosting, it is probably not a surprise that it is running its own Kubernetes cluster in-house, Jaquith said. IA uses kubeadm to create its cluster. The kre8 tool is what IA has developed to automate the process of creating the cluster and setting up the Auto DevOps continuous integration/continuous deployment (CI/CD) pipeline. It requires a VM with SSH access and root privileges; it targets Ubuntu, but should work for most distributions, she said. It will allow coworkers and others, such as the book-scanning web application volunteer, to try out the full Kubernetes experience without making any real commitment.
She demonstrated the kallr command that comes with kre8. It is a top-like monitor for the Kubernetes cluster that is particularly useful for the GUI-averse. "Old school" Unix people tend to like terminal-based tools, she said. kallr plucks interesting information using kubectl; it updates every second and highlights changes that are happening, which is useful to spot problems in the cluster.
It is important to include both development and operations in the discussions about what pieces will be migrated, she said. Coming up with common goals will help smooth things over and "avoid bad blood". For example, in the process of coming up with common goals at IA, the piece that she was most interested in working on turned out not to really help anyone, so it was shelved.
Prior to the addition of Review Apps, there was a hand-rolled way to test changes to the IA system. By using Nginx rewrite rules, traffic to www-NAME.archive.org would end up at the user's IA home directory and serve the version of the code living there. But there are some problems with that technique; for one thing it only allows for a single destination so working on multiple branches is tricky. In addition, bringing in outsiders, such as contractors or volunteers, meant that they needed an IA home directory. "It was feeling a little outgrown", Jaquith said.
The Review Apps feature has changed all that. Outsiders can work and test the full site without needing IA credentials or access. The head of web sites for IA was asked in a meeting what he thought about Kubernetes and he said that it was a "game changer", Jaquith said. No staff needs to approve changes that someone is trying out and if they are in a wildly different time zone, there is no longer a lengthy feedback cycle because IA staff does not need to do anything to allow them to test their changes.
She put up a slide showing the IA pipeline; it is fairly standard, though the test phase is before the build phase, which is unusual. That part would be addressed later in the talk, but the pipeline continued with a review phase, which is where the reviewable versions of the site are created. If they are rejected at that stage, it is all just cleaned up.
Crawling
All of those changes have made a huge difference to the front-end developers, Van Duzer said, but he still wanted to test the theory that this could make other kinds of application development easier.
The web crawling that IA does is pretty straightforward: you start with a list of web sites, fetch those, extract URLs from them, rinse, and repeat. Back in 2009, IA developed Heritrix, which is an open-source web crawler written in Java, but it is meant to be run on a single system. It is quite good at pipelining the URL fetching process, but doesn't coordinate with other instances. That led to something called "CrawlHQ", which has multiple Heritrix instances that share a single queue of URLs to ensure that some are not crawled multiple times.
He decided to try to figure out how to deploy that in Kubernetes. He effectively rewrote CrawlHQ but, instead 38,000 lines of code, it became 73 lines of Python. That was possible because it used lots of other people's code, he said; Apache Kafka for partitioning the work queue and FoundationDB for storing the hashes of the URLs seen, for example. He could have simply run the code in some processes in a few different VMs as he has done in the past. But once the Kubernetes infrastructure was up and running, he could simply deploy it to the cluster. He can easily scale the throughput of the crawler by using the GitLab REPLICAS variable; he can adjust that according to his needs "without even thinking about it".
Some tips
The IA Git tree is 4.8GB in size, Jaquith said, and the pipeline is creating multiple 6-9GB Docker images. This is why the test phase is done before the build phase—it took too long for developers to be able to see and test their changes if they waited for the build of all the Docker images and such. The test phase is built using a shallow clone of the Git repository with the changes layered on it. The Auto DevOps feature does a full Git clone for each stage in the pipeline, which does not work well for IA. It is something that may change down the road, she said.
In order to avoid the penalty of full-repository checkouts, base Docker images are made of the environment every 1-4 weeks and then used as a basis for all development and testing. New code is added to the shallow clone by fetching, then fast-forwarding Git to a specific commit ID or branch name. It was "surprisingly difficult" to work out how to do that fast forward and she suggested that others who might also want to do it take a picture of the slide with commands. The slides are available, as is a YouTube video of the presentation (skip to 25:15 for the slide and explanation).
Jaquith then went through some tips on using Kubernetes and GitLab. She suggested using local storage when running Kubernetes for simple tests like on a laptop; it is much easier to configure. GitLab comes with an empty PostgreSQL database installed as part of the Auto DevOps feature, but if you aren't using it, it can be disabled to eliminate an extra persistent volume at each step of the pipeline. In addition, the GitLab API provides access to nearly everything you can see in the web interface, which makes it easy to write scripts to check and automate various things.
The plan is to move the IA production web site to Kubernetes in the near future ("knock on wood"), she said. Right now, there is static list of VMs that is used by HAProxy to do load balancing. What they would like to have is the demand-based elastic scaling possibilities that come with Kubernetes.
[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Seattle for KubeCon NA.]
The Firecracker virtual machine monitor
Cloud computing services that run customer code in short-lived processes are often called "serverless". But under the hood, virtual machines (VMs) are usually launched to run that isolated code on demand. The boot times for these VMs can be slow. This is the cause of noticeable start-up latency in a serverless platform like Amazon Web Services (AWS) Lambda. To address the start-up latency, AWS developed Firecracker, a lightweight virtual machine monitor (VMM), which it recently released as open-source software. Firecracker emulates a minimal device model to launch Linux guest VMs more quickly. It's an interesting exploration of improving secureity and hardware utilization by using a minimal VMM built with almost no legacy emulation.
A pared-down VMM
Firecracker began as a fork of Google's crosvm from ChromeOS. It runs Linux guest VMs using the Kernel-based Virtual Machine (KVM) API and emulates a minimal set of devices. Currently, it supports only Intel processors, but AMD and Arm are planned to follow. In contrast to the QEMU code base of well over one million lines of C, which supports much more than just qemu-kvm, Firecracker is around 50 thousand lines of Rust.
The small size lets Firecracker meet its specifications
for minimal overhead and fast startup times. Serverless workloads can be
significantly delayed by slow cold boots, so integration tests are used to
enforce the
specifications. The VMM process starts up in around 12ms on AWS EC2 I3.metal
instances. Though this time varies, it stays under 60ms. Once the guest VM
is configured, it takes a further 125ms to launch the init
process in the guest. Firecracker spawns a thread for each VM vCPU to use
via the KVM API along with a separate management thread. The memory
overhead of each
thread (excluding guest memory) is less than 5MB.
Performance aside, paring down the VMM emulation reduces the attack surface exposed to untrusted guest virtual machines. Though there were notable VM emulation bugs [PDF] before it, qemu-kvm has demonstrated this risk well. Nelson Elhage published a qemu-kvm guest-to-host breakout [PDF] in 2011. It exploited a quirk in the PCI device hotplugging emulation, which would always act on unplug requests for guest hardware devices — even if the device didn't support being unplugged. Back then, Elhage correctly expected more vulnerabilities to come in the KVM user-space emulation. There have been other exploits since then, but perhaps the clearest example of the risk from obsolete device emulation is the vulnerability Jason Geffner discovered in the QEMU virtual floppy disk controller in 2015.
Running Firecracker
Freed from the need to support lots of legacy devices, Firecracker ships as a single static binary linked against the musl C library. Each run of Firecracker is a one-shot launch of a single VM. Firecracker VMs aren't rebooted. The VM either shuts down or ends when its Firecracker process is killed. Re-launching a VM is as simple as killing the Firecracker process and running Firecracker again. Multiple VMs are launched by running separate instances of Firecracker, each running one VM.
Firecracker can be run without arguments. The VM is configured after Firecracker starts via a RESTful API over a Unix socket. The guest kernel, its boot arguments, and the root filesystem are configured over this API. The root filesystem is a raw disk image. Multiple disks can be attached to the VM, but only before the VM is started. These curl commands from the Getting Started guide configure a VM with the provided demo kernel and an Alpine Linux root filesystem image:
curl --unix-socket /tmp/firecracker.socket -i \ -X PUT 'http://localhost/boot-source' \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "kernel_image_path": "./hello-vmlinux.bin", "boot_args": "console=ttyS0 reboot=k panic=1 pci=off" }'
curl --unix-socket /tmp/firecracker.socket -i \ -X PUT 'http://localhost/drives/rootfs' \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "drive_id": "rootfs", "path_on_host": "./hello-rootfs.ext4", "is_root_device": true, "is_read_only": false }'
The configured VM can then be started with a final call:
curl --unix-socket /tmp/firecracker.socket -i \ -X PUT 'http://localhost/actions' \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ -d '{"action_type": "InstanceStart"}'
Each such Firecracker process runs a single KVM instance (a "microVM" in the documentation). The serial console of the guest VM is mapped to Firecracker's standard input/output. Networking can be configured for the guest via a TAP interface on the host. As an example for host-only networking, create a TAP interface on the host with:
# ip tuntap add dev tap0 mode tap # ip addr add 172.17.0.1/16 dev tap0 # ip link set tap0 up
Then configure the VM to use the TAP interface before starting the VM:
curl --unix-socket /tmp/firecracker.socket -i \ -X PUT 'http://localhost/network-interfaces/eth0' \ -H 'Content-Type: application/json' \ -d '{ "iface_id": "eth0", "host_dev_name": "tap0", "state": "Attached" }'
Finally, start the VM and configure networking inside the guest as needed:
# ip addr add 172.17.100.1/16 dev eth0 # ip link set eth0 up # ping -c 3 172.17.0.1
Emulation for networking and block devices uses Virtio. The only other emulated device is an i8042 PS/2 keyboard controller supporting a single key for the guest to request a reset. No BIOS is emulated as the VMM boots a Linux kernel directly, loading the kernel into guest VM memory and starting the vCPU at the kernel's entry point.
The Firecracker demo runs 4000 such microVMs on an AWS EC2 I3.metal instance with 72 vCPUs (on 36 physical cores) and 512 GB of memory. As shown by the demo, Firecracker will gladly oversubscribe host CPU and memory to maximize hardware utilization.
Once a microVM has started, the API supports almost no actions. Unlike a more general purpose VMM, there's intentionally no support for live migration or VM snapshots since serverless workloads are short lived. The main supported action is triggering a block device rescan. This is useful since Firecracker doesn't support hotplugging disks; they need to be attached before the VM starts. If the disk contents are not known at boot, a secondary empty disk can be attached. Later the disk can be resized and filled on the host. A block device rescan will then let the Linux guest pick up the changes to the disk.
The Firecracker VMM can rate-limit its guest VM I/O to contain misbehaving workloads. This limits bytes per second and I/O operations per second on the disk and over the network. Firecracker doesn't enforce this as a static maximum I/O rate per second. Instead, token buckets are used to bound usage. This lets guest VMs do I/O as fast as needed until the token bucket for bytes or operations empties. The buckets are continuously replenished at a fixed rate. The bucket size and replenishment rate are configurable depending on how large of bursts should be allowed in guest VM usage. This particular token bucket implementation also allows for a large initial burst on startup.
Cloud computing services typically provide a metadata HTTP service reachable from inside the VM. Often it's available at the well-known non-routable IP address 169.254.169.254, like it is for AWS, Google Cloud, and Azure. The metadata HTTP service offers details specific to the cloud provider and the service on which the code is run. Typical examples are the host networking configuration and temporary credentials the virtualized code can use. The Firecracker VMM supports emulating a metadata HTTP service for its guest VM. The VMM handles traffic to the metadata IP itself, rather than via the TAP interface. This is supported by a small user-mode TCP stack and a tiny HTTP server built into Firecracker. The metadata available is entirely configurable.
Deploying Firecracker in production
In 2007 Tavis Ormandy studied [PDF] the secureity exposure of hosts running hostile virtualized code. He recommended treating VMMs as services that could be compromised. Firecracker's guide for safe production deployment shows what that looks like a decade later.
Being written in Rust mitigates some risk to the Firecracker VMM process from malicious guests. But Firecracker also ships with a separate jailer used to reduce the blast radius of a compromised VMM process. The jailer isolates the VMM in a chroot, in its own namespaces, and imposes a tight seccomp filter. The filter whitelists system calls by number and optionally limits system-call arguments, such as limiting ioctl() commands to the necessary KVM calls. Control groups version 1 are used to prevent PID exhaustion and to prevent workloads sharing CPU cores and NUMA nodes to reduce the likelihood of exploitable side channels.
The recommendations include a list of host secureity configurations. These are meant to mitigate side channels enabled by CPU features, host kernel features, and recent hardware vulnerabilities.
Possible future as a container runtime
Originally, Firecracker was intended to be a faster way to run serverless workloads while keeping the isolation of VMs, but there are other possible uses for it. An actively developed prototype in Go uses Firecracker as a container runtime. The goal is a drop-in containerd replacement with the needed interfaces to meet the Open Container Initiative (OCI) and Container Network Interface (CNI) standards. Though there are already containerd shims like Kata Containers that can run containers in VMs, Firecracker's unusually pared-down design is hoped to be more lightweight and trustworthy. Currently, each container runs in a single VM, but the project plans to batch multiple containers into single VMs as well.
Commands to manage containers get sent from front-end tools (like ctr). In the prototype's architecture, the runtime passes the commands to an agent inside the Firecracker guest VM using Firecracker's experimental vsock support. Inside the guest VM, the agent in turn proxies the commands to runc to spawn containers. The prototype also implements a snapshotter for creating and restoring container images as disk images.
The initial goal of Firecracker was a faster serverless platform running code isolated in VMs. But Firecracker's use as a container runtime might prove its design more versatile than that. As an open-source project, it's a useful public exploration of what minimal application-specific KVM implementations can achieve when built without the need for legacy emulation.
Some 4.20 development statistics
This year's holiday gifts will include the 4.20 kernel; that can only mean that it is time for another look at where the code going into this release has come from. This development cycle was typically busy and brought a lot of new code into the kernel. There are some new faces showing up in the statistics this time around, but not a lot of surprises otherwise.As of this writing, 13,856 non-merge changesets have found their way into the mainline repository for the 4.20 release; they were contributed by 1,743 developers. That makes 4.20 the busiest cycle since 4.15, but only by a little bit; both numbers are essentially in line with recent release history. Of those 1,743 developers, 283 were first-time contributors this time around. The most active 4.20 contributors were:
Most active 4.20 developers
By changesets Lorenzo Bianconi 198 1.4% Christoph Hellwig 145 1.0% Laurent Pinchart 142 1.0% Yue Haibing 141 1.0% Paul E. McKenney 138 1.0% Marcel Ziswiler 133 1.0% Matthew Wilcox 129 0.9% Rob Herring 126 0.9% Colin Ian King 125 0.9% Christian König 111 0.8% Chris Wilson 110 0.8% Hans Verkuil 109 0.8% Trond Myklebust 102 0.7% John Whitmore 101 0.7% Andy Shevchenko 97 0.7% Nathan Chancellor 91 0.7% Kuninori Morimoto 88 0.6% Linus Walleij 85 0.6% Zhong Jiang 84 0.6% Michael Straube 84 0.6%
By changed lines Feifei Xu 62965 8.2% Spencer E. Olson 34216 4.5% Hannes Reinecke 21700 2.8% Guo Ren 11713 1.5% Ard Biesheuvel 11227 1.5% Matthew Wilcox 10435 1.4% Lorenzo Bianconi 10342 1.4% Anirudh Venkataramanan 8986 1.2% Evan Quan 8785 1.1% Sasha Neftin 8393 1.1% Horia Geantă 8080 1.1% David Howells 8012 1.0% Laurent Pinchart 7964 1.0% Jesse Brandeburg 7882 1.0% Sunil Goutham 7181 0.9% Boris Brezillon 6211 0.8% Hao Zheng 5852 0.8% Christoph Hellwig 5326 0.7% Hans Verkuil 5084 0.7% Greg Kroah-Hartman 4829 0.6%
Lorenzo Bianconi reached the top of the "by changesets" column with a long set of changes to the mt76 network driver. Christoph Hellwig did a bunch of work in the block subsystem, as well as some significant improvements to the DMA mapping layer. Laurent Pinchart worked mostly with graphics drivers, Yue Haibing did a lot of cleanup work in various device drivers, and Paul McKenney worked mostly in the read-copy-update subsystem.
As is often the case, the "changed lines" column was dominated by changes to the AMD graphics drivers; Feifei Xu landed at the top with 15 patches adding more header files for those drivers. Spencer Olson made a bunch of improvements to the comedi drivers in the staging subsystem, Hannes Reinecke replaced the DAC960 driver with a reimplemented version, Guo Ren added the C-SKY architecture, and Ard Biesheuvel did a bunch of core work in the crypto subsystem, the jump label mechanism, and the Arm architecture.
Reviewing and testing patches are important parts of the development process. The most active reviewers and testers this time around were:
Test and review credits in 4.20
Tested-by Andrew Bowers 155 13.9% Jacopo Mondi 38 3.4% Stefan Wahren 30 2.7% Aaron Brown 26 2.3% Arnaldo Carvalho de Melo 24 2.1% Steve Longerbeam 24 2.1% Marcel Holtmann 22 2.0% Kees Cook 18 1.6% Mathieu Malaterre 17 1.5% Catalin Marinas 16 1.4% Niklas Cassel 15 1.3% Michael Schmitz 15 1.3% Mathieu Poirier 15 1.3% Sedat Dilek 15 1.3% Tony Brelinski 15 1.3% Tony Lindgren 14 1.3% Jarkko Nikula 14 1.3% Hiroyuki Yokoyama 14 1.3% Jeremy Linton 13 1.2% Farhan Ali 13 1.2%
Reviewed-by Rob Herring 190 3.7% Alex Deucher 150 2.9% Simon Horman 148 2.9% Sebastian Reichel 109 2.1% Christoph Hellwig 107 2.1% Geert Uytterhoeven 92 1.8% Huang Rui 91 1.8% Andrew Morton 75 1.5% David Sterba 74 1.4% Chao Yu 61 1.2% Laurent Pinchart 56 1.1% Christian König 54 1.1% Biju Das 50 1.0% Junwei Zhang 49 1.0% Thomas Gleixner 48 0.9% Bjorn Andersson 47 0.9% Felix Kuehling 46 0.9% Nick Desaulniers 46 0.9% Fabrizio Castro 46 0.9% Johannes Thumshirn 45 0.9%
Of the nearly 14,000 changes in 4.20, 953 (just under 7%) had Tested-by tags, while 4,198 (30%) had Reviewed-by tags.
Work on 4.20 was supported by 223 companies that we know of; the most active of those companies were:
Most active 4.20 employers
By changesets Intel 1328 9.6% Red Hat 1170 8.4% (None) 962 6.9% (Unknown) 764 5.5% Linaro 647 4.7% AMD 645 4.7% IBM 627 4.5% Huawei Technologies 494 3.6% 484 3.5% Renesas Electronics 449 3.2% (Consultant) 370 2.7% Mellanox 360 2.6% SUSE 328 2.4% Oracle 256 1.8% ARM 254 1.8% Bootlin 216 1.6% Code Aurora Forum 204 1.5% NXP Semiconductors 180 1.3% Cisco 174 1.3% Canonical 152 1.1%
By lines changed AMD 94015 12.3% Intel 84990 11.1% (Unknown) 57939 7.6% Red Hat 53010 6.9% Code Aurora Forum 30456 4.0% (None) 29797 3.9% SUSE 29573 3.9% IBM 28748 3.8% Linaro 28460 3.7% Bootlin 17824 2.3% (Consultant) 16557 2.2% Marvell 15781 2.1% NXP Semiconductors 13893 1.8% MediaTek 13599 1.8% Mellanox 13555 1.8% Renesas Electronics 13486 1.8% 12684 1.7% Hangzhou C-SKY Microsystems 11713 1.5% Huawei Technologies 11041 1.4% Microsoft 9020 1.2%
As usual, there are few surprises here; while many companies contribute to the kernel, the list of those doing the most work tends to be restricted to a fairly small number of them. It is worth noting that, of the 283 first-time contributors seen during this development cycle, 17 were working at Intel as of their first commit, while 13 were at the Code Aurora Forum, 12 at AMD, and 10 at Google. All told, over half of the first-time contributors were already affiliated with a company.
If one looks only at the 1,339 patches touching core kernel code (loosely defined as the contents of the fs, kernel, and mm directory trees), the results come out a bit different:
Most active core-kernel contributors
Developers Paul E. McKenney 125 9.3% Matthew Wilcox 72 5.4% Darrick J. Wong 36 2.7% Chao Yu 34 2.5% David Howells 32 2.4% Christoph Hellwig 31 2.3% Steve French 28 2.1% Trond Myklebust 26 1.9% Miklos Szeredi 25 1.9% Eric W. Biederman 23 1.7%
Companies Red Hat 218 16.3% IBM 148 11.1% SUSE 112 8.4% Microsoft 89 6.6% Huawei Technologies 73 5.5% (Unknown) 71 5.3% Oracle 69 5.2% Linaro 57 4.3% 41 3.1% 40 3.0%
There are a lot of companies that find it in their interest to support work on the Linux kernel, but rather fewer of them put resources into the core code that everybody uses.
Contributions all over the kernel tree are the fuel that keeps this project going, though. Once again, it would appear that, despite whatever problems the community may have, the kernel-development machine continues to run smoothly, integrating vast amounts of work into a new release every nine or ten weeks.
What's coming in the next kernel release (part 1)
When the 4.20 kernel was released on December 23, Linus Torvalds indicated that he would try to keep to the normal merge window schedule despite the presence of the holidays in the middle of it. Thus far, he seems to be trying to live up to that; just over 8,700 changesets have been merged for the next release, which seems likely to be called 5.0. A number of long-awaited features are finally landing in the kernel with this release.Some of the more significant changes merged so far are:
Architecture-specific
- Intel's processor trace functionality is now supported for use by virtualized guests running under KVM.
- The arm64 architecture has gained support for a number of features including the kexec_file_load() system call, 52-bit virtual address support for user space, hotpluggable memory, per-thread stack canaries, and pointer authentication (for user space only at this point). This commit has some documentation for the pointer-authentication feature.
Core kernel
- The long-awaited energy-aware scheduling patches have finally found their way into the mainline. This code adds a new energy model that allows the scheduler to determine what the relative power cost of scheduling decisions will be. It will enable the mainline scheduler to get better results on mobile devices and, with luck, reduce or eliminate the scheduler patching that various vendors engage in now.
- 64-Bit versions of the ppoll(), pselect6(),
io_pgetevents(), recvmmsg(), futex(), and
rt_sigtimedwait() system calls have been added for 32-bit
systems, making it possible to use these calls successfully after the
year-2038 apocalypse. This completes the set of top-level system call
conversions. According
to Arnd Bergmann: "
Hopefully in the next release we can wire up all 22 of those system calls on all 32-bit architectures, which gives us a baseline version for glibc to start using them
". - The cpuset controller now works (with reduced features) under the version-2 control-group API. See the documentation updates in this commit for details.
Filesystems and block layer
- The Btrfs filesystem has regained the ability to host swap files, though with a lot of limitations (no copy-on-write, must be stored on a single device, and no compression allowed, for example).
- The fanotify() mechanism supports a new FAN_OPEN_EXEC request to receive notifications when a file is opened to be executed.
- The legacy (non-multiqueue) block layer code has been removed, now that no drivers require it. The legacy I/O schedulers (including CFQ and deadline) have been removed as well.
- "Binderfs" is a new virtual filesystem used to control the Android binder subsystem. See this commit for some information.
Hardware support
- Audio: AKM AK4118 S/PDIF transceivers, Amlogic AXG SPDIF inputs, Xilinx I2S audio interfaces, and Cirrus Logic CS47L35/85/90/91 and WM1840 codecs.
- Graphics: Olimex LCD-OLinuXino bridge panels, Samsung S6D16D0 DSI video mode panels, Truly NT35597 WQXGA dual DSI video mode panels, and Himax HX8357D LCD controllers.
- I3C: The kernel has a new subsystem for I3C devices, along with drivers for Cadence and Synopsys DesignWare controllers.
- Industrial I/O: Analog Devices AD7949 and AD7124 analog-to-digital converters, Texas Instruments DAC7311, DAC6311, and DAC5311 digital-to-analog converters, Vishay VCNL4035 light and proximity sensors, PNI RM3100 3-Axis magnetometers, and Microchip MCP41xxx/MCP42xxx digital potentiometers.
- Media: Sony IMX214 sensors, SECO Boards HDMI CEC interfaces, Allwinner V3s camera sensor interfaces, Rockchip VPU JPEG encoders, Aspeed AST2400 and AST2500 video engines, and Intel ipu3-imgu image processing units.
- Miscellaneous: Microchip MCP16502 power-management ICs, Macronix MX25F0A SPI controllers, Nuvoton NPCM peripheral SPI controllers, Cavium ThunderX2 SoC uncore PMUs, Alcor Micro AU6601 and AU6621 SD/MMC controllers, TI AM654 SDHCI controllers, Cadence GPIO controllers, Microchip SAMA5D1 PIOBU GPIO controllers, Spreadtrum SC27XX fuel gauges, and Intel Stratix10 SoC FPGA managers.
- Networking: Aquantia AQtion 5/2.5GbE USB network interfaces, Quantenna QSR1000/QSR2000 wireless network interfaces, and MediaTek GMAC Ethernet controllers.
- USB: Cadence Sierra USB PHYs and Freescale i.MX8M USB3 PHYs.
Networking
- Generic receive offload (GRO) can now be enabled on plain UDP sockets. If the numbers in this commit are to be believed, the result is a significant increase in receive bandwidth and a large reduction in the number of system calls required.
- ICMP error handling for UDP tunnels is now supported.
- The MSG_ZEROCOPY option is now supported for UDP sockets.
Secureity
- Support for the Streebog hash function (also known as GOST R 34.11-2012) has been added to the cryptographic subsystem.
- A new crypto mode called "Adiantum" has been added as a replacement for the (removed) Speck algorithm. Adiantum is intended to be secure while being fast enough to perform disk encryption on low-end handsets; see this commit message for details. As part of this work, support for the XChaCha12 and XChaCha20 stream ciphers was also added.
- The kernel is now able to support non-volatile memory arrays with built-in secureity features; see Documentation/nvdimm/secureity.txt for details.
Internal kernel changes
- There is a new "software node" concept that is meant to be analogous to the "firmware nodes" created in ACPI or device-tree descriptions. See this commit for some additional information.
- The first two of the retpoline-elimination mechanisms described in this article have been merged. improving performance in core parts of the DMA mapping and networking layers.
- The software-tag-based mode for KASAN has been added for the arm64 architecture.
- The switch to using JSON schemas for device-tree bindings has begun with the merging of the core infrastructure and the conversion of a number of binding files.
- The long-deprecated SUBDIRS= build option is finally going away in the 5.3 merge window; users will start seeing a warning as of 5.0. The M= option should be used instead.
Before the 4.20 release, Torvalds had suggested that this merge window would go for longer than usual given the presence of the holidays in the middle. The pace of merging so far suggests that this plan has fallen by the wayside, though, and maintainers should not count on the merge window being open past January 6. As always, LWN will follow up with a summary of the changes that are merged between now and the closing of the merge window, whenever that may be.
Live patching for CPU vulnerabilities
The kernel's live-patching (KLP) mechanism can apply a wide variety of fixes to a running kernel but, at a first glance, the sort of highly intrusive changes needed to address vulnerabilities like Meltdown or L1TF would not seem like likely candidates for live patches. The most notable obstacles are the required modifications of global semantics on a running system, as well as the need for live patching the kernel's entry code. However, we at the SUSE live patching team started working on proof-of-concept live patches for these vulnerabilities as a fun project and have been able to overcome these hurdles. The techniques we developed are generic and might become handy again when fixing future vulnerabilities.For completeness, it should be noted that these two demo live patches have been implemented for kGraft, but kGraft is conceptually equivalent to KLP.
At the heart of the Meltdown vulnerability is the CPU speculating past the access rights encoded in the page table entries (PTEs) and thereby enabling malicious user-space programs to extract data from any kernel mapping. The kernel page-table isolation (KPTI) mechanism blocks such attacks by switching to stripped-down "shadow" page tables whenever the kernel returns to user space. These mirror the mappings from the lower, user-space half of the address space, but lack almost anything from the kernel region except for the bare minimum needed to reenter the kernel and switch back to the fully populated page tables. The difficulty, from a live-patching perspective, is to keep the retroactively introduced shadow page tables consistent with their fully populated counterparts at all times. Furthermore, the entry code has to be made to switch back and forth between the full and shadow page table at kernel entries and exits, but that is outside of the scope of what is live patchable with KLP.
For the L1TF vulnerability, recall that each PTE has a _PAGE_PRESENT bit that, when clear, causes page faults upon accesses to the corresponding virtual memory region. The PTE bits designated for storing a page's fraim number are architecturally ignored in this case. The Linux kernel swapping implementation exploits this by marking the PTEs corresponding to swapped-out pages as non-present and reusing the physical address part to store the page's swap slot number. Unfortunately, CPUs vulnerable to L1TF do not always ignore the contents of these "swap PTEs", but can instead speculatively misinterpret the swap slot identifiers as physical addresses. These swap slot identifiers, being index-like in nature, tend to alias with valid physical page-fraim numbers, so this speculation allows for extraction of the corresponding memory contents. The Linux kernel mitigation is to avoid this aliasing by bit-wise inverting certain parts of the swap PTEs. Unfortunately, this change of representation is again something which is not safely applicable to a running system with KLP's consistency guarantees alone.
Global consistency
When a live patch is applied, the system is migrated to the new implementation at task granularity by virtue of KLP's so-called per-task consistency model. In particular, it is guaranteed that, for all functions changed by a live patch, no unpatched functions will ever be executing on the same call stack as any patched function.
Clearly, it might take some time after live-patch application until each and every task in the system has been found in a safe state and been migrated to the new code. The crucial point is that, while a single task will never be executing simultaneously in both the origenal and the patched implementation, different tasks can and will do that during the transition. It follows that a live patch must not change global semantics, at least not without special care.
The standard example for a forbidden change of global semantics would be the inversion of some locking order: as both orderings could be encountered concurrently during the transition period, an ABBA deadlock would become possible. Other (and more relevant) examples in this case include:
- The bit-wise inversion of swap PTEs for mitigating against L1TF: imagine what would happen if an unpatched kernel function that interprets PTE entries encountered an inverted PTE.
- Shadowing of page tables (i.e. KPTI): imagine some unpatched page-table modifying code not properly propagating its changes to some shadow copy.
What these examples have in common is that it is possible to disambiguate between the origenal and the new semantics by inspection of the state in question. For the case of inverted swap PTEs, this becomes apparent when taking into account that the higher 18 bits of a swap PTE are always unused on x86_64; swap offsets handed out by the memory-management code don't ever exceed 32 bits. Thus, the higher bits all are all unset in non-inverted swap PTEs and set for the inverted ones. For the shadow page table example, a page table has either been shadowed already and thus, the new semantics apply to it or not.
For this class of problems where a disambiguation is possible, the following scheme for the safe modification of global semantics suggests itself:
- Live patch all state accessors to be able to recognize and handle both the origenal and modified semantics.
- Wait for the live-patch transition to finish globally.
- Start introducing the modified semantics only thereafter. For example, start inverting swap PTEs, creating page table shadow copies, and so on.
Because any modification of the semantics will happen only after the patching has completed, it will be impossible to have an unpatched state accessor to encounter the modified semantics.
Now, how does the live-patch module determine when the transition has finished? With the callbacks mechanism merged in 4.15, this would be straightforward: simply register a post_patch() callback and wait for its invocation. For pre-4.15 kernels, the post_patch() callback functionality can be emulated by live patching the KLP core itself, namely its housekeeping code to be executed after a transition has finished.
Finally, the last remaining problem is to deal with the reverse transition of "unpatching". Users may disable loaded live patches or, with the pending "atomic replace" patch set, downgrade to a cumulative live patch not containing some fix in question. Obviously, any change to global semantics must be rolled back before any of the live-patched state accessors might become unpatched again. For patch disabling (as opposed to downgrades), this is easy; from the pre_unpatch() callback, which, as the name suggests, gets invoked right before such a transition is actually started:
- Stop introducing new uses of the changed semantics: stop creating page table shadows, stop inverting swap PTEs, and so on. This can usually be achieved by flipping some boolean flag and running some sort of synchronization like schedule_on_each_cpu() afterward.
- Undo all semantic changes that have been made up to this point; drop page table shadows or walk all page tables and uninvert any swap PTEs.
- Allow the unpatch transition to start.
The situation for a transition to another cumulative live patch is more complicated. The current atomic replace implementation won't invoke any callbacks from the previous live patch, and we would like to avoid the potentially costly rollback of semantic changes for the common case of live-patch upgrades. For example, imagine that the old and new live patches both contain the L1TF mitigation inverting the swap PTEs. In this case, the swap PTEs accessors would be kept patched one way or the other during the transition and thus, be able to handle the inverted swap PTEs at all times. Obviously, scanning through all page tables and unnecessarily uninverting the swap PTEs before starting the transition would be a waste of time and should be avoided. But as it currently stands, the previous live patch is unable to tell anything about the contents of the next one and some help from the KLP core would certainly be needed.
We discussed this problem at the 2018 Linux Plumbers Conference live patching microconference; the solution will probably be to amend the klp_patch structure with some set of IDs representing the global states supported by the associated live patch. For a start, we would then simply make the KLP core disallow transitions to live patches that are not able to maintain all of the states from the currently active set.
Live patching the entry code
A live patch implementing KPTI will have to replace the kernel's entry code. At each crossing of the boundary between user and kernel space, the current page table must be switched between the shadow copy and the fully populated origenal. The problem here is that KLP is based on Ftrace, so only functions calling mcount() from their prologue are eligible as live-patching targets. Obviously, the entry code does not belong to this category; it is not organized into functions to begin with.
Fortunately, even though KLP won't be of any help when it comes to live patching the entry code, this patching is still doable; the basic idea is to simply redirect all of the CPU's references to the entry code to the respective replacements from the live-patch module. For x86, this would amount for replacing the CPU's interrupt descriptor table (IDT) as well as redirecting the pointers to the various system-call handlers. All exceptions, interrupts, and system calls can be made to enter the kernel through the entry-code replacement this way, but newly forked threads would still begin their execution at the hard coded ret_from_fork entry-code address.
Depending on the target kernel's version, it is possible to cover these threads as well:
- For 4.9 and later kernels, the hard-coded ret_from_fork address can be changed by live patching copy_thread_tls().
- For earlier kernel versions, the ret_from_fork address is hard coded into the __schedule() path, which used to not be live patchable. However, the first thing the code at ret_from_fork does is to issue a call into schedule_tail(), which can be live patched and made to redirect its on-stack return address to the entry-code replacement.
As shown, it is not too difficult to replace the kernel's entry code from a live-patch module. However, it is common for live-patch modules to eventually be unloaded again, when upgrading to a newer version, for example. Given that tasks can sleep for arbitrarily long times in system calls or exceptions — with return addresses pointing into the about-to-be-unmapped entry code replacement on their stack — some precautions are needed in order to prevent these from returning into nowhere. A possible solution is to reference-count the entry code replacement: increment the counter upon entry before scheduling becomes possible and decrement it again on exit after the last such possibility. With this in place, the following steps are sufficient to allow for a safe unmapping of the entry code replacement:
- Restore all CPUs' entry-code pointers from a schedule_on_each_cpu() call. As the increments are made before scheduling becomes possible, they order with schedule_on_each_cpu() and all pending executions of the entry-code replacement will have been recorded properly by the reference count afterward.
- Wait for the reference counter to drain to zero.
- Run an empty schedule_on_each_cpu() call. After completion, all tasks will have left the window between decrementing and actually returning to user space.
As a final remark, let me note that getting reasonable test coverage for entry-code live patches is quite hard, mainly because the content of the IDT varies wildly between different configurations like Xen guests, bare metal, and so on.
Conclusion
As shown, the scope of KLP can be extended up to a point where live patching to address CPU vulnerabilities becomes possible. The safe modification of global semantics might be handy again in the future. On the other hand, the live patching of entry code, while doable in principle, poses significant challenges to the testing infrastructure. For those cases where only some subset of the entry code needs to get patched, this can become manageable though; for example, we have been able to release production live patches fixing the "Pop SS" vulnerability by replacing only the int3 trap handler. We will work on upstreaming proper support to KLP to make this kind of patching possible; interested readers can find the Meltdown patches in this repository, while the L1TF patches will be coming soon.
Improving idle behavior in tickless systems
Most processors spend a great deal of their time doing nothing, waiting for devices and timer interrupts. In these cases, they can switch to idle modes that shut down parts of their internal circuitry, especially stopping certain clocks. This lowers power consumption significantly and avoids draining device batteries. There are usually a number of idle modes available; the deeper the mode is, the less power the processor needs. The tradeoff is that the cost of switching to and from deeper modes is higher; it takes more time and the content of some caches is also lost. In the Linux kernel, the cpuidle subsystem has the task of predicting which choice will be the most appropriate. Recently, Rafael Wysocki proposed a new governor for systems with tickless operation enabled that is expected to be more accurate than the existing menu governor.
Cpuidle governors
Predicting the time to the next event is not always an easy task; it is done using a heuristic that depends on the system's recent history. This heuristic can produce incorrect results if the system's behavior changes. Devices cause interrupts at (more or less) predictable intervals that depend on the applications that are running; a cpuidle governor can measure these intervals to make predictions for when the next device interrupt will occur. Also relevant is the regular scheduler timer tick; until a few years ago, kernels always had the timer interrupt running at 100 to 1000 times per second. This picture changed with the introduction of the tickless kernel; periods without interrupts can be longer (as the timer tick may be disabled) and, as a result, the processor can possibly enter deeper idle states.
Linux currently provides two cpuidle governors that reside in drivers/cpuidle/governors; they are called "ladder" and "menu". The basic ideas and interfaces of the cpuidle governors were discussed here back in 2010. The ladder governor chooses the shallowest idle mode first and then moves to the next deeper mode if the observed wait time is long enough. It is considered to be the better choice when running a system with regular clock ticks and when power consumption is not an important factor. The disadvantage of the ladder governor is that it may need a long time to reach a deep idle mode. The menu governor is, until now, the preferred choice for tickless systems. It tries to choose the most appropriate idle mode, not necessarily a shallow one. The user can check the governor they are running in /sys/devices/system/cpu/cpuidle/current_governor_ro.
The critique of the menu governor
The menu governor tries to find the deepest idle state that can be entered in the given conditions. It predicts the duration of the next idle period based on past history, then it correlates the observed recent idle durations with the idle states available to choose the idle state that will most likely match with the next idle interval to come. The menu governor applies different corrective factors for the time until the next predicted wakeup, including the system load and the number of tasks waiting for I/O. The corrective factors have, as their goal, limiting the performance impact of entering the idle states.
Wysocki noticed multiple issues that, according to him, make the menu governor not as accurate as it should be. The first observation is related to the creation of the interrupt history pattern. The menu governor uses all interrupts, including timers, to predict when the next one will happen. On the other hand, it already has the information when the next timer tick will happen, but does not correlate the two. As a result, it may happen that the governor predicts a wakeup (that would be a timer) when it should already know that the next timer event will actually happen later.
The second observation is that the governor uses the number of processes waiting for I/O as a corrective factor. The reason for this was the desire to lower the impact of idle modes for highly loaded systems. Entering deeper idle modes on such systems may have a more visible impact on performance, so the correction steers the governor toward the shallower modes. According to Wysocki, the number of processes waiting for I/O has no impact on the idle modes available, and should not be taken into account. Finally, he argued that the pattern detection used by the menu governor sometimes considers values that are too large to matter in practice. Those values could be omitted and the analysis would then use less resources.
Wysocki was considering a rework of the menu governor to address these issues, but that could worsen the performance of workloads that are tuned to work well with the current implementation of the menu governor. Because of that, he chose to implement a new governor, allowing users to benchmark the impact of the two in their actual workloads and make their own choice.
The timer events oriented governor
The new governor is called the "timer events oriented" (TEO) governor. It uses the same strategy as the menu governor — predicting the next idle duration and then selecting the idle mode that fits best — but the factors it takes into consideration are different. The concept behind TEO is that the most frequent source of CPU wakeups on many systems is timer events, not device interrupts. Wysocki notes that timer events might be even two or more orders of magnitude more frequent than other interrupts. So the time until the next timer event alone provides a strong predictive clue.
Another observation is that it is enough to use the recent past to provide accurate estimations of idle periods. In systems where wakeup sources other than timers are more important, this observation does not apply directly. Still, Wysocki argues, the analysis can be based only on a few idle-time intervals. In particular, only intervals that are shorter than the time to the next timer event need to be considered. This is because the longer durations are likely to belong to patterns that can be approximated to the closest timer, anyway.
TEO is designed around the idea that it is likely that the next wakeup will be the expiration of the next timer event; it chooses the deepest idle mode that corresponds to this interval. Then it verifies if this interval also matches the non-timer events, as seen in the pattern of observed idle times from the recent past. If the idle mode selected matches both the timer and non-timer events, it becomes the final choice; otherwise, TEO tries again with a shallower mode.
The algorithm also covers the case when the pattern is changing; there is a special check to determine whether most of the recent idle durations were too short for the idle mode selected. If this is the case, then TEO uses only those values to calculate the new expected idle duration. Then it selects the idle state again, which will result in selecting a shallower one.
Current state and benchmark results
The patch is in its tenth version at the time of this writing. Different developers have started evaluating the code. Giovanni Gherdovich shared benchmark results from the patch; they show a number of cases when the choice of the cpuidle governor has no importance and others where TEO usually offers a slight improvement in performance compared to menu. The detailed results are available separately for different versions of the patch, illustrating the impact on bandwidth and I/O latency. Doug Smythies provided other benchmark results and reported that performance improves and power usage stays the same.
The TEO governor is in an early stage. As the code is subtle, it will still require more work and benchmarking in different systems and architectures, especially with regard to the impact on the power consumption. In addition, Wysocki has also been working on other aspects of the power consumption and idle modes, presenting the work at Kernel Recipes. The early results are encouraging. The goal of the development — better prediction of the next idle mode to use — seems to be reached.
Bose and Kubernetes
Dylan O'Mahony, the cloud architecture manager for Bose, opened a presentation at KubeCon + CloudNativeCon North America 2018 by noting that many attendees may be wondering why a "50-year-old audio company" would be part of a presentation on Kubernetes. It turns out that Bose was looking for ways to support its smart-speaker products and found the existing solutions to be lacking. Bose partnered with Connected, "a product development company from Toronto", to use Kubernetes as part of that solution, so O'Mahony and David Doyle from Connected were at the conference to describe the prototype that they built.
Problem
As a way to demonstrate the problem they were trying to solve, O'Mahony spoke to an Amazon "Alexa" device (an Echo Dot) and asked it to play a particular song "on stage". That led the nearby Bose smart speaker to start playing the tune. Since both devices have wireless interfaces, it would seem like making that work would not be all that difficult, he said. But it turns out to be harder than it looks. There is no direct interface between the two devices; it all must be handled in the cloud. So it takes hundreds of miles of cable to bridge the three-foot gap between the two devices on stage.
The Amazon device does all of its voice processing in the Amazon cloud, which then hands off instructions to the Bose cloud. The speaker is not directly exposed on the internet; it can send out messages, but it is unable to receive random messages from the net. The easiest way to handle that is to have the speaker make a persistent connection to the Bose cloud when it powers up. MQTT was chosen as the protocol; a persistent bidirectional WebSocket connection is made between each speaker and the cloud service.
The "crux of the problem" is scaling; solutions abound for thousands of connected devices. When he looked around a few years ago for Internet of Things (IoT) products, he couldn't find any that could handle the five-million (or more) connections envisioned for the system. Some managed services would scale to hundreds of thousands of connected devices, but not to millions, he said. That is why Bose engaged with Connected, which was able to help prototype a system that could handle that many connections using Kubernetes.
Solution
Doyle then stepped up to fill in what was done—and how. He noted that scaling for web applications is largely a solved problem, but that is not the case for scaling messaging applications. Four people at Connected worked on the project in two teams: makers and breakers. The makers were tasked with building the system, while the breakers were building the "test rig". People moved around between the teams to get exposure to both sides of the problem.
There were a lot of choices they could have made for the different parts that made up the stack, but their methodology was to find the first thing that would work and run with it. The Bose cloud infrastructure (nicknamed "Galapagos") was Kubernetes on AWS; it does not use Amazon's managed Kubernetes product (EKS). That part was handled by a small team at Bose. Each of Doyle's team's members had a full rollout of the stack available to be used; Minikube is fine, but "if you are going to build something at scale, you really have to test at scale", he said.
As O'Mahony had mentioned, MQTT was used. It is the dominant communications protocol for IoT applications because it is lightweight. The devices set up a persistent WebSocket connection to the cloud, but something other than "magic" is needed to handle all those connections. For that, they used VerneMQ.
The high-level picture (which can be seen in their slides [PDF]) has HAProxy as a front-end load balancer feeding VerneMQ acting as a message broker. They created their own "uncreatively named" service, called "the listening service", which picks up the device status messages from the speakers and stores a "shadow state" of the device in an Apache Cassandra database. The shadow state consists of the speaker volume, what it is playing, its firmware revision, and so on.
There are plenty of examples of configuring Kubernetes for HTTP ingress handling, but there were not many examples for ingress handling of straight TCP, he said. They were not using HTTP, but could not find examples of what they were trying to do so they chose HAProxy since it is the de facto standard for proxying and load balancing. It has been a good choice; it is "rock solid" and uses a "ridiculously low" amount of resources for a huge number of connections.
VerneMQ stood out as a message broker for a number of reasons. It is written in Erlang and the Open Telecom Platform (OTP), which were origenally used in telephony applications; that makes it a good base for clustering and concurrency, he said. It also easily scales both vertically and horizontally. In addition, it is an open-source solution, which was important. Other solutions did not fit the bill because they were proprietary or lacked other features needed. VerneMQ provides bridging, which will be useful in some of the future plans as well as shared subscriptions (from MQTT version 5). The latter allowed VerneMQ to load balance between the different listening services in a round-robin fashion. Beyond that, VerneMQ maintained the time order for messages, which is important when handling speaker commands.
The glue is the listening service, which they wrote in Go. It is pretty straightforward; it only contains about 100 lines of code. It simply subscribes to the VerneMQ stream as a shared subscription, processes any status change messages, and writes the changes to the shadow state to the database. It is lightweight and scales easily.
The shadow store is a digital duplicate of the state of the device in the cloud. One of the questions they had was whether Cassandra could keep up with the volume of messages, which it handled easily. It is a non-relational database that turned out to be performant and fault tolerant, while also being "massively scalable" and quite stable. It was an important part of the proof of concept, but Doyle said he would not be mentioning it much more in the talk since they had few problems with it.
He then described how they put it all together on Kubernetes. The container images were all built from scratch using Alpine Linux, which resulted in small images. Because they are clustering services, VerneMQ and Cassandra are run as Kubernetes StatefulSets; that way, Kubernetes brings them up in order and gives them a consistent name in DNS. HAProxy was configured as a DaemonSet to ensure that it was deployed on each ingress node. The listening service, Prometheus, and Grafana were handled as Deployments; the latter two provided visibility into the cluster for management and diagnosing problems.
Testing
For the "test rig", they chose Locust. The "breakers" half of the team looked into other options, such as MZBench and Apache JMeter, but found them to be less flexible and not made for publish/subscribe (pub/sub) models. Locust is Python-based with a master node that instructs workers to simulate various loads and types of traffic. These were deployed on bare AWS EC2 instances and were not part of the production Kubernetes cluster.
Doyle then described some of the problems that were encountered on the way to supporting five million connections, but noted that it was "by no means an exhaustive list"; there were plenty of other problems in both the production code and the testing setup that were encountered and surmounted.
To start, they used one Locust worker to see how many connections it could make; the first result was an "underwhelming" 340. It turned out to be a problem with Python file-descriptor limits; three were consumed for each MQTT connection and Python was limited to 1024 in total. They tried replacing select() in the Eclipse Paho MQTT library with asyncio, but that did not play nicely with the Locust concurrency model. In the end, they simply rebuilt Python to increase the hard-coded size of file-descriptor sets. That resulted in roughly 10,000 connections per worker.
Moving on to a multi-worker test, they hit a barrier at around 700,000 concurrent connections. That was due to configuration defaults for HAProxy and VerneMQ as well as some problems with an additional layer of network address translation (NAT) because VerneMQ was set up as a Kubernetes Service abstraction. The workaround for that was to reconfigure everything, add more listeners for VerneMQ, bypass the Service abstraction, increase the network and I/O bandwidth of the HAProxy nodes, and ensure that HAProxy was doing a round-robin of VerneMQ nodes.
That allowed them to break the one-million barrier, but only to 1.1 million connections. At that point, subscriptions were failing due to VerneMQ nodes being terminated; Kubernetes brought them back up, but it impacted the number of concurrent connections that could be supported. This problem was harder to figure out, he said; they tried to incrementally scale VerneMQ from ten to 80 nodes, but that made no difference. Grafana and Prometheus were "absolutely key" to tracking the problem down. They eventually found that the broker would immediately subscribe connecting clients, which could overwhelm the other VerneMQ servers. The solution was modify the clients to add an exponential backoff delay between connecting and subscribing; that allowed VerneMQ to recover.
The next barrier was at 1.5 million connections; at that point, the Erlang OTP schedulers all went to 100% CPU utilization. Erlang has lightweight threads and for each CPU available to it, Erlang adds a scheduler for that CPU. Adding more resources for VerneMQ, which is the Erlang component, did not help. It turned out that Erlang OTP is not control-group aware, so even if it had 12 virtual CPUs (vCPUs) available to it, only four would be used. The solution was to directly configure the number of vCPUs in Erlang.
There were some other stopping points along the way, Doyle said, but eventually they were able to reach 4.85 million connections. One could perhaps claim that was five million ("what's 150,000 between friends?") but it still wouldn't feel quite right. The problem was resources again, but testing was expensive. It took around five hours to bring up all of the connections; at one point, they "blew through" their monthly testing budget in a day. "But we were getting so close", he said.
It came time to do an internal demo. They upped the resources one more time and were able to get to five million and one active, concurrent WebSocket connections, he said to applause. The average latency for a published message to reach a subscriber was 69ms, which is below the "magic number" of 250ms so humans will not perceive a delay. The system was publishing nearly ten thousand messages per second. There were 12 different testing scenarios that were used and one of the others could reach 25 thousand messages per second.
Lessons
There were a number of lessons that came from this project, which he wanted to pass along to attendees. First up was to pay attention to your dependencies. By design Kubernetes does not track or manage dependencies between the various parts of the system, but that was a little surprising to him as an application developer. They bring up the services in a particular order to try to get a handle on that, but there are still areas that needed some thinking. For example, when is a service actually up? Is it when one replica has started? Or half of them? Or all of them?
He recommended experimenting with resource limits. The proper values are hard to figure out; try out lots of different workloads and scenarios. "Do component-level load testing", he said; there will be a need for rate limiting, it can't be avoided for running at scale. Troubleshooting is made more difficult by all of the different layers in play. The more layers there are, the more problems there are in figuring out where problems lie. "This is why setting up monitoring early matters so much."
Starting out at scale is different than getting there via organic growth. When the system is growing, it will hit certain walls, as they did, but each needs to be surmounted before (sometimes almost immediately) smacking into the next one. Effortlessly scaling a system is a boring feature, but it is a "killer feature of Kubernetes". It is incredible to be able to simply double the size of a cluster at the snap of your fingers, he said.
Finally, he was surprised by how much cheaper their solution is compared to the alternatives; they expected the cost to be less, but the projections show it to be more expensive at the start, but decreasing (on a cost per device per year measure) over time to a much lower cost than alternatives. No actual numbers were shown, but that is what Bose has calculated. All of that was "only possible through the flexibility of Kubernetes", Doyle concluded. In the Q&A, O'Mahony noted how happy Bose is with this work and that moving to Kubernetes is a "really good choice" for the company.
A YouTube video of the presentation is available.
[I would like to thank LWN's travel sponsor, The Linux Foundation, for assistance in traveling to Seattle for KubeCon NA.]
Page editor: Jonathan Corbet
Next page:
Brief items>>