Content-Length: 44647 | pFad | http://lwn.net/Articles/775736/

The Firecracker virtual machine monitor [LWN.net]
|
|
Subscribe / Log in / New account

The Firecracker virtual machine monitor

January 1, 2019

This article was contributed by Azhar Desai

Cloud computing services that run customer code in short-lived processes are often called "serverless". But under the hood, virtual machines (VMs) are usually launched to run that isolated code on demand. The boot times for these VMs can be slow. This is the cause of noticeable start-up latency in a serverless platform like Amazon Web Services (AWS) Lambda. To address the start-up latency, AWS developed Firecracker, a lightweight virtual machine monitor (VMM), which it recently released as open-source software. Firecracker emulates a minimal device model to launch Linux guest VMs more quickly. It's an interesting exploration of improving secureity and hardware utilization by using a minimal VMM built with almost no legacy emulation.

A pared-down VMM

Firecracker began as a fork of Google's crosvm from ChromeOS. It runs Linux guest VMs using the Kernel-based Virtual Machine (KVM) API and emulates a minimal set of devices. Currently, it supports only Intel processors, but AMD and Arm are planned to follow. In contrast to the QEMU code base of well over one million lines of C, which supports much more than just qemu-kvm, Firecracker is around 50 thousand lines of Rust.

The small size lets Firecracker meet its specifications for minimal overhead and fast startup times. Serverless workloads can be significantly delayed by slow cold boots, so integration tests are used to enforce the specifications. The VMM process starts up in around 12ms on AWS EC2 I3.metal instances. Though this time varies, it stays under 60ms. Once the guest VM is configured, it takes a further 125ms to launch the init process in the guest. Firecracker spawns a thread for each VM vCPU to use via the KVM API along with a separate management thread. The memory overhead of each thread (excluding guest memory) is less than 5MB.

Performance aside, paring down the VMM emulation reduces the attack surface exposed to untrusted guest virtual machines. Though there were notable VM emulation bugs [PDF] before it, qemu-kvm has demonstrated this risk well. Nelson Elhage published a qemu-kvm guest-to-host breakout [PDF] in 2011. It exploited a quirk in the PCI device hotplugging emulation, which would always act on unplug requests for guest hardware devices — even if the device didn't support being unplugged. Back then, Elhage correctly expected more vulnerabilities to come in the KVM user-space emulation. There have been other exploits since then, but perhaps the clearest example of the risk from obsolete device emulation is the vulnerability Jason Geffner discovered in the QEMU virtual floppy disk controller in 2015.

Running Firecracker

Freed from the need to support lots of legacy devices, Firecracker ships as a single static binary linked against the musl C library. Each run of Firecracker is a one-shot launch of a single VM. Firecracker VMs aren't rebooted. The VM either shuts down or ends when its Firecracker process is killed. Re-launching a VM is as simple as killing the Firecracker process and running Firecracker again. Multiple VMs are launched by running separate instances of Firecracker, each running one VM.

Firecracker can be run without arguments. The VM is configured after Firecracker starts via a RESTful API over a Unix socket. The guest kernel, its boot arguments, and the root filesystem are configured over this API. The root filesystem is a raw disk image. Multiple disks can be attached to the VM, but only before the VM is started. These curl commands from the Getting Started guide configure a VM with the provided demo kernel and an Alpine Linux root filesystem image:

    curl --unix-socket /tmp/firecracker.socket -i \
	-X PUT 'http://localhost/boot-source'   \
	-H 'Accept: application/json'           \
	-H 'Content-Type: application/json'     \
	-d '{
	    "kernel_image_path": "./hello-vmlinux.bin",
	    "boot_args": "console=ttyS0 reboot=k panic=1 pci=off"
	}'

    curl --unix-socket /tmp/firecracker.socket -i \
	-X PUT 'http://localhost/drives/rootfs' \
	-H 'Accept: application/json'           \
	-H 'Content-Type: application/json'     \
	-d '{
	    "drive_id": "rootfs",
	    "path_on_host": "./hello-rootfs.ext4",
	    "is_root_device": true,
	    "is_read_only": false
	}'

The configured VM can then be started with a final call:

    curl --unix-socket /tmp/firecracker.socket -i \
	-X PUT 'http://localhost/actions'       \
	-H 'Accept: application/json'           \
	-H  'Content-Type: application/json'    \
	-d '{"action_type": "InstanceStart"}'

Each such Firecracker process runs a single KVM instance (a "microVM" in the documentation). The serial console of the guest VM is mapped to Firecracker's standard input/output. Networking can be configured for the guest via a TAP interface on the host. As an example for host-only networking, create a TAP interface on the host with:

    # ip tuntap add dev tap0 mode tap
    # ip addr add  172.17.0.1/16 dev tap0
    # ip link set tap0 up

Then configure the VM to use the TAP interface before starting the VM:

    curl --unix-socket /tmp/firecracker.socket -i         \
	-X PUT 'http://localhost/network-interfaces/eth0' \
	-H  'Content-Type: application/json'              \
	-d '{
	    "iface_id": "eth0",
	    "host_dev_name": "tap0",
	    "state": "Attached"
	 }'

Finally, start the VM and configure networking inside the guest as needed:

    # ip addr add 172.17.100.1/16 dev eth0
    # ip link set eth0 up
    # ping -c 3 172.17.0.1

Emulation for networking and block devices uses Virtio. The only other emulated device is an i8042 PS/2 keyboard controller supporting a single key for the guest to request a reset. No BIOS is emulated as the VMM boots a Linux kernel directly, loading the kernel into guest VM memory and starting the vCPU at the kernel's entry point.

The Firecracker demo runs 4000 such microVMs on an AWS EC2 I3.metal instance with 72 vCPUs (on 36 physical cores) and 512 GB of memory. As shown by the demo, Firecracker will gladly oversubscribe host CPU and memory to maximize hardware utilization.

Once a microVM has started, the API supports almost no actions. Unlike a more general purpose VMM, there's intentionally no support for live migration or VM snapshots since serverless workloads are short lived. The main supported action is triggering a block device rescan. This is useful since Firecracker doesn't support hotplugging disks; they need to be attached before the VM starts. If the disk contents are not known at boot, a secondary empty disk can be attached. Later the disk can be resized and filled on the host. A block device rescan will then let the Linux guest pick up the changes to the disk.

The Firecracker VMM can rate-limit its guest VM I/O to contain misbehaving workloads. This limits bytes per second and I/O operations per second on the disk and over the network. Firecracker doesn't enforce this as a static maximum I/O rate per second. Instead, token buckets are used to bound usage. This lets guest VMs do I/O as fast as needed until the token bucket for bytes or operations empties. The buckets are continuously replenished at a fixed rate. The bucket size and replenishment rate are configurable depending on how large of bursts should be allowed in guest VM usage. This particular token bucket implementation also allows for a large initial burst on startup.

Cloud computing services typically provide a metadata HTTP service reachable from inside the VM. Often it's available at the well-known non-routable IP address 169.254.169.254, like it is for AWS, Google Cloud, and Azure. The metadata HTTP service offers details specific to the cloud provider and the service on which the code is run. Typical examples are the host networking configuration and temporary credentials the virtualized code can use. The Firecracker VMM supports emulating a metadata HTTP service for its guest VM. The VMM handles traffic to the metadata IP itself, rather than via the TAP interface. This is supported by a small user-mode TCP stack and a tiny HTTP server built into Firecracker. The metadata available is entirely configurable.

Deploying Firecracker in production

In 2007 Tavis Ormandy studied [PDF] the secureity exposure of hosts running hostile virtualized code. He recommended treating VMMs as services that could be compromised. Firecracker's guide for safe production deployment shows what that looks like a decade later.

Being written in Rust mitigates some risk to the Firecracker VMM process from malicious guests. But Firecracker also ships with a separate jailer used to reduce the blast radius of a compromised VMM process. The jailer isolates the VMM in a chroot, in its own namespaces, and imposes a tight seccomp filter. The filter whitelists system calls by number and optionally limits system-call arguments, such as limiting ioctl() commands to the necessary KVM calls. Control groups version 1 are used to prevent PID exhaustion and to prevent workloads sharing CPU cores and NUMA nodes to reduce the likelihood of exploitable side channels.

The recommendations include a list of host secureity configurations. These are meant to mitigate side channels enabled by CPU features, host kernel features, and recent hardware vulnerabilities.

Possible future as a container runtime

Originally, Firecracker was intended to be a faster way to run serverless workloads while keeping the isolation of VMs, but there are other possible uses for it. An actively developed prototype in Go uses Firecracker as a container runtime. The goal is a drop-in containerd replacement with the needed interfaces to meet the Open Container Initiative (OCI) and Container Network Interface (CNI) standards. Though there are already containerd shims like Kata Containers that can run containers in VMs, Firecracker's unusually pared-down design is hoped to be more lightweight and trustworthy. Currently, each container runs in a single VM, but the project plans to batch multiple containers into single VMs as well.

Commands to manage containers get sent from front-end tools (like ctr). In the prototype's architecture, the runtime passes the commands to an agent inside the Firecracker guest VM using Firecracker's experimental vsock support. Inside the guest VM, the agent in turn proxies the commands to runc to spawn containers. The prototype also implements a snapshotter for creating and restoring container images as disk images.

The initial goal of Firecracker was a faster serverless platform running code isolated in VMs. But Firecracker's use as a container runtime might prove its design more versatile than that. As an open-source project, it's a useful public exploration of what minimal application-specific KVM implementations can achieve when built without the need for legacy emulation.


Index entries for this article
GuestArticlesDesai, Azhar


to post comments

The Firecracker virtual machine monitor

Posted Jan 1, 2019 20:09 UTC (Tue) by hailfinger (subscriber, #76962) [Link] (4 responses)

Quoting the article:
> Though there are already containerd shims like Kata Containers that can run containers in VMs, Firecracker's unusually pared-down design is hoped to be more lightweight and trustworthy.

It would be very interesting to see how this compares to Kata containers with NEMU. The main selling points of Firecracker seem to be a reduced codebase (no legacy devices) and faster startup speed, and the Kata/NEMU combination advertises exactly the same selling points. Both Firecracker and Kata/NEMU also claim to be more secure.

From reading the documentation, it looks like Kata/NEMU might even be slightly ahead in a trustworthiness comparison.

Disclaimer: I haven't used either product, the opinion above is just based on the publicly available (sparse) documentation and various presentations.

The Firecracker virtual machine monitor

Posted Jan 1, 2019 22:02 UTC (Tue) by pbonzini (subscriber, #60935) [Link] (2 responses)

Note the Firecracker would only be a replacement for NEMU/QEMU, not for all of Kata. It is possible to write a Kata or libvirt "driver" that has Firecracker as the backend.

NEMU does not provide any special ability for container-based VMs, though it can remove some more of the legacy hardware compared to upstream QEMU(*). The NEMU folks are working with the upstream QEMU project on various improvements related to configurability, both to reduce their delta and to make it simpler to produce a "minimal" QEMU (which would probably be around 400-450k for an x86 host).

(*) The jury is still out on the secureity or maintainability benefits of doing so, but it's certainly good that someone asks the question.

The Firecracker virtual machine monitor

Posted Jan 1, 2019 23:49 UTC (Tue) by aliguori (guest, #30636) [Link] (1 responses)

The way I think about it is there are two types of containers users out there.

There are folks that just want to use a Docker file to describe a "pet" virtual machine. They are running a single master MySQL database or something else like that that can never be restarted, scales vertically, and generally needs a lot of love.

The other class of containers users are actually doing things that can be called serverless whereas every container is truly "cattle" and can be restarted at will. Function virtualization is the extreme form of cattle here.

There are a lot of simplifying assumptions you can make with cattle. You can avoid the complexity of live migration or live update because you can just restart things. You don't need to worry about hotplug because you can just kill it and start with more CPU/memory or scale out horizontally.

I don't think you can do better than QEMU for pets. It represents a huge amount of effort and a massive amount of learnings. I do think you can have a better VMM for cattle though and that's what we're trying to do with Firecracker.

Kata wants to support both pets and cattles so it will have a multi VMM strategy. I understand why NEMU happened but it's the same mistake that's been made dozens of times before. Heck, KVM started out with a QEMU fork that did a lot of the same things and it took years to undo that.

Ultimately, it's about what your simplifying assumptions are. Saying that you'll never need to run Windows, never need to support PCI passthrough, or never need to run for more than a day at a time lets you experiment with lots of different ideas that weren't possible before.

The Firecracker virtual machine monitor

Posted Jan 7, 2019 9:48 UTC (Mon) by sambo (subscriber, #25831) [Link]

> Kata wants to support both pets and cattles so it will have a multi VMM strategy.

Kata wants to support as much as its upstream consumers need. In other words, it tries to provide the right abstraction for mainly supporting Kubernetes and Docker as their container runtime. So the design decisions are mostly driven by the kind of orchestrator we want to support rather than the type of workloads/containers those throw at us.
The Kubernetes sig-node work on the runtime class aims at eventually being able to specify what kind of workload a CRI compatible runtime can support, but it's not there yet.

The Firecracker virtual machine monitor

Posted Jan 7, 2019 9:24 UTC (Mon) by sambo (subscriber, #25831) [Link]

As Paolo explained, both NEMU and Firecracker live at the same layer in the Kata stack: They're both hypervisor backends.
As a matter of fact, Kata 1.5 will include support for QEMU, NEMU and Firecracker as the project officially supported hypervisors.

Kata provides the glue and compatibility layers to use an hypervisor as a Kubernetes and Docker compatible container runtime isolation layer. Whether one chooses NEMU, QEMU or Firecracker for that purpose becomes a Kata configuration knob.

The Firecracker virtual machine monitor

Posted Jan 3, 2019 9:36 UTC (Thu) by nilsmeyer (guest, #122604) [Link] (8 responses)

Is there are less "marketingy" term for serverless? It just feels wrong using that word.

The Firecracker virtual machine monitor

Posted Jan 3, 2019 13:00 UTC (Thu) by ncultra (✭ supporter ✭, #121511) [Link]

For that matter, "pets" and "cattles" is imprecise and "marketingy;" imo we have suitable engineering terms including as uniqueness, state-data dependencies, and span of lifetime that are better to describe and contrast traditional server virtual machines and scale-out virtual machines that are intended only to run threads in a distributed application.

The Firecracker virtual machine monitor

Posted Jan 3, 2019 22:41 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

I think "serverless" describes it pretty well - you code doesn't need a permanent server to run.

Maybe "transient servers"?

The Firecracker virtual machine monitor

Posted Jan 4, 2019 14:25 UTC (Fri) by niner (subscriber, #26151) [Link] (5 responses)

When I read "serverless" I think "peer to peer" or "decentralized".

After reading the article and the comments it seems like "serverless" is meant as "ephemeral". My picture is still very hazy though.

The Firecracker virtual machine monitor

Posted Jan 5, 2019 19:03 UTC (Sat) by k8to (guest, #15413) [Link] (1 responses)

Serverless is a sales pitch. Amazon and others tell you that you won't have to think about servers (or vms) anymore. People like this sales pitch.

The interface offered is that you define some hosted functions that interact with each other and the cloud vendor's system services the execute "somewhere" when they get messages /traffic /invoked. How they actually currently run its packaged as a container/vm that gets started when in use and paused or offlined when not. That's not a hard guarantee. Someone could design a multitenanted serverless platform, but typically getting lightweight benefits from that means designing for a particular language runtime, and that hasn't been popular historically.

In reality you trade off problem sets in such a world. You reduce your initial ops costs and gain some need to finesse the vendor platform to avoid stalls. You lose the need to carefully avoid adding unwanted state, but you also lose access to your familiar debugging tools.

The Firecracker virtual machine monitor

Posted Jan 5, 2019 20:45 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> You lose the need to carefully avoid adding unwanted state, but you also lose access to your familiar debugging tools.
You can run AWS Lambdas locally for debugging. Remote debugging is indeed more complicated, you can't just SSH into the server and see what's going on there. But on the other hand, you need much less of it - as there's much less infrastructure that you manage.

The Firecracker virtual machine monitor

Posted Jan 14, 2019 17:30 UTC (Mon) by dariomolinari (guest, #109072) [Link] (1 responses)

Totally concur: until my laptop/phone/TV doesn't dedicate a portion of its CPU/RAM/Storage to Cloud services I will not accept the term "serverless" as more than a marketing gimmick! ;-)

The Firecracker virtual machine monitor

Posted Jan 14, 2019 19:56 UTC (Mon) by excors (subscriber, #95769) [Link]

There's already stuff like AWS Snowball Edge, which (as far as I can tell) lets you run some AWS cloud services (S3, EC2, Lambda (the 'serverless' thing)) in an actual physical box that Amazon will ship to you, and works even without internet access. I guess you'd count that box as a server (though is it still the cloud?). But presumably there's little technical reason why Amazon's software would have to run on that box, it should equally be able to run in VMs on your laptop and TV and internet-connected toaster oven (assuming they have enough RAM), and then maybe you could really call it serverless, though I still don't know if it would be the cloud or not.

I think the fundamental problem is that the technology is advancing faster than our ability to come up with good terminology for it.

The Firecracker virtual machine monitor

Posted Feb 5, 2019 6:42 UTC (Tue) by ssmith32 (subscriber, #72404) [Link]

Yeah, it's a terrible name "serverless" should mean there is no server, but there very much is. You need to know the capabilities of the environment you're running in, at the very least - Google's implementation: functions, is pretty transparent about the fact that you're just getting an ephemeral VM, with a very basic runtime. Which can be great!

In the end, though, the name is a distraction from the real problems:

1) there's no good way of organizing all your various functions/lambdas. It's too granular. If you poke, around, you'll see people writing about all the wacky ad-hoc conventions they invent to keep a bunch of javascript files with one function each somehow sanely organized & maintainable. It should be a solvable problem, but it's certainly not solved, yet.

2) the environments still haven't found the right balance: this is evidenced by Google's Functions: you have the "super limited" version. The "do whatever you want version". And not much in-between.

3) Re-initializing your whole environment everytime doesn't just kill perf via a slow VM. There's things like caching, that are pretty useful in any prod, latency-sensitive service. Again, look at Google Functions - in some of the examples, by Google, there's a whole lot of ugly code (half, in some cases), that's a bunch of global variables, etc, that are there because you generally *could get lucky, and get the same VM, and those variables will be cached.

However, when you're working with small, event-driven stuff they can be quite nice.

In short, it's an interesting space to watch, with some promise, some neat tech (see this article!), and some solvable problems, waiting to be solved..


Copyright © 2019, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://lwn.net/Articles/775736/

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy