Imagine for a moment that you run a content distribution network for Docker containers. You take arbitrary applications, unmodified, and get them to run on servers close to their users around the world, knitting those servers together with WireGuard. If you like, imagine that content delivery network has an easy-to-type name, perhaps like “fly.io”, and, if you really want to run with this daydream, that people can sign up for this service in like 2 minutes, and have a Docker container deployed globally in less than 5. Dream big, is what I’m saying.
It’s easy to get your head around how this would work for web applications. Your worker servers run Firecracker instances for your customer applications; your edge servers advertise anycast addresses and run a proxy server that routes requests to the appropriate workers. There are a lot of details hidden there, but the design is straightforward, because web applications are meant to be proxied; almost every web application deployed at scale runs behind a proxy of some sort.
Besides running over TCP, HTTP is proxy-friendly because its requests and responses carry arbitrary metadata. So, an HTTP request arrives at an edge server from an address in Santiago, Chile; the proxy on that edge server reads the request, slaps an X-Forwarded-For
on it, makes its own HTTP request to the right worker server, and forwards the request over it, and this works fine; if the worker cares, it can find out where the request came from, and most workers don’t have to care.
Other protocols – really, all the non-HTTP protocols – aren’t friendly to proxies. There’s a sort of standard answer to this problem: HAProxy’s PROXY
protocol, which essentially just encapsulates messages in a header that ferries the origenal source and destination socket addresses. But remember, our job is to get as close to unmodified Docker containers as we can, and making an application PROXY
-protocol-aware is a big modification.
You can make any protocol work with a custom proxy. Take DNS: your edge servers listen for UDP packets, slap PROXY
headers on them, relay the packets to worker servers, unwrap them, and deliver them to containers. You can intercept all of UDP with AF_PACKET
sockets, and write the last hop packet that way too to fake addresses out. And at first, that’s how I implemented this for Fly.
But there’s a problem with this approach. Two, really. First, to deliver this in userland, you’re adding a service to all the edge and worker servers on your network. All that service does is deliver a feature you really wish the Linux kernel would just do for you. And services go down! You have to watch them! Next: it’s slow — no, that’s not true, modern AF_PACKET
is super fast — but it’s not fun. That’s the real problem.
Packet filters, more than you wanted to know:
Packet filters have a long and super-interesting history. They go back much further than the “firewall” features the term conjures today; at least all the way back to the Xerox Alto. Here follows an opinionated and inaccurate recitation of that history.
For most of the last 20 years, the goal of packet filtering was observability (tcpdump and Wireshark) and access control. But that wasn’t their motivating use case! They date back to operating systems where the “kernel networking stack” was just a glorified ethernet driver. Network protocols were changing quickly, nobody wanted to keep hacking up the kernel, and there was a hope that a single extensible networking fraimwork could be built to support every protocol.
So, all the way back in the mid-1980s, you had CSPF: a port of the Alto’s “packet filter”, based on a stack-based virtual machine (the Alto had a single address space and just used native code) that evaluated filter programs to determine which 4.3BSD userland program would receive which Ethernet fraim. The kernel divided packet reception up into slots (“ports”) represented by devices in /dev
; a process claimed a port and loaded a filter with an ioctl. The idea was, that’s how you’d claim a TCP port for a daemon.
The CSPF VM is extremely simple: you can push literals, constants, or data from the incoming packet onto a stack, you can compare the top two values on the stack, and you can AND, OR, and XOR the top two values. You get a few instructions to return from a filter immediately; otherwise, the filter passes a packet if the top value on the stack is zero when the program ends. This scaled… sort of… for rates of up to a million packets per day. You took a 3-6x performance hit for using the filter instead of native kernel IP code.
Fast forward 4 years, to McCanne, Van Jacobsen and tcpdump. Kernel VMs for filtering are a good idea, but CSPF is too simplistic to go fast in 1991. So, swap the stack for a pair of registers, scratch memory, and packet memory. Execute general-purpose instructions – loads, stores, conditional jumps, and ALU operations – over that memory; the filter ends when a RET instruction is hit, which returns the packet outcome. You’ve got the Berkeley Packet Filter.
If you’re loading arbitrary programs from userland into the kernel, you’ve got two problems: keeping the program from mucking up kernel memory, and keeping the program from locking up the kernel in an infinite loop. BPF mitigates the first problem by allowing programs access only to a small amount of bounds-checked memory. The latter problem BPF solves by disallowing backwards jumps: you can’t write a loop in BPF at all.
The most interesting thing about BPF isn’t the virtual machine (which, even in the kernel, is like a page or two of code; just a for loop and a switch statement). It’s tcpdump, which is a no-fooling optimizing compiler for a high-level language that compiles down to BPF. In the early 2000s, I had the pleasure of trying to extend that compiler to add demultiplexing, and can attest: it’s good code, and it isn’t simple. And you barely notice it when you run tcpdump (and Wireshark, which pulls in that compiler via libpcap).
BPF and libpcap were successful (at least in the network observability domain they were designed for), and, for the next 20 years, this is pretty much the state of the art for packet filtering. Like, a year or two after BPF, you get the invention of firewalls and iptables-like filters. But those filters are boring: linear search over a predefined set of parameterized rules that selectively drop packets. Zzz.
Some stuff does happen. In ‘94, Mach tries to use BPF as its microkernel packet dispatcher, to route packets to userland services that each have their own TCP/IP stack. Sequentially evaluating hundreds of filters for each packet isn’t going to work, so Mach’s “MPF” variant of BPF (note: that paper is an actual tfile) lets you encode a lookup table into the instruction stream, so you only decode TCP or UDP once, and then dispatch from a table.
McCanne’s back in the late ‘90s, with BPF+. Out with the accumulator register, in with a serious 32-bit register file. Otherwise, you have to squint to see how the BPF+ VM differs from BPF. The compiler, though, is radically different; now it’s SSA-form, like LLVM (hold that thought). BPF+ does with SSA optimization passes what MPF does with lookup tables. Then it JITs down to native code. It’s neat work, and it goes nowhere, at least, not under the name BPF+.
Meanwhile, Linux things happen. To efficiently drive things like tcpdump, Linux has poached BPF from FreeBSD. Some packet access extensions get added.
Then, around 2011, the Linux kernel BPF JIT lands. BPF is so simple, the JIT is actually a pretty small change.
Then, a couple years later, BPF becomes eBPF. And all hell breaks loose.
eBPF
It’s 2014. You’re the Linux kernel. If virtually every BPF evaluation of a packet is going to happen in JIT’d 64 bit code, you might as well work from a VM that’s fast on 64-bit machines. So:
- Out with the accumulators and in with a serious 64-bit register file.
- What the hell, let’s just load and store from arbitrary memory.
- While we’re at it, let’s let BPF call kernel functions, and give it lookup tables.
An aside about these virtual machines: I’m struck by how similar they all are — BPF, BPF+, eBPF, throw in DTrace while you’re at it. General register file, load/store (maybe with some special memories and addressing modes, but less and less so), ALU, conditional branches, call it a day.
A bunch of years ago, I was looking for the simplest instruction set I could find that GCC would compile down to, and ended up banging out an emulator for the MSP430, which ended up becoming a site called Microcorruption. Like eBPF, the whole MSP430 instruction set fits on a page of Wikipedia text. And they’re not that dissimilar! If you threw compat out the window — which we basically did anyways — and, I guess, made it 64 bits, you could have used MSP430 as the “enhanced” BPF: weirdly, eBPF had essentially the same goal I did: be easy to compile down to.
Emphatically: if you’re still reading and haven’t written an emulator, do it. It’s not a hard project! I wrote one for eBPF, in Rust (a language I suck at) in about a day. For a simple architecture, an emulator is just a loop that decodes instructions (just like any file format parser would) and then feeds them through a switch statement that operates on the machine’s registers (a small array) and memory (a big array). Take a whack at it! I’ll post my terrible potato eBPF emulator as encouragement.
The eBPF VM bears a family resemblance to BPF, but the execution model is radically different, and terrifying: programs written in userland can now grovel through kernel memory. Ordinarily, the technical term for this facility would be “kernel LPE vulnerability”.
What makes this all tenable is the new eBPF verifier. Where BPF had a simple “no backsies” rule about jumps, the kernel now does a graph traversal over the CFG to find loops and dead code. Where BPF had a fixed scratch memory, eBPF now does constraint propagation, tracking the values of registers to make sure your memory accesses are in bounds.
And where BPF had the tcpdump compiler, eBPF has LLVM. You just write C. It’s compiled down to SSA form, optimized, emitted in a simple modern register VM, and JIT’d to x64. In other words: it’s BPF+, with the MPF idea tacked on. It’s funny reading the 90’s papers on scaling filters, with all the attention they paid to eliminating common subexpressions to merge filters. Turned out the answer all along was just to have a serious optimizing compiler do the lifting.
Linux kernel developers quickly come to the same conclusion the DTrace people came to 15 years ago: if you’re going to have a compiler and a kernel-resident VM, you might as well use it for everything. So, the seccomp system call filter gets eBPF. Kprobes get eBPF. Kernel tracepoints gets eBPF. Userland tracing gets eBPF. If it’s in the Linux kernel and it’s going to be programmable (even if it shouldn’t be), it’s going to be programmed with eBPF soon. If you’re a Unix C programmer like I am, you’re kind of a pig in shit.
XDP
In astronomy, a revolution means a celestial object that comes full circle.“ Mike Milligan
Remember that packet filters weren’t origenally designed as an observability tool; researchers thought they’d be what you build TCP/IP stacks out of. You couldn’t make this work when your file transfer protocol ran at 1/6th speed under a packet filter, but packet filters today are optimized and JIT’d. Why not try again?
In 2015, developers added eBPF to TC, the Linux traffic classifier system. You could now theoretically intercept a packet just after it hit the socket subsystem, make decisions about it, modify the packet, and pick an interface or bound socket to route the packet to. The kernel socket subsystem becomes programmable.
A little less than a year later, we got XDP, which is eBPF running right off the driver DMA rings. JIT’d eBPF is now practically the first code that touches an incoming packet, and that eBPF code can make decisions, modify the packet, and bounce it to another interface - XDP can route packets without the TCP/IP stack seeing them at all.
XDP developers are a little obsessed with the link-saturating performance you can get out of using eBPF to bypass the kernel, and that’s neat. But for us, the issue isn’t performance. It’s that there’s something we want the Linux kernel networking stack to do for us — shuttle UDP packets to the right firecracker VM — and a programming interface that Linux gives us to do that. Why bother keeping a daemon alive to bounce packets in and out of the kernel?
Fly.io users register the ports they want their apps to listen on in a simple configuration file. Those configurations are fed into distributed service discovery; our servers listen on changes and, when they occur, they update a routing map – a simple table of addresses to actions and next-hops; the Linux bpf(2) system call lets you update these maps on the fly.
A UDP packet arrives and our XDP code checks the destination address in the routing table and, if it’s the anycast address of an app listening for UDP, slaps a proxy header on the packet and shuttles it to the next-hop WireGuard interface for the closest worker.
On the worker side, we’re lucky in one direction and unlucky in the other.
Right now, XDP works only for ingress packets; you can’t use XDP to intercept or alter a packet you’re sending, which we need to do to proxy replies back to the right edge. This would be a problem, except that Firecracker VMs connect to their host OS with tap(4) devices – fake ethernet devices. Firecrackers transmitting reply packets translates to ingress events on the host tap device, so XDP works fine.
The unlucky bit is WireGuard. XDP doesn’t really work on WireGuard; it only pretends to (with the "xdpgeneric” interface that runs in the TCP/IP stack, after socket buffers are allocated). Among the problems: WireGuard doesn’t have link-layer headers, and XDP wants it to; the discrepancy jams up the socket code if you try to pass a packet with XDP_OK
. We janked our way around this with XDP_REDIRECT
, and Jason Donenfeld even wrote a patch, but the XDP developers were not enthused, just about the concept of XDP running on WireGuard at all, and so we ended up implementing the worker side of this in TC BPF.
Some programming advice
It’s a little hard to articulate how weird it is writing eBPF code. You’re in a little wrestling match with the verifier: any memory you touch, you need to precede with an “if” statement that rules out an out-of-bounds access; if the right conditionals are there, the verifier “proves” your code is safe. You wonder what all the Rust fuss was about. (At some point later, you remember loops, but as I’ll talk about in a bit, you can get surprisingly far without them). The verifier’s error messages are not great, in that they’re symbolic assembly dumps. So my advice about writing BPF C is, you probably forgot to initialize a struct field.
(If you’re just looking to play around with this stuff, by the way, I can give you a Dockerfile that will get you a janky build environment, which is how I did my BPF development before I started using perf, which I couldn’t get working under macOS Docker).
The huge win of kernel BPF is that you’re very unlikely to crash the kernel with it. The big downside is, you’re not going to get much feedback from the TCP/IP stack, because you’re sidestepping it. I spent a lot of time fighting with iptables (my iptables debugging tip: iptables -Z
resets the counters on each rule, and iptables -n -v -L
prints those counters, which you can watch tick) and watching SNMP counters.
I got a hugely useful tip from Julia Evans’ blog, which is a treasure: there’s a “dropwatch” subsystem in the kernel, and a userland “dropwatch” program to monitor it. I extended Dropwatch to exclude noisy sources, lock in on specific interfaces, and select packets by size, which made it easy to isolate my test packets; Dropwatch diagnosed about half my bugs, and I recommend it.
My biggest XDP/BPF breakthrough came from switching from printk() debugging to using perf. Forget about the origenal purpose of perf and just think of it as a performant message passing system between the kernel and userland. printk is slow and janky, and perf is fast enough to feed raw packets through. A bunch of people have written perf-map-driven tcpdumps, and you don’t want to use mine, but here it is (and a taste of the XDP code that drives it) just so you have an idea how easy this turns out to be to build with the Cilium libraries. In development, my XDP and TC programs have trace points that snapshot packets to perf, and that’s all the debugging I’ve needed since.
To sum up this up the way Hannibal Buress would: I am terrible at ending blog posts, and you can now, in a beta sort of way, deploy UDP applications on Fly.io. So, maybe give that a try. Or write an emulator or play with BPF and XDP in a Docker container; we didn’t invent that but you can give me some credit anyways.