Better linked-list traversal in BPF

By Jonathan Corbet
March 8, 2024

Before loading a BPF program, the kernel must verify that the program is safe to run; among other things, that verification includes ensuring that the program will terminate within a bounded time. That requirement has long made writing loops in BPF a challenging task. The situation has improved over the years for some types of loops, but others — including linked-list traversal — are still awkward in BPF programs. A new set of BPF primitives aims to make life easier for this use case through the installation of what can be seen as a sort of circuit breaker.

Even relatively simple loops can be hard for the verifier to handle. To the human eye, a loop like this looks safe:

    for (i = 1; i < 10; i++)
    	do_something(i);

It can be hard, though, for the verifier (which is dealing with lower-level code for the BPF virtual machine) to know that nothing will reset the value of the iteration variable in a loop, though; without that assurance, it cannot verify that the loop will terminate as expected. Over the years, a number of helpers have been added to make this kind of iteration easier; they include the bpf_loop() function and generic iterators. This sort of bounded iteration is now relatively easy to do in BPF programs.

If one is iterating through a linked list, though, there is no loop variable that can bound the number of times the loop will run. There is no way for the verifier to know about the properties of a list that a program would like to traverse. If the list is circular, traversal could go forever. That prospect makes the verifier grumpy, forcing developers to engage in workarounds that make them even grumpier. When Alexei Starovoitov recently proposed a solution to this problem, he provided an example of the code needed (in current kernels) to go through a list stored in a BPF arena:

    for (struct bpf_iter_num ___it __attribute__((aligned(8),
                                                  cleanup(bpf_iter_num_destroy))),
		* ___tmp = (bpf_iter_num_new(&___it, 0, (1000000)),
                    	pos = list_entry_safe((head)->first,
                                              typeof(*(pos)), member),
	                (void)bpf_iter_num_destroy,
		     	(void *)0);
	bpf_iter_num_next(&___it) && pos &&
            ({ ___tmp = (void *)pos->member.next; 1; });
        pos = list_entry_safe((void __arena *)___tmp, typeof(*(pos)), member))

Briefly, this construct creates a new generic iterator (the bpf_iter_num_new() call) set for a maximum of 1,000,000 iterations. The bpf_iter_num_next() call increments that iterator and forces an exit from the loop if it goes too high. The iterator is never expected to reach anything close to the maximum value; it exists only to reassure the verifier that something will force the loop to end at some point.

One might fairly conclude that this code is not pleasant to write — and even less pleasant to try to understand. But, as Starovoitov put it: "Unfortunately every 'for' in normal C code needs an equivalent monster macro". He initially proposed a solution (a function called bpf_can_loop()), but the shape of that solution changed fairly quickly.

As of the v6 patch set, the first step is to create a bit of infrastructure in the form of a new BPF instruction called may_goto. This instruction has some interesting semantics. If the kernel sees a may_goto instruction in a code block, it will automatically reserve space for an iteration count on the stack. Each execution of may_goto increments that count and compares it to a kernel-defined maximum; if that maximum is exceeded, a goto will be executed to a point just far enough ahead to insert another goto.

This instruction is used to create a macro called cond_break that turns into BPF code like this:

    		 may_goto l_break;
   		 goto l_continue;
    l_break: 	 break;
    l_continue:  ;

In words: the macro normally uses may_goto to cause (by way of a bit of a goto dance) a break to be executed when the loop count is exceeded. This macro could, in turn, be used in this sort of loop:

    for (ptr = first_item; ptr; ptr = ptr->next)
    {
        do_something_with(ptr);
	cond_break;
    }

The presence of cond_break (which uses may_goto) in the loop causes stack space to be set aside for an iteration count; the maximum is set to BPF_MAX_LOOPS, which is defined as 8*1024*1024 in current kernels. Each execution of cond_break checks the iteration count and forces an exit from the loop if the maximum is exceeded.

Should that forced exit ever happen, chances are good that something is going wrong. Either some sort of out-of-control loop has been created, or the list to process is too long and the traversal will not be completed as expected. But, again, in real programs, exceeding the loop count is not expected to ever happen. It exists only as a sort of circuit breaker to reassure the verifier that the loop is safe to run. Or, as Starovoitov put it:

In other words "cond_break" is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. So [a] bpf author can assume that cond_break is a nop if their program is well formed.

The promise of the BPF verifier — that it would be able to guarantee that BPF programs cannot harm the kernel — was always going to be hard to achieve without imposing significant limitations on developers. Much of the work on BPF over the years has been aimed at lifting some of those limitations, which have only become more onerous as the complexity of BPF programs has increased. As awkward as the new features may seem, they are less so than what came before.

Still, there is room for improvement. Starovoitov said that relying on loop counts was not the best approach, and that "the actual limit of BPF_MAX_LOOPS is a random number"; he suggested that the kernel may eventually implement a watchdog timer to simply interrupt programs that run for too long. That might remove some of the awkwardness, but would have some interesting implications; BPF programs are not written with the idea that they could be interrupted at an arbitrary point. Addressing that could take a while; in the meantime, there is cond_break. There do not seem to be objections to the changes, and the patch set has been merged into the bpf-next repository, so cond_break seems likely to show up in the mainline during the 6.9 merge window.

Index entries for this article

Kernel BPF/Loops

Kernel Releases/6.9

Index entries for this article
Kernel	BPF/Loops
Kernel	Releases/6.9

Better linked-list traversal in BPF

Posted Mar 8, 2024 15:48 UTC (Fri) by atnot (subscriber, #124910) [Link] (10 responses)

It's interesting how this difficultly of analyzing linked list traversal in BPF mirrors the way CPUs struggle analyzing it. Which is a big reason why they are so disastrously slow and really shouldn't be used in any remotely performance sensitive context today. I guess the lesson is if BPF doesn't like your code, your cpu probably won't either :)

Better linked-list traversal in BPF

Posted Mar 8, 2024 16:25 UTC (Fri) by pj (subscriber, #4506) [Link] (9 responses)

Indeed, it's my understanding that it's more efficient for traversal to allocate an array of node records and then make a linked list based on array indices. This may also shrink the memory footprint because the array index (aka 'pointer') may be smaller (u16 or even u8) than a machine pointer.

Better linked-list traversal in BPF

Posted Mar 8, 2024 16:52 UTC (Fri) by atnot (subscriber, #124910) [Link]

If you're creating a dependency between the next address to access and the contents of the previous address, that's going to be bad regardless of whether you use indexes or pointers. BPF wouldn't like that either. You really just want a count and some data; that's trivial to analyze.

Allocating stuff together does make a big difference and you should definitely do it. But at that point you may as well just use the length and capacity values you have to store anyway to do a proper dynamic array. Then just swap elements in place or, if you must, have a separate table of indexes sorted by some other value, like in a hashtable. Or just use a well implemented tree or hash table outright. But really most of the time you don't need any of that.

Better linked-list traversal in BPF

Posted Mar 8, 2024 16:54 UTC (Fri) by willy (subscriber, #9762) [Link] (7 responses)

Demonstration code:

https://www.infradead.org/~willy/linux/scan.c

That doesn't actually do linked-list-by-index-instead-of-pointer. Someone who's interested can make that modification. But it's about 10x as fast to walk an array of random pointers as it is to walk a list of random pointers, depending on your CPU.

Better linked-list traversal in BPF

Posted Mar 8, 2024 17:48 UTC (Fri) by epa (subscriber, #39769) [Link] (6 responses)

Perhaps a useful optimization would be to define a magic pointer value which means “next one in memory”. You could even use null for that if you have some other way to signal end of list. Then your code would be like

if (curr->next == MAGIC) ++curr;
else curr = curr->next;

If the linked list is laid out mostly continuously in memory most of the time (perhaps not completely contiguous all of the time, because then you’d be using an array or vector instead) then the branch predictor should be able to learn the first case is more common and the code can run almost as fast as an ordinary loop over an array.

Better linked-list traversal in BPF

Posted Mar 8, 2024 18:14 UTC (Fri) by mb (subscriber, #50428) [Link] (2 responses)

You don't need magic for that:

if (curr->next == curr + 1) ++curr;
else curr = curr->next;

But I don't get why that would help.
You can just say curr = curr->next, because you need to load the next pointer anyway.

I wouldn't be surprised if the compiler would optimize it into that.

Better linked-list traversal in BPF

Posted Mar 9, 2024 6:36 UTC (Sat) by epa (subscriber, #39769) [Link] (1 responses)

You need to load the next pointer but the processor can learn that it’s usually the following address and load it speculatively. Whereas if you simply dereference it the processor might not be able to look so far ahead. That’s my thinking.

If the flag to say “just increment” were included in a part of the data structure you have to load anyway, then it might save some memory access too.

Better linked-list traversal in BPF

Posted Mar 9, 2024 18:08 UTC (Sat) by Baughn (subscriber, #124425) [Link]

Correct, so it would be faster on average, and a lot of the time the next location will be loaded whether or not it’s useful, regardless of list design — cache lines can only be loaded as a whole.

This isn’t a new approach, by the way. LISP famously uses lists a lot, and LISP machines needed to process them quickly. Even with memory latency at the time being much lower relative to CPU speeds, laying them out contiguously in memory was a standard trick. One way to do that is to use a copying garbage collector, which if coded correctly will automatically defragment the lists when they’re copied.

Better linked-list traversal in BPF

Posted Mar 8, 2024 18:59 UTC (Fri) by atnot (subscriber, #124910) [Link] (2 responses)

There's hash tables that do this. But I think in reality that's kind of just otherthinking it. Very few C programs use linked lists for good reasons and want or need any of the very few positive properties of them. It's mostly just used because it's the first thing you learn in school and perhaps more importantly the only datastructure that's easy to implement and use in C without macro hell. That and not nearly enough people tell programmers it's a bad idea.*

* this goes beyond just performance implications but architectural decisions too. One of the "advantages" of linked lists is that list items rarely move once they get created. This inevitably leads to things like structures getting bloated with information really only needed by one user and people stashing long-lived pointers to objects on lists somewhere without really worrying about how long they'll stay valid for, which ossifies into the familiar pointer hell codebase where you're scared of changing anything because something might still be holding a pointer to it and 5% of your CPU time is refcounting. But I digress.

Better linked-list traversal in BPF

Posted Mar 9, 2024 15:51 UTC (Sat) by adobriyan (subscriber, #30858) [Link] (1 responses)

How would you structure (pardon the repetition) data structures?

Userspace have a luxury of kernel doing equivalent of vmalloc for their std::vector's, but kernel have to deal with fragmentation.

Better linked-list traversal in BPF

Posted Mar 9, 2024 20:24 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

There's no real reason the kernel can't allocate its internal data in a contiguous virtual RAM. It's just that right now the kernel has one simple 1-to-1 mapping.

Better linked-list traversal in BPF

Posted Mar 8, 2024 18:01 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

So now they rediscovered runtime metering at backbranches, as WASM did years ago.

Better linked-list traversal in BPF

Posted Mar 8, 2024 22:42 UTC (Fri) by roc (subscriber, #30627) [Link]

Yes, and

> he suggested that the kernel may eventually implement a watchdog timer to simply interrupt programs that run for too long.

It's a long and winding path towards BPF being just a bad version of WebAssembly. (Bad because, among other reasons, it will be encrusted with all these different attempts to avoid just using a watchdog timer, which will have to be supported indefinitely.)

Better linked-list traversal in BPF

Posted Mar 8, 2024 22:47 UTC (Fri) by cytochrome (subscriber, #58718) [Link] (4 responses)

I was wondering when someone would bring up WASM.

Better linked-list traversal in BPF

Posted Mar 8, 2024 23:01 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Me too. It should have been done about 5 years ago.

Better linked-list traversal in BPF

Posted Mar 9, 2024 0:02 UTC (Sat) by cytochrome (subscriber, #58718) [Link] (1 responses)

Close your eyes, and tap your heels together three times, and think to yourself, 'There's no place like WASM…'

Better linked-list traversal in BPF

Posted Mar 9, 2024 0:04 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

OK, I did this. Now my kernel transformed into NetBSD with Lua modules.

Please send help.

Better linked-list traversal in BPF

Posted Mar 10, 2024 11:12 UTC (Sun) by ballombe (subscriber, #9523) [Link]

What ? You are bringing WASM in about every other kernel posts

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

Better linked-list traversal in BPF

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!