Better linked-list traversal in BPF
Even relatively simple loops can be hard for the verifier to handle. To the human eye, a loop like this looks safe:
for (i = 1; i < 10; i++) do_something(i);
It can be hard, though, for the verifier (which is dealing with lower-level code for the BPF virtual machine) to know that nothing will reset the value of the iteration variable in a loop, though; without that assurance, it cannot verify that the loop will terminate as expected. Over the years, a number of helpers have been added to make this kind of iteration easier; they include the bpf_loop() function and generic iterators. This sort of bounded iteration is now relatively easy to do in BPF programs.
If one is iterating through a linked list, though, there is no loop variable that can bound the number of times the loop will run. There is no way for the verifier to know about the properties of a list that a program would like to traverse. If the list is circular, traversal could go forever. That prospect makes the verifier grumpy, forcing developers to engage in workarounds that make them even grumpier. When Alexei Starovoitov recently proposed a solution to this problem, he provided an example of the code needed (in current kernels) to go through a list stored in a BPF arena:
for (struct bpf_iter_num ___it __attribute__((aligned(8), cleanup(bpf_iter_num_destroy))), * ___tmp = (bpf_iter_num_new(&___it, 0, (1000000)), pos = list_entry_safe((head)->first, typeof(*(pos)), member), (void)bpf_iter_num_destroy, (void *)0); bpf_iter_num_next(&___it) && pos && ({ ___tmp = (void *)pos->member.next; 1; }); pos = list_entry_safe((void __arena *)___tmp, typeof(*(pos)), member))
Briefly, this construct creates a new generic iterator (the bpf_iter_num_new() call) set for a maximum of 1,000,000 iterations. The bpf_iter_num_next() call increments that iterator and forces an exit from the loop if it goes too high. The iterator is never expected to reach anything close to the maximum value; it exists only to reassure the verifier that something will force the loop to end at some point.
One might fairly conclude that this code is not pleasant to write — and
even less pleasant to try to understand. But, as Starovoitov put it:
"Unfortunately every 'for' in normal C code needs an equivalent monster
macro
". He initially proposed a solution (a function called
bpf_can_loop()), but the shape of that solution changed fairly
quickly.
As of the v6 patch set, the first step is to create a bit of infrastructure in the form of a new BPF instruction called may_goto. This instruction has some interesting semantics. If the kernel sees a may_goto instruction in a code block, it will automatically reserve space for an iteration count on the stack. Each execution of may_goto increments that count and compares it to a kernel-defined maximum; if that maximum is exceeded, a goto will be executed to a point just far enough ahead to insert another goto.
This instruction is used to create a macro called cond_break that turns into BPF code like this:
may_goto l_break; goto l_continue; l_break: break; l_continue: ;
In words: the macro normally uses may_goto to cause (by way of a bit of a goto dance) a break to be executed when the loop count is exceeded. This macro could, in turn, be used in this sort of loop:
for (ptr = first_item; ptr; ptr = ptr->next) { do_something_with(ptr); cond_break; }
The presence of cond_break (which uses may_goto) in the loop causes stack space to be set aside for an iteration count; the maximum is set to BPF_MAX_LOOPS, which is defined as 8*1024*1024 in current kernels. Each execution of cond_break checks the iteration count and forces an exit from the loop if the maximum is exceeded.
Should that forced exit ever happen, chances are good that something is going wrong. Either some sort of out-of-control loop has been created, or the list to process is too long and the traversal will not be completed as expected. But, again, in real programs, exceeding the loop count is not expected to ever happen. It exists only as a sort of circuit breaker to reassure the verifier that the loop is safe to run. Or, as Starovoitov put it:
In other words "cond_break" is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. So [a] bpf author can assume that cond_break is a nop if their program is well formed.
The promise of the BPF verifier — that it would be able to guarantee that BPF programs cannot harm the kernel — was always going to be hard to achieve without imposing significant limitations on developers. Much of the work on BPF over the years has been aimed at lifting some of those limitations, which have only become more onerous as the complexity of BPF programs has increased. As awkward as the new features may seem, they are less so than what came before.
Still, there is room for improvement. Starovoitov said that relying on
loop counts was not the best approach, and that "the actual limit of
BPF_MAX_LOOPS is a random number
"; he suggested that the kernel may
eventually implement a watchdog timer to simply interrupt programs that run
for too long. That might remove some of the awkwardness, but would have
some interesting implications; BPF programs are not written with the idea
that they could be interrupted at an arbitrary point. Addressing that
could take a while; in the meantime, there is cond_break. There
do not seem to be objections to the changes, and the patch set has
been merged into the bpf-next repository, so cond_break seems
likely to show up in the mainline during the 6.9 merge window.
Index entries for this article | |
---|---|
Kernel | BPF/Loops |
Kernel | Releases/6.9 |
Posted Mar 8, 2024 15:48 UTC (Fri)
by atnot (subscriber, #124910)
[Link] (10 responses)
Posted Mar 8, 2024 16:25 UTC (Fri)
by pj (subscriber, #4506)
[Link] (9 responses)
Posted Mar 8, 2024 16:52 UTC (Fri)
by atnot (subscriber, #124910)
[Link]
Allocating stuff together does make a big difference and you should definitely do it. But at that point you may as well just use the length and capacity values you have to store anyway to do a proper dynamic array. Then just swap elements in place or, if you must, have a separate table of indexes sorted by some other value, like in a hashtable. Or just use a well implemented tree or hash table outright. But really most of the time you don't need any of that.
Posted Mar 8, 2024 16:54 UTC (Fri)
by willy (subscriber, #9762)
[Link] (7 responses)
https://www.infradead.org/~willy/linux/scan.c
That doesn't actually do linked-list-by-index-instead-of-pointer. Someone who's interested can make that modification. But it's about 10x as fast to walk an array of random pointers as it is to walk a list of random pointers, depending on your CPU.
Posted Mar 8, 2024 17:48 UTC (Fri)
by epa (subscriber, #39769)
[Link] (6 responses)
if (curr->next == MAGIC) ++curr;
If the linked list is laid out mostly continuously in memory most of the time (perhaps not completely contiguous all of the time, because then you’d be using an array or vector instead) then the branch predictor should be able to learn the first case is more common and the code can run almost as fast as an ordinary loop over an array.
Posted Mar 8, 2024 18:14 UTC (Fri)
by mb (subscriber, #50428)
[Link] (2 responses)
if (curr->next == curr + 1) ++curr;
But I don't get why that would help.
I wouldn't be surprised if the compiler would optimize it into that.
Posted Mar 9, 2024 6:36 UTC (Sat)
by epa (subscriber, #39769)
[Link] (1 responses)
If the flag to say “just increment” were included in a part of the data structure you have to load anyway, then it might save some memory access too.
Posted Mar 9, 2024 18:08 UTC (Sat)
by Baughn (subscriber, #124425)
[Link]
This isn’t a new approach, by the way. LISP famously uses lists a lot, and LISP machines needed to process them quickly. Even with memory latency at the time being much lower relative to CPU speeds, laying them out contiguously in memory was a standard trick. One way to do that is to use a copying garbage collector, which if coded correctly will automatically defragment the lists when they’re copied.
Posted Mar 8, 2024 18:59 UTC (Fri)
by atnot (subscriber, #124910)
[Link] (2 responses)
* this goes beyond just performance implications but architectural decisions too. One of the "advantages" of linked lists is that list items rarely move once they get created. This inevitably leads to things like structures getting bloated with information really only needed by one user and people stashing long-lived pointers to objects on lists somewhere without really worrying about how long they'll stay valid for, which ossifies into the familiar pointer hell codebase where you're scared of changing anything because something might still be holding a pointer to it and 5% of your CPU time is refcounting. But I digress.
Posted Mar 9, 2024 15:51 UTC (Sat)
by adobriyan (subscriber, #30858)
[Link] (1 responses)
Userspace have a luxury of kernel doing equivalent of vmalloc for their std::vector's, but kernel have to deal with fragmentation.
Posted Mar 9, 2024 20:24 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Mar 8, 2024 18:01 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Mar 8, 2024 22:42 UTC (Fri)
by roc (subscriber, #30627)
[Link]
> he suggested that the kernel may eventually implement a watchdog timer to simply interrupt programs that run for too long.
It's a long and winding path towards BPF being just a bad version of WebAssembly. (Bad because, among other reasons, it will be encrusted with all these different attempts to avoid just using a watchdog timer, which will have to be supported indefinitely.)
Posted Mar 8, 2024 22:47 UTC (Fri)
by cytochrome (subscriber, #58718)
[Link] (4 responses)
Posted Mar 8, 2024 23:01 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Posted Mar 9, 2024 0:02 UTC (Sat)
by cytochrome (subscriber, #58718)
[Link] (1 responses)
Posted Mar 9, 2024 0:04 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Please send help.
Posted Mar 10, 2024 11:12 UTC (Sun)
by ballombe (subscriber, #9523)
[Link]
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
else curr = curr->next;
Better linked-list traversal in BPF
else curr = curr->next;
You can just say curr = curr->next, because you need to load the next pointer anyway.
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF
Better linked-list traversal in BPF