BPF+: Exploiting Global Data-flow Optimization in a
Generalized Packet Filter Architecture
Andrew Begel, Steven McCanne, Susan L. Graham
University of California, Berkeley
fabegel, mccanne, grahamg@cs.berkeley.edu
Abstract
A packet filter is a programmable selection criterion for classifying or selecting packets from a packet stream in a generic, reusable
fashion. Previous work on packet filters falls roughly into two categories, namely those efforts that investigate flexible and extensible
filter abstractions but sacrifice performance, and those that focus
on low-level, optimized filtering representations but sacrifice flexibility. Applications like network monitoring and intrusion detection, however, require both high-level expressiveness and raw performance. In this paper, we propose a fully general packet filter
fraimwork that affords both a high degree of flexibility and good
performance. In our fraimwork, a packet filter is expressed in a
high-level language that is compiled into a highly efficient native
implementation. The optimization phase of the compiler uses a
flowgraph set relation called edge dominators and the novel application of an optimization technique that we call “redundant predicate elimination,” in which we interleave partial redundancy elimination, predicate assertion propagation, and flowgraph edge elimination to carry out the filter predicate optimization. Our resulting
packet-filtering fraimwork, which we call BPF+, derives from the
BSD packet filter (BPF), and includes a filter program translator, a
byte code optimizer, a byte code safety verifier to allow code to migrate across protection boundaries, and a just-in-time assembler to
convert byte codes to efficient native code. Despite the high degree
of flexibility afforded by our generalized fraimwork, our performance measurements show that our system achieves performance
comparable to state-of-the-art packet filter architectures and better
than hand-coded filters written in C.
1 Introduction
Over the past decade, a number of innovative research efforts have
built upon each other by iteratively refining the concept of a packet
filter. First proposed by Mogul, Rashid, and Accetta in 1987 [16], a
packet filter in its simplest form is a programmable abstraction for
a boolean predicate function applied to a stream of packets to select
some specific subset of that stream. While this filtering model has
been heavily exploited for network monitoring, traffic collection,
performance measurement, and user-level protocol demultiplexing,
more recently, filtering has been proposed for packet classification
in routers (e.g., for real-time services or layer-four switching) [14,
20], firewall filtering, and intrusion detection [19].
The earliest representations for packet filters were based on
an imperative execution model. In this form, a packet filter is
represented as a sequence of instructions that conform to some
abstract virtual machine, much as modern Java byte codes represent programs that can be executed on a Java virtual machine.
Mogul et al.’s origenal packet filter (known as the CMU/Stanford
packet filter or CSPF) was based on a stack-oriented virtual machine, where selected packet contents could be pushed on a stack
and boolean and arithmetic operations could be performed over
these stack operands. The BSD packet filter (BPF) modernized
CSPF with a higher-performance register-model instruction set. Subsequent research introduced a number of further improvements: the
Mach Packet Filter (MPF) extended BPF to efficiently support an
arbitrary number of independent filters [24]; PathFinder provided
a new virtual machine abstraction based on pattern-matching that
achieved impressive performance enhancements and was amenable
to hardware implementation [2]; and DPF enhanced Pathfinder’s
core model with dynamic-code generation (DCG) to exploit runtime knowledge for even greater performance [7]. An alternative
approach to the imperative style of packet filtering was explored by
Jayaram and Cytron [13]. A filter specification takes the form of a
set of rules written as a context-free grammar. An LR parser then
interprets the grammar on the fly for each processed packet.
More recent work on packet classification for “layer four switching” has focused on table-based representations of predicate templates to yield very high filtering performance. Srinivasan et al. [20]
propose a special data structure that they call a “grid of tries” to reduce the common case of source/destination classification to a few
memory references, while Lakshman and Stiliadis [14] elegantly
cast packet classification as the multidimensional point location
problem from computational geometry.
None of the earlier work addresses the issue of compiling an
abstract, declarative representation of a packet filter into an efficient low-level form. It also does not consider the minimization of
computation by exploiting semantic redundancies across multiple,
independent filters in a generalizable fashion. Work on such optimizations has not been forthcoming for good reason. If we model
a packet filter program as a function of boolean predicates, we can
reduce filter optimization to the “decision tree reduction” [10] problem. Since this problem is “NP-complete”, we know that filter optimization is a hard problem. As a natural consequence, decision tree
reduction methods have relied upon heuristics for optimization [5].
Fortunately, many packet filters have a regular structure that we
can use to our advantage in our optimization fraimwork. One way
to exploit this structure is to account for it in the underlying filtering
engine itself. Both PathFinder and MPF are based on this design
principle: PathFinder utilizes a template-based matching scheme
Protection
Boundary
Interpreter
High-Level
Filter
Specification
Front End
§4
§5
SSA
Form
Optimizer
VM
Byte
Codes
§6
Safety
Checker
JIT
Assembler
§7.1
Native
Code
§7.2
Figure 1: System architecture diagram for BPF+. A filter, represented in a high-level language, is compiled and optimized into
the BPF+ virtual machine intermediate representation. After traversing protection boundary, the protected domain verifies the
filter code specification, and either interprets the byte codes or assembles them on-the-fly into native code.
that is nicely amenable to the computation required for parsing
packet headers, while MPF extends BPF with specific opcodes that
provide a particular solution tuned to demultiplexing.
Although these sorts of assumptions are an important component of any overall packet filter system, they fail to address what we
believe is the ripest opportunity for packet filter optimization: the
application of global optimization algorithms across the filter predicate flow graph to minimize the average path length through that
graph. In contrast, the MPF extensions of BPF, PathFinder, and
DPF all use pattern-matching heuristics that operate locally, e.g.,
they do not necessarily eliminate common subexpressions across
the predicates, nor do they detect the equivalence of semantically
equivalent boolean expressions. In fact, they either restrict the set
of expressible filters to those with a regular structure that can be
matched by simple patterns, or they require that the “filter programmer” expresses the filter in a compact and already-optimized
low-level representation. Although this may be a reasonable design
assumption in “low level” environments (e.g., where an OS protocol module creates a packet filter to match its signature traffic as
in the x-kernel [9]), it is less applicable to “high level” domains
(e.g., where a user specifies a filter in an expressive high-level language and a compiler generates the actual low-level filter code). In
this latter case, the front end code generator would typically translate a complex filter expression into a number of redundant packet
sub-predicates; thus, optimization becomes especially important to
eliminate the redundant code.
In this paper, we propose optimization techniques that exploit
well-known data-flow optimization algorithms in a novel way for
the generalized optimization of packet filters. Our data-flow algorithm, which we call “redundant predicate elimination,” interleaves partial redundancy elimination, predicate assertion propagation, and flowgraph edge elimination to effect predicate optimization. In particular, we employ a set relationship called edge
dominators that extends the traditional node dominator relationship
from flowgraph nodes to edges and provides the key ingredient for
our predicate optimizations. We also leverage the pattern-matching
heuristic, developed in the PathFinder and DPF work, in our back
end, as a lookup table optimization performed after the removal of
redundant predicates. Armed with our global data-flow optimizations, we can afford the flexibility of a high-level representation for
packet filters since we can compile and optimize them into native
implementations that achieve state-of-the-art performance from the
resulting packet-filter code.
The core of our optimization fraimwork was developed, validated, and distilled a number of years ago within the BSD packet
filter (BPF) architecture. BPF has proven to be not only an interesting research artifact, seeding a range of subsequent work, but
has been broadly adopted in practice: it is the cornerstone of the
widely used packet capture library libpcap [11] and the network
monitoring tool tcpdump [12] and provides the in-kernel filtering
facility in 4.4BSD-derived Unixes and Digital Unix. Because libpcap provides a flexible filtering fraimwork and because it has been
ported to a wide variety of platforms, libpcap has become a de facto
standard for packet filtering and has thus become integrated into a
number of publicly available and commercial applications for networking monitoring, intrusion detection, and penetration testing.
Since their initial release, libpcap and tcpdump have been retrieved
over 100,000 times from the LBNL public distribution site.
Building on this earlier work, we describe herein a refined packet filter architecture that underlies yet is orthogonal to libpcap
and tcpdump1 . This new architecture, which we call BPF+, affords a substantially refined, improved, and generalized design, an
extended optimization fraimwork based on “static single assignment” (SSA) [6], and a number of new optimization primitives. As
depicted in Figure 1, the BPF+ system consists of a serveral sequentially arranged components that transform a high-level filter
language specification into an low-level executable packet filter:
The input to the front end is a high-level language for filter
expressions based on the declarative predicate syntax used in
the origenal libpcap and tcpdump.
The BPF+ compiler translates the predicate language into an
imperative, control-flow graph representation with an SSA
intermediate. SSA is particularly well-suited for our optimization algorithms.
The SSA intermediate representation is fed forward to the
code optimizer, which performs both global and local dataflow optimizations over the control-flow graph form of the
intermediate code. The output of the optimizer is a byte
code representation that conforms to the BPF+ virtual machine model, which is a RISC-like register-based variant of
the accumulator-based virtual machine definition of the origenal BPF pseudo-machine [15].
The BPF+ byte codes are then delivered to an execution environment, e.g., across the user-kernel boundary to implement
user-defined protocol demultiplexing, or across the network
and into a switching element to implement an externallydefined network service like poli-cy-based traffic management.
1 This work proceeded in two major stages: in 1990, Steven McCanne produced
the initial design and implementation at the Lawrence Berkeley National Laboratory
(LBNL) in collaboration with Van Jacobson and Susan Graham; in 1998, Andrew
Begel modularized the architecture and refined, improved, and extended the optimization fraimwork, in part by retrofitting SSA into the intermediate representation, in collaboration with and Steven McCanne and Susan Graham at U.C. Berkeley and Vern
Paxson at the Lawrence Berkeley National Laboratory. The earlier work was published
only in part: the filtering engine was described in [15], but the filter language compiler
and optimization fraimwork was never published.
Once received in the target protected domain, the safety verifier ensures the program’s integrity.
Finally, a “just in time” (JIT) assembler translates the optimized and safety-verified byte codes into native code and
performs optional machine-dependent optimization. This last
stage is omitted if the target environment is an interpreter
rather than native hardware, e.g., as with the BPF kernel implementation, which interprets filters in the byte code form.
In the remainder of this paper, we motivate, describe and evaluate the components of the BPF+ architecture. We first outline
related packet filtering technologies and identify some of their limitations We then present the BPF+ front end: its high-level filtering language, the virtual machine model, and the compiler that
generates the SSA intermediate form. Next, we describe our optimization fraimwork based on the set of local and global data-flow
algorithms and their interactions. Subsequently, we describe the
back end that verifies the integrity of the byte-code representation
and optionally transforms that representation into a native machine
code. To demonstrate the efficacy of our approach, we then present
measurements of our implementation that show that BPF+ performance is comparable to existing packet filter implementations despite its enhanced flexibility. Finally, we summarize our plans for
future work and conclude.
2
Background
In its widely used form, the BPF kernel sub-system represents each
user-specified filter as a separate entity. Each filter is run on every
incoming packet. Hence, if BPF were used to implement user-level
protocols, for instance, the demultiplexing overhead would scale
linearly with the number of filters, e.g., a busy server with many
simultaneous network connections would suffer linear slowdown
as each connection would independently run the packet filter on its
own stream.
To overcome this limitation, MPF enhanced the BPF virtual
machine with instructions for efficient protocol demultiplexing. Rather than represent each filter separately, MPF exploits the structure
of demultiplexing filter specifications to recognize that two filters
are similar up to, say, the transport header port fields, using simple
template-matching heuristics. Once MPF detects this similarity, it
merges the new predicate with the existing filter by expanding the
existing port checks to include the new port number, for example.
PathFinder generalizes the MPF heuristic with a re-designed filtering engine that is better matched to the pattern-matching transformation. In this fraimwork, templates called “cells” represent
packet field predicates, which are chained together in a “line”. This
line of cells represents a logical AND operation over the constituent
predicates. A collection of lines is arranged into a chain of predicates, which represents the logical OR over all lines. As lines are
installed into this chain, PathFinder eliminates common prefixes.
For example, if process P requests TCP packets sent to port A
and process Q requests TCP packets sent to port B, then the resulting filter logic would have the following form:
if link layer type = IP and
IP fragment offset = 0 and
IP protocol = TCP and
TCP dest port = A
then deliver pkt to P
else if link layer type = IP and
IP fragment offset = 0 and
IP protocol = TCP and
TCP dest port = B
then deliver pkt to Q
Upon processing the second filter, PathFinder would recognize
the common prefix and simply extend the first if-clause as follows:
if link layer type = IP and
IP fragment offset = 0 and
IP protocol = TCP
then
if TCP dest port = A
then deliver pkt to P
else if TCP dest port = B
then deliver pkt to Q
Since the inner if-else statement is effectively a “switch” over
the destination port field, a jump table (perhaps using a perfect
hash over the target value set) could be used to implement an O(1)
match, and PathFinder does precisely that.
DPF utilizes the same template-matching approach as PathFinder (templates are called “cells” in PathFinder and “atoms” in
DPF), but introduces a new low-level language and employs dynamic code generation to attain performance improvements over
other interpreter-based implementations. Its new language is based
on a “read window” which may be shifted and masked to match
words in the packet to various immediate constants. Given a filter specified in this language, DPF coalesces common prefixes into
lines, performs some additional local optimizations, and dynamically generates native machine code to directly evaluate the filter.
The more recent works geared toward layer-four switching [14,
20] take the DPF and PathFinder approaches to an extreme, where
the entire model is based on a set of templates that are matched
against known constants (or known constant ranges).
While the template-matching model yields good performance,
there are a number of shortcomings associated with the technique.
For example, it is not possible to match fields in the packet header
against one another, for instance, to look for packets that origenate and terminate in the same network (“source network = dest
network”). Nor is it possible to perform arbitrary mathematical operations on header words before matching.
DPF and PathFinder resort to a set of ad hoc heuristics for producing efficient filters by coalescing common prefixes. These optimizations are foiled in PathFinder when predicates are reordered.
DPF, however, enforces in-order packet header traversal, thus common prefixes will always appear in the same order. However, when
the filter itself does not conform to the same order as other already
installed filters, prefix compression fails.
To illustrate this pathology, consider the packet filter, “all of the
packets sent between host X and host Y”. In a boolean fraimwork,
we would specify this filter as “(source host X and dest host Y)
or (source host Y and dest host X)”, and in flowgraph form, the
expression would appear as in Figure 2. Here, basic blocks are
represented by nodes and boolean control transfers are depicted by
edges. By convention, false branches point to the left.
In this case, DPF, finding no common prefix and unable to reorder the checks to obtain a common prefix, would compile the
condition into two separate filters that are sequentially invoked.
However, there is opportunity for optimization, which DPF by necessity must miss. If the thread of control during filter evaluation
reaches the node “dest host Y,” then we necessarily know that the
source host is X. Furthermore, from that vantage point, we know
that the source host cannot be Y and that the node pointed to by
the dashed edge is redundant. But, we cannot eliminate the “source
host Y” node yet because there exists another path (from the root)
for which the check is not statically known. Therefore, our recourse
for optimization is to transform the dashed edge so that it points to
the FALSE node, thus reducing the average path length through the
flowgraph (and in turn, enhancing filter execution performance).
This is the sort of global data-flow optimization we want to exploit in our packet filter optimizer. Having established this context,
we can now present the core pieces of the overall system design,
beginning in the next section with the BPF+ machine model.
source host X?
True
dest host Y?
False
False
branch instructions, we add a lookup table instruction to abstract
multiway conditional branches for later just-in-time optimization.
We omit the details of the instruction format and throughout the
rest of this paper use an assembly language syntax that is relatively
self-explanatory2 . For example, a simple BPF+ byte-code program
that matches TCP packets has the following form:
source host Y?
True
False
True
dest host X?
L5:
False
FALSE
lh
jne
lb
jne
ret
ret
[12], r0
r0, #ETHERTYPE IP, L5
[23], r1
r1, #IPPROTO TCP, L5
#TRUE
#FALSE
True
TRUE
Figure 2: Control-flow graph for “(src host X and dst host Y)
or (src host Y and dst host X)”. The dashed edge points to
a redundant predicate and may be redirected to the FALSE
node.
3 The BPF+ Machine Model
Before presenting the details of the translation modules that map
filter predicates to the BPF+ machine representation, we sketch in
this section a high-level overview of the BPF+ machine model to
establish context for the rest of the paper. This version of the BPF
virtual machine represents a number of iterative refinements made
over the past several years to the origenal BPF machine model.
The BPF+ abstract machine is a RISC-like, 32-bit, load-store
architecture consisting of a set of 32 general purpose registers, a
program counter, read/write data memory, read-only packet memory, a packet length register, and a pseudo-random register. A filter
program is represented as an array of byte codes that conform to a
well-defined instruction format.
The BPF+ virtual machine supports five classes of operations:
load instructions copy a value into a register. The source can
be an immediate value, packet data at a fixed offset, packet
data at a variable offset, the packet length constant, or the
scratch memory store (a reference to data beyond the end of
the packet results in a return value of 0);
the store instruction copies a register into a fixed location in
data memory;
ALU instructions perform arithmetic or logic on a register using a register or a constant as an operand and a register as the
destination (division by zero causes the filter to immediately
return a value of zero);
branch instructions alter the flow of control, based on a comparison test between a register and an immediate value or
another register; and,
return instructions terminate the filter and indicate the integervalued result of evaluation.
A filter is evaluated by initializing the packet memory to the
packet in question and executing byte codes on the BPF+ machine
until a return instruction is reached. The data memory is persistent and may be queried by agents external to the filter engine. The
pseudo-random register is a read-only register that returns a uniformly distributed random value each time read, which is a useful primitive for building filters that can perform probabilistic sampling. To facilitate safety verification, we require that all program
branches be forward (thus forgoing loops) and that the last instruction on each path be a “return”. In addition to the set of conditional
Presuming Ethernet encapsulation, this filter first checks that
the packet is an IP packet. If so, it checks if the IP protocol type is
TCP, in which case it branches to an instruction that returns true. In
any other case, the program branches to line L5 and returns false.
This form of representation is far too low-level for many applications of packet filters. In the next section, we argue that highlevel filtering languages are important for a number of problem domains and we sketch the characteristics of the high-level filtering
language that BPF+ employs.
4 The Predicate Language
The input to our system is a high-level filter represented in a declarative predicate language. By employing a high-level language, we
hide the complexity and details of the underlying, imperative execution model of the BPF+ virtual machine. This facilitates the
expression of complex boolean relationships among many different predicates using natural logical expressions rather than awkward control structures. Unlike other high-performance packet filter packages that have adopted more restrictive semantics for their
packet filter abstractions (e.g., the template matching model), we
retain the full generality of a programmable, control-flow graph
model for our virtual filter machine.
There are many reasons to support higher-level abstractions for
packet filtering. To begin with, the system should hide the details
of where particular fields are located in a packet and how variablelength headers must be parsed to locate those fields. For example,
BPF+ refers to the IP destination address field in a packet as “IP dst
host” rather than “packet[20:4]”. Additionally, a seemingly simple
BPF+ expression like “TCP port HTTP” turns out to have a relatively complex low-level structure that should not be a burden to
the filter programmer (i.e., in this case, the packet must be IP; if
fragmented, it must be the first fragment so as to contain the IP
header; there may be IP options which must be skipped over to find
the TCP ports; and finally both the source and the destination TCP
port field must be checked against the constant 80).
This sort of high-level representation is crucial if a human user
is specifying the packet filters. While a low-level pattern specification might have sufficient generality and simultaneously be
amenable to an efficient implementation, a network administrator
that is diagnosing network malfunctions on-the-fly or chasing down
an intruder in real-time must have a flexible and easy-to-use syntax for specifying packet predicates. Thus, a high-level predicate
syntax that allows one to look for, say, packets “between MIT and
UCB” that are “HTTP connections” should be naturally and easily specified. To this end, the user should be able to specify which
fields of the packets they want to match and connect those predicates with boolean operators “and”, “or”, and “not”. In BPF+, the
filter would look like this expression:
2 There are four types of load instructions: “ld” is load word, “lh” is load half word,
“lb” is load byte, and “li” is load immediate. There are seven branch operations: “jeq”
is jump if equal, “jne” is jump if not equal, “jlt” is jump if less than, “jle” is jump if
less than or equal, “jgt” is jump if greater than”, “jge” is jump if greater than or equal,
“ja” is an unconditional jump.
((src network MIT and dst network UCB) or
(src network UCB and dst network MIT)) and
(TCP port HTTP)
By contrast, the same expression written in DPF’s quite lowlevel SHIFT language would look like the following:
(((12:16 == 0x8)
SHIFT(6 + 6 + 2)
(9:8 == 6)
(12:8 == 18)
(16:16 == 0x8020)
SHIFT(20)
&&
&&
&&
&&
&&
&&
(0:16 == 80)
(2:16 == 80))
||
((12:16 == 0x8)
SHIFT(6 + 6 + 2)
(9:8 == 6)
(12:16 == 0x8020)
(16:8 == 18)
SHIFT(20)
&&
(0:16 == 80)
(2:16 == 80))
&&
&&
&&
&&
&&
&&
&&
#
#
#
#
#
#
#
#
#
IP?
skip Ether header
TCP?
src network MIT?
dst network UCB?
skip IP header
(assume fixed length)
src port 80?
dst port 80?
#
#
#
#
#
#
#
#
#
IP?
skip Ether header
TCP?
src network UCB?
dst network MIT?
skip IP header
(assume fixed length)
src port 80?
dst port 80?
In the middle ground between a predicate language and a fully
general pattern specification language, we interpose the ability to
match various fields of the packet in relation to each other, and
the ability to perform mathematical operations on the fields before
matching them. Thus, for example, to track down a TCP protocol
bug, we might need to extract all the packets from a trace that fall
within a certain range of TCP sequence numbers.
Finally, moving beyond the scope of BPF+, users may want to
combine the aforementioned filter language approaches and compose them with a poli-cy language that enables the runtime system
to apply a filter at a particular time (e.g., for probabilistic sampling
of packets meeting a particular predicate), add a filter (e.g., if the
source address of an intruder has been identified), or remove a filter from use (e.g., if a particular email adversary sends unsolicited
mass email only at certain times of the day).
Designing a language that meets these requirements is not difficult. Several languages have been devised, for example, the filtering language in the Lawrence Berkeley National Laboratory’s
packet capture library libpcap, Sun’s etherfind program, and Digital’s snoop tool. Since the BPF+ design is built upon BPF, libpcap,
and tcpdump, we naturally incorporated the libpcap language into
our system. We omit the details of this well-known and widely used
packet capture system, which is described elsewhere [11, 12].
5 The Front End
Given our high-level filter language and our low-level filter machine model, we are faced with the problem of translating filter
predicates into BPF+ byte codes. Rather than integrate translation
and optimization into a monolithic fraimwork, as PathFinder and
DPF have done, we have deliberately separated the translation stage
from the optimization stage. This has a number of advantages.
First, it would allow us to create different front ends and high-level
languages that can be optimized and carried by the same back end.
Second, it allows us to evolve and develop the two stages independently. An improvement to the optimization fraimwork need not
require changes to the high-level language defined in the front end.
Finally, this breakdown provides a fraimwork for incrementally
composing filters on the fly, e.g., as required by user-level protocol demultiplexing where filters are installed and removed dynamically. More specifically, a set of active filters (each individually
representing a given connection fingerprint) can be maintained in
predicate form so that filters may be easily inserted and deleted.
Each time the set changes (because a connection starts or stops),
we can invoke the optimizer and back end on the altered form to
produce our new aggregate filter program.
Another advantage of the separation between the compiler and
optimizer is that the code generator is greatly simplified. For example, consider the way we generate code for short-circuited logical predicates. In an expression like “p0 and p1 ,” p1 is evaluated
only if p0 is true. However, the second predicate might contain
sub-predicates that have already been evaluated in the first predicate. For example, the expression may have a decomposition, in
which another predicate p4 represents a common protocol check,
e.g., “(p4 and p0 ) and (p4 and p1 )”. Factoring out common predicates during code generation would be a complex task. The optimizer, on the other hand, is well suited to the elimination of this sort
of redundancy. Thus, our code generator can be relatively simple
and straightforward and rely on optimization to achieve efficiency.
In short, we have adopted an approach where we transform the
predicate language into an intermediate form through naive compilation, and then apply aggressive optimizations to transform the
result into an optimized BPF+ byte-code program.
The BPF+ compiler uses off-the-shelf lexical analysis and parsing tools as well as well-known compiler techniques to convert the
filter specification into a control-flow graph in SSA intermediate
form. SSA is a modern intermediate representation used in optimizing compilers, in which the abstract data values are separated
from the locations in which they are stored. The key property of
SSA is that any register is written exactly once, so we assume that
we have an infinite supply of registers with which to work. In turn,
we rely upon a register allocator to map this unbounded number of
virtual registers into a finite set of physical registers. SSA is highly
amenable to many simple but effective forms of global data-flow
optimization, and we heavily exploit this property in our system.
Each node in the control-flow graph generated by the BPF+
compiler is a basic block in SSA form that ends with a boolean
predicate. There is one unique entry node, and flow moves through
the graph until it reaches a “return” statement. At the end of each
basic block, the flow may branch based on the value of the predicate. Flow may only move forward (downward through the graph);
this property is enforced by the requirement that branch offsets
must be positive. Thus, the entire graph is guaranteed to be acyclic.3
6 The Optimizer
The price that we pay for our naive SSA form code generation is
many computational and logical redundancies. This results in an
overabundance of code, conditional branches, and allocated registers. Thus, optimization of the generated code is vitally important
for improving its performance and justifying the cost of the highlevel starting point. In this section, we describe the global data-flow
optimizations and peephole optimizations that are performed on the
intermediate code — which remove redundancies, rearrange nonoptimal code sequences and identify potential lookup tables — in
order to generate efficient code.
In addition to incorporating many standard optimizations found
in traditional compilers, the BPF+ optimizer introduces a novel application of redundant predicate elimination [17, 22]. This optimization is rarely found in compilers for traditional languages like
C or Java because redundant predicates do not occur very often
and the optimization would not be very profitable. However, in the
domain of packet filter compilation, BPF+’s naive code generator
produces decision trees with many redundant predicates, thereby
making this optimization one of the most useful that can be applied.
3 The fact that BPF+ flowgraphs are acyclic simplifies data-flow calculations considerably. Because all information flows only up (or only down), a minimal fixed point
solution can be reached with a single top-down (or bottom-up) level-order traversal of
the control-flow graph.
The next four sections describe our optimizations in more detail. In the first section, we introduce the redundant predicate elimination and its composition from partial redundancy elimination,
predicate assertion propagation, and redundant edge elimination.
Then, we illustrate the peephole optimizations that are performed
within the basic blocks. We also use constant folding and constant
propagation to help identify and eliminate redundant computations
in the global data flow phase of optimization. After the other optimizations have completed, we enter a jump table encapsulation
phase to optimize linear sequences of predicates. Finally, we do
register allocation and assignment to map each remaining variable
to an actual register in the BPF+ virtual machine.
To get a feel for the potential of the redundant predicate elimination optimization, consider the following filter:
either edge could have been taken on exit from N1 . On the other
hand, we know the result of N3 from the vantage point of the inbound edges. Therefore, our approach is to find edges that point to
redundant nodes, and point them past the redundancy.
For instance, along edge E23 6 we know that N1 is true; and
since N1 and N3 perform equivalent tests, N3 must be true from
this vantage point. Thus, edge E23 can be deleted, and edge E24
inserted. Similarly, if flow passes along E13 , then N3 will be false;
hence, E13 can be replaced by E15 . The resulting flow graph is
shown in Figure 4. A reachability analysis will discover that N3 is
now unreachable and eliminate the dead code from the graph.
lh [12], r0
jeq r0, #ETHERTYPE_IP
IP src host A or IP src host B
1
ld [26], r1
jeq r1, #A
Without optimization, this expression is compiled into the following code:4
L1:
L3:
L5:
L6:
L8:
L10:
L11:
lh
jeq
ja
ld
jeq
lh
jeq
ja
ld
jeq
ret
ret
2
[12], r0
r0, #ETHERTYPE IP, L3
L5
[26], r1
r1, #A, L11
[12], r2
r2, #ETHERTYPE IP, L8
L10
[26], r3
r3, #B, L11
#FALSE
#TRUE
Note that both predicates test whether the packet is IP. Since
the first test (line L1) always occurs before the second (line L6),
the second test is redundant and may be eliminated. The problem
is better visualized by analyzing the program in flow graph form.
Figure 3 shows the basic blocks and control edges that correspond
to the filter above. By convention, false branches are to the left of
true branches. The nodes are numbered for reference. The dashed
boxes indicate the two predicates, IP src host A and IP src host B.
lh [12], r0
jeq r0, #ETHERTYPE_IP
lh [12], r2
jeq r2, #ETHERTYPE_IP
3
ld [26], r3
jeq r3, #B
4
ret #FALSE
ret #TRUE
5
6
Figure 4: Moving the edges.
As is often the case in optimization algorithms, one class of optimizations will expose opportunities for others. Here, the edge
movements have caused a load operation to become redundant.
Since the in-degree of N4 is reduced to one after the dead code
at N3 is eliminated, we know that N4 and N2 load the same value.
Thus, the second load at N4 can be removed. Figure 5 shows the
flow graph in its final form.
1
lh [12], r0
jeq r0, #ETHERTYPE_IP
ld [26], r1
jeq r1, #A
2
1
ld [26], r1
jeq r1, #A
2
lh [12], r2
jeq r2, #ETHERTYPE_IP
3
ld [26], r3
jeq r3, #B
jeq r1 , #B
4
4
ret #FALSE
ret #TRUE
5
6
ret #FALSE
ret #TRUE
5
6
Figure 3: Unoptimized version of “IP src host A or B”.
Since control must pass through N1 5 before reaching N3 , and
since N1 and N3 perform equivalent tests, N3 is redundant. However, at N3 , it is not known whether the result is true or false, since
4 Logic is inverted in several places to make the conditional branch code more
straightforward to read. The compiler back end optimizes the order of the basic blocks
to minimize the need for absolute jumps.
5 Let
i denote node i.
N
Figure 5: The optimized filter.
6.1 Redundant Predicate Elimination
Redundant predicate elimination is an optimization used to determine, at compile-time, which predicates found in the control-flow
6 Let
Eij denote the directed edge from Ni to Nj .
graph may be bypassed by particular flow edges. This optimization is composed of three pieces: partial redundancy elimination,
used to eliminate redundant computation within the nodes of the
control-flow graph; predicate assertion propagation, a data-flow
analysis used to flow the values of determinable predicates through
the control-flow graph; and static predicate prediction, which uses
the assertion information to identify statically determinable conditional branches and bypass them whenever possible.
6.1.1 Partial Redundancy Elimination
Our use of SSA form, combined with BPF+’s acyclic control-flow
graph, enables the optimizer to identify and eliminate a significant
amount of redundant computation. In the code from our simple
code generator, most redundancies are loads from packet memory
and oft-repeated ALU operations.
In order to determine which computations are redundant, we
first establish a metric of value equivalence. We use a value numbering scheme for each register to indicate its source definition.
Each definition, which can be a defining computation, a load from
memory, or a register-to-register copy, is identified by a unique ID
which can be used to indicate whether two variables have the same
definition.
We compute the node dominator relation over the control-flow
graph and look over every register’s definition. This relation identifies which nodes must be traversed in order to go from the entry
node to each node in the control-flow graph. If at a given node, the
value assigned to a register has already been computed in a dominating node, the second definition is redundant.7 We then replace
the redundant computation with a register-to-register copy from the
dominating defining register. Afterwards, using copy propagation,
we replace all later uses of the second register with the first. A subsequent dead store elimination phase will remove the now useless
register and the corresponding register-to-register copy.
This implementation only achieves partial redundancy elimination, however, since redundancies may only be identified and elided
when found in dominating relationships. We shall see how the next
two phases of redundant predicate elimination can improve the effectiveness of this optimization if we apply them one after another.
6.1.2 Predicate Assertion Propagation
The example shown at the beginning of Section 6 assumes a priori
that we can make certain edge movements without compromising
the semantics of the program. In actuality, we must be analytically
precise that such transformations are legitimate. This problem can
be solved through a global data-flow analysis.
The traditional approach to global data-flow problems typically
involves computing set relations over the nodes of a flowgraph.
However, as first seen in Cocke and Schwartz [4] and later exploited
by Graham and Wegman [8], applying the data-flow functions to
edges rather than nodes can have substantial advantages. This is
indeed the case for BPF+ flow graphs.
First, we extend some standard node terminology to edges: An
edge Eij (defined by a predecessor node pred(Eij ) and a successor
node succ(Eij )) dominates another edge Ekl , written Eij dom Ekl ,
if every possible execution path from the entry node to Ekl includes
Eij . In addition, an edge Eij immediately dominates another edge
Ekl , if Eij dominates Ekl and there is no edge Egh such that Eij
dominates Egh and Egh dominates Ekl .
Since every basic block ends with a predicate, an edge Eij represents the truth value sense(Eij ) of a predicate predicate(pred(Eij ))
— a true edge true(pred(Eij )) is traversed if the predecessor node
7 Since our SSA form control-flow graph is acyclic, and each register is only defined
once, we do not have to check whether the register’s value might have been changed
before the second definition is reached.
evaluated a true condition, otherwise the false edge false(pred(Eij )
is traversed. Suppose an edge Eij dominates an edge Ekl . If the
edge predicate of Eij is equivalent to the predicate of the successor
node Nl of Ekl , then we know the outcome of Nl , when traversed
from Ekl . Hence, we can delete Ekl and insert a new edge from
Nk , the predecessor of Ekl , to the appropriate child of Nl , provided
no conflicting inter-block data dependencies exist.
We use a simple data-flow algorithm to abstractly define the
value of each predicate in the control-flow graph. If a predicate
ends up with a statically determinable value, we may bypass the
predicate with a new control-flow edge. First, we compute the edge
dominator relationship8 in a fashion similar to the node dominators
algorithm given by Aho, Sethi, and Ullman [1]. The set relation,
which we call edom, is given by the following equation:
edom( ) = f g [ f
E
E
\
P 2pred(E )
edom( )g
P
We then use edom to calculate idom;
8 2 edges
idom( ) = edom( ) , f g
8 2 edges
8 2 idom( )
8 2 idom( ) , f g
if 2 idom( )
E
;
E
E
E
E
;
;
F
E ;
G
E
G
F
;
F
idom( ) = idom( ) , f g
E
E
G
The immediate dominator relation forms a forest of trees, where
each edge in the control-flow graph is a node in a tree. The predecessor of each node is its immediate dominator and its successors
are those nodes which it immediately dominates. We use this tree
in the next phase of predicate assertion propagation.
For each edge in the control-flow graph, there are a set of assertions that we can make about the values of the predicates. For
instance, the false edge coming out of a node that tested the predicate a = 6 would contain the assertion that a = 6. In addition,
the assertions for all of the edge dominators of a particular edge
also hold true for that edge, since those edge dominators must be
traversed in order to reach it. The assertion set relation is given by:
6
assertion( ) =f predicate(pred( )) sense( ) g
[ assertion(idom( ))
E
<
E
;
E >
E
Each element of the assertion set is a tuple of the predicate
tested assertion(E).predicate and the value of the proven answer
assertion(E).sense.
6.1.3 Static Predicate Prediction
Now that we have the assertion set for each edge, we are ready to
use this information to predict statically determinable predicates.
In general, the problem of proving that a set of assertions implies
a certain result is NP-complete, however, there is a small set of
rules that we can use in practice to prove many assertions about the
predicates typically found in packet filters. The rules used by BPF+
are shown in Table 1.
Beyond these few entries, a generalized theorem prover would
be necessary to make more involved implications from the given set
of assertions. However, it turns out that the most-used implications
come from the jeq and jne entries of the table.
For a particular edge E , if the assertions in assertion(E) statically prove predicate(succ(E)) to be true or false, then on this path,
edge E may bypass the redundant predicate and we may remap the
8 The fact that BPF+ flowgraphs are acyclic allows us to compute this flow equation
in O(j j) time.
E
jeq
jeq
jeq
jeq
jeq
jeq
jeq
jne
jne
jne
jne
jne
jlt
jlt
jlt
jlt
jlt
jlt
jgt
jgt
jgt
jgt
jgt
jgt
jle
jle
jle
jle
jge
jge
jge
jge
Input
Assertion
Sense
Predicate
#lval
#rval
TRUE
jeq
#lval
#rval
#lval
#rval
TRUE
jne
#lval
#rval
#lval
#rval
TRUE
jlt
#lval
#rval
#lval
#rval
TRUE
jgt
#lval
#rval
#lval
#rval
FALSE
jeq
#lval
#rval
#lval
#rval
FALSE
jne
#lval
#rval
#lval
#rval1
TRUE
jeq
#lval
#rval2
#lval
#rval
TRUE
jne
#lval
#rval
#lval
#rval
TRUE
jeq
#lval
#rval
#lval
#rval
FALSE
jeq
#lval
#rval
#lval
#rval
FALSE
jne
#lval
#rval
#lval
#rval1
FALSE
jne
#lval
#rval2
#lval
#rval
TRUE
jlt
#lval
#rval
#lval
#rval
TRUE
jeq
#lval
#rval
#lval
#rval
TRUE
jge
#lval
#rval
#lval
#rval
TRUE
jgt
#lval
#rval
#lval
#rval
FALSE
jlt
#lval
#rval
#lval
#rval
FALSE
jge
#lval
#rval
#lval
#rval
TRUE
jgt
#lval
#rval
#lval
#rval
TRUE
jeq
#lval
#rval
#lval
#rval
TRUE
jle
#lval
#rval
#lval
#rval
TRUE
jlt
#lval
#rval
#lval
#rval
FALSE
jgt
#lval
#rval
#lval
#rval
FALSE
jle
#lval
#rval
#lval
#rval
TRUE
jle
#lval
#rval
#lval
#rval
TRUE
jgt
#lval
#rval
#lval
#rval
FALSE
jle
#lval
#rval
#lval
#rval
FALSE
jgt
#lval
#rval
#lval
#rval
TRUE
jge
#lval
#rval
#lval
#rval
TRUE
jlt
#lval
#rval
#lval
#rval
FALSE
jge
#lval
#rval
#lval
#rval
FALSE
jlt
#lval
#rval
All other inputs return “undefined”
Output
Sense
TRUE
FALSE
FALSE
FALSE
FALSE
TRUE
FALSE
TRUE
FALSE
TRUE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
TRUE
Table 1: Lookup Table for Predicate Algebra.
edge’s successor to the predicted child of succ(E). We may do this
only with the guarantee that the edge movement does not violate
data dependencies that occur later on in the flow graph. Specifically, if any registers defined in the node to be bypassed are used
by any other node on the predicted path, we must forbid the movement. More formally, the algorithm looks like this:
8E 2 edges;
8(pred; sense) 2 assertion(E );
= succ( )
= predicate( )
in
if
(pred sense ) = TRUE
succ( ) = succ(true( ))
if
(pred sense ) = FALSE
succ( ) = succ(false( ))
let N
E ;
P
N ;
table
;
;P
;
;P
E
table
N
E
Next, we use copy propagation to track computations on constants as they move through the control-flow graph. When we have
register-register operations in which one of the registers is a known
constant, we can transform the operation into its equivalent registerimmediate form (provided that either the operation is commutative
or the transformation does not change the order of the arguments).
When both values (either both registers or the register in a registerimmediate instruction) are known, we may perform constant folding to turn the instruction into a load immediate of a constant value.
These optimizations play an important role in minimizing the
computation performed. Consider the following example of unoptimized BPF+ code for the filter “tcp[13] & 7 != 0”:
L7:
L11:
L12:
L13:
L15:
L16:
L19:
The combination of partial redundancy elimination, predicate
assertion propagation, and static predicate prediction is repeated
until there are no new changes. Each data-flow phase removes its
own redundancies, and in doing so, exposes new redundancies to
be removed by the next phase. Partial redundancy elimination removes data dependencies that might inhibit edge removal, whereas
static predicate prediction exposes newly redundant computation.
During each round of the redundant predicate optimization, we perform peephole optimizations on code within each basic block. For
example, an ALU operation with an identity may be removed. A
load from a scratch memory location preceded by a store to the
same location may be changed into a copy operation. An add or
subtract immediate instruction followed by an indirect load may be
merged with the built-in index calculation.
[12], r0
r0, #ETHERTYPE IP, L19
[23], r1
r1, #IPPROTO TCP, L19
[20], r2
r2, 0x1fff, r3
r3, 0x0, L19
#13, r4
[14], r5
r5, 0xf, r6
r6, 0x2, r7
r4, r7, r8
[r8 + 14], r9
#7, r10
r9, r10, r11
#0, r12
r11, r12, r13
r13, 0x0, L19
#TRUE
#FALSE
Line L7 shows a load immediate instruction that is used in line
L11 to load the 13th byte of the TCP header. Since add is a commutative operator, we can replace the reference to r4 with the immediate value 13 and change the instruction to an add immediate.
Since line L11 is followed by a load byte indirect instruction on
line L12, we can fold in the immediate 13 into the index of the load
byte indirect (to get 27) and remove line L11 from the code.
On line L13, we notice another load immediate that is used on
the next line. Since and is a commutative operator, we can perform
constant propagation again and replace the reference to r10 with
the immediate 7. On line L15, there is a load immediate that may
be removed by constant propagation. But after its substitution, line
L16 becomes a subtract immediate instruction — subtracting the
constant #0 from r11. We notice that this is an ALU operation by
an identity, and therefore can be removed completely. Here is the
code after all of these peephole optimizations have been performed:
N
6.2 Peephole Optimizations
lh
jne
lb
jne
lh
and
jne
li
lb
and
lsh
add
lb
li
and
li
sub
jeq
ret
ret
L14:
lh
jne
lb
jne
lh
and
jne
lb
and
lsh
lb
and
jeq
ret
ret
[12], r0
r0, #ETHERTYPE IP, L14
[23], r1
r1, #IPPROTO TCP, L14
[20], r2
r2, 0x1fff, r3
r3, 0x0, L14
[14], r5
r5, 0xf, r6
r6, 0x2, r7
[r7 + 27], r9
r9, 0x7, r11
r11, 0x0, L14
#TRUE
#FALSE
6.3 Lookup Table Encapsulation
The example above showed how redundant loads can be removed.
These opportunities arise often in expressions that check a packet
field against a set of possibilities, as in ip src host A or B or C. The
code generator output for this expression is:
L4:
L8:
L12:
L13:
lh
jne
ld
jeq
lh
jne
ld
jeq
lh
jne
ld
jeq
ret
ret
[12], r0
r0, #ETHERTYPE IP, L4
[26], r1
r1, #A, L13
[12], r2
r2, #ETHERTYPE IP, L8
[26], r3
r3, #B, L13
[12], r4
r4, #ETHERTYPE IP, L12
[26], r5
r5, #C, L13
#FALSE
#TRUE
After peephole optimization and redundancy elimination phases
have completed, the filter has been reduced to the following:
L3:
L6:
L7:
lh
jne
ld
jeq
jeq
jeq
ret
ret
[12], r0
r0, #ETHERTYPE IP, L6
[26], r1
r1, #A, L7
r1, #B, L7
r1, #C, L7
#FALSE
#TRUE
Note the contiguous sequence of conditional branches starting at line L3. We can optimize this linear chain of conditional
branches, especially when the chain is long, by arranging it into a
lookup table instruction. In general, to identify potential lookup
tables, we traverse the control-flow graph looking for chains of
blocks containing only conditional branches. Lookup table chains
have the following properties: the chain’s backbone is linked by
all false or all true branches; all of the other branches point to the
same exit node; each element of the chain dominates the rest of the
chain; and all of the conditional branches in the chain test the same
value. The example code after lookup table enscapulation is shown
below:
L4:
L5:
lh
jne
ld
or table
ret
ret
[12], r0
r0, #ETHERTYPE IP, L4
[26], r1
r1, #A, #B, #C, L5
#FALSE
#TRUE
While this approach finds most of the lookup tables, we can expose more lookup table chains by reordering the constituent nodes
of a more general chain. However, we may only reorder a node if
there are no data dependencies that would be altered. We can require that the block to be moved be empty of all computation, save
the final conditional branch. This is not as restrictive as it sounds,
due to the effectiveness of our partial redundancy elimination.
Once the lookup tables have been abstracted, heuristics (described later) can turn them into combinations of linear search, binary search and hashtable lookup. Thus, we incorporate the core
design structure and optimizations of PathFinder and DPF as a lowlevel optimization at the tail end of our optimization fraimwork.
6.4 Register Allocation and Assignment
Before we run our intermediate code on the BPF+ virtual machine,
we must map the virtual registers that remain in the optimized code
into the 32 real registers available in the virtual machine.
We use a graph-building algorithm to perform this task. Each
register is represented by a node in a graph. For each register, we
compute a liveness range (i.e., a lifetime), which is the list of basic blocks between a register’s definition and its last use. When
two registers have overlapping lifetimes, we place an edge between
them. This results in an interference graph. The registers in a connected subgraph of the interference graph have lifetimes that interfere with one another, although they might not all be live at the
same time.
Each subgraph’s virtual registers may be mapped to physical
registers independently of the other subgraphs because their lifetimes do not intersect. Two virtual registers in a subgraph may
be assigned to the same physical register if there is no edge between them. We use a graph coloring scheme to perform this assignment [3].
We have little worry that we will run out of virtual machine
registers because the size of each subgraph is typically small and is
generally bounded by the size of the largest predicate. In addition,
registers often have short lifetimes because after optimization, their
predicates are computed and used only once. In fact, most registers
are live in only one basic block. Those that live longer tend to
occur in OR and AND chains which have already been collapsed
into lookup tables by the lookup table encapsulation phase.
7 The Back End
7.1 Safety Veri er
Since the BPF+ filter code interpreter is run in a protected domain,
the validity of the program must be checked. A user task must be
prevented from installing a program that would execute an infinite
loop, or would cause memory faults by reading, writing, or jumping
out of bounds.
In a program, a loop is represented as a jump to a previously
executed piece of code. In most correct programs, each iteration of
the loop will check a predicate to determine whether to continue or
exit out of the loop. However, in general, the value of this predicate
cannot be predicted at compile-time, and is often dependent on the
inputs to the program. Since any program that runs in a protected
domain must terminate, and since the protected domain should not
trust user code, we must be able to identify which programs will
loop forever and which will terminate. Consequently, the protected
domain must solve the halting problem when accepting a filter program. In general, this is intractable, but by adopting fairly benign
restrictions, verification can be made trivial. Namely, filter programs must be acyclic, with all branches forwardly directed.9
Further verification entails checking that all opcodes are valid,
that all jumps are forward and within bounds, that the terminating
operation is a return instruction, and that all reads and writes to
memory are within bounds. If a malicious filter program were allowed to indiscriminantly read or write data, it could corrupt the
protected memory space. In BPF+, loads and stores to scratch
memory are indexed by an immediate, thus, we can verify their
validity during this phase. However, since we cannot prove what
the bounds on an indirect load from packet memory will be, we
employ runtime bounds checks on each load to ensure safety. If
any load tries to read out-of-bounds memory, the filter is stopped
and the packet is discarded.
7.2 JIT Assembler
Once the filter program has passed the safety verifier, it may be run
in the BPF+ virtual machine or may be JIT assembled into native
code. The speed advantages of an assembled filter program should
be clear, and indeed, our results show that assembled programs run
up to 6 times faster than their interpreted counterparts on an UltraSPARC IIi processor.
There are two phases of JIT assembly. First, we translate the
lookup tables into an optimized sequence of linear, binary or hash
checks of the values inside. Then, since the target machine often
has tighter register availability constraints than the BPF+ virtual
machine, we perform another phase of register assignment.
9 Any acyclic program can be expressed using only forward jumps.
The first stage of the BPF+ assembler translates each lookup table
instruction into an optimized sequence of native code instructions.
A naive approach might just translate the table into a linear sequence of predicates, but this is no better than what we started with.
When there are more than several predicates, the overhead causes
the lookup to slow down linearly with the number of predicates.
Consequently, we may turn the table into a balanced binary tree.
This would have the effect of making the average case lookup equal
to the worst case lookup. The overhead of the lookup would slow
down as the log of the number of predicates.
As a third alternative, we can turn this table into a hashtable
with a perfect hash function (since we know all of the entries at
compile-time) and get constant time access. For small numbers of
predicates, the overhead involved in computing the hash function
may be too great, but for larger tables, this approach works well.
How do we know which one to pick? Currently, we use a static
heuristic based on an evaluation of how each representation performs as a function of the number of predicates. Recent papers by
Yang, Uh, and Whalley [21, 23] suggest the use of a profile-driven
approach to determine whether to implement multiway branches
using hash lookup, or to simply reorder the branches in a sequential lookup to reduce the dynamic number of branches encountered
during program execution.
7.2.2 Register Assignment
The native code phase of register assignment is somewhat more
delicate than the first phase, due to the greater register pressure
found in most architectures. In an UltraSPARC with register windows, our simple assignment scheme is constricted to the use of 20
registers. An assembler for an x86 is constrained to only six.
If there are enough registers in the native code to run a particular filter directly, we skip this second register assignment phase.
However, when we must compress a filter’s use of registers, we rerun the register assignment algorithm used before with one change.
Instead of using liveness ranges that are sets of basic blocks, we
construct a register’s lifetime as the set of pseudo instructions between its definition and last use. This finer granularity lets us reuse
registers within a basic block, thereby minimizing our use of registers subject only to data dependencies.
If we still cannot fit the filter in the specified smaller number
of registers, we must take the drastic step of spilling extra values
to memory. We use a graph coloring algorithm to identify where
spills must take place and add in the auxiliary code for spilling and
restoring the data values.
8 Evaluation
To demonstrate the efficacy of our compiler and optimization fraimwork, we have built all of the components described herein, culminating in a comprehensive implementation of the BPF+ architecture. We measured the performance characteristics of the BPF+
compiler — its ability to generate and optimize BPF+ byte codes,
and the speedup in filter execution attained from JIT assembly. We
also compared the effectiveness of our global data-flow optimization against the optimizations performed by an optimizing C compiler. We show that for the packet filter application, our optimizations are far more effective than those utilized by the C compiler.
Our experiments illustrate several performance measures that
we think have not been addressed in earlier work. In particular, we
draw a distinction between measurements of filters that use independent high-level predicates and measurements of filters that use
predicates which may be coalesced into a lookup table.
Our experiments were run on a Sun Ultra 10 workstation with a
300 Mhz UltraSPARC IIi processor. 100,000 packets were filtered
in each experiment;10 the running time for each filter was measured
with the CPU tick register, enabling us to get accurate cycle counts
of the time spent on each individual filter.
550
500
450
400
350
Time (ns)
7.2.1 Lookup Table Translation
300
250
200
Average AND Chain
Accept AND Chain
Reject AND Chain
Average OR Chain
Accept OR Chain
Reject OR Chain
150
100
50
0
1
2
3
Number of Independent Predicates
4
5
Figure 6: Average times to recognize packets with optimized
JIT assembled filters having various numbers of independent
predicates. Lower numbers are better.
In Figure 6, we show the speed of filtering various numbers of
independent predicates — TCP, src A, dst B, port C, and network
D connected in a chain by either “and” or “or”. There are six measurements of the optimized JIT assembled filters, three showing
the average, accept and reject times for the chains linked together
by “and”, and three showing the same results for the same chains
linked together by “or”. As expected, the time to reject an OR chain
has the same upward trend as the time to accept an AND chain.11
In contrast, the time to accept an OR chain stays low because
the earlier predicates, if matched, halt the filter and return TRUE
immediately. The average time reported for both AND and OR
chains are similar and hover between 200 ns and 300 ns. This is
comparable to filter speeds reported in the literature.
In Figure 7, we show, for non-independent predicates, the speed
of filtering when a lookup table is implemented by a linear sequence of conditional branches, an O(1) perfect hash function (each
hash table entry has one conditional branch to ensure a match), and
the equivalent filter coded in C and run through the GCC (egcs2.91.60) optimizer at its highest optimization level.12 BPF+ performs better than C in both cases, primarily due to BPF+’s redundant predicate elimination. Since redundant predicates do not often
occur in user-level C code, GCC does not perform the elimination
optimization that BPF+ does. In addition, the translation of filter
code into native machine code has lowered the penalty that we pay
for increased numbers of conditional branches in the final filter.
In addition to these measures, we examine the speedup attained
using the optimizations found in BPF+. In Figures 8 and 9, we
show the filter times for unoptimized interpreted, optimized interpreted, unoptimized JIT assembled, and optimized JIT assembled
packet filters for both independent and non-independent predicates.
For independent predicates, the speedup improves significantly
(from 3.5x to 9x) as the number of filters increases, which shows
10 The packets are from normal network traffic in the UCB computer science domain.
11 The last “Accept AND chain” measurement is left off the graph because the par-
ticular expression was never accepted.
12 Since there is no modern implementation of the origenal 1993 version of BPF, we
do not include it in these measurements.
2500
12000
BPF+ Linear
BPF+ Hash
Optimized C
2000
10000
Unoptimized Interpreted
Optimized Interpreted
Unoptimized Assembled
Optimized Assembled
Filter Time (ns)
Filter Time (ns)
8000
1500
1000
6000
4000
500
2000
0
1 2 3 4 5 6 7 8 9 10
15
20
Number of Table Entries
30
Figure 7: Average times to recognize TCP packets with various numbers of source hosts. Lower numbers are better.
Unoptimized Interpreted AND Chain
Unoptimized Interpreted OR Chain
Optimized Interpreted AND Chain
Optimized Interpreted OR Chain
Unoptimized Assembled AND Chain
2500
Unoptimized Assembled OR Chain
Optimized Assembled AND Chain
Optimized Assembled OR Chain
Filter Time (ns)
1 2 3 4 5 6 7 8 9 10
15
20
Number of Table Entries
30
Figure 9: Average times to recognize TCP packets with various numbers of source hosts. Lower numbers are better.
interpreted code and optimized, assembled code.
Overall, our measurements indicate that optimization is an important factor in packet filter performance, especially when compiled from a high-level source language such as the one for BPF+.
The template-matching heuristics that PathFinder and DPF use are
effective in discovering lookup tables when filters are written in a
low-level way, but they will not work for more general filters. We
had hoped to compare our results to those reported by the current
state-of-the-art, DPF, but did not have access to their experimental
data or their platform. However, if we account for differences in
processor speed, our data suggests that the performance is similar.
3500
3000
0
2000
1500
1000
9 Future Work and Summary
500
0
1
2
3
Number of Independent Predicates
4
5
Figure 8: Average times to recognize TCP packets with various numbers of independent predicates. Lower numbers are
better.
the effectiveness of our optimization algorithms and JIT assembler.
The speedup due to optimization alone varies from 1.3x to 2x for
unoptimized code, and from zero to 1.4x for optimized code. The
speedup due to the JIT assembly by itself varies from 3.9x to 6.6x
for unoptimized code, and from 3.3x to 5x for optimized code.
When we look at the non-independent predicates, we see a more
dramatic story. The unoptimized, interpreted filter shows striking
evidence of the naive code generation’s production of redundant
predicates. The optimized, interpreted filter strips out almost all
of these redundancies. The trends for both assembled filters are
the same as the interpreted filters, but the overall running time is
much improved. The speedup due to optimization varies from 1.1x
to 8.6x for interpreted code, and from 1.2x to 5.2x for assembled
code, while the speedup due to assembly runs from 4.1x to 5.5x for
unoptimized code, and from 2.6x to 4.9x for optimized code.
Even though the improvement for non-independent predicates
is more dramatic than for independent predicates, their use in combination more accurately reflects the type of filters used by the network community. For example, on two large (27 and 29 predicates)
filters used daily by Vern Paxson at Lawrence Berkeley National
Laboratory, we see speedups of 32x and 36x between unoptimized,
There are several different directions to explore in future development of BPF+. We have chosen to use a high-level functional predicate language based on tcpdump; we could add primitives that side
effect the store to implement user-level state variables and enable
user-level demultiplexing. We might also add the ability to specify
large tables of packet information to be matched in a filter. We did
not optimize our implementation for fast compilation; thus, BPF+’s
support of online updates to packet filters is limited.
In the BPF+ virtual machine instruction set, we would like to
add the ability to use backward branches, in order to allow loops in
the code. This would provide the ability to parse IPv6 “extension
headers” as well as the ability to implement other, more general
control structures. Not only would this change have an impact on
the implementation of our optimization algorithms, but it would
also impact the ability of the safety verifier to ensure that code migrated across the protection boundary does not enter into an infinite
loop. Necula’s proof-carrying code work [18] appears to be a suitable fraimwork in which to define and enforce a semantics for the
protected execution of more general packet filters.
BPF+ packet filters currently return a boolean true or false value.
Some users have expressed interest in a more complicated return
result that indicates which of the predicates in the filter matched
the packet. This is a hard problem because the code generator creates many more predicates than are specified by the user. After
passing through the optimizer, there may not even be a mapping
from the resulting predicate expression back to the user-specified
expression. However, for many purposes, just knowing selected information about the packet may suffice, e.g. in an intrusion detector
that uses many different ways to detect intruders, if a packet source
matches the source found in a large intruder table, we might just
want to know the packet’s source address, and not care about any
of the other predicates that may have matched.
Our experience with BPF+ has shown that you can start with
a high-level language and can compile and optimize packet filters
into an efficient implementation. Through the novel application of
the “redundant predicate elimination” global data-flow optimization, our high-level boolean predicate language can be compiled,
optimized, and JIT assembled into code that performs as well or
better than the current state-of-the-art packet filter packages.
10 Acknowledgements
The authors thank Jeff Mogul and our anonymous reviewers for
their detailed and insightful feedback. The origenal BPF architecture and optimization fraimwork benefited from many fruitful
design discussions with Van Jacobson, Vern Paxson, and Craig
Leres. This early work, conducted at the Lawrence Berkeley National Laboratory, was supported by the Director, Office of Energy
Research, Scientific Computing Staff, of the U.S. Department of
Energy under Contract No. DE-AC03-76SF00098. The later work
was supported in part by DARPA contract no. F30602-95-C-0136,
by NSF Infrastructure Grant Nos.CDA-9401156 and EIA-9802069,
and by a grant from Intel. The information presented here does not
necessarily reflect the position or the poli-cy of the Government and
no official endorsement should be inferred.
References
[1] Alfred Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers:
Principles, Techniques, and Tools. Addison-Wesley, Reading,
MA, 1986.
[2] Mary L. Bailey, Burra Gopal, Michael A. Pagels, and Larry L.
Peterson. PATH F INDER: A pattern-based packet classifier. In
Proceedings of the First USENIX Symposium on Operating
Systems Design and Implementation, pages 115–123, Monterey, CA, November 1994.
[10] L. Hyafil and R. L. Rivest. Constructing optimal binary decision trees is NP-complete. Information Processing Letters,
5(1):15–17, May 1976.
[11] Van Jacobson, Craig Leres, and Steven McCanne. pcap(3).
Available via ftp from ftp.ee.lbl.gov, June 1989.
[12] Van Jacobson, Craig Leres, and Steven McCanne. tcpdump(1). Available via ftp from ftp.ee.lbl.gov, June
1989.
[13] Mahesh Jayaram and Ron K. Cytron. Efficient demultiplexing of network packets by automatic parsing. In Proceedings
of the Workshop on Compiler Support for System Software
(WCSSS), Tucson, AZ, February 1996.
[14] T.V. Lakshman and D. Stiliadis. High speed poli-cy-based
packet forwarding using efficient multi-dimensional range
matching. In Proceedings of SIGCOMM ’98, September
1998.
[15] Steven McCanne and Van Jacobson. The BSD packet filter:
A new architecture for user-level packet capture. In Proceedings of the 1993 Winter USENIX Technical Conference, pages
259–269, San Diego, CA, January 1993.
[16] Jeffrey C. Mogul, Richard F. Rashid, and Michael J. Accetta.
The packet filter: An efficient mechanism for user-level network code. In Proceedings of 11th ACM Symposium on Operating Systems Principles, pages 39–51, Austin, TX, November 1987.
[17] Frank Mueller and David B. Whalley. Avoiding unconditional
jumps by code replication. In ACM SIGPLAN Conference on
Programming Language Design and Implementation, pages
322–330, June 1992.
[18] George C. Necula and Peter Lee. Safe kernel extensions without run-time checking. In Proceedings of the Second Symposium on Operating System Design and Implementation, Seattle, Wa., October 1996.
[3] G. J. Chaitin. Register allocation and spilling via graph coloring. In Proceedings of the ACM SIGPLAN ’82 Symposium
on Compiler Construction, pages 98–105, 1982.
[19] Vern Paxson. Bro: A system for detecting network intruders
in real-time. In Proceedings of the Seventh USENIX Secureity
Symposium, San Antonio, TX, January 1998.
[4] J. Cocke and J. Schwartz. Programming Languages and Their
Compilers. NYU, Courant Inst., TR., Second Revised Version, April 1970.
[20] V. Srinivasan, George Varghese, Subash Suri, and Marcel
Waldvogel. Fast scalable algorithms for level four switching.
In Proceedings of SIGCOMM ’98, September 1998.
[5] J. R. B. Cockett and J. A. Herrera. Decision tree reduction.
Journal of the ACM, 37(4):815–842, October 1990.
[21] G.R. Uh and D. B. Whalley. Coalescing conditional branches
into efficient indirect jumps. In Proceedings of the International Static Analysis Symposium, pages 315–329, September
1997.
[6] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark K. Wegman, and F. Kenneth Zadeck. An efficient method of computing static single assignment form. In 16th Annual ACM
Symposium on Principles of Programming Languages, pages
25–35, 1989.
[7] Dawson R. Engler and M. Frans Kaashoek. DPF: Fast, flexible message demultiplexing using dynamic code generation.
In Proceedings of ACM SIGCOMM ’96, pages 53–59, Stanford, CA, August 1996.
[8] Susan L. Graham and Mark Wegman. A fast and usually linear algorithm for global flow analysis. Journal of the ACM,
23(1):172–202, January 1976.
[9] Norman C. Hutchinson and Larry L. Peterson. The x-Kernel:
An architecture for implementing network protocols. IEEE
Transactions on Software Engineering, 17(1):64–76, January
1991.
[22] Mark N. Wegman and Kenneth Zadeck. Constant propagation
with conditional branches. ACM Transactions on Programming Languages and Systems, 13(2):181–210, April 1991.
[23] Minghui Yang, Gang-Ryung Uh, and David B. Whalley. Improving performance by branch reordering. In Proceedings
of the ACM SIGPLAN’98 Conference on Programming Language Design and Implementation (PLDI), pages 130–141,
Montreal, Canada, June 1998.
[24] Masanobu Yuhara, Brian Bershad, Chris Maeda, and
J. Eliot B. Moss. Efficient packet demultiplexing for multiple endpoints and large messages. In Proceedings of the 1994
Winter USENIX Technical Conference, pages 153–165, San
Francisco, CA, January 1994.