Lifted First-Order Belief Propagation: Parag Singla Pedro Domingos
Lifted First-Order Belief Propagation: Parag Singla Pedro Domingos
Lifted First-Order Belief Propagation: Parag Singla Pedro Domingos
1094
Table 1: Example of a Markov logic network. Free variables are implicitly universally quantified.
English First-Order Logic Weight
Most people don’t smoke. ¬Smokes(x) 1.4
Most people don’t have cancer. ¬Cancer(x) 2.3
Most people aren’t friends. ¬Friends(x, y) 4.6
Smoking causes cancer. Smokes(x) ⇒ Cancer(x) 1.5
Friends have similar smoking habits. Smokes(x) ∧ Friends(x, y) ⇒ Smokes(y) 1.1
field can have arbitrary factors. As long as P (X = x) > 0 Belief propagation can also be used for exact inference in
for all x, the distribution can be equivalently
P represented as a arbitrary graphs, by combining nodes until a tree is obtained,
log-linear model: P (X = x) = Z1 exp ( i wi gi (x)), where but this suffers from the same combinatorial explosion as
the features gi (x) are arbitrary functions of (a subset of) the variable elimination.
state.
Graphical models can be represented as factor graphs
Markov Logic
(Kschischang, Frey, & Loeliger 2001). A factor graph is First-order probabilistic languages combine graphical mod-
a bipartite graph with a node for each variable and factor in els with elements of first-order logic, by defining template
the model. (For convenience, we will consider one factor features that apply to whole classes of objects at once.
fi (x) = exp(wi gi (x)) per feature gi (x), i.e., we will not A simple and powerful such language is Markov logic
aggregate features over the same variables into a single fac- (Richardson & Domingos 2006). A Markov logic network
tor.) Variables and the factors they appear in are connected (MLN) is a set of weighted first-order clauses.1 Together
by undirected edges. with a set of constants representing objects in the domain
The main inference task in graphical models is to compute of interest, it defines a Markov network with one node per
the conditional probability of some variables (the query) ground atom and one feature per ground clause. The weight
given the values of some others (the evidence), by summing of a feature is the weight of the first-order clause that orig-
out the remaining variables. This problem is #P-complete, inated it. The probability of P a state x in such aQnetwork
but becomes tractable if the graph is a tree. In this case, the is given by P (x) = Z1 exp ( i wi gi (x)) = Z1 i fi (x),
marginal probabilities of the query variables can be com- where wi is the weight of the ith clause, gi = 1 if the ith
puted in polynomial time by belief propagation, which con- clause is true, and gi = 0 otherwise. Table 1 shows an ex-
sists of passing messages from variable nodes to the corre- ample of a simple MLN representing a standard social net-
sponding factor nodes and vice-versa. The message from a work model. In a domain with two objects Anna and Bob,
variable x to a factor f is ground atoms will include Smokes(Anna), Cancer(Bob),
Y Friends(Anna, Bob), etc. States of the world where more
µx→f (x) = µh→x (x) (1) smokers have cancer, and more pairs of friends have similar
h∈nb(x)\{f } smoking habits, are more probable.
where nb(x) is the set of factors x appears in. The message Inference in Markov logic can be carried out by creat-
from a factor to a variable is ing the ground network and applying belief propagation to
it, but this can be extremely inefficient because the size of
X Y the ground network is O(dc ), where d is the number of ob-
µf →x (x) = f (x) µy→f (y) (2) jects in the domain and c is the highest clause arity. In the
∼{x} y∈nb(f )\{x} next section we introduce a better, lifted algorithm for in-
ference. Although we focus on Markov logic for simplicity,
where nb(f ) are the arguments of f , and the sum is over the algorithm is easily generalized to other representations.
all of these except x. The messages from leaf variables are Alternatively, they can be translated to Markov logic and the
initialized to 1, and a pass from the leaves to the root and algorithm applied directly (Richardson & Domingos 2006).
back to the leaves suffices. The (unnormalized) marginal of
Lifted Belief Propagation
Q
each variable x is then given by h∈nb(x) µh→x (x). Evi-
dence is incorporated by setting f (x) = 0 for states x that We begin with some necessary definitions. These assume
are incompatible with it. This algorithm can still be applied the existence of an MLN M, set of constants C, and ev-
when the graph has loops, repeating the message-passing idence database E (set of ground literals). For simplicity,
until convergence. Although this loopy belief propagation our definitions and explanation of the algorithm will assume
has no guarantees of convergence or of giving the correct re- that each predicate appears at most once in any given MLN
sult, in practice it often does, and can be much more efficient clause. We will then describe how to handle multiple occur-
than other methods. Different schedules may be used for rences of a predicate in a clause.
message-passing. Here we assume flooding, the most widely Definition 1 A supernode is a set of groundings of a predi-
used and generally best-performing method, in which mes- cate that all send and receive the same messages at each step
sages are passed from each variable to each corresponding
factor and back at each step (after initializing all variable 1
In this paper we assume function-free clauses and Herbrand
messages to 1). interpretations.
1095
of belief propagation, given M, C and E. The supernodes
of a predicate form a partition of its groundings. Table 2: Lifted network construction.
A superfeature is a set of groundings of a clause that all
send and receive the same messages at each step of belief function LNC(M, C, E)
propagation, given M, C and E. The superfeatures of a inputs: M, a Markov logic network
clause form a partition of its groundings. C, a set of constants
Definition 2 A lifted network is a factor graph composed of E, a set of ground literals
supernodes and superfeatures. The factor corresponding to output: L, a lifted network
a superfeature g(x) is exp(wg(x)), where w is the weight for each predicate P
of the corresponding first-order clause. A supernode and a for each truth value t in {true, false, unknown}
superfeature have an edge between them iff some ground form a supernode containing all groundings of P
atom in the supernode appears in some ground clause in the with truth value t
superfeature. Each edge has a positive integer weight. A repeat
minimal lifted network is a lifted network with the smallest for each clause C involving predicates P1 , . . . , Pk
possible number of supernodes and superfeatures. for each tuple of supernodes (N1 , . . . , Nk ),
where Ni is a Pi supernode
The first step of lifted BP is to construct the minimal lifted form a superfeature F by joining N1 , . . . , Nk
network. The size of this network is O(nm), where n is the for each predicate P
number of supernodes and m the number of superfeatures. for each superfeature F it appears in
In the best case, the lifted network has the same size as the S(P, F ) ← projection of the tuples in F down to
MLN; in the worst case, as the ground Markov network. the variables in P
The second and final step in lifted BP is to apply standard for each tuple s in S(P, F )
BP to the lifted network, with two changes: T (s, F ) ← number of F ’s tuples that were
1. The message from supernode x to superfeature f becomes projected into s
n(f,x)−1 Q
S
µf →x h∈nb(x)\{f } µh→x (x)
n(h,x)
, where n(h, x) is S(P ) ← F S(P, F )
the weight of the edge between h and x. form a new supernode from each set of tuples in S(P )
with the same T (s, F ) counts for all F
2. The (unnormalized) marginal of each supernode (and until convergence
therefore of each ground atom in it) is given by add all current supernodes and superfeatures to L
Q n(h,x)
h∈nb(x) µh→x (x). for each supernode N and superfeature F in L
The weight of an edge is the number of identical messages add to L an edge between N and F with weight T (s, F )
that would be sent from the ground clauses in the superfea- return L
ture to each ground atom in the supernode if BP was carried
out on the ground network. The n(f, x) − 1 exponent re- the arguments it shares with Ri . Lifted network construction
flects the fact that a variable’s message to a factor excludes thus proceeds by alternating between two steps:
the factor’s message to the variable.
The lifted network is constructed by (essentially) simulat- 1. Form superfeatures by doing joins of their supernodes.
ing BP and keeping track of which ground atoms and clauses 2. Form supernodes by projecting superfeatures down to
send the same messages. Initially, the groundings of each their predicates, and merging atoms with the same pro-
predicate fall into three groups: known true, known false jection counts.
and unknown. (One or two of these may be empty.) Each Pseudo-code for the algorithm is shown in Table 2. The
such group constitutes an initial supernode. All groundings projection counts at convergence are the weights associated
of a clause whose atoms have the same combination of truth with the corresponding edges.
values (true, false or unknown) now send the same messages To handle clauses with multiple occurrences of a predi-
to the ground atoms in them. In turn, all ground atoms that cate, we keep a tuple of edge weights, one for each occur-
receive the same number of messages from the superfeatures rence of the predicate in the clause. A message is passed
they appear in send the same messages, and constitute a new for each occurrence of the predicate, with the corresponding
supernode. As the effect of the evidence propagates through edge weight. Similarly, when projecting superfeatures into
the network, finer and finer supernodes and superfeatures are supernodes, a separate count is maintained for each occur-
created. rence, and only tuples with the same counts for all occur-
If a clause involves predicates R1 , . . . , Rk , and N = rences are merged.
(N1 , . . . , Nk ) is a corresponding tuple of supernodes, the
groundings of the clause generated by N are found by join- Theorem 1 Given an MLN M, set of constants C and set
ing N1 , . . . , Nk (i.e., by forming the Cartesian product of of ground literals E, there exists a unique minimal lifted net-
the relations N1 , . . . , Nk , and selecting the tuples in which work L∗ , and algorithm LNC(M, C, E) returns it. Belief
the corresponding arguments agree with each other, and with propagation applied to L∗ produces the same results as be-
any corresponding constants in the first-order clause). Con- lief propagation applied to the ground Markov network gen-
versely, the groundings of predicate Ri connected to ele- erated by M and C.
ments of a superfeature F are obtained by projecting F onto Proof. We prove each part in turn.
1096
The uniqueness of L∗ is proved by contradiction. Sup- The proof that BP applied to L gives the same results as
pose there are two minimal lifted networks L1 and L2 . Then BP applied to the ground network follows from Definitions 1
there exists a ground atom a that is in supernode N1 in L1 and 2, the previous parts of the theorem, modifications 1 and
and in supernode N2 in L2 , and N1 6= N2 ; or similarly for 2 to the BP algorithm, and the fact that the number of iden-
some superfeature c. Then, by Definition 1, all nodes in N1 tical messages sent from the ground atoms in a superfeature
send the same messages as a and so do all nodes in N2 , and to each ground atom in a supernode is the cardinality of the
therefore N1 = N2 , resulting in a contradiction. A similar projection of the superfeature onto the supernode. 2
argument applies to c. Therefore there is a unique minimal
lifted network L∗ . Clauses involving evidence atoms can be simplified (false
We now show that LNC returns L∗ in two subparts: literals and clauses containing true literals can be deleted).
As a result, duplicate clauses may appear, and the corre-
1. The network Li obtained by LNC at any iteration i is no sponding superfeatures can be merged. This will typically
finer than L∗ in the sense that, if two ground atoms are result in duplicate instances of tuples. Each
in different supernodes in Li , they are in different supern- P tuple in the
merged superfeature is assigned a weight i mi wi , where
odes in L∗ , and similarly for ground clauses. mi is the number of duplicate tuples resulting from the ith
2. LNC converges in a finite number of iterations to a net- superfeature and wi is the corresponding weight. During
work L where all ground atoms (ground clauses) in a su- the creation of supernodes, T (s, F ) is now the number of
pernode (superfeature) receive the same messages during F tuples projecting into s multiplied by the corresponding
ground BP. weight. This can greatly reduce the size of the lifted net-
The claim follows immediately from these two statements, work. When no evidence is present, our algorithm reduces
since if L is no finer than L∗ and no coarser, it must be L∗ . to the one proposed by Jaimovich et al. (2007).
For subpart 1, it is easy to see that if it is satisfied by An important question remains: how to represent supern-
the atoms at the ith iteration, then it is also satisfied by odes and superfeatures. Although this does not affect the
the clauses at the ith iteration. Now, we will prove sub- space or time cost of belief propagation (where each supern-
part 1 by induction. Clearly, it is true at the start of the ode and superfeature is represented by a single symbol), it
first iteration. Suppose that a supernode N splits into N1 can greatly affect the cost of constructing the lifted network.
and N2 at the ith iteration. Let a1 ∈ N1 and a2 ∈ N2 . The simplest option is to represent each supernode or su-
Then there must be a superfeature F in the ith iteration such perfeature extensionally as a set of tuples (i.e., a relation), in
that T (a1 , F ) 6= T (a2 , F ). Since Li is no finer than ∗ which case joins and projections reduce to standard database
∗
S L , operations. However, in this case the cost of constructing
there exist superfeatures Fj in L such that F = j Fj .
the lifted network is similar to the cost of constructing the
Since T (a1 , F ) 6= T (a2 , F ), ∃j T (a1 , Fj ) 6= T (a2 , Fj ), full ground network, and can easily become the bottleneck.
and therefore a1 and a2 are in different supernodes in L∗ . A better option is to use a more compact intensional repre-
Hence Li+1 is no finer than L∗ , and by induction this is true sentation, as done by Poole (2003) and Braz et al. (2005;
at every iteration. 2006).2
We prove subpart 2 as follows. In the first iteration each
A ground atom can be viewed as a first-order atom with
supernode either remains unchanged or splits into finer su-
all variables constrained to be equal to constants, and sim-
pernodes, because each initial supernode is as large as pos-
ilarly for ground clauses. (For example, R(A, B) is R(x, y)
sible. In any iteration, if each supernode remains unchanged
with x = A and y = B.) We represent supernodes by sets of
or splits into finer supernodes, each superfeature also re-
(α, γ) pairs, where α is a first-order atom and γ is a set of
mains unchanged or splits into finer superfeatures, because
constraints, and similarly for superfeatures. Constraints are
splitting a supernode that is joined into a superfeature nec-
of the form x = y or x 6= y, where x is an argument of the
essarily causes the superfeature to be split as well. Simi-
atom and y is either a constant or another argument. For ex-
larly, if each superfeature remains unchanged or splits into
ample, (S(v, w, x, y, z), {w = x, y = A, z 6= B, z 6= C}) com-
finer superfeatures, each supernode also remains unchanged
pactly represents all groundings of S(v, w, x, y, z) compati-
or splits into finer supernodes, because (a) if two nodes are
ble with the constraints. Notice that variables may be left
in different supernodes they must have different counts from
unconstrained, and that infinite sets of atoms can be finitely
at least one superfeature, and (b) if two nodes have different
represented in this way.
counts from a superfeature, they must have different counts
Let the default value of a predicate R be its most frequent
from at least one of the finer superfeatures that it splits into,
value given the evidence (true, false or unknown). Let SR,i
and therefore must be assigned to different supernodes.
be the set of constants that appear as the ith argument of R
Therefore, throughout the algorithm supernodes and su-
only in groundings with the default value. Supernodes not
perfeatures can only remain unchanged or split into finer
involving any members of SR,i for any argument i are repre-
ones. Because there is a maximum possible number of su-
sented extensionally (i.e. with pairs (α, γ) where γ contains
pernodes and superfeatures, this also implies that the algo-
rithm converges in a finite number of iterations. Further, no 2
Superfeatures are related, but not identical, to the parfactors of
splits occur iff all atoms in each supernode have the same Poole and Braz et al.. One important difference is that superfea-
counts as in the previous iteration, which implies they re- tures correspond to factors in the original graph, while parfactors
ceive the same messages at every iteration, and so do all correspond to factors created during variable elimination. Super-
clauses in each corresponding superfeature. features are thus exponentially more compact.
1097
a constraint of the form x = A, where A is a constant, for Link Prediction
each argument x). Initially, supernodes involving members Link prediction is an important problem with many ap-
of SR,i are represented using (α, γ) pairs containing con- plications: social network analysis, law enforcement,
straints of the form x 6= A for each A ∈ C \ SR,i .3 When bibliometrics, identifying metabolic networks in cells,
two or more supernodes are joined to form a superfeature F , etc. We experimented on the link prediction task of
if the kth argument of T F ’s clause is the i(j)th argument of Richardson and Domingos (2006), using the UW-CSE
its jth literal, Sk = j Sr(j),i , where r(j) is the predicate database and MLN publicly available from the Alchemy
symbol in the jth literal. F is now represented analogously website (Kok et al. 2007). The database contains a total
to the supernodes, according to whether or not it involves el- of 2678 groundings of predicates like: Student(person),
ements of Sk . If F is represented intensionally, each (α, γ) Professor(person), AdvisedBy(person1, person2),
pair is divided into one pair for each possible combination TaughtBy(course, person, quarter), Publication
of equality/inequality constraints among the clause’s argu- (paper, person) etc. The MLN includes 94 formulas
ments, which are added to γ. When forming a supernode stating regularities like: each student has at most one advi-
from superfeatures, the constraints in each (α, γ) pair in the sor; if a student is an author of a paper, so is her advisor;
supernode are the union of (a) the corresponding constraints etc. The task is to predict who is whose advisor, i.e., the
in the superfeatures on the variables included in the supern- AdvisedBy(x, y) predicate, from information about paper
ode, and (b) the constraints induced by the excluded vari- authorships, classes taught, etc. The database is divided into
ables on the included ones. This process is analogous to the five areas (AI, graphics, etc.); we trained weights on the
shattering process of Braz et al. (2005). smallest using Alchemy’s default discriminative learning
In general, finding the most compact representation for algorithm, ran inference on all five, and averaged the results.
supernodes and superfeatures is an intractable problem. In-
Social Networks
vestigating it further is a direction for future work.
We also experimented with the example “Friends & Smok-
Experiments ers” MLN in Table 1. The goal here was to examine how
We compared the performance of lifted BP with the ground the relative performance of lifted BP and ground BP varies
version on three domains. All the domains are loopy (i.e., with the number of objects in the domain and the fraction of
the graphs have cycles), and the algorithms of Poole (2003) objects we have evidence about. We varied the number of
and Braz et al. (2005; 2006) run out of memory, rendering people from 250 to 2500 in increments of 250, and the frac-
them inapplicable. We implemented lifted BP as an exten- tion of known people KF from 0 to 1. A KF of r means
sion of the open-source Alchemy system (Kok et al. 2007). that we know for a randomly chosen r fraction of all peo-
Since our algorithm is guaranteed to produce the same re- ple (a) whether they smoke or not and (b) who 10 of their
sults as the ground version, we do not report solution quality. friends are (other friendship relations are still assumed to be
Diagnosing the convergence of BP is a difficult problem; we unknown). Cancer(x) is unknown for all x. The people
ran it for 1000 steps for both algorithms in all experiments. with known information were randomly chosen. The whole
BP did not always converge. Either way, it was marginally domain was divided into a set of friendship clusters of size
less accurate than Gibbs sampling. The experiments were 50 each. For each known person, we randomly chose each
run on a cluster of nodes, each node having 3.46 GB of RAM friend with equal probability of being inside or outside their
and two processors running at 3 GHz. friendship cluster. All unknown atoms were queried.
Entity Resolution Results
Entity resolution is the problem of determining which ob- Results on all domains are summarized in Table 3. The
servations (e.g., records in a database) correspond to the Friends & Smokers results are for 1000 people and KF =
same objects. This problem is of crucial importance to many 0.1; the Cora results are for 500 records. All results for
large scientific projects, businesses, and government agen- Cora and Friends & Smokers are averages over five random
cies, and has received increasing attention in the AI com- splits. 4 LNC with intensional representation is comparable
munity in recent years. We used the version of McCallum’s in time and memory with the extensional version on Cora
Cora database available on the Alchemy website (Kok et al. and UW-CSE, but much more efficient on Friends & Smok-
2007). The inference task was to de-duplicate citations, au- ers. All the results shown are for the intensional representa-
thors and venues (i.e., to determine which pairs of citations tion. LNC is slower than grounding the full network, but BP
refer to the same underlying paper, and similarly for author is much faster on the lifted network, resulting in better times
fields and venue fields). We used the MLN (formulas and in all domains (by two orders of magnitude on Friends &
weights) used by Singla and Domingos (2005) in their ex- Smokers). The number of (super) features created is much
periments. This contains 46 first-order clauses stating reg- smaller for lifted BP than for ground BP (by four orders of
ularities such as: if two fields have high TF-IDF similarity, magnitude on Friends & Smokers). Memory (not reported
they are (probably) the same; if two records are the same, here) is comparable on Cora and UW-CSE, and much lower
their fields are the same, and vice-versa; etc. for LNC on Friends & Smokers. Figure 1 shows how net-
3
work size varies with the number of people in the Friends
In practice, variables are typed, and C is replaced by the do-
4
main of the argument; and the set of constraints is only stored once, For Cora, we made sure that each actual cluster was either
and pointed to as needed. completely inside or outside each split.
1098
Table 3: Time and memory cost of ground and lifted BP.
Domain Time (in seconds) No. of (Super) Features
Construction BP Total
Ground Lifted Ground Lifted Ground Lifted Ground Lifted
Cora 263.1 1173.3 12368.4 3997.7 12631.6 5171.1 2078629 295468
UW-CSE 6.9 22.1 1015.8 602.5 1022.8 624.7 217665 86459
Friends & Smokers 38.8 89.7 10702.2 4.4 10741.0 94.2 1900905 58
1e+07 Acknowledgments
This research was funded by DARPA contracts NBCH-
1e+06 D030010/02-000225, FA8750-07-D-0185, and HR0011-07-C-
No. of (Super) Features
1099