Bayesian Networks - Exercises: 1 Independence and Conditional Independence

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Bayesian networks – exercises

Collected by: Jiřı́ Kléma, klema@labe.felk.cvut.cz

Fall 2015/2016

Note: The exercises 3b-e, 10 and 13 were not covered this term.

Goals: The text provides a pool of exercises to be solved during AE4M33RZN tutorials on
graphical probabilistic models. The exercises illustrate topics of conditional independence,
learning and inference in Bayesian networks. The identical material with the resolved
exercises will be provided after the last Bayesian network tutorial.

1 Independence and conditional independence


Exercise 1. Formally prove which (conditional) independence relationships are encoded by
serial (linear) connection of three random variables.

Only the relationship between A and C shall be studied (the variables connected by an edge are clearly
dependent), let us concern A ⊥⊥ C|∅ and A ⊥⊥ C|B:
A⊥ ⊥ C|∅ ⇔ P r(A, C) = P r(A)P r(C) ⇔ P r(A|C) = P r(A) ∧ P r(C|A) = P r(C)
A⊥ ⊥ C|B ⇔ P r(A, C|B) = P r(A|B)P r(C|B) ⇔ P r(A|B, C) = P r(A|B) ∧ P r(C|A, B) = P r(C|B)

It follows from BN definition: P r(A, B, C) = P r(A)P r(B|A)P r(C|B)

To decide on conditional independence between A and C, P r(A, C|B) can be expressed and factorized.
It follows from both conditional independence and BN definition:

P r(A, B, C) P r(A)P r(B|A)


P r(A, C|B) = = P r(C|B) = P r(A|B)P r(C|B)
P r(B) P r(B)

P r(A, C|B) = P r(A|B)P r(C|B) holds in the linear connection and A ⊥⊥ C|B also holds.

Note 1: An alternative way to prove the same is to express P r(C|A, B) or P r(A|B, C):

1
P r(A, B, C) P r(A)P r(B|A)P r(C|B)
P r(C|A, B) = = = P r(C|B) or
P r(A, B) P r(A)P r(B|A)
P r(A, B, C) P r(A)P r(B|A)P r(C|B) P r(A)P r(B|A)P r(C|B)
P r(A|B, C) = =P = P =
P r(B, C) A P r(A)P r(B|A)P r(C|B) P r(C|B) A P r(A)P r(B|A)
P r(A)P r(B|A)
= = P r(A|B)
P r(B)

Note 2: Even a more simple way to prove the same is to apply both the general and the BN specific
definition of joint probability:

P r(A)P r(B|A)P r(C|A, B) = P r(A, B, C) = P r(A)P r(B|A)P r(C|B) ⇒ P r(C|A, B) = P r(C|B)

To decide on independence of A from C, P r(A, C) needs to be expressed. Let us marginalize the BN


definition:

X X X
P r(A, C) = P r(A, B, C) = P r(A)P r(B|A)P r(C|B) = P r(A) P r(C|B)P r(B|A) =
B B B
X X
= P r(A) P r(C|A, B)P r(B|A) = P r(A) P r(B, C|A) = P r(A)P r(C|A)
B B

(the conditional independence expression proved earlier was used, it holds P r(C|B) = P r(C|A, B)).
The independence expression P r(A, C) = P r(A)P r(C) does not follow from the linear connection and
the relationship A ⊥⊥ C|∅ does not hold in general.

Conclusion: D-separation pattern known for linear connection was proved on the basis of BN definition.
Linear connection transmits information not given the intermediate node, it blocks the information
otherwise. In other words, its terminal nodes are dependent, however, when knowing the middle node
for sure, the dependence vanishes.

Exercise 2. Having the network/graph shown in figure below, decide on the validity of
following statements:

a) P1 , P5 ⊥
⊥ P6 |P8 ,

b) P2 >
>P6 | ,

2
c) P1 ⊥
⊥ P2 |P8 ,

d) P1 ⊥
⊥ P2 , P5 |P4 ,

e) Markov equivalence class that contains the shown graph contains exactly three directed
graphs.

Solution:
a) FALSE, the path through P3 , P4 and P7 is opened, neither the nodes P1 and P6 nor P5 and P6
are d-separated,
b) FALSE, the path is blocked, namely the node P7 ,
c) FALSE, unobserved linear P3 is opened, converging P4 is opened due to P8 , the path is opened,
d) FALSE, information flows through unobserved linear P3 ,
e) TRUE, P1 → P3 direction can be changed (second graph) then P3 → P5 can also be changed
(third graph).

Exercise 3. Let us have an arbitrary set of (conditional) independence relationships among


N variables that is associated with a joint probability distribution.

a) Can we always find a directed acyclic graph that perfectly maps this set (perfectly maps
= preserves all the (conditional) independence relationships, it neither removes nor
adds any)?

b) Can we always find an undirected graph that perfectly maps this set?

c) Can directed acyclic models represent the conditional independence relationships of


all possible undirected models?

d) Can undirected models represent the conditional independence relationships of all pos-
sible directed acyclic models?

e) Can we always find a directed acyclic model or an undirected model?

Solution:
a) No, we cannot. An example is {A ⊥⊥ C|B∪D, B ⊥⊥ D|A∪C} in a four-variable problem. This pair
of conditional independence relationships leads to a cyclic graph or converging connection which
introduces additional independence relationships. In practice if the perfect map does not exist, we
rather search for a graph that encodes only valid (conditional) independence relationships and it
is minimal in such sense that removal of any of its edges would introduce an invalid (conditional)
independence relationship.
b) No, we cannot. An example is {A ⊥⊥ C|∅} in a three-variable problem (the complementary set of
dependence relationships is {A> >B|∅, A>>B|C, B>>C|∅, B>>C|A, A>>C|B}). It follows that A
and B must be directly connected as there is no other way to meet both A>>B|∅ and A>>B|C.
The same holds for B and C. Knowing A ⊥⊥ C|∅, there can be no edge between A and C.
Consequently, it necessarily holds A ⊥⊥ C|B which contradicts the given set of independence
relationships (the graph encodes an independence relationship that does not hold).

3
c) No, they cannot. An example is the set of relationships ad a) that can be encoded in a form of
undirected graph (see the left figure below), but not as a directed graph. The best directed graph
is the graph that encodes only one of the CI relationships (see the mid-left figure below) that
stands for {B ⊥⊥ D|A ∪ C}.
d) There are also directed graphs whose independence relationships cannot be captured by undirected
models. Any directed graph with converging connection makes an example, see the graph on the
right in the figure below which encodes the set of the relationships ad b). In the space of undirected
graphs it needs to be represented as the complete graph (no independence assumptions). Any of
two of the discussed graph classes is not strictly more expressive than the other.

the undirected model ad a) the directed model ad b)


and its imperfect directed counterpart and its imperfect undirected counterpart

e) No, we cannot. Although the set of free CI relationship sets is remarkably restricted by the
condition of existence of associated joint probability distribution (e.g., the set {A ⊥⊥ B, B>>A}
violates the trivial axiom of symmetry, there is no corresponding joint probability distribution),
there are still sets of relationships that are meaningful (have their joint probability counterpart)
but cannot be represented as any graph.

Venn diagram illustrating graph expressivity. F stands for free CI relationship sets, P stands for CI
relationship sets with an associated probability distribution, D stands for distributions with the perfect
directed map and U stands for distributions with the perfect undirected map.

4
2 Inference
Exercise 4. Given the network below, calculate marginal and conditional probabilities
P r(¬p3 ), P r(p2 |¬p3 ), P r(p1 |p2 , ¬p3 ) a P r(p1 |¬p3 , p4 ). Apply the method of inference
by enumeration.

Inference by enumeration sums the joint probabilities of atomic events. They are calculated from the
network model: P r(P1 , . . . , Pn ) = P r(P1 |parents(P1 )) × · · · × P r(Pn |parents(Pn )). The method does
not take advantage of conditional independence to further simplify inference. It is a routine and easily
formalized algorithm, but computationally expensive. Its complexity is exponential in the number of
variables.

X X
P r(¬p3 ) = P r(P1 , P2 , ¬p3 , P4 ) = P r(P1 )P r(P2 |P1 )P r(¬p3 |P2 )P r(P4 |P2 ) =
P1 ,P2 ,P4 P1 ,P2 ,P4

= P r(p1 )P r(p2 |p1 )P r(¬p3 |p2 )P r(p4 |p2 ) + P r(p1 )P r(p2 |p1 )P r(¬p3 |p2 )P r(¬p4 |p2 )+
+ P r(p1 )P r(¬p2 |p1 )P r(¬p3 |¬p2 )P r(p4 |¬p2 ) + P r(p1 )P r(¬p2 |p1 )P r(¬p3 |¬p2 )P r(¬p4 |¬p2 )+
+ P r(¬p1 )P r(p2 |¬p1 )P r(¬p3 |p2 )P r(p4 |p2 ) + P r(¬p1 )P r(p2 |¬p1 )P r(¬p3 |p2 )P r(¬p4 |p2 )+
+ P r(¬p1 )P r(¬p2 |¬p1 )P r(¬p3 |¬p2 )P r(p4 |¬p2 ) + P r(¬p1 )P r(¬p2 |¬p1 )P r(¬p3 |¬p2 )P r(¬p4 |¬p2 ) =
= .4 × .8 × .8 × .8 + .4 × .8 × .8 × .2 + .4 × .2 × .7 × .5 + .4 × .2 × .7 × .5+
+ .6 × .5 × .8 × .8 + .6 × .5 × .8 × .2 + .6 × .5 × .7 × .5 + .6 × .5 × .7 × .5 =
= .2048 + .0512 + .028 + .028 + .192 + .048 + .105 + .105 = .762

P r(p2 , ¬p3 ) .496


P r(p2 |¬p3 ) = = = .6509
P r(¬p3 ) .762
X X
P r(p2 , ¬p3 ) = P r(P1 , p2 , ¬p3 , P4 ) = P r(P1 )P r(p2 |P1 )P r(¬p3 |p2 )P r(P4 |p2 ) =
P1 ,P4 P1 ,P4

= P r(p1 )P r(p2 |p1 )P r(¬p3 |p2 )P r(p4 |p2 ) + P r(p1 )P r(p2 |p1 )P r(¬p3 |p2 )P r(¬p4 |p2 )+
+ P r(¬p1 )P r(p2 |¬p1 )P r(¬p3 |p2 )P r(p4 |p2 ) + P r(¬p1 )P r(p2 |¬p1 )P r(¬p3 |p2 )P r(¬p4 |p2 ) =
= .4 × .8 × .8 × .8 + .4 × .8 × .8 × .2 + .6 × .5 × .8 × .8 + .6 × .5 × .8 × .2 =
= .2048 + .0512 + .192 + .048 = .496

5
P r(p1 , p2 , ¬p3 ) .256
P r(p1 |p2 , ¬p3 ) = = = .5161
P r(p2 , ¬p3 ) .496
X X
P r(p1 , p2 , ¬p3 ) = P r(p1 , p2 , ¬p3 , P4 ) = P r(p1 )P r(p2 |p1 )P r(¬p3 |p2 )P r(P4 |p2 ) =
P4 P4

= P r(p1 )P r(p2 |p1 )P r(¬p3 |p2 )P r(p4 |p2 ) + P r(p1 )P r(p2 |p1 )P r(¬p3 |p2 )P r(¬p4 |p2 ) =
= .4 × .8 × .8 × .8 + .4 × .8 × .8 × .2 = .2048 + .0512 = .256
P r(p2 , ¬p3 ) = P r(p1 , p2 , ¬p3 ) + P r(¬p1 , p2 , ¬p3 ) = .256 + .24 = .496
X X
P r(¬p1 , p2 , ¬p3 ) = P r(¬p1 , p2 , ¬p3 , P4 ) = P r(¬p1 )P r(p2 |¬p1 )P r(¬p3 |P2 )P r(p4 |P2 ) =
P4 P4

= P r(¬p1 )P r(p2 |¬p1 )P r(¬p3 |p2 )P r(p4 |p2 ) + P r(¬p1 )P r(p2 |¬p1 )P r(¬p3 |p2 )P r(¬p4 |p2 ) =
= .6 × .5 × .8 × .8 + .6 × .5 × .7 × .2 = .192 + .048 = .24

P r(p1 , ¬p3 , p4 ) .2328


P r(p1 |¬p3 , p4 ) = = = .4394
P r(¬p3 , p4 ) .5298
X X
P r(p1 , ¬p3 , p4 ) = P r(p1 , P2 , ¬p3 , p4 ) = P r(p1 )P r(P2 |p1 )P r(¬p3 |P2 )P r(p4 |P2 ) =
P2 P2

= P r(p1 )P r(p2 |p1 )P r(¬p3 |p2 )P r(p4 |p2 ) + P r(p1 )P r(¬p2 |p1 )P r(¬p3 |¬p2 )P r(p4 |¬p2 ) =
= .4 × .8 × .8 × .8 + .4 × .2 × .7 × .5 = .2048 + .028 = .2328
P r(¬p3 , p4 ) = P r(p1 , ¬p3 , p4 ) + P r(¬p1 , ¬p3 , p4 ) = .2328 + .297 = .5298
X X
P r(¬p1 , ¬p3 , p4 ) = P r(¬p1 , P2 , ¬p3 , p4 ) = P r(¬p1 )P r(P2 |¬p1 )P r(¬p3 |P2 )P r(p4 |P2 ) =
P2 P2

= P r(¬p1 )P r(p2 |¬p1 )P r(¬p3 |p2 )P r(p4 |p2 ) + P r(¬p1 )P r(¬p2 |¬p1 )P r(¬p3 |¬p2 )P r(p4 |¬p2 ) =
= .6 × .5 × .8 × .8 + .6 × .5 × .7 × .5 = .192 + .105 = .297

Conclusion: P r(¬p3) = 0.762, P r(p2|¬p3) = 0.6509, P r(p1 |p2 , ¬p3 ) = 0.5161, P r(p1|¬p3, p4) =
0.4394.

Exercise 5. For the same network calculate the same marginal and conditional probabilities
again. Employ the properties of directed graphical model to manually simplify inference
by enumeration carried out in the previous exercise.

When calculating P r(¬p3) (and P r(p2|¬p3) analogically), P4 is a leaf that is not a query nor evidence.
It can be eliminated without changing the target probabilities.

X X
P r(¬p3 ) = P r(P1 , P2 , ¬p3 ) = P r(P1 )P r(P2 |P1 )P r(¬p3 |P2 ) =
P1 ,P2 P1 ,P2

= P r(p1 )P r(p2 |p1 )P r(¬p3 |p2 ) + P r(p1 )P r(¬p2 |p1 )P r(¬p3 |¬p2 )+
+ P r(¬p1 )P r(p2 |¬p1 )P r(¬p3 |p2 ) + P r(¬p1 )P r(¬p2 |¬p1 )P r(¬p3 |¬p2 ) =
= .4 × .8 × .8 + .4 × .2 × .7 + .6 × .5 × .8 + .6 × .5 × .7 =
= .256 + .056 + .24 + .21 = .762

6
The same result is reached when editing the following expression:

X X
P r(¬p3 ) = P r(P1 , P2 , ¬p3 , P4 ) = P r(P1 )P r(P2 |P1 )P r(¬p3 |P2 )P r(P4 |P2 ) =
P1 ,P2 ,P4 P1 ,P2 ,P4
X X X
= P r(P1 )P r(P2 |P1 )P r(¬p3 |P2 ) P r(P4 |P2 ) = P r(P1 )P r(P2 |P1 )P r(¬p3 |P2 ) × 1
P1 ,P2 P4 P1 ,P2

Analogically, P r(p2, ¬p3) and P r(p2|¬p3) can be calculated:

X X
P r(p2 , ¬p3 ) = P r(P1 , p2 , ¬p3 ) = P r(P1 )P r(p2 |P1 )P r(¬p3 |p2 ) =
P1 P1

= P r(p1 )P r(p2 |p1 )P r(¬p3 |p2 ) + P r(¬p1 )P r(p2 |¬p1 )P r(¬p3 |p2 ) =
= .4 × .8 × .8 + 6 × .5 × .8 = .256 + .24 = .496

The computation of P r(p1 |p2 , ¬p3 ) may take advantage of P1 ⊥⊥ P3 |P2 – P2 makes a linear node
between P1 and P3 , when P2 is given, the path is blocked and nodes P1 and P3 are d-separated.
P r(p1 |p2 , ¬p3 ) simplifies to P r(p1 |p2 ) which is easier to compute (both P3 and P4 become unqueried
and unobserved graph leaves, alternatively the expression could also be simplified by elimination of the
tail probability that equals one):

P r(p1 , p2 ) .32
P r(p1 |p2 ) = = = .5161
P r(p2 ) .62
P r(p1 , p2 ) = P r(p1 )P r(p2 |p1 ) = .4 × .8 = .32
P r(p2 ) = P r(p1 , p2 ) + P r(¬p1 , p2 ) = .32 + .3 = .62
P r(¬p1 , p2 ) = P r(¬p1 )P r(p2 |¬p1 ) = .6 × .5 = .3

P r(p1 |¬p3 , p4 ) calculation cannot be simplified.

Conclusion: Consistent use of the properties of graphical models and precomputation of repetitive
calculations greatly simplifies and accelerates inference.

Exercise 6. For the same network calculate P r(¬p3 ) and P r(p2 |¬p3 ) again. Apply the
method of variable elimination.

Variable elimination gradually simplifies the original network by removing hidden variables (those that
are not query nor evidence). The hidden variables are summed out. The target network is the only
node representing the joint probability P r(Q, e). Eventually, this probability is used to answer the query
P r(Q|e) = PP r(Q,e)
P r(Q,e) .
Q

The first two steps are the same for both the probabilities: (i) P4 can simply be removed and (ii) P1 is
summed out. Then, (iii) P2 gets summed out to obtain P r(¬p3 ) while (iv) the particular value ¬p3 is
taken to obtain P r(p2 |¬p3 ). See the figure below.

7
The elimination process is carried out by factors. The step (i) is trivial, the step (ii) corresponds to:

X X
fP¯1 (P2 ) = P r(P1 , P2 ) = P r(P1 )P r(P2 |P1 )
P1 P1

fP¯1 (p2 ) = .4 × .8 + .6 × .5 = .62, fP¯1 (¬p2 ) = .4 × .2 + .6 × .5 = .38

The step (iii) consists in:

X
fP¯1 ,P¯2 (P3 ) = fP¯1 (P2 )P r(P3 |P2 )
P2

fP¯1 ,P¯2 (p3 ) = .62 × .2 + .38 × .3 = .238, fP¯1 ,P¯2 (¬p3 ) = .62 × .8 + .38 × .7 = .762

The step (iv) consists in:

fP¯1 ,¬p3 (P2 ) = fP¯1 (P2 )P r(¬p3 |P2 )


fP¯1 ,¬p3 (p2 ) = .62 × .8 = .496, fP¯1 ,¬p3 (¬p2 ) = .38 × .7 = .266

Eventually, the target probabilities can be computed:

P r(¬p3 ) = fP¯1 ,P¯2 (¬p3 ) = .762

fP¯1 ,¬p3 (p2 ) .496


P r(p2 |¬p3 ) = = = .6509
fP¯1 ,¬p3 (p2 ) + fP¯1 ,¬p3 (¬p2 ) .496 + .266

Conclusion: Variable elimination makes a building block for other exact and approximate inference
algorithms. In general DAG it is NP-hard, nevertheless it is often “much more efficient” with a proper
elimination order (it is difficult to find the best one, but heuristics exist). In our example, the enumeration
approach takes 47 operations, the simplified method 17, while the variable elimination method needs 16
operations only.

Exercise 7. Analyze the complexity of inference by enumeration and variable elimination


on a chain of binary variables.

8
The given network factorizes the joint probability as follows:
X
P r(P1 , . . . , Pn ) = P r(P1 )P r(P2 |P1 ) . . . P r(Pn |Pn−1 )
P1 ,...,Pn

The inference by enumeration works with up to 2n atomic events. To get the probability of each
event, n − 1 multiplications must be carried out. To obtain P r(pn ), we need to enumerate and sum
2n−1 atomic events, which makes (n − 1)2n−1 multiplications and 2n−1 − 1 additions. The inference is
apparently O(n2n ).

The inference by variable elimination deals with a trivial variable ordering P1 ≺ P2 ≺ · · · ≺ Pn . In


each step i = 1, . . . , n − 1, the factor for Pi and Pi+1 is computed and Pi is marginalized out:
X
P r(Pi+1 ) = P r(Pi )P r(Pi+1 |Pi )
Pi

Each such step costs 4 multiplications and 2 additions, there are n − 1 steps. Consequently, the inference
is O(n). P r(pn ) (and other marginal and conditional probabilities even easier to be obtained) can be
computed in linear time.

The linear chain is a graph whose largest clique does not grow with n and remains 2. That is why,
variable elimination procedure is extremely efficient.

Exercise 8. For the network from Exercise 4 calculate the conditional probability P r(p1 |p2 , ¬p3 )
again. Apply a sampling approximate method. Discuss pros and cons of rejection sampling,
likelihood weighting and Gibbs sampling. The table shown below gives an output of a uniform
random number generator on the interval (0,1), use the table to generate samples.

r1 r2 r3 r4 r5 r6 r7 r8 r9 r10
0.2551 0.5060 0.6991 0.8909 0.9593 0.5472 0.1386 0.1493 0.2575 0.8407
r11 r12 r13 r14 r15 r16 r17 r18 r19 r20
0.0827 0.9060 0.7612 0.1423 0.5888 0.6330 0.5030 0.8003 0.0155 0.6917

Let us start with rejection sampling. The variables must be topologically sorted first (current notation
meets the definition of topological ordering, P1 < P2 < P3 < P4 ). The individual samples will be
generated as follows:

• s1 : P r(p1 ) > r1 → p1 ,
• s1 : P r(p2 |p1 ) > r2 → p2 ,
• s1 : P r(p3 |p2 ) < r3 → ¬p3 ,
• s1 : P4 is irrelevant for the given prob,
• s2 : P r(p1 ) < r4 → ¬p1 ,
• s2 : P r(p2 |¬p1 ) < r5 → ¬p2 , violates evidence, STOP.
• s3 : P r(p1 ) < r6 → ¬p1 ,

9
• s3 : P r(p2 |¬p1 ) > r7 → p2 ,
• s3 : P r(p3 |p2 ) > r8 → p3 , violates evidence, STOP.
• ...

Using 20 random numbers we obtain 8 samples shown in the table below. The samples s2 , s3 , s4 , s5 and
s7 that contradict evidence will be rejected. The rest of samples allows to estimate the target probability:

s1 s2 s3 s4 s5 s6 s7 s8
P1 T F F T T F F F
P2 T F T F F T F T N (p1 , p2 , ¬p3 ) 1
P r(p1 |p2 , ¬p3 ) ≈ = = 0.33
P3 F ? T ? ? F ? F N (p2 , ¬p3 ) 3
P4 ? ? ? ? ? ? ? ?

Likelihood weighting does not reject any sample, it weights the generated samples instead. The
sample weight equals to the likelihood of the event given the evidence. The order of variables and the
way of their generation will be kept the same as before, however, the evidence variables will be kept fixed
(that is why random numbers will be matched with different probabilities):

• s1 : P r(p1 ) > r1 → p1 ,
• w1 : P r(p2 |p1 )P r(¬p3 |p2 ) = .8 × .8 = 0.64,
• s2 : P r(p1 ) < r2 → ¬p1 ,
• w2 : P r(p2 |¬p1 )P r(¬p3 |p2 ) = .5 × .8 = 0.4,
• s3 : P r(p1 ) < r2 → ¬p1 ,
• w3 : P r(p2 |¬p1 )P r(¬p3 |p2 ) = .5 × .8 = 0.4,
• ...

Using first 6 random numbers we obtain 6 samples shown in the table below. The target probability is
estimated as the fraction of sample weights meeting the condition to their total sum:

p1 p2 p3 p4 p5 p6
P1 T F F F F F
P2 T T T T T T P
wi δ(pi1 , p1 ) .64
i
P3 F F F F F F P r(p1 |p2 , ¬p3 ) ≈ P i = = 0.24
P4 ? ? ? ? ? ? iw 2.64
wi .64 .40 .40 .40 .40 .40

Conclusion: Both the sampling methods are consistent and shall converge to the target probability
value .5161. The number of samples must be much larger anyway. Rejection sampling suffers from a
large portion of generated and further unemployed samples (see s2 and s6 ). Their proportion grows for
unlikely evidences with high topological indices. P r(p1 |¬p3 , p4 ) makes an example. For larger networks
it becomes inefficient. Likelihood weighting shall deliver smoother estimates, nevertheless, it suffers from
frequent insignificant sample weights under the conditions mentioned above.

Gibbs sampling removes the drawback of rejection sampling and likelihood weighting. On the other
hand, in order to be able to generate samples, the probabilities P r(Pi |M B(Pi )), where M B stands
for Markov blanket of Pi node, must be computed. The blanket covers all Pi parents, children and
their parents. Computation is done for all relevant hidden (unevidenced and unqueried) variables. In
the given task, P r(P1 |P2 ) must be computed to represent MB of P1 . The other MB probabilities

10
P r(P2 |P1 , P3 , P4 ), P r(P3 |P2 ) and P r(P4 |P2 ) are not actually needed (P2 and P3 are evidences and
thus fixed, P4 is irrelevant), P r(P3 |P2 ) and P r(P4 |P2 ) are directly available anyway. However, finding
P r(P1 |P2 ) itself de facto solves the problem P r(p1 |p2 , ¬p3 ). It follows that Gibbs sampling is advanta-
geous for larger networks where it holds ∀i = 1 . . . n |M B(Pi )|  n.

Conclusion: Gibbs sampling makes sense in large networks with ∀Pi : |M B(Pi )|  n, where n stands
for the number of variables in the network.

Exercise 9. Let us have three tram lines – 6, 22 and 24 – regularly coming to the stop in
front of the faculty building. Line 22 operates more frequently than line 24, 24 goes more
often than line 6 (the ratio is 5:3:2, it is kept during all the hours of operation). Line 6
uses a single car setting in 9 out of 10 cases during the daytime, in the evening it always
has the only car. Line 22 has one car rarely and only in evenings (1 out of 10 tramcars).
Line 24 can be short whenever, however, it takes a long setting with 2 cars in 8 out of 10
cases. Albertov is available by line 24, lines 6 and 22 are headed in the direction of IP
Pavlova. The line changes appear only when a tram goes to depot (let 24 have its depot in
the direction of IP Pavlova, 6 and 22 have their depots in the direction of Albertov). Every
tenth tram goes to the depot evenly throughout the operation. The evening regime is from
6pm to 24pm, the daytime regime is from 6am to 6pm.

a) Draw a correct, efficient and causal Bayesian network.

b) Annotate the network with the conditional probability tables.

c) It is evening. A short tram is approaching the stop. What is the probability it will go
to Albertov?

d) There is a tram 22 standing in the stop. How many cars does it have?

Ad a) and b)

Which conditional independence relationships truly hold?

• E⊥
⊥ N |∅ – if not knowing the tram length then the tram number has nothing to do with time.
• L⊥
⊥ D|N – if knowing the tram number then the tram length and its direction get independent.

11
• E⊥
⊥ D|N – if knowing the tram number then time does not change the tram direction.
• E⊥
⊥ D|∅ – if not knowing the tram length then time and the tram direction are independent.

Ad c) We enumerate P r(D = albertov|E = evng, L = short), the path from E to D is opened (E is


connected via the evidenced converging node L, L connects to unevidenced diverging node N , it holds
D>>E|L, D> >L|∅). The enumeration can be simplified by reordering of the variables and elimination of
D in denominator:

P r(D = albertov, E = evng, L = short)


P r(D = albertov|E = evng, L = short) = =
P r(E = evng, L = short)
P
P r(E = evng, N, L = short, D = albertov)
= NP =
N,D P r(E = evng, N, L = short, D)
P
P r(E = evng) N P r(N )P r(L = short|E = evng, N )P r(D = albertov|N )
= P P =
P r(E = evng) N P r(N )P r(L = short|E = evng, N ) D P r(D|N ))
P
P r(N )P r(L = short|E = evng, N )P r(D = albertov|N )
= N P =
N P r(N )P r(L = short|E = evng, N )
1 1 1 1 1 3 1 9
5 ×1× 10 + 2 × 10 × 10 + 10 × 5 × 10
= 1 1 1 3 1 = .2548
5 × 1 + 2 × 10 + 10 × 5

Ad d) In order to get P r(L = long|N = l22), it suffices the information available in nodes E and L,
we will only sum out the variable E:
X
P r(L = long|N = l22) = P r(L = long, E|N = l22) =
E
X
= P r(E|N = l22) × P r(L = long|E, N = l22) =
E
X
= P r(E) × P r(L = long|E, N = l22) =
E
1 9 2
= × + × 1 = .9667
3 10 3
Alternatively, we may simply follow the definition of conditional probability and benefit from two facts:
1) the node D is irrelevant as it is an unobserved leaf (which is also blocked by the observed diverging
node L), 2) P r(N = l22) can be quickly reduced from the expression, if not it is available directly in
the network too and does not have to be computed. Then:
P
P r(L = long, E, N = l22)
P r(L = long|N = l22) = E =
P r(N = l22)
P
P r(N = l22) E P r(E) × P r(L = long|E, N = l22)
= =
P r(N = l22)
X
= P r(E) × P r(L = long|E, N = l22) =
E
1 9 2
= × + × 1 = .9667
3 10 3

12
Exercise 10. Trace the algorithm of belief propagation in the network below knowing
that e = {p2 }. Show the individual steps, be as detailed as possible. Explain in which way
the unevidenced converging node P3 blocks the path between nodes P1 and P2 .

Obviously, the posterior probabilities are P r∗ (P1 ) = P r(P1 |p2 ) = P r(P1 ), P r∗ (p2 ) = 1 and P r∗ (P3 ) =

P
P r(P3 |p2 ) = P1 P r(P1 )P r(P3 |p2 , P1 ) (P r (p3 ) = 0.52). The same values must be reached by
belief propagation when message passing stops and the node probabilities are computed as follows:
P r∗ (Pi ) = αi × π(Pi ) × λ(Pi ).

Belief propagation starts with the following list of initialization steps:


1. Unobserved root P1 sets its compound causal π(P1 ), π(p1 ) = 0.4, π(¬p1 ) = 0.6.
2. Observed root P2 sets its compound causal π(P2 ), π(p2 ) = 1, π(¬p2 ) = 0.
3. Unobserved leaf P3 sets its compound diagnostic λ(P3 ), λ(p3 ) = 1, λ(¬p3 ) = 1.
Then, iteration steps are carried out:
1. P1 knows its compound π and misses one λ from its children only, it can send πPP31 (P1 ) to P3 :
πPP31 (P1 ) = α1 π(P1 ) → α1 = 1, πPP31 (p1 ) = π(p1 ) = 0.4, πPP31 (¬p1 ) = π(¬p1 ) = 0.6

2. P2 knows its compound π and misses one λ from its children only, it can send πPP32 (P2 ) to P3 :
πPP32 (P2 ) = α2 π(P2 ) → α2 = 1, πPP32 (p2 ) = π(p2 ) = 1, πPP32 (¬p2 ) = π(¬p2 ) = 0
3. P3 received all π messages from its parents, it can compute its compound π(p3 ):
P Q P
π(P3 ) = P1 ,P2 P r(P3 |P1 , P2 ) j=1,2 πP3j (Pj )
π(p3 ) = .1 × .4 × 1 + .8 × .6 × 1 = 0.52
π(¬p3 ) = .9 × .4 × 1 + .2 × .6 × 1 = 1 − P r(p3) = 0.48
4. P3 knows its compound λ, misses no π from its parents, it can send λP P2
P3 (P1 ) and λP3 (P2 ):
1

P k6=j
λPj3 (Pj ) = P3 λ(P3 ) P1 ,P2 P r(P3 |P1 , P2 ) k=1,2 πPP3k (Pk )
P P Q

λP λ(P3 ) p1 ,P2 P r(P3 |p1 , P2 )πPP32 (P2 ) = 1 × 1(.1 + .9) = 1


P P
P3 (p1 ) =
1

PP3
λP3 (p2 ) = P3 λ(P3 ) P1 ,p2 P r(P3 |P1 , p2 )πPP31 (P1 ) = 1(.4(.1 + .9) + .6(.8 + .2) = 1
P1 P

5. P3 knows both its compound parameters, it can compute its posterior P r∗ (P3 ):
P r∗ (P3 ) = α3 π(P3 )λ(P3 )
P r∗ (p3 ) = α3 × .52 × 1 = .52α3 , P r∗ (¬p3 ) = α3 × .48 × 1 = .48α3 ,
P r∗ (p3 ) + P r∗ (¬p3 ) = 1 → α3 = 1, P r∗ (p3 ) = .52, P r∗ (¬p3 ) = .48
6. P1 received
Q all of 1its λ messages, the compound λ(P1 ) can be computed:
λ(P1 ) = j=3 λP Pj (P1 )
λ(p1 ) = λP P1
P3 (p1 ) = 1, λ(¬p1 ) = λP3 (¬p1 ) = 1
1

13
7. P1 knows both its compound parameters, it can compute its posterior P r∗ (P1 ): P r∗ (P1 ) =
α4 π(P1 )λ(P1 )
P r∗ (p1 ) = α4 × .4 × 1 = .4α4 , P r∗ (¬p1) = α4 × .6 × 1 = .6α4 ,
P r∗ (p1 ) + P r∗ (¬p1 ) = 1 → α4 = 1, P r∗ (p1 ) = .4, P r∗ (¬p1 ) = .6

Conclusion: Belief propagation reaches the correct posterior probabilities. The blocking effect of P3
manifests in Step 4. Since P3 is a unevidenced leaf, λ(P3 ) has a uniform distribution (i.e., λ(p3 ) =
λ(¬p3 ) = 1 as P3 is a binary variable). It is easy to P
show that arbitrary normalized causal messages
coming to P3 cannot change this distribution (it holds Pj πPPjk (Pj ) = 1). The reason is that it always
P Pj
holds P3 P r(P3 |P1 , P2 ) = 1. Step 4 can be skipped putting λP3 (Pj ) = 1 automatically without
waiting for the causal parameters.

14
3 (Conditional) independence tests, best network structure
Exercise 11. Let us concern the frequency table shown below. Decide about independence
relationships between A and B.

c ¬c
b ¬b b ¬b
a 14 8 25 56
¬a 54 25 7 11

The relationships of independence (A ⊥⊥ B|∅) and conditional independence (A ⊥⊥ B|C) represent


two possibilities under consideration. We will present three different approaches to their analysis. The
first one is based directly on the definition of independence and it is illustrative only. The other two
approaches represent practically applicable methods.

Approach 1: Simple comparison of (conditional) probabilities.

Independence is equivalent with the following formulae:


A⊥ ⊥ B|∅ ⇔ P r(A, B) = P r(A)P r(B) ⇔ P r(A|B) = P r(A) ∧ P r(B|A) = P r(B)

The above-mentioned probabilities can be estimated from data by maximum likelihood estimation (MLE):
39 64
P r(a|b) = 100 = 0.39, P r(a|¬b) = 100 = 0.64, P r(a) = 103
200 = 0.51
P r(b|a) = 103 = 0.38, P r(b|¬a) = 97 = 0.63, P r(b) = 100
39 61
200 = 0.5

Conditional independence is equivalent with the following formulae:


A⊥⊥ B|C ⇔ P r(A, B|C) = P r(A|C)P r(B|C) ⇔ P r(A|B, C) = P r(A|C) ∧ P r(B|A, C) = P r(B|C)

Again, MLE can be applied:


P r(a|b, c) = 14 8 22
68 = 0.21, P r(a|¬b, c) = 33 = 0.24, P r(a|c) = 101 = 0.22
P r(a|b, ¬c) = 32 = 0.78, P r(a|¬b, ¬c) = 67 = 0.84, P r(a|¬c) = 81
25 56
99 = 0.82
P r(b|a, c) = 14
22 = 0.64, P r(b|¬a, c) = 54
79 = 0.68, P r(b|c) = 68
101 = 0.67
25 7
P r(b|a, ¬c) = 81 = 0.31, P r(b|¬a, ¬c) = 18 = 0.39, P r(b|¬c) = 32
99 = 0.32

In this particular case it is easy to see that the independence relationship is unlikely, the independence
equalities do not hold. On the contrary, conditional independence rather holds as the definition equalities
are roughly met. However, it is obvious that we need a more scientific tool to make clear decisions. Two
of them will be demonstrated.

Approach 2: Statistical hypothesis testing.

Pearson’s χ2 independence test represents one of the most common options for independence testing.
A⊥ ⊥ B|∅ is checked by application of the test on a contingency (frequency) table counting A and B
co-occurrences (the left table):

OAB b ¬b sum EAB b ¬b sum


a 39 64 103 a 51.5 51.5 103
¬a 61 36 97 ¬a 48.5 48.5 97
sum 100 100 200 sum 100 100 200

15
The null hypothesis is independence of A and B. The test works with the frequencies expected under
the null hypothesis (the table on right):

NA × NB Na × N¬b 103 × 100


EAB = → Ea¬b = = = 51.5
N N 200
The test compares these expected frequencies to the observed ones. The test statistic is:
X (OAB − EAB )2
χ2 = = 12.51  χ2 (α = 0.05, df = 1) = 3.84
EAB
A,B

The null hypothesis is rejected in favor of the alternative hypothesis that A and B are actually dependent
when the test statistic is larger than its tabular value. In our case, we took the tabular value for the
common significance level α = 0.05, df is derived from the size of contingency table (df = (r − 1)(c − 1),
where df stands for degrees of freedom, r and c stand for the number of rows and columns in the
contingency table). Under the assumption that the null hypothesis holds, a frequency table with the
observed and higher deviation from the expected counts can occur only with negligible probability p =
0.0004  α. Variables A and B are dependent.

A⊥ ⊥ B|C hypothesis can be tested analogically1 . The χ2 test statistic will be separately computed for
the contingency tables corresponding to c and ¬c. The total value equals to sum of both the partial
statistics, it has two degrees of freedom.

c ¬c c ¬c
OAB EAB
b ¬b sum b ¬b sum b ¬b sum b ¬b sum
a 14 8 22 25 56 81 a 14.8 7.2 22 26.2 54.8 81
¬a 54 25 79 7 11 18 ¬a 53.2 25.8 79 5.8 12.2 18
sum 68 33 101 32 67 99 sum 68 33 101 32 67 99

The null hypothesis is conditional independence, the alternative hypothesis is the full/saturated model
with all parameters. The test statistic is:
X (OAB|C − EAB|C )2
χ2 = = 0.175 + 0.435 = 0.61  χ2 (α = 0.05, df = 2) = 5.99
EAB|C
A,B|C

The null hypothesis cannot be rejected in favor of the alternative hypothesis based on the saturated
model on the significance level α = 0.05. A frequency table with the given or higher deviation from
the expected values is likely to be observed when dealing with the conditional independence model –
p = 0.74  α. Variables A and B are conditionally independent given C. Variable C explains
dependence between A and B.

Approach 3: Model scoring.

Let us evaluate the null (A and B independent) and the alternative (A and B dependent) models of two
variables, see the figure below.
1
In practice, Pearson’s χ2 independence test is not used to test conditional independence for its low
power. It can be replaced for example by the likelihood-ratio test. This test compares likelihood of the
null model (AC,BC) with likelihood of the alternative modelu (AC,BC,AB). The null model assumes no
interaction between A and B, it concerns only A and C interactions, resp. B and C interactions. The
alternative model assumes a potential relationship between A and B as well.

16
null model alternative model

BIC (and Bayesian criterion) will be calculated for both models. The structure with higher score will be
taken. At the same time, we will use the likelihood values enumerated in terms of BIC to perform the
likelihood-ratio statistical test.

ln Lnull = (39 + 64) ln 103 100 97 100


200 200 + (61 + 36) ln 200 200 = −277.2
103 39 103 64 97 61 97 36
ln Lalt = 39 ln 200 103 + 64 ln 200 103 + 61 ln 200 97 + 36 ln 200 97 = −270.8
K 2
BIC(null) = − 2 ln M + ln Lnull = − 2 ln 200 − 277.2 = −282.5
BIC(alt) = − K 3
2 ln M + ln Lalt = − 2 ln 200 − 270.8 = −278.8
BIC(null) < BIC(alt) ⇔ the alternative model is more likely, the null hypothesis A ⊥⊥ B|∅ does not
hold.

Bayesian score is more difficult to compute, the evaluation was carried out in Matlab BNT (the function
score dags): ln P r(D|null) = −282.9 < ln P r(D|alt) = −279.8 ⇔ the alternative model is more likely,
the null hypothesis A ⊥⊥ B|∅ does not hold.

The likelihood-ratio statistical test:


D = −2(ln Lnull − ln Lalt ) = −2(−277.2 + 270.8) = 12.8
D statistic follows χ2 distribution with 3−2 = 1 degrees of freedom. The null hypothesis has p = 0.0003
and we can reject it.

Conclusion: Variables A and B are dependent.

Analogically we will compare the null (A and B conditionally independent) and the alternative (A and
B conditionally dependent) model of three variables, see the figure below.

null model alternative model

We will compare their scores, the structure with a higher score wins.

ln Lnull and ln Lalt computed in Matlab BNT, the function log lik complete):
BIC(null) = − K 5
2 ln M + ln Lnull = − 2 ln 200 − 365.1 = −377.9
K 7
BIC(alt) = − 2 ln M + ln Lalt = − 2 ln 200 − 364.3 = −382.9
BIC(null) > BIC(alt) ⇔ the null model has a higher score, the hypothesis A ⊥⊥ B|C holds.

Bayesian score (carried out in Matlab BNT, the function score dags): ln P r(D|null) = −379.4 >
ln P r(D|alt) = −385.5 ⇔ the alternative model has a lower posterior probability, the model assuming
A⊥ ⊥ B|C will be used.

17
The likelihood-ratio statistical test:
D = −2(ln Lnull − ln Lalt ) = −2(−365.1 + 364.3) = 1.6
D statistic has χ2 distribution with 7 − 5 = 2 degrees of freedom. Assuming that the null hypothesis is
true, the probability of observing a D value that is at least 1.6 is p = 0.45. As p > α, the null hypothesis
cannot be rejected.

Conclusion: Variables A and B are conditionally independent given C.


Exercise 12. Let us consider the network structure shown in the figure below. Our goal is to
calculate maximum likelihood (ML), maximum aposteriori (MAP) and Bayesian estimates
of the parameter θ = P r(b|a). 4 samples are available (see the table). We also know that
the prior distribution of P r(b|a) is Beta(3,3).
A B
T T
F F
T T
F F

N (a,b) 2
MLE of P r(b|a): θb = arg max LB (θ : D) = N (a) = 2 =1
θ
MLE sets P r(b|a) to maximize the probability of observations. It finds the maximum of function
P r(b|a)2 (1 − P r(b|a))0 shown in left graph.

MAP estimate maximizes posterior probability of parameter, it takes the prior distribution into consider-
ation as well: α−1
(1−θ)β−1
Beta(α, β) = θ B(α,β) where B plays a role of normalization constant.
The prior distribution Beta(3,3) is shown in middle graph. P r(b|a) is expected to be around 0.5, never-
theless the assumption is not strong (its strength corresponds to prior observation of four samples with
positive A, two of them have positive B as well).

N (a,b)+α−1 2
MAP of P r(b|a): θb = arg max P r(θ|D) = N (a)+α+β−2 = 3
θ
The posterior distribution is proportional to P r(b|a)4 (1 − P r(b|a))2 , the estimated value was shifted
towards the prior. See the graph to the right.

Similarly as MAP, Bayesian estimate deals with the posterior distribution P r(P r(b|a)|D). Unlike MAP
it takes its expected value. Bayesian estimation of P r(b|a):
θb = E{θ|D} = NN(a)+α+β
(a,b)+α
= 85
The expected value can be interpreted as the center of gravity of the posterior distribution shown in the
graph to the right.

18
4 Dynamic Bayesian networks
Exercise 13. A patient has a disease N . Physicians measure the value of a parameter P
to see the disease development. The parameter can take one of the following values {low,
medium, high}. The value of P is a result of patient’s unobservable condition/state S. S can
be {good, poor}. The state changes between two consecutive days in one fifth of cases. If the
patient is in good condition, the value for P is rather low (having 10 sample measurements,
5 of them are low, 3 medium and 2 high), while if the patient is in poor condition, the value
is rather high (having 10 measurements, 3 are low, 3 medium and 4 high). On arrival to
the hospital on day 0, the patient’s condition was unknown, i.e., P r(S0 = good) = 0.5.

a) Draw the transition and sensor model of the dynamic Bayesian network modeling the
domain under consideration,

b) calculate probability that the patient is in good condition on day 2 given low P values
on days 1 and 2,

c) can you determine the most likely patient state sequence in days 0, 1 and 2 without
any additional computations?, justify.

ad a) The transition model describes causality between consecutive states, the sensor model describes
relationship between the current state and the current evidence. See both the models in figure below:

ad b) P r(s2 |P1 = low, P2 = low) will be enumerated (the notation is: s good state, ¬s poor state). It
is a typical filtering task:

X
P r(S1 |P1 = low) = α1 P r(P1 = low|S1 ) P r(S1 |S0 )P r(S0 )
S0 ∈{s0 ,¬s0 }

P r(s1 |P1 = low) = α1 × 0.5 × 0.5 = 0.625


P r(¬s1 |P1 = low) = α1 × 0.3 × 0.5 = 0.375
X
P r(S2 |P1 = low, P2 = low) = α2 P r(P2 = low|S2 ) P r(S2 |S1 )P r(S1 )
S1 ∈{s1 ,¬s1 }

P r(s2 |P1 = low, P2 = low) = α2 × 0.5(0.8 × 0.625 + 0.2 × 0.375) = α2 × 0.2875 = 0.6928
P r(¬s2 |P1 = low, P2 = low) = α2 × 0.3(0.2 × 0.625 + 0.8 × 0.375) = α2 × 0.1275 = 0.3072

19
The same task can be posed as a classical inference task:

P r(s2 , P1 = low, P2 = low)


P r(s2 |P1 = low, P2 = low) = =
P r(P1 = low, P2 = low)
P
S ,S P r(S0 , S1 , s2 , P1 = low, P2 = low)
=P 0 1 =
S0 ,S1 ,S2 P r(S0 , S1 , S2 , P1 = low, P2 = low)
P r(s0 )P r(s1 |s0 )P r(s2 |s1 )P r(P1 = low|s1 )P r(P2 = low|s2 ) + . . .
= = ...
P r(s0 )P r(s1 |s0 )P r(s2 |s1 )P r(P1 = low|s1 )P r(P2 = low|s2 ) + . . .

ad c) No, we cannot. The most likely explanation task P r(S1:2 |P1:2 ) is a distinct task from filtering
and smoothing. The states interact, moreover, at day 1 filtering computes P r(s1 |P1 = low) instead of
P r(s1 |P1 = low, P2 = low). Viterbi algorithm (a dynamic programming algorithm used in HMM) needs
to be applied.

20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy