Introduction To Belief Networks: David Barber

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Introduction to Belief Networks1

David Barber
University College London

1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and demos can be downloaded from
www.cs.ucl.ac.uk/staff/D.Barber/brml. Feedback and corrections are also available on the site. Feel free to adapt these slides for your own purposes,
but please include a link the above website.
Belief Networks (Bayesian Networks)

A belief network is a directed acyclic graph in which each node has associated the
conditional probability of the node given its parents.

The joint distribution is obtained by taking the product of the conditional


probabilities:

p(A, B, C, D, E) = p(A)p(B)p(C|A, B)p(D|C)p(E|B, C)

A B

D
E

p(E|B, C)
Example – Part I
Sally’s burglar Alarm is sounding. Has she been Burgled, or was the alarm
triggered by an Earthquake? She turns the car Radio on for news of earthquakes.

Choosing an ordering
Without loss of generality, we can write
p(A, R, E, B) = p(A|R, E, B)p(R, E, B)
= p(A|R, E, B)p(R|E, B)p(E, B)
= p(A|R, E, B)p(R|E, B)p(E|B)p(B)

Assumptions:
The alarm is not directly influenced by any report on the radio,
p(A|R, E, B) = p(A|E, B)
The radio broadcast is not directly influenced by the burglar variable,
p(R|E, B) = p(R|E)
Burglaries don’t directly ‘cause’ earthquakes, p(E|B) = p(E)
Therefore
p(A, R, E, B) = p(A|E, B)p(R|E)p(E)p(B)
Example – Part II: Specifying the Tables

B E

A R

p(A|B, E)
p(R|E)
Alarm = 1 Burglar Earthquake
0.9999 1 1 Radio = 1 Earthquake
0.99 1 0 1 1
0.99 0 1 0 0
0.0001 0 0

The remaining tables are p(B = 1) = 0.01 and p(E = 1) = 0.000001. The tables
and graphical structure fully specify the distribution.
Example Part III: Inference
Initial Evidence: The alarm is sounding
P
E,R p(B = 1, E, A = 1, R)
p(B = 1|A = 1) = P
B,E,R p(B, E, A = 1, R)
P
E,R p(A = 1|B = 1, E)p(B = 1)p(E)p(R|E)
= P ≈ 0.99
B,E,R p(A = 1|B, E)p(B)p(E)p(R|E)

Additional Evidence: The radio broadcasts an earthquake warning:

A similar calculation gives p(B = 1|A = 1, R = 1) ≈ 0.01.

Initially, because the alarm sounds, Sally thinks that she’s been burgled. However,
this probability drops dramatically when she hears that there has been an
earthquake.

The earthquake ‘explains away’ to an extent the fact that the alarm is ringing.
Uncertain Evidence

In soft/uncertain evidence the variable is in more than one state, with the strength
of our belief about each state being given by probabilities. For example, if y has
the states dom(y) = {red, blue, green} the vector (0.6, 0.1, 0.3) could represent the
probabilities of the respective states.

hard evidence
We are certain that a variable is in a particular state. In this case, all the
probability mass is in one of the vector components, (0, 0, 1).

inference
Inference with soft-evidence can be achieved using Bayes’ rule. Writing the soft
evidence as ỹ, we have
X
p(x|ỹ) = p(x|y)p(y|ỹ)
y

where p(y = i|ỹ) represents the probability that y is in state i under the
soft-evidence.
Jeffrey’s rule

For variables x, y, and p1 (x, y), how do we form a joint distribution given
soft-evidence ỹ?

Form the conditional


We first define
p1 (x, y)
p1 (x|y) = P
x p1 (x, y)

Define the joint


The soft evidence p(y|ỹ) then defines a new joint distribution

p2 (x, y|ỹ) = p1 (x|y)p(y|ỹ)

One can therefore view soft evidence as defining a new joint distribution. We use a
dashed circle to represent a variable in an uncertain state.
Uncertain evidence example

B E

A R

Revisiting the earthquake scenario, we think we hear the burglar alarm sounding,
but are not sure, specifically p(A = tr) = 0.7. For this binary variable case we
represent this soft-evidence for the states (tr, fa) as à = (0.7, 0.3). What is the
probability of a burglary under this soft-evidence?
X
p(B = tr|Ã) = p(B = tr|A)p(A|Ã)
A
= p(B = tr|A = tr) × 0.7 + p(B = tr|A = fa) × 0.3 ≈ 0.6930

This value is lower than 0.99, the probability of being burgled when we are sure we
heard the alarm. The probabilities p(B = tr|A = tr) and p(B = tr|A = fa) are
calculated using Bayes’ rule from the original distribution, as before.
Unreliable evidence (likelihood evidence)

Under potentially confusing reports, you decide to replace the influence of the
radio variable with your own model. You decide that you want the radio evidence
to influence the inference 80% towards being an earthquake and 20% to not being
an earthquake.

0.8 E = tr
p(R|E) → p(R|E) =
0.2 E = fa

B E

A R

This then gives a distribution, with R in an arbitrary fixed state,

p(B, E, A, R) = p(A|B, E)p(B)p(E)p(R|E)

This can then be used to form inference.


Examples of Belief Networks in Machine Learning

Prediction (discriminative)
p(class|input)

Prediction (generative)
p(class|input) ∝ p(input|class)p(class)

Time-series
Markov chains, Hidden Markov Models.
Unsupervised
P learning
p(data) = latent p(data|latent)p(latent).

And many more


Personally I find the framework very useful for understanding and rationalising the
many different approaches in machine learning and related areas.
Independence ⊥
⊥ in Belief Networks – Part I
All belief networks with three nodes and two links:

A⊥
⊥ B |C ⊥⊥
AB | C

A B A B A B A B

C C C C
(a) (b) (c) (d)

In (a), (b) and (c), A, B are conditionally independent given C.


p(A,B,C) p(A|C)p(B|C)p(C)
(a) p(A, B|C) = p(C)
= p(C)
= p(A|C)p(B|C)
p(A)p(C|A)p(B|C) p(A,C)p(B|C)
(b) p(A, B|C) = p(C)
= p(C)
= p(A|C)p(B|C)
p(A|C)p(C|B)p(B) p(A|C)p(B,C)
(c) p(A, B|C) = p(C)
= p(C)
= p(A|C)p(B|C)

In (d) the variables A, B are conditionally dependent given C,


p(A, B|C) ∝ p(C|A, B)p(A)p(B).
Independence ⊥
⊥ in Belief Networks – Part II

⊥⊥
AB A⊥
⊥B
A B A B A B A B

C C C C
(a) (b) (c) (d)

In (a), (b) and (c), the variables A, B are marginally dependent.

In (d) the variables A, B are marginally independent.


P P
p(A, B) = C p(A, B, C) = C p(A)p(B)p(C|A, B) = p(A)p(B)
Collider

A collider contains two or more incoming arrows along a chosen path.


Summary of two previous slides:

A B
If C has more than one incoming link, then A ⊥⊥ B
and A ⊥
⊥ B | C. In this case C is called collider.
C

A B If C has at most one incoming link, then A ⊥⊥


B | C and A 

⊥ B. In this case C is called non-

C collider.
The ‘connection’-graph
All paths in the connection graph need to be blocked to obtain ⊥
⊥:
A⊥
⊥ D | B, C AA⊥⊥⊥⊥DD
|B
A A A A

B B B B
B C ⇒ B C B C ⇒ B C

C C C C

D D D D

non-collider in the conditioning set blocks a path collider outside the conditioning set blocks a path

AB⊥⊥⊥
⊥DC| B,
| AC ⊥⊥
BC | A
A A A A

B C ⇒ B C B C ⇒ B C

D D D D
General Rule for Independence in Belief Networks
Given three sets of nodes X , Y, C, if all paths from any element of X to any
element of Y are blocked by C, then X and Y are conditionally independent
given C.

A path P is blocked by C if at least one of the following conditions is satisfied:

1. there is a collider in the path P such that neither the collider nor any
of its descendants is in the conditioning set C.

2. there is a non-collider in the path P that is in the conditioning set C.

d-connected/separated
We use the phrase ‘d-connected’ if there is a path from X to Y in the ‘connection’
graph – otherwise the variable sets are ‘d-separated’. Note that d-separation
implies that X ⊥
⊥ Y| Z, but d-connection does not necessarily imply conditional
dependence.
Markov Equivalence
skeleton
Formed from a graph by removing the arrows

immorality
An immorality in a DAG is a configuration of three nodes, A,B,C such that C is a
child of both A and B, with A and B not directly connected.

Markov equivalence
Two graphs represent the same set of independence assumptions if and only if they
have the same skeleton and the same set of immoralities.
a b a b

d c d c

e e
=?
Limitations of expressibility
t1 t2

p(t1 , t2 , y1 , y2 , h) = p(t1 )p(t2 )p(y1 |t1 , h)p(y2 |t2 , h)


y1 y2
t1 ⊥
⊥ t2 , y2 , t2 ⊥
⊥ t1 , y1
h

X
p(t1 , t2 , y1 , y2 ) = p(t1 )p(t2 ) p(y1 |t1 , h)p(y2 |t2 , h)
h

t1 t2
Still holds that:

t1 ⊥
⊥ t2 , y2 , t2 ⊥
⊥ t1 , y1
y1 y2
No Belief network on t1 , t2 , y1 , y2 can represent all the con-
ditional independence statements contained in p(t1 , t2 , y1 , y2 ).
Sometimes we can extend the representation by adding for ex-
ample a bidirectional link, but this is no longer a Belief Network.
Causality

Males Recovered Not Recovered Rec. Rate


Given Drug 18 12 60%
Not Given Drug 7 3 70%
Females Recovered Not Recovered Rec. Rate
Given Drug 2 8 20%
Not Given Drug 9 21 30%

Combined Recovered Not Recovered Rec. Rate


Given Drug 20 20 50%
Not Given Drug 16 24 40%

Simpson’s paradox
For the males, it’s best not to give the drug. For the females, it’s also best not to
give the drug. However, for the combined data, it’s best to give the drug!
Resolving the paradox
We can write the distribution as
G

D R

observational calculation

p(G, D, R) = p(R|G, D)p(D|G)p(G)

Our observational calculation computed p(R|G, D) and p(R|D) using the above
distribution.
Sampling from the distribution
The above formula suggests that we would first chose a gender (the term p(G))
then decide whether or not to give the drug (the term p(D|G)).
Resolving the paradox
interventional calculation
We must use a distribution that is consistent with an interventional experiment. In
this case, the term p(D|G) should play no role. That is, we need to consider a
modified distribution (conditioned on the drug)
G

p̃(G, R|D) = p(R|G, D)p(G)


D R

X X
p(R||D) = p̃(G, R|D) = p(R|G, D)p(G)
G G

This gives the non-paradoxical result:

p(recovery|drug) = 0.6 × 0.5 + 0.2 × 0.5 = 0.4


p(recovery|no drug) = 0.7 × 0.5 + 0.3 × 0.5 = 0.5

The moral of the story is that you have to make the distribution match the
experimental conditions, otherwise apparent paradoxes may arise.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy