Chapter 4 Bayesian Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

Chapter 4

Bayesian Networks
Outline
1. Introduction
2. Probability Primer
3. Bayesian networks
4. Naïve Bayes Classification

2
Introduction
• Suppose you are trying to
determine if a patient has
inhalational anthrax.

• You observe the following


symptoms:

• The patient has a cough

• The patient has a fever

• The patient has difficulty breathing3


Introduction
• You would like to determine
how likely the patient is infected
with inhalational anthrax given
that the patient has a cough, a
fever, and difficulty breathing

• We are not 100% certain that the patient has


anthrax because of these symptoms.
• We are dealing with uncertainty! 4
Introduction
• Now suppose you order an x-
ray and observe that the
patient has a wide
mediastinum.

• Your belief that that the


patient is infected with
inhalational anthrax is now
much higher.
5
Introduction
• In the previous slides, what you observed
affected your belief that the patient is
infected with anthrax

• This is called reasoning with uncertainty

• Wouldn’t it be nice if we had some


methodology for reasoning with
uncertainty? 6
Bayesian Networks
HasAnthrax

HasCough HasFever HasDifficultyBreathing HasWideMediastinum

• Bayesian networks have made significant


contribution in AI in the last 10 years

• They are used in many applications eg. spam


filtering, speech recognition, robotics, diagnostic
systems and even syndromic surveillance
7
Outline
1. Introduction
2. Probability Primer
3. Bayesian networks
4. Naïve Bayes Classification

8
Probability Primer: Random Variables

• A random variable is the basic element of


probability

• Random Variable - refers to an event and


there is some degree of uncertainty as to the
outcome of the event

• For example, the random variable A could be


the event of getting a heads on a coin flip 9
Boolean Random Variables
• Boolean Variables - are the simplest type of random
variables

• Take the values - true or false

• Think of the event as occurring or not occurring

• Examples (Let A be a Boolean random variable):


A = Getting heads on a coin flip

A = It will rain today


10
Probabilities
• We will write P(A = true) to mean the probability that A = true.

• What is probability?

• It is the relative frequency with which an outcome would


be obtained if the process were repeated a large number of
times under similar conditions*

The sum of the red


and blue areas is 1 P(A = true)

There’s also the Bayesian definition which


P(A = false)
says probability is your degree of belief
in an outcome

11
The Joint Probability Distribution
A B C P(A,B,C)
• The probability of A = true and B =
false false false 0.1
true false false true 0.2
false true false 0.05
– Written as: P(A = true, B = true)
false true true 0.05
• Joint probabilities can be between true false false 0.3
true false true 0.1
any number of variables
true true false 0.05
eg. P(A = true, B = true, C = true) true true true 0.15

• For each combination of variables, we need to say how Sums to 1

probable that combination is

• The probabilities of these combinations need to sum to 1


12
The Joint Probability Distribution
A B C P(A,B,C)
• Once you have the joint
false false false 0.1
probability distribution, you can false false true 0.2
false true false 0.05
calculate any probability involving
false true true 0.05
A, B, and C true false false 0.3
true false true 0.1
true true false 0.05
Examples of things you can compute: true true true 0.15

• P(A=true) = sum of P(A,B,C) in rows with A=true


• P(A=true, B = true | C=true) =
P(A = true, B = true, C = true) / P(C = true)
13
Conditional Probability
• Probability of A given B or Probability of A conditioned
on B

• Written as: P(A = true | B = true)

• Out of all the outcomes in which B is true, how many


also have A equal to true
P(A =
• The formula for conditional probability is: true)
P(A∩B)
– P(B|A) = P(A and B) / P(A) P(B = true)

• which can also be rewritten as:


– P(B|A) = P(A∩B) / P(A) 14
Conditional Probability - Example
• In a group of 100 sports car buyers, 40 bought alarm systems, 30 purchased
bucket seats, and 20 purchased an alarm system and bucket seats.

• If a car buyer chosen at random bought an alarm system, what is the probability
they also bought bucket seats?

• Solution:
– Determine probability of alarm systems P(A) = 40%, or 0.4

– Figure out probability of both alarm systems and bucket seats P(A∩B) - This is the
intersection of A and B: both happening together. P(A∩B) = 20% or 0.2.

– Calculate the probability of buying bucket seats given that (s)he bought the alarm
systems P(B|A)

P(B|A) = P(A∩B) / P(A) = 0.2 / 0.4 = 0.5.


15
The Problem with the Joint Distribution
• Lots of entries in the table
A B C P(A,B,C)
to fill up!
false false false 0.1

• For k Boolean random false false true 0.2

false true false 0.05


variables, you need a table
false true true 0.05
of size 2k true false false 0.3

true false true 0.1


• How do we use fewer
true true false 0.05
numbers? true true true 0.15

– Need the concept of


16
independence
Independence
• Variables A and B are independent if any of the
following hold:
– P(A,B) = P(A) P(B)

– P(A | B) = P(A)

– P(B | A) = P(B)

This says that knowing the outcome of A


does not tell me anything new about the
outcome of B.

17
Independence
How is independence useful?

• Suppose you have n coin flips and you want to calculate


the joint distribution P(C1, …, Cn)

• If the coin flips are not independent, you need 2n values


in the table

• If the coin flips are independent, then


n Each P(Ci) table has 2
P(C1 ,..., Cn ) =  P(Ci ) entries and there are n of
i =1 them for a total of 2n
values 18
Conditional Independence
• Variables A and B are conditionally independent
given C if any of the following hold:
– P(A, B | C) = P(A | C) P(B | C)

– P(A | B, C) = P(A | C)

– P(B | A, C) = P(B | C)

• Knowing C tells me everything about B.

• I don’t gain anything by knowing A (either because A doesn’t


influence B or because knowing C provides all the information
knowing A would give) 19
Bayesian Theorem
• The Rev. Thomas Bayes (1702 -1761)
• Represent uncertainty by probabilities

• Prior – Old belief


• Posterior – New Belief

New belief

20
Example of Bayes theorem
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20

• If a patient has stiff neck, what’s the probability he/she


has meningitis?
p( S | M ) p( M ) 0.5 1 / 50000
p( M | S ) = = = 0.0002
p( S ) 1 / 20
21
Example 2

22
Example 3

23
Possible Applications
• Spam Classification
– Given an email, predict whether it is spam or not

• Medical Diagnosis
– Given a list of symptoms, predict whether a
patient has disease X or not

• Weather
– Based on temperature, humidity, etc… predict if it
will rain tomorrow
24
Bayesian classifiers
Approach:
• Compute the posterior probability p( Ci | x1, x2, … , xn ) for each
value of Ci using Bayes theorem:
p( x1 , x2 ,, xn | Ci ) p(Ci )
p(Ci | x1 , x2 ,, xn ) =
p( x1 , x2 ,, xn )

• Choose value of Ci that maximizes


p( Ci | x1, x2, … , xn )
• Equivalent to choosing value of Ci that maximizes
p( x1, x2, … , xn | Ci ) p( Ci )
(We can ignore denominator?)
• Easy to estimate priors p( Ci ) from data. (How?)
• The real challenge: how to estimate p( x1, x2, … , xn | Ci )?
25
Bayesian classifiers - Estimations
• How to estimate p( x1, x2, … , xn | Ci )?

• In the general case, where the attributes xj have


dependencies, this requires estimating the full
joint distribution p( x1, x2, … , xn ) for each class
C i.

• There is almost never enough data to


confidently make such estimates.
26
Outline
1. Introduction
2. Probability Primer
3. Bayesian networks
4. Naïve Bayes Classification

27
A Bayesian Network
A Bayesian network (Bayesian Belief ) is
made up of: A

1. A Directed Acyclic Graph (DAG)


B

2. A set of tables for each node in the C D


graph
A P(A) A B P(B|A) B D P(D|B) B C P(C|B)
false 0.6 false false 0.01 false false 0.02 false false 0.4
true 0.4 false true 0.99 false true 0.98 false true 0.6
true false 0.7 true false 0.05 true false 0.9
true true 0.3 true true 0.95 true true 0.1

28
Bayesian Networks
• In a Bayesian Network the states are linked by
probabilities, so:
– If A then B; if B then C; if C then D

• Not only this, but this can be updated when an


event A happens, propagating the new
probabilities by using the new final probability
of B to recalculate the probability of C, etc.
29
A Directed Acyclic Graph
Each node in the graph A node X is a parent of
is a random variable another node Y if there is
an arrow from node X to
A node Y eg.
A is a parent of B
B

C D

Informally, an arrow from


node X to node Y means X has
a direct influence on Y
30
A Set of Tables for Each Node
A P(A) A B P(B|A)
Each node Xi has a
false 0.6 false false 0.01
conditional probability
distribution P(Xi |
true 0.4 false true 0.99
true false 0.7
true true 0.3
Parents(Xi)) that quantifies
the effect of the parents
B C P(C|B) on the node
false false 0.4
The parameters are the
false true 0.6 A
true false 0.9
probabilities in these
true true 0.1
conditional probability
B tables (CPTs)
B D P(D|B)

C D false false 0.02


false true 0.98
true false 0.05
true true 0.95 31
A Set of Tables for Each Node
Conditional Probability
Distribution for C given B
B C P(C|B)
false false 0.4
false true 0.6
true false 0.9
true true 0.1 For a given combination of values of the parents
(B in this example), the entries for P(C=true |
B) and P(C=false | B) must add up to 1
eg. P(C=true | B=false) + P(C=false |B=false )=1

If you have a Boolean variable with k Boolean parents, this


table has 2k+1 probabilities (but only 2k need to be stored)

32
Using a Bayesian Network Example
• Using the network in the example, suppose
you want to calculate:
This is from the
P(A = true, B = true, C = true, D = true) graph structure

= P(A = true) * P(B = true | A = true) *


P(C = true | B = true) P( D = true | B = true)
= (0.4)*(0.3)*(0.1)*(0.95)
A

B
These numbers are from the
conditional probability tables
C D 33
Bayesian Networks

Two important properties:

1. Encodes the conditional independence


relationships between the variables in the
graph structure

2. Is a compact representation of the joint


probability distribution over the variables
34
Example
• I'm at work, neighbor John calls to say my alarm is ringing, but
neighbor Mary doesn't call. Sometimes it's set off by minor
earthquakes. Is there a burglar?

• Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls

• Network topology reflects "causal" knowledge:


– A burglar can set the alarm off

– An earthquake can set the alarm off

– The alarm can cause Mary to call

– The alarm can cause John to call


35
Example contd.

36
Conditional Independence
• The Markov condition:
– Given its parents (P1, P2), a node (X) is conditionally
independent of its non-descendants (ND1, ND2)

P1 P2

ND1 X ND2

C1 C2

37
The Bad News
• Exact inference is feasible in small to medium-sized
networks

• Exact inference in large networks takes a very long


time

• We resort to approximate inference techniques


which are much faster and give pretty good results

38
Advantages and Disadvantages of
Bayesian Networks
• Bayesian Approach:
– Advantages
• Reflects an expert’s knowledge

• The belief is kept updating when new data item


arrives

– Disadvantage:
• Arbitrary (More subjective)
39
Advantages and Disadvantages of
Bayesian Networks
• Classical Probability:
– Advantage:
• Objective and unbiased

– Disadvantage:
• It takes a long time to measure the object’s
physical characteristics

40
Outline
1. Introduction
2. Probability Primer
3. Bayesian networks
4. Naïve Bayes Classification

41
Naïve Bayes Classification
• Problem statement:
– Given features X1,X2,…,Xn
– Predict a label Y

42
Based on Bayes Theorem
• The classifier is based on Bayes' theorem, which
describes the probability of a hypothesis given the
evidence, and it's formulated as:
P(h∣d) = P(d∣h)×P(h)
P(d)
Where:
• P(h∣d) is the probability of hypothesis ℎ given the
data d,
• P(d∣h) is the probability of data d given the
hypothesis h,
• P(h) is the prior probability of hypothesis h,
• P(d) is the prior probability of data. 43
Naive Independence Assumption
• The "naive" assumption in Naive Bayes is that
features are conditionally independent given
the class label.
• Mathematically, it's represented as:
P(x1, x2,…, xn∣y) = P(x1 ∣y) × P(x2 ∣y) × …× P(xn∣y)
where x1 , x2 , …. xn are features, and y is the
class label.

44
Classification
• To classify a new instance, the classifier
calculates the posterior probability of each
class given the features, and then it selects
the class with the highest posterior
probability.
argmaxyP(y∣x1​,x2​,…,xn​)
• This is done using Bayes' theorem and the
naive independence assumption.

45
Modeling Bayes Classifier -
Parameters
• For the Bayes classifier, we need to “learn” two
functions,
– the likelihood and

– the prior

• How many parameters are required to specify the


prior?

• How many parameters are required to specify the


likelihood? 46
Modeling Bayes Classifier -
Parameters
• The problem with explicitly modeling P(X1,…,Xn|Y) is
that there are usually way too many parameters:

– We’ll run out of space

– We’ll run out of time

– And we’ll need tons of training data (which is


usually not available)

47
The Naïve Bayes Model
• The Naïve Bayes Assumption: Assume that all
features are independent given the class label Y

• Equationally speaking:

48
Why is this useful?
• Dependent Variables
– # of parameters for modeling P(X1,…,Xn|Y):

▪ 2(2n-1)

• Independent Variables:
– # of parameters for modeling P(X1|Y),…,P(Xn|Y)

▪ 2n

49
Example
• Example: Play Tennis

50
Another Example - contd
The weather data, with counts and probabilities

outlook temperature humidity windy play

yes no yes no yes no yes no yes no

sunny 2 3 hot 2 2 high 3 4 false 6 2 9 5

overcast 4 0 mild 4 2 normal 6 1 true 3 3

rainy 3 2 cool 3 1

sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 9/14 5/14

overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5

rainy 3/9 2/5 cool 3/9 1/5

51
Another Example - contd
Classifying a new day
outlook temperature humidity windy play
sunny cool high true ?

• Likelihood of yes
2 3 3 3 9
=     = 0.0053
9 9 9 9 14
• Likelihood of no
3 1 4 3 5
=     = 0.0206
5 5 5 5 14

• Therefore, the prediction is No 52


The Naive Bayes Classifier for Data
Sets with Numerical Attribute Values

• One common practice to handle numerical


attribute values is to assume normal
distributions for numerical attributes.

53
The numeric weather data with summary statistics
outlook temperature humidity windy play
yes no yes no yes no yes no yes no

sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 std 6.2 7.9 std 10.2 9.7 true 3/9 3/5
dev dev
rainy 3/9 2/5
54
• Let x1, x2, …, xn be the values of a numerical
attribute in the training data set.

1 n
 =  xi
n i =1
1 n
=  (xi −  )2

n − 1 i =1
( w −  )2
1 −
f ( w) = e 2

2 
55
• For examples,
( 66− 73)2

f (temperatur e = 66 | Yes ) =
1
e 2 ( 6.2 )2
= 0.0340
2 (6.2 )
2 3 9
• Likelihood of Yes =  0.0340  0.0221  = 0.000036
9 9 14

3 3 5
• Likelihood of No =  0.0291 0.038   = 0.000136
5 5 14

56
Outputting Probabilities
• What’s nice about Naïve Bayes (and generative
models in general) is that it returns probabilities

– These probabilities can tell us how confident the


algorithm is

– So… don’t throw away those probabilities!

57
Naïve Bayes Assumption
• Recall the Naïve Bayes assumption:
– that all features are independent given the class label Y

• Actually, the Naïve Bayes assumption is almost


never true

• Still… Naïve Bayes often performs surprisingly well


even when its assumptions do not hold

58
Why Naïve Bayes is naive
• It makes an assumption that is virtually impossible to
see in real-life data:

• The conditional probability is calculated as the pure


product of the individual probabilities of components.

• This implies the absolute independence of features —


a condition probably never met in real life.

59
Naïve Bayes Problems
• Problems
– Actual Features may overlap

– Actual Features may not be independent

• Example: Size and weight of tiger

– Use a joint distribution estimation (P(X|Y), P(Y)) to solve a conditional


problem (P(Y|X= x))

• Can we discriminatively train?


– Logistic regression

– Regularization

– Gradient ascent 60
Conclusions on Naïve Bayes
• Naïve Bayes based on the independence
assumption
– Training is very easy and fast; just requiring
considering each attribute in each class separately

– Test is straightforward; just looking up tables or


calculating conditional probabilities with normal
distributions

– Naïve Bayes is often a good choice if you don’t


61
have much training data!
Conclusions on Naïve Bayes
• A popular generative model
– Performance competitive to most of the classifiers
even in presence of violating independence
assumption

– Many successful applications, e.g., spam mail filtering

– A good candidate of a base learner in ensemble


learning

– Naïve Bayes can do more than classification.


62

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy