Chapter 4 Bayesian Networks
Chapter 4 Bayesian Networks
Chapter 4 Bayesian Networks
Bayesian Networks
Outline
1. Introduction
2. Probability Primer
3. Bayesian networks
4. Naïve Bayes Classification
2
Introduction
• Suppose you are trying to
determine if a patient has
inhalational anthrax.
8
Probability Primer: Random Variables
• What is probability?
11
The Joint Probability Distribution
A B C P(A,B,C)
• The probability of A = true and B =
false false false 0.1
true false false true 0.2
false true false 0.05
– Written as: P(A = true, B = true)
false true true 0.05
• Joint probabilities can be between true false false 0.3
true false true 0.1
any number of variables
true true false 0.05
eg. P(A = true, B = true, C = true) true true true 0.15
• If a car buyer chosen at random bought an alarm system, what is the probability
they also bought bucket seats?
• Solution:
– Determine probability of alarm systems P(A) = 40%, or 0.4
– Figure out probability of both alarm systems and bucket seats P(A∩B) - This is the
intersection of A and B: both happening together. P(A∩B) = 20% or 0.2.
– Calculate the probability of buying bucket seats given that (s)he bought the alarm
systems P(B|A)
– P(A | B) = P(A)
– P(B | A) = P(B)
17
Independence
How is independence useful?
– P(A | B, C) = P(A | C)
– P(B | A, C) = P(B | C)
New belief
20
Example of Bayes theorem
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
22
Example 3
23
Possible Applications
• Spam Classification
– Given an email, predict whether it is spam or not
• Medical Diagnosis
– Given a list of symptoms, predict whether a
patient has disease X or not
• Weather
– Based on temperature, humidity, etc… predict if it
will rain tomorrow
24
Bayesian classifiers
Approach:
• Compute the posterior probability p( Ci | x1, x2, … , xn ) for each
value of Ci using Bayes theorem:
p( x1 , x2 ,, xn | Ci ) p(Ci )
p(Ci | x1 , x2 ,, xn ) =
p( x1 , x2 ,, xn )
27
A Bayesian Network
A Bayesian network (Bayesian Belief ) is
made up of: A
28
Bayesian Networks
• In a Bayesian Network the states are linked by
probabilities, so:
– If A then B; if B then C; if C then D
C D
32
Using a Bayesian Network Example
• Using the network in the example, suppose
you want to calculate:
This is from the
P(A = true, B = true, C = true, D = true) graph structure
B
These numbers are from the
conditional probability tables
C D 33
Bayesian Networks
36
Conditional Independence
• The Markov condition:
– Given its parents (P1, P2), a node (X) is conditionally
independent of its non-descendants (ND1, ND2)
P1 P2
ND1 X ND2
C1 C2
37
The Bad News
• Exact inference is feasible in small to medium-sized
networks
38
Advantages and Disadvantages of
Bayesian Networks
• Bayesian Approach:
– Advantages
• Reflects an expert’s knowledge
– Disadvantage:
• Arbitrary (More subjective)
39
Advantages and Disadvantages of
Bayesian Networks
• Classical Probability:
– Advantage:
• Objective and unbiased
– Disadvantage:
• It takes a long time to measure the object’s
physical characteristics
40
Outline
1. Introduction
2. Probability Primer
3. Bayesian networks
4. Naïve Bayes Classification
41
Naïve Bayes Classification
• Problem statement:
– Given features X1,X2,…,Xn
– Predict a label Y
42
Based on Bayes Theorem
• The classifier is based on Bayes' theorem, which
describes the probability of a hypothesis given the
evidence, and it's formulated as:
P(h∣d) = P(d∣h)×P(h)
P(d)
Where:
• P(h∣d) is the probability of hypothesis ℎ given the
data d,
• P(d∣h) is the probability of data d given the
hypothesis h,
• P(h) is the prior probability of hypothesis h,
• P(d) is the prior probability of data. 43
Naive Independence Assumption
• The "naive" assumption in Naive Bayes is that
features are conditionally independent given
the class label.
• Mathematically, it's represented as:
P(x1, x2,…, xn∣y) = P(x1 ∣y) × P(x2 ∣y) × …× P(xn∣y)
where x1 , x2 , …. xn are features, and y is the
class label.
44
Classification
• To classify a new instance, the classifier
calculates the posterior probability of each
class given the features, and then it selects
the class with the highest posterior
probability.
argmaxyP(y∣x1,x2,…,xn)
• This is done using Bayes' theorem and the
naive independence assumption.
45
Modeling Bayes Classifier -
Parameters
• For the Bayes classifier, we need to “learn” two
functions,
– the likelihood and
– the prior
47
The Naïve Bayes Model
• The Naïve Bayes Assumption: Assume that all
features are independent given the class label Y
• Equationally speaking:
48
Why is this useful?
• Dependent Variables
– # of parameters for modeling P(X1,…,Xn|Y):
▪ 2(2n-1)
• Independent Variables:
– # of parameters for modeling P(X1|Y),…,P(Xn|Y)
▪ 2n
49
Example
• Example: Play Tennis
50
Another Example - contd
The weather data, with counts and probabilities
rainy 3 2 cool 3 1
sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5
51
Another Example - contd
Classifying a new day
outlook temperature humidity windy play
sunny cool high true ?
• Likelihood of yes
2 3 3 3 9
= = 0.0053
9 9 9 9 14
• Likelihood of no
3 1 4 3 5
= = 0.0206
5 5 5 5 14
53
The numeric weather data with summary statistics
outlook temperature humidity windy play
yes no yes no yes no yes no yes no
sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 std 6.2 7.9 std 10.2 9.7 true 3/9 3/5
dev dev
rainy 3/9 2/5
54
• Let x1, x2, …, xn be the values of a numerical
attribute in the training data set.
1 n
= xi
n i =1
1 n
= (xi − )2
n − 1 i =1
( w − )2
1 −
f ( w) = e 2
2
55
• For examples,
( 66− 73)2
−
f (temperatur e = 66 | Yes ) =
1
e 2 ( 6.2 )2
= 0.0340
2 (6.2 )
2 3 9
• Likelihood of Yes = 0.0340 0.0221 = 0.000036
9 9 14
3 3 5
• Likelihood of No = 0.0291 0.038 = 0.000136
5 5 14
56
Outputting Probabilities
• What’s nice about Naïve Bayes (and generative
models in general) is that it returns probabilities
57
Naïve Bayes Assumption
• Recall the Naïve Bayes assumption:
– that all features are independent given the class label Y
58
Why Naïve Bayes is naive
• It makes an assumption that is virtually impossible to
see in real-life data:
59
Naïve Bayes Problems
• Problems
– Actual Features may overlap
– Regularization
– Gradient ascent 60
Conclusions on Naïve Bayes
• Naïve Bayes based on the independence
assumption
– Training is very easy and fast; just requiring
considering each attribute in each class separately