L23 Bayesian Naive

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 18

Bayesian Learning

B
A

P ( A)   P ( ABi )  P ( AB1 )  P ( AB2 )


i

 P ( B1 ) P ( A / B1 )  P ( B2 ) P ( A / B2 )
Bayes Classifiers
Credit rating prediction:
Training data: Joint observation of features and outcome
Classifier is a mapping from observed values of x to predicted
values of c
Features # bad #good

x=0 42 15

x=1 338 287

X=2 3 5
• Predict more likely outcome for each possible
observation
• Probability of outcome c given an observation x

Features # bad #good

x=0 .7368 .2632

x=1 .5408 .4592

X=2 .3750 .6250


How likely we see ‘x’ in users with good credit
Bayes Classifiers
Features # bad #good

x=0 .7368 .2632

x=1 .5408 .4592

X=2 .3750 .6250

Features # bad #good P(x/c=0) P(x/c=1)


x=0 42 15 42/383 15/307
x=1 338 287 338/383 287/307
X=2 3 5 3/383 5/307
P(c) 383/690 307/690 p ( x / c ) p (c )
p (c / x ) 
p( x)
P(x)= 57/690
(42 / 383)(383 / 690)
  0.7368
P(x)= (42/383)(383/690)+(15/307)(307/690)=57/690 57 / 690
= P(x/c=0*p(c=0)+ P(x/c=1)*p(c=1)
Bayesian Learning
(Based on Chapter 6 of Mitchell T.., Machine Learning,)

• Bayesian Decision Theory came long before Decision Tree


Learning and Neural Networks. It was studied in the field of
Statistical Theory and more specifically, in the field of
Pattern Recognition.
• Bayesian Decision Theory is at the basis of important
learning schemes such as the Naïve Bayes Classifier,
Learning Bayesian Belief Networks and the EM Algorithm.
• Bayesian Decision Theory is also useful as it provides a
framework within which many non-Bayesian classifiers can
be studied (See [Mitchell, Sections 6.3, 4,5,6]).
Bayes Theorem
• Goal: To determine the most probable hypothesis,
given the data D plus any initial knowledge about the
prior probabilities of the various hypotheses in H.
• Prior probability of h, P(h): it reflects any background
knowledge we have about the chance that h is a
correct hypothesis (before having observed the data).
• Prior probability of D, P(D): it reflects the probability
that training data D will be observed given no
knowledge about which hypothesis h holds.
• Conditional Probability of observation D, P(D|h): it
denotes the probability of observing data D given
some world in which hypothesis h holds.
Bayes Theorem
• Posterior probability of h, P(h|D): it represents the
probability that h holds given the observed training
data D. It reflects our confidence that h holds after
we have seen the training data D and it is the
quantity that Machine Learning researchers are
interested in.
• Bayes Theorem allows us to compute P(h|D):

P(h|D)=P(D|h)P(h)/P(D)
Maximum A Posteriori (MAP)
Hypothesis and Maximum Likelihood
• Goal: To find the most probable hypothesis h from a set of
candidate hypotheses H given the observed data D.
• MAP Hypothesis, hMAP = argmax hH P(h|D)
= argmax hH P(D|h)P(h)/P(D)
= argmax hH P(D|h)P(h)
• If every hypothesis in H is equally probable a priori, we only
need to consider the likelihood of the data D given h, P(D|
h). Then, hMAP becomes the Maximum Likelihood,
hML= argmax hH P(D|h)
This can be applied to any set H of mutually exclusive propositions whose
probabilities sum to 1.
Bayes’ Rule
Understanding Bayes' rule
P ( d | h) P ( h) d  data
p(h | d )  h  hypothesis
P(d ) Proof. Just rearrange :
p ( h | d ) P ( d )  P ( d | h) P ( h)
P ( d , h)  P ( d , h)
the same joint probability
Who is who in Bayes’ rule
on both sides

P ( h) : prior belief (probabili ty of hypothesis h before seeing any data)


P ( d | h) : likelihood (probabili ty of the data if the hypothesis h is true)
P(d )   P(d | h) P(h) : data evidence (marginal probabilit y of the data)
h

P(h | d ) : posterior (probabili ty of hypothesis h after having seen the data d )


Bayesian classifiers

unseen sample X = <rain, hot, high, weak>

P(p|a1,a2,..,an) = a P(vj|a1,a2,..,an)

P(n|a1,a2,..,an) =b

P(h | D)  P(D | h)P(h)


P(D)
If a> b, then the class of X is p otherwise n
Naïve Bayes Classifier
• Let each instance x of a training set D be described by a conjunction
of n attribute values <a1,a2,..,an> and let f(x), the target function, be
such that f(x)  V, a finite set.
• Bayesian Approach:
vMAP = argmaxvj V P(vj|a1,a2,..,an)
= argmaxvj V [P(a1,a2,..,an|vj) P(vj)/P(a1,a2,..,an)]
= argmaxvj V [P(a1,a2,..,an|vj) P(vj)
• Naïve Bayesian Approach: We assume that the attribute values are
conditionally independent so that P(a1,a2,..,an|vj) =i P(a1|vj)
Naïve Bayes Classifier:
vNB = argmaxvj V P(vj) i P(ai|vj)

14
Naïve Bayesian Classification
• If i-th attribute is categorical:
P(a1,a2,..,an|vj) is estimated as the relative
freq of samples having value ai as i-th
attribute in class v
• If i-th attribute is continuous:
P(a1,a2,..,an|vj) is estimated thru a Gaussian
density function
• Computationally easy in both cases
Play-tennis example: estimating P(xi|C)
outlook
Outlook Temperature Humidity Windy Class P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high false N
sunny hot high true N P(overcast|p) = 4/9 P(overcast|n) = 0
overcast hot high false P
rain mild high false P P(rain|p) = 3/9 P(rain|n) = 2/5
rain cool normal false P
rain cool normal true N temperature
overcast cool normal true P
sunny mild high false N P(hot|p) = 2/9 P(hot|n) = 2/5
sunny cool normal false P
rain mild normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
sunny mild normal true P
overcast
overcast
mild
hot
high true
normal false
P
P
P(cool|p) = 3/9 P(cool|n) = 1/5
rain mild high true N
humidity
P(high|p) = 3/9 P(high|n) = 4/5
P(p) = 9/14 P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(n) = 5/14
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
Play-tennis example: classifying X
• An unseen sample X = <rain, hot, high, false>

• P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286

• Sample X is classified in class n (don’t play)


The independence hypothesis…
• … makes computation possible
• … yields optimal classifiers when satisfied
• … but is seldom satisfied in practice, as attributes (variables)
are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning with causal
relationships between attributes

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy