Bayes

Bayes Classification
By
SATHISHKUMAR G
(sathishsak111@gmail.com)
classification
 Uncertainty & Probability
 Baye's rule
 Choosing Hypotheses- Maximum a posteriori
 Maximum Likelihood - Baye's concept learning
 Maximum Likelihood of real valued function
 Bayes optimal Classifier
 Joint distributions
 Naive Bayes Classifier
Uncertainty
 Our main tool is the probability theory, which
assigns to each sentence numerical degree of
belief between 0 and 1
 It provides a way of summarizing the

uncertainty
Variables
 Boolean random variables: cavity might be true or false
 Discrete random variables: weather might be sunny, rainy,
cloudy, snow
 P(Weather=sunny)
 P(Weather=rainy)
 P(Weather=cloudy)
 P(Weather=snow)
 Continuous random variables: the temperature has continuous
values
Where do probabilities come
from?
 Frequents:
 From experiments: form any finite sample, we can estimate the true
fraction and also calculate how accurate our estimation is likely to be
 Subjective:
 Agent’s believe
 Objectivist:
 True nature of the universe, that the probability up heads with
probability 0.5 is a probability of the coin
 Before the evidence is obtained; prior probability
 P(a) the prior probability that the proposition is true
 P(cavity)=0.1
 After the evidence is obtained; posterior probability

 P(a|b)
 The probability of a given that all we know is b
 P(cavity|toothache)=0.8
Axioms of Probability Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (Unkomprimiert)“
benötigt.
(Kolmogorov’s axioms,
first published in German 1933)
 All probabilities are between 0 and 1. For any

proposition a 0 ≤ P(a) ≤ 1
 P(true)=1, P(false)=0
 The probability of disjunction is given by

P
(
a∨b
)=P
(
a)+P
(
b)−P
(
a∧b
)
 Product rule
P (a ∧ b) = P (a | b) P (b)
P (a ∧ b) = P (b | a ) P (a )
Theorem of total probability
If events A1, ... , An are mutually
exclusive with then

Bayes’s rule
 (Reverent Thomas Bayes 1702-1761)
• He set down his findings on Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
probability in "Essay Towards

Solving a Problem in the Doctrine of
Chances" (1763), published
posthumously in the Philosophical
Transactions of the Royal Society of
London
P(a|b)P(b)
P(b|a)=
P(a)
Diagnosis
 What is the probability of meningitis in the patient with stiff
neck?
 A doctor knows that the disease meningitis causes the patient to have
a stiff neck in 50% of the time -> P(s|m)
 Prior Probabilities:
• That the patient has meningitis is 1/50.000 -> P(m)
• That the patient has a stiff neck is 1/20 -> P(s)
P(s|m)P(m)
P(m|s)=
P(s)
0.5*0.0
0002
P(m|s)= =0
.000
2
0
.05
Normalization

1 = P ( y | x ) + P ( ¬y | x )
P( x | y ) P( y )
P( y | x) =
P( x)
P(Y | X ) = α × P ( X | Y ) P(Y )
P ( x | ¬y ) P ( ¬y )
P ( ¬y | x ) = α P( y | x), P(¬y | x)
P( x)
α 0.12,0.08 = 0.6,0.4
Bayes Theorem
 P(h) = prior probability of hypothesis h

 P(D) = prior probability of training data D
 P(h|D) = probability of h given D
 P(D|h) = probability of D given h
Choosing Hypotheses
 Generally want the most probable
hypothesis given the training data
 Maximum a posteriori hypothesis hMAP:


 If assume P(hi)=P(hj) for all hi and hj, then
can further simplify, and choose the
 Maximum likelihood (ML) hypothesis

Example
 Does patient have cancer or not?
A patient takes a lab test and the result comes

back positive. The test returns a correct
positive result (+) in only 98% of the cases in
which the disease is actually present, and a
correct negative result (-) in only 97% of the
cases in which the disease is not present
Furthermore, 0.008 of the entire population have
this cancer
Suppose a positive result (+) is
returned...

Normalization
0.0078 0.0298
=0.20745 =0.79255
0.0078+0.0298 0.0078+0.0298
 The result of Bayesian inference depends

strongly on the prior probabilities, which
must be available in order to apply the
method
Brute-Force
Bayes Concept Learning
 For each hypothesis h in H, calculate the
posterior probability
 Output the hypothesis hMAP with the highest

posterior probability
 Given no prior knowledge that one
hypothesis is more likely than another,
what values should we specify for P(h)?
 What choice shall we make for P(D|h) ?

 Choose P(h) to be uniform distribution
 for all h in H
 P(D|h)=1 if h consistent with D

 P(D|h)=0 otherwise
P(D)
P(D)=∑P(D|hi)P(hi)
hi ∈H
)= ∑1⋅ + ∑0⋅
1 1
P(D
hi ∈VSH,D |H| hi ∉VSH,D |H|
|VSH,D |
P(D)=
|H|
 Version space VSH,D is the subset of consistent

Hypotheses from H with the training examples in D
if h is inconsistent with D
1
1⋅ 1
P(h|D)= |H| = if h is consistent with D
|VSH,D | |VSH,D |
|H|

Maximum Likelihood of real
valued function
 Maximize natural log of this instead...
Bayes optimal Classifier
A weighted majority classifier
 What is he most probable classification of the new
instance given the training data?
 The most probable classification of the new instance is
obtained by combining the prediction of all hypothesis,
weighted by their posterior probabilities
 If the classification of new example can take any
value vj from some set V, then the probability P(vj|D)
that the correct classification for the new instance is
vj, is just:
Bayes optimal classification:

Gibbs Algorithm
 Bayes optimal classifier provides best
result, but can be expensive if many
hypotheses
 Gibbs algorithm:
 Choose one hypothesis at random, according
to P(h|D)
 Use this to classify new instance
 Suppose correct, uniform prior distribution
over H, then
 Pick any hypothesis at random..
 Its expected error no worse than twice Bayes
optimal
Joint distribution
 A joint distribution for toothache, cavity, catch, dentist‘s probe

catches in my tooth :-( ZuA
r n z e ig e w ird d e rQ u ic k T im e ™
D e k o mp re s s o r„T IF F (L Z W)“
b e n ö tig t.
 We need to know the conditional probabilities of the conjunction

of toothache and cavity
 What can a dentist conclude if the probe catches in the aching
tooth?
P (to othach e∧catch|c a v
ity )P(c
avity)
P (cavity|to
othache∧ catc )=
h
P (too
tha c
h e∧ cavity)
 For n possible variables there are 2n possible combinations

Conditional Independence
 Once we know that the patient has cavity we do
not expect the probability of the probe catching to
depend on the presence of toothache
P (catch | cavity ∧ toothache) = P (catch | cavity )
P (toothache | cavity ∧ catch) = P (toothache | cavity )
 Independence between a and b

P ( a | b) = P ( a )
P(b | a ) = P(b)
P (a ∧ b) = P (a ) P (b)
P (toothache, catch, cavity,Weather = cloudy ) =
= P(Weather = cloudy ) P (toothache, catch, cavity )
• The decomposition of large probabilistic domains into

weakly connected subsets via conditional
independence is one of the most important
developments in the recent history of AI
• This can work well, even the assumption is not true!

A single cause directly influence a number of
effects, all of which are conditionally
independent
n
P(cause, effect1 , effect 2 ,...effect n ) = P (cause)∏ P (effecti | cause)
i =1
Naive Bayes Classifier
 Along with decision trees, neural networks,
nearest nbr, one of the most practical learning
methods
 When to use:
 Moderate or large training set available
 Attributes that describe instances are conditionally
independent given classification
 Successful applications:
 Diagnosis
 Classifying text documents
Naive Bayes Classifier
 Assume target function f: X  V, where
each instance x described by attributes a1,
a2 .. an
 Most probable value of f(x) is:
vNB
 Naive Bayes assumption:
 which gives
Naive Bayes Algorithm
 For each target value vj
  estimate P(vj)
 For each attribute value ai of each
attribute a
  estimate P(ai|vj)
Training dataset
age income student credit_rating buys_computer
Class: <=30 high no fair no
C1:buys_computer=‘yes’ <=30 high no excellent no
C2:buys_computer=‘no’ 30…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data sample: 31…40 low yes excellent yes
<=30 medium no fair no
X= <=30 low yes fair yes
(age<=30, >40 medium yes fair yes
Income=medium,
<=30 medium yes excellent yes
Student=yes
31…40 medium no excellent yes
Credit_rating=Fair)
31…40 high yes fair yes
>40 medium no excellent no
Naï ve Bayesian Classifier:
Example
 Compute P(X|Ci) for each class
P(buys_computer=„yes“)=9/14
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(buys_computer=„no“)=5/14
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
 X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|C i ) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x

0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|C i)*P(C i ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028

P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
 Conditional independence assumption is
often violated
 ...but it works surprisingly well anyway

Estimating Probabilities
 We have estimated probabilities by the fraction
of times the event is observed to nc occur over
the total number of opportunities n
 It provides poor estimates when nc is very small
 If none of the training instances with target

value vj have attribute value ai?
 nc is 0
 When nc is very small:
 n is number of training examples for which v=vj

 nc number of examples for which v=vj and a=ai
 p is prior estimate
 m is weight given to prior (i.e. number of
``virtual'' examples)
Naï ve Bayesian Classifier:
Comments
 Advantages :
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence , therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
 Dependencies among these cannot be modeled by Naï ve
Bayesian Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks
 Uncertainty & Probability
 Baye's rule
 Choosing Hypotheses- Maximum a posteriori
 Maximum Likelihood - Baye's concept learning
 Maximum Likelihood of real valued function
 Bayes optimal Classifier
 Joint distributions
 Naive Bayes Classifier
Bayesian Belief Networks
P(B) Earthquake P(E)

Burglary 0.001 0.002
Burg. Earth. P(A)

t t .95
Alarm t f .94
f t .29
f f .001
A P(M)
A P(J)
JohnCalls t .90 MaryCalls t .7
f .01
f .05
Thank you

Bayes

Uploaded by

Copyright:

Available Formats

Bayes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayes

Uploaded by

Copyright:

Available Formats

Bayes Classification

 It provides a way of summarizing the

 After the evidence is obtained; posterior probability

 All probabilities are between 0 and 1. For any

 The probability of disjunction is given by

exclusive with then

probability in "Essay Towards

 P(h) = prior probability of hypothesis h

 Maximum a posteriori hypothesis hMAP:

 Maximum likelihood (ML) hypothesis

A patient takes a lab test and the result comes

 The result of Bayesian inference depends

 Output the hypothesis hMAP with the highest

 What choice shall we make for P(D|h) ?

 P(D|h)=1 if h consistent with D

 Version space VSH,D is the subset of consistent

 A joint distribution for toothache, cavity, catch, dentist‘s probe

 We need to know the conditional probabilities of the conjunction

 For n possible variables there are 2n possible combinations

 Independence between a and b

• The decomposition of large probabilistic domains into

• This can work well, even the assumption is not true!

 Naive Bayes assumption:

 X=(age<=30 ,income =medium, student=yes,credit_rating=fair)

P(X|C i ) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x

P(X|C i)*P(C i ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028

 ...but it works surprisingly well anyway

 If none of the training instances with target

 n is number of training examples for which v=vj

P(B) Earthquake P(E)

Burg. Earth. P(A)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

P(X|C i)P(C i ) : P(X|buys_computer=“yes”) P(buys_computer=“yes”)=0.028