Bayes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Bayes Classification

By
SATHISHKUMAR G
(sathishsak111@gmail.com)
classification
 Uncertainty & Probability
 Baye's rule
 Choosing Hypotheses- Maximum a posteriori
 Maximum Likelihood - Baye's concept learning
 Maximum Likelihood of real valued function
 Bayes optimal Classifier
 Joint distributions
 Naive Bayes Classifier
Uncertainty
 Our main tool is the probability theory, which
assigns to each sentence numerical degree of
belief between 0 and 1

 It provides a way of summarizing the


uncertainty
Variables
 Boolean random variables: cavity might be true or false
 Discrete random variables: weather might be sunny, rainy,
cloudy, snow
 P(Weather=sunny)
 P(Weather=rainy)
 P(Weather=cloudy)
 P(Weather=snow)
 Continuous random variables: the temperature has continuous
values
Where do probabilities come
from?
 Frequents:
 From experiments: form any finite sample, we can estimate the true
fraction and also calculate how accurate our estimation is likely to be
 Subjective:
 Agent’s believe
 Objectivist:
 True nature of the universe, that the probability up heads with
probability 0.5 is a probability of the coin
 Before the evidence is obtained; prior probability
 P(a) the prior probability that the proposition is true
 P(cavity)=0.1

 After the evidence is obtained; posterior probability


 P(a|b)
 The probability of a given that all we know is b
 P(cavity|toothache)=0.8
Axioms of Probability Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (Unkomprimiert)“
benötigt.

(Kolmogorov’s axioms,
first published in German 1933)

 All probabilities are between 0 and 1. For any


proposition a 0 ≤ P(a) ≤ 1

 P(true)=1, P(false)=0

 The probability of disjunction is given by


P
(
a∨b
)=P
(
a)+P
(
b)−P
(
a∧b
)
 Product rule
P (a ∧ b) = P (a | b) P (b)
P (a ∧ b) = P (b | a ) P (a )
Theorem of total probability
If events A1, ... , An are mutually

exclusive with then


Bayes’s rule
 (Reverent Thomas Bayes 1702-1761)
• He set down his findings on Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.

probability in "Essay Towards


Solving a Problem in the Doctrine of
Chances" (1763), published
posthumously in the Philosophical
Transactions of the Royal Society of
London

P(a|b)P(b)
P(b|a)=
P(a)
Diagnosis
 What is the probability of meningitis in the patient with stiff
neck?

 A doctor knows that the disease meningitis causes the patient to have
a stiff neck in 50% of the time -> P(s|m)

 Prior Probabilities:
• That the patient has meningitis is 1/50.000 -> P(m)
• That the patient has a stiff neck is 1/20 -> P(s)
P(s|m)P(m)
P(m|s)=
P(s)
0.5*0.0
0002
P(m|s)= =0
.000
2
0
.05
Normalization

1 = P ( y | x ) + P ( ¬y | x )
P( x | y ) P( y )
P( y | x) =
P( x)
P(Y | X ) = α × P ( X | Y ) P(Y )

P ( x | ¬y ) P ( ¬y )
P ( ¬y | x ) = α P( y | x), P(¬y | x)
P( x)

α 0.12,0.08 = 0.6,0.4
Bayes Theorem

 P(h) = prior probability of hypothesis h


 P(D) = prior probability of training data D
 P(h|D) = probability of h given D
 P(D|h) = probability of D given h
Choosing Hypotheses
 Generally want the most probable
hypothesis given the training data

 Maximum a posteriori hypothesis hMAP:



 If assume P(hi)=P(hj) for all hi and hj, then
can further simplify, and choose the

 Maximum likelihood (ML) hypothesis


Example
 Does patient have cancer or not?

A patient takes a lab test and the result comes


back positive. The test returns a correct
positive result (+) in only 98% of the cases in
which the disease is actually present, and a
correct negative result (-) in only 97% of the
cases in which the disease is not present
Furthermore, 0.008 of the entire population have
this cancer
Suppose a positive result (+) is
returned...

Normalization
0.0078 0.0298
=0.20745 =0.79255
0.0078+0.0298 0.0078+0.0298

 The result of Bayesian inference depends


strongly on the prior probabilities, which
must be available in order to apply the
method
Brute-Force
Bayes Concept Learning
 For each hypothesis h in H, calculate the
posterior probability

 Output the hypothesis hMAP with the highest


posterior probability
 Given no prior knowledge that one
hypothesis is more likely than another,
what values should we specify for P(h)?

 What choice shall we make for P(D|h) ?


 Choose P(h) to be uniform distribution

 for all h in H

 P(D|h)=1 if h consistent with D


 P(D|h)=0 otherwise
P(D)
P(D)=∑P(D|hi)P(hi)
hi ∈H

)= ∑1⋅ + ∑0⋅
1 1
P(D
hi ∈VSH,D |H| hi ∉VSH,D |H|

|VSH,D |
P(D)=
|H|

 Version space VSH,D is the subset of consistent


Hypotheses from H with the training examples in D
if h is inconsistent with D

1
1⋅ 1
P(h|D)= |H| = if h is consistent with D
|VSH,D | |VSH,D |
|H|

Maximum Likelihood of real
valued function
 Maximize natural log of this instead...
Bayes optimal Classifier
A weighted majority classifier
 What is he most probable classification of the new
instance given the training data?
 The most probable classification of the new instance is
obtained by combining the prediction of all hypothesis,
weighted by their posterior probabilities
 If the classification of new example can take any
value vj from some set V, then the probability P(vj|D)
that the correct classification for the new instance is
vj, is just:
Bayes optimal classification:

Gibbs Algorithm
 Bayes optimal classifier provides best
result, but can be expensive if many
hypotheses

 Gibbs algorithm:
 Choose one hypothesis at random, according
to P(h|D)
 Use this to classify new instance
 Suppose correct, uniform prior distribution
over H, then
 Pick any hypothesis at random..
 Its expected error no worse than twice Bayes
optimal
Joint distribution

 A joint distribution for toothache, cavity, catch, dentist‘s probe


catches in my tooth :-( ZuA
r n z e ig e w ird d e rQ u ic k T im e ™
D e k o mp re s s o r„T IF F (L Z W)“
b e n ö tig t.

 We need to know the conditional probabilities of the conjunction


of toothache and cavity
 What can a dentist conclude if the probe catches in the aching
tooth?
P (to othach e∧catch|c a v
ity )P(c
avity)
P (cavity|to
othache∧ catc )=
h
P (too
tha c
h e∧ cavity)

 For n possible variables there are 2n possible combinations


Conditional Independence
 Once we know that the patient has cavity we do
not expect the probability of the probe catching to
depend on the presence of toothache
P (catch | cavity ∧ toothache) = P (catch | cavity )
P (toothache | cavity ∧ catch) = P (toothache | cavity )

 Independence between a and b


P ( a | b) = P ( a )
P(b | a ) = P(b)
P (a ∧ b) = P (a ) P (b)
P (toothache, catch, cavity,Weather = cloudy ) =
= P(Weather = cloudy ) P (toothache, catch, cavity )

• The decomposition of large probabilistic domains into


weakly connected subsets via conditional
independence is one of the most important
developments in the recent history of AI

• This can work well, even the assumption is not true!


A single cause directly influence a number of
effects, all of which are conditionally
independent

n
P(cause, effect1 , effect 2 ,...effect n ) = P (cause)∏ P (effecti | cause)
i =1
Naive Bayes Classifier
 Along with decision trees, neural networks,
nearest nbr, one of the most practical learning
methods

 When to use:
 Moderate or large training set available
 Attributes that describe instances are conditionally
independent given classification
 Successful applications:
 Diagnosis
 Classifying text documents
Naive Bayes Classifier
 Assume target function f: X  V, where
each instance x described by attributes a1,
a2 .. an
 Most probable value of f(x) is:
vNB

 Naive Bayes assumption:

 which gives
Naive Bayes Algorithm
 For each target value vj
  estimate P(vj)
 For each attribute value ai of each
attribute a
  estimate P(ai|vj)
Training dataset
age income student credit_rating buys_computer
Class: <=30 high no fair no
C1:buys_computer=‘yes’ <=30 high no excellent no
C2:buys_computer=‘no’ 30…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data sample: 31…40 low yes excellent yes
<=30 medium no fair no
X= <=30 low yes fair yes
(age<=30, >40 medium yes fair yes
Income=medium,
<=30 medium yes excellent yes
Student=yes
31…40 medium no excellent yes
Credit_rating=Fair)
31…40 high yes fair yes
>40 medium no excellent no
Naï ve Bayesian Classifier:
Example
 Compute P(X|Ci) for each class
P(buys_computer=„yes“)=9/14
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(buys_computer=„no“)=5/14
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4

 X=(age<=30 ,income =medium, student=yes,credit_rating=fair)

P(X|C i ) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x


0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019

P(X|C i)*P(C i ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028


P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
 Conditional independence assumption is
often violated

 ...but it works surprisingly well anyway


Estimating Probabilities
 We have estimated probabilities by the fraction
of times the event is observed to nc occur over
the total number of opportunities n
 It provides poor estimates when nc is very small

 If none of the training instances with target


value vj have attribute value ai?
 nc is 0
 When nc is very small:

 n is number of training examples for which v=vj


 nc number of examples for which v=vj and a=ai
 p is prior estimate
 m is weight given to prior (i.e. number of
``virtual'' examples)
Naï ve Bayesian Classifier:
Comments
 Advantages :
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence , therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
 Dependencies among these cannot be modeled by Naï ve
Bayesian Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks
 Uncertainty & Probability
 Baye's rule
 Choosing Hypotheses- Maximum a posteriori
 Maximum Likelihood - Baye's concept learning
 Maximum Likelihood of real valued function
 Bayes optimal Classifier
 Joint distributions
 Naive Bayes Classifier
Bayesian Belief Networks

P(B) Earthquake P(E)


Burglary 0.001 0.002

Burg. Earth. P(A)


t t .95
Alarm t f .94
f t .29
f f .001

A P(M)
A P(J)
JohnCalls t .90 MaryCalls t .7
f .01
f .05
Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy