l7 Probabilistic Reasoning

Knowledge Engineering
(IT4362E)
Quang Nhat NGUYEN
(quang.nguyennhat@hust.edu.vn)
Hanoi University of Science and Technology
School of Information and Communication Technology
Academicyear 2013 2014 Academic year 2013-2014
Content
Introduction
First-order logic
Knowledge representation
Semantic Web
L i i Logic programming
Uncertain reasoning
Probability theory Probability theory
Probabilistic reasoning
Expert systems
Knowledge discovery by Machine learning
Knowledge discovery by Data mining
2
Basic probability concepts
Suppose we have an experiment (e.g., a dice roll) whose
outcome depends on chance outcome depends on chance
Sample space S. A set of all possible outcomes
E.g., S= {1, 2, 3, 4, 5, 6} for a dice roll g , { , , , , , }
Event E. A subset of the sample space
E.g., E= {1}: the result of the roll is one
E th lt f th ll i dd b E.g., E= {1, 3, 5}: the result of the roll is an odd number
Event space W. The possible worlds the outcome can occur
E.g., Wincludes all dice rolls g ,
Random variable A. A random variable represents an
event, and there is some degree of chance (probability)
th t th t that the event occurs
3
Visualizing probability
P( A) : the fraction of possible worlds in which Ais true
Event space of all
possible worlds
Worlds in which
Ais true
p
Worlds in which Ais false
Its area is 1
Worlds in which Ais false
[ http://www cs cmu edu/~awm/tutorials]
4
[ http://www.cs.cmu.edu/~awm/tutorials]
Boolean random variables
A Boolean random variable can take either of the two
Boolean values t r ue or f al se Boolean values, t r ue or f al se
The axioms
0 P( A) 1 0 P( A) 1
P( t r ue) = 1
P( f al se) = 0 P( f al se) = 0
P( A V B) = P( A) + P( B) - P( A B)
The corollaries The corollaries
P( not A) P( ~A) = 1 - P( A)
P( A) = P( A B) + P( A B) P( A) = P( A B) + P( A ~B)
5
Multi-valued random variables
A multi-valued random variable can take a value from a set
of k (>2) values {v
1
, v
2
, , v
k
} ( )
1 2 k
j i v A v A P
j i
= = = if 0 ) (
P(A=v
1
V A=v
2
V ... V A=v
k
) = 1
= = = = =
i
v A P v A v A v A P
2 1
) ( ) (
1 ) ( = =
k
j
v A P
=

j
j i
v A P v A v A v A P
1
2 1
) ( ) ... (
1 = j
[ ] ( ) ) ( ...
1
2 1 j
i
j
i
v A B P v A v A v A B P = = = = =

6
[ http://www.cs.cmu.edu/~awm/tutorials]
1 j=
Conditional probability (1)
P( A| B) is the fraction of worlds in which Ais true given
that Bis true that Bis true
Example
A: I will go to the football match tomorrow
B: It will be not raining tomorrow
P( A| B) : The probability that I will go to the football
match if (given that) it will be not raining tomorrow
7
Conditional probability (2)
Definition:
) , (
) | (
B A P
B A P =
Definition:
) (
) | (
B P
B A P =
Worlds
in which
Corollaries:
in which
Bis true
P( A, B) =P( A| B) . P( B)
Worlds in which
Ais true
P( A| B) +P( ~A| B) =1
k
B A P 1 ) | (
=
= =
i
i
B v A P
1
1 ) | (
8
Independent variables (1)
Two events Aand Bare statistically independent if the
probability of Ais the same value probability of Ais the same value
when Boccurs, or
when Bdoes not occur, or
when nothing is known about the occurrence of B
Example
A: I will play a football match tomorrow
B: Bob will play the football match
P( A| B) = P( A)
Whether Bob will play the football match tomorrow does not
influence my decision of going to the football match. y g g
9
Independent variables (2)
From the definition of independent variables
P( A| B) =P( A) we can derive the following rules P( A| B) =P( A) , we can derive the following rules
P( ~A| B) = P( ~A)
P( B| A) = P( B)
P( A, B) = P( A) . P( B)
P( ~A, B) = P( ~A) . P( B)
P( A, ~B) = P( A) . P( ~B) ( , ) ( ) ( )
P( ~A, ~B) = P( ~A) . P( ~B)
10
Conditional probability >2 variables
P( A| B, C) is the probability of Agiven B
and C and C
Example
B C
A: I will walk along the river tomorrow
morning
B: The weather is beautiful tomorrow
A
P( A| B C)
B: The weather is beautiful tomorrow
morning
C: I will get up early tomorrow morning
P( A| B, C)
P( A| B, C) : The probability that I will walk
along the river tomorrow morning if (given
that) the weather is nice and I get up early
11
that) the weather is nice and I get up early
Conditional independence
Two variables Aand Care conditionally independent
given variable Bif the probability of Agiven Bis the same given variable Bif the probability of Agiven Bis the same
as the probability of Agiven Band C
Formal definition: P( A| B, C) = P( A| B) Formal definition: P( A| B, C) P( A| B)
Example
A: I will play the football match tomorrow p y
B: The football match will take place indoor
C: It will be not raining tomorrow
P( A| B C) P( A| B) P( A| B, C) =P( A| B)
Given knowing that the match will take place indoor, the
probability that I will play the match does not depend on the
weather weather
12
Probability Important rules
Chain rule
P( A B) = P( A| B) P( B) = P( B| A) P( A) P( A, B) = P( A| B) . P( B) = P( B| A) . P( A)
P( A| B) = P( A, B) / P( B) = P( B| A) . P( A) / P( B)
P( A, B| C) = P( A,B,C) / P( C) = P( A| B,C) . P( B,C) / P( C) P( A, B| C) P( A,B,C) / P( C) P( A| B,C) . P( B,C) / P( C)
= P( A| B,C) . P( B| C)
(Conditional) independence (Conditional) independence
P( A| B) = P( A) ; if Aand Bare independent
P( A, B| C) = P( A| C) . P( B| C) ; if Aand Bare conditionally P( A, B| C) P( A| C) . P( B| C) ; if Aand Bare conditionally
independent given C
P( A
1
,,A
n
| C) = P( A
1
| C) P( A
n
| C) ; if A
1
,,A
n
are
conditionally independent given C conditionally independent given C
13
Bayes theorem
) ( ). | (
) | (
h P h D P
D h P =
P( h) : Prior probability of hypothesis (e g
) (
) | (
D P
D h P =
P( h) : Prior probability of hypothesis (e.g.,
classification) h
P( D) : Prior probability that the data Dis observed P( D) : Prior probability that the data Dis observed
P( D| h) : Probability of observing the data Dgiven
hypothesis h
P( h| D) : Probability of hypothesis h given the observed
data D
14
Bayes theorem Example (1)
Assume that we have the following data (of a person):
Day Outlook Temperature Humidity Wind Play Tennis Day Outlook Temperature Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 O t H t Hi h W k Y D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
15
D12 Overcast Mild High Strong Yes
[Mitchell, 1997]
Bayes theorem Example (2)
Dataset D. The data of the days when the outlook is sunny
and the wind is strongg
Hypothesis h. The person plays tennis
Prior probability P( h) . Probability that the person plays tennis
(i.e., irrespective of the outlook and the wind)
Prior probability P( D) . Probability that the outlook is sunny
and the wind is strong and the wind is strong
P( D| h) . Probability that the outlook is sunny and the wind is
strong, given knowing that the person plays tennis
P( h| D) . Probability that the person plays tennis, given
knowing that the outlook is sunny and the wind is strong
We are interested in this posterior probabilit !! We are interested in this posterior probability!!
16
Maximum a posteriori (MAP)
Given a set Hof possible hypotheses (e.g., possible
classifications) the learner finds the most probable classifications), the learner finds the most probable
hypothesis h( H) given the observed data D
Such a maximally probable hypothesis is called a maximum a y p yp
posteriori (MAP) hypothesis
) | ( max arg D h P h
MAP
=
H h
) (
) ( ). | (
max arg
D P
h P h D P
h
H h
MAP
=
(by Bayes theorem)
) (D P H h
) ( ). | ( max arg h P h D P h
H h
MAP
=
(P( D) is a constant,
independent of h)
17
H h
p )
MAP hypothesis Example
The set Hcontains two hypotheses
h
1
: The person will play tennis h
1
: The person will play tennis
h
2
: The person will not play tennis
Compute the two posteriori probabilities P( h
1
| D) , P( h
2
| D)
The MAP hypothesis: h
MAP
=h
1
if P( h
1
| D) P( h
2
| D) ;
otherwise h
MAP
=h
2
B P( D) P( D h ) +P( D h ) i th f b th h d Because P( D) =P( D, h
1
) +P( D, h
2
) is the same for both h
1
and
h
2
, we ignore it
So we compute the two formulae: P( D| h
1
) P( h
1
) and So, we compute the two formulae: P( D| h
1
) . P( h
1
) and
P( D| h
2
) . P( h
2
) , and make the conclusion:
If P( D| h
1
) . P( h
1
) P( D| h
2
) . P( h
2
) , the person will play tennis;
Otherwise, the person will not play tennis
18
Maximum likelihood estimation (MLE)
Recall of MAP. Given a set Hof possible hypotheses,
fi d th h th i th t i i ( | h) ( h) find the hypothesis that maximizes: P( D| h) . P( h)
Assumption in MLE. Every hypothesis in His equally
probable a priori: P( h ) =P( h ) h h H probable a priori: P( h
i
) =P( h
j
) , h
i
,h
j
H
MLE finds the hypothesis that maximizes P( D| h) ; where
P( D| h) is called the likelihood of the data Dgiven h P( D| h) is called the likelihood of the data Dgiven h
The maximum likelihood (ML) hypothesis
) | ( max arg h D P h
H h
ML
=
19
ML hypothesis Example
The set Hcontains two hypotheses
h
1
: The person will play tennis
1
p p y
h
2
: The person will not play tennis
D: The data of the dates when the outlook is sunny and the wind is strong
Comp te the t o likelihood al es of the data Dgi en the t o Compute the two likelihood values of the data Dgiven the two
hypotheses: P( D| h
1
) and P( D| h
2
)
P( Out l ook=Sunny, Wi nd=St r ong| h
1
) = 1/8
2
) = 1/4
The ML hypothesis h
ML
=h
1
if P( D| h
1
) P( D| h
2
) ; otherwise
h =h h
ML
=h
2
Because P( Out l ook=Sunny, Wi nd=St r ong| h
1
) <
2
) , we arrive at the
conclusion: The person will not play tennis conclusion: The person will not play tennis
20
Nave Bayes classifier (1)
Problem definition
A training set D where each training instance is represented as A training set D, where each training instance x is represented as
an n-dimensional attribute vector: ( x
1
, x
2
, . . . , x
n
)
A pre-defined set of classes: C={c
1
, c
2
, . . . , c
m
}
Given a new instance z, which class should z be classified to?
We want to find the most probable class for instance z
) | ( max arg z c P c
i
C c
MAP
i
=
) | ( max arg P ) ,..., , | ( max arg
2 1 n i
C c
MAP
z z z c P c
i
=
) (
) ( ). | ,..., , (
max arg
2 1 i i n
C
MAP
z z z P
c P c z z z P
c =
(by Bayes theorem)
21
) ,..., , (
2 1 n
C c
z z z P
i
Nave Bayes classifier (2)

To find the most probable class for z (continued)
( ( ) i
) ( ). | ,..., , ( max arg
2 1 i i n
C c
MAP
c P c z z z P c
i
=
(P( z
1
, z
2
, . . . , z
n
) is
the same for all classes)
Assumption in Nave Bayes classifier. The attributes
are conditionally independent given classification
n
=
=
j
i j i n
c z P c z z z P
1
2 1
) | ( ) | ,..., , (
Nave Bayes classifier finds the most probable class for z Nave Bayes classifier finds the most probable class for z
=
n
j
i j i
C c
NB
c z P c P c
i
1
) | ( ). ( max arg
22
j
Nave Bayes classifier - Algorithm
The learning (training) phase (given a training set)
For each classification (i.e., class label) c
i
C ( )
i
Estimate the priori probability: P( c
i
)
For each attribute value x
j
, estimate the probability of that
attribute value given classification c
i
: P( x
j
| c
i
) attribute value given classification c
i
: P( x
j
| c
i
)
The classification phase (given a new instance)
For each classification c
i
C compute the formula For each classification c
i
C, compute the formula
=
n
j
i j i
c x P c P
1
) | ( ). (
Select the most probable classification c
*
=
n
i j i
C
c x P c P c
*
) | ( ). ( max arg
23
=
j
C c
i 1
Nave Bayes classifier Example (1)
Will a young student with medium income and fair credit rating buy a computer?
Rec. ID Age Income Student Credit_Rating Buy_Computer
1 Young High No Fair No
2 Young High No Excellent No
3 Medium High No Fair Yes
4 Old M di N F i Y 4 Old Medium No Fair Yes
5 Old Low Yes Fair Yes
6 Old Low Yes Excellent No
7 Medium Low Yes Excellent Yes 7 Medium Low Yes Excellent Yes
8 Young Medium No Fair No
9 Young Low Yes Fair Yes
10 Old Medium Yes Fair Yes
11 Young Medium Yes Excellent Yes
12 Medium Medium No Excellent Yes
13 Medium High Yes Fair Yes
24
14 Old Medium No Excellent No
http://www.cs.sunysb.edu
/~cse634/lecture_notes/0
7classification.pdf
Representation of the problem
x = (Age=Young Income=Medi umStudent=Yes Credit Rating=Fai r ) x (Age Young, Income Medi um, Student Yes, Credit_Rating Fai r )
Two classes: c
1
(buy a computer) and c
2
(not buy a computer)
Compute the priori probability for each class Compute the priori probability for each class
P(c
1
) = 9/14
P(c
2
) = 5/14
Compute the probability of each attribute value given each class
P(Age=Young|c
1
) = 2/9; P(Age=Young|c
2
) = 3/5
P(I | ) 4/9 P(I | ) 2/5 P(Income=Medi um|c
1
) = 4/9; P(Income=Medi um|c
2
) = 2/5
P(Student=Yes|c
1
) = 6/9; P(Student=Yes|c
2
) = 1/5
P(Credit_Rating=Fai r |c
1
) = 6/9; P(Credit_Rating=Fai r |c
2
) = 2/5
1 2
25
Compute the likelihood of instance x given each class
For class c
1
For class c
1
P(x|c
1
) = P(Age=Young|c
1
).P(Income=Medi um|c
1
).P(Student=Yes|c
1
).
1
) = (2/9).(4/9).(6/9).(6/9) = 0.044
For class c For class c
2
P(x|c
2
) = P(Age=Young|c
2
).P(Income=Medi um|c
2
).P(Student=Yes|c
2
).
2
) = (3/5).(2/5).(1/5).(2/5) = 0.019
Fi d h b bl l Find the most probable class
For class c
1
P(c
1
).P(x|c
1
) = (9/14).(0.044) = 0.028
For class c
2
P(c
2
).P(x|c
2
) = (5/14).(0.019) = 0.007
C l i Th ill b t ! Conclusion: The person x will buy a computer!
26
Nave Bayes classifier Issues (1)
What happens if no training instances associated with class c
i
have attribute value x
j
?
n
P( x
j
| c
i
) =0 , and hence:
Solution: to use a Bayesian approach to estimate P( x
j
| c
i
)
0 ) | ( ). (
1
=
= j
i j i
c x P c P
n( c
i
) : number of training instances associated with class c
i
m c n
mp x c n
c x P
i
j i
i j
+
+
=
) (
) , (
) | (
( c
i
) u be o t a g sta ces assoc ated t c ass c
i
n( c
i
, x
j
) : number of training instances associated with class c
i
that
have attribute value x
j
p: a prior estimate for P( x
j
| c
i
)
j i
Assume uniform priors: p=1/ k, if attribute f
j
has k possible values
m: a weight given to prior
To augment the n( c
i
) actual observations by an additional m g
i
y
virtual samples distributed according to p
27
Nave Bayes classifier Issues (2)
The limit of precision in computers computing capability
P( x
j
| c
i
) <1 for every attribute value x
j
and class c
i
P( x
j
| c
i
) <1, for every attribute value x
j
and class c
i
So, when the number of attribute values is very large
0 ) | ( lim =
n
i j
c x P ) | (
1

=

j
i j
n
Solution: to use a logarithmic function of probability
=

=
n
j
i j i
C c
NB
c x P c P c
i 1
) | ( ). ( log max arg
+ =

=
n
j
i j i
C c
NB
c x P c P c
i 1
) | ( log ) ( log max arg
28
Document classification by NB Training
Problem definition
A training set D, where each training example is a document associated
with a class label: D= {(d
k
, c
i
)}
A pre-defined set of class labels: C= {c
i
}
The training algorithm The training algorithm
From the documents collection contained in the training set D, extract the
vocabulary of distinct terms (keywords): T = {t
j
}
Lets denote D c
i
(D) the set of documents in Dwhose class label is c
i
Let s denote D_c
i
(D) the set of documents in Dwhose class label is c
i
For each class c
i
- Compute the priori probability of class c
i
:
D
c D
c P
i
i
_
) ( =
- For each term t
j
, compute the probability of term t
j
given class c
i
( )
( ) T t d
t d n
c t P
i k
c D d
j k
i j
+
=

_
) (
1 ) , (
) | (
(n( d
k
, t
j
) : the number of occurrences
of term t
j
in document d )
29
( ) T t d n
i k m
c D d T t
m k
i j
+

_
) , (
of term t
j
in document d
k
)
Document classification by NB Classifying
To classify (assign the class label for) a new document d
The classification algorithm The classification algorithm
From the document d, extract a set T_d of all terms (keywords)
t
j
that are known by the vocabulary T (i.e., T_d T)
Additi l ti Th b bilit f t i l Additional assumption. The probability of term t
j
given class
c
i
is independent of its position in document
P(t
j
at position k| c
i
) = P(t
j
at position m| c
i
), k,m
For each class c
i
, compute the likelihood of document d given c
i
d T t
i j i
j
c t P c P
_
) | ( ). (
j
Classify document d in class c
*
=
i j i
c t P c P c
*
) | ( ). ( max arg
30
d T t
j
C c
j
i _
Nave Bayes classifier Summary
One of the most practical learning methods
Based on the Bayes theorem Based on the Bayes theorem
Very fast in performance
For the training: only one pass over (scan through) the training set
F th l ifi ti th t ti ti i li i th b For the classification: the computation time is linear in the number
of attributes and the size of the documents collection
Despite its conditional independence assumption, Nave Bayes
classifier shows a good performance in several application domains
When to use?
A moderate or large training set available A moderate or large training set available
Instances are represented by a large number of attributes
Attributes that describe instances are conditionally independent
gi en classification given classification
31
References
T. M. Mi t chel l . Machine Learning. McGr aw- Hi l l , 1997.
32

l7 Probabilistic Reasoning

Uploaded by

Copyright:

Available Formats

l7 Probabilistic Reasoning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

l7 Probabilistic Reasoning

Uploaded by

Copyright:

Available Formats

Knowledge Engineering

Nave Bayes classifier (2)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.