Bayesian Classifier Notes

Chapter 6
Bayesian classifier and ML

estimation
The Bayesian classifier is an algorithm for classifying multiclass datasets. This is based on the
Bayes’ theorem in probability theory. Bayes in whose name the theorem is known was an English
statistician who was known for having formulated a specific case of a theorem that bears his name.
The classifier is also known as “naive Bayes Algorithm” where the word “naive” is an English word
with the following meanings: simple, unsophisticated, or primitive. We first explain Bayes’ theorem
and then describe the algorithm. Of course, we require the notion of conditional probability.
6.1 Conditional probability

The probability of the occurrence of an event A given that an event B has already occurred is called
the conditional probability of A given B and is denoted by P (A∣B). We have
P (A ∩ B)
P (A∣B) = if P (B) ≠ 0.
P (B)
6.1.1 Independent events

1. Two events A and B are said to be independent if
P (A ∩ B) = P (A)P (B).
2. Three events A, B, C are said to be pairwise independent if

P (B ∩ C) = P (B)P (C)
P (C ∩ A) = P (C)P (A)
P (A ∩ B) = P (A)P (B)
3. Three events A, B, C are said to be mutually independent if

P (B ∩ C) = P (B)P (C) (6.1)
P (C ∩ A) = P (C)P (A) (6.2)
P (A ∩ B) = P (A)P (B) (6.3)
P (A ∩ B ∩ C) = P (A)P (B)P (C) (6.4)
4. In general, a family of k events A1 , A2 , . . . , Ak is said to be mutually independent if for any

subfamily consisting of Ai1 , . . . Aim we have
P (Ai1 ∩ . . . ∩ Aim ) = P (Ai1 ) . . . P (Aim ).
61
CHAPTER 6. BAYESIAN CLASSIFIER AND ML ESTIMATION 62
Remarks
Consider events and respective probabilities as shown in Figure 6.1. It can be seen that, in this case,
the conditions Eqs.(6.1)–(6.3) are satisfied, but Eq.(6.4) is not satisfied. But if the probabilities are
as in Figure 6.2, then Eq.(6.4) is satisfied but all the conditions in Eqs.(6.1)–(6.2) are not satisfied.
Figure 6.1: Events A, B, C which are not mutually independent: Eqs.(6.1)–(6.3) are satisfied, but
Eq.(6.4) is not satisfied.
Figure 6.2: Events A, B, C which are not mutually independent: Eq.(6.4) is satisfied but Eqs.(6.1)–
(6.2) are not satisfied.
6.2 Bayes’ theorem

6.2.1 Theorem
Let A and B any two events in a random experiment. If P (A) ≠ 0, then
P (A∣B)P (B)
P (B∣A) = .
P (A)
6.2.2 Remarks
1. The importance of the result is that it helps us to “invert” conditional probabilities, that is, to
express the conditional probability P (A∣B) in terms of the conditional probability P (B∣A).
2. The following terminology is used in this context:

• A is called the proposition and B is called the evidence.
• P (A) is called the prior probability of proposition and P (B) is called the prior proba-
bility of evidence.
• P (A∣B) is called the posterior probability of A given B.

• P (B∣A) is called the likelihood of B given A.
6.2.3 Generalisation
Let the sample space be divided into disjoint events B1 , B2 , . . . , Bn and A be any event. Then we
have
P (A∣Bk )P (Bk )
P (Bk ∣A) = n
∑i=1 P (A∣Bi )P (Bi )
6.2.4 Examples
Problem 1
Consider a set of patients coming for treatment in a certain clinic. Let A denote the event that
a “Patient has liver disease” and B the event that a “Patient is an alcoholic.” It is known from
experience that 10% of the patients entering the clinic have liver disease and 5% of the patients are
alcoholics. Also, among those patients diagnosed with liver disease, 7% are alcoholics. Given that
a patient is alcoholic, what is the probability that he will have liver disease?
Solution
Using the notations of probability, we have
P (A) = 10% = 0.10

P (B) = 5% = 0.05
P (B∣A) = 7% = 0.07
P (B∣A)P (A)
P (A∣B) =
P (B)
0.07 × 0.10
=
0.05
= 0.14
Problem 2
Three factories A, B, C of an electric bulb manufacturing company produce respectively 35%. 35%
and 30% of the total output. Approximately 1.5%, 1% and 2% of the bulbs produced by these
factories are known to be defective. If a randomly selected bulb manufactured by the company was
found to be defective, what is the probability that the bulb was manufactures in factory A?
Solution
Let A, B, C denote the events that a randomly selected bulb was manufactured in factory A, B, C
respectively. Let D denote the event that a bulb is defective. We have the following data:
P (A) = 0.35, P (B) = 0.35, P (C) = 0.30

P (D∣A) = 0.015, P (D∣B) = 0.010, P (D∣C) = 0.020
We are required to find P (A∣D). By the generalisation of the Bayes’ theorem we have:
P (D∣A)P (A)
P (A∣D) =
P (D∣A)P (A) + P (D∣B)P (B) + P (D∣C)P (C)
0.015 × 0.35
=
0.015 × 0.35 + 0.010 × 0.35 + 0.020 × 0.30
= 0.356.
6.3 Naive Bayes algorithm

6.3.1 Assumption
The naive Bayes algorithm is based on the following assumptions:
• All the features are independent and are unrelated to each other. Presence or absence of a
feature does not influence the presence or absence of any other feature.
• The data has class-conditional independence, which means that events are independent so
long as they are conditioned on the same class value.
These assumptions are, in general, true in many real world problems. It is because of these assump-
tions, the algorithm is called a naive algorithm.
6.3.2 Basic idea

Suppose we have a training data set consisting of N examples having n features. Let the features
be named as (F1 , . . . , Fn ). A feature vector is of the form (f1 , f2 , . . . , fn ). Associated with each
example, there is a certain class label. Let the set of class labels be {c1 , c2 , . . . , cp }.
Suppose we are given a test instance having the feature vector
X = (x1 , x2 , . . . , xn ).
We are required to determine the most appropriate class label that should be assigned to the test
instance. For this purpose we compute the following conditional probabilities
P (c1 ∣X), P (c2 ∣X), . . . , P (cp ∣X). (6.5)
and choose the maximum among them. Let the maximum probability be P (ci ∣X). Then, we choose
ci as the most appropriate class label for the training instance having X as the feature vector.
The direct computation of the probabilities given in Eq.(6.5) are difficult for a number of reasons.
The Bayes’ theorem can b applied to obtain a simpler method. This is explained below.
6.3.3 Computation of probabilities

Using Bayes’ theorem, we have:
P (X∣ck )P (ck )
P (ck ∣X) = (6.6)
P (X)
Since, by assumption, the data has class-conditional independence, we note that the events “x1 ∣ck ”,
“x2 ∣ck ”, ⋯, xn ∣ck are independent (because they are all conditioned on the same class label ck ).
Hence we have
P (X∣ck ) = P ((x1 , x2 , . . . , xn )∣ck )

= P (x1 ∣ck )P (x2 ∣ck )⋯P (xn ∣ck )
Using this in Eq,(6.6) we get

P (x1 ∣ck )P (x2 ∣ck )⋯P (xn ∣ck )P (ck )
P (ck ∣X) = .
P (X)
Since the denominator P (X) is independent of the class labels, we have
P (ck ∣X) ∝ P (x1 ∣ck )P (x2 ∣ck )⋯P (xn ∣ck )P (ck ).
So it is enough to find the maximum among the following values:
P (x1 ∣ck )P (x2 ∣ck )⋯P (xn ∣ck )P (ck ), k = 1, . . . , p.

Remarks
The various probabilities in the above expression are computed as follows:
No. of examples with class label ck
P (ck ) =
Total number of examples
No. of examples with jth feature equal to xj and class label ck
P (xj ∣ck ) =
No. of examples with class label ck
6.3.4 The algorithm

Algorithm: Naive Bayes
Let there be a training data set having n features F1 , . . . , Fn . Let f1 denote an arbitrary value of F1 ,
f2 of F2 , and so on. Let the set of class labels be {c1 , c2 , . . . , cp }. Let there be given a test instance
having the feature vector
X = (x1 , x2 , . . . , xn ).
We are required to determine the most appropriate class label that should be assigned to the test
instance.
Step 1. Compute the probabilities P (ck ) for k = 1, . . . , p.
Step 2. Form a table showing the conditional probabilities
P (f1 ∣ck ), P (f2 ∣ck ), ... , P (fn ∣ck )
for all values of f1 , f2 , . . . , fn and for k = 1, . . . , p.

Step 3. Compute the products
qk = P (x1 ∣ck )P (x2 ∣ck )⋯P (xn ∣ck )P (ck )
for k = 1, . . . , p.
Step 4. Find j such qj = max{q1 , q2 , . . . , qp }.
Step 5. Assign the class label cj to the test instance X.
Remarks
In the above algorithm, Steps 1 and 2 constitute the learning phase of the algorithm. The remaining
steps constitute the testing phase. For testing purposes, only the table of probabilities is required;
the original data set is not required.
6.3.5 Example
Problem
Consider a training data set consisting of the fauna of the world. Each unit has three features named
“Swim”, “Fly” and “Crawl”. Let the possible values of these features be as follows:
Swim Fast, Slow, No
Fly Long, Short, Rarely, No
Crawl Yes, No
For simplicity, each unit is classified as “Animal”, “Bird” or “Fish”. Let the training data set be as in
Table 6.1. Use naive Bayes algorithm to classify a particular species if its features are (Slow, Rarely,
No)?
Sl. No. Swim Fly Crawl Class

1 Fast No No Fish
2 Fast No Yes Animal
3 Slow No No Animal
4 Fast No No Animal
5 No Short No Bird
6 No Short No Bird
7 No Rarely No Animal
8 Slow No Yes Animal
9 Slow No No Fish
10 Slow No Yes Fish
11 No Long No Bird
12 Fast No No Bird
Table 6.1: Sample data set for naive Bayes algorithm
Solution
In this example, the features are
F1 = “Swim”, F2 = “Fly”, F3 = “Crawl”.
The class labels are
c1 = “Animal”, c2 = “ Bird”, c3 = “Fish”.
The test instance is (Slow, Rarely, No) and so we have:
x1 = “Slow”, x2 = “Rarely”, x3 = “No”.
We construct the frequency table shown in Table 6.2 which summarises the data. (It may be noted
that the construction of the frequency table is not part of the algorithm.)
Features
Class Swim (F1 ) Fly (F2 ) Crawl (F3 ) Total
Fast Slow No Long Short Rarely No Yes No
Animal (c1 ) 2 2 1 0 0 1 4 2 3 5
Bird (c2 ) 1 0 3 1 2 0 1 1 3 4
Fish (c3 ) 1 2 0 0 0 0 3 0 3 3
Total 4 4 4 1 2 1 8 4 8 12
Table 6.2: Frequency table for the data in Table 6.1
Step 1. We compute following probabilities.

No. of records with class label “Animal”
P (c1 ) =
= 5/12
No. of records with class label “Bird”
P (c2 ) =
= 4/12
No of records with class label “Fish”
P (c3 ) =
= 3/12
Step 2. We construct the following table of conditional probabilities:
Features
Swim (F1 ) Fly (F2 ) Crawl (F3 )
Class
f1 f2 f3
Fast Slow No Long Short Rarely No Yes No
Animal (c1 ) 2/5 2/5 1/5 0/5 0/5 1/5 4/5 2/5 3/5
Bird (c2 ) 1/4 0/4 3/4 1/4 2/4 0/4 1/4 0/4 4/4
Fish (c3 ) 13 2/3 0/3 0/3 0/3 0/3 3/3 0/3 3/3
Table 6.3: Table of the conditional probabilities P (fi ∣ck )
Note: The conditional probabilities are calculated as follows:

No. of records with F1 = Slow and class label c1
P ((F1 = Slow)∣c1 ) =
No. of records with class label c1
= 2/5.
Step 3. We now calculate the following numbers:
q1 = P (x1 ∣c1 )P (x2 ∣c1 )P (x3 ∣c1 )P (c1 )

= (2/5) × (1/5) × (3/5) × (5/12)
= 0.02
q2 = P (x1 ∣c2 )P (x2 ∣c2 )P (x3 ∣c2 )P (c2 )
= (0/4) × (0/4) × (3/4) × (4/12)
=0
q3 = P (x1 ∣c3 )P (x2 ∣c3 )P (x3 ∣c3 )P (c3 )
= (2/3) × (0/3) × (3/3) × (3/12)
=0
Step 4. Now
max{q1 , q2 , q3 } = 0.05.
Step 5. The maximum is q1 an it corresponds to the class label
c1 = “ Animal”.
So we assign the class label “Animal” to the test instance “(Slow, Rarely, No)”.
6.4 Using numeric features with naive Bayes algorithm

The naive Bayes algorithm can be applied to a data set only if the features are categorical. This is
so because, the various probabilities are computed using the various frequencies and the frequencies
can be counted only if each feature has a limited set of values.
If a feature is numeric, it has to be discretized before applying the algorithm. The discretization
is effected by putting the numeric values into categories known as bins. Because of this discretization
is also known as binning. This is ideal when there are large amounts of data.
There are several different ways to discretize a numeric feature.
1. If there are natural categories or cut points in the distribution of values, use these cut points to
create the bins. For example, let the data consists of records of times when certain activities
were carried out. The the categories, or bins, may be created as in Figure 6.3.
Figure 6.3: Discretization of numeric data: Example
2. If there are no obvious cut points, we may discretize the feature using quantiles. We may
divide the data into three bins with tertiles, four bins with quartiles, or five bins with quintiles,
etc.
6.5 Maximum likelihood estimation (ML estimation)

To develop a Bayesian classifier, we need the probabilities P (x∣ck ) for the class labels c1 , . . . , ck .
These probabilities are estimated from the given data. There is need to know whether the sample
is truly random so that the computed probabilities are good approximations to true probabilities. If
they are good approximations of true probabilities, then there would be an underlying probability
distribution. Suppose we have reasons to believe that the underlying distribution has a particular
form, say binomial, Poisson or normal. These forms are defined by probability functions or proba-
bility density functions. There are parameters which define these functions, and these parameters are
to be estimated to test whether a given data follow some particular distribution. Maximum likelihood
estimation is particular method to estimate the parameters of a probability distribution.
Definition
Maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical
model, given observations. MLE attempts to find the parameter values that maximize the likelihood
function, given the observations. The resulting estimate is called a maximum likelihood estimate,
which is also abbreviated as MLE.
6.5.1 The general MLE method

Suppose we have a random sample X = {x1 , . . . , xn } taken from a probability distribution having
the probability mass function or probability density function p(x∣θ) where x denotes a value of the
random variable and θ denotes the set of parameters that appear in the function.
The likelihood of sample X is a function of the parameter θ and is defined as
l(θ) = p(x1 ∣θ)p(x2 ∣θ) . . . p(xn ∣θ).
In maximum likelihood estimation, we find the value of θ that makes the value of the likelihood
function maximum. For computation convenience, we define the log likelihood function as the
logarithm of the likelihood function:
L(θ) = log l(θ)

= log p(x1 ∣θ) + log p(x2 ∣θ) + ⋯ + log p(xn ∣θ).
A value of θ that maximizes L(θ) will also maximise l(θ) and vice-versa. Hence, in maximum like-
lihood estimation, we find θ that maximizes the log likelihood function. Sometimes the maximum
likelihood estimate of θ is denoted by θ̂.
6.5.2 Special cases

1. Bernoulli density
In a Bernoulli distribution there are two outcomes: An event occurs or it does not, for example, an
instance is a positive example of the class, or it is not. The event occurs and the Bernoulli random
variable X takes the value 1 with probability p, and the nonoccurrence of the event has probability
1 − p and this is denoted by X taking the value 0.
The probability function of X is given by
f (x∣p) = px (1 − p)1−x , x = 0, 1.
In this function, the probability p is the only parameter.
Estimation of p
Consider a random sample X = {x1 , . . . , xn } taken from a Bernoulli distribution with the probability
function f (x∣p). The log likelihood function is
L(p) = log f (x1 ∣p) + ⋯ + log f (xn ∣p)
= log px1 (1 − p)1−x1 + ⋯ + log pxn (1 − p)1−xn
= [x1 log p + (1 − x1 ) log(1 − p)] + ⋯ + [xn log p + (1 − xn ) log(1 − p)]
To find the value of p that maximizes L(p) we set up the equation
dL
= 0,
dp
that is,
x1 1 − x1 xn 1 − xn
[ − ]+⋯+[ − ] = 0.
p 1−p p 1−p
Solving this equation, we have the maximum likelihood estimate of p as
1
p̂ = (x1 + ⋯ + xn ).
n
2. Multinomial density
Suppose that the outcome of a random event is one of K classes, each of which has a probability of
occurring pi with
p1 + ⋯ + pK = 1.
We represent each outcome by an ordered K-tuple x = (x1 , . . . , xK ) where exactly one of x1 , . . . , xK
is 1 and all others are 0. xi = 1 if the outcome in the i-th class occurs. The probability function can
be expressed as
f (x∣p, . . . , pK ) = px1 1 . . . pxKK .
Here, p1 , . . . , pK are the parameters.
We choose n random samples. The i-the sample may be represented by
xi = (x1i , . . . , xKi ).
The values of the parameters that maximizes the likelihood function can be shown to be
1
p̂k = (xk1 + xk2 + ⋯ + xkn ).
n
(We leave the details of the derivation as an exercise.)

Bayesian Classifier Notes

Uploaded by

Copyright:

Available Formats

Bayesian Classifier Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Classifier Notes

Uploaded by

Copyright:

Available Formats

Chapter 6

Bayesian classifier and ML

6.1 Conditional probability

6.1.1 Independent events

2. Three events A, B, C are said to be pairwise independent if

3. Three events A, B, C are said to be mutually independent if

4. In general, a family of k events A1 , A2 , . . . , Ak is said to be mutually independent if for any

6.2 Bayes’ theorem

2. The following terminology is used in this context:

• P (A∣B) is called the posterior probability of A given B.

P (A) = 10% = 0.10

P (A) = 0.35, P (B) = 0.35, P (C) = 0.30

6.3 Naive Bayes algorithm

6.3.2 Basic idea

P (c1 ∣X), P (c2 ∣X), . . . , P (cp ∣X). (6.5)

6.3.3 Computation of probabilities

P (X∣ck ) = P ((x1 , x2 , . . . , xn )∣ck )

Using this in Eq,(6.6) we get

So it is enough to find the maximum among the following values:

P (x1 ∣ck )P (x2 ∣ck )⋯P (xn ∣ck )P (ck ), k = 1, . . . , p.

6.3.4 The algorithm

P (f1 ∣ck ), P (f2 ∣ck ), ... , P (fn ∣ck )

for all values of f1 , f2 , . . . , fn and for k = 1, . . . , p.

qk = P (x1 ∣ck )P (x2 ∣ck )⋯P (xn ∣ck )P (ck )

Sl. No. Swim Fly Crawl Class

Table 6.1: Sample data set for naive Bayes algorithm

Table 6.2: Frequency table for the data in Table 6.1

Step 1. We compute following probabilities.

Step 2. We construct the following table of conditional probabilities:

Table 6.3: Table of the conditional probabilities P (fi ∣ck )

Note: The conditional probabilities are calculated as follows:

Step 3. We now calculate the following numbers:

q1 = P (x1 ∣c1 )P (x2 ∣c1 )P (x3 ∣c1 )P (c1 )

Step 5. The maximum is q1 an it corresponds to the class label

6.4 Using numeric features with naive Bayes algorithm

Figure 6.3: Discretization of numeric data: Example

6.5 Maximum likelihood estimation (ML estimation)

6.5.1 The general MLE method

l(θ) = p(x1 ∣θ)p(x2 ∣θ) . . . p(xn ∣θ).

L(θ) = log l(θ)

6.5.2 Special cases

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.