Module - 4 Bayeian Learning
Module - 4 Bayeian Learning
Module - 4 Bayeian Learning
BAYEIAN LEARNING
CONTENT
• Introduction
• Bayes theorem
• Bayes theorem and concept learning
• Maximum likelihood and Least Squared Error Hypothesis
• Maximum likelihood Hypotheses for predicting probabilities
• Minimum Description Length Principle
• Naive Bayes classifier
• Bayesian belief networks
• EM algorithm
• One practical difficulty in applying Bayesian methods is that they typically require
initial knowledge of many probabilities. When these probabilities are not known
in advance they are often estimated based on background knowledge, previously
available data, and assumptions about the form of the underlying distributions.
P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
P(h|D) decreases as P(D) increases, because the more probable it is that D will be
observed independent of h, the less evidence D provides in support of h.
• In many learning scenarios, the learner considers some set of candidate hypotheses
H and is interested in finding the most probable hypothesis h ∈ H given the
observed data D. Any such maximally probable hypothesis is called a maximum a
posteriori (MAP) hypothesis.
• Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP
is a MAP hypothesis provided
P(D|h) is often called the likelihood of the data D given h, and any hypothesis that
maximizes P(D|h) is called a maximum likelihood (ML) hypothesis
The available data is from a particular laboratory with two possible outcomes: +
(positive) and - (negative)
Since Bayes theorem provides a principled way to calculate the posterior probability
of each hypothesis given the training data, and can use it as the basis for a
straightforward learning algorithm that calculates the probability for each possible
hypothesis, then outputs the most probable.
Lets choose P(h) and for P(D|h) to be consistent with the following assumptions:
• The training data D is noise free (i.e., di = c(xi))
• The target concept c is contained in the hypothesis space H
• We have no a priori reason to believe that any hypothesis is more probable than any other.
• P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the
fixed set of instances (x1 . . . xm), given a world in which hypothesis h holds
• Since we assume noise-free training data, the probability of observing
classification di given h is just 1 if di = h(xi) and 0 if di # h(xi). Therefore,
Example:
• FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis under the probability
distributions P(h) and P(D|h) defined above.
• Are there other probability distributions for P(h) and P(D|h) under which FIND-S outputs MAP
hypotheses? Yes.
• Because FIND-S outputs a maximally specific hypothesis from the version space, its output
hypothesis will be a MAP hypothesis relative to any prior probability distribution that favours more
specific hypotheses.
A straightforward Bayesian analysis will show that under certain assumptions any
learning algorithm that minimizes the squared error between the output hypothesis
predictions and the training data will output a maximum likelihood (ML) hypothesis
• Learner L considers an instance space X and a hypothesis space H consisting of some class of real-
valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training examples of the form
<xi,di>
• The problem faced by L is to learn an unknown target function f : X → R
• A set of m training examples is provided, where the target value of each example is corrupted by
random noise drawn according to a Normal probability distribution with zero mean (di = f(xi) + ei)
• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable representing
the noise.
– It is assumed that the values of the ei are drawn independently and that they are distributed
according to a Normal distribution with zero mean.
• The task of the learner is to output a maximum likelihood hypothesis, or, equivalently, a MAP
hypothesis assuming all hypotheses are equally probable a priori.
Assuming training examples are mutually independent given h, we can write P(D|h) as the product of
the various (di|h)
Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each di
must also obey a Normal distribution around the true targetvalue f(xi). Because we are writing the
expression for P(D|h), we assume h is the correct description of f. Hence, µ = f(xi) = h(xi)
The first term in this expression is a constant independent of h and can therefore be discarded
Maximizing this negative term is equivalent to minimizing the corresponding positive term.
• the hML is one that minimizes the sum of the squared errors
Equation (4) to substitute for P(di |h, xi) in Equation (5) to obtain
Equation (7) describes the quantity that must be maximized in order to obtain the maximum
likelihood hypothesis in our current problem setting
Derive a weight-training rule for neural network learning that seeks to maximize G(h, D) using
gradient ascent
• The gradient of G(h, D) is given by the vector of partial derivatives of G(h, D) with respect to the
various network weights that define the hypothesis h represented by the learned network
• In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to unit j is
where xijk is the kth input to unit j for the ith training example, and d(x) is the derivative of the sigmoid
squashing function.
Finally, substituting this expression into Equation (1), we obtain a simple expression for the
derivatives that constitute the gradient
where η is a small positive constant that determines the step size of the i gradient ascent search
• This equation can be interpreted as a statement that short hypotheses are preferred, assuming a
particular representation scheme for encoding hypotheses and data
• -log2P(h): the description length of h under the optimal encoding for the hypothesis space H
LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H.
• -log2P(D | h): the description length of the training data D given hypothesis h, under the
optimal encoding fro the hypothesis space H: LCH (D|h) = −log2P(D| h) , where C D|h is the
optimal code for describing data D assuming that both the sender and receiver know the
hypothesis h.
Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given by the
description length of the hypothesis plus the description length of the data given the hypothesis.
where CH and CD|h are the optimal encodings for H and for D given h
Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis
The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH, and if
we choose C2 to be the optimal encoding CD|h, then hMDL = hMAP
Apply the MDL principle to the problem of learning decision trees from some training data.
What should we choose for the representations C1 and C2 of hypotheses and data?
• For C1: C1 might be some obvious encoding, in which the description length grows with the
number of nodes and with the number of edges
• For C2: Suppose that the sequence of instances (x1 . . .xm) is already known to both the transmitter
and receiver, so that we need only transmit the classifications (f (x1) . . . f (xm)).
Now if the training classifications (f (x1) . . .f(xm)) are identical to the predictions of the
hypothesis, then there is no need to transmit any information about these examples. The
description length of the classifications given the hypothesis ZERO
If examples are misclassified by h, then for each misclassification we need to transmit a message
that identifies which example is misclassified as well as its correct classification
The hypothesis hMDL under the encoding C1 and C2 is just the one that minimizes the sum of these
description lengths.