Module4 Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Machine Learning Module 4 Notes for 3rd IA

Module4: Introduction, Bayes theorem, Bayes theorem and concept learning, ML


and LS error hypothesis, ML for predicting probabilities, MDL principle, Naive
Bayes classifier, Bayesian belief networks, EM algorithm

1. Define (i) Prior Probability (ii) Conditional Probability


(iii) Posterior Probability.
Ans :
(i) The prior probability is the probability an event will happen before you taken
any new evidence into account. The prior probability of an event is the probability
of the event computed before the collection of new data. One begins with a prior
probability of an event and revises it in the light of new data. For example, if 0.01
of a population has schizophrenia then the probability that a person drawn at
random would have schizophrenia is 0.01. This is the prior probability.

Prior is a probability calculated to express one's beliefs about this quantity before
some evidence is taken into account. In statistical inferences and Bayesian
techniques, priors play an important role in influencing the likelihood for a datum.

(ii) The conditional probability of an event A given that an event B has occurred is
written: P(A|B) and is calculated using:
P(A|B) = P(A∩B) / P(B) as long as P(B) > 0.

Example :
P(A) = 4/52
P(B) = 4/51
P(A and B) = 4/52*4/51= 0.006
iii) A posterior probability, in Bayesian statistics, is the revised or updated
probability of an event occurring after taking into consideration new information.
Posterior probability is calculated by updating the prior probability using Bayes'
theorem. In statistical terms, the posterior probability is the probability of event A
occurring given that event B has occurred.

Bayes' Theorem Formula


The formula to calculate a posterior probability of A occurring given that B
occurred:

where:
A, B = events
(B) = greater than zero
P(B∣A) = the probability of B occurring given that A is true
P(A) and P(B) =
the probabilities of A occurring and B occurring independently of each
other

2. What is conditional Independence?


Ans:
• Definition: X is conditionally independent of Y given Z if the probability
distribution governing X is independent of the value of Y given the value of
Z; that is, if
(xi, yj, zk) P(X= xi|Y= yj, Z= zk) = P(X= xi|Z= zk)
more compactly, we write
P(X|Y, Z) = P(X|Z)
• Example: Thunder is conditionally independent of Rain, given Lightning
P(Thunder|Rain, Lightning) = P(Thunder|Lightning)
• Naive Bayes uses cond. indep. to justify
P(X, Y|Z) = P(X|Y, Z) P(Y|Z) = P(X|Z) P(Y|Z)
3. What is Bayes Theorem and maximum posterior
hypothesis? Derive an equation for MAP Hypothesis
and Maximum likelihood (ML) using Bayes theorem.
Ans :

Bayes' theorem is a formula that describes how to update the probabilities of hypotheses
when given evidence. It follows simply from the axioms of conditional probability, but can be
used to powerfully reason about a wide range of problems involving belief updates.
Given a hypothesis h and evidence D, Bayes' theorem states that the relationship between
the probability of the hypothesis before getting the evidence P(h) and the probability of the
hypothesis after getting the evidence P(h∣D) is

• P(h) = prior (initial) probability that hypothesis h holds , before we observed


any training data.
• P(D) = prior probability of training data D
• P(h|D) = posterior probability of h given D (it holds after we have seen the
training data D)
• P(D|h) = probability of observing data D given some world in which
hypothesis h holds.

Maximum a posterior (MAP) hypothesis:


• In many learning scenarios , the learner considers some set of candidate
hypotheses H and is interested in finding the most probable hypotheses hЄH
given the observed data D .
• Any such maximally probable hypothesis is called a maximum posteriori
(MAP) hypothesis hMAP:
Maximum Likelihood:
In some cases we will assume that every hypothesis in H is equally probable a
priori (P(hi) = P(hj) for all hi in H ) hen can further simplify and need to consider
the term P(D|h) is often called the likelihood of the data D given h and hypothesis
that maximizes P(D|h) is called a Maximum likelihood (ML) hypothesis hML

4. Explain how Bayesian theorem can be used for


Concept Learning . Describe Brute Force MAP learning
Algorithm. Hence Derive the relation for P(h/D) when
h is consistent with D and h is not consistent with D.
Ans:

Bayesian Theorem can learn concept by estimating the Maximum A Posterior


probability of a given training data examples. Assume that the learner considers
some finite hypothesis space H defined over the instance space X , in which the
task is to learn some target concept c: X-> {0,1}
Assume fixed set of instances <x1,…, xm>
Assume D is the set of classifications: D = <c(x1),…,c(xm)>
Assume that the learner has given some sequence of training examples
<<x1,d1><x2,d2>,…………………..<xm, dm>> where xi is some instance from X
and where di is the target value of xi (i.e di= c(xi)).
Brute Force MAP Learning Algorithm:
1. For each hypothesis h in H, calculate the posterior probability

2. Output the hypothesis hMAP with the highest posterior probability

Assumptions :
The probability distribution P(h) and P(D|h) is chosen to be consistent with the
following assumptions:
1. The training data D is noise free (i.e. di = c(xi))
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is more probable
than any other.

The Values of P(h) and P(D|h)

• Choose P(h) to be uniform distribution


• P(h) = 1/|H| for all h in H
• Choose P(D|h):

1 if di =h(xi) for all di in D(h consistent with D)


P(D|h) =
0 otherwise

Two cases :
• By Applying Bayes theorem
• Case1 : When h is inconsistent with training data D:
P(h|D) = 0.P(h)/P(D) = 0

• Case 2: When h is consistent with D , we have


P(h|D) = (1*1/|H|)/(|VSH,D|/|H|)
= 1/|VSH,D|
• To summarize , Bayes theorem implies that the posterior probability P(h|D)
under our assumed P(h) and P(D|h) is

5. What is the relevance and features of Bayesian


theorem? Explain the practical difficulties of Bayesian
theorem. Discuss at least five applications of Bayes
learning.
Ans:
Relevance of Bayesian Learning : Bayesian Learning is relevant for two reasons
• First reason: explicit manipulation of probabilities
• among the most practical approaches to certain types of learning
problems
• e.g. Bayes classifier is competitive with decision tree and neural
network learning
• Second reason: useful perspective for understanding learning methods
that do not explicitly manipulate probabilities
• determine conditions under which algorithms output the most
probable hypothesis
• e.g. justification of the error functions in ANNs
• e.g.justification of the inductive bias of decision trees
Features of Bayesian Learning Methods :
• Each observed training example can incrementally decrease or increase the
estimated probability that a hypothesis is correct.
• Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis.
• Bayesian methods can accommodated hypothesis that make probabilities
predictions.
• New instances can be classified by combining the predictions of multiple
hypotheses, weighted by their probabilities
• Provides a standard of optimal decision making against which other practical
methods can be measured.

The practical difficulties of Bayesian theorem:


• Initial knowledge of many probabilities is required : It does not tell you how
to select a prior. There is no correct way to choose a prior. Bayesian
inferences require skills to translate subjective prior beliefs into a
mathematically formulated prior. If you do not proceed with caution, you
can generate misleading results. It can produce posterior distributions that
are heavily influenced by the priors. From a practical point of view, it might
sometimes be difficult to convince subject matter experts who do not agree
with the validity of the chosen prior.
• Significant computational costs required :It often comes with a high
computational cost, especially in models with a large number of parameters.

Applications of Bayesian Learning:


• News Categorization
• Medical Diagnosis
• E- mail Spam Detection
• Face Recognition
• Sentiment Analysis
• Digit Recognition
• Weather Prediction
[ Note : Students are hereby instructed to provide the detail discussion on their
own understanding of the application ]
6. Write a Short note on Naïve Bayes classifier. Write and
explain the equation for target value output, vNB by
the Bayes Classifier.
Ans :
Highly Bayesian learning method is the naïve Bayes learner often called the
naïve Bayes Classifier . Bayesian Classifier assumes that all the variables are
conditionally independent given the value of the target variable. The naïve
Bayes Classifier applies to learning tasks where each instance x is described by
a conjunction of attribute values and where the target function f(x) can take
on any value from some finite set V.A set of training examples of the target
function is provided, and a new instance is presented, described by the tuple of
attribute values <𝒂𝟏 , 𝒂𝟐 , 𝒂𝟑 ,------- 𝒂𝒏 >. The learner is asked to predict the
target value, or classification, for this new instance.

The Bayesian approach to classifying the new instance is to assign the most
probable target value, 𝑉𝑀𝐴𝑃 , given the attribute values <𝑎1 , 𝑎2 , 𝑎3 ,-------
𝑎𝑛 >.that describe the instance.
7. Illustrate the application of the Bayes Classifier to a
concept learning problem of Play Tennis Concept
example.
Example: Consider the Play tennis examples given below:

Learning Phase:
Test Phase:

8. Consider a football game between two rival teams:


Team 0 and Team 1. Support Team 0 wins 65% of the
time and Team 1 wins the remaining matches. Among
the games won by Team 0, only 30% of them come
from playing on Team 1’s football field. On the other
hand, 75% of the victories for Team 1 are obtained
while playing at home. If Team 1 is to host the next
match between the two teams, which team will most
likely emerge as the winner.
Solution:
Hence forth probability of hosting teaming winning chance (0.5738) is more .
9. The following table gives data set about stolen
vehicles . Using Naïve Bayes classifier classify the new
data (Red, SUV, Domestic) .
Color Type Origin Stolen
Red Sports Domestic Yes
Red Sports Domestic No
Red Sports Domestic Yes
Yellow Sports Domestic No
Yellow Sports Imported Yes
Yellow SUV Imported No
Yellow SUV Imported Yes
Yellow SUV Domestic No
Red SUV Imported No
Red Sports Imported Yes

Solution:
For a given problem We need to estimate:
10. Explain how Naïve Bayes algorithm is useful for
learning and classifying text?
Answer:

Learning to Classify Text – Algorithm

S1: LEARN_NAIVE_BAYES_TEXT (Examples, V)

S2: CLASSIFY_NAIVE_BAYES_TEXT (Doc)


• Examples is a set of text documents along with their target values.
• V is the set of all possible target values.
• This function (S1) learns the probability terms P(wk I vj), describing the
probability that a randomly drawn word from a document in class vj will be
the English word wk. It also learns the class prior probabilities P(vj).

S1: LEARN_NAIVE_BAYES_TEXT (Examples, V)


[ V: Class, W: Word, doc : Documents]
1. collect all words and other tokens that occur in Examples
• Vocabulary  all distinct words and other tokens in Examples
2. calculate the required P(vj) and P(wk | vj) probability terms
• For each target value vj in V do

• docsj  subset of Examples for which the target value is vj


• Textj  a single document created by concatenating all members of
docsj
• n  total number of words in Textj (counting duplicate words
multiple times)
• for each word wk in Vocabulary
nk  number of times word wk occurs in Textj
S2: CLASSIFY_NAIVE_BAYES_TEXT (Doc)
• positions  all word positions in Doc that contain tokens found in
Vocabulary
• Return vNB where

Example: In the example, we are given a sentence “ A very close game”, a


training set of five sentences (as shown below), and their corresponding
category (Sports or Not Sports). The goal is to build a Naive Bayes classifier that
will tell us which category the sentence “ A very close game” belongs
to. Applying a Naive Bayes classifier, thus the strategy would be calculating the
probability of both “A very close game is Sports”, as well as it’s Not Sports. The
one with the higher probability will be the result.
Step 1: Feature Engineering
• word frequencies, i.e., counting the occurrence of every word in the document.
• P( a very close game) = P(a) X P(very) X P(close) X P(game)
• P(a very close game | Sports) = P(a|Sports) X P(Very|Sports) X P(close|Sports)
X P(game|Sports)
• P(a very close game | Not Sports) = P(a | Not Sports) x P(very | Not Sports) x
P(close | Not Sports) x P(game | Not Sports)

Step 2: Calculating the probabilities


• Here , the word “close” does not exist in the category Sports,
thus P(close |Sports) = 0, leading to P(a very close game | Sports)=0.
• The probabilities are calculated using multinomial probability distribution
function
As seen from the results shown below, P(a very close game | Sports) gives a
higher probability, suggesting that the sentence belongs to the Sports category.

11.What are Bayesian Belief nets? Where are they used? Can it solve all types of
problems?

Ans: Bayesian networks are a type of Probabilistic Graphical Model that


can be used to build models from data and/or expert opinion. Bayesian
networks are comprised of nodes and directed edges. Bayesian network
models capture both conditionally dependent and conditionally independent
relationships between random variables. Models can be prepared by experts
or learned from data, then used for inference to estimate the probabilities for
causal or subsequent events. Bayesian Belief networks describe conditional
independence among subsets of variables.
Bayesian belief networks.
• Represent the full joint distribution more compactly with smaller number of
parameters.
• Take advantage of conditional and marginal independences among components in
the distribution
Nodes
In many Bayesian networks, each node represents a Variable such as
someone's height, age or gender. A variable might be discrete, such as
Gender = {Female, Male} or might be continuous such as someone's age.

Links
Links are added between nodes to indicate that one node directly influences
the other. When a link does not exist between two nodes, this does not mean
that they are completely independent, as they may be connected via other
nodes. They may however become dependent or independent depending on
the evidence that is set on other nodes.

Directed Acyclic Graph (DAG)


A Bayesian network is a type of graph called a Directed Acyclic
Graph or DAG. A Dag is a graph with directed links and one which contains
no directed cycles.
Example :
Why Bayesian nets are used?
Bayesian networks are a type of Probabilistic Graphical Model that can
be used to build models from data and/or expert opinion. They can
be used for a wide range of tasks including prediction, anomaly detection,
diagnostics, automated insight, reasoning, time series prediction and
decision making under uncertainty. They can be used for a wide range of
tasks including prediction, anomaly detection, diagnostics, automated
insight, reasoning, time series prediction and decision making under
uncertainty. They are also commonly referred to as Bayes nets, Belief
networks and sometimes Causal networks.

Example :

Two events can cause grass to be wet: an active sprinkler or rain. Rain
has a direct effect on the use of the sprinkler (namely that when it rains,
the sprinkler usually is not active). This situation can be modeled with
a Bayesian network (shown to the right). Each variable has two
possible values, T (for true) and F (for false).

Limitations of Bayesian Nets :


Bayesian nets cannot do
• Combining conflicting beliefs that are based on different implicit conditions,
• Carry out inference when the premises are based different implicit
conditions,
12. Write and Explain EM Algorithm. Discuss what are
Gaussian Mixtures.
Ans :
• Given:
• Instances from X generated by mixture of k Gaussian distributions
• Unknown means <1,…,k > of the k Gaussians
• Don’t know which instance xi was generated by which Gaussian
• Determine:
• Maximum likelihood estimates of <1,…,k >

EM Algorithm
• Pick random initial h = <1, 2> then iterate
• E step: Calculate the expected value E[zij] of each hidden variable zij,
assuming the current hypothesis h = <1, 2> holds.

• M step: Calculate a new maximum likelihood hypothesis h' = <'1, '2>,


assuming the value taken on by each hidden variable zij its expected value
E[zij] calculated above. Replace h = <1, 2> by h' = <'1, '2>.
Gaussian Mixtures:

A Gaussian Mixture is a function that is comprised of several Gaussians, each


identified by k ∈ {1,…, K}, where K is the number of clusters of our dataset. Each
Gaussian k in the mixture is comprised of the following parameters:
• A mean μ that defines its centre.
• A covariance Σ that defines its width. This would be equivalent to the
dimensions of an ellipsoid in a multivariate scenario.
• A mixing probability π that defines how big or small the Gaussian function
will be.
• Let us now illustrate these parameters graphically:

Here, we can see that there are three Gaussian functions, hence K = 3. Each
Gaussian explains the data contained in each of the three clusters available. The
mixing coefficients are themselves probabilities and must meet this condition:

Now how do we determine the optimal values for these parameters? To achieve
this we must ensure that each Gaussian fits the data points belonging to each
cluster. This is exactly what maximum likelihood does.
In general, the Gaussian density function is given by:

Where x represents our data points, D is the number of dimensions of each data
point. μ and Σ are the mean and covariance, respectively. If we have a dataset
comprised of N = 1000 three-dimensional points (D = 3), then x will be a 1000 ×
3 matrix. μ will be a 1 × 3 vector, and Σ will be a 3 × 3 matrix.

13. Derive the following:

a. Maximum Likelihood and Least Square Error


Hypotheses
A straightforward Bayesian analysis will show that under certain assumptions
any learning algorithm that minimizes the squared error between the output
hypothesis predictions and the training data will output a maximum likelihood
hypothesis.
• Consider any real-valued target function f
Training examples <xi, di>, where di is noisy training value
• di = f(xi) + ei
• ei is random variable (noise) drawn independently for each xi
according to some Gaussian distribution with mean=0
• Then the maximum likelihood hypothesis hML is the one that minimizes the
sum of squared errors:
Maximize natural log of this instead...
b. Maximum Likelihood for Predicting Probabilities:
Consider predicting survival probability from patient data
Training examples <xi, di>, where di is 1 or 0
Want to train neural network to output a probability given xi (not a 0 or 1)
In this case can show

• Weight update rule for a sigmoid unit


• where

For Complete Derivation Refer the text book “ Machine Learning “ Tom M
Mitchell : Page No 168 to 171

c. Minimum Description Length (MDL) Principle

Occam’s razor: prefer the shortest hypothesis


MDL: prefer the hypothesis h that minimizes
where LC(x) is the description length of x under encoding C
Example:
• H = decision trees, D = training data labels
• LC1(h) is # bits to describe tree h
• LC2(D|h) is #bits to describe D given h
– Note LC2 (D|h) = 0 if examples classified perfectly by h. Need
only describe exceptions
• Hence hMDL trades off tree size for training errors

hMAP = arg max P( D | h) P(h)


hH

= arg max log 2 P( D | h) + log 2 P(h)


hH

= arg min − log 2 P( D | h) − log 2 P(h) (1)


hH

Interesting fact from information theory:


The optimal (shortest expected length) code for an event with probability p is
log2p bits.
So interpret (1):
-log2P(h) is the length of h under optimal code
-log2P(D|h) is length of D given h in optimal code
→ prefer the hypothesis that minimizes
length(h)+length(misclassifications)

d. K Means Algorithm
Algorithm
1. The sample space is initially partitioned into K clusters and the observations
are randomly assigned to the clusters.
2. For each sample:
Calculate the distance from the observation to the centroid of the
cluster.
IF the sample is closest to its own cluster THEN leave it ELSE select
another cluster.
3. Repeat steps 1 and 2 untill no observations are moved from one cluster to
another
How the K-Mean Clustering algorithm works?
K-means clustering example

Derivation of the k –means Algorithm

• Refer the text book “ Machine Learning “ Tom M Mitchell : Page No 195 to
196.

Note: All Students are hereby instructed to study the VTU Prescribed Textbook:
Machine Learning Tom M. Mitchell in addition to this notes for the main
examination .

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy