Module4 Notes
Module4 Notes
Module4 Notes
Prior is a probability calculated to express one's beliefs about this quantity before
some evidence is taken into account. In statistical inferences and Bayesian
techniques, priors play an important role in influencing the likelihood for a datum.
(ii) The conditional probability of an event A given that an event B has occurred is
written: P(A|B) and is calculated using:
P(A|B) = P(A∩B) / P(B) as long as P(B) > 0.
Example :
P(A) = 4/52
P(B) = 4/51
P(A and B) = 4/52*4/51= 0.006
iii) A posterior probability, in Bayesian statistics, is the revised or updated
probability of an event occurring after taking into consideration new information.
Posterior probability is calculated by updating the prior probability using Bayes'
theorem. In statistical terms, the posterior probability is the probability of event A
occurring given that event B has occurred.
where:
A, B = events
(B) = greater than zero
P(B∣A) = the probability of B occurring given that A is true
P(A) and P(B) =
the probabilities of A occurring and B occurring independently of each
other
Bayes' theorem is a formula that describes how to update the probabilities of hypotheses
when given evidence. It follows simply from the axioms of conditional probability, but can be
used to powerfully reason about a wide range of problems involving belief updates.
Given a hypothesis h and evidence D, Bayes' theorem states that the relationship between
the probability of the hypothesis before getting the evidence P(h) and the probability of the
hypothesis after getting the evidence P(h∣D) is
Assumptions :
The probability distribution P(h) and P(D|h) is chosen to be consistent with the
following assumptions:
1. The training data D is noise free (i.e. di = c(xi))
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is more probable
than any other.
Two cases :
• By Applying Bayes theorem
• Case1 : When h is inconsistent with training data D:
P(h|D) = 0.P(h)/P(D) = 0
The Bayesian approach to classifying the new instance is to assign the most
probable target value, 𝑉𝑀𝐴𝑃 , given the attribute values <𝑎1 , 𝑎2 , 𝑎3 ,-------
𝑎𝑛 >.that describe the instance.
7. Illustrate the application of the Bayes Classifier to a
concept learning problem of Play Tennis Concept
example.
Example: Consider the Play tennis examples given below:
Learning Phase:
Test Phase:
Solution:
For a given problem We need to estimate:
10. Explain how Naïve Bayes algorithm is useful for
learning and classifying text?
Answer:
11.What are Bayesian Belief nets? Where are they used? Can it solve all types of
problems?
Links
Links are added between nodes to indicate that one node directly influences
the other. When a link does not exist between two nodes, this does not mean
that they are completely independent, as they may be connected via other
nodes. They may however become dependent or independent depending on
the evidence that is set on other nodes.
Example :
Two events can cause grass to be wet: an active sprinkler or rain. Rain
has a direct effect on the use of the sprinkler (namely that when it rains,
the sprinkler usually is not active). This situation can be modeled with
a Bayesian network (shown to the right). Each variable has two
possible values, T (for true) and F (for false).
EM Algorithm
• Pick random initial h = <1, 2> then iterate
• E step: Calculate the expected value E[zij] of each hidden variable zij,
assuming the current hypothesis h = <1, 2> holds.
Here, we can see that there are three Gaussian functions, hence K = 3. Each
Gaussian explains the data contained in each of the three clusters available. The
mixing coefficients are themselves probabilities and must meet this condition:
Now how do we determine the optimal values for these parameters? To achieve
this we must ensure that each Gaussian fits the data points belonging to each
cluster. This is exactly what maximum likelihood does.
In general, the Gaussian density function is given by:
Where x represents our data points, D is the number of dimensions of each data
point. μ and Σ are the mean and covariance, respectively. If we have a dataset
comprised of N = 1000 three-dimensional points (D = 3), then x will be a 1000 ×
3 matrix. μ will be a 1 × 3 vector, and Σ will be a 3 × 3 matrix.
•
• where
For Complete Derivation Refer the text book “ Machine Learning “ Tom M
Mitchell : Page No 168 to 171
d. K Means Algorithm
Algorithm
1. The sample space is initially partitioned into K clusters and the observations
are randomly assigned to the clusters.
2. For each sample:
Calculate the distance from the observation to the centroid of the
cluster.
IF the sample is closest to its own cluster THEN leave it ELSE select
another cluster.
3. Repeat steps 1 and 2 untill no observations are moved from one cluster to
another
How the K-Mean Clustering algorithm works?
K-means clustering example
• Refer the text book “ Machine Learning “ Tom M Mitchell : Page No 195 to
196.
Note: All Students are hereby instructed to study the VTU Prescribed Textbook:
Machine Learning Tom M. Mitchell in addition to this notes for the main
examination .