L09 Learning I Bayesian Learning
L09 Learning I Bayesian Learning
Semester I, 2024-25
Rohan Paul
1
Outline
• Last Class
• CSPs
• This Class
• Bayesian Learning, MLE/MAP, Learning in Probabilistic Models.
• Reference Material
• Please follow the notes as the primary reference on this topic. Supplementary
reading on topics covered in class from AIMA Ch 20 sections 20.1 – 20.2.4.
2
Acknowledgement
These slides are intended for teaching purposes only. Some material
has been used/adapted from web sources and from slides by Doina
Precup, Dorsa Sadigh, Percy Liang, Mausam, Parag, Emma Brunskill,
Alexander Amini, Dan Klein, Anca Dragan, Nicholas Roy and others.
3
Learning Probabilistic Models
• Models are useful for making optimal decisions.
• Probabilistic models express a theory about the domain and can be used for
decision making.
• How to acquire these models in the first place?
• Solution: data or experience can be used to build these models
• Key question: how to learn from data?
• Bayesian view of learning (learning task itself is probabilistic inference)
• Learning with complete and incomplete data.
• Essentially, rely on counting.
Example: Which candy bag is it?
Statistics Probability
Bayesian Learning – in a nutshell
P(H)
H
P(d|H)
D1 D2 DN
i.i.d
Bayes Rule
IID assumption
Posterior Probability of Hypothesis given
Ovservations
Incremental Belief Update
Observations
Bayesian Prediction – Evidence arrives incrementally
key ideas
• Predictions are weighted average over the Changing
predictions of the individual hypothesis. belief
• Bayesian prediction eventually agrees with
the true hypothesis.
• For any fixed prior that does not rule out the
true hypothesis, the posterior probability of
any false hypothesis will eventually vanish.
• Why keep all the hypothesis?
Prediction
• Learning from small data, early commitment to a
by model
hypothesis is risky, later evidence may lead to a
different likely hypothesis. averaging.
• Better accounting of uncertainty in making
predictions.
• Problem: maybe slow and intractable, cannot
estimate and marginalize out the hypotheses.
Marginalization over Hypothesis – challenging!
Can we pick one good hypothesis and just use that for predications?
Maximum a-posteriori (MAP) Approximation
P(Xd): This is the probability of observing new data X, given the evidence d.
Make predictions with the hypothesis that maximizes the data likelihood. Essentially, assuming a
uniform prior with no preference of a hypothesis over another.
Similar problem to observing tosses of a biased coin and estimating the bias/fractional
parameter.
ML Estimation in General: Estimation for
Bernoulli Model
Even in the coin tossing problem, one would take the fraction as heads or
tails over the total number of tosses.
MAP vs. MLE Estimation
• Maximum likelihood estimate (MLE)
• Estimates the parameters that maximizes
the data likelihood.
• Relative counts give MLE estimates
1
• Features: The attributes used to make the digit decision
• Pixels: (6,8)=ON
• Shape Patterns: NumComponents, AspectRatio, NumLoops
• …
Not clear
Bayes Net for Classification
• Naïve Bayes: Assume all features are independent effects of the label
• P(Fi|Y) – for digit 3 what fraction of the time the cell is on?
• Conditioned on the class type how frequent is the feature
If one feature was not seen in the training data, the likelihood goes to zero.
If we did not see this feature in the training data, does not mean we will
not see this in training. Essentially overfitting to the training data set.
Laplace Smoothing
• Pretend that every outcome occurs once
more than it is observed. H H T
https://www.naftaliharris.com/blog/
visualizing-k-means-clustering/
K-Means Clustering Algorithm
What objective K-Means is optimizing?
Data points
GMMs are a
generative model of
data.