0% found this document useful (0 votes)

21 views66 pages

L09 Learning I Bayesian Learning

Uploaded by

Abhijeet Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views66 pages

L09 Learning I Bayesian Learning

Uploaded by

Abhijeet Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

COL333/671: Introduction to AI

Semester I, 2024-25

Learning with Probabilities

Rohan Paul

1
Outline
• Last Class
• CSPs
• This Class
• Bayesian Learning, MLE/MAP, Learning in Probabilistic Models.
• Reference Material
• Please follow the notes as the primary reference on this topic. Supplementary
reading on topics covered in class from AIMA Ch 20 sections 20.1 – 20.2.4.

2
Acknowledgement
These slides are intended for teaching purposes only. Some material
has been used/adapted from web sources and from slides by Doina
Precup, Dorsa Sadigh, Percy Liang, Mausam, Parag, Emma Brunskill,
Alexander Amini, Dan Klein, Anca Dragan, Nicholas Roy and others.

3
Learning Probabilistic Models
• Models are useful for making optimal decisions.
• Probabilistic models express a theory about the domain and can be used for
decision making.
• How to acquire these models in the first place?
• Solution: data or experience can be used to build these models
• Key question: how to learn from data?
• Bayesian view of learning (learning task itself is probabilistic inference)
• Learning with complete and incomplete data.
• Essentially, rely on counting.
Example: Which candy bag is it?

Statistics Probability
Bayesian Learning – in a nutshell
P(H)
H
P(d|H)

D1 D2 DN

i.i.d

In these slides X and d used

interchangeably.
Posterior Probability of Hypothesis given
Ovservations
Now, we are getting observations incrementally, Probability of a bag of a certain type given
how does our belief change? observations.

Bayes Rule

IID assumption
Posterior Probability of Hypothesis given
Ovservations
Incremental Belief Update

True hypothesis eventually dominates. Probability of

indefinitely producing uncharacteristic data →0
Predictions given Belief over Hypotheses
What is the probability that the next candy is of type lime?

Observations
Bayesian Prediction – Evidence arrives incrementally

key ideas
• Predictions are weighted average over the Changing
predictions of the individual hypothesis. belief
• Bayesian prediction eventually agrees with
the true hypothesis.
• For any fixed prior that does not rule out the
true hypothesis, the posterior probability of
any false hypothesis will eventually vanish.
• Why keep all the hypothesis?
Prediction
• Learning from small data, early commitment to a
by model
hypothesis is risky, later evidence may lead to a
different likely hypothesis. averaging.
• Better accounting of uncertainty in making
predictions.
• Problem: maybe slow and intractable, cannot
estimate and marginalize out the hypotheses.
Marginalization over Hypothesis – challenging!

Ideally, one needs to marginalize or account for all the hypotheses.

Can we pick one good hypothesis and just use that for predications?
Maximum a-posteriori (MAP) Approximation
P(Xd): This is the probability of observing new data X, given the evidence d.

Estimate the best hypothesis

given data while incorporating
the prior knowledge.
What is the
probability of a The prior term says which
hypothesis given hypothesis are likelier than
data?
others. Typically, the number of
bits to encode hypothesis.
MAP Vs. Bayesian Estimation

Difference between marginalization

(accounting for all hypothesis) vs.
committing to one and make
predictions from it.
Maximum Likelihood Estimation

Make predictions with the hypothesis that maximizes the data likelihood. Essentially, assuming a
uniform prior with no preference of a hypothesis over another.

MLE is also called Maximum likelihood (ML) Approximation

Maximum Likelihood Approximation

Theta represents the parameters of the probabilistic model.

These parameters define the specific configuration of the hypothesis or the model we are using.

Theta ML stands for the Maximum Likelihood Estimate of the parameters.

It is the value of Theta that makes the observed data most likely under the model.
ML Estimation in General: Bernoulli Model
Hypothesis is the likelihood of generating a candy of a specific flavor.

Cherry, Lime, Lime, Cherry, Cherry,

Lime, Cherry, Cherry

Similar problem to observing tosses of a biased coin and estimating the bias/fractional
parameter.
ML Estimation in General: Estimation for
Bernoulli Model

Cherry, Lime, Lime, Cherry, Cherry,

Lime, Cherry, Cherry

Even in the coin tossing problem, one would take the fraction as heads or
tails over the total number of tosses.
MAP vs. MLE Estimation
• Maximum likelihood estimate (MLE)
• Estimates the parameters that maximizes
the data likelihood.
• Relative counts give MLE estimates

• Maximum a posteriori estimate (MAP)

• Bayesian parameter estimation
• Encodes a prior over the parameters (not all
parameters are equal prior values).
• Combines the prior and the likelihood while
estimating the parameters.
ML Estimation in General: Learning Parameters
for a Probability Model
• Probabilistic models require
parameters (numbers in the
conditional probability tables).
• We need these values to make
predictions.
• Can we learn these from data (i.e.,
samples from the Bayes Net)?
• How to do this? Counting and
averaging. Can we use samples to estimate the
values in the tables?
Learning Parameters for a Probability Model
Classification Problem
• Task: given inputs x, predict labels (classes) y
• Examples:
• Spam detection (input: document,
classes: spam / ham)
• OCR (input: images, classes: characters)
• Medical diagnosis (input: symptoms, classes: diseases)
• Fraud detection (input: account activity, classes: fraud / no fraud)
Bayes Net for Classification
• Input: images / pixel grids
0
• Output: a digit 0-9
1
• Setup:
• Get a large collection of example images, each labeled with a digit
• Note: someone has to hand label all this data! 2
• Want to learn to predict labels of new, future digit images

1
• Features: The attributes used to make the digit decision
• Pixels: (6,8)=ON
• Shape Patterns: NumComponents, AspectRatio, NumLoops
• …
Not clear
Bayes Net for Classification
• Naïve Bayes: Assume all features are independent effects of the label

• Simple digit recognition: Y

• One feature (variable) Fij for each grid position <i,j>
• Feature values are on / off, based on whether intensity
is more or less than 0.5 in underlying image
• Each input maps to a feature vector, e.g. F1 F2 Fn
Parameter Estimation
• Need the estimates of local conditional probability tables.
• P(Y), the prior over labels
• P(Fi|Y) for each feature (evidence variable)
• These probabilities are collectively called the parameters of the
model and denoted by 
• Till now, the table values were provided.
• Now, use data to acquire these values.
Parameter Estimation

• P(Y) – how frequent is the class-type for digit 3?

• If you take a sample of images of numbers how frequent is this
number

• P(Fi|Y) – for digit 3 what fraction of the time the cell is on?
• Conditioned on the class type how frequent is the feature

• Use relative frequencies from the data to estimate these

values.
Parameter Estimation: Complete Data

Note: The data is “complete”. Each data

point had values observed for “all” the
variables in the model.
Parameter Estimation
Parameter Estimation
Parameter Estimation
Problem: values not seen in the training data

If one feature was not seen in the training data, the likelihood goes to zero.
If we did not see this feature in the training data, does not mean we will
not see this in training. Essentially overfitting to the training data set.
Laplace Smoothing
• Pretend that every outcome occurs once
more than it is observed. H H T

• If certain counts are not seen in training

does not mean that they have zero
probability of occurring in future.

• Another version of Laplace smoothing

• instead of 1, add k times
• k is an adjustable parameter.

• Essentially, encodes a prior (pseudo-

counts).
Learning Multiple Parameters
• Estimate latent parameters
using MLE.
• There are two CPTs in this
example.
• Observations are of both
variables: Flavor and
Wrapper.
• Take log likelihood.
Learning Multiple Parameters
• Minimize data likelihood to estimate the parameters.

Maximum Likelihood Parameter Learning

with complete data for a Bayes Net
decomposes into separate learning
problems, one for each parameter.
How to learn the structure of the Bayes Net?
• Problem: Estimate/learn the structure
of the model
• Setup a search process (like local search,
hill climbing etc.)
• For each structure, learn the
parameters.
• How to score a solution?
• Use Max. likelihood estimation.
• Penalize complexity of the structure
(don’t want a fully connected
model).
• Additionally check for validity of the
conditional independences.
Parameter Learning when some variables are
not observed
• If we knew the missing
value for B. Then we can
estimate the CPTs.
Conditional probability table

• If we knew the CPTs then

we can infer the probability
of the missing value of B.

• It is a chicken and egg

problem. Data is incomplete. One sample has (A = 1, B= ? and C = 0 )
Expectation Maximization
• Initialization
• Initialize CPT parameter values (ignoring missing information)
• Expectation
• Compute expected values of unobserved variables assuming current
parameters values.
• Involves BayesNet inference (exact or approximate)
• Maximization
• Compute new parameters (of the CPTs) to maximize the probability of data
(observed and estimated)
• Alternate the EM steps until convergence. Convergence is guaranteed.
Expectation Maximization
Expectation Maximization
EM Example
Problem: learning the parameters of a Bayes
Net that models ratings given by reviewers.

We postulate that ratings (1 or 2) are

conditioned on the “genre” or “type” of the
move (Comedy or Drama.

Observations, we only see the ratings given

by the reviewers.

Apply EM to learn the parameters.

Reviewers rate individually (their CPTs are

assumed to be the same).

Slide adapted from Dorsa Sadigh and Percy Liang

What objective are we optimizing in EM?
Maximum Marginal Likelihood
Latent Variables are variables in a model that
are not directly observed in the data but are
inferred through relationships with observed
variables.
In this example, G (genre of the movie) acts as
a latent variable because it influences the
observed ratings, but its value might not be
directly provided in the data.
Latent Vectors typically refer to multi-
dimensional representations of these latent
variables, but in this context, they simply mean
the possible values or states of G that we sum
over to compute the marginal likelihood in the
EM objective.

Marginalize over the

latent variables in the
likelihood

Slide adapted from Dorsa Sadigh and Percy Liang

E and M steps Compute for every value of h
and for each setting of the
evidence variables.

The estimated data points

In the E-step, it estimates the probabilities of the hidden (latent) variables given the observed data
and the current parameters. from E step are used to
In the M-step, it updates the model parameters to maximize the likelihood of the observed data, update the CPTs.
given these estimated probabilities.

Slide adapted from Dorsa Sadigh and Percy Liang

EM: Estimating and using weighted samples

Estimated Fractional samples

(g=c, r1=2, r2=2) prob: 0.69

(g=d, r1=2, r2=2) prob: 0.31
(g=c, r1=1, r2=2) prob: 0.5
(g=d, r1=1, r2=2) prob: 0.5

Revising probabilities based on

fractional samples.

The CPTs for the two reviewers is the same.

Related Topic: Clustering
Example: Clustering images in a data base
Clustering is subjective
Clustering is based on a distance metric

Clustering depends on the

distance function used.

Euclidean distance? Edit

distance? ….
K-Means Clustering

A GMM yields a probability distribution

over the cluster assignment for each
point; whereas K-Means gives a single
hard assignment

GMM: Gaussian Mixture Model

https://www.naftaliharris.com/blog/
visualizing-k-means-clustering/
K-Means Clustering Algorithm
What objective K-Means is optimizing?
Data points

K-means converges with

every step to the minimum
distortion metric

Iteration I Iteration II Iteration III

How to pick “k”?

Ideal value of number of clusters(k) can

be identified using the distortion metric
for different values of k.
K-Means Application: Segmentation
Goal of segmentation is to partition
an image into regions each of
which has reasonably homogenous
visual appearance.

Apply K-Means in the colour space.

EM in Continuous Space: Gaussian Mixture
Modeling
• Problem: clustering task where we want
to discern multiple category in a
collection of given points.
• Assume a mixture of components
(Gaussian)
• Don’t know which data point comes
from which component.
• Use EM to iteratively determine the
assignments and the parameters of the
Gaussian components.

Web link: https://lukapopijac.github.io/gaussian-mixture-model/

Soft vs. hard assignments during clustering

Some slides courtesy: https://nakulgopalan.github.io/cs4641/course/20-gaussian-

mixture-model.pdf
Gaussian Mixture Models (GMMs)

GMMs are a
generative model of
data.

They model how the

data was generated
from an underlying
model.
Each f is the normal
distribution. The overall data
set is generated as being
sampled from a mixture.
Learning a GMM: Optimizing the likelihood of
generating the data

We want to fit the parameters of

the Gaussian mixture model
(mixing fractions and the
parameters of the Gaussians given
the data).
E-step (associating data points with clusters)
M-step (given responsibilities optimize the
GMM parameters)
EM for GMMs
GMM Example

Colours indicate cluster

membership likelihood
Example
Example
Example
Example
Example
Example
Example

Online demo: https://lukapopijac.github.io/gaussian-mixture-model/

Machine Learning Interview
No ratings yet
Machine Learning Interview
14 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
PML Class 1 2025
No ratings yet
PML Class 1 2025
54 pages
Naive Bayes Classifier and Other Topics
No ratings yet
Naive Bayes Classifier and Other Topics
52 pages
ML 1
No ratings yet
ML 1
64 pages
MLT Unit 4 Notes
No ratings yet
MLT Unit 4 Notes
26 pages
lec21-ML II
No ratings yet
lec21-ML II
66 pages
Bayesian Learning
No ratings yet
Bayesian Learning
81 pages
SP14 CS188 Lecture 21 - Naive Bayes - Print
No ratings yet
SP14 CS188 Lecture 21 - Naive Bayes - Print
41 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
Chapter 21
No ratings yet
Chapter 21
31 pages
AIML-Unit 3 Notes-Assignment 3
No ratings yet
AIML-Unit 3 Notes-Assignment 3
37 pages
Bayesian Inference and Learning
No ratings yet
Bayesian Inference and Learning
48 pages
Unit 3
No ratings yet
Unit 3
16 pages
CS464 Ch3 Estimation
No ratings yet
CS464 Ch3 Estimation
56 pages
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bayesian Inference: Fundamentals and Applications
From Everand
Bayesian Inference: Fundamentals and Applications
Fouad Sabry
No ratings yet
2022 Naive Bayes and Probability
No ratings yet
2022 Naive Bayes and Probability
30 pages
ML 5
No ratings yet
ML 5
28 pages
6 Probabilities
No ratings yet
6 Probabilities
52 pages
CP4252 ML Unit-Iv
No ratings yet
CP4252 ML Unit-Iv
12 pages
Lecture 2 Annotated
No ratings yet
Lecture 2 Annotated
60 pages
Likelihood Frequentist
No ratings yet
Likelihood Frequentist
27 pages
Lecture17 Mle Map
No ratings yet
Lecture17 Mle Map
29 pages
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
No ratings yet
Machine Learning/Data Science Interview Cheat Sheets: Aqeel Anwar
17 pages
MLP RL1
No ratings yet
MLP RL1
6 pages
Slide 1
No ratings yet
Slide 1
37 pages
Chapter20 4e
No ratings yet
Chapter20 4e
36 pages
Chap1 Bishop
No ratings yet
Chap1 Bishop
35 pages
Machine - Learning (Unit 3)
No ratings yet
Machine - Learning (Unit 3)
9 pages
03 Lecturenote MLE MAP
No ratings yet
03 Lecturenote MLE MAP
7 pages
Lec 12
No ratings yet
Lec 12
15 pages
6.1 Bayesian Learning
No ratings yet
6.1 Bayesian Learning
33 pages
Lecture # 2-1 Probabilistic Models
No ratings yet
Lecture # 2-1 Probabilistic Models
40 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Lecture5 Maximum Likelihood
No ratings yet
Lecture5 Maximum Likelihood
13 pages
Ai Notes (Unit 3)
No ratings yet
Ai Notes (Unit 3)
33 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages
COMP4702 Notes 2019: Week 2 - Supervised Learning
No ratings yet
COMP4702 Notes 2019: Week 2 - Supervised Learning
23 pages
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
No ratings yet
Probabilistic Models in Machine Learning: Unit - III Chapter - 1
18 pages
ML in 10 Pages 1683806402
No ratings yet
ML in 10 Pages 1683806402
10 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
No ratings yet
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
141 pages
Bayesian
No ratings yet
Bayesian
91 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
14 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
No ratings yet
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
34 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Machine Learning Cheat Sheet PDF
No ratings yet
Machine Learning Cheat Sheet PDF
15 pages
CSE546: Naïve Bayes: Winter 2012
No ratings yet
CSE546: Naïve Bayes: Winter 2012
35 pages
Bayes Algorithm
No ratings yet
Bayes Algorithm
26 pages
Probability Theory and Stochastic Process Problems11
No ratings yet
Probability Theory and Stochastic Process Problems11
74 pages
Statistics and Probability
No ratings yet
Statistics and Probability
5 pages
Introduction To Probabilistic Learning
No ratings yet
Introduction To Probabilistic Learning
9 pages
Graphical Models - Learning
No ratings yet
Graphical Models - Learning
20 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
No ratings yet
Basics of Probabilistic/Bayesian Modeling and Parameter Estimation
21 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Bayesian Nonparametrics and The Probabilistic Approach To Modelling
No ratings yet
Bayesian Nonparametrics and The Probabilistic Approach To Modelling
27 pages
Random Variable Assignment-1
No ratings yet
Random Variable Assignment-1
8 pages
LN SP2018
No ratings yet
LN SP2018
139 pages
Probability Foundations For Engineers PDF
100% (2)
Probability Foundations For Engineers PDF
178 pages
Machine Learning in 10 Pages PDF
No ratings yet
Machine Learning in 10 Pages PDF
10 pages
RBD Problems 2021
No ratings yet
RBD Problems 2021
12 pages
Bola Armoush - MAT120 final-REVIEW PART 1
No ratings yet
Bola Armoush - MAT120 final-REVIEW PART 1
8 pages
Iygb Gce: Mathematics FS1 Advanced Level
No ratings yet
Iygb Gce: Mathematics FS1 Advanced Level
6 pages
Jntuk 2-1 and 2-2 ECE Syllabus R10
No ratings yet
Jntuk 2-1 and 2-2 ECE Syllabus R10
30 pages
Exercise Sheet 1 Mathematics and Statistics
No ratings yet
Exercise Sheet 1 Mathematics and Statistics
9 pages
Midterm 1 Study Questions Solution
No ratings yet
Midterm 1 Study Questions Solution
12 pages
Probability Statistics
100% (1)
Probability Statistics
11 pages
AL 302 Introduction To Probability and Statistics
No ratings yet
AL 302 Introduction To Probability and Statistics
2 pages
Expectation-Maximization Algorithm
No ratings yet
Expectation-Maximization Algorithm
13 pages
Simple Exponential Smoothing
No ratings yet
Simple Exponential Smoothing
32 pages
Exercises in Engineering Statistics
No ratings yet
Exercises in Engineering Statistics
34 pages
Untitled
No ratings yet
Untitled
9 pages
Cont Prob Dist-2
No ratings yet
Cont Prob Dist-2
29 pages
A Level Milestone S2
No ratings yet
A Level Milestone S2
16 pages
Lec 1
No ratings yet
Lec 1
4 pages
Probability The Science of Uncertainty
No ratings yet
Probability The Science of Uncertainty
12 pages
2023TSP6
No ratings yet
2023TSP6
2 pages
Unit 3. Sampling Distribution
100% (2)
Unit 3. Sampling Distribution
10 pages
2003 Awr 3
No ratings yet
2003 Awr 3
12 pages
Tabela Binomial
No ratings yet
Tabela Binomial
5 pages
Craps Monter Carlo Simulation Model (Davis-Flood)
No ratings yet
Craps Monter Carlo Simulation Model (Davis-Flood)
9 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
8 pages
PTSP 2marks Questions
100% (1)
PTSP 2marks Questions
4 pages
Lecture 22: Continuous Time Markov Chains
No ratings yet
Lecture 22: Continuous Time Markov Chains
5 pages
Aaoc ZC111
No ratings yet
Aaoc ZC111
13 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

L09 Learning I Bayesian Learning

Uploaded by

L09 Learning I Bayesian Learning

Uploaded by

COL333/671: Introduction to AI

Learning with Probabilities

In these slides X and d used

True hypothesis eventually dominates. Probability of

Ideally, one needs to marginalize or account for all the hypotheses.

Estimate the best hypothesis

Difference between marginalization

MLE is also called Maximum likelihood (ML) Approximation

Theta represents the parameters of the probabilistic model.

Theta ML stands for the Maximum Likelihood Estimate of the parameters.

Cherry, Lime, Lime, Cherry, Cherry,

Cherry, Lime, Lime, Cherry, Cherry,

• Maximum a posteriori estimate (MAP)

• Simple digit recognition: Y

• P(Y) – how frequent is the class-type for digit 3?

• Use relative frequencies from the data to estimate these

Note: The data is “complete”. Each data

• If certain counts are not seen in training

• Another version of Laplace smoothing

• Essentially, encodes a prior (pseudo-

Maximum Likelihood Parameter Learning

• If we knew the CPTs then

• It is a chicken and egg

We postulate that ratings (1 or 2) are

Observations, we only see the ratings given

Apply EM to learn the parameters.

Reviewers rate individually (their CPTs are

Slide adapted from Dorsa Sadigh and Percy Liang

Marginalize over the

Slide adapted from Dorsa Sadigh and Percy Liang

The estimated data points

Slide adapted from Dorsa Sadigh and Percy Liang

Estimated Fractional samples

(g=c, r1=2, r2=2) prob: 0.69

Revising probabilities based on

The CPTs for the two reviewers is the same.

Clustering depends on the

Euclidean distance? Edit

A GMM yields a probability distribution

GMM: Gaussian Mixture Model

K-means converges with

Iteration I Iteration II Iteration III

Ideal value of number of clusters(k) can

Apply K-Means in the colour space.

Web link: https://lukapopijac.github.io/gaussian-mixture-model/

Some slides courtesy: https://nakulgopalan.github.io/cs4641/course/20-gaussian-

They model how the

We want to fit the parameters of

Colours indicate cluster

Online demo: https://lukapopijac.github.io/gaussian-mixture-model/

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.