0% found this document useful (0 votes)
21 views66 pages

L09 Learning I Bayesian Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views66 pages

L09 Learning I Bayesian Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

COL333/671: Introduction to AI

Semester I, 2024-25

Learning with Probabilities

Rohan Paul

1
Outline
• Last Class
• CSPs
• This Class
• Bayesian Learning, MLE/MAP, Learning in Probabilistic Models.
• Reference Material
• Please follow the notes as the primary reference on this topic. Supplementary
reading on topics covered in class from AIMA Ch 20 sections 20.1 – 20.2.4.

2
Acknowledgement
These slides are intended for teaching purposes only. Some material
has been used/adapted from web sources and from slides by Doina
Precup, Dorsa Sadigh, Percy Liang, Mausam, Parag, Emma Brunskill,
Alexander Amini, Dan Klein, Anca Dragan, Nicholas Roy and others.

3
Learning Probabilistic Models
• Models are useful for making optimal decisions.
• Probabilistic models express a theory about the domain and can be used for
decision making.
• How to acquire these models in the first place?
• Solution: data or experience can be used to build these models
• Key question: how to learn from data?
• Bayesian view of learning (learning task itself is probabilistic inference)
• Learning with complete and incomplete data.
• Essentially, rely on counting.
Example: Which candy bag is it?

Statistics Probability
Bayesian Learning – in a nutshell
P(H)
H
P(d|H)

D1 D2 DN

i.i.d

In these slides X and d used


interchangeably.
Posterior Probability of Hypothesis given
Ovservations
Now, we are getting observations incrementally, Probability of a bag of a certain type given
how does our belief change? observations.

Bayes Rule

IID assumption
Posterior Probability of Hypothesis given
Ovservations
Incremental Belief Update

True hypothesis eventually dominates. Probability of


indefinitely producing uncharacteristic data →0
Predictions given Belief over Hypotheses
What is the probability that the next candy is of type lime?

Observations
Bayesian Prediction – Evidence arrives incrementally

key ideas
• Predictions are weighted average over the Changing
predictions of the individual hypothesis. belief
• Bayesian prediction eventually agrees with
the true hypothesis.
• For any fixed prior that does not rule out the
true hypothesis, the posterior probability of
any false hypothesis will eventually vanish.
• Why keep all the hypothesis?
Prediction
• Learning from small data, early commitment to a
by model
hypothesis is risky, later evidence may lead to a
different likely hypothesis. averaging.
• Better accounting of uncertainty in making
predictions.
• Problem: maybe slow and intractable, cannot
estimate and marginalize out the hypotheses.
Marginalization over Hypothesis – challenging!

Ideally, one needs to marginalize or account for all the hypotheses.

Can we pick one good hypothesis and just use that for predications?
Maximum a-posteriori (MAP) Approximation
P(Xd): This is the probability of observing new data X, given the evidence d.

Estimate the best hypothesis


given data while incorporating
the prior knowledge.
What is the
probability of a The prior term says which
hypothesis given hypothesis are likelier than
data?
others. Typically, the number of
bits to encode hypothesis.
MAP Vs. Bayesian Estimation

Difference between marginalization


(accounting for all hypothesis) vs.
committing to one and make
predictions from it.
Maximum Likelihood Estimation

Make predictions with the hypothesis that maximizes the data likelihood. Essentially, assuming a
uniform prior with no preference of a hypothesis over another.

MLE is also called Maximum likelihood (ML) Approximation


Maximum Likelihood Approximation

Theta represents the parameters of the probabilistic model.


These parameters define the specific configuration of the hypothesis or the model we are using.

Theta ML stands for the Maximum Likelihood Estimate of the parameters.


It is the value of Theta that makes the observed data most likely under the model.
ML Estimation in General: Bernoulli Model
Hypothesis is the likelihood of generating a candy of a specific flavor.

Cherry, Lime, Lime, Cherry, Cherry,


Lime, Cherry, Cherry

Similar problem to observing tosses of a biased coin and estimating the bias/fractional
parameter.
ML Estimation in General: Estimation for
Bernoulli Model

Cherry, Lime, Lime, Cherry, Cherry,


Lime, Cherry, Cherry

Even in the coin tossing problem, one would take the fraction as heads or
tails over the total number of tosses.
MAP vs. MLE Estimation
• Maximum likelihood estimate (MLE)
• Estimates the parameters that maximizes
the data likelihood.
• Relative counts give MLE estimates

• Maximum a posteriori estimate (MAP)


• Bayesian parameter estimation
• Encodes a prior over the parameters (not all
parameters are equal prior values).
• Combines the prior and the likelihood while
estimating the parameters.
ML Estimation in General: Learning Parameters
for a Probability Model
• Probabilistic models require
parameters (numbers in the
conditional probability tables).
• We need these values to make
predictions.
• Can we learn these from data (i.e.,
samples from the Bayes Net)?
• How to do this? Counting and
averaging. Can we use samples to estimate the
values in the tables?
Learning Parameters for a Probability Model
Classification Problem
• Task: given inputs x, predict labels (classes) y
• Examples:
• Spam detection (input: document,
classes: spam / ham)
• OCR (input: images, classes: characters)
• Medical diagnosis (input: symptoms, classes: diseases)
• Fraud detection (input: account activity, classes: fraud / no fraud)
Bayes Net for Classification
• Input: images / pixel grids
0
• Output: a digit 0-9
1
• Setup:
• Get a large collection of example images, each labeled with a digit
• Note: someone has to hand label all this data! 2
• Want to learn to predict labels of new, future digit images

1
• Features: The attributes used to make the digit decision
• Pixels: (6,8)=ON
• Shape Patterns: NumComponents, AspectRatio, NumLoops
• …
Not clear
Bayes Net for Classification
• Naïve Bayes: Assume all features are independent effects of the label

• Simple digit recognition: Y


• One feature (variable) Fij for each grid position <i,j>
• Feature values are on / off, based on whether intensity
is more or less than 0.5 in underlying image
• Each input maps to a feature vector, e.g. F1 F2 Fn
Parameter Estimation
• Need the estimates of local conditional probability tables.
• P(Y), the prior over labels
• P(Fi|Y) for each feature (evidence variable)
• These probabilities are collectively called the parameters of the
model and denoted by 
• Till now, the table values were provided.
• Now, use data to acquire these values.
Parameter Estimation

• P(Y) – how frequent is the class-type for digit 3?


• If you take a sample of images of numbers how frequent is this
number

• P(Fi|Y) – for digit 3 what fraction of the time the cell is on?
• Conditioned on the class type how frequent is the feature

• Use relative frequencies from the data to estimate these


values.
Parameter Estimation: Complete Data

Note: The data is “complete”. Each data


point had values observed for “all” the
variables in the model.
Parameter Estimation
Parameter Estimation
Parameter Estimation
Problem: values not seen in the training data

If one feature was not seen in the training data, the likelihood goes to zero.
If we did not see this feature in the training data, does not mean we will
not see this in training. Essentially overfitting to the training data set.
Laplace Smoothing
• Pretend that every outcome occurs once
more than it is observed. H H T

• If certain counts are not seen in training


does not mean that they have zero
probability of occurring in future.

• Another version of Laplace smoothing


• instead of 1, add k times
• k is an adjustable parameter.

• Essentially, encodes a prior (pseudo-


counts).
Learning Multiple Parameters
• Estimate latent parameters
using MLE.
• There are two CPTs in this
example.
• Observations are of both
variables: Flavor and
Wrapper.
• Take log likelihood.
Learning Multiple Parameters
• Minimize data likelihood to estimate the parameters.

Maximum Likelihood Parameter Learning


with complete data for a Bayes Net
decomposes into separate learning
problems, one for each parameter.
How to learn the structure of the Bayes Net?
• Problem: Estimate/learn the structure
of the model
• Setup a search process (like local search,
hill climbing etc.)
• For each structure, learn the
parameters.
• How to score a solution?
• Use Max. likelihood estimation.
• Penalize complexity of the structure
(don’t want a fully connected
model).
• Additionally check for validity of the
conditional independences.
Parameter Learning when some variables are
not observed
• If we knew the missing
value for B. Then we can
estimate the CPTs.
Conditional probability table

• If we knew the CPTs then


we can infer the probability
of the missing value of B.

• It is a chicken and egg


problem. Data is incomplete. One sample has (A = 1, B= ? and C = 0 )
Expectation Maximization
• Initialization
• Initialize CPT parameter values (ignoring missing information)
• Expectation
• Compute expected values of unobserved variables assuming current
parameters values.
• Involves BayesNet inference (exact or approximate)
• Maximization
• Compute new parameters (of the CPTs) to maximize the probability of data
(observed and estimated)
• Alternate the EM steps until convergence. Convergence is guaranteed.
Expectation Maximization
Expectation Maximization
EM Example
Problem: learning the parameters of a Bayes
Net that models ratings given by reviewers.

We postulate that ratings (1 or 2) are


conditioned on the “genre” or “type” of the
move (Comedy or Drama.

Observations, we only see the ratings given


by the reviewers.

Apply EM to learn the parameters.

Reviewers rate individually (their CPTs are


assumed to be the same).

Slide adapted from Dorsa Sadigh and Percy Liang


What objective are we optimizing in EM?
Maximum Marginal Likelihood
Latent Variables are variables in a model that
are not directly observed in the data but are
inferred through relationships with observed
variables.
In this example, G (genre of the movie) acts as
a latent variable because it influences the
observed ratings, but its value might not be
directly provided in the data.
Latent Vectors typically refer to multi-
dimensional representations of these latent
variables, but in this context, they simply mean
the possible values or states of G that we sum
over to compute the marginal likelihood in the
EM objective.

Marginalize over the


latent variables in the
likelihood

Slide adapted from Dorsa Sadigh and Percy Liang


E and M steps Compute for every value of h
and for each setting of the
evidence variables.

The estimated data points


In the E-step, it estimates the probabilities of the hidden (latent) variables given the observed data
and the current parameters. from E step are used to
In the M-step, it updates the model parameters to maximize the likelihood of the observed data, update the CPTs.
given these estimated probabilities.

Slide adapted from Dorsa Sadigh and Percy Liang


EM: Estimating and using weighted samples

Estimated Fractional samples

(g=c, r1=2, r2=2) prob: 0.69


(g=d, r1=2, r2=2) prob: 0.31
(g=c, r1=1, r2=2) prob: 0.5
(g=d, r1=1, r2=2) prob: 0.5

Revising probabilities based on


fractional samples.

The CPTs for the two reviewers is the same.


Related Topic: Clustering
Example: Clustering images in a data base
Clustering is subjective
Clustering is based on a distance metric

Clustering depends on the


distance function used.

Euclidean distance? Edit


distance? ….
K-Means Clustering

A GMM yields a probability distribution


over the cluster assignment for each
point; whereas K-Means gives a single
hard assignment

GMM: Gaussian Mixture Model

https://www.naftaliharris.com/blog/
visualizing-k-means-clustering/
K-Means Clustering Algorithm
What objective K-Means is optimizing?
Data points

K-means converges with


every step to the minimum
distortion metric

Iteration I Iteration II Iteration III


How to pick “k”?

Ideal value of number of clusters(k) can


be identified using the distortion metric
for different values of k.
K-Means Application: Segmentation
Goal of segmentation is to partition
an image into regions each of
which has reasonably homogenous
visual appearance.

Apply K-Means in the colour space.


EM in Continuous Space: Gaussian Mixture
Modeling
• Problem: clustering task where we want
to discern multiple category in a
collection of given points.
• Assume a mixture of components
(Gaussian)
• Don’t know which data point comes
from which component.
• Use EM to iteratively determine the
assignments and the parameters of the
Gaussian components.

Web link: https://lukapopijac.github.io/gaussian-mixture-model/


Soft vs. hard assignments during clustering

Some slides courtesy: https://nakulgopalan.github.io/cs4641/course/20-gaussian-


mixture-model.pdf
Gaussian Mixture Models (GMMs)

GMMs are a
generative model of
data.

They model how the


data was generated
from an underlying
model.
Each f is the normal
distribution. The overall data
set is generated as being
sampled from a mixture.
Learning a GMM: Optimizing the likelihood of
generating the data

We want to fit the parameters of


the Gaussian mixture model
(mixing fractions and the
parameters of the Gaussians given
the data).
E-step (associating data points with clusters)
M-step (given responsibilities optimize the
GMM parameters)
EM for GMMs
GMM Example

Colours indicate cluster


membership likelihood
Example
Example
Example
Example
Example
Example
Example

Online demo: https://lukapopijac.github.io/gaussian-mixture-model/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy