0% found this document useful (0 votes)

22 views55 pages

07 - Bayesian Learning

Uploaded by

Rehan Mahmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views55 pages

07 - Bayesian Learning

Uploaded by

Rehan Mahmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Lecture 7.

Bayesian Learning
Learning in an uncertain world
Joaquin Vanschoren
XKCD, Randall Monroe
Bayes' rule
Rule for updating the probability of a hypothesis given data
c x

P (c|x) is the posterior probability of class given data .

c x

P (c) is the prior probability of class : what you believed before you saw the data
c x

P (x|c) is the likelihood of data point given that the class is (computed from your dataset)
x c

P (x) is the prior probability of the data (marginal likelihood): the likelihood of the data under any
x

circumstance (no matter what the class is)

Example: exploding sun
Let's compute the probability that the sun has exploded
PriorP (exploded) : the sun has an estimated lifespan of 10 billion years, P (exploded) =
1

Likelihood that detector lies:

13
4.38x10
1
P (lie) =
36

P (yes|exploded)P (exploded)
P (exploded|yes) =
P (yes)

(1 − P (lie))P (exploded)
=
P (exploded)(1 − P (lie)) + P (lie)(1 − P (exploded))

1
=
12
1.25226x10

The one positive observation of the detector increases the probability

Example: COVID test
What is the probability of having COVID-19 if a 96% accurate test returns positive? Assume a false
positive rate of 4%
PriorP (C) : 0.015 (117M cases, 7.9B people)
P (T P ) = P (pos|C) = 0.96 , and
P (F P ) = (pos|notC) = 0.04

If test is positive, prior becomes

P (C) = 0.268 . 2nd positive test:
P (C|pos) = 0.9

P (pos|C)P (C)
P (C|pos) =
P (pos)

P (pos|C)P (C)
=
P (pos|C)P (C) + P (pos|notC)(1 − P (C))

0.96 ∗ 0.015
=
0.96 ∗ 0.015 + 0.04 ∗ 0.985

= 0.268
Bayesian models
Learn the joint distribution
P (x, y) = P (x|y)P (y) .
Assumes that the data is Gaussian distributed (!)
With every input you get
x , hence a mean and standard deviation for (blue)
P (y|x) y

For every desired output you get

y , hence you can sample new points (red)
P (x|y) x

Easily updatable with new data using Bayes' rule ('turning the crank')
Previous posterior becomes new prior
P (y|x) P (y)
Generative models
The joint distribution represents the training data for a particular output (e.g. a class)
You can sample a new point with high predicted likelihood
x P (x, c) : that new point will be very
similar to the training points
Generate new (likely) points according to the same distribution: generative model
Generate examples that are fake but corresponding to a desired output
Generative neural networks (e.g. GANs) can do this very accurately for text, images, ...
Naive Bayes
Predict the probability that a point belongs to a certain class, using Bayes' Theorem
P (x|c)P (c)
P (c|x) =
P (x)

Problem: since is a vector, computing

x P (x|c) can be very complex
Naively assume that all features are conditionally independent from each other, in which case:
P (x|c) = P (x1 |c) × P (x2 |c)×. . . ×P (xn |c)

Very fast: only needs to extract statistics from each feature.

On categorical data

What's the probability that your friend will play golf if the weather is sunny?
On numeric data

We need to fit a distribution (e.g. Gaussian) over the data points

GaussianNB: Computes mean and standard deviation of the feature values per class:
μc
2
σc
(v−μc )
−
1 2σ
2
p(x = v ∣ c) = e c

2
√2πσc

We can now make predictions using Bayes' theorem: p(c ∣ x) =

p(x∣c) p(c)

p(x)
What do the predictions of Gaussian Naive Bayes look like?
Other Naive Bayes classifiers:
BernoulliNB
Assumes binary data
Feature statistics: Number of non-zero entries per class
MultinomialNB
Assumes count data
Feature statistics: Average value per class
Mostly used for text classification (bag-of-words data)
Bayesian Networks
What if we know that some variables are not independent?
A Bayesian Network is a directed acyclic graph representing variables as nodes and conditional
dependencies as edges.
If an edge(A, B) connects random variables A and B, thenP (B|A) is a factor in the joint
probability distribution. We must knowP (B|A) for all values of and
B A

The graph structure can be designed manually or learned (hard!)

Gaussian processes
Model the data as a Gaussian distribution, conditioned on the training points
Probabilistic interpretation of regression
Linear regression (recap):
y = f (xi ) = xi w + b

For one input feature:

y = w 1 ⋅ x1 + b ⋅ 1

We can solve this via linear algebra (closed form solution): w

∗
= (X
T
X)
−1
X
T
Y

w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))

X is our data matrix with a x0 = 1 column to represent the bias : b

⊤
x
⎡ 1 ⎤ 1 x1
⎡ ⎤
⊤
⎢ x ⎥
2 ⎢ 1 x2 ⎥
⎢ ⎥
⎢ ⎥
X = ⎢ ⎥ =
⎢ ⎥
⎢ ⎥
⎢ ⋮ ⋮ ⎥
⎢ ⋮ ⎥
⎣ ⎦
⎣ ⊤ ⎦ 1 xN
x
N
Example: Olympic marathon data

We learned: y = w1 x + w0 = −0.013x + 28.895

Polynomial regression (recap)
We can fit a 2nd degree polynomial by using a basis expansion (adding more basis functions):
2
Φ = [1 x x ]
Kernelized regression (recap)
We can also kernelize the model and learn a dual coefficient per data point
Probabilistic interpretation
These models do not give us any indication of the (un)certainty of the predictions
Assume that the data is inherently uncertain. This can be modeled explictly by introducing a slack
variable, , known as noise.
ϵi

y i = w 1 xi + w 0 + ϵ i .

Assume that the noise is distributed according to a Gaussian distribution with zero mean and
variance .
σ
2

2
ϵi ∼ N (0, σ )

That means that y(x) is now a Gaussian distribution with mean wx and variance σ
2

2
y = N (wx, σ )
We have an uncertainty predictions, but it is the same for all predictions
You would expect to be more certain nearby your training points
How to learn probabilities?
Maximum Likelihood Estimation (MLE): Maximize P (X|w)

Corresponds to optimizing , using (log) likelihood as the loss function

Every prediction has a mean defined by and Gaussian noise

n n

2
P (X|w) = ∏ P (y |xi ; w) = ∏ N (wx, σ I)
i

i=0 i=0
Maximum A Posteriori estimation (MAP): Maximize the posterior P (w|X)

This can be done using Bayes' rule after we choose a (Gaussian) prior P (w):
P (X|w)P (w)
P (w|X) =
P (X)

Bayesian approach: model the prediction directly

P (y|xtest , X)

Marginalize out: consider all possible models (some are more likely)
w

If prior
P (w) is Gaussian, then is also Gaussian!
P (y|xtest , X)

A multivariate Gaussian with mean and covariance matrix

μ Σ

P (y|xtest , X) = ∫ P (y|xtest , w)P (w|X)dw = N (μ, Σ)

w
Gaussian prior P (w)
In the Bayesian approach, we assume a prior (Gaussian) distribution for the parameters, :
w ∼ N (0, αI)

With zero mean ( =0) and covariance matrix . For 2D:

μ αI αI = [
α

0
0

α
]

I.e, is drawn from a Gaussian density with variance

wi α

wi ∼ N (0, α)
Sampling from the prior (weight space)
We can sample from the prior distribution to see what form we are imposing on the functions a priori
(before seeing any data).
Draw (left) independently from a Gaussian density
w w ∼ N (0, αI)

Use any normally distributed sampling technique, e.g. Box-Mueller transform

Every sample yields a polynomial function (right):
f (x) f (x) = wϕ(x).

For example, with being a polynomial:

ϕ(x)
Learning Gaussian distributions

We assume that our data is Gaussian distributed: P (y|xtest , X) = N (μ, Σ)

Example with learned mean [m, m] and covariance [

β
β

α
]

The blue curve is the predicted P (y|xtest , X)

Understanding covariances
If two variables covariate strongly, knowing about tells us a lot about
xi x1 x2

If covariance is 0, knowing tells us nothing about (the conditional and marginal distributions
x1 x2

are the same)

For covariance matrix[
1

β
β

1
: ]
Sampling from higher-dimensional distributions
Instead of sampling and then multiplying by , we can also generate examples of directly.
w Φ f (x)

f with values can be sampled from an -dimensional Gaussian distribution with zero mean and
n n

covariance matrix K = αΦΦ

⊤
:
fis a stochastic process: series of normally distributed variables (interpolated in the plot)
f ∼ N (0, K)
Repeat for 40 dimensions, with the polynomial transform:
Φ

More examples of covariances

Noisy functions

We normally add Gaussian noise to obtain our observations:

y = f + ϵ
Gaussian Process
Usually, we want our functions to be smooth: if two points are similar/nearby, the predictions should
be similar.
Hence, we need a similarity measure (a kernel)
In a Gaussian process we can do this by specifying the covariance function directly (not as
K = αΦΦ )⊤

The covariance matrix is simply the kernel matrix: f ∼ N (0, K)

The RBF (Gaussian) covariance function (or kernel) is specified by

′ 2
∥x − x ∥
′
k(x, x ) = α exp(− ).
2
2ℓ

where ′
∥x − x ∥
2
is the squared distance between the two input vectors
′ 2 ′ ⊤ ′
∥x − x ∥ = (x − x ) (x − x )

and the length parameter controls the smoothness of the function and the vertical variation.
l α
Now the influence of a point decreases smoothly but exponentially
These are our priorsP (y) = N (0, K) , with mean 0
We now want to condition it on our training data:
P (y|xtest , X) = N (μ, Σ)
Computing the posterior P (y|X)

Assuming that P (X)is a Gaussian density with a covariance given by kernel matrix , the model K

likelihood becomes:
P (y) P (X ∣ y) 1 1 −1
⊤ 2
P (y|X) = = exp(− y (K + σ I) y)
n 1
P (X) 2
(2π) 2
|K| 2

Hence, the negative log likelihood (the objective function) is given by:
1 1 −1
⊤ 2
E(θ) = log |K| + y (K + σ I) y
2 2

The model parameters (e.g. noise variance ) and the kernel parameters (e.g. lengthscale,
σ
2

variance) can be embedded in the covariance function and learned from data.
Good news: This loss function can be optimized using linear algebra (Cholesky Decomposition)
Bad news: This is cubic in the number of data points AND the number of features: O(n d )
3 3
Making predictions
The model makes predictions for that are unaffected by future values of . ∗

If we think of as test points, we can still write down a joint probability density over the training
f f
∗

observations, and the test observations, .

f
∗
f f

This joint probability density will be Gaussian, with a covariance matrix given by our kernel function,
k(xi , xj ).
f K K∗
[ ] ∼ N (0, [ ])
∗ ⊤
f K K∗,∗
∗

where is the kernel matrix computed between all the training points,
is the kernel matrix computed between the training points and the test points,
K

is the kernel matrix computed between all the tests points and themselves.
K∗

K∗,∗
Conditional Density P (y|xtest , X)
Finally, we need to define conditional distributions to answer particular questions of interest
We will need the conditional density for making predictions.
∗
f |y ∼ N (μf , Cf )

with a mean given by μf = K

⊤
∗
2
[K + σ I]
−1
y

and a covariance given by Cf = K∗,∗ − K

⊤
∗
2
[K + σ I]
−1
K∗ .
Remember that our prediction is the sum of the mean and the variance:
P (y|xtest , X) = N (μ, Σ)

The mean is the same as the one computed with kernel ridge (if given the same kernel and
hyperparameters)
The Gaussian process learned the covariance and the hyperparameters
The values on the diagonal of the covariance matrix give us the variance, so we can simply plot the mean
and 95% confidence interval
Gaussian Processes in practice (with GPy)
GPyRegression

Generate a kernel first

State the dimensionality of your input data
Variance and lengthscale are optional, default = 1
kernel = GPy.kern.RBF(input_dim=1, variance=1.,
lengthscale=1.)

Other kernels:
GPy.kern.BasisFuncKernel?

Build model:
m = GPy.models.GPRegression(X,Y,kernel)
Matern is a generalized RBF kernel that can scale between RBF and Exponential
Build the untrained GP. The shaded region corresponds to ~95% confidence intervals (i.e. +/- 2 standard
deviation)
Train the model (optimize the parameters): maximize the likelihood of the data.
Best to optimize with a few restarts: the optimizer may converges to the high-noise solution. The optimizer
is then restarted with a few random initialization of the parameter values.
You can also show results in 2D
We can plot 2D slices using the fixed_inputs argument to the plot function.
fixed_inputs is a list of tuples containing which of the inputs to fix, and to which value.
Gaussian Processes with scikit-learn

GaussianProcessRegressor
Hyperparameters:
kernel : kernel specifying the covariance function of the GP
Default: "1.0 * RBF(1.0)"
Typically leave at default. Will be optimized during fitting
alpha : regularization parameter
Tikhonov regularization of covariance between the training points.
Adds a (small) value to diagonal of the kernel matrix during fitting.
Larger values:
correspond to increased noise level in the observations
also reduce potential numerical issues during fitting
Default: 1e-10
n_restarts_optimizer : number of restarts of the optimizer
Default: 0. Best to do at least a few iterations.
Optimizer finds kernel parameters maximizing log-marginal likelihood
Retrieve predictions and confidence interval after fitting:
y_pred, sigma = gp.predict(x, return_std=True)
Example
Example with noisy data
Gaussian processes: Conclusions
Advantages:
The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals.
The prediction interpolates the observations (at least for regular kernels).
Versatile: different kernels can be specified.
Disadvantages:
They are typically not sparse, i.e., they use the whole sample/feature information to perform the
prediction.
Sparse GPs also exist: they remember only the most important points
They lose efficiency in high dimensional spaces – namely when the number of features exceeds a
few dozens.
Gaussian processes and neural networks
You can prove that a Gaussian process is equivalent to a neural network with one layer and an
infinite number of nodes
You can build deep Gaussian Processes by constructing layers of GPs
Bayesian optimization
The incremental updates you can do with Bayesian models allow a more effective way to optimize
functions
E.g. to optimize the hyperparameter settings of a machine learning algorithm/pipeline
After a number of random search iterations we know more about the performance of
hyperparameter settings on the given dataset
We can use this data to train a model, and predict which other hyperparameter values might be
useful
More generally, this is called model-based optimization
This model is called a surrogate model
This is often a probabilistic (e.g. Bayesian) model that predicts confidence intervals for all
hyperparameter settings
We use the predictions of this model to choose the next point to evaluate
With every new evaluation, we update the surrogate model and repeat
Example (see figure):
Consider only 1 continuous hyperparameter (X-axis)
You can also do this for many more hyperparameters
Y-axis shows cross-validation performance
Evaluate a number of random hyperparameter settings (black dots)
Sometimes an initialization design is used
Train a model, and predict the expected performance of other (unseen) hyperparameter values
Mean value (black line) and distribution (blue band)
An acquisition function (green line) trades off maximal expected performace and maximal
uncertainty
Exploitation vs exploration
Optimal value of the asquisition function is the next hyperparameter setting to be evaluated
Repeat a fixed number of times, or until time budget runs out
Shahriari et al. Taking the Human Out of the Loop: A Review of Bayesian Optimization
In 2 dimensions:
Surrogate models
Surrogate model can be anything as long as it can do regression and is probabilistic
Gaussian Processes are commonly used
Smooth, good extrapolation, but don't scale well to many hyperparameters (cubic)
Sparse GPs: select ‘inducing points’ that minimize info loss, more scalable
Multi-task GPs: transfer surrogate models from other tasks
Random Forests
A lot more scalable, but don't extrapolate well
Often an interpolation between predictions is used instead of the raw (step-wise)
predictions
Bayesian Neural Networks:
Expensive, sensitive to hyperparameters
Acquisition Functions
When we have trained the surrogate model, we ask it to predict a number of samples
Can be simply random sampling
Better: Thompson sampling
fit a Gaussian distribution (a mixture of Gaussians) over the sampled points
sample new points close to the means of the fitted Gaussians
Typical acquisition function: Expected Improvement
Models the predicted performance as a Gaussian distribution with the predicted mean and
standard deviation
Computes the expected performance improvement over the previous best configuration
X
+
:
+
EI (X) := E [max{0, f (X ) − ft+1 (X)}]

Computing the expected performance requires an integration over the posterior

distribution, but has a closed form solution.
Bayesian Optimization: conclusions
More efficient way to optimize hyperparameters
More similar to what humans would do
Harder to parallellize
Choice of surrogate model depends on your search space
Very active research area
For very high-dimensional search spaces, random forests are popular

Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
Murphy Book Solution
No ratings yet
Murphy Book Solution
100 pages
CS-601-Machine-learning-Unit-5 (1)
No ratings yet
CS-601-Machine-learning-Unit-5 (1)
18 pages
5 ML NaiveBayes
No ratings yet
5 ML NaiveBayes
45 pages
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
No ratings yet
Week 6 v1.61 (Hidden) - Revision, CW1, and Probabilistic Graphical Models
65 pages
PBM Notes
No ratings yet
PBM Notes
130 pages
Bayesian Machine Learning
No ratings yet
Bayesian Machine Learning
127 pages
ml-3
No ratings yet
ml-3
66 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Lecture 12 Bayesian Neural Network
No ratings yet
Lecture 12 Bayesian Neural Network
46 pages
Bayesian Kernel Methods
No ratings yet
Bayesian Kernel Methods
40 pages
Lecture # 2-1 Probabilistic Models
No ratings yet
Lecture # 2-1 Probabilistic Models
40 pages
Lecture 4
No ratings yet
Lecture 4
36 pages
Gaussian Process Tutorial by Andrew NG
No ratings yet
Gaussian Process Tutorial by Andrew NG
13 pages
Lecture 6_Generative Models
No ratings yet
Lecture 6_Generative Models
33 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
MLT Unit 2 - Updated
No ratings yet
MLT Unit 2 - Updated
58 pages
slide07-bayes
No ratings yet
slide07-bayes
51 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
51 pages
Lecture 5
No ratings yet
Lecture 5
23 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
Lecture2 2013
No ratings yet
Lecture2 2013
60 pages
Chap1 Bishop
No ratings yet
Chap1 Bishop
35 pages
Tutorial: Gaussian Process Models For Machine Learning
No ratings yet
Tutorial: Gaussian Process Models For Machine Learning
35 pages
Lecture 2 - Principle of Machine Learning
No ratings yet
Lecture 2 - Principle of Machine Learning
39 pages
Ghahramani Lecture2
No ratings yet
Ghahramani Lecture2
30 pages
FML Unit3
No ratings yet
FML Unit3
18 pages
Ryan Adams 140814 Bayesopt Ncap
No ratings yet
Ryan Adams 140814 Bayesopt Ncap
84 pages
Advanced ML Notes (Midterm)
No ratings yet
Advanced ML Notes (Midterm)
10 pages
Bayesian Learning: Berrin Yanikoglu
No ratings yet
Bayesian Learning: Berrin Yanikoglu
64 pages
A Tutorial On Gaussian Processes (Or Why I Don'T Use SVMS) : Zoubin Ghahramani
No ratings yet
A Tutorial On Gaussian Processes (Or Why I Don'T Use SVMS) : Zoubin Ghahramani
31 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Deep GP Untuk Speech
No ratings yet
Deep GP Untuk Speech
8 pages
Ds 7
No ratings yet
Ds 7
20 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
17 pages
Bark08 Ghahramani Samlbb 01
No ratings yet
Bark08 Ghahramani Samlbb 01
26 pages
Essentials of Bayesian Inference 1706204646
No ratings yet
Essentials of Bayesian Inference 1706204646
21 pages
Bain Engelhardt
100% (2)
Bain Engelhardt
658 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
Gaussian Processes in Machine Learning
No ratings yet
Gaussian Processes in Machine Learning
9 pages
RVM Tutorial
No ratings yet
RVM Tutorial
25 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
Experimental Vs Theoretical Probability Powerpoint
100% (1)
Experimental Vs Theoretical Probability Powerpoint
17 pages
Quantitative Techniques in Business
No ratings yet
Quantitative Techniques in Business
39 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Machine Learning and Pattern Recognition Gaussian Processes
No ratings yet
Machine Learning and Pattern Recognition Gaussian Processes
6 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
7 pages
Bayes
No ratings yet
Bayes
10 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Supervised Learning
No ratings yet
Supervised Learning
6 pages
CH 7
No ratings yet
CH 7
77 pages
06 - Data Preprocessing
No ratings yet
06 - Data Preprocessing
68 pages
Bayesian Modelling For Data Analysis and Learning From Data
No ratings yet
Bayesian Modelling For Data Analysis and Learning From Data
19 pages
STAT 253 Probability & Statistics: July 23, 2022
No ratings yet
STAT 253 Probability & Statistics: July 23, 2022
76 pages
Discriminating Among Weibull, Log-Normal and LL
No ratings yet
Discriminating Among Weibull, Log-Normal and LL
28 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
10 - Neural Networks For Text
No ratings yet
10 - Neural Networks For Text
40 pages
Basics of Reliability
No ratings yet
Basics of Reliability
54 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
Formula Sheet For Isi
No ratings yet
Formula Sheet For Isi
15 pages
Deques: ECE 250 Algorithms and Data Structures
No ratings yet
Deques: ECE 250 Algorithms and Data Structures
34 pages
Probability Ca Module
No ratings yet
Probability Ca Module
12 pages
Ms Data Science S, 24 (WEEK# 4)
No ratings yet
Ms Data Science S, 24 (WEEK# 4)
23 pages
Chen and Li - 2009 - A Note On The Principle of Preservation of Probabi
No ratings yet
Chen and Li - 2009 - A Note On The Principle of Preservation of Probabi
9 pages
Correlation (Pearson, Kendall, Spearman)
No ratings yet
Correlation (Pearson, Kendall, Spearman)
11 pages
RSH Qam11 ch03
No ratings yet
RSH Qam11 ch03
77 pages
MCQ Test On Unit 6 - Attempt Review
No ratings yet
MCQ Test On Unit 6 - Attempt Review
6 pages
Assumptions of The Classical Linear Regression Model
No ratings yet
Assumptions of The Classical Linear Regression Model
7 pages
Business Statistics Syllabus 2016-17
No ratings yet
Business Statistics Syllabus 2016-17
2 pages
Assignment MPH
No ratings yet
Assignment MPH
2 pages
University of Lucknow Syllabus of Statistics B.A. Part-I (For The Examination of 2009 and Onwards)
No ratings yet
University of Lucknow Syllabus of Statistics B.A. Part-I (For The Examination of 2009 and Onwards)
16 pages
Multivariate Analysis: y N P V A
No ratings yet
Multivariate Analysis: y N P V A
2 pages
Department of Computing: CS-220: Database Systems Class: BSCS-4C
100% (1)
Department of Computing: CS-220: Database Systems Class: BSCS-4C
14 pages
STA 1510 Chapter 4 Summary
No ratings yet
STA 1510 Chapter 4 Summary
3 pages
Ex Ml-Basics
No ratings yet
Ex Ml-Basics
1 page
Chapter 7:sampling and Sampling Distributions
No ratings yet
Chapter 7:sampling and Sampling Distributions
50 pages
PSHS Curriculum
No ratings yet
PSHS Curriculum
1 page
Some Random Reviews: Which Happens by Default?
No ratings yet
Some Random Reviews: Which Happens by Default?
9 pages
Assignment 2 Exercise 2.5
No ratings yet
Assignment 2 Exercise 2.5
3 pages
Detailed Lesson Plan in Statistics and Probability: I. Objectives
No ratings yet
Detailed Lesson Plan in Statistics and Probability: I. Objectives
13 pages
MGSC 1108 Formula Sheet
No ratings yet
MGSC 1108 Formula Sheet
2 pages
Dynamic Linear Models, Recursive Least Squares and Steepest-Descent Learning
No ratings yet
Dynamic Linear Models, Recursive Least Squares and Steepest-Descent Learning
11 pages
Class Xii Assignment Probability
100% (1)
Class Xii Assignment Probability
4 pages
This Study Resource Was
No ratings yet
This Study Resource Was
3 pages
WINSEM2015 16 - CP2656 - 25 Jan 2016 - RM01 - Moment Generating Function
No ratings yet
WINSEM2015 16 - CP2656 - 25 Jan 2016 - RM01 - Moment Generating Function
2 pages
This Study Resource Was
No ratings yet
This Study Resource Was
1 page
CS 188 Introduction To AI Midterm Study Guide
No ratings yet
CS 188 Introduction To AI Midterm Study Guide
2 pages
Course Outline Title Probability and Statistics Code MT-205 Credit Hours
No ratings yet
Course Outline Title Probability and Statistics Code MT-205 Credit Hours
7 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
Useful Formulae: Mathematical & Physical
From Everand
Useful Formulae: Mathematical & Physical
Matthew Watkins
No ratings yet
Group Theory I Essentials
From Everand
Group Theory I Essentials
Emil Milewski
No ratings yet
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

07 - Bayesian Learning

Uploaded by

07 - Bayesian Learning

Uploaded by

Lecture 7.

P (c|x) is the posterior probability of class given data .

circumstance (no matter what the class is)

Likelihood that detector lies:

The one positive observation of the detector increases the probability

If test is positive, prior becomes

For every desired output you get

Problem: since is a vector, computing

Very fast: only needs to extract statistics from each feature.

We need to fit a distribution (e.g. Gaussian) over the data points

We can now make predictions using Bayes' theorem: p(c ∣ x) =

The graph structure can be designed manually or learned (hard!)

For one input feature:

We can solve this via linear algebra (closed form solution): w

w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))

X is our data matrix with a x0 = 1 column to represent the bias : b

We learned: y = w1 x + w0 = −0.013x + 28.895

Corresponds to optimizing , using (log) likelihood as the loss function

Every prediction has a mean defined by and Gaussian noise

Bayesian approach: model the prediction directly

A multivariate Gaussian with mean and covariance matrix

P (y|xtest , X) = ∫ P (y|xtest , w)P (w|X)dw = N (μ, Σ)

With zero mean ( =0) and covariance matrix . For 2D:

I.e, is drawn from a Gaussian density with variance

Use any normally distributed sampling technique, e.g. Box-Mueller transform

For example, with being a polynomial:

We assume that our data is Gaussian distributed: P (y|xtest , X) = N (μ, Σ)

Example with learned mean [m, m] and covariance [

The blue curve is the predicted P (y|xtest , X)

are the same)

covariance matrix K = αΦΦ

More examples of covariances

We normally add Gaussian noise to obtain our observations:

The covariance matrix is simply the kernel matrix: f ∼ N (0, K)

The RBF (Gaussian) covariance function (or kernel) is specified by

observations, and the test observations, .

with a mean given by μf = K

and a covariance given by Cf = K∗,∗ − K

Generate a kernel first

Computing the expected performance requires an integration over the posterior

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.