07 - Bayesian Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

Lecture 7.

Bayesian Learning
Learning in an uncertain world
Joaquin Vanschoren
XKCD, Randall Monroe
Bayes' rule
Rule for updating the probability of a hypothesis given data
c x

P (c|x) is the posterior probability of class given data .


c x

P (c) is the prior probability of class : what you believed before you saw the data
c x

P (x|c) is the likelihood of data point given that the class is (computed from your dataset)
x c

P (x) is the prior probability of the data (marginal likelihood): the likelihood of the data under any
x

circumstance (no matter what the class is)


Example: exploding sun
Let's compute the probability that the sun has exploded
PriorP (exploded) : the sun has an estimated lifespan of 10 billion years, P (exploded) =
1

Likelihood that detector lies:


13
4.38x10
1
P (lie) =
36

P (yes|exploded)P (exploded)
P (exploded|yes) =
P (yes)

(1 − P (lie))P (exploded)
=
P (exploded)(1 − P (lie)) + P (lie)(1 − P (exploded))

1
=
12
1.25226x10

The one positive observation of the detector increases the probability


Example: COVID test
What is the probability of having COVID-19 if a 96% accurate test returns positive? Assume a false
positive rate of 4%
PriorP (C) : 0.015 (117M cases, 7.9B people)
P (T P ) = P (pos|C) = 0.96 , and
P (F P ) = (pos|notC) = 0.04

If test is positive, prior becomes


P (C) = 0.268 . 2nd positive test:
P (C|pos) = 0.9

P (pos|C)P (C)
P (C|pos) =
P (pos)

P (pos|C)P (C)
=
P (pos|C)P (C) + P (pos|notC)(1 − P (C))

0.96 ∗ 0.015
=
0.96 ∗ 0.015 + 0.04 ∗ 0.985

= 0.268
Bayesian models
Learn the joint distribution
P (x, y) = P (x|y)P (y) .
Assumes that the data is Gaussian distributed (!)
With every input you get
x , hence a mean and standard deviation for (blue)
P (y|x) y

For every desired output you get


y , hence you can sample new points (red)
P (x|y) x

Easily updatable with new data using Bayes' rule ('turning the crank')
Previous posterior becomes new prior
P (y|x) P (y)
Generative models
The joint distribution represents the training data for a particular output (e.g. a class)
You can sample a new point with high predicted likelihood
x P (x, c) : that new point will be very
similar to the training points
Generate new (likely) points according to the same distribution: generative model
Generate examples that are fake but corresponding to a desired output
Generative neural networks (e.g. GANs) can do this very accurately for text, images, ...
Naive Bayes
Predict the probability that a point belongs to a certain class, using Bayes' Theorem
P (x|c)P (c)
P (c|x) =
P (x)

Problem: since is a vector, computing


x P (x|c) can be very complex
Naively assume that all features are conditionally independent from each other, in which case:
P (x|c) = P (x1 |c) × P (x2 |c)×. . . ×P (xn |c)

Very fast: only needs to extract statistics from each feature.


On categorical data

What's the probability that your friend will play golf if the weather is sunny?
On numeric data

We need to fit a distribution (e.g. Gaussian) over the data points


GaussianNB: Computes mean and standard deviation of the feature values per class:
μc
2
σc
(v−μc )

1 2σ
2
p(x = v ∣ c) = e c

2
√2πσc

We can now make predictions using Bayes' theorem: p(c ∣ x) =


p(x∣c) p(c)

p(x)
What do the predictions of Gaussian Naive Bayes look like?
Other Naive Bayes classifiers:
BernoulliNB
Assumes binary data
Feature statistics: Number of non-zero entries per class
MultinomialNB
Assumes count data
Feature statistics: Average value per class
Mostly used for text classification (bag-of-words data)
Bayesian Networks
What if we know that some variables are not independent?
A Bayesian Network is a directed acyclic graph representing variables as nodes and conditional
dependencies as edges.
If an edge(A, B) connects random variables A and B, thenP (B|A) is a factor in the joint
probability distribution. We must knowP (B|A) for all values of and
B A

The graph structure can be designed manually or learned (hard!)


Gaussian processes
Model the data as a Gaussian distribution, conditioned on the training points
Probabilistic interpretation of regression
Linear regression (recap):
y = f (xi ) = xi w + b

For one input feature:


y = w 1 ⋅ x1 + b ⋅ 1

We can solve this via linear algebra (closed form solution): w



= (X
T
X)
−1
X
T
Y

w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))

X is our data matrix with a x0 = 1 column to represent the bias : b


x
⎡ 1 ⎤ 1 x1
⎡ ⎤

⎢ x ⎥
2 ⎢ 1 x2 ⎥
⎢ ⎥
⎢ ⎥
X = ⎢ ⎥ =
⎢ ⎥
⎢ ⎥
⎢ ⋮ ⋮ ⎥
⎢ ⋮ ⎥
⎣ ⎦
⎣ ⊤ ⎦ 1 xN
x
N
Example: Olympic marathon data

We learned: y = w1 x + w0 = −0.013x + 28.895


Polynomial regression (recap)
We can fit a 2nd degree polynomial by using a basis expansion (adding more basis functions):
2
Φ = [1 x x ]
Kernelized regression (recap)
We can also kernelize the model and learn a dual coefficient per data point
Probabilistic interpretation
These models do not give us any indication of the (un)certainty of the predictions
Assume that the data is inherently uncertain. This can be modeled explictly by introducing a slack
variable, , known as noise.
ϵi

y i = w 1 xi + w 0 + ϵ i .

Assume that the noise is distributed according to a Gaussian distribution with zero mean and
variance .
σ
2

2
ϵi ∼ N (0, σ )

That means that y(x) is now a Gaussian distribution with mean wx and variance σ
2

2
y = N (wx, σ )
We have an uncertainty predictions, but it is the same for all predictions
You would expect to be more certain nearby your training points
How to learn probabilities?
Maximum Likelihood Estimation (MLE): Maximize P (X|w)

Corresponds to optimizing , using (log) likelihood as the loss function


w

Every prediction has a mean defined by and Gaussian noise


w

n n

2
P (X|w) = ∏ P (y |xi ; w) = ∏ N (wx, σ I)
i

i=0 i=0
Maximum A Posteriori estimation (MAP): Maximize the posterior P (w|X)

This can be done using Bayes' rule after we choose a (Gaussian) prior P (w):
P (X|w)P (w)
P (w|X) =
P (X)

Bayesian approach: model the prediction directly


P (y|xtest , X)

Marginalize out: consider all possible models (some are more likely)
w

If prior
P (w) is Gaussian, then is also Gaussian!
P (y|xtest , X)

A multivariate Gaussian with mean and covariance matrix


μ Σ

P (y|xtest , X) = ∫ P (y|xtest , w)P (w|X)dw = N (μ, Σ)


w
Gaussian prior P (w)
In the Bayesian approach, we assume a prior (Gaussian) distribution for the parameters, :
w ∼ N (0, αI)

With zero mean ( =0) and covariance matrix . For 2D:


μ αI αI = [
α

0
0

α
]

I.e, is drawn from a Gaussian density with variance


wi α

wi ∼ N (0, α)
Sampling from the prior (weight space)
We can sample from the prior distribution to see what form we are imposing on the functions a priori
(before seeing any data).
Draw (left) independently from a Gaussian density
w w ∼ N (0, αI)

Use any normally distributed sampling technique, e.g. Box-Mueller transform


Every sample yields a polynomial function (right):
f (x) f (x) = wϕ(x).

For example, with being a polynomial:


ϕ(x)
Learning Gaussian distributions

We assume that our data is Gaussian distributed: P (y|xtest , X) = N (μ, Σ)

Example with learned mean [m, m] and covariance [


α

β
β

α
]

The blue curve is the predicted P (y|xtest , X)


Understanding covariances
If two variables covariate strongly, knowing about tells us a lot about
xi x1 x2

If covariance is 0, knowing tells us nothing about (the conditional and marginal distributions
x1 x2

are the same)


For covariance matrix[
1

β
β

1
: ]
Sampling from higher-dimensional distributions
Instead of sampling and then multiplying by , we can also generate examples of directly.
w Φ f (x)

f with values can be sampled from an -dimensional Gaussian distribution with zero mean and
n n

covariance matrix K = αΦΦ



:
fis a stochastic process: series of normally distributed variables (interpolated in the plot)
f ∼ N (0, K)
Repeat for 40 dimensions, with the polynomial transform:
Φ

More examples of covariances


Noisy functions

We normally add Gaussian noise to obtain our observations:


y = f + ϵ
Gaussian Process
Usually, we want our functions to be smooth: if two points are similar/nearby, the predictions should
be similar.
Hence, we need a similarity measure (a kernel)
In a Gaussian process we can do this by specifying the covariance function directly (not as
K = αΦΦ )⊤

The covariance matrix is simply the kernel matrix: f ∼ N (0, K)

The RBF (Gaussian) covariance function (or kernel) is specified by


′ 2
∥x − x ∥

k(x, x ) = α exp(− ).
2
2ℓ

where ′
∥x − x ∥
2
is the squared distance between the two input vectors
′ 2 ′ ⊤ ′
∥x − x ∥ = (x − x ) (x − x )

and the length parameter controls the smoothness of the function and the vertical variation.
l α
Now the influence of a point decreases smoothly but exponentially
These are our priorsP (y) = N (0, K) , with mean 0
We now want to condition it on our training data:
P (y|xtest , X) = N (μ, Σ)
Computing the posterior P (y|X)

Assuming that P (X)is a Gaussian density with a covariance given by kernel matrix , the model K

likelihood becomes:
P (y) P (X ∣ y) 1 1 −1
⊤ 2
P (y|X) = = exp(− y (K + σ I) y)
n 1
P (X) 2
(2π) 2
|K| 2

Hence, the negative log likelihood (the objective function) is given by:
1 1 −1
⊤ 2
E(θ) = log |K| + y (K + σ I) y
2 2

The model parameters (e.g. noise variance ) and the kernel parameters (e.g. lengthscale,
σ
2

variance) can be embedded in the covariance function and learned from data.
Good news: This loss function can be optimized using linear algebra (Cholesky Decomposition)
Bad news: This is cubic in the number of data points AND the number of features: O(n d )
3 3
Making predictions
The model makes predictions for that are unaffected by future values of . ∗

If we think of as test points, we can still write down a joint probability density over the training
f f

observations, and the test observations, .


f

f f

This joint probability density will be Gaussian, with a covariance matrix given by our kernel function,
k(xi , xj ).
f K K∗
[ ] ∼ N (0, [ ])
∗ ⊤
f K K∗,∗

where is the kernel matrix computed between all the training points,
is the kernel matrix computed between the training points and the test points,
K

is the kernel matrix computed between all the tests points and themselves.
K∗

K∗,∗
Conditional Density P (y|xtest , X)
Finally, we need to define conditional distributions to answer particular questions of interest
We will need the conditional density for making predictions.

f |y ∼ N (μf , Cf )

with a mean given by μf = K




2
[K + σ I]
−1
y

and a covariance given by Cf = K∗,∗ − K




2
[K + σ I]
−1
K∗ .
Remember that our prediction is the sum of the mean and the variance:
P (y|xtest , X) = N (μ, Σ)

The mean is the same as the one computed with kernel ridge (if given the same kernel and
hyperparameters)
The Gaussian process learned the covariance and the hyperparameters
The values on the diagonal of the covariance matrix give us the variance, so we can simply plot the mean
and 95% confidence interval
Gaussian Processes in practice (with GPy)
GPyRegression

Generate a kernel first


State the dimensionality of your input data
Variance and lengthscale are optional, default = 1
kernel = GPy.kern.RBF(input_dim=1, variance=1.,
lengthscale=1.)

Other kernels:
GPy.kern.BasisFuncKernel?

Build model:
m = GPy.models.GPRegression(X,Y,kernel)
Matern is a generalized RBF kernel that can scale between RBF and Exponential
Build the untrained GP. The shaded region corresponds to ~95% confidence intervals (i.e. +/- 2 standard
deviation)
Train the model (optimize the parameters): maximize the likelihood of the data.
Best to optimize with a few restarts: the optimizer may converges to the high-noise solution. The optimizer
is then restarted with a few random initialization of the parameter values.
You can also show results in 2D
We can plot 2D slices using the fixed_inputs argument to the plot function.
fixed_inputs is a list of tuples containing which of the inputs to fix, and to which value.
Gaussian Processes with scikit-learn

GaussianProcessRegressor
Hyperparameters:
kernel : kernel specifying the covariance function of the GP
Default: "1.0 * RBF(1.0)"
Typically leave at default. Will be optimized during fitting
alpha : regularization parameter
Tikhonov regularization of covariance between the training points.
Adds a (small) value to diagonal of the kernel matrix during fitting.
Larger values:
correspond to increased noise level in the observations
also reduce potential numerical issues during fitting
Default: 1e-10
n_restarts_optimizer : number of restarts of the optimizer
Default: 0. Best to do at least a few iterations.
Optimizer finds kernel parameters maximizing log-marginal likelihood
Retrieve predictions and confidence interval after fitting:
y_pred, sigma = gp.predict(x, return_std=True)
Example
Example with noisy data
Gaussian processes: Conclusions
Advantages:
The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals.
The prediction interpolates the observations (at least for regular kernels).
Versatile: different kernels can be specified.
Disadvantages:
They are typically not sparse, i.e., they use the whole sample/feature information to perform the
prediction.
Sparse GPs also exist: they remember only the most important points
They lose efficiency in high dimensional spaces – namely when the number of features exceeds a
few dozens.
Gaussian processes and neural networks
You can prove that a Gaussian process is equivalent to a neural network with one layer and an
infinite number of nodes
You can build deep Gaussian Processes by constructing layers of GPs
Bayesian optimization
The incremental updates you can do with Bayesian models allow a more effective way to optimize
functions
E.g. to optimize the hyperparameter settings of a machine learning algorithm/pipeline
After a number of random search iterations we know more about the performance of
hyperparameter settings on the given dataset
We can use this data to train a model, and predict which other hyperparameter values might be
useful
More generally, this is called model-based optimization
This model is called a surrogate model
This is often a probabilistic (e.g. Bayesian) model that predicts confidence intervals for all
hyperparameter settings
We use the predictions of this model to choose the next point to evaluate
With every new evaluation, we update the surrogate model and repeat
Example (see figure):
Consider only 1 continuous hyperparameter (X-axis)
You can also do this for many more hyperparameters
Y-axis shows cross-validation performance
Evaluate a number of random hyperparameter settings (black dots)
Sometimes an initialization design is used
Train a model, and predict the expected performance of other (unseen) hyperparameter values
Mean value (black line) and distribution (blue band)
An acquisition function (green line) trades off maximal expected performace and maximal
uncertainty
Exploitation vs exploration
Optimal value of the asquisition function is the next hyperparameter setting to be evaluated
Repeat a fixed number of times, or until time budget runs out
Shahriari et al. Taking the Human Out of the Loop: A Review of Bayesian Optimization
In 2 dimensions:
Surrogate models
Surrogate model can be anything as long as it can do regression and is probabilistic
Gaussian Processes are commonly used
Smooth, good extrapolation, but don't scale well to many hyperparameters (cubic)
Sparse GPs: select ‘inducing points’ that minimize info loss, more scalable
Multi-task GPs: transfer surrogate models from other tasks
Random Forests
A lot more scalable, but don't extrapolate well
Often an interpolation between predictions is used instead of the raw (step-wise)
predictions
Bayesian Neural Networks:
Expensive, sensitive to hyperparameters
Acquisition Functions
When we have trained the surrogate model, we ask it to predict a number of samples
Can be simply random sampling
Better: Thompson sampling
fit a Gaussian distribution (a mixture of Gaussians) over the sampled points
sample new points close to the means of the fitted Gaussians
Typical acquisition function: Expected Improvement
Models the predicted performance as a Gaussian distribution with the predicted mean and
standard deviation
Computes the expected performance improvement over the previous best configuration
X
+
:
+
EI (X) := E [max{0, f (X ) − ft+1 (X)}]

Computing the expected performance requires an integration over the posterior


distribution, but has a closed form solution.
Bayesian Optimization: conclusions
More efficient way to optimize hyperparameters
More similar to what humans would do
Harder to parallellize
Choice of surrogate model depends on your search space
Very active research area
For very high-dimensional search spaces, random forests are popular

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy