07 - Bayesian Learning
07 - Bayesian Learning
07 - Bayesian Learning
Bayesian Learning
Learning in an uncertain world
Joaquin Vanschoren
XKCD, Randall Monroe
Bayes' rule
Rule for updating the probability of a hypothesis given data
c x
P (c) is the prior probability of class : what you believed before you saw the data
c x
P (x|c) is the likelihood of data point given that the class is (computed from your dataset)
x c
P (x) is the prior probability of the data (marginal likelihood): the likelihood of the data under any
x
P (yes|exploded)P (exploded)
P (exploded|yes) =
P (yes)
(1 − P (lie))P (exploded)
=
P (exploded)(1 − P (lie)) + P (lie)(1 − P (exploded))
1
=
12
1.25226x10
P (pos|C)P (C)
P (C|pos) =
P (pos)
P (pos|C)P (C)
=
P (pos|C)P (C) + P (pos|notC)(1 − P (C))
0.96 ∗ 0.015
=
0.96 ∗ 0.015 + 0.04 ∗ 0.985
= 0.268
Bayesian models
Learn the joint distribution
P (x, y) = P (x|y)P (y) .
Assumes that the data is Gaussian distributed (!)
With every input you get
x , hence a mean and standard deviation for (blue)
P (y|x) y
Easily updatable with new data using Bayes' rule ('turning the crank')
Previous posterior becomes new prior
P (y|x) P (y)
Generative models
The joint distribution represents the training data for a particular output (e.g. a class)
You can sample a new point with high predicted likelihood
x P (x, c) : that new point will be very
similar to the training points
Generate new (likely) points according to the same distribution: generative model
Generate examples that are fake but corresponding to a desired output
Generative neural networks (e.g. GANs) can do this very accurately for text, images, ...
Naive Bayes
Predict the probability that a point belongs to a certain class, using Bayes' Theorem
P (x|c)P (c)
P (c|x) =
P (x)
What's the probability that your friend will play golf if the weather is sunny?
On numeric data
2
√2πσc
p(x)
What do the predictions of Gaussian Naive Bayes look like?
Other Naive Bayes classifiers:
BernoulliNB
Assumes binary data
Feature statistics: Number of non-zero entries per class
MultinomialNB
Assumes count data
Feature statistics: Average value per class
Mostly used for text classification (bag-of-words data)
Bayesian Networks
What if we know that some variables are not independent?
A Bayesian Network is a directed acyclic graph representing variables as nodes and conditional
dependencies as edges.
If an edge(A, B) connects random variables A and B, thenP (B|A) is a factor in the joint
probability distribution. We must knowP (B|A) for all values of and
B A
⊤
x
⎡ 1 ⎤ 1 x1
⎡ ⎤
⊤
⎢ x ⎥
2 ⎢ 1 x2 ⎥
⎢ ⎥
⎢ ⎥
X = ⎢ ⎥ =
⎢ ⎥
⎢ ⎥
⎢ ⋮ ⋮ ⎥
⎢ ⋮ ⎥
⎣ ⎦
⎣ ⊤ ⎦ 1 xN
x
N
Example: Olympic marathon data
y i = w 1 xi + w 0 + ϵ i .
Assume that the noise is distributed according to a Gaussian distribution with zero mean and
variance .
σ
2
2
ϵi ∼ N (0, σ )
That means that y(x) is now a Gaussian distribution with mean wx and variance σ
2
2
y = N (wx, σ )
We have an uncertainty predictions, but it is the same for all predictions
You would expect to be more certain nearby your training points
How to learn probabilities?
Maximum Likelihood Estimation (MLE): Maximize P (X|w)
n n
2
P (X|w) = ∏ P (y |xi ; w) = ∏ N (wx, σ I)
i
i=0 i=0
Maximum A Posteriori estimation (MAP): Maximize the posterior P (w|X)
This can be done using Bayes' rule after we choose a (Gaussian) prior P (w):
P (X|w)P (w)
P (w|X) =
P (X)
Marginalize out: consider all possible models (some are more likely)
w
If prior
P (w) is Gaussian, then is also Gaussian!
P (y|xtest , X)
0
0
α
]
wi ∼ N (0, α)
Sampling from the prior (weight space)
We can sample from the prior distribution to see what form we are imposing on the functions a priori
(before seeing any data).
Draw (left) independently from a Gaussian density
w w ∼ N (0, αI)
β
β
α
]
If covariance is 0, knowing tells us nothing about (the conditional and marginal distributions
x1 x2
β
β
1
: ]
Sampling from higher-dimensional distributions
Instead of sampling and then multiplying by , we can also generate examples of directly.
w Φ f (x)
f with values can be sampled from an -dimensional Gaussian distribution with zero mean and
n n
where ′
∥x − x ∥
2
is the squared distance between the two input vectors
′ 2 ′ ⊤ ′
∥x − x ∥ = (x − x ) (x − x )
and the length parameter controls the smoothness of the function and the vertical variation.
l α
Now the influence of a point decreases smoothly but exponentially
These are our priorsP (y) = N (0, K) , with mean 0
We now want to condition it on our training data:
P (y|xtest , X) = N (μ, Σ)
Computing the posterior P (y|X)
Assuming that P (X)is a Gaussian density with a covariance given by kernel matrix , the model K
likelihood becomes:
P (y) P (X ∣ y) 1 1 −1
⊤ 2
P (y|X) = = exp(− y (K + σ I) y)
n 1
P (X) 2
(2π) 2
|K| 2
Hence, the negative log likelihood (the objective function) is given by:
1 1 −1
⊤ 2
E(θ) = log |K| + y (K + σ I) y
2 2
The model parameters (e.g. noise variance ) and the kernel parameters (e.g. lengthscale,
σ
2
variance) can be embedded in the covariance function and learned from data.
Good news: This loss function can be optimized using linear algebra (Cholesky Decomposition)
Bad news: This is cubic in the number of data points AND the number of features: O(n d )
3 3
Making predictions
The model makes predictions for that are unaffected by future values of . ∗
If we think of as test points, we can still write down a joint probability density over the training
f f
∗
This joint probability density will be Gaussian, with a covariance matrix given by our kernel function,
k(xi , xj ).
f K K∗
[ ] ∼ N (0, [ ])
∗ ⊤
f K K∗,∗
∗
where is the kernel matrix computed between all the training points,
is the kernel matrix computed between the training points and the test points,
K
is the kernel matrix computed between all the tests points and themselves.
K∗
K∗,∗
Conditional Density P (y|xtest , X)
Finally, we need to define conditional distributions to answer particular questions of interest
We will need the conditional density for making predictions.
∗
f |y ∼ N (μf , Cf )
The mean is the same as the one computed with kernel ridge (if given the same kernel and
hyperparameters)
The Gaussian process learned the covariance and the hyperparameters
The values on the diagonal of the covariance matrix give us the variance, so we can simply plot the mean
and 95% confidence interval
Gaussian Processes in practice (with GPy)
GPyRegression
Other kernels:
GPy.kern.BasisFuncKernel?
Build model:
m = GPy.models.GPRegression(X,Y,kernel)
Matern is a generalized RBF kernel that can scale between RBF and Exponential
Build the untrained GP. The shaded region corresponds to ~95% confidence intervals (i.e. +/- 2 standard
deviation)
Train the model (optimize the parameters): maximize the likelihood of the data.
Best to optimize with a few restarts: the optimizer may converges to the high-noise solution. The optimizer
is then restarted with a few random initialization of the parameter values.
You can also show results in 2D
We can plot 2D slices using the fixed_inputs argument to the plot function.
fixed_inputs is a list of tuples containing which of the inputs to fix, and to which value.
Gaussian Processes with scikit-learn
GaussianProcessRegressor
Hyperparameters:
kernel : kernel specifying the covariance function of the GP
Default: "1.0 * RBF(1.0)"
Typically leave at default. Will be optimized during fitting
alpha : regularization parameter
Tikhonov regularization of covariance between the training points.
Adds a (small) value to diagonal of the kernel matrix during fitting.
Larger values:
correspond to increased noise level in the observations
also reduce potential numerical issues during fitting
Default: 1e-10
n_restarts_optimizer : number of restarts of the optimizer
Default: 0. Best to do at least a few iterations.
Optimizer finds kernel parameters maximizing log-marginal likelihood
Retrieve predictions and confidence interval after fitting:
y_pred, sigma = gp.predict(x, return_std=True)
Example
Example with noisy data
Gaussian processes: Conclusions
Advantages:
The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals.
The prediction interpolates the observations (at least for regular kernels).
Versatile: different kernels can be specified.
Disadvantages:
They are typically not sparse, i.e., they use the whole sample/feature information to perform the
prediction.
Sparse GPs also exist: they remember only the most important points
They lose efficiency in high dimensional spaces – namely when the number of features exceeds a
few dozens.
Gaussian processes and neural networks
You can prove that a Gaussian process is equivalent to a neural network with one layer and an
infinite number of nodes
You can build deep Gaussian Processes by constructing layers of GPs
Bayesian optimization
The incremental updates you can do with Bayesian models allow a more effective way to optimize
functions
E.g. to optimize the hyperparameter settings of a machine learning algorithm/pipeline
After a number of random search iterations we know more about the performance of
hyperparameter settings on the given dataset
We can use this data to train a model, and predict which other hyperparameter values might be
useful
More generally, this is called model-based optimization
This model is called a surrogate model
This is often a probabilistic (e.g. Bayesian) model that predicts confidence intervals for all
hyperparameter settings
We use the predictions of this model to choose the next point to evaluate
With every new evaluation, we update the surrogate model and repeat
Example (see figure):
Consider only 1 continuous hyperparameter (X-axis)
You can also do this for many more hyperparameters
Y-axis shows cross-validation performance
Evaluate a number of random hyperparameter settings (black dots)
Sometimes an initialization design is used
Train a model, and predict the expected performance of other (unseen) hyperparameter values
Mean value (black line) and distribution (blue band)
An acquisition function (green line) trades off maximal expected performace and maximal
uncertainty
Exploitation vs exploration
Optimal value of the asquisition function is the next hyperparameter setting to be evaluated
Repeat a fixed number of times, or until time budget runs out
Shahriari et al. Taking the Human Out of the Loop: A Review of Bayesian Optimization
In 2 dimensions:
Surrogate models
Surrogate model can be anything as long as it can do regression and is probabilistic
Gaussian Processes are commonly used
Smooth, good extrapolation, but don't scale well to many hyperparameters (cubic)
Sparse GPs: select ‘inducing points’ that minimize info loss, more scalable
Multi-task GPs: transfer surrogate models from other tasks
Random Forests
A lot more scalable, but don't extrapolate well
Often an interpolation between predictions is used instead of the raw (step-wise)
predictions
Bayesian Neural Networks:
Expensive, sensitive to hyperparameters
Acquisition Functions
When we have trained the surrogate model, we ask it to predict a number of samples
Can be simply random sampling
Better: Thompson sampling
fit a Gaussian distribution (a mixture of Gaussians) over the sampled points
sample new points close to the means of the fitted Gaussians
Typical acquisition function: Expected Improvement
Models the predicted performance as a Gaussian distribution with the predicted mean and
standard deviation
Computes the expected performance improvement over the previous best configuration
X
+
:
+
EI (X) := E [max{0, f (X ) − ft+1 (X)}]