PRML Slides 3
PRML Slides 3
PRML Slides 3
where
where
where
Geometry of Least Squares
Consider
N-dimensional
M-dimensional
S is spanned by .
wML minimizes the distance
between t and its orthogonal
projection on S, i.e. y.
Sequential Learning
Data items considered one at a time (a.k.a.
online learning); use stochastic (sequential)
gradient descent:
is called the
regularization
which is minimized by coefficient.
Regularized Least Squares (2)
With a more general regularizer, we have
Lasso Quadratic
Regularized Least Squares (3)
Lasso tends to generate sparser solutions than a
quadratic
regularizer.
Multiple Outputs (1)
Analogously to the single output case we have:
where
where
The Bias-Variance Decomposition (5)
Example: 25 data sets from the sinusoidal, varying
the degree of regularization, .
The Bias-Variance Decomposition (6)
Example: 25 data sets from the sinusoidal, varying
the degree of regularization, .
The Bias-Variance Decomposition (7)
Example: 25 data sets from the sinusoidal, varying
the degree of regularization, .
The Bias-Variance Trade-off
From these plots, we note
that an over-regularized
model (large ) will have a
high bias, while an under-
regularized model (small )
will have a high variance.
Bayesian Linear Regression (1)
Define a conjugate prior over w
where
Bayesian Linear Regression (2)
A common choice for the prior is
for which
where
Predictive Distribution (2)
Example: Sinusoidal data, 9 Gaussian basis functions,
1 data point
Predictive Distribution (3)
Example: Sinusoidal data, 9 Gaussian basis functions,
2 data points
Predictive Distribution (4)
Example: Sinusoidal data, 9 Gaussian basis functions,
4 data points
Predictive Distribution (5)
Example: Sinusoidal data, 9 Gaussian basis functions,
25 data points
Equivalent Kernel (1)
The predictive mean can be written
Equivalent kernel or
smoother matrix.
Polynomial Sigmoidal
Equivalent Kernel (4)
The kernel as a covariance function: consider
where .
Bayesian Model Comparison (1)
How do we choose the right model?
Assume we want to compare models Mi, i=1, ,L,
using data D; this requires computing
Note that
Bayesian Model Comparison (4)
For a given model with a
single parameter, w, con-
sider the approximation
Negative
Thus
has eigenvalues i + .
Maximizing the Evidence Function (2)
We can now differentiate w.r.t. and ,
and set the results to zero, to get
where
w1 is not well
Likelihood determined by the
likelihood
w2 is well determined
by the likelihood