Gaussian Processes For Regression: A Tutorial
Gaussian Processes For Regression: A Tutorial
Gaussian Processes For Regression: A Tutorial
José Melo
Faculty of Engineering, University of Porto
FEUP - Department of Electrical and Computer Engineering
Rua Dr. Roberto Frias, s/n 4200-465
Porto, PORTUGAL
jose.melo@fe.up.pt
Abstract the output if new inputs are provided. Both linear and non-
linear regression techniques have been extensively used for
Gaussian processes are a powerful, non-parametric tool this purpose, using different estimation techniques to fit the
that can be be used in supervised learning, namely in re- data, namely several different flavours of the least-squares
gression but also in classification problems. The main ad- algorithms, ridge regression, etc.
vantages of this method are the ability of GPs to provide Despite all the advantages of these traditional regression
uncertainty estimates and to learn the noise and smoothness techniques, in all of them it is necessary to make assump-
parameters from training data. The aim of this short tuto- tions about the the smoothness of our model. While incor-
rial is to provide the basic theoretical aspects of Gaussian porating prior knowledge in the model that correctly de-
Processes, as well as a brief practical overview on imple- scribes the evolution of the data we have can be of great
mentation. value, sometimes this information is just not available. And
The main motivation of this work was to develop a new using a model that does not correctly characterizes the data
approach to detect outliers on acoustic navigation algo- is likely to lead to poor results.
rithms for Autonomous Underwater Vehicles, capable of ad- A completely different approach is given by Gaussian
justing to different operation scenarios, since this is a major Processes, by neglecting the parametric model viewpoint
problem in the majority of Autonomous Underwater Vehi- and instead define a prior probability distribution over all
cles. In the last part of the tutorial, a brief insight on this possible functions directly [3]. This paper has a strong fo-
actual problem, and the solution proposed, that involves cus on introducing the use of Gaussian Process in regres-
Gaussian Processes as a predictor, and some background sion, and is is organised as follows. In Section 2 the basic
subtraction techniques is described. principles of the Gaussian Processes are given. In section 3,
prediction with Gaussian Processes is derived, and learning
with Gaussian Processes is covered in section 4. In section
1. Introduction 4 we describe the application and in section 5 the results ob-
tained. In the end we present some conclusions and future
In the machine learning context, supervised learning is work directions.
concerned with inferring the values of one or more outputs,
or response variables, for a given set of inputs that have 2. Gaussian Processes
not yet been observed, or predictor variables [4]. These
predictions are based on the training samples of previously Gaussian Processes (GPs) are powerful non-parametric
solved cases. Depending on whether the output is contin- technique with explicit uncertainty models, that finds its
uous or discrete, we talk about regression or classification used mainly in regression and classification problems. The
problems, respectively. Traditional approaches to solve this reason because they are non-parametric is because instead
kind of problem usually consist on parametric models, on of trying to fit the parameters of a selected basis functions,
which the behaviour of data is described by a previously instead GPs rather try to infer how all the measured data is
defined model, and the parameters of this model are learned correlated.
from the training data. By adjusting these parameters, it is A GP is, by definition, a collection of random variables
possible to fit the model to the data. Once this is done, it with the property that the joint distribution of any of its sub-
should be straightforward to use the model to and predict set is joint Gaussian distribution. At this point, it is impor-
1
tant to make a clear distinction between a Gaussian distri- that using the squared exponential as a covariance function
bution and a Gaussian process. is equivalent to regression using infinitely many Gaussian
A Gaussian distribution is a continuous probability dis- shaped basis functions placed everywhere, and not just the
tribution, informally known as the "bell shape curve", and training points [16].
fully specified by a mean and a covariance: x v N (µ, σ). The output of the Gaussian process model is a normal
Moreover, a uni-variate Gaussian distribution can be de- distribution, expressed in terms of mean and variance. The
fined by the function: mean value represents the most likely output and the vari-
ance can be interpreted as the measure of its confidence.
1 (x−µ)2
f (x) = √ e− 2σ 2 (1)
2πσ 2 3. Prediction with Gaussian Processes
Gaussian Processess, on the other can, can be though of
Prediction problems are most of the time related with
a generalization of the Gaussian probability distribution to
events occurring in a time-series . A typical example of
infinitely many variables. A Gaussian process is a Gaussian
a prediction problem can be stated in the following man-
random function, and is fully specified by a mean function
ner: given some observations {y1 , y2 , . . . , yN } of a de-
m(x) and covariance function k(x, x0 ):
pendent variable, subject to noise, at certain time-instants
{x1 , x2 , . . . , xN }, what is our best estimate of the depen-
f (x) v GP(m(x), k(x, x0 ) (2)
dent variable at a new time-instant xN +1 ? In the Gaus-
It is clear then the correspondence between GPs and sian Process framework, the inputs would be the vector
Gaussian distributions. The representation given by (2) X = {x1 , x2 , . . . , xN }, and the test points would be the
means that "the function f is distributed as a GP with mean vector X∗ , composed by all the points we want to predict
function m and covariance function k" [13]. To define an and, in this case, only xN +1 .
individual GP, one needs to choose a form for the mean If we are ready to make assumptions about the under-
function, m(x), and for the covariance function k(x, x0 ). lying model of the observed values follow, this problem is
In most applications there is no prior knowledge about usually tackled by using traditional linear regression meth-
the mean functon, m(x) of a given Gaussian Process. By ods. However, if no assumptions are taken related to the dis-
simplicity, and because GPs are, by definition, a linear com- tributions of the observations, then Gaussian Processes are
bination of random variables with Normal Distribution, this likely to be a better choice when comparing to their para-
is commonly assumed to be zero [3]. If there is, however, metric counterpart ones.
enough information about the process we are modelling Lets consider we are in present of a set of observations
such that the mean function should be explicitly different y, on which each element is a sample from a Gaussian Dis-
than zero, this can be done in a very trivial way, without tribution, representing the real value of the observation af-
loss of the results presented below. fected by some independent Gaussian noise with variance
The covariance function, k(x, x0 ), can be in general any σn . We can then think on the observations as being the sum
function that takes any two arguments, such that k(x, x0 ) of a function plus an additive gaussian noise:
generates a nonnegative definitive covariance matrix K. By
choosing the covariance function, one is implicitly making y = f (x) + (4)
underlying assumptions about certain aspects of the process
Given this, the objective is now to predict f∗ , expected
being modelled, such as smoothness, periodicity, stationar-
value given the test input x∗ . Recalling that a Gaussian Pro-
ity, among others. Obviously there are great set of possible
cess is a set of random variables which have a consistent
covariance functions, but one that is most frequently used is
Gaussian distribution with mean zero, we can represent our
the squared exponential covariance function:
problem as:
1
k(x, x0 ) = σf2 exp(− |x − x0 |2 ) (3)
2l2
y
K(X, X) + σn2 I K(X, X∗ )
This covariance function is also sometimes referred to as v N 0, . (5)
y∗ K(X∗ , X) K(X∗ , X∗ )
Radial Basis Function. It is easy to see that for equation 3
the covariance between any two inputs is really close to one Here, the different K matrix are built using any function
if the inputs are close to each other, and decreases exponen- k(x, x0 ) able to perform as a covariance function. In par-
tially as the distance between the inputs increases. Here, ticular, as we are in the presence of observations corrupted
σf and l are what we call the hyperparameters, mainly due with noise, the covariance between any two observations is
to the resemblance to the hyperparameters of a Neural Net- given by:
work. In most cases, the choice of parameters can signifi-
cantly influence the performance of the GP. It can be shown cov(yp , yq ) = k(xp , xq ) + σn2 δpq (6)
2
In equation 6 δpq is the Kronecker delta, which is a func- (11) would likely be intractable. The three different terms
tion of two variables equal to 1 if and only if both its inputs in (12) play different roles in the likelihood. The first one
are equal, and 0 otherwise. By combining (3) and (6) we are is the only one involving the past observations y and, there-
now ready to build For the vector of inputs x, the covariance fore, is the data-fit term. The second term, on the other
of the associated observations is given by equation 7 and by hand, depends only on the covariance matrix, and works in
combinnit should be noted that the diagonal elements of K an analogous way to the regularization terms in linear re-
is σf2 + σn2 gression, adding a penalty as the complexity increases. The
last term is only a normalizing constant, and doesn’t play a
cov(y) = K(X, X) + σn2 I (7) very specific role in the marginalization of the likelihood. A
careful analysis of the effects of the hyperparameters in the
The prediction step consists in estimating the mean value log marginal likelihood can be found in [16].
and the variance for y∗ . Considering equation 5, it is obvi- A small note about the computational aspects of com-
ous that want is desired is to estimate the conditional dis- puting the log likelihood, as given by (12). In fact, there are
tribution of y∗ given y. An interesting result is that Re- some complexity issues related with the inversion of CN ,
member that y and y∗ are jointly Gaussian random vectors, which depending on the size of the data points, might be
then the conditional distribution of y∗ given y is given by quite heavy. Moreover, if CN is an ill-conditioned matrix,
equation (8). For the simplicity of the notation, in (9) and its inversion is not trivial. There is already some work de-
(10), we used k = K(x, x∗ ), CN = K(X, X) + σn2 I and voted to solve this non-trivialities, and more details about
k∗∗ = K(X∗ , X∗ ). this issues can be found in [16], [6] and [17].
y∗ |y v N (f¯∗ , cov(f∗ )) (8) 4. Learning the Hyperparameters
Given a covariance function, it is straightforward to
f¯∗ = k∗T CN
−
1y (9) make predictions for new test points, as is only a matter of
−1 algebraic matrix manipulation. However, in practical appli-
cov(f∗ ) = k∗∗ − k∗T CN k∗ (10)
cations it is unlikely to know which covariance function to
The mean value of the prediction, f¯∗ in equation (9), use. Clearly, the reliability of our regression is then depen-
gives us the our best estimate for y∗ , and is also known as dent on how well we select the parameters that the selected
the matrix of regression coefficients. The variance, cov(f∗ ), covariance function requires [17].
is the Schur complement, and is an indication of the un- Let θ be the set of hyperparameters needed for a given
certainty of our estimation. An important conclusion from covariance function. In particular lets consider the case in
these results is that the mean prediction f¯∗ is a linear com- (13), where the squared (3) and (6) were combined to form a
bination of the observations y. Another aspect to underline squared exponential function for the prediction of noisy ob-
is that the variance, cov(f∗ ) does not depend on the obser- servations; then θ = {l, σf , σm }. The challenge now, and
vations, but only in the inputs. assuming that the covariance function is adequate for the
For reasons that will be more clear ahead, we should also data, is to choose a value for each of the hyperparameters,
at this point introduce the marginal likelihood, p(y|X). By the free parameters ruling a covariance function.
marginalization we mean that we are integrating over the
function values f . The marginal likelihood is then integral
of the likelihood times the prior: 1
k(xp , xq ) = σf2 exp(− (xp − xq )2 ) + σn2 δpq (13)
Z 2l2
p(y|X) = p(y|f, X)p(f |X) df (11) For the covariance function in (13), l is the length-scale,
σf the signal noise and σn the noise variance. The length-
It can be seen in [16] that under the Gaussian process scale characterizes the distance in input space before the
model, the prior is Gaussian, f |X v N (0, K), and the like- function value can change significantly. Short length-scales
lihood is also a Gaussian, y|f v N (f, σn2 I). Using the log- mean that the predictive variance can grow rapidly away
arithmic identify to simplify the calculations, the result of from the data points, and all the predictions are little cor-
the integration over f , the log marginal likelihood is: related between each other. In the same way, we can think
about σf as the vertical lengthscale. The noise that affects
1 1 n the process is supposed to be random, and so no correlation
−1
logp(y|X) = − y T CN y − log|CN | − log2π (12) between different inputs is expected, and so the term σn is
2 2 2
only present on the diagonals of the covariance matrix.
This exact inference is possible because both the prior The trial-and-error approach for choosing the appropri-
and the likelihood are Gaussian, otherwise the integral in ate values for each parameters is obviously not adequate.
3
Figure 1. Example of the effect of optimizing the hyperparameters. Figure 3. LBL acoustic positioning system schematic diagram
On both plots the same Gaussian Process regression was done, but
on the left ones the hyperparameters were not optimized, and on
the right one they were. rather difficult whenever there is a large number of param-
eters to estimate [17]. More information about CV tech-
niques to estimate values for the hyperparameters can be
found in detail in [16].
Besides the obvious problems of this random approach, the
covariance function can be as complex functions as needed
and, therefore, the number of hyperparameters can be larger. 5. Application
What is needed is to find the set of parameters that optimize
The Ocean Systems Group (OSG), a study group within
the marginal likelihood.
the Robotics unit of INESC TEC, has its main research ef-
Our maximum a posteriori estimate of θ occurs when the
forts directed toward the development of small-sized au-
marginal likelihood, now with the notation p(y|X, θ) to un-
tonomous robotic vehicles, both in underwater as on the
derline we interested in the hyperparameters. The problem
surface. Currently, the main challenge for this kind of vehi-
of learning with Gaussian processes is exactly the problem
cles, and one of the main research-areas, is related with the
of learning them. Care should be taken, as the minimization
improvement of the navigation algorithms that allow the ve-
of p(y|X, θ) is a non-convex optimization task, so no guar-
hicles to localize themselves within the environment.
antee of convergence is provided. To do such minimization
is usually achieved through some standard gradient based While for the vehicles navigating on the surface can rely
technique, as long as the partial derivatives of the covari- on the highly accurate GPS systems available, such is not
ance matrix with respect to each one of the parameters are possible for vehicles that move underwater, as GPS signals
possible to get. are not available in those environments. Therefore the ma-
jority of the Autonomous Underwater Vehicles (AUVs) rely
on acoustic navitation algorithms and, in particular, on long
∂ 1 −1 ∂CN 1 −1 ∂CN −1 baseline (LBL) acoustic positioning systems.
p(y|X, θ) = tr(CN ) + y T CN C y
∂θk 2 ∂θk 2 θk N For this systems, prior to any mission, the vehicle is in-
(14) formed about the actual global coordinates of the the bea-
Equation (14) shows the analytical formulation to com- cons that constitute the acoustic network used. Then, in
pute the different partial derivatives of the log marginal like- order to know its exact localization at any given time, it has
lihood. Because of its simplicity, the Gradient Descent is a to interrogate each beacon, sending an acoustic signal with
common technique to find the set of near-optimal hyperpa- a specific frequency and waiting for the beacon reply. By
rameters that maximize the log likelihood. By iteratively timing this acoustic events, it is then possible to compute
combining this with the standard gradient descent method, the actual distance of a given vehicle to each of the two
synthesized by equation (15), until convergence given that beacons and, therefore, its real-time global coordinates.
the learning rate is appropriate. The algorithm used to estimate distances d1 and d2 , as
can be seen on figure 3, assumes that the AUV positions
∂
θk = θk + w p(y|X, θ) (15) remains stationary between the interrogation of the beacon
∂θk and the reception of the correspondent answer. It is also
Alternatives to the maximum likelihood estimation of considered that the depths the AUV reaches while in mis-
the parameters, that was just described, is to use a cross- sion are constant and quite small relative to the distances
validation (CV), or generalized cross-validation algorithm. to both beacons and, thereby, we can assume only motion
However, some previous works show that this approach is in the horizontal plane. The measures available are highly
4
Figure 2. Example of range measurements acquired to a pair of buoys during a mission. In green we can see the measurements classified
by an expert as good ones, while the ones in blue are classified as spurious and caused by reflections.
irregular and noisy, and require filtering. The technique cur- jects from static cameras. After a careful review of some of
rently in use by the OSG vehicles is based on a Kalman Fil- the techniques used, it was decided to use a Running Gaus-
ter (KF). The KF plays has a very important role in the esti- sian average to online validate the range measurements.
mation process, as not only it allows the elimination of spu- This model is based on ideally fitting a Gaussian probability
rious data measurements, but also to fuse navigation data density function on the last n pixels. In order to avoid fitting
coming from different sensors. An example of range mea- from scratch at each new frame time, a running average is
surements acquired during a standard mission can be seen computed instead [12].
in figure 2.
The filtering of the measures is done by evaluating the µt = (1 − ρ)µt−1 + ρXt (16)
covariance of the error associated with the measurements as
and comparing it to the design parameter γ. Even though
this method achieves some reasonable results, is not very σt2 = (1 − ρ)σt−1
2
+ ρ(Xt − µt )T (Xt − µt ) (17)
flexible as it relies solely on a single parameters, and doesn’t Following the paper by Stauffer and Grimson, who pro-
allow to adapt to the different varying environmental condi- posed some changes to the traditional running Gaussian ap-
tions, such as temperature and salinity of the water, and that proaches, the running averages mean and covariance are up-
have a great affect in the measures. As a consequence, γ dated according to (16) and (17). Xt is the new observation
as to be wide enough, but this causes spurious messages to be validated and ρ = N (Xt |µk , σk ). A matched is de-
to be accepted as if they were not spurious. In opposition fined whenever an observation is within 2.5 standard devia-
to direct range measurements, these spurious measures are tion of a distribution.
mostly caused by reflections of the acoustic signals both in The range measurements are expected to vary through-
the bottom of the sea, or in the surface. out time, as the distance from the AUV to both beacons
The main motivation for this work came from a paper also varies according to the motion of the vehicle. In that
from Bingham and Seering [2], where both direct mea- sense, the mean of running Gaussian should also vary in the
surements and reflections were modelled, but in an off-line same way. To tackle this problem, what we propose is to
post processing environment. By using an Expectation- predict the next range measurement, based on the past mea-
Maximization algorithm and a proper modelling of both the surements taken as correct. To such prediction one can use
range, and the reflection, some very interesting results. whether a linear regression, or a Gaussian Process regres-
For this work, the objective was to filter the range mea- sion. Given the scope of this paper, which pretends to be a
surements using some techniques used in background sub- tutorial on the use of Gaussian Processes, in the next section
traction, a widely used approach for detecting moving ob- results will be presented comparing both this approaches
5
Figure 4. Linear Regression: results Figure 6. GP Regression: detail with improved parameters. It is
possible to note some outliers being rejected.
l = 50, σf = 10 and σn = 1;
As for the linear regression, there is not much to be im-
proved, in the the regression with the Gaussian Process, we
can still try to vary the hyperparameters, as described in the
previous sections. In fact, the framework under used pro-
vides the methods necessary to obtain the derivatives of the
log marginal likelihood, that would be of great help.
However, and despite some effort, the parameters
couldn’t be optimized, due to the lack of convergence. Re-
calling from the previous chapters, the minimization of the
Figure 5. GP Regression: results log marginal likelihood is not a convex optimization, and
therefore the gradient descent methods that were used, with
a large set of learning rates, are not guaranteed to succeed,
6. Results as indeed happened. On the other side, and due to the in-
trinsic and very dynamic nature of the problem, where the
In this section we will present the results that compare a
vehicle in question is always moving with different attitudes
standard linear regression, with a GP regression. All these
towards each of the buoys, is is probably very difficult that
results are comparable, in the sense that all of them relate to
the optimal parameters are the same throughout the whole
the same data, acquired during a mission performed in Au-
mission. Instead, it is more likely that this parameters keep
gusto 2011, in the Douro River. The linear regression algo-
varying, and so, the gradient descent algorithm doesn’t con-
rithms were implemented by the author, but for the regres-
verge.
sion with the Gaussian Processes, the framework developed
Even though, a simple trial and error approach lead to a
for Matlab by Carl Edward Rasmussen and Christopher K.
small improve in the prediction, with its output more im-
I. Williams, and freely available on the Internet was used.
mune to spurious measures. With the hyperparameters set
On figure 4 we can see the result of the linear regression, to l = 50, σf = 10 and σn = 5, where one can note the
for the basis are {1, x, x2 }, a regression using a quadratic change on the signal noise from 1 to 5, we can achieve better
function. Even though at a coarse level the regression is able results. With the detailed view, in figure 6 we can confirm
to correctly follow actual ranges, we can see that specially that some outliers are being ignored.
at the inflexion points there is a lot of noise predictions, that
don’t match the actual behaviour of the vehicle.
7. Conclusion and Future Work
On figure 5 we have the correspondent Gaussian Process
based regression. Here in this case, it is also possible to un- In this paper the regression with Gaussian Processes is
derstand that the regression predictions follow closely the covered. It was demonstrated that it can be a quite effective
actual range measures. A careful look will in fact realize method if there is some prior knowledge about the covari-
that in this case, there is less noisy predictions in the inflex- ance of the measures. In fact, this is of utmost importance,
ion points, with the regression fitting more closely to the as choosing a wrong covariance function can lead to poor
data. This regression was made using a squared exponential performance.
function as in eq. (13) with the following hyperparameters: A note as well for the possibility that Gaussian pro-
6
cess have for classification problems, where the approach [11] J. Melo and A. Matos. Guidance and control of an asv in
is pretty similar as for prediction problems. The main dif- auv tracking operations. In OCEANS 2008, pages 1 –7, sept.
ferences are related with the fact that in classification prob- 2008.
lems, and due to the fact that an activation function is used, [12] M. Piccardi. Background subtraction techniques: a review.
the integration of the prior times the posterior that leads to In Systems, Man and Cybernetics, 2004 IEEE International
the likelihood is in fact intractable. So approximations al- Conference on, volume 4, pages 3099 – 3104 vol.4, oct.
2004. 5
gorithms must be employed, like the Laplace approximation
[13] C. E. Rasmussen. Gaussian processes in machine
or the Expectation Propagation.
learning. Available in http://www.cs.ubc.ca/ hut-
To conclude, even though Gaussian Processes are, in a ter/earg/papers05/rasmussen_gps_in_ml.pdf, January
sense, close to some ARMA models or even the Kalman 2011, January 2011. 2
Filter, they provide a very efficient way to adapt to the data [14] S. Srihari. Gaussian processes - lecture slides for machine
in a non-parametric way. Given that we choose the ade- learning and probabilistic graphical models. Technical re-
quate mean function and covariance function, the problem port, Department of Computer Science and Engineering,
of learning with Gaussian processes is exactly the problem University at Buffalo, 2011.
of learning the hyperparameters of the covariance function. [15] C. Stauffer and W. Grimson. Adaptive background mixture
models for real-time tracking. In Computer Vision and Pat-
tern Recognition, 1999. IEEE Computer Society Conference
References on., volume 2, pages 2 vol. (xxiii+637+663), 1999.
[1] R. M. S. Almeida. Sistema inteligente de posicionamento [16] C. E. R. . C. K. I. Williams. Gaussian Processes for Machine
acústico subaquático. Master’s thesis, Faculdade de Engen- Learning. the MIT Press, 2006. 2, 3, 4
haria da Universidade do Porto, 2010. [17] C. K. I. Williams. Prediction with gaussian processes: From
[2] B. Bingham and W. Seering. Hypothesis grids: improving linear regression to linear prediction and beyond. In Learn-
long baseline navigation for autonomous underwater vehi- ing and Inference in Graphical Models, pages 599–621.
cles. Oceanic Engineering, IEEE Journal of, 31(1):209 – Kluwer, 1998. 3, 4
218, jan. 2006. 5
[3] C. M. Bishop. Pattern Recognition and Machine Learning
(Information Science and Statistics). Springer-Verlag New
York, Inc., Secaucus, NJ, USA, 2006. 2
[4] J. S. Cardoso. Machine Learning, Lecture Notes. Faculty of
Engineering, University of Porto, January 2012. 1
[5] M. Ebden. Gaussian processes for regression: A quick in-
troduction. Available in http://www.robots.ox.ac.uk/ meb-
den/reports/GPtutorial.pdf, January 2011, August 2008.
[6] M. Gibbs and D. J. MacKay. Efficient implementation of
gaussian processes. Technical report, 1997. 3
[7] B. Huhle, T. Schairer, A. Schilling, and W. Strasser. Learning
to localize with gaussian process regression on omnidirec-
tional image data. In Intelligent Robots and Systems (IROS),
2010 IEEE/RSJ International Conference on, pages 5208 –
5213, oct. 2010.
[8] B. Huhle, T. Schairer, A. Schilling, and W. Strasser. Learning
to localize with gaussian process regression on omnidirec-
tional image data. In Intelligent Robots and Systems (IROS),
2010 IEEE/RSJ International Conference on, pages 5208 –
5213, oct. 2010.
[9] J. Ko, D. Klein, D. Fox, and D. Haehnel. Gaussian processes
and reinforcement learning for identification and control of
an autonomous blimp. In Robotics and Automation, 2007
IEEE International Conference on, pages 742 –747, april
2007.
[10] J. Ko, D. Klein, D. Fox, and D. Haehnel. Gaussian processes
and reinforcement learning for identification and control of
an autonomous blimp. In Robotics and Automation, 2007
IEEE International Conference on, pages 742 –747, april
2007.