Skew Gaussian Process For Nonlinear Regression
Skew Gaussian Process For Nonlinear Regression
Skew Gaussian Process For Nonlinear Regression
1
Department of Mathematics, Statistics and Physics, Qatar University, Qatar
2
Department of Statistics, Yarmouk University, Irbid, Jordan
In this article, we extend the Gaussian process for regression model by assuming a skew
Gaussian process prior on the input function and a skew Gaussian white noise on the
error term. Under these assumptions, the predictive density of the output function at
a new fixed input is obtained in a closed form. Also, we study the Gaussian process
predictor when the errors depart from the Gaussianity to the skew Gaussian white noise.
The bias is derived in a closed form and is studied for some special cases. We conduct a
simulation study to compare the empirical distribution function of the Gaussian process
predictor under Gaussian white noise and skew Gaussian white noise.
1. Introduction
In statistical literature, the assumption of Gaussianity or normality has been made on statis-
tical models for a long time when analyzing spatial data. The popularity of using Gaussian
assumption is due to its mathematical tractability. For example, the multivariate Gaussian
distribution possesses the properties of closure under marginal, conditional distributions as
well as the closure under convolution. Despite of such nice properties of Gaussian distri-
bution, it is found that the data distribution does not meet the assumption of Gaussianity
for a large number of real data sets due to the presence of the skewness. If the analysis of
such data sets relies on the Gaussian assumption, then unrealistic or nonsensical estimates
will be produced. The simplest way to analyze skewed data via the Gaussian model is
to Gaussianize the data, i.e., by transforming the data to near Gaussian data. Such trans-
formation method is not recommended due to the following different reasons. (i) Finding
a suitable transformation to achieve normality is not an easy issue in practice. (ii) Since
such transformations are usually applied to data component-wise, then the normality of
marginal distributions does not guarantee the joint normality. Hence, the estimates might
4936
Skew Gaussian Process 4937
be fallible from biases. (iii) Despite of the difficulty in interpreting the transformed data,
data skewness could not be ignored, since it has an interpretation (Buccianti, 2005).
Recently, random processes, that possess a skewness parameter, have been defined
by several researchers. Alodat and Aludaat (2007) employed the skew normal theory, as
presented in Genton (2004), to define a new random process called the skew Gaussian
process. Also they gave an application to real data. Relying on the multivariate closed-skew
normal distribution of González-Farı́as et al. (2004), Allard and Navea (2007) defined what
they called the closed skew normal random field. For more examples about skew random
processes or fields, we refer the reader to Zhang and El-Sharaawi (2009) and Alodat and
Al-Rawwash (2009).
The cornerstone in defining a new skew processes or field is the multivariate skew
normal distribution which appeared in the pioneer works of Azzalini (1985, 1986), Azzalini
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
and Dalla valle (1996), and Azzalini and Capitanio, 1999). The skew-normal or skew
Gaussian distribution is defined as follows. A random vector Z(n×1) is said to have an n−
dimentional multivariate skew normal distribution if it has the probability density function
(pdf)
PZ (z) = 2φn (z; 0, ) α T z , z ∈ Rn , (1)
where φn (.; 0, ) is the pdf of Nn (0, ), (.) is the cdf of N (0, 1), and α (n×1) is a vector
called the skewness parameter. A more general family of (1) is obtained by using the
transformation X = μ + Z, μ ∈ Rn . It is so easy to show that the pdf of X is PX (x) =
PZ (x − μ). We use the notation X ∼ SNn (μ, , α) to denote an n−dimenational skew
normal distribution with parameters μ, , and α.
Also a generalization to (1) is given by González-Farı́as et al. (2004) as follows. Let
μ ∈ Rp , D be an arbitrary q × p matrix, and positive definite matrices of dimensions
p × p and q × q, respectively. A random vector Y is said to have a p−dimensional
closed skew normal distribution (CSN) with parameters q, μ, , D, v, , denoted by Y ∼
CSNp,q (μ, , D, v, ) , if its pdf takes the form
where φp (.; η, ψ), p (.; η, ψ) are the pdf and the cumulative distribution function (cdf)
of a p−dimentional normal distribution with mean vector η and covariance matrix ψ.
Throughout this article, several lemmas and results about the multivariate CSN distribution
will be used extensively. So we present them in Appendix A. For their proofs, we refer the
reader to González-Farı́as et al. (2004) or Genton (2004).
Furthermore, it has been shown that the family of skew normal distributions possesses
properties that are close to or coincide with those of the normal family. Besides to the
closeness properties, it contains the normal family, i.e., when α = 0. Such properties have
attracted the researchers to extend the well-known statistical techniques under the skew
normality assumption. There are still a lot of works in their mission. For example, the
Gaussian process regression (GPR) model is a statistical technique introduced by Neal
(1995) to treat a non-linear regression Y (t) = f (t) + ε (t) from a Bayesian viewpoint.
Simply, the technique assumes a Gaussian process as a prior on the unknown function f (t)
while ε(t) is assumed to have a white noise process. Then the aim is to predict f (t) at a
4938 Alodat and Al-Momani
new value of t. In other words, the Gaussian process provides us with a prior distribution
over the space of all functions.
Since the Gaussian family is a sub-family of the skew Gaussian family, then using
the skew Gaussian process, i.e., a process whose finite dimensional distributions are of the
form (1), as a prior on f (t) will allow us to define a distribution over a more rich family of
functions than the Gaussian one. Also, it will allow us to extend the error term in the above
regression model to have a skewed distribution which closer to real data than its Gaussian
counterpart.
It appears from literatures that the GPR has a significant applications in various fields of
science. For example, it has been applied to model noisy data and to classification problems
arising in machine learning to predict the inverse dynamics of a robot arm (Rasmussen and
Williams, 2006). Brahim-Belhouari and Bermak (2004) applied the GPR model to predict
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
the future value of a non-stationary time series. Schmidt et al. (2008) studied the sensitivity
of GPR to the choice of correlation function. Based on a numerical study, they concluded
that the predictions did not differ much amongst the different correlation functions. Van-
hatalo et al. (2009) proposed a GPR with student-t likelihood by approximating the joint
distribution of the process by a student distribution. The idea beyond that approximation
is to make the GPR model robust against outliers. The model they proposed is analytically
intractable. Kuss (2006) proposed other robust models as alternatives for GPR. Macke et al.
(2010) applied the GPR to estimate the cortical map of the human brain. They modeled
the brain image of their experiment, where the activity at each voxel is measured, by a
Gaussian process. Fyfe et al. (2008) applied the GPR to Canonical correlation analysis with
application to neuron data.
The problem of treating the prediction problem of the nonlinear regression Y (t) =
f (t) + ε (t) from a Bayesian viewpoint when both f (t) and ε (t) follow skew Gaussian
processes has not yet been a dressed in the literature. In this article, we extend the GPR
model by assuming two independent skew Gaussian processes one on f (t) and the other one
on ε (t). In other words, we consider the nonlinear regression model Yi = f (ti )+ε (ti ) , i =
1, 2, . . . , n, i.e., for each i, f (ti ) is measured as Yi but corrupted by the noise ε (ti ). Then
we put a skew Gaussian process as prior on the function f (t). Also, we assume that the
process ε (t) follows a skew Gaussian process. Under these assumptions, the following
two prediction problems are considered: (i) Prediction of f (t) at a fixed input t, and (ii)
Prediction of f (t) at a random input t ∗ .
The rest of this article is organized as follows. In Sec. 2, we introduce the reader to
the GPR model. In Sec. 3, we generalize the GPR model by assuming a skew Gaussian
process on f (t) and another skew Gaussian process on ε (t). Then we derive the predictive
density of the output function at new input. Also, we derive the mean and the variance of
the predictive distribution. In Sec. 4, it is assumed that the GPR predictor is used to analyze
a data with skewed errors. Then we derive the bias and the variance. In Sec. 5, we conduct
a simulation study to compare the new model to the Gaussian one. Finally, we state our
conclusions in Sec. 6.
the space of functions to treat a nonlinear regression from a Bayesian viewpoint, while
an application of O’Hagan’s work to Bayesian learning in networks has appeared in Neal
(1995).
The GPR, as presented in Neal (1995), can be illustrated as follows. Consider a set
of training data Y = (Y1 , Y2 , . . . , Yn )T , where the input vectors t1 , t2 , t3 , . . . , tn ∈ C ⊆ Rn
and their output values Y1 , Y2 , . . . , Yn are governed by the non-linear regression model
Yi = f (ti ) + ε (ti ), where ε (t1 ) , ε (t2 ) , . . . , ε (tn ) are iid Gaussian noises on C of mean
0 and variance τ 2 , and f (.) is an unknown function. The main question is “what is the
predicted value of f ∗ = f (t ∗ ), the value of f (t) at a new input t ∗ ?”. To answer this
question, a prior distribution is needed on f (t) , i.e., a distribution over a set of functions
is needed. This prior distribution should be defined on the class of all functions defined on
the space of t. The set of all sample paths of a Gaussian process on C provides us with a
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
with
⎛ ⎞
k (t1 , t1 ) · · · k (t1 , tn ) k (t1 , t ∗ )
⎜ .. .. .. .. ⎟
⎜ . . ⎟
=⎜⎜ .. ⎟,
⎟
⎝ k (t , t ) · · · k (t , t ) k (t , t ∗ ) ⎠
n 1 n n n
∗ ∗ ∗ ∗
k (t ,t1 ) · · · k (t , tn ) k (t , t )
k
= ,
kT k∗
where
Rasmussen (1996) showed that the prediction distribution of f ∗ given Y and t ∗ remains
Gaussian and is given by
where μ (t ∗ ) and σ 2 (t ∗ ) are the mean and the variance of the predictive distribution (4) and
are given by
−1
μ∗ = μ(t ∗ ) = k T + τ 2 In Y,
where
−1
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
Y = (Y1 , Y2 , . . . , Yn )T and σ 2 t ∗ = k ∗ − k T + τ 2 In k.
The distribution (4) can be used to draw several inferential statements about f (t ∗ ).
For instance, when p = 1, a 100 (1 − α) % prediction interval for f (t ∗ ) is given by [L, U ],
where L and U are the solutions of the following two equations:
∞
L
∗ ∗ α
∗ α
P (f |Y, t )df = and P (f ∗ |Y, t ∗ )df ∗ = .
0 2 U 2
μ(t ∗ ) ± Z1− α2 σ (t ∗ ),
where Z1− α2 is the 100 (1 − α) quantile of N (0, 1). Moreover, the mean μ (t ∗ ) serves as a
predictor for f (t ∗ ) given the data Y and t ∗ , while the variance σ 2 (t ∗ ) serves as a measure
of uncertainty in μ (t ∗ ).
Now, assume that we are interested in predicting f (t) at t ∗ , where t ∗ is a random
variable such that t ∗ ∼ Np (μ∗ , ∗ ), i.e., we are interested in prediction at a random input.
So the predictive pdf for f ∗ given that μ∗ , ∗ is (Girard et al., 2004):
P (f ∗ |μ∗ , ∗ , Y ) = P (f ∗ |Y, t ∗ )P (t ∗ )dt ∗ . (5)
The integral in Eq. (5) does not have a closed form. Hence, an approximation to this
integral is needed in order to report inferential statements about f ∗ . Moreover the main
computational problem in GPR is the inversion of the matrix + τ 2 In and in obtaining the
mean and variance of the predictive distribution of f ∗ given Y at a random input t ∗ . For
this reason, we propose the following simple Monte Carlo approximation to (5):
1
∗
N
∗
P (f |μ∗ , ∗ , Y ) = P (f ∗ |Y, t ∗ )P (t ∗ )dt ∗ P f , Y |t ∗(r) ,
N r=1
where t ∗(1) , . . . , t ∗(N) are independent samples from P (t ∗ ). Before closing this section, we
refer to Girard et al. (2002) and Williams and Rasmussen (2006) where the reader can find
several analytical approximation techniques to approximate the predictive density (5).
Skew Gaussian Process 4941
Definition 3.2. A skew Gaussian process Y (t) possesses a fixed skewness in all directions
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
if for every n and t1 , . . . , tn , the parameter α in (1) takes the form α = α1n , α ∈ R.
Throughout
n this article, we will assume that for each n and t1 , . . . , tn ∈ C, the parameter
= k ti , tj i,j =1 , where k (., .) is a given covariance function.
Definition 3.3. A skew Gaussian process is called a skew white noise if for every n and
t1 , . . . , tn ∈ C ⊆ Rn , ε = (ε (t1 ) , . . . , ε (tn ))T ∼SNn 0, τ 2 In , β1Tn , where τ, β ∈ R.
and
D11 D12
DA = = D1 D 2 ,
D21 D22
where
D11 D12
D1 = and D2 = .
D21 2×n
D22 2×1
Then:
i. the conditional distribution of f ∗ given Y and t ∗ is
−1 −1
f ∗ |Y, t ∗ ∼ CSN 1,2 k T + τ 2 In Y, k ∗ − k T + τ 2 In k, D2 , −D ∗ Y, A , (6)
where
−1
D ∗ = D1 + D2 k T + τ 2 In ;
4942 Alodat and Al-Momani
ii. the predictive mean and the predictive variance are given by
E(f ∗ |Y, t ∗ ) = μ∗ + σ ∗2 D12 (1) ∗ ∗2
2 02×1 ; −D Y, A + σ D2 D2
T
∗ ∗2
+D22 (2)
2 02×1 ; −D Y, A + σ D2 D2
T
,
and
2
var(f ∗ |Y, t ∗ ) = 2σ ∗4 (11)
2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12
12
+2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12 D22
21
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
where (i)2 (., .) is the first partial derivative of 2 (., .) with respect to the i argument for
(ij )
i = 1, 2 and 2 (., .) is the second mixed partial derivative of 2 (., .) with respect to the
ith and jth arguments for i, j = 1, 2.
The above theorem shows that the predictive distribution of a new output follows
a closed skew Gaussian distribution. As a special case, this predictive distribution re-
duces to (4) if the skewness is absent, i.e., if α = β = 0. Another predictor of f (t ∗ )
is the median of the conditional distribution of f ∗ given Y. Neither mean nor the me-
dian of the conditional distribution in our case has a simple closed form. Furthermore,
part (i) of Theorem 3.1 can be used to predict the value of f (t) at a random input, say
t ∗ . For instance, assume
∗ ∗2that t ∗ ∼ Np (a, B) and we wish to predict f ∗ = f (t ∗ ). Since
f |Y, t ∼ CSN 1,2 μ , σ , D2 , −D ∗ Y, A , then using the total probability law, we write
∗ ∗
Unfortunately, it is difficult even for GPR to find a closed form for the integral in
the last equation, so an approximation for P (f ∗ |Y ) is needed. Here, we propose the
following simple Monte Carlo approximation for the predictive distribution at a random
input:
1
∗
N
∗
P (f |Y ) = ∗ ∗ ∗ ∼
P (f |Y, t )P (t )dt = ∗
P f |Y, t ∗(r) ,
R p N r
where t ∗(1) , . . . , t ∗(N) are independent samples from P (t ∗ ). Since we are putting a skew
Gaussian process prior on the function f (t), then for each n, the finite dimensional dis-
tribution of the skew Gaussian process is used as a prior for the distribution of the vector
f = (f (t1 ) , . . . , f (tn ))T , i.e., f ∼ SNn (0, , α1n ), where
as defined in the previous
sections. Since Y = f + ε, where ε ∼ SNn 0, τ 2 In , β1n , then the posterior distribution
Skew Gaussian Process 4943
It will be shown in the proof of Theorem 3.1, Appendix B1, that the last equation
simplifies to the pdf of the distribution:
−1 −1
CSN1,2 k T + τ 2 In Y, k ∗ − k T + τ 2 In k, D2 , −D ∗ Y, A .
then we may utilize the following stochastic representation of the CSN distribution (Genton,
2004; Allard and Naveau, 2007):
i. Let V be a random vector from N2 (−D ∗ Y, Q), where
−1
Q = A + D2T k ∗ − k T + τ 2 In kD2 .
−1
= k T + τ 2 In f
while under the assumption ε ∼ SNn 0, τ 2 In , β1n , we have that
−1
E SG (f̂ ) = k T + τ 2 In E(f + ε),
−1 −1
= k T + τ 2 In f + k T + τ 2 In Eε ·
Hence,
−1 2 τ 2 β1n
E SG
(f̂ ) = E (f̂ ) + k + τ 2 In
G T
,
π 1 + β 2τ 2n
= E G (f̂ ) + b τ 2 , β 2 , n , say. (7)
−2 2β 2 τ 4 k02 n2
= τ 2 k + τ 2 In k− ,
π 1 + nτ 2 β 2 L2n
n
where Ln = τ 2 + i=1 k (t1 , ti ).
Skew Gaussian Process 4945
Theorem 4.1. Consider the setup in the above discussion. Then b τ 2 , β 2 , n and
var SG (f̂ )satisfy the following properties.
i. var SG (f̂ ) ≤ var G (f̂ ) for all τ, β, n.
2 2 2 τ T −1
ii. lim b τ , β , n = π √n k + τ 2 In 1n , lim b τ 2 , β 2 , n = 0 and
β→±∞
2 2 β→0
lim b τ , β , n = 0,
τ →0
iii. Assume that t1 , t2 , . . . , tn are chosen so that they are the vertices of a regular
polygon and t ∗ is located at its center. If k (., .) is an isotropic covariance function
and k0 = k (t1 , t ∗ ), then
a. b τ 2 , β 2 , n > 0 for all τ , n and β
= 0, and lim b τ 2 , β 2 , n = 0.
τ →∞
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
n
b. If k (t1 , ti ) = n−0.5 O (n), with n−0.5 O (n) → c
= 0 as n → ∞,
i=1
n
then lim b τ 2 , β 2 , n = π2 τβk 0
|β|c and if k (t1 , ti ) = O (n), then
n→∞
2 2 i=1
lim b τ , β , n = 0.
n→∞
n τ k2
c. If k (t1 , ti ) = O (n), then lim var SG (f̂ ) = 1 − π2 c20 .
i=1 n→∞
nτ 2 k 2 nτ 2 k 2
d. lim var SG (f̂ ) = L2 0 , lim var SG (f̂ ) = L2 0 1 − π2 and
β→0 n β→±∞ n
Proof. The proof of (ii) and (iii)(a)–(iii)b are given in Appendix B. The proof of the other
parts is easy, so we leave it to the reader.
It can be noticed that if a Gaussian predictor is used for predicting skew data, then the
variance of the predictor cannot exceed the variance of the Gaussian predictor. On the other
hand, the value of the predictor
will be shifted to the left or to the right of the Gaussian one
by an amount of b τ 2 , β 2 , n . If an isotropic Gaussian covariance function is used, then
n n 0.5θ0 i
i=1 k (t1 , ti ) = i=1 τ exp − λ2 = O(n), where θ0 denotes the angle between ti and
2
t ∗ for all i = 1, . . . , n. So the Gaussian covariance function satisfies part b of Theorem 4.1.
5. Simulation study
In this section, we present an algorithm to simulate a realization from a skew Gaussian
process, i.e., by simulating from its finite dimensional distributions. Then the algorithm is
implemented in a Matlab code to simulate from a GPR and a SGPR predictors.
where c ≥ 1, 0 < g (x) ≤ 1, ∀x and h (x) is a pdf. If this is the case, then a ran-
dom observation from P (x) is generated as follows.
1. Generate U from u (0, 1).
2. Generate Y from h (x).
3. IfU ≤ g (Y ), then deliver Y as a realization of P (x).
4. Go to step1.
For the SNn (0; , λ) distribution, we may use this algorithm with c = 2, g (x) =
(λT x) and h (x) = φn (x; 0, ).
U = Nq (v, + D T D)|U ≤ 0.
Y = X+ε,
where X ∼ SNn 0, , α1Tn , 0, 1 , ε ∼ SNn 0, τ 2 In , β1Tn , 0, 1 , τ > 0, and X, ε are in-
dependent
random vectors. Then applying Proposition A.5 of Appendix A yields Y ∼
CSN n,2 0, + τ 2 In , D ◦ , 0, ◦ ,where
−1
◦ α1Tn + τ 2 In ◦ A11 A12
D = −1 , = ,
βτ 2 1Tn + τ 2 In A12 A22
and
−1
A11 = 1 + α 2 1Tn 1n − α 2 1Tn + τ 2 In 1n , A22 = 1 + nβ 2 τ 2 − β 2 τ 4 1Tn
−1
× + τ 2 In 1n ,
−1
A12 = −αβτ 2 1Tn + τ 2 In 1n .
Skew Gaussian Process 4947
Although the marginal distribution of the data Y is a multivariate closed skew normal
distribution, the problem of finding confidence intervals for the parameters τ, σ 2 , α, β, and
λ is not an easy task, since these parameters are embedded in the distribution’s parameters,
i.e., in , D ◦ and ◦ . So one may think in Bayesian intervals. For this purpose,
a prior
distribution on the τ, σ 2 , α, β, and λ must be assumed. Let P τ, σ 2 , α, β, λ be the prior
that represents our belief about the distribution of τ, σ 2 , α, β, and λ. Then the posterior
distribution of τ, σ 2 , α, β, and λ given Y satisfies
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
P τ, σ 2 , α, β, λ|Y ∝ L τ, σ 2 , α, β, λ; Y × P τ, σ 2 , α, β, λ .
Again, we face another problem in finding the normalizing constant for the posterior
distribution. Hence, a Markov chain Monte Carlo (MCMC) method should be called for.
For example, one may use the Metropolis-Hasting algorithm (Christian and Casella, 2004).
To find confidenceintervals for the parameters τ, σ 2 , α, β, and λ, a large sample from
P τ, σ 2 , α, β, λ|Y is needed. To do so, we propose an algorithm in Appendix C to find
such confidence intervals. On the other hand, once the parameters τ, σ 2 , α, β, and λ have
been estimated, then their estimates can be plugged in the variance formula var (f ∗ |Y, t ∗ )
to get an estimate for var (f ∗ |Y, t ∗ ). Furthermore, to obtain a 95% Bayesian
confidence
band for (f ∗ |y, t ∗ ), we also need to use simulation. To proceed, let = τ, σ 2 , α, β, λ
and (, t ∗ ) = (f ∗ |y, t ∗ ). A Bayesian simultaneously 95% confidence band for (, t ∗ )is
obtained by finding L and U such that P|Y,t ∗ (L < (, t ∗ ) < U forallt ∗ ) = 0.95, which
is equivalent to solve the following equation for L and U :
∗
∗
P|Y,t L < inf
∗
∗
, t < sup , t < U = 0.95. (8)
t t∗
Figure 1. (a) GPR (G) and skew Gaussian (SG) predictors with parameters; (b) GPR (G) and SG
predictors with parameters α = −0.05, β = −5, 0, 1, 5 and τ = 0.1, α = −0.01, β = −5, −1, 0, 5
and τ = 0.1.
Figure 2. (a) GPR (G) and SG predictors with Parameters; (b) GPR (G) and SG predictors with
Parameters α = 0, β = −5, 0, 1, 5 andτ = 0.1, α = 0.05, β = −5, 0, 2, 5, andτ = 0.1.
Figure 3. (a) GPR (G)and SG predictors with Parameters, (b) GPR (G)and SG predictors with
Parametersα = 2, β = −5, −2, 0, 5 and τ = 0.1, α = 5, β = −5, 0, 2, 5 and τ = 0.1.
Skew Gaussian Process 4949
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
Figure 4. (a) GPR (G) and SG predictors with Parameters; (b) GPR (G)and SG predictors with
Parameters α = −0.01, β = −5, 0, 1.5, 5 and τ = 1, α = 0, β = 5, 0, 2, 5, and τ = 1.
Figure 5. (a) GPR (G) and SG predictors with parameters α = 0.5, β = −5, 0, 1.5, 5 and τ = 1.
Figure 6. (a) GPR (G) and SG predictors with parameters; (b) GPR (G) and SG predictors with
Parameters α = 1.5, β = −5, −1, 0, 5 and τ = 1.5, α = 4, β = −5, 0, 2, 5 and τ = 1.5.
4950 Alodat and Al-Momani
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
Figure 7. (a) GPR (G) and SG predictors with parameters; (b) GPR (G) and SG predictors with
parameters α = −0.1, β = −5, 0, 2, 4 and τ = 2, α = 5, β = −5, 0, 2, 5 and τ = 2.
3. If a Gaussian process is used for the errors, i.e., β = 0, then there is no difference
between the two distributions when α ≤ 0, and τ is small (see Figs. 1, 2, 3and 4).
4. For fixed values of α, and moderate values of τ , the difference between the two distri-
butions is very clear and seems to be an increasing function in |β| (see Figs. 4,5).
5. For fixed values of α, and large values of τ , there is a huge difference between the two
distributions (see Fig. 8).
GPR. These advantages will attract us to continue this work in future. We highlight some
of such possible works.
1. Studying the effect of the choice of the covariance function on the skew Gaussian
process predictor.
2. Developing methods for estimating the hyper-parameters of the model.
3. Prediction at several inputs.
4. Defining more robust models by using more general distributions either on the input
function f (t) or on the error term. For such future work, one may utilize the work of
Lachos et al. (2010) and Da Silva-Ferreira et al. (2011) by assuming that either the
input function or the error term follows a random process whose finite dimensional
distributions are scale mixture of skew normal (SMSN) distributions as defined by
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
1 1
ϕn (y; μ, c (u) ) c 2 (u) λT − 2 (y − μ) dH (u) .
1
P (y) = 2
0
Lachos et al. (2010b) showed that the family SMSN includes several known families
such as skew-t, skew-slash and the skew-Cauchy families. This open the way for further
research on more robust models.
Although the process whose finite dimensional distributions are of SMSN is very gen-
eral, we have several computational challenges when finding the estimates of the hyper pa-
rameters. These challenges are due to the integration in the pdf of Y ∼ SMSNn (μ, , λ, H ).
So, instead of conducting the numerical calculations, it could be easier to use an intensive
statistical computing algorithm to calculate the integration in the pdf P (y). Since inten-
sive computing requires large samples from the pdf of Y , we may utilize the stochastic
1
representation Y = μ + c 2 (U ) Z for such simulation purposes.
Azzalini and Capitanio (1999) have pointed out that the MLE for the skewness pa-
rameter of the multivariate skew normal distribution may diverge with positive probability.
Also they noticed that the Fisher information matrix is singular when the skewness param-
eter is zero. For the multivariate closed skew normal distribution, these issues have been
considered only in few number of papers. Here we refer to the work of Arellano-Valle
et al. (2005). They used the skew normal distribution to model the both the random effect
and the error terms in the linear mixed effect model. Also they showed that the response
data vector has a multivariate closed skew normal distribution. Furthermore, they derived
and implemented an EM algorithm to find the MLEs for all parameters. According to the
literature, it can be noticed that the above issues concerning the MLEs of the closed skew
normal distribution parameters are still not explored enough. Hence, we believe that a fur-
ther research should conducted. For example, the estimation of the SGP model parameters
using the penalized maximum likelihood method could be called for. We leave these issues
to a separated article.
4952 Alodat and Al-Momani
References
Allard, D., Naveau, P. (2007). A new spatial skew-normal random field model. Commun. Statist.
Theory. Meth. 36:1821–1834.
Alodat, M. T., Aludaat, K. M. (2007). A skew Gaussian process. Pak. J.Statist. 23:89–97.
Alodat, M. T., AL-Rawwash, M. Y. (2009). Skew Gaussian random field. J.Computat. Appl. Math.
232(2):496–504.
Arellano-Valle, R. B., Bolfarine, H., Lachos, V. H. (2005). Skew-normal Linear Mixed models. J.Data
Sci. 3:415–438.
Azzalini, A. (1985). A class of distributions which includes the normal ones. Scand. J. Statist.
12:71–178.
Azzalini, A. (1986). Further results on a class of distributions which includes the normal ones.
Statistica 46:199–208.
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
Azzalini, A., Capitanio, A. (1999). Statistical application of the multivariate skew normal distributions.
J. Roy. Stat. Soc. Ser. B 61:579–602.
Azzalini, A., Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika
83:715–726.
Brahim-Belhouari, S., Bermak, A. (2004). Gaussian process for non-stationary time series prediction.
Computat. Statist. Data Anal. 47:705–712.
Buccianti, A. (2005). ‘Meaning of the λ parameter of skew–normal and log–skew normal distributions
in fluid geochemistry’ a CODAWORK’05.
Christian, P. R., Casella, G. (2004). Monte Carlo Statistical Methods. New York: Springer.
Da Silva-Ferreira, C., Bolfarine, H., Lachos, V. (2011). Skew-scale mixture of skew-normal distribu-
tions. Statist. Methodol. 8:154–181.
Fan, J., Peng, H. (2004). Nonconcave penalized likelihood with diverging number of Parameters.
Ann. Statist. 32:928–961.
Fyfe, C., Leen, G., Lai, P. L. (2008). Gaussian processes for canonical correlation analysis. Neuro
Comput. 71:3077–3088.
Genton, M. (2004). Skew-Elliptical Distributions and Their Applications: A Journey Beyond Nor-
mality. Boca Raton, FL: Chapman & Hall/CRC.
Girard, A., Kocijan, J., Murray-Smith, R., Rasmussen, C. E. (2004). Gaussian process model based
predictive control. Proc. Amer. Control Conf . Boston.
Girard, A., Rasmussen, C. E., Murray-Smith, R. M. (2002). Gaussian Process priors with uncer-
tain Inputs: Multiple-Step-Ahead Prediction. Technical Report TR-2002-119, Department of
computing Science, University of Glasgow.
Gonzáles-Farias, G., Domingusez-Molina, J., Gupta, A. (2004). Additive properties of skew normal
random vectors. J. Statist. Plan. Infer. 126:521–534.
Kuss, M. (2006). Gaussian process models for robust regression, classification, and reinforcement
learning. Ph.D. thesis, Technische Universität Darmstadt.
Lachos, V., Labra, F., Bolfarine, H., Gosh, H. (2010a). Multivariate measurements error models based
on scale mixtures of the skew-normal distribution. Statistics 44:541–556.
Lachos, V. H., Ghosh, P., Arellano-Valle, R. B. (2010b). Likelihood based inference for skew-normal
independent linear mixed models. Statistica Sinica 20:303–322.
Macke, J. H., Gerwinn, S., White, L. E., Kaschube, M., Bethge, M. (2010). Gaussian process methods
for estimating Cortical maps.
Neal, R. M. (1995). Bayesian learning for neural networks. Ph.D., thesis, Dept. of Computer Science,
University of Toronto.
O’Hagan, A. (1978). On curve fitting and optimal design for prediction. J. Roy. Soc. B 40:1–42.
Rasmussen, C. E., Williams, C. (2006). Gaussian Processes for Machine Learning. Cambridge, MA:
MIT press.
Rasmussen, C. E. (1996). Evaluation of Gaussian Processes and other methods for non-linear
regression, Ph.D. thesis, Dept. of Computer Science, University of Toronto.
Skew Gaussian Process 4953
Schmidt, A. M., Concoicäo, M. F., Moreira, G. A. (2008). Investigating the sensitivity of Gaussian
processes to the choice of their correlation functions and prior specifications. J. Statist. Computat.
Simul. 78(8):681–699.
Schott, J. R. (1997). Matrix Analysis for Statistics. New York: Wiley-Interscience.
Vanhatalo, J., Jylänki, P., Vehtari, A. (2009). Gaussian process regression with Student-t likelihood.
In: Bengio Y., Schuurmans D., Lafferty J., Williams C. K. I., Culotta A. Eds. Advances in Neural
Information Processing Systems 22:1910–1918.
Williams, C. K. I., Rasmussen, C. E. (1996) Gaussian processes for regression. Adv. Neur. Inform.
Process. Syst. 8:514–520.
Zhang, H., El-Shaarawi, A. (2009). On spatial skew Gaussian process applications. Environmetrics
10:982.
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
Appendix
n
n T
p+ = pi , q + = qi , μ+ = μT1 , . . . , μTn , + = ⊕ni=1 i ,
i=1 i=1
T
D + = ⊕ni=1 Di , v + = v1T , . . . , vnT , + = ⊕ni=1 i
and
A0
A⊕B = .
0 B
Proposition
If Y ∼ CSN p,q (μ, , D, v, ), then for two sub vectors Y1 and Y2 where
A.3
Y T = Y1T , Y2T , Y1 is k−dimensional, 1 ≤ k ≤ p, and μ, , D are partitioned as follows:
μ1 k
k p − k
μ= , = 11 12 k
μ2 p−k
21 22 p − k
k p−k
D = (D1 D2 ) q.
where
D ∗ = D1 + D2 21 −1
11 ,
and
22.1 = 22 − 21 −1
11 12 .
Proposition A.4 If Y ∼ CSN p,q (μ, , D, v, ), then the moment generating function of
Y is:
q Ds; v, + DDT s T μ+ 1 s T s
MY (s) = e 2 , s ∈ Rp .
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
q 0; v, + DDT
Proposition A.5 If Y1 and Y2 are independent vectors such that Yi ∼ CSN p,qi
(μi , i , Di , vi , i ), i = 1, 2, then Y1 + Y2 ∼ CSN p,q1 +q2 (μ1 + μ2 , 1 + 2 , D ◦ , v ◦ , ◦ ),
where
◦ D1 1 ( 1 + 2 )−1 ◦ A11 A12
D = , = ,
D2 2 ( 1 + 2 )−1 A21 A22
and
A11 = 1 + D1 1 D1T − D1 1 ( 1 + 2 )−1 1 D1T ,
A22 = 2 + D2 2 D2T − D2 2 ( 1 + 2 )−1 2 D2T ,
T
A12 = −D1 1 ( 1 + 2 )−1 2 D2T , v ◦ = v1T , v2T .
the column of one’s of size (n + 1), and In is the identity matrix of size
where 1n+1 denotes
f
n × n. Since is independent of ε (t) , then by Proposition A.1 we have that
f∗
⎛ ⎞
f
⎝ f ∗ ⎠ ∼ CSN 2n+1,2 μ+ , + , D + , v + , + ,
ε
Skew Gaussian Process 4955
where
T
μ+ = 0T1×n , 0, 0T1×n , v + = (0, 0)T , + = I2 ,
to find the conditional distribution of f ∗ and Y , t ∗ is to find the joint pdf of f ∗ and Y . To
T T
proceed, we write Y T , f ∗ as a linear combination of f T f ∗ T , i.e.,
⎛ ⎞
f
Y f +
I n 0n×1 I n ⎝f∗ ⎠.
= =
f∗ f∗ 0Tn×1 1 0Tn×1
ε
I n 0n×1 I n
To simplify the notation, let A(n+1)×(2n+1) = ( ). It is straight forward to
0Tn×1 1 0Tn×1
check that the matrix A is of rank (n + 1). Now, we are ready to apply Proposition A.2.
Hence,
⎛ ⎞
f
Y
∗ = A ⎝ f ∗⎠
∼ CSN n+1,2 μA , A , DA , v + , A ,
f
where
⎛ ⎞
0T1×n
I n 0n×1 I n ⎝ 0 ⎠ = 0(n+1)×1
μA = Aμ+ =
0Tn×1 1 0Tn×1
0T1×n ⎛ ⎞
I n 0n×1
I n 0n×1 I n (n+1)×(n+1) 0(n+1)×n ⎝ T
A = A + AT = 0n×1 1 ⎠
0Tn×1 1 0Tn×1 0T(n+1)×n τ 2 In
I n 0n×1
+ τ 2 In k
= .
kT k ∗ (n+1)×(n+1)
To proceed, we need to apply the following matrix identity which can be found in Schott
(1997). Let A be a matrix which is partitioned as follows:
A11 A12
A= ,
A21 A22
DA = D+ + AT −1
A ,
⎛ ⎞
I n 0n×1
α1tn+1 0n×1 (n+1)×(n+1) 0(n+1)×n ⎝ 0Tn×1 1 ⎠ −1
= T
01×n β1tn 0T(n+1)×n τ 2 In A
I n 0n×1
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
T
α1Tn+1 (, k)T α1Tn+1 k T , k ∗
= −1
A ,
βτ 2 1Tn 0
D11(1×n) D12
= ,
D21(1×n) D22 2×(n+1)
where
−1
D11 = α1Tn+1 (, k)T + τ 2 I n − kk ∗−1 kT
T −1
−α1Tn+1 kT , k ∗ k ∗−1 kT + τ 2 In − kk ∗−1 kT ,
−1 −1 −1
D12 = −α1Tn+1 (, k)T + τ 2 I n k k ∗ − kT + τ 2 I n k
T ∗ −1 −1
+α1Tn+1 kT , k ∗ k − kT + τ 2 I n k ,
−1
D21 = βτ 2 1Tn + τ 2 I n − kk ∗−1 kT
and
−1 ∗ −1 −1
D22 = −βτ 2 1Tn + τ 2 I n k k − kT + τ 2 I n k .
and
α (, k) 1n+1 βτ 2 1n
A + D+T = .
α kT , k ∗ 1n+1 0 (n+1)×2
Skew Gaussian Process 4957
where
−1
W11 = α1Tn+1 (, k)T + τ 2 I n − kk ∗−1 kT
T −1
−α1Tn+1 kT , k ∗ k ∗−1 kT + τ 2 I n − kk ∗−1 kT α , k 1n+1
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
and
−1
W22 = nβ 2 τ 4 1Tn + τ 2 I n − kk ∗−1 kT 1n .
and
D11 D12
DA = = D1 D2 ,
D21 D22
4958 Alodat and Al-Momani
where
D11 D12
D1 = and D2 = .
D21 2×n
D22 2×1
where
−1
D∗ = D1 + D2 kT + τ 2 I n .
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
Proof of (ii). Here, we have to find the mean and the variance of f ∗ |Y, t ∗ by applying
Proposition A.4; to complete this mission, we find the moment generating function of
f ∗ |Y, t ∗ , hence the moment generating function of f ∗ |Y, t ∗ is equal to
2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2 sμ∗ + 1 σ ∗2 s 2
Mf ∗ |Y ,t ∗ (s) = e 2 , s ∈ R,
2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2
−1 −1
where σ ∗2 = k ∗ −k T + τ 2 I n k, and μ∗ = kT + τ 2 I n
(j )
Y . Let 2 (. , .) denote
the first partial derivative of 2 (. , .) with respect to the J component for j = 1, 2. Also,
th
(ij )
let 2 (., .) denote the mixed second partial derivative of 2 (., .). Now we find the mean
and the variance of f ∗ |Y, t ∗ :
∗ ∗ ∂ ∂ 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 sμ∗ + 1 σ ∗2 s 2
E(f |Y , t ) = Mf ∗ |Y (s) |s=0 = e 2 |s=0
∂s ∂s 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2
⎛ ⎞
D12 σ ∗2 s ∗ ∗2
⎜ 2 ; − D Y , A + σ D2 D2 T
⎟
∂ ⎜ D22 σ ∗2 s ⎟
= ⎜ sμ∗ + 12 σ ∗2 s 2 ⎟
⎜
∂s ⎝ ∗
2 02×1 ; − D Y , A + σ D2 D2
∗2 T
e ⎟ |s=0 ,
⎠
1
= ∗
× (1)2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 D12 σ ∗2
2 02×1 ; − D Y , A + σ D2 D2
∗2 T
+ (2)
2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 D22 σ ∗2
∗ 1 ∗2 2
+ 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 μ∗ + σ ∗2 s esμ + 2 σ s |s=0
Finally, we find that
E(f ∗ | y, t ∗ ) = μ∗ + σ ∗2 D12 (1) 2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2
∗ ∗2
+D22 (2)
2 0 2×1 ; − D Y , A + σ D D
2 2
T
,
∂2 D12 σ ∗2 s ∗ ∗2
sμ∗ + 12 σ ∗2 s 2
× 2 2 ∗2 ; − D Y , A + σ D2 D2 e
T
,
∂s D 22 σ s s=0
1
= ∗
2 02×1 ; − D Y , A + σ ∗2 D2 DT2
∂ (1)
× 2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2 D12 σ ∗2
∂s
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
s=0
After substituting s = 0, f ∗2 |Y, t ∗ reduces to
E f ∗2 Y , t ∗
∗ ∗2 ∗2 2
= 2((11)
2 (02×1 ; − D Y , A + σ D 2 D 2 )(D12 σ )
T
∗ ∗2 ∗2 ∗2
+(12)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T
∗ ∗2 ∗2 ∗2
+(21)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T
∗ ∗2 ∗2 2
+ (22)
2 (02×1 ; − D Y , A + σ D 2 D 2 )(D22 σ ) )
T
∗2 ∗2
+ 4((1)
2 (02×1 ; − D ∗ Y , A + σ D 2 D 2 )D12 σ
T
∗ ∗2 ∗2 ∗ ∗2 ∗2
+ (2)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D22 σ )μ + μ + σ .
T
Hence,
2
var f ∗ |Y , t ∗ = f ∗2 |Y − E f ∗ |Y ,
∗ ∗2 ∗2 2
= 2((11)
2 (02×1 ; − D Y , A + σ D 2 D 2 )(D12 σ )
T
∗ ∗2 ∗2 ∗2
+ (12)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T
∗ ∗2 ∗2 ∗2
+ (21)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T
∗ ∗2 ∗2 2
+ (22)
2 (02×1 − D Y , A + σ D 2 D 2 )(D22 σ ) )
T
∗2 ∗2
+ 4((1)
2 (02×1 ; − D ∗ Y , A + σ D 2 D 2 )D12 σ
T
∗ ∗2 ∗2 ∗ ∗2 ∗2
+ (2)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D22 σ )μ + μ + σ
T
⎛ ⎛ ⎞⎞2
D12 (1)
2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2
− ⎝μ∗ + σ ∗2 ⎝ ⎠⎠ ,
(2) ∗ ∗2
+D22 2 02×1 ; − D Y , A + σ D2 D2 T
where (11)
2 is the derivative of (1)
2 with respect to the first component, and 2
(12)
is the
derivative of 2 with respect to the second component, and 2 is the derivative of (2)
(1) (21)
2
with respect to the first component, and (22)
2 is the derivative of (2)
2 with respect to the
second component.
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
−1
Since + τ 2 I n is circulant and 1n is an eigen vector of any circulant matrix, then
−1 1
+ τ2 In 1n = 1n ,
Ln
where
n
Ln = τ 2 + k (t 1 , t i ) .
i=1
Hence,
2 τ 2 βk0 n
b τ ,β ,n =
2 2
.
π 1 + β 2 τ 2 n Ln
It is easy to see that b τ 2 , β 2 , n > 0 for all non zero values of τ , β, and k0 . If
n −0.5
i=1 k (t1 , ti ) = n O (n), where n−0.5 O (n) → c
= 0, then
2 2 2 τ 2 βk0 n
b τ ,β ,n = ,
π 1 + β 2 τ 2 n τ 2 + n−0.5 O (n)
√
2 τ 2 βk0 n
= .
π 1 + β 2 τ 2 n √τ 2 + O(n)
n
n
Skew Gaussian Process 4961
Hence,
2 τ 2 βk0 1
lim b τ 2 , β 2 , n = ,
n→∞ π β 2τ 2 c
2 τβk0
= .
π c |β|
−1
To show that limτ →∞ b τ 2 , β 2 , n = 0, we notice that + τ 2 I n 1n = 1
1 .
Ln n
Conse-
quently, we find that
2 2 2 τ 2 βk0 n
b τ ,β ,n = ,
π 1 + β τ n Ln
2 2
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
2 nβk0 τ 2
= .
π τ 2 + ni=1 k (t1 , ti ) 1 + β 2 τ 2 n
Hence limτ →∞ b τ 2 , β 2 , n = 0.
1 N
3. Estimate p, the probability in (8), via p̂ = N i=1 I (lTi andUi u)
For l = −L0 to L1 STEP d1 (Searching for the solution in (−L0 , L0 ) × (−L0 , L0 )
For u = −U0 to U1 STEP d2
Do while (|p̂ − 0.95|>ω) l = −L0 , u = −L0
For each tj∗ simulate a large sample from the posterior say 1 , . . . , N .
∗ ∗
Find Ti = minM j =1 {(i , tj )} and Ui = maxj =1 {(i , tj )}
M
1 N
Estimate p, the probability in (8), via p̂ = N i=1 I (lTi andUi u)
Update l and u: l ← l + d1 and u ← u + d2
End Do
Output: L = l and U = u