Deep GP Untuk Speech

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Acoust. Sci. & Tech.

41, 2 (2020) #2020 The Acoustical Society of Japan

INVITED REVIEW

An introduction of Gaussian processes and deep Gaussian processes


and their applications to speech processing

Tomoki Koriyama
Graduate School of Information Science and Technology, The University of Tokyo,
7–3–1 Hongo, Bunkyo-ku, Tokyo, 113–8656 Japan

Abstract: Gaussian process (GP) is a distribution of functions, which can be used for a machine
learning framework. GP regression has characteristics of Bayesian model, which can predict
uncertainty of outputs, and kernel methods, which enables nonlinear function with a small number of
parameters. In this paper, we first describe the basic of GP regression, and introduce recent notable
advances of GP. Specifically, we focus on stochastic variational GP that is an approximation method
available for a huge amount of training data, and explain a GP-based deep architecture model called
deep Gaussian process. Since GP regression is a general-purpose machine learning framework, there
are many applications. In this paper, we introduce GP-based applications to speech information
processing including speech synthesis.

Keywords: Gaussian process regression, Kernel methods, Deep learning, Bayesian model, Speech
processing

PACS number: 43.10.Ln [doi:10.1250/ast.41.457]

scalar value y 2 R from a D-dimensional vector x 2 RD , it


1. INTRODUCTION is assumed that the relationship is represented using f :
Recently, deep learning, which uses a function defined RD ! R by
by deep neural networks (DNNs) becomes popular and it
y ¼ f ðxÞ þ  ð1Þ
caused artificial intelligence (AI) boom. Gaussian process
(Gaussian process, GP) is used as another machine learning where  is an observation noise. In the case of binary
framework that predicts the function [1]. In the analysis of classification where an output variable is yes (y ¼ 1) or no
the behavior of DNNs, GP is attracted because is is related (y ¼ 0), y can be predicted as follows:
to the DNN with an infinite number of hidden units [2,3]. 
1 ( f ðxÞ  0)
In this paper, we introduce the basics of machine learning y¼ ð2Þ
0 ( f ðxÞ < 0).
using Gaussian process and its application to speech
information processing. In general speech information Since the use of a neural network (NN) as the function
processing, a large amount of training data is used and has become popular in recent years, we take NN-based
the relationship between input and output features is framework as an example to explain machine learning.
complicated. Therefore, we explain an approximation When we use a neural network with one hidden layer and
method of GP available for an arbitrary amount of training DH hidden units, the function is expressed by the following
data, and describe deep Gaussian process for expressing equation
complicated functions.
f ðxÞ ¼ bð2Þ þ wð2Þ> ðbð1Þ þ W ð1Þ> xÞ ð3Þ
2. MACHINE LEARNING AND where wð2Þ 2 RDH , W ð1Þ 2 RDDH are weight parameters
BAYESIAN TRAINING and bð2Þ 2 R, bð1Þ 2 RDH are bias ones. We represent the
We briefly introduce a general machine learning parameter set as  ¼ ðwð2Þ ; W ð1Þ ; bð2Þ ; bð1Þ Þ. is an element-
framework in this section. In typical machine learning wise activation function, which we typically use a hyper-
frameworks, we predict a function f ðxÞ that expresses the bolic tangent (tanh) or rectified linear unit (ReLU)
relationship among variables using training data. For function.
example, in the regression task in which we predict a Let X ¼ ½x1 ; . . . ; xN > 2 RND and y ¼ ½y1 ; . . . ;
yN > 2 RN be input vectors and output variables, respec-

e-mail: tomoki koriyama@ipc.i.u-tokyo.ac.jp tively. Here, we consider predicting a function f ðÞ from

457
Acoust. Sci. & Tech. 41, 2 (2020)

(also called model evidence) given by


Z
pðDÞ ¼ pðDjÞpðÞd: ð6Þ

We can choose an appropriate model for given data by


comparing the marginal likelihoods.
The issue of Bayesian learning is that the calculations
of predictive distributions and marginal likelihoods tend to
be hard. Moreover, to derive closed forms of predictive
distributions, the definitions of priors are often restricted.
Fig. 1 Example of sampled functions from (a) a
Bayesian neural work (1 hidden layer, 200 hidden 3. GAUSSIAN PROCESS REGRESSION
units, and tanh activation function) and (b) Gaussian
process (EQ kernel). 3.1. Gaussian Process
In machine learning based on Gaussian process, we
assume the distribution over functions by considering the
the training data D ¼ ðX; yÞ. A widely-used method is a statistic of function values, instead of the distributions over
point estimate of model parameters  using the frameworks parameters. Let X ¼ ½x1 ; . . . ; xN > and f ¼ ½ f ðx1 Þ; . . . ;
of maximum likelihood or maximum a posterior. After f ðxN Þ> denote an input variable sequence and a function
training parameters, we use the function f^ obtained from variable sequence, respectively, and we assume that the
the optimal parameters ^ to predict the output f^ðx Þ given joint distribution of f becomes a multivariate Gaussian
an unknown input vector x . distribution. Here, if the joint distribution of f becomes
Bayesian learning is another approach to predicting the a multivariate Gaussian distribution for arbitrary input
output from an unknown input, in which the posterior sequence, the distribution over the function is referred to
distributions of parameters are inferred. Based on Bayes’ as Gaussian process. Gaussian process is expressed as
theorem, a posterior pðjDÞ is calculated as follows: follows:
pðDjÞpðÞ f  GPðmðÞ; kð; ÞÞ: ð7Þ
pðjDÞ ¼ ð4Þ
pðDÞ
Gaussian process is formulated by a mean function mðÞ
where pðÞ is a prior which assumes how parameters are and a kernel (covariance) function kð; Þ. The mean
distributed, and pðDjÞ is a likelihood that represents the function represents that the mean of f1 ðxÞ; f2 ðxÞ; . . .
goodness of fit of parameters to data. pðDÞ is a marginal becomes mðxÞ for any input x when the sampled functions
distribution, which is independent of parameters. are f1 ; f2 ; . . . . Similarly, the variance of ð f1 ðxÞ; f2 ðxÞ; . . . Þ
The important point of Bayesian learning is that we corresponds to kðx; xÞ and the covariance between
have the distributions of parameters. Since the function is ð f1 ðxÞ; f2 ðxÞ; . . . Þ and ð f1 ðx0 Þ; f2 ðx0 Þ; . . . Þ is defined by
uniquely determined from the parameter, we can obtain the kðx; x0 Þ. In practice, a constant function mðxÞ ¼ 0 is
prior and posterior distributions of the function from those generally used as the mean function, which is called
of parameters. We show an example of the distributions zero-mean function. Figure 1(b) shows an example of
of functions sampled from the parameter distribution of a functions sampled from a Gaussian process using an
1-hidden-layer neural network in Fig. 1(a). The method exponential quadratic (EQ) kernel (See Sect. 3.3.) as a
that infers the parameter distribution of a neural network is kernel function. With the EQ kernel, the covariance
referred to as Bayesian neural network. decreases with the distance between inputs, which derives
The purpose of Bayesian inference is to obtain the the nature ‘‘If the inputs are similar, the outputs become
following predictive distribution by integrating out the similar.’’ in the sampled functions.
posterior distribution of the function: The joint distribution of f using the zero-mean is
Z expressed as follows:
pðy jx Þ ¼ pð f jDÞpðy j f ðx ÞÞd f ð5Þ
pð f Þ ¼ N ð f ; 0; KN Þ ð8Þ
2 3
Note that the predictive distribution does not depend on kðx1 ; x1 Þ . . . kðxN ; x1 Þ
specific parameters but on the model itself. 6 7
6 .. .. .. 7
Bayesian model selection is another property of KN ¼ 6 . . . 7 ð9Þ
4 5
Bayesian learning. In Bayesian model selection, the
kðxN ; x1 Þ . . . kðxN ; xN Þ
performance of a model, which is independent of its
parameters, can be evaluated using marginal likelihood where 0 denotes a zero vector whose elements are all zero,

458
T. KORIYAMA: GP AND DGP FOR SPEECH PROCESSING

KN is referred to as a Gram matrix. Table 1 Relationship between activation functions and


kernel functions.
It is known that Gaussian processes are closely related
to neural networks [3,4]. Specifically, if a neural network Activation Kernel
has an infinite number of hidden units and its weight cos EQ [5]
parameters are distributed independently and identically on ReLU ArcCos [6]
Gaussian distributions, the function represented by the error function ArcSin [7]
neural network becomes a sample of Gaussian process. arbitrary function NNGP [3]

3.2. Gaussian Process Regression


We derive the predictive distribution given unknown plying two kernel functions.
input from a training data D ¼ ðX; yÞ. Let XT ¼ Using the correspondence between Gaussian process
½x1 ; . . . ; xT > 2 RTD and f T ¼ ½ f ðx1 Þ; . . . ; f ðxT Þ> 2 RT and neural network with an infinite number of hidden units,
be sequences of unknown inputs and latent function we can consider the relationship between kernel functions
variables, respectively. From the definition of Gaussian and activation functions as shown in Table 1. The EQ
process, the joint distribution of f T and f is expressed by kernel is related to a cosine function. Random Fourier
the following Gaussian distribution: features are approximation methods for kernel-based
frameworks using this relationship [5]. The error function
pð f ; f T jXT ; DÞ
" # " #! in the Table is a function that has a similar contour to a
f KN KNT sigmoid and hyperbolic tangent (tanh) function. Lee et al.
¼N ; 0; ð10Þ recently showed that it is possible to define kernel function
fT KTN KT
for arbitrary activation functions [3].
where KTN ¼ K> NT is a Gram matrix between XT and X, The selection of an appropriate kernel function is not an
KT is a Gram matrix obtained from XT . We adopt the easy task as the selection of an activation function is not. A
property of Gaussian distribution: when the joint distribu- basic principle is to choose the kernel that represents the
tion is a Gaussian, the conditional distribution is also relationship between the training data points. To overcome
Gaussian. Then, we obtain the predictive distribution as the kernel selection issue, deep Gaussian processes
follows: (described in Sect. 5) are proposed, where the input
pð f T jXT ; DÞ ¼ N ð f T ; T ; T Þ ð11Þ features are warped by a hierarchical architecture.

T ¼ Ay ð12Þ 3.4. Training of Gaussian Process Regression


T ¼ KT  AKNT ð13Þ In the Gaussian process regression, hyperparameters
such as kernel function parameters and noise variances are
A ¼ KTN ðKN þ 2 I N Þ1 : ð14Þ
trained on the basis of Bayesian model selection described
We also obtain the predictive distribution of output in Sect. 2, namely, the marginal likelihood is maximized to
variable yT as follows: optimize the hyperparameters. The log marginal likelihood
is given by
pðyT jXT ; DÞ ¼ N ðyT ; T ; T þ 2 I T Þ ð15Þ
1
For the derivation of Gaussian process regression, [1] log pðyjXÞ ¼  y> ðKN þ 2 IÞ1 y
2
provides the detail.
1
 log jKN þ 2 Ij þ const. ð16Þ
3.3. Kernel Function Selection 2
To perform Gaussian processes regression represented The first and second terms in (16) represent the goodness of
by (6), we have to define the kernel function that fit of the kernel function to the data and the complexity of
determines the values of Gram matrices. A widely used the model, respectively. The complexity term works as the
kernel is exponential quadratic (EQ) kernel defined by penalty term that prevents an overfitting problem.
kðx; x0 Þ ¼ expðkx  x0 k2 =2Þ. EQ kernel is also referred to
as radial basis function (RBF) kernel, squared potential 3.5. Characteristics of Gaussian Process Regression
(SE) kernel, or Gaussian kernel. The EQ kernel determines Gaussian process regression has several characteristics
the value using the distance of two inputs of x and x0 . such as nonparametric model, kernel methods, Bayesian
Besides the EQ kernel, whatever kernel functions can be model, and Gaussian.
used if the kernel function is positive definite. For example, Nonparametric model means that we utilize all samples
the sum of two kernel functions is also a positive definite of training data for prediction instead of parameterizing the
kernel. We can design a new kernel function by multi- training data into a fixed amount of parameters. This is

459
Acoust. Sci. & Tech. 41, 2 (2020)

confirmed by showing (12), which represents the predictive


mean is directly calculated using the weighted sum of pð f ; uÞ
" # " #!
training data points y. f KN KNM
The advantage of kernel methods is the complicated ¼N ; 0; ð18Þ
fM KMN KM
functions can be predicted with a small number of
parameters. Moreover, kernel methods enable us to use a where KMN (¼ K> NM ) is a Gram matrix between Z and X.
diverse type of input features without the restriction of the In SVGP, we train the variational distribution qðuÞ ¼
use of continuous variable features. For example, we can N ðu; m; SÞ that approximates the posterior pðujyÞ. Using
use structured input features such as trees and probability the variational distribution qðuÞ, the predictive distribution
distribution functions [8]. is approximated by
Z
Gaussian process regression also has the characteristics
of Bayesian modeling. Specifically, the training is per- pð f T jyÞ ¼ pð f T ju; yÞpðujyÞdu
formed based on Bayesian model selection that considers Z
the model complexity. Also, the uncertainty of output pð f T juÞpðujyÞdu
feature can be obtained by Bayesian inference. Z
Furthermore, since Gaussian process is the assumption pð f T juÞqðuÞdu
of Gaussian distribution over function variables, we can Z
utilize the properties of Gaussian distributions. For exam- pð f T ; uÞ
¼ qðuÞdu
ple, the derivation of conditional distribution in (11) is pðuÞ
based on the property of Gaussian distribution. By utilizing ¼ N ð f T ; T ; T Þ , qð f T Þ ð19Þ
these properties, we can obtain close forms of marginal and
predictive distribution. T ¼ KTM K1
M m ð20Þ
T ¼ KT  KTM K1
M KMT
4. SCALABLE GAUSSIAN PROCESS
COMPUTATION þ KTM K1 1
M SKM KMT : ð21Þ
4.1. Stochastic Variational Gaussian Process (SVGP) These equations means that we just require OðM Þ storage 2

One issue of Gaussian process regression is computa- and OðM 3 Þ computational complexity. We here define the
tional complexity. Let N be the number of samples in operation of calculating qð f T Þ as SVGPð f T ; XT ; Z; qðuÞÞ.
training data. We require OðN 2 Þ storage for (N  N) Gram The variational distribution qðuÞ can be trained by a
matrix and OðN 3 Þ computational complexity for inversion variational Bayesian method. Using Jensen’s inequality for
and determinant in (16) and (11). This restricts N at most log marginal likelihood (16), we obtain the following
10,000. However, speech processing requires much more evidence lower bound (ELBO):
training data points. For example, 60-minute speech data XN

with a 5 ms frame shift exceeds 700,000 training data L¼ Eqð f ðxi ÞÞ ½log pðyi j f ðxi ÞÞ
i¼1
points.
One noteworthy technique to overcome the problem is  KLðqðuÞkpðuÞÞ ð22Þ
an approximation method based on stochastic variational where EpðÞ is expectation over the distribution pðÞ, and KL
inference (SVI) [9,10], which is referred to as SVGP. The represents Kullback–Leibler divergence. By maximizing
SVGP framework utilizes inducing point methods (also the ELBO, we can optimize the variational distribution
called pseudo data methods), which approximate the pairs parameters ðm; SÞ.
of input X and function output f by a small number of The advantage of SVGP is that the parameters can be
representative pairs. We represent the pairs by inducing optimized by the stochastic gradient method because
inputs Z ¼ ðz1 ; . . . ; zM Þ and inducing outputs u ¼ ELBO is decomposed to the sum of respective training
ðu1 ; . . . ; uM Þ. M ( N) is the number of inducing points, data points. Hence, it is possible to use a minibatch training
which is generally hundreds or approximately 1,000. available for a large amount of training data and employ
From the definition of Gaussian process, the prior adaptive optimization techniques such as AdaGrad and
distribution of u is given by Adam.
Since SVGP does not restrict the kernel function, it can
pðuÞ ¼ N ðu; 0; KM Þ ð17Þ
be used for various kinds of input features. Moreover, it is
where KM is a Gram matrix of input Z. Also, the joint easy to apply SVGP to classification task because we only
distribution is given by have to calculate the expectation Eqðf ðxi ÞÞ ½log pðyi j f ðxi ÞÞ in
(22).

460
T. KORIYAMA: GP AND DGP FOR SPEECH PROCESSING

4.2. Other Approximation Methods of Gaussian Proc- h‘;d ¼ f ‘;d ðH ‘1 Þ ð26Þ
ess Regression
Besides SVGP, diverse approximation methods have we can infer the predictive distribution of upper layer value
been proposed to overcome the problem of computational h‘;d from H ‘1 , using Gaussian process regression. By
complexity [11]. In local GP, training data is partitioned repeating the inference until the final layer L, we obtain the
into several clusters, and Gaussian process regression is predictive distribution of output variable.
executed for each cluster [12]. This enables us to use all
training data points in prediction while reducing the 5.1. Stochastic Variational Inference for DGP
computational complexity to the cube of the number of To achieve scalable training of DGP for a large amount
frames of each cluster. Product of experts [13] and of training data, Salimbeni et al. proposed a doubly
Bayesian committee machine [14] combines the results of stochastic variational inference (DSVI)-based DGP [20]. In
local GPs to enhance the prediction accuracy. DSVI-DGP, the inference of ‘-th layer from ð‘  1Þ-th
In the inducing point methods including SVGP, the layer is calculated using the following equations in the
number of inducing points is the limit of performance. same manner as (21):
KISS-GP [15] focuses on the structure of Gram matrices, qðh‘;d ‘;d ‘1 ‘ ‘;d
T Þ ¼ SVGPðhT ; H T ; Z ; qðu ÞÞ ð27Þ
which are approximated by Kronecker product and
interpolation. Although KISS-GP can utilize much larger qðu‘;d Þ ¼ N ðu‘;d ; m‘;d ; S‘;d Þ ð28Þ
inducing points than SVGP, the dimension of input feature By repeating SVGP, we obtain the predictive distribution
vectors has to be small (less than 4) due to the assumption of output variable give by
of Kronecker structure. TT-GP [16] and SKIP [17] attempt Z Z
to overcome the problem by approximating the Kronecker pðY T Þ ¼    qðY T jH LT ÞqðH LT jH L1
T Þ
structure.
Other approaches to the reduction of computational . . . qðH 1 jXT ÞdH LT . . . H 1T ð29Þ
complexity are the approximation of kernel function by Y
D‘
the inner product of finite-dimensional vectors. Random qðH ‘T jH L1
‘ Þ¼ qðh‘;d
T Þ: ð30Þ
Fourier features [5] and sparse spectrum GP (SSGP) [18] d¼1

approximate EQ kernel with finite-dimensional orthogonal Since the calculation of this integral is intractable,
bases. SSGP has weight vectors as parameters instead of predictive means are practically used for 1 to ðL  1Þ-th
inducing points, the model structure is very close to layers.
Bayesian neural networks. The training of DSVI-based DGP is executed by
maximizing the following ELBO in the same manner as
5. DEEP GAUSSIAN PROCESSES SVGP:
One of the problems of Gaussian process regression is 1X S X N X DL
that the performance depends on the design of kernel L E ~L;d ½log pðydi jh~i;s
L;d
Þ
S s¼1 i¼1 d¼1 qðhi;s Þ
functions. For example, EQ kernel assumes that the kernel
function depends only on the distance between two input L X
X D‘

vectors. Therefore, to focus the other measures such as  KLðqðu‘;d Þkpðu‘;d ÞÞ ð31Þ
‘¼1 d¼1
norms and angles of vectors, we have to choose other
kernel functions. However, it is a laborious work to choose where S is the number of points for Monte Carlo sampling
the best kernel function. which is generally set to unit. qðh~L;di;s Þ is the predictive
Deep Gaussian process (DGP) [19] is proposed to distribution of L-th layer. The Monte Carlo sampling is
overcome the problem by transforming the feature space of carried out using the predictive distribution for each layer,
input variables using a deep architecture. In DGP, we and the sampled values are used for the inference of the
assume that the function is f : RD0 ! RDL expressed by a next layer.
composite function of several functions
f ¼ fL  f1 ð23Þ 5.2. The Relationship between Neural Networks and
Deep Gaussian Processes
f ‘ ¼ ð f ‘;1 ; . . . ; f ‘;D‘ Þ ð24Þ Deep Gaussian processes are closely related to
and respective functions that output the value of dimension (Bayesian) neural networks. Figure 2 shows the transition
d and layer ‘ are sampled from Gaussian processes. When from a neural network to a Gaussian process. The left
the hidden layer variables are defined by network in the figure is a 3-hidden-layer neural network.
H ‘1 ¼ f ‘1 ð. . . ð f 1 ðXÞÞ ð25Þ The hidden layer values g‘ are obtained using a weight
matrix W ‘ and an activation function. Here, we decompose

461
Acoust. Sci. & Tech. 41, 2 (2020)

Decompositon of weight Infinite number of


matrices hidden units the decision tree used in hidden Markov model (HMM)-
based speech synthesis. However, this framework depends
on the design of kernel functions for complicated context,
and the performance is greatly affected by the construction
of decision trees.
DGP-based speech synthesis [24] incorporated a deep
architecture that encodes the complicated context repre-
sented by hundreds dimensional context vector. It is
reported that the DGP-based speech synthesis can generate
more natural sounding speech than DNN-based speech
synthesis.
Fig. 2 The relationship between neural network and Moungsri et al. proposed the duration model for speech
deep Gaussian process. synthesis, which utilizes the uncertainties of multiple
Gaussian process models [25]. By multiplying the pre-
dictive distributions of syllable- and phone-level durations,
the weight matrix W ‘ into two matrices W ‘;a , W ‘;b , and this method achieved more robust inference than using a
define the new hidden layer as h‘ . If we increase the single Gaussian process regression.
number of hidden units of g‘ to the infinite number, the For application to speech recognition, Lam proposed
relationship between h‘1 and h‘ is represented by a to insert Gaussian process layer for DNN-based speech
Gaussian process. Therefore, a 3-hidden-layer neural net- recognition [26]. This model uses an approximation
work can be transformed into a 3-layer deep Gaussian method based on random Fourier features.
process. The difference is that whereas Bayesian neural In this paper, we have explained the Gaussian process
networks infer the posterior of weight matrices, deep regression which predicts the output variables from the
Gaussian processes infer the posterior that of inducing input variables. However, as another model of Gaussian
points. process, Gaussian process latent variable model (GPLVM)
is proposed, in which the distribution of latent input
6. APPLICATION OF GAUSSIAN variables is inferred from observed variables. GPLVM can
PROCESSES TO SPEECH encode observed vectors into latent features with a small
PROCESSING dimensions. Gaussian process dynamic model (GPDM) is
In the previous sections, the frameworks of Gaussian an extension of GPLVM which can represent time-series
process regression and deep Gaussian process have been information. GPLVM and GPDM are used to various
described. Since the GP-based frameworks are general applications such as music genre recognition [27], phone
machine learning ones, they can be applied to speech classification [28], dynamical latent representation of
information processing. For example, in the voice con- acoustic feature sequence [29], syllable-level stress detec-
version task, Pilkington et al. used the acoustic feature tion [30], and semi-supervised prosody modeling [31].
vectors of source and target speakers as input and output
features [21]. To reduce computational complexity, they
7. CONCLUSIONS
performed the clustering of acoustic features and employed In this paper, we introduced the basics of Gaussian
a local GP approximation. Park et al. applied Gaussian process, and approximation techniques and the extension to
process regression to voice activity detection [22]. In this deep architecture models. Since Gaussian process regres-
study, the sample of a waveform is predicted from adjacent sion is a general-purpose machine learning technique, it
samples. Since the optimal kernel parameter of voiced can be applied not only to speech information processing
regions is different from that of noisy segment, this but also in a wide variety of applications. The recent
difference is used to detect voiced regions. developments for scalable Gaussian process enable us to
In [23], Gaussian process regression is used for use a huge amount of data for training. In future work, it is
statistical parametric speech synthesis. This technique expected to construct deep Gaussian process models easily
utilizes the advantage of kernel methods that structured in the same way as neural network models.
input can be used. As an input of speech synthesis, they use
REFERENCES
structural frame-level context that consists of phoneme-
and phrase-level information combined with frame position [1] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes
for Machine Learning (The MIT Press, Cambridge, Mass.,
information. To alleviate the computational complexity 2006).
problem, an approximation method based on local GP is [2] A. G. de G. Matthews, M. Rowland, J. Hron, R. E. Turner and
used, and the partition of training data is performed using Z. Ghahramani, ‘‘Gaussian process behaviour in wide deep

462
T. KORIYAMA: GP AND DGP FOR SPEECH PROCESSING

neural networks,’’ Proc. ICLR (2018). based speech synthesis,’’ Proc. ICASSP, pp. 5495–5498
[3] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington (2017).
and J. Sohl-Dickstein, ‘‘Deep neural networks as Gaussian [26] M. W. Y. Lam, S. Hu, X. Xie, S. Liu, J. Yu, R. Su, X. Liu and
processes,’’ Proc. ICLR (2018). H. Meng, ‘‘Gaussian process neural networks for speech
[4] R. M. Neal, ‘‘Priors for infinite networks,’’ in Bayesian recognition,’’ Proc. Interspeech, pp. 1778–1782 (2018).
Learning for Neural Networks, Lecture Notes in Statistics, [27] K. Markov and T. Matsui, ‘‘Music genre and emotion
Vol. 118 (Springer, New York, 1996). recognition using Gaussian processes,’’ IEEE Access, 2, 688–
[5] A. Rahimi and B. Recht, ‘‘Random features for large-scale 697 (2014).
kernel machines,’’ Proc. NIPS, pp. 1177–1184 (2008). [28] H. Park, S. Yun, S. Park, J. Kim and C. D. Yoo, ‘‘Phoneme
[6] Y. Cho and L. K. Saul, ‘‘Kernel methods for deep learning,’’ classification using constrained variational Gaussian process
Proc. NIPS, pp. 342–350 (2009). dynamical system,’’ Proc. NIPS, pp. 2006–2014 (2012).
[7] C. K. I. Williams, ‘‘Computing with infinite networks,’’ Neural [29] G. E. Henter, M. R. Frean and W. B. Kleijn, ‘‘Gaussian process
Comput., 10, 1203–1216 (1998). dynamical models for nonparametric speech representation and
[8] J. Shawe-Taylor and N. Cristianini, Kernel Methods for synthesis,’’ Proc. ICASSP, pp. 4505–4508 (2012).
Pattern Analysis (Cambridge University Press, Cambridge, [30] D. Moungsri, T. Koriyama and T. Kobayashi, ‘‘Unsupervised
2004). stress information labeling using Gaussian process latent
[9] J. Hensman, N. Fusi and N. D. Lawrence, ‘‘Gaussian processes variable model for statistical speech synthesis,’’ Proc. Inter-
for big data,’’ Proc. UAI, pp. 282–290 (2013). speech, pp. 1517–1521 (2016).
[10] J. Hensman, A. Matthews and Z. Ghahramani, ‘‘Scalable [31] T. Koriyama and T. Kobayashi, ‘‘Semi-supervised prosody
variational Gaussian process classification,’’ Proc. AISTATS, modeling using deep Gaussian process latent variable model,’’
pp. 1648–1656 (2015). Proc. Interspeech, pp. 4450–4454 (2019).
[11] H. Liu, Y.-S. Ong, X. Shen and J. Cai, ‘‘When Gaussian
process meets big data: A review of scalable GPs,’’ arXiv
preprint, 1807.01065 (2018). APPENDIX: FORMULAS ABOUT
[12] E. Snelson and Z. Ghahramani, ‘‘Local and global sparse GAUSSIAN DISTRIBUTION
Gaussian process approximations,’’ Proc. AISTATS, pp. 524–
531 (2007). Calculation of Gaussian process regression is attributed
[13] Y. Cao and D. J. Fleet, ‘‘Generalized product of experts for by the properties of Gaussian distribution. In this section,
automatic and principled fusion of Gaussian process predic-
we introduce representative formulas about Gaussian
tions,’’ Proc. Modern Nonparametrics 3: Automating the
Learning Pipeline Workshop at NIPS (2014). distribution. When pðxÞ is a Gaussian distribution with
[14] H. Liu, J. Cai, Y. Wang and Y.-S. Ong, ‘‘Generalized robust mean  and covariance matrix , we represent pðxÞ ¼
Bayesian committee machine for large-scale Gaussian process N ðx; ; Þ.
regression,’’ Proc. ICML, pp. 3131–3140 (2018).
[15] A. G. Wilson and H. Nickisch, ‘‘Kernel interpolation for
scalable structured Gaussian processes (KISS-GP),’’ Proc. A.1. Joint and Conditional Distributions
ICML, pp. 1775–1784 (2015). When the joint distribution of x and y is the following
[16] P. Izmailov, A. Novikov and D. Kropotov, ‘‘Scalable Gaussian multivariate Gaussian distribution,
processes with billions of inducing inputs via tensor train     " #  !
decomposition,’’ Proc. AISTATS, pp. 726–735 (2018). x x x xx xy
p ¼N ; ; ðA:1Þ
[17] J. R. Gardner, G. Pleiss, R. Wu, K. Q. Weinberger and A. G. y y y yx yy
Wilson, ‘‘Product kernel interpolation for scalable Gaussian
processes,’’ Proc. AISTATS, pp. 1407–1416 (2018). the conditional distribution of y given x also becomes a
[18] M. Lázaro-Gredilla, J. Quiñonero-Candela, C. E. Rasmussen Gaussian distribution expressed by
and A. R. Figueiras-Vidal, ‘‘Sparse spectrum Gaussian process
regression,’’ J. Mach. Learn. Res., 11, 1865–1881 (2010). pðyjxÞ ¼ N ðy; yjx ; yjx Þ ðA:2Þ
[19] A. C. Damianou and N. D. Lawrence, ‘‘Deep Gaussian
processes,’’ Proc. AISTATS, pp. 207–215 (2013). yjx ¼ y þ yx 1
xx x ðA:3Þ
[20] H. Salimbeni and M. Deisenroth, ‘‘Doubly stochastic varia-
tional inference for deep Gaussian processes,’’ Proc. NIPS, yjx ¼ yy  yx xx xy : ðA:4Þ
pp. 4591–4602 (2017).
[21] N. C. V. Pilkington, H. Zen and M. J. F. Gales, ‘‘Gaussian
process experts for voice conversion,’’ Proc. Interspeech, A.2. Linear Transformation on Gaussian Distribution
pp. 2761–2764 (2011). When the conditional mean of y depends on x as
[22] S. Park and S. Choi, ‘‘Gaussian process regression for voice follows:
activity detection and speech enhancement,’’ Proc. IJCNN,
pp. 2879–2882 (2008). pðyjxÞ ¼ N ðy; Ax þ b; LÞ ðA:5Þ
[23] T. Koriyama, T. Nose and T. Kobayashi, ‘‘Statistical para-
pðxÞ ¼ N ðx; m; SÞ ðA:6Þ
metric speech synthesis based on Gaussian process regres- R
sion,’’ IEEE J. Sel. Top. Signal Process., 8, 173–183 (2014). the marginal distribution of pðyÞ ¼ pðyjxÞpðxÞdx and the
[24] T. Koriyama and T. Kobayashi, ‘‘Statistical parametric speech conditional distribution of pðxjyÞ ¼ pðyjxÞpðxÞ=pðyÞ be-
synthesis using deep Gaussian processes,’’ IEEE/ACM Trans.
Audio Speech Lang. Process., 27, 948–959 (2019).
come Gaussian distributions given by
[25] D. Moungsri, T. Koriyama and T. Kobayashi, ‘‘Duration
prediction using multiple Gaussian process experts for GPR-

463
Acoust. Sci. & Tech. 41, 2 (2020)

pðyÞ ¼ N ðAm þ b; L þ ASA> Þ ðA:7Þ pðx1 þ x2 Þ ¼ N ðx1 þ x2 ;


pðxjyÞ ¼ N ðx; xjy ; xjy Þ ðA:8Þ 1 þ 2 ; 1 þ 2 Þ: ðA:11Þ
xjy ¼ xjy ðA> Lðy  bÞ þ S1 mÞ ðA:9Þ
Tomoki Koriyama received his B.E. degree in computer science,
xjy ¼ ðS1 þ A> LAÞ1 : ðA:10Þ and M.E. and Dr. Eng. degrees in information processing from Tokyo
Institute of Technology, Tokyo, Japan in 2009, 2010, and 2013. In
2013, he joined the Research Laboratory of Interdisciplinary
A.3. Sum of Variables of Gaussian Distribution Graduate School of Science and Engineering, Tokyo Institute of
When the distributions of x1 and x2 are Gaussian Technology as a Japan Society for the Promotion of Science
distributions represented by N ðx1 ; 1 ; 1 Þ and N ðx2 ; Research Fellow. He is currently an assistant professor at the
2 ; 2 Þ, respectively, the distribution of the sum of x1 Graduate School of Information Science and Technology, The
University of Tokyo, Japan. Dr. Koriyama was a recipient of The
and x2 is the following Gaussian: Awaya Prize Young Researcher Award from Acoustic Society of
Japan. He is a member of IEEE, ISCA, ASJ, IEICE, and IPSJ.

464

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy