Deep GP Untuk Speech
Deep GP Untuk Speech
Deep GP Untuk Speech
INVITED REVIEW
Tomoki Koriyama
Graduate School of Information Science and Technology, The University of Tokyo,
7–3–1 Hongo, Bunkyo-ku, Tokyo, 113–8656 Japan
Abstract: Gaussian process (GP) is a distribution of functions, which can be used for a machine
learning framework. GP regression has characteristics of Bayesian model, which can predict
uncertainty of outputs, and kernel methods, which enables nonlinear function with a small number of
parameters. In this paper, we first describe the basic of GP regression, and introduce recent notable
advances of GP. Specifically, we focus on stochastic variational GP that is an approximation method
available for a huge amount of training data, and explain a GP-based deep architecture model called
deep Gaussian process. Since GP regression is a general-purpose machine learning framework, there
are many applications. In this paper, we introduce GP-based applications to speech information
processing including speech synthesis.
Keywords: Gaussian process regression, Kernel methods, Deep learning, Bayesian model, Speech
processing
457
Acoust. Sci. & Tech. 41, 2 (2020)
458
T. KORIYAMA: GP AND DGP FOR SPEECH PROCESSING
459
Acoust. Sci. & Tech. 41, 2 (2020)
One issue of Gaussian process regression is computa- and OðM 3 Þ computational complexity. We here define the
tional complexity. Let N be the number of samples in operation of calculating qð f T Þ as SVGPð f T ; XT ; Z; qðuÞÞ.
training data. We require OðN 2 Þ storage for (N N) Gram The variational distribution qðuÞ can be trained by a
matrix and OðN 3 Þ computational complexity for inversion variational Bayesian method. Using Jensen’s inequality for
and determinant in (16) and (11). This restricts N at most log marginal likelihood (16), we obtain the following
10,000. However, speech processing requires much more evidence lower bound (ELBO):
training data points. For example, 60-minute speech data XN
with a 5 ms frame shift exceeds 700,000 training data L¼ Eqð f ðxi ÞÞ ½log pðyi j f ðxi ÞÞ
i¼1
points.
One noteworthy technique to overcome the problem is KLðqðuÞkpðuÞÞ ð22Þ
an approximation method based on stochastic variational where EpðÞ is expectation over the distribution pðÞ, and KL
inference (SVI) [9,10], which is referred to as SVGP. The represents Kullback–Leibler divergence. By maximizing
SVGP framework utilizes inducing point methods (also the ELBO, we can optimize the variational distribution
called pseudo data methods), which approximate the pairs parameters ðm; SÞ.
of input X and function output f by a small number of The advantage of SVGP is that the parameters can be
representative pairs. We represent the pairs by inducing optimized by the stochastic gradient method because
inputs Z ¼ ðz1 ; . . . ; zM Þ and inducing outputs u ¼ ELBO is decomposed to the sum of respective training
ðu1 ; . . . ; uM Þ. M ( N) is the number of inducing points, data points. Hence, it is possible to use a minibatch training
which is generally hundreds or approximately 1,000. available for a large amount of training data and employ
From the definition of Gaussian process, the prior adaptive optimization techniques such as AdaGrad and
distribution of u is given by Adam.
Since SVGP does not restrict the kernel function, it can
pðuÞ ¼ N ðu; 0; KM Þ ð17Þ
be used for various kinds of input features. Moreover, it is
where KM is a Gram matrix of input Z. Also, the joint easy to apply SVGP to classification task because we only
distribution is given by have to calculate the expectation Eqðf ðxi ÞÞ ½log pðyi j f ðxi ÞÞ in
(22).
460
T. KORIYAMA: GP AND DGP FOR SPEECH PROCESSING
4.2. Other Approximation Methods of Gaussian Proc- h‘;d ¼ f ‘;d ðH ‘1 Þ ð26Þ
ess Regression
Besides SVGP, diverse approximation methods have we can infer the predictive distribution of upper layer value
been proposed to overcome the problem of computational h‘;d from H ‘1 , using Gaussian process regression. By
complexity [11]. In local GP, training data is partitioned repeating the inference until the final layer L, we obtain the
into several clusters, and Gaussian process regression is predictive distribution of output variable.
executed for each cluster [12]. This enables us to use all
training data points in prediction while reducing the 5.1. Stochastic Variational Inference for DGP
computational complexity to the cube of the number of To achieve scalable training of DGP for a large amount
frames of each cluster. Product of experts [13] and of training data, Salimbeni et al. proposed a doubly
Bayesian committee machine [14] combines the results of stochastic variational inference (DSVI)-based DGP [20]. In
local GPs to enhance the prediction accuracy. DSVI-DGP, the inference of ‘-th layer from ð‘ 1Þ-th
In the inducing point methods including SVGP, the layer is calculated using the following equations in the
number of inducing points is the limit of performance. same manner as (21):
KISS-GP [15] focuses on the structure of Gram matrices, qðh‘;d ‘;d ‘1 ‘ ‘;d
T Þ ¼ SVGPðhT ; H T ; Z ; qðu ÞÞ ð27Þ
which are approximated by Kronecker product and
interpolation. Although KISS-GP can utilize much larger qðu‘;d Þ ¼ N ðu‘;d ; m‘;d ; S‘;d Þ ð28Þ
inducing points than SVGP, the dimension of input feature By repeating SVGP, we obtain the predictive distribution
vectors has to be small (less than 4) due to the assumption of output variable give by
of Kronecker structure. TT-GP [16] and SKIP [17] attempt Z Z
to overcome the problem by approximating the Kronecker pðY T Þ ¼ qðY T jH LT ÞqðH LT jH L1
T Þ
structure.
Other approaches to the reduction of computational . . . qðH 1 jXT ÞdH LT . . . H 1T ð29Þ
complexity are the approximation of kernel function by Y
D‘
the inner product of finite-dimensional vectors. Random qðH ‘T jH L1
‘ Þ¼ qðh‘;d
T Þ: ð30Þ
Fourier features [5] and sparse spectrum GP (SSGP) [18] d¼1
approximate EQ kernel with finite-dimensional orthogonal Since the calculation of this integral is intractable,
bases. SSGP has weight vectors as parameters instead of predictive means are practically used for 1 to ðL 1Þ-th
inducing points, the model structure is very close to layers.
Bayesian neural networks. The training of DSVI-based DGP is executed by
maximizing the following ELBO in the same manner as
5. DEEP GAUSSIAN PROCESSES SVGP:
One of the problems of Gaussian process regression is 1X S X N X DL
that the performance depends on the design of kernel L E ~L;d ½log pðydi jh~i;s
L;d
Þ
S s¼1 i¼1 d¼1 qðhi;s Þ
functions. For example, EQ kernel assumes that the kernel
function depends only on the distance between two input L X
X D‘
vectors. Therefore, to focus the other measures such as KLðqðu‘;d Þkpðu‘;d ÞÞ ð31Þ
‘¼1 d¼1
norms and angles of vectors, we have to choose other
kernel functions. However, it is a laborious work to choose where S is the number of points for Monte Carlo sampling
the best kernel function. which is generally set to unit. qðh~L;di;s Þ is the predictive
Deep Gaussian process (DGP) [19] is proposed to distribution of L-th layer. The Monte Carlo sampling is
overcome the problem by transforming the feature space of carried out using the predictive distribution for each layer,
input variables using a deep architecture. In DGP, we and the sampled values are used for the inference of the
assume that the function is f : RD0 ! RDL expressed by a next layer.
composite function of several functions
f ¼ fL f1 ð23Þ 5.2. The Relationship between Neural Networks and
Deep Gaussian Processes
f ‘ ¼ ð f ‘;1 ; . . . ; f ‘;D‘ Þ ð24Þ Deep Gaussian processes are closely related to
and respective functions that output the value of dimension (Bayesian) neural networks. Figure 2 shows the transition
d and layer ‘ are sampled from Gaussian processes. When from a neural network to a Gaussian process. The left
the hidden layer variables are defined by network in the figure is a 3-hidden-layer neural network.
H ‘1 ¼ f ‘1 ð. . . ð f 1 ðXÞÞ ð25Þ The hidden layer values g‘ are obtained using a weight
matrix W ‘ and an activation function. Here, we decompose
461
Acoust. Sci. & Tech. 41, 2 (2020)
462
T. KORIYAMA: GP AND DGP FOR SPEECH PROCESSING
neural networks,’’ Proc. ICLR (2018). based speech synthesis,’’ Proc. ICASSP, pp. 5495–5498
[3] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington (2017).
and J. Sohl-Dickstein, ‘‘Deep neural networks as Gaussian [26] M. W. Y. Lam, S. Hu, X. Xie, S. Liu, J. Yu, R. Su, X. Liu and
processes,’’ Proc. ICLR (2018). H. Meng, ‘‘Gaussian process neural networks for speech
[4] R. M. Neal, ‘‘Priors for infinite networks,’’ in Bayesian recognition,’’ Proc. Interspeech, pp. 1778–1782 (2018).
Learning for Neural Networks, Lecture Notes in Statistics, [27] K. Markov and T. Matsui, ‘‘Music genre and emotion
Vol. 118 (Springer, New York, 1996). recognition using Gaussian processes,’’ IEEE Access, 2, 688–
[5] A. Rahimi and B. Recht, ‘‘Random features for large-scale 697 (2014).
kernel machines,’’ Proc. NIPS, pp. 1177–1184 (2008). [28] H. Park, S. Yun, S. Park, J. Kim and C. D. Yoo, ‘‘Phoneme
[6] Y. Cho and L. K. Saul, ‘‘Kernel methods for deep learning,’’ classification using constrained variational Gaussian process
Proc. NIPS, pp. 342–350 (2009). dynamical system,’’ Proc. NIPS, pp. 2006–2014 (2012).
[7] C. K. I. Williams, ‘‘Computing with infinite networks,’’ Neural [29] G. E. Henter, M. R. Frean and W. B. Kleijn, ‘‘Gaussian process
Comput., 10, 1203–1216 (1998). dynamical models for nonparametric speech representation and
[8] J. Shawe-Taylor and N. Cristianini, Kernel Methods for synthesis,’’ Proc. ICASSP, pp. 4505–4508 (2012).
Pattern Analysis (Cambridge University Press, Cambridge, [30] D. Moungsri, T. Koriyama and T. Kobayashi, ‘‘Unsupervised
2004). stress information labeling using Gaussian process latent
[9] J. Hensman, N. Fusi and N. D. Lawrence, ‘‘Gaussian processes variable model for statistical speech synthesis,’’ Proc. Inter-
for big data,’’ Proc. UAI, pp. 282–290 (2013). speech, pp. 1517–1521 (2016).
[10] J. Hensman, A. Matthews and Z. Ghahramani, ‘‘Scalable [31] T. Koriyama and T. Kobayashi, ‘‘Semi-supervised prosody
variational Gaussian process classification,’’ Proc. AISTATS, modeling using deep Gaussian process latent variable model,’’
pp. 1648–1656 (2015). Proc. Interspeech, pp. 4450–4454 (2019).
[11] H. Liu, Y.-S. Ong, X. Shen and J. Cai, ‘‘When Gaussian
process meets big data: A review of scalable GPs,’’ arXiv
preprint, 1807.01065 (2018). APPENDIX: FORMULAS ABOUT
[12] E. Snelson and Z. Ghahramani, ‘‘Local and global sparse GAUSSIAN DISTRIBUTION
Gaussian process approximations,’’ Proc. AISTATS, pp. 524–
531 (2007). Calculation of Gaussian process regression is attributed
[13] Y. Cao and D. J. Fleet, ‘‘Generalized product of experts for by the properties of Gaussian distribution. In this section,
automatic and principled fusion of Gaussian process predic-
we introduce representative formulas about Gaussian
tions,’’ Proc. Modern Nonparametrics 3: Automating the
Learning Pipeline Workshop at NIPS (2014). distribution. When pðxÞ is a Gaussian distribution with
[14] H. Liu, J. Cai, Y. Wang and Y.-S. Ong, ‘‘Generalized robust mean and covariance matrix , we represent pðxÞ ¼
Bayesian committee machine for large-scale Gaussian process N ðx; ; Þ.
regression,’’ Proc. ICML, pp. 3131–3140 (2018).
[15] A. G. Wilson and H. Nickisch, ‘‘Kernel interpolation for
scalable structured Gaussian processes (KISS-GP),’’ Proc. A.1. Joint and Conditional Distributions
ICML, pp. 1775–1784 (2015). When the joint distribution of x and y is the following
[16] P. Izmailov, A. Novikov and D. Kropotov, ‘‘Scalable Gaussian multivariate Gaussian distribution,
processes with billions of inducing inputs via tensor train " # !
decomposition,’’ Proc. AISTATS, pp. 726–735 (2018). x x x xx xy
p ¼N ; ; ðA:1Þ
[17] J. R. Gardner, G. Pleiss, R. Wu, K. Q. Weinberger and A. G. y y y yx yy
Wilson, ‘‘Product kernel interpolation for scalable Gaussian
processes,’’ Proc. AISTATS, pp. 1407–1416 (2018). the conditional distribution of y given x also becomes a
[18] M. Lázaro-Gredilla, J. Quiñonero-Candela, C. E. Rasmussen Gaussian distribution expressed by
and A. R. Figueiras-Vidal, ‘‘Sparse spectrum Gaussian process
regression,’’ J. Mach. Learn. Res., 11, 1865–1881 (2010). pðyjxÞ ¼ N ðy; yjx ; yjx Þ ðA:2Þ
[19] A. C. Damianou and N. D. Lawrence, ‘‘Deep Gaussian
processes,’’ Proc. AISTATS, pp. 207–215 (2013). yjx ¼ y þ yx 1
xx x ðA:3Þ
[20] H. Salimbeni and M. Deisenroth, ‘‘Doubly stochastic varia-
tional inference for deep Gaussian processes,’’ Proc. NIPS, yjx ¼ yy yx xx xy : ðA:4Þ
pp. 4591–4602 (2017).
[21] N. C. V. Pilkington, H. Zen and M. J. F. Gales, ‘‘Gaussian
process experts for voice conversion,’’ Proc. Interspeech, A.2. Linear Transformation on Gaussian Distribution
pp. 2761–2764 (2011). When the conditional mean of y depends on x as
[22] S. Park and S. Choi, ‘‘Gaussian process regression for voice follows:
activity detection and speech enhancement,’’ Proc. IJCNN,
pp. 2879–2882 (2008). pðyjxÞ ¼ N ðy; Ax þ b; LÞ ðA:5Þ
[23] T. Koriyama, T. Nose and T. Kobayashi, ‘‘Statistical para-
pðxÞ ¼ N ðx; m; SÞ ðA:6Þ
metric speech synthesis based on Gaussian process regres- R
sion,’’ IEEE J. Sel. Top. Signal Process., 8, 173–183 (2014). the marginal distribution of pðyÞ ¼ pðyjxÞpðxÞdx and the
[24] T. Koriyama and T. Kobayashi, ‘‘Statistical parametric speech conditional distribution of pðxjyÞ ¼ pðyjxÞpðxÞ=pðyÞ be-
synthesis using deep Gaussian processes,’’ IEEE/ACM Trans.
Audio Speech Lang. Process., 27, 948–959 (2019).
come Gaussian distributions given by
[25] D. Moungsri, T. Koriyama and T. Kobayashi, ‘‘Duration
prediction using multiple Gaussian process experts for GPR-
463
Acoust. Sci. & Tech. 41, 2 (2020)
464