Statistical Parametric Speech Synthesis Using Deep Gaussian Processes

This paper proposes a framework of speech synthesis based on deep Gaussian processes (DGPs), which is a deep architecture model composed of stacked Bayesian kernel regressions. In this method, we train a statistical model of transformation from contextual features to speech parameters in a similar manner to deep neural network (DNN)-based speech synthesis. To apply DGPs to a statistical parametric speech synthesis framework, our framework uses an approximation method, doubly stochastic variational inference, which is suitable for an arbitrary amount of data. Since the training of DGPs is based on the marginal likelihood that takes into account not only data fitting, but also model complexity, DGPs are less vulnerable to overfitting compared with DNNs. In experimental evaluations, we investigated a performance comparison of the proposed DGP-based framework with a feedforward DNN-based one. Subjective and objective evaluation results showed that our DGP framework yielded a higher mean opinion score and lower acoustic feature distortions than the conventional framework.

[1]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[2]  Takashi Nose,et al.  Statistical Parametric Speech Synthesis Based on Gaussian Process Regression , 2014, IEEE Journal of Selected Topics in Signal Processing.

[3]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[4]  Takao Kobayashi,et al.  A comparison of speech synthesis systems based on GPR, HMM, and DNN with a small amount of training data , 2015, INTERSPEECH.

[5]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[6]  Zoubin Ghahramani,et al.  Local and global sparse Gaussian process approximations , 2007, AISTATS.

[7]  Brian Kingsbury,et al.  Arccosine kernels: Acoustic modeling with infinite neural networks , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[11]  Takao Kobayashi,et al.  Prosody generation using frame-based Gaussian process regression and classification for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[13]  Keiichi Tokuda,et al.  XIMERA: a new TTS from ATR based on corpus-based technologies , 2004, SSW.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Radford M. Neal Priors for Infinite Networks , 1996 .

[16]  Bhuvana Ramabhadran,et al.  F0 contour prediction with a deep belief network-Gaussian process hybrid model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Gustav Eje Henter,et al.  Gaussian process dynamical models for nonparametric speech representation and synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Donald R. Jones,et al.  A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[19]  Daniel Hernández-Lobato,et al.  Deep Gaussian Processes for Regression using Approximate Expectation Propagation , 2016, ICML.

[20]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[21]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[22]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[23]  James Hensman,et al.  Scalable Variational Gaussian Process Classification , 2014, AISTATS.

[24]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[25]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[26]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[27]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[28]  Neil D. Lawrence,et al.  Variationally Auto-Encoded Deep G aussian Processes , 2016, International Conference on Learning Representations.

[29]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[30]  Marc Peter Deisenroth,et al.  Doubly Stochastic Variational Inference for Deep Gaussian Processes , 2017, NIPS.

[31]  Takao Kobayashi,et al.  GPR-based Thai speech synthesis using multi-level duration prediction , 2018, Speech Commun..

[32]  Maurizio Filippone,et al.  Random Feature Expansions for Deep Gaussian Processes , 2016, ICML.

[33]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[34]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[35]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[36]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .