Statistical Parametric Speech Synthesis Based on Gaussian Process Regression

This paper proposes a statistical parametric speech synthesis technique based on Gaussian process regression (GPR). The GPR model is designed for directly predicting frame-level acoustic features from corresponding information on frame context that is obtained from linguistic information. The frame context includes the relative position of the current frame within the phone and articulatory information and is used as the explanatory variable in GPR. Here, we introduce cluster-based sparse Gaussian processes (GPs), i.e., local GPs and partially independent conditional (PIC) approximation, to reduce the computational cost. The experimental results for both isolated phone synthesis and full-sentence continuous speech synthesis revealed that the proposed GPR-based technique without dynamic features slightly outperformed the conventional hidden Markov model (HMM)-based speech synthesis using minimum generation error training with dynamic features.

[1]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[2]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[3]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[4]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[6]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[7]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[8]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[9]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[10]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[11]  Carl E. Rasmussen,et al.  Warped Gaussian Processes , 2003, NIPS.

[12]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[13]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[14]  Multivariate Geostatistics , 2004 .

[15]  Takashi Fukuda,et al.  Orthogonalized Distinctive Phonetic Feature Extraction for Noise-Robust Automatic Speech Recognition , 2004, IEICE Trans. Inf. Syst..

[16]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[17]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[18]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[19]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006, Blizzard Challenge.

[20]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[22]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[23]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[24]  Xia Wang,et al.  A Novel HMM-Based TTS System using Both Continuous HMMS and Discrete HMMS , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[25]  Zoubin Ghahramani,et al.  Local and global sparse Gaussian process approximations , 2007, AISTATS.

[26]  Sunho Park,et al.  Gaussian process regression for voice activity detection and speech enhancement , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[27]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[28]  William J. Byrne,et al.  Autoregressive HMMs for speech synthesis , 2009, INTERSPEECH.

[29]  Zhi-Jie Yan,et al.  Rich context modeling for high quality HMM-based TTS , 2009, INTERSPEECH.

[30]  Heiga Zen,et al.  A Frame-Based Context-Dependent Acoustic Modeling for Speech Recognition , 2010 .

[31]  Heiga Zen,et al.  Gaussian Process Experts for Voice Conversion , 2011, INTERSPEECH.

[32]  Heiga Zen,et al.  The Effect of Using Normalized Models in Statistical Speech Synthesis , 2011, INTERSPEECH.

[33]  Hyunsin Park Gaussian Process Dynamical Models for Phoneme Classification , 2011 .

[34]  Hugh F. Durrant-Whyte,et al.  Non-stationary dependent Gaussian processes for data fusion in large-scale terrain modeling , 2011, 2011 IEEE International Conference on Robotics and Automation.

[35]  Takashi Nose,et al.  An F0 modeling technique based on prosodic events for spontaneous speech synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Gustav Eje Henter,et al.  Gaussian process dynamical models for nonparametric speech representation and synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.