Prosody generation using frame-based Gaussian process regression and classification for statistical parametric speech synthesis

This paper proposes novel models of F0 contours and phone durations using Gaussian process regression and classification (GPR and GPC) for statistical parametric speech synthesis. Although the use of frame-based GPR has shown the effectiveness of spectral feature modeling in previous studies, the application of GPR to prosodic features, i.e., F0 and phone duration, was not investigated sufficiently because the kernel function was designed for phonetic information only. In this paper, therefore, we propose a kernel function available for multiple units such as syllables, moras, and accent phrases. The proposed kernel function is based on temporal acoustic events like the beginning of accent phrase and the relative position between the target frame and the event is utilized for the kernel function. Experimental results of objective and subjective tests show that the GPR/GPC-based F0 and duration modeling improves the prediction accuracy of acoustic features compared with HMM-based speech synthesis.

[1]  Keikichi Hirose,et al.  Synthesis by rule of voice fundamental frequency contours of spoken Japanese from linguistic information , 1984, ICASSP.

[2]  Takashi Nose,et al.  Statistical Parametric Speech Synthesis Based on Gaussian Process Regression , 2014, IEEE Journal of Selected Topics in Signal Processing.

[3]  Bhuvana Ramabhadran,et al.  F0 contour prediction with a deep belief network-Gaussian process hybrid model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[5]  Takashi Fukuda,et al.  Orthogonalized Distinctive Phonetic Feature Extraction for Noise-Robust Automatic Speech Recognition , 2004, IEICE Trans. Inf. Syst..

[6]  Zoubin Ghahramani,et al.  Local and global sparse Gaussian process approximations , 2007, AISTATS.

[7]  Helen M. Meng,et al.  Multi-distribution deep belief network for speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Takashi Nose,et al.  Statistical nonparametric speech synthesis using sparse Gaussian processes , 2013, INTERSPEECH.

[10]  Takashi Nose,et al.  Parametric speech synthesis based on Gaussian process regression using global variance and hyperparameter optimization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[12]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[14]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[15]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .