Duration prediction using multiple Gaussian process experts for GPR-based speech synthesis

This paper proposes an alternative multi-level approach to duration prediction for improving prosody generation in statistical parametric speech synthesis using multiple Gaussian process experts. We use two duration models at different levels, specifically, syllable and phone. First, we individually train syllable- and phone-level duration models. Then, the predictive distributions of syllable and phone duration models are combined by product of Gaussians. The means of combined predictive distributions are used as predicted durations for synthetic speech. We show objective and subjective evaluation results for the proposed technique by comparing with the conventional ones when the techniques are applied to Gaussian process regression (GPR)-based speech synthesis.

[1]  Mary P. Harper,et al.  Vowel length and stress in Thai , 1998 .

[2]  Takashi Nose,et al.  Parametric speech synthesis based on Gaussian process regression using global variance and hyperparameter optimization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[4]  Heiga Zen,et al.  Gaussian Process Experts for Voice Conversion , 2011, INTERSPEECH.

[5]  Heiga Zen,et al.  Product of Experts for Statistical Parametric Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Zhizheng Wu,et al.  Improved Prosody Generation by Maximizing Joint Probability of State and Longer Units , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Takashi Nose,et al.  Statistical Parametric Speech Synthesis Based on Gaussian Process Regression , 2014, IEEE Journal of Selected Topics in Signal Processing.

[8]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[9]  Takao Kobayashi,et al.  Prosody generation using frame-based Gaussian process regression and classification for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Takao Kobayashi,et al.  Duration prediction using multi-level model for GPR-based speech synthesis , 2015, INTERSPEECH.

[11]  Sudaporn Luksaneeyanawin,et al.  Intonation in Thai. , 1983 .

[12]  M P Harper,et al.  Acoustic Correlates of Stress in Thai , 1996, Phonetica.

[13]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[14]  Takashi Nose,et al.  Frame-level acoustic modeling based on Gaussian process regression for statistical nonparametric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Sadaoki Furui,et al.  Thai speech processing technology: A review , 2007, Speech Commun..

[16]  Takashi Nose,et al.  Statistical nonparametric speech synthesis using sparse Gaussian processes , 2013, INTERSPEECH.

[17]  Chen-Yu Chiang,et al.  Modeling of Speaking Rate Influences on Mandarin Speech Prosody and Its Application to Speaking Rate-controlled TTS , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.