GPR-based Thai speech synthesis using multi-level duration prediction

Abstract This paper proposes a multi-level Gaussian process regression (GPR)-based method for duration prediction by incorporating phone- and syllable-level duration models. In this method, we first train the syllable model and predict syllable durations for a given input of context labels. Then, we use the predicted syllable duration as an additional context for the phone-level model to predict phone durations. To apply multi-level duration prediction to the GPR-based speech synthesis framework, we designed phone- and syllable- level context sets for Thai that include linguistic information and the relative positions of speech units. We also examined the multi-level deep neural network (DNN)-based duration-prediction method by using the same approach as for the proposed multi-level GPR-based one. We conducted objective and subjective evaluations using two-hour training data to compare the proposed method with single-level ones. The results indicate that the proposed multi-level duration-prediction method outperformed single-level ones in DNN-, and GPR-based frameworks. They also indicate that the proposed multi-level GPR-based method can provide better performance than the multi-level HMM-based duration-prediction method.

[1]  Jan P. H. van Santen,et al.  Contextual effects on vowel duration , 1992, Speech Commun..

[2]  W. Nick Campbell Predicting segmental durations for accommodation within a syllable-level timing framework , 1993, EUROSPEECH.

[3]  Zhizheng Wu,et al.  Improved Prosody Generation by Maximizing Joint Probability of State and Longer Units , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Takao Kobayashi,et al.  Phone duration modeling using gradient tree boosting , 2008, Speech Commun..

[5]  Sin-Horng Chen,et al.  A new duration modeling approach for Mandarin speech , 2003, IEEE Trans. Speech Audio Process..

[6]  Srikanth Ronanki,et al.  Robust TTS duration modelling using DNNS , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Takao Kobayashi,et al.  A comparison of speech synthesis systems based on GPR, HMM, and DNN with a small amount of training data , 2015, INTERSPEECH.

[8]  Takashi Nose,et al.  Parametric speech synthesis based on Gaussian process regression using global variance and hyperparameter optimization , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Takao Kobayashi,et al.  Design of tree-based context clustering for an HMM-based Thai speech synthesis system , 2007, SSW.

[10]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[12]  Inma Hernáez,et al.  A Hybrid TTS Approach for Prosody and Acoustic Modules , 2011, INTERSPEECH.

[13]  Takashi Nose,et al.  Statistical nonparametric speech synthesis using sparse Gaussian processes , 2013, INTERSPEECH.

[14]  Simon King,et al.  Bayesian networks for phone duration prediction , 2008, Speech Commun..

[15]  Sadaoki Furui,et al.  Thai speech processing technology: A review , 2007, Speech Commun..

[16]  Stephen Isard,et al.  Segment durations in a syllable frame , 1991 .

[17]  Mary P. Harper,et al.  Vowel length and stress in Thai , 1998 .

[18]  Takashi Nose,et al.  Frame-level acoustic modeling based on Gaussian process regression for statistical nonparametric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Zhizheng Wu,et al.  Duration refinement by jointly optimizing state and longer unit likelihood , 2008, INTERSPEECH.

[20]  Takao Kobayashi,et al.  Tone correctness improvement in speaker dependent HMM-based Thai speech synthesis , 2008, Speech Commun..

[21]  Géza Németh,et al.  DNN-Based Duration Modeling for Synthesizing Short Sentences , 2016, SPECOM.

[22]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Takao Kobayashi,et al.  Duration prediction using multi-level model for GPR-based speech synthesis , 2015, INTERSPEECH.

[24]  Nikos Fakotakis,et al.  Two-stage phone duration modelling with feature construction and feature vector extension for the needs of speech synthesis , 2012, Comput. Speech Lang..

[25]  Takao Kobayashi,et al.  Implementation and evaluation of an HMM-based Thai speech synthesis system , 2007, INTERSPEECH.

[26]  Frank K. Soong,et al.  On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[28]  Yoshinori Sagisaka,et al.  Statistical modelling of speech segment duration by constrained tree regression , 2000 .

[29]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[30]  M P Harper,et al.  Acoustic Correlates of Stress in Thai , 1996, Phonetica.

[31]  Bayya Yegnanarayana,et al.  Modeling durations of syllables using neural networks , 2007, Comput. Speech Lang..

[32]  Diamantino Freitas,et al.  Segmental durations predicted with a neural network , 2003, INTERSPEECH.

[33]  Nikos Fakotakis,et al.  Improving phone duration modelling using support vector regression fusion , 2011, Speech Commun..

[34]  Takashi Nose,et al.  Statistical Parametric Speech Synthesis Based on Gaussian Process Regression , 2014, IEEE Journal of Selected Topics in Signal Processing.

[35]  Yang Wang,et al.  Combining extreme learning machine and decision tree for duration prediction in HMM based speech synthesis , 2015, INTERSPEECH.

[36]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[37]  Philip N. Garner,et al.  SVR vs MLP for Phone Duration Modelling in HMM-based Speech Synthesis , 2014 .

[38]  Takao Kobayashi,et al.  Prosody generation using frame-based Gaussian process regression and classification for statistical parametric speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Sudaporn Luksaneeyanawin,et al.  Intonation in Thai. , 1983 .