Contextual Additive Structure for HMM-Based Speech Synthesis

This paper proposes a spectral modeling technique based on an additive structure of context dependencies for HMM-based speech synthesis. Contextual additive structure models can represent complicated dependencies between acoustic features and context labels using multiple decision trees. However, the computational complexity of the context clustering is too high for the full context labels of speech synthesis. To overcome this problem, this paper proposes two approaches; covariance parameter tying and a likelihood calculation algorithm using the matrix inversion lemma. Additive structure models can be applied to HMM-based speech synthesis using these techniques and speech quality would significantly be improved. Experimental results show that the proposed method outperforms the conventional one in subjective listening tests.

[1]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  Frank K. Soong,et al.  Generating natural F0 trajectory with additive trees , 2008, INTERSPEECH.

[3]  Hiroya Fujisaki,et al.  In search of models in speech communication research , 2009, INTERSPEECH.

[4]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Frank K. Soong,et al.  Modeling pitch trajectory by hierarchical HMM with minimum generation error training , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[7]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[9]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[10]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[11]  K. Nakajima,et al.  Speech recognition using dynamic transformation of phoneme templates depending on acoustic/phonetic environments , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[12]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[13]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[14]  Alan W. Black,et al.  Generating F/sub 0/ contours from ToBI labels using linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[16]  Yoshihiko Nankaku,et al.  Spectral modeling with contextual additive structure for HMM-based speech synthesis , 2010, SSW.

[17]  Heiga Zen,et al.  Acoustic modeling with contextual additive structure for HMM-based speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[19]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[21]  George Zavaliagkos,et al.  Convolutional density estimation in hidden Markov models for speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[22]  Heiga Zen,et al.  A Covariance-Tying Technique for HMM-Based Speech Synthesis , 2010, IEICE Trans. Inf. Syst..

[23]  Heiga Zen,et al.  Context-dependent additive log f_0 model for HMM-based speech synthesis , 2009, INTERSPEECH.