Contextual partial additive structure for HMM-based speech synthesis

This paper proposes a spectral modeling technique based on a contextual partial additive structure for HMM-based speech synthesis. To represent complicated context dependencies, contextual additive structure models assume multiple independent components which have different context dependencies to form acoustic features. In additive structure models, there is a constraint that a fixed number of additive components are used for generating acoustic features. However, it is natural to assume that the number of components depends on contexts. In the proposed technique, partial additive components affecting arbitrary contextual sub-spaces are created on demand to increase the likelihood. Then, the number of components for each context can be automatically determined with the training data. Experimental results show that the proposed technique outperformed the standard technique in a subjective test.

[1]  Heiga Zen,et al.  A Covariance-Tying Technique for HMM-Based Speech Synthesis , 2010, IEICE Trans. Inf. Syst..

[2]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[3]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Heiga Zen,et al.  Context-dependent additive log f_0 model for HMM-based speech synthesis , 2009, INTERSPEECH.

[5]  Yoshihiko Nankaku,et al.  An optimization algorithm of independent mean and variance parameter tying structures for HMM-based speech synthesis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  George Zavaliagkos,et al.  Convolutional density estimation in hidden Markov models for speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[7]  K. Nakajima,et al.  Speech recognition using dynamic transformation of phoneme templates depending on acoustic/phonetic environments , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[8]  Yoshinori Sagisaka,et al.  Statistical modelling of speech segment duration by constrained tree regression , 2000 .

[9]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[10]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[11]  Yoshihiko Nankaku,et al.  Spectral modeling with contextual additive structure for HMM-based speech synthesis , 2010, SSW.

[12]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[13]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[14]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[15]  Heiga Zen,et al.  Acoustic modeling with contextual additive structure for HMM-based speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .