Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis

This paper presents a method to realize the hidden Markov model (HMM)-based Mandarin-Tibetan cross-lingual statistical speech synthesis using speaker adaptive training. A set of Speech Assessment Methods Phonetic Alphabet (SAMPA) is designed to label the pronunciation of the initial and the final of Mandarin and Tibetan syllables according to the similarities in pronunciation between Mandarin and Tibetan. A grapheme-to-phoneme conversion method is realized to convert Chinese or Tibetan sentences to SAMPA-based Pinyin sequences. A Mandarin statistical speech synthesis framework is employed to realize Mandarin-Tibetan cross-lingual speech synthesis. A set of context-dependent label format is designed to label the context information of Mandarin and Tibetan sentences. A question set is also realized for context dependent decision tree clustering. The initial and the finalare used as the synthesis units with training using a set of average mixed-lingual models from a large Mandarin multi-speaker-based corpus and a small Tibetan one-speaker-based corpus using speaker adaptive training (SAT). Then, the speaker adaptation transformation is applied to the speaker dependent (SD) training data to obtain a set of speaker dependent Mandarin or Tibetan models from the average mixed-lingual models. The Mandarin speech or Tibetan speech is then synthesized from the speaker dependent Mandarin or Tibetan models. Tests show that this method outperforms the method using only Tibetan SD models when only a small number of Tibetan training utterances are available. When the number of training Tibetan utterances is increased, the performances of the two methods tend to be the same. Mixed Tibetan training sentences have a small effect on the quality of synthesized Mandarin speech.

[1]  Yoshihiko Nankaku,et al.  Cross-lingual speaker adaptation for HMM-based speech synthesis considering differences between language-dependent average voices , 2010 .

[2]  Frank K. Soong,et al.  A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Sadaoki Furui,et al.  New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer , 2006, Speech Commun..

[4]  Heiga Zen,et al.  HMM-based polyglot speech synthesis by speaker and language adaptive training , 2010, SSW.

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Frank K. Soong,et al.  State mapping for cross-language speaker adaptation in TTS , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Jie Liu,et al.  A Research on Text analysis in Tibetan speech synthesis , 2010, The 2010 IEEE International Conference on Information and Automation.

[8]  M. Wester The EMIME Bilingual Database , 2010 .

[9]  Philip N. Garner,et al.  Current trends in multilingual speech processing , 2011 .

[10]  Yoshihiko Nankaku,et al.  State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis , 2009, INTERSPEECH.

[11]  Chin-Hui Lee,et al.  Structural maximum a posteriori linear regression for fast HMM adaptation , 2002, Comput. Speech Lang..

[12]  Frank K. Soong,et al.  An HMM-Based Mandarin Chinese Text-To-Speech System , 2006, ISCSLP.

[13]  Melvyn C. Goldstein,et al.  Essentials of Modern Literary Tibetan: A Reading Course and Reference Grammar , 1991 .

[14]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[15]  Keiichi Tokuda,et al.  Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[16]  Anna Hunecke,et al.  MARY TTS participation in the Blizzard Challenge 2007 , 2007 .

[17]  Frank K. Soong,et al.  A cross-language state mapping approach to bilingual (Mandarin-English) TTS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Zev Handel,et al.  What is Sino-Tibetan? Snapshot of a Field and a Language Family in Flux , 2008, Lang. Linguistics Compass.

[20]  K. Tokuda,et al.  A Training Method of Average Voice Model for HMM-Based Speech Synthesis , 2003, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[21]  Heiga Zen Speaker and language adaptive training for HMM-based polyglot speech synthesis , 2010, INTERSPEECH.

[22]  Gong Yu-chang A Statistically Study on the Qualities of All Modern Tibetan Character Set , 2005 .