HMM-based polyglot speech synthesis by speaker and language adaptive training

This paper describes a technique for speaker and language adaptive training (SLAT) for HMM-based polyglot speech synthesis and its evaluations on a multi-lingual speech corpus. The SLAT technique allows multi-speaker/multi-language adaptive training and synthesis to be performed. Experimental results show that the SLAT technique achieves better naturalness than both speaker-adaptively trained language-dependent (LD-SAT) and language-independent (LI-SAT) models. In cross-lingual adaptation speaker similarity tests SLAT and LI-SAT outperform LD-SAT but there are still significant differences between polyglot adaptation and intra-language adaptation.

[1]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[2]  Heiga Zen,et al.  The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge , 2008 .

[3]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[4]  Nick Campbell TALKING FOREIGN - concatenative speech synthesis and the language barrier , 2001, INTERSPEECH.

[5]  Junichi Yamagishi,et al.  Average-Voice-Based Speech Synthesis , 2006 .

[6]  Jan Odijk,et al.  Introduction to multilingual corpus-based concatenative speech synthesis , 2007, INTERSPEECH.

[7]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Mark J. F. Gales,et al.  Adaptive training using structured transforms , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Richard Sproat,et al.  Multilingual Text-to-Speech Synthesis: The Bell Labs Approach , 1998, CL.

[10]  Yoshihiko Nankaku,et al.  State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis , 2009, INTERSPEECH.

[11]  Yong Zhao,et al.  Microsoft Mulan - a bilingual TTS system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Alan W. Black,et al.  Multilingual text-to-speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Heiga Zen,et al.  AN HMM-BASED SPEECH SYNTHESIS SYSTEM APPLIED TO ENGLISH , 2003 .

[14]  Silvia Quazza,et al.  ACTOR: A multilingual unit-selection speech synthesis system , 2001, SSW.

[15]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[16]  Tanja Schultz,et al.  Speaker Clustering for Multilingual Synthesis , 2006 .

[17]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[18]  Sadaoki Furui,et al.  New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer , 2006, Speech Commun..

[19]  Heiga Zen Speaker and language adaptive training for HMM-based polyglot speech synthesis , 2010, INTERSPEECH.

[20]  Beat Pfister,et al.  From multilingual to polyglot speech synthesis , 1999, EUROSPEECH.

[21]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[22]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[23]  Heiga Zen,et al.  Context-dependent additive log f_0 model for HMM-based speech synthesis , 2009, INTERSPEECH.

[24]  W·M·贝尔特曼,et al.  Speech audio process , 2011 .

[25]  Richard Sproat Multilingual Text-to-Speech Synthesis , 1997 .

[26]  Frank K. Soong,et al.  A cross-language state mapping approach to bilingual (Mandarin-English) TTS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.