Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

An increasingly common scenario in building speech synthesis and recognition systems is training on inhomogeneous data. This paper proposes a new framework for estimating hidden Markov models on data containing both multiple speakers and multiple languages. The proposed framework, speaker and language factorization, attempts to factorize speaker-/language-specific characteristics in the data and then model them using separate transforms. Language-specific factors in the data are represented by transforms based on cluster mean interpolation with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by transforms based on constrained maximum-likelihood linear regression. Experimental results on statistical parametric speech synthesis show that the proposed framework enables data from multiple speakers in different languages to be used to: train a synthesis system; synthesize speech in a language using speaker characteristics estimated in a different language; and adapt to a new language.

[1]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[2]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[4]  R. Sproat,et al.  Multilingual text-to-speech synthesis : the Bell Labs approach , 1998 .

[5]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[6]  Beat Pfister,et al.  From multilingual to polyglot speech synthesis , 1999, EUROSPEECH.

[7]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[9]  Caroline L. Smith Handbook of the International Phonetic Association: a guide to the use of the International Phonetic Alphabet (1999). Cambridge: Cambridge University Press. Pp. ix+204. , 2000, Phonology.

[10]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[11]  Mark J. F. Gales Acoustic factorisation , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[12]  Olaf Schenk,et al.  Solving unsymmetric sparse systems of linear equations with PARDISO , 2002, Future Gener. Comput. Syst..

[13]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[14]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[15]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[16]  Keiichi Tokuda,et al.  Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[17]  Mark J. F. Gales,et al.  Adaptive training using structured transforms , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Takao Kobayashi,et al.  Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Mark J. F. Gales,et al.  Adaptation of precision matrix models on large vocabulary continuous speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20]  Junichi Yamagishi,et al.  Average-Voice-Based Speech Synthesis , 2006 .

[21]  Sadaoki Furui,et al.  New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer , 2006, Speech Commun..

[22]  Takahiro Shinozaki Hmm State Clustering Based on Efficient Cross-Validation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[25]  Heiga Zen,et al.  The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge , 2008 .

[26]  Heiga Zen,et al.  Acoustic modeling with contextual additive structure for HMM-based speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Frank K. Soong,et al.  State mapping for cross-language speaker adaptation in TTS , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Frank K. Soong,et al.  A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Heiga Zen,et al.  Context-dependent additive log f_0 model for HMM-based speech synthesis , 2009, INTERSPEECH.

[30]  Yoshihiko Nankaku,et al.  State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis , 2009, INTERSPEECH.

[31]  M. Wester The EMIME Bilingual Database , 2010 .

[32]  Hui Liang,et al.  An analysis of language mismatch in HMM state mapping-based cross-lingual speaker adaptation , 2010, INTERSPEECH.

[33]  Heiga Zen Speaker and language adaptive training for HMM-based polyglot speech synthesis , 2010, INTERSPEECH.

[34]  Simon King,et al.  Simple methods for improving speaker-similarity of HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Oliver Watts,et al.  The CSTR/EMIME HTS system for Blizzard Challenge 2010 , 2010 .

[36]  Yoshihiko Nankaku,et al.  Cross-lingual speaker adaptation for HMM-based speech synthesis considering differences between language-dependent average voices , 2010 .

[37]  Heiga Zen,et al.  HMM-based polyglot speech synthesis by speaker and language adaptive training , 2010, SSW.

[38]  Yongqiang Wang,et al.  Speaker and noise factorisation on the AURORA4 task , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Heiga Zen,et al.  Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis , 2011, Speech Commun..

[40]  Alex Acero,et al.  Separating Speaker and Environmental Variability Using Factored Transforms , 2011, INTERSPEECH.

[41]  Sabine Buchholz,et al.  Crowdsourcing Preference Tests, and How to Detect Cheating , 2011, INTERSPEECH.