论文信息 - Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

An increasingly common scenario in building speech synthesis and recognition systems is training on inhomogeneous data. This paper proposes a new framework for estimating hidden Markov models on data containing both multiple speakers and multiple languages. The proposed framework, speaker and language factorization, attempts to factorize speaker-/language-specific characteristics in the data and then model them using separate transforms. Language-specific factors in the data are represented by transforms based on cluster mean interpolation with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by transforms based on constrained maximum-likelihood linear regression. Experimental results on statistical parametric speech synthesis show that the proposed framework enables data from multiple speakers in different languages to be used to: train a synthesis system; synthesize speech in a language using speaker characteristics estimated in a different language; and adapt to a new language.

[1] Jj Odell,et al. The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[2] Richard M. Schwartz,et al. A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[4] R. Sproat,et al. Multilingual text-to-speech synthesis : the Bell Labs approach , 1998 .

[5] Keiichi Tokuda,et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[6] Beat Pfister,et al. From multilingual to polyglot speech synthesis , 1999, EUROSPEECH.

[7] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8] Roland Kuhn,et al. Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[9] Caroline L. Smith. Handbook of the International Phonetic Association: a guide to the use of the International Phonetic Alphabet (1999). Cambridge: Cambridge University Press. Pp. ix+204. , 2000, Phonology.

[10] Mark J. F. Gales. Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[11] Mark J. F. Gales. Acoustic factorisation , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[12] Olaf Schenk,et al. Solving unsymmetric sparse systems of linear equations with PARDISO , 2002, Future Gener. Comput. Syst..

[13] Keiichi Tokuda,et al. Multi-Space Probability Distribution HMM , 2002 .

[14] Tanja Schultz,et al. Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[15] H. Zen,et al. An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[16] Keiichi Tokuda,et al. Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[17] Mark J. F. Gales,et al. Adaptive training using structured transforms , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18] Takao Kobayashi,et al. Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19] Mark J. F. Gales,et al. Adaptation of precision matrix models on large vocabulary continuous speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20] Junichi Yamagishi,et al. Average-Voice-Based Speech Synthesis , 2006 .

[21] Sadaoki Furui,et al. New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer , 2006, Speech Commun..

[22] Takahiro Shinozaki. Hmm State Clustering Based on Efficient Cross-Validation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24] Keiichi Tokuda,et al. A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[25] Heiga Zen,et al. The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge , 2008 .

[26] Heiga Zen,et al. Acoustic modeling with contextual additive structure for HMM-based speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27] Frank K. Soong,et al. State mapping for cross-language speaker adaptation in TTS , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28] Frank K. Soong,et al. A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[29] Heiga Zen,et al. Context-dependent additive log f_0 model for HMM-based speech synthesis , 2009, INTERSPEECH.

[30] Yoshihiko Nankaku,et al. State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis , 2009, INTERSPEECH.

[31] M. Wester. The EMIME Bilingual Database , 2010 .

[32] Hui Liang,et al. An analysis of language mismatch in HMM state mapping-based cross-lingual speaker adaptation , 2010, INTERSPEECH.

[33] Heiga Zen. Speaker and language adaptive training for HMM-based polyglot speech synthesis , 2010, INTERSPEECH.

[34] Simon King,et al. Simple methods for improving speaker-similarity of HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35] Oliver Watts,et al. The CSTR/EMIME HTS system for Blizzard Challenge 2010 , 2010 .

[36] Yoshihiko Nankaku,et al. Cross-lingual speaker adaptation for HMM-based speech synthesis considering differences between language-dependent average voices , 2010 .

[37] Heiga Zen,et al. HMM-based polyglot speech synthesis by speaker and language adaptive training , 2010, SSW.

[38] Yongqiang Wang,et al. Speaker and noise factorisation on the AURORA4 task , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] Heiga Zen,et al. Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis , 2011, Speech Commun..

[40] Alex Acero,et al. Separating Speaker and Environmental Variability Using Factored Transforms , 2011, INTERSPEECH.

[41] Sabine Buchholz,et al. Crowdsourcing Preference Tests, and How to Detect Cheating , 2011, INTERSPEECH.