Combining multiple high quality corpora for improving HMM-TTS

The most reliable way to build synthetic voices for end-products is to start with high quality recordings from professional voice talents. This paper describes the application of average voice models (AVMs) and a novel application of cluster adaptive training (CAT) to combine a small number of these high quality corpora to make best use of them and improve overall voice quality in hidden Markov model based text-to-speech (HMMTTS) systems. It is shown that integrated training by both CAT and AVM approaches, yields better sounding voices than speaker dependent modelling. It is also shown that CAT has an advantage over AVMs when adapting to a new speaker. Given a limited amount of adaptation data CAT maintains a much higher voice quality even when adapted to tiny amounts of speech.

[1]  Mark J. F. Gales,et al.  Adaptation of precision matrix models on large vocabulary continuous speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[2]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[3]  Sabine Buchholz,et al.  Crowdsourcing Preference Tests, and How to Detect Cheating , 2011, INTERSPEECH.

[4]  Mark J. F. Gales,et al.  Speech factorization for HMM-TTS based on cluster adaptive training , 2012, INTERSPEECH.

[5]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[7]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Cenk Demiroglu,et al.  HMM-based text to speech system with speaker interpolation , 2011, 2011 IEEE 19th Signal Processing and Communications Applications Conference (SIU).

[9]  Mark J. F. Gales,et al.  Exploring Rich Expressive Information from Audiobook Data Using Cluster Adaptive Training , 2012, INTERSPEECH.