Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer

This paper presents advances in Google’s hidden Markov model (HMM)-driven unit selection speech synthesis system. We describe several improvements to the run-time system; these include minimal latency, high-quality and fast refresh cycle for new voices. Traditionally unit selection synthesizers are limited in terms of the amount of data they can handle and the real applications they are built for. That is even more critical for reallife large-scale applications where high-quality is expected and low latency is required given the available computational resources. In this paper we present an optimized engine to handle a large database at runtime, a composite unit search approach for combining diphones and phrase-based units. In addition a new voice building strategy for handling big databases and keeping the building times low is presented.

[1]  Heng Lu,et al.  The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 , 2007 .

[2]  Heiga Zen,et al.  A Hidden Semi-Markov Model-Based Speech Synthesis System , 2007, IEICE Trans. Inf. Syst..

[3]  Salim Roukos,et al.  Phrase splicing and variable substitution using the IBM trainable speech synthesis system , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[4]  Tatsuya Kawahara,et al.  Admissible stopping in viterbi beam search for unit selection in concatenative speech synthesis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Phillip Taylor,et al.  Concept-to-speech synthesis by phonological structure matching , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[6]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[7]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  N. Campbell,et al.  Conversational speech synthesis and the need for some laughter , 2005, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Keiichi Tokuda,et al.  XIMERA: a new TTS from ATR based on corpus-based technologies , 2004, SSW.

[10]  Thomas Fang Zheng,et al.  Comparison of different implementations of MFCC , 2001, Journal of Computer Science and Technology.

[11]  E.A.M. Klabbers,et al.  High-quality speech output generation through advanced phrase concatenation , 1997 .

[12]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[13]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  Joan Claudi Socoró,et al.  Local minimum generation error criterion for hybrid HMM speech synthesis , 2009, INTERSPEECH.

[15]  Joan Claudi Socoró,et al.  Towards High-Quality Next-Generation Text-to-Speech Synthesis: A Multidomain Approach by Automatic Domain Classification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Hong-Goo Kang,et al.  A perspective on the next challenges for TTS research , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..