Text-to-speech with cross-lingual neural network-based grapheme-to-phoneme models

Modern Text-To-Speech (TTS) systems need to increasingly deal with multilingual input. Navigation, social and news are all domains with a large proportion of foreign words. However, when typical monolingual TTS voices are used, the synthesis quality on such input is markedly lower. This is because traditional TTS derives pronunciations from a lexicon or a Grapheme-To-Phoneme (G2P) model which was built using a pre-defined sound inventory and a phonotactic grammar for one language only. G2P models perform poorly on foreign words, while manual lexicon development is labour-intensive, expensive and requires extra storage. Furthermore, large phoneme inventories and phonotactic grammars contribute to data sparsity in unit selection systems. We present an automatic system for deriving pronunciations for foreign words that utilises the monolingual voice design and can rapidly scale to many languages. The proposed system, based on a neural network cross-lingual G2P model, does not increase the size of the voice database, doesn’t require large data annotation efforts, is designed not to increase data sparsity in the voice, and can be sized to suit embedded applications.

[1]  Horst-Udo Hain Automation of the training procedures for neural networks performing multi-lingual grapheme to phoneme conversion , 1999, EUROSPEECH.

[2]  Haizhou Li,et al.  Robust phone set mapping using decision tree clustering for cross-lingual phone recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[4]  Frank K. Soong,et al.  A cross-language state mapping approach to bilingual (Mandarin-English) TTS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Harald Romsdorfer,et al.  Mixed-lingual text analysis for polyglot TTS synthesis , 2003, INTERSPEECH.

[7]  Xin Lei,et al.  Fine context, low-rank, softplus deep neural networks for mobile speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Joan Claudi Socoró,et al.  Linguistic and mixed excitation improvements on a HMM-based speech synthesis for Castilian Spanish , 2007, SSW.

[9]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Nick Campbell TALKING FOREIGN - concatenative speech synthesis and the language barrier , 2001, INTERSPEECH.

[11]  John A. Bullinaria Text to phoneme alignment and mapping for speech technology: A neural networks approach , 2011, The 2011 International Joint Conference on Neural Networks.

[12]  Geoffrey C. Fox,et al.  A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[13]  Claudia Barolo,et al.  A general approach to TTS reading of mixed-language texts , 2004, INTERSPEECH.

[14]  Enikö Beatrice Bilcu Text-To-Phoneme Mapping Using Neural Networks , 2008 .

[15]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[16]  Terrence J. Sejnowski,et al.  NETtalk: a parallel network that learns to read aloud , 1988 .

[17]  Alan W. Black,et al.  Foreign accents in synthetic speech: development and evaluation , 2005, INTERSPEECH.

[18]  Antonio Bonafonte,et al.  Introducing nativization to Spanish TTS systems , 2011, Speech Commun..

[19]  Nick Campbell FOREIGN-LANGUAGE SPEECH SYNTHESIS , 1998 .

[20]  Kate Knill,et al.  Investigating Prosodic Modifications for Polyglot Text-to-Speech Synthesis , 2006 .

[21]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[22]  Mark J. F. Gales,et al.  Graphone Model Interpolation and Arabic Pronunciation Generation , 2011, INTERSPEECH.