Predicting Pronunciations with Syllabification and Stress with Recurrent Neural Networks

Word pronunciations, consisting of phoneme sequences and the associated syllabification and stress patterns, are vital for both speech recognition and text-to-speech (TTS) systems. For speech recognition phoneme sequences for words may be learned from audio data. We train recurrent neural network (RNN) based models to predict the syllabification and stress pattern for such pronunciations making them usable for TTS. We find these RNN models significantly outperform naive rulebased models for almost all languages we tested. Further, we find additional improvements to the stress prediction model by using the spelling as features in addition to the phoneme sequence. Finally, we train a single RNN model to predict the phoneme sequence, syllabification and stress for a given word. For several languages, this single RNN outperforms similar models trained specifically for either phoneme sequence or stress prediction. We report an exhaustive comparison of these approaches for twenty languages.

[1]  Brian Roark,et al.  Encoding linear models as weighted finite-state transducers , 2014, INTERSPEECH.

[2]  Fuchun Peng,et al.  Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  James F. Allen,et al.  Pronunciation of proper names with a joint n-gram model for bi-directional grapheme-to-phoneme conversion , 2002, INTERSPEECH.

[4]  Hermann Ney,et al.  Structure learning in hidden conditional random fields for grapheme-to-phoneme conversion , 2013, INTERSPEECH.

[5]  Fuchun Peng,et al.  Fix it where it fails: Pronunciation learning by mining error corrections from speech logs , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Keikichi Hirose,et al.  Failure transitions for joint n-gram models and G2p conversion , 2013, INTERSPEECH.

[7]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[8]  Grzegorz Kondrak,et al.  A Ranking Approach to Stress Prediction for Letter-to-Phoneme Conversion , 2009, ACL/IJCNLP.

[9]  Fuchun Peng,et al.  Pronunciation learning for named-entities through crowd-sourcing , 2014, INTERSPEECH.

[10]  Stefan Hahn,et al.  Comparison of Grapheme-to-Phoneme Methods on Large Pronunciation Dictionaries and LVCSR Tasks , 2012, INTERSPEECH.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Simon King,et al.  Letter-to-Sound Pronunciation Prediction Using Conditional Random Fields , 2011, IEEE Signal Processing Letters.

[13]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[14]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[15]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[16]  Frédéric Bimbot,et al.  Variable-length sequence matching for phonetic transcription using joint multigrams , 1995, EUROSPEECH.

[17]  Enikö Beatrice Bilcu Text-To-Phoneme Mapping Using Neural Networks , 2008 .

[18]  Richard Sproat,et al.  Applications of maximum entropy rankers to problems in spoken language processing , 2014, INTERSPEECH.

[19]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[20]  Keikichi Hirose,et al.  Improving WFST-based G2P Conversion with Alignment Constraints and RNNLM N-best Rescoring , 2012, INTERSPEECH.