Dictionary Augmented Sequence-to-Sequence Neural Network for Grapheme to Phoneme Prediction

Both automatic speech recognition and text to speech systems need accurate pronunciations, typically obtained by using both a lexicon dictionary and a grapheme to phoneme (G2P) model. G2Ps typically struggle with predicting pronunciations for tail words, and we hypothesized that one reason is because they try to discover general pronunciation rules without using prior knowledge of the pronunciation of related words. Our new approach expands a sequence-to-sequence G2P model by injecting prior knowledge. In addition, our model can be updated without having to retrain a system. We show that our new model has significantly better performance for German, both on a tightly controlled task and on our real-world system. Finally, the simplification of the system allows for faster and easier scaling to other languages.

[1]  Yang Feng,et al.  Memory-augmented Neural Machine Translation , 2017, EMNLP.

[2]  Antoine Bruguier,et al.  Pronunciation Learning with RNN-Transducers , 2017, INTERSPEECH.

[3]  Geoffrey Zweig,et al.  Sequence-to-sequence neural net models for grapheme-to-phoneme conversion , 2015, INTERSPEECH.

[4]  Fuchun Peng,et al.  Grapheme-to-phoneme conversion using Long Short-Term Memory recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[6]  Yoshua Bengio,et al.  Memory Augmented Neural Networks with Wormhole Connections , 2017, ArXiv.

[7]  Stefan Hahn,et al.  Comparison of Grapheme-to-Phoneme Methods on Large Pronunciation Dictionaries and LVCSR Tasks , 2012, INTERSPEECH.

[8]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[9]  Martin Jansche Computer-Aided Quality Assurance of an Icelandic Pronunciation Dictionary , 2014, LREC.

[10]  Dimitri Palaz,et al.  Towards End-to-End Speech Recognition , 2016 .

[11]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[12]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[13]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[14]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..