Multilingual Speech Recognition with a Single End-to-End Model

Training a conventional automatic speech recognition (ASR) system to support multiple languages is challenging because the sub-word unit, lexicon and word inventories are typically language specific. In contrast, sequence-to-sequence models are well suited for multilingual ASR because they encapsulate an acoustic, pronunciation and language model jointly in a single network. In this work we present a single sequence-to-sequence ASR model trained on 9 different Indian languages, which have very little overlap in their scripts. Specifically, we take a union of language-specific grapheme sets and train a grapheme-based sequence-to-sequence model jointly on data from all languages. We find that this model, which is not explicitly given any information about language identity, improves recognition performance by 21% relative compared to analogous sequence-to-sequence models trained on each language individually. By modifying the model to accept a language identifier as an additional input feature, we further improve performance by an additional 7% relative and eliminate confusion between different languages.

[1]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  John R. Hershey,et al.  Language Independent End-to-End Architecture For Joint Language and Speech Recognition , .

[3]  Andreas Stolcke,et al.  A study of multilingual speech recognition , 1997, EUROSPEECH.

[4]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Thomas Niesler,et al.  Language-dependent state clustering for multilingual acoustic modelling , 2007, Speech Commun..

[7]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[8]  Ralf Schlüter,et al.  Investigation on cross- and multilingual MLP features under matched and mismatched acoustical conditions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[10]  Tanja Schultz,et al.  Fast bootstrapping of LVCSR systems with multilingual phoneme sets , 1997, EUROSPEECH.

[11]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[12]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[13]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[14]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[15]  Hynek Hermansky,et al.  Multilingual MLP features for low-resource LVCSR systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[17]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Prahallad Kishore,et al.  A simple approach for building transliteration editors for Indian languages , 2005 .

[19]  Hui Lin,et al.  A study on multilingual acoustic modeling for large vocabulary ASR , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[23]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Hervé Bourlard,et al.  An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation , 2017, INTERSPEECH.

[26]  Brian Kan-Wing Mak,et al.  Multitask Learning of Deep Neural Networks for Low-Resource Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.