Sequence-Based Multi-Lingual Low Resource Speech Recognition

Techniques for multi-lingual and cross-lingual speech recognition can help in low resource scenarios, to bootstrap systems and enable analysis of new languages and domains. End-to-end approaches, in particular sequence-based techniques, are attractive because of their simplicity and elegance. While it is possible to integrate traditional multi-lingual bottleneck feature extractors as front-ends, we show that end-to-end multi-lingual training of sequence models is effective on context independent models trained using Connectionist Temporal Classification (CTC) loss. We show that our model improves performance on Babel languages by over 6% absolute in terms of word/phoneme error rate when compared to mono-lingual systems built in the same setting for these languages. We also show that the trained model can be adapted cross-lingually to an unseen language using just 25% of the target data. We show that training on multiple languages is important for very low resource cross-lingual target scenarios, but not for multi-lingual testing scenarios. Here, it appears beneficial to include large well prepared datasets.

[1]  Steve Renals,et al.  Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[2]  Hervé Bourlard,et al.  An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation , 2017, INTERSPEECH.

[3]  ˇ Boˇ Study of Probabilistic and Bottle-Neck Features in Multilingual Environment , 2011 .

[4]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Andreas Stolcke,et al.  Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  Tanja Schultz,et al.  Integrating multilingual articulatory features into speech recognition , 2003, INTERSPEECH.

[7]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[8]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[9]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Mark J. F. Gales,et al.  Investigation of multilingual deep neural networks for spoken term detection , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[11]  Martin Karafiát,et al.  Study of probabilistic and Bottle-Neck features in multilingual environment , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[12]  Martin Karafiát,et al.  Study of Large Data Resources for Multilingual Training and System Porting , 2016, SLTU.

[13]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14]  Pietro Laface,et al.  On the use of a multilingual neural network front-end , 2008, INTERSPEECH.

[15]  Florian Metze,et al.  An empirical exploration of CTC acoustic models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Ngoc Thang Vu,et al.  Multilingual bottle-neck features and its application for under-resourced languages , 2012, SLTU.

[17]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Sebastian Stüker,et al.  Language Adaptive Multilingual CTC Speech Recognition , 2017, SPECOM.

[19]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[20]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[21]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[22]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Martin Karafiát,et al.  Adaptation of multilingual stacked bottle-neck neural network structure for new language , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).