论文信息 - Sequence-Based Multi-Lingual Low Resource Speech Recognition

Sequence-Based Multi-Lingual Low Resource Speech Recognition

Techniques for multi-lingual and cross-lingual speech recognition can help in low resource scenarios, to bootstrap systems and enable analysis of new languages and domains. End-to-end approaches, in particular sequence-based techniques, are attractive because of their simplicity and elegance. While it is possible to integrate traditional multi-lingual bottleneck feature extractors as front-ends, we show that end-to-end multi-lingual training of sequence models is effective on context independent models trained using Connectionist Temporal Classification (CTC) loss. We show that our model improves performance on Babel languages by over 6% absolute in terms of word/phoneme error rate when compared to mono-lingual systems built in the same setting for these languages. We also show that the trained model can be adapted cross-lingually to an unseen language using just 25% of the target data. We show that training on multiple languages is important for very low resource cross-lingual target scenarios, but not for multi-lingual testing scenarios. Here, it appears beneficial to include large well prepared datasets.

[1] Steve Renals,et al. Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[2] Hervé Bourlard,et al. An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation , 2017, INTERSPEECH.

[3] ˇ Boˇ. Study of Probabilistic and Bottle-Neck Features in Multilingual Environment , 2011 .

[4] Steve Renals,et al. Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5] Andreas Stolcke,et al. Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6] Tanja Schultz,et al. Integrating multilingual articulatory features into speech recognition , 2003, INTERSPEECH.

[7] Martin Karafiát,et al. The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[8] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[9] Georg Heigold,et al. Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Mark J. F. Gales,et al. Investigation of multilingual deep neural networks for spoken term detection , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[11] Martin Karafiát,et al. Study of probabilistic and Bottle-Neck features in multilingual environment , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[12] Martin Karafiát,et al. Study of Large Data Resources for Multilingual Training and System Porting , 2016, SLTU.

[13] Yajie Miao,et al. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14] Pietro Laface,et al. On the use of a multilingual neural network front-end , 2008, INTERSPEECH.

[15] Florian Metze,et al. An empirical exploration of CTC acoustic models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Ngoc Thang Vu,et al. Multilingual bottle-neck features and its application for under-resourced languages , 2012, SLTU.

[17] Kai Feng,et al. Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] Sebastian Stüker,et al. Language Adaptive Multilingual CTC Speech Recognition , 2017, SPECOM.

[19] Xiaodong Cui,et al. English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[20] Geoffrey Zweig,et al. Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[21] Tanja Schultz,et al. Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[22] Ngoc Thang Vu,et al. Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Martin Karafiát,et al. Adaptation of multilingual stacked bottle-neck neural network structure for new language , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).