A comparative study of BNF and DNN multilingual training on cross-lingual low-resource speech recognition

This paper presents a comparative study of bottle-neck feature (BNF) and Deep Neural Network (DNN) multilingual training for low-resource language speech recognition. Besides, we also compared the system performances after fine-tuning. The evaluation was conducted on the IARPA Babel data. The source languages are Cantonese, Pashto, Tagalog, and Turkish, while the target languages are Vietnamese and Tamil. As compared to the monolingual baseline systems, the BNF and DNN methods similarly achieved relative WER reductions of 3.0-5.1% on the Limited Language Pack (LLP) data and 6.1-8.6% on the Very LLP (VLLP) data. By fine-tuning, the BNF method further gained significant performance improvement, while the DNN method obtained marginal gains. Overall, we observed the DNN method performs worse on the smaller size data (VLLP), as well as on the noisier data (Tamil) of the target languages, in the cross-lingual transfer and fine-tuning cases respectively.

[1]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yu Zhang,et al.  Language ID-based training of multilingual stacked bottleneck features , 2014, INTERSPEECH.

[4]  Martin Karafiát,et al.  Adaptation of multilingual stacked bottle-neck neural network structure for new language , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Bin Ma,et al.  Strategies for Vietnamese keyword search , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[8]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[9]  Hynek Hermansky,et al.  Multilingual MLP features for low-resource LVCSR systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[11]  Kai Feng,et al.  Multilingual acoustic modeling for speech recognition based on subspace Gaussian Mixture Models , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[13]  Tanja Schultz,et al.  Globalphone: a multilingual speech and text database developed at karlsruhe university , 2002, INTERSPEECH.

[14]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[15]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[17]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Hermann Ney,et al.  Multilingual MRASTA features for low-resource keyword search and speech recognition systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Ngoc Thang Vu,et al.  Multilingual bottle-neck features and its application for under-resourced languages , 2012, SLTU.

[21]  Martin Karafiát,et al.  Semi-supervised bootstrapping approach for neural network feature extractor training , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[22]  Jean-Luc Gauvain,et al.  Acoustic unit discovery and pronunciation generation from a grapheme-based lexicon , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  Bin Ma,et al.  Low-resource keyword search strategies for tamil , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Andreas Stolcke,et al.  Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[26]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Mark J. F. Gales,et al.  Investigation of multilingual deep neural networks for spoken term detection , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.