Improving Deliverable Speech-to-Text Systems with Multilingual Knowledge Transfer

This paper reports our recent progress on using multilingual data for improving speech-to-text (STT) systems that can be easily delivered. We continued the work BBN conducted on the use of multilingual data for improving Babel evaluation systems, but focused on training time-delay neural network (TDNN) based chain models. As done for the Babel evaluations, we used multilingual data in two ways: first, to train multilingual deep neural networks (DNN) for extracting bottle-neck (BN) features, and second, for initializing training on target languages. Our results show that TDNN chain models trained on multilingual DNN bottleneck features yield significant gains over their counterparts trained on MFCC plus i-vector features. By initializing from models trained on multilingual data, TDNN chain models can achieve great improvements over random initializations of the network weights on target languages. Two other important findings are: 1) initialization with multilingual TDNN chain models produces larger gains on target languages that have less training data; 2) inclusion of target languages in multilingual training for either BN feature extraction or initialization have limited impact on performance measured on the target languages. Our results also reveal that for TDNN chain models, the combination of multilingual BN features and multilingual initialization achieves the best performance on all target languages.

[1]  Hynek Hermansky,et al.  Multilingual MLP features for low-resource LVCSR systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[3]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[4]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[5]  Mark J. F. Gales,et al.  Investigation of multilingual deep neural networks for spoken term detection , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[6]  Ralf Schlüter,et al.  Investigation on cross- and multilingual MLP features under matched and mismatched acoustical conditions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Richard M. Schwartz,et al.  Progress in the BBN keyword search system for the DARPA RATS program , 2014, INTERSPEECH.

[8]  Jan Silovský,et al.  Sage: The New BBN Speech Processing Platform , 2016, INTERSPEECH.

[9]  Richard M. Schwartz,et al.  Two-Stage Data Augmentation for Low-Resourced Speech Recognition , 2016, INTERSPEECH.

[10]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Martin Karafiát,et al.  Study of Large Data Resources for Multilingual Training and System Porting , 2016, SLTU.

[12]  Hermann Ney,et al.  Multilingual MRASTA features for low-resource keyword search and speech recognition systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[14]  Sebastian Stüker,et al.  Multilingual shifting deep bottleneck features for low-resource ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Andreas Stolcke,et al.  Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Richard M. Schwartz,et al.  The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system , 2005, INTERSPEECH.

[17]  Martin Karafiát,et al.  Adaptation of multilingual stacked bottle-neck neural network structure for new language , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Richard M. Schwartz,et al.  Improved Multilingual Training of Stacked Neural Network Acoustic Models for Low Resource Languages , 2016, INTERSPEECH.

[19]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[20]  Yu Zhang,et al.  Multilingual data selection for training stacked bottleneck features , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).