Research on Transfer Learning for Khalkha Mongolian Speech Recognition Based on TDNN

Automated speech recognition(ASR)incorporating Neural Networks with Hidden Markov Models (NNs/HMMs)have achieved the state-of-the-art in various benchmarks. Most of them use a large amount of training data. However, ASR research is still quite difficult in languages with limited resources, such as Khalkha Mongolian. Transfer learning methods have been shown to be effective utilizing out-of-domain data to improve ASR performance in similar data-scarce. In this paper, we investigate two different weight transfer approaches to improve the performance of Khalkha Mongolian ASR based on Lattice-free Maximum Mutual Information(LF-MMI). Moreover, the i-vector feature is used to combine with the MFCCs feature as the input to validate the effectiveness of Khalkha Mongolian ASR transfer models. Experimental results show that the weight transfer methods with out-of-domain Chahar speech can achieve great improvements over baseline model on Khalkha speech. And transferring parts of the model performs better than transferring the whole model. Furthermore, the i-vector spliced together with MFCCs as input features can further enhance the performance of the acoustic model. The WER of optimal model is relatively reduced by 10.96% compared with the in-of-domain Khalkha speech baseline model.

[1]  Richard M. Schwartz,et al.  Improved Multilingual Training of Stacked Neural Network Acoustic Models for Low Resource Languages , 2016, INTERSPEECH.

[2]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[3]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[4]  Guanglai Gao,et al.  Mongolian Text-to-Speech System Based on Deep Neural Network , 2017 .

[5]  Vishwa Gupta,et al.  USE OF MULTIPLE FRONT-ENDS AND I-VECTOR-BASED SPEAKER ADAPTATION FOR ROBUST SPEECH RECOGNITION , 2014 .

[6]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[7]  Guanglai Gao,et al.  Improving of Acoustic Model for the Mongolian Speech Recognition System , 2009, 2009 Chinese Conference on Pattern Recognition.

[8]  Hui Zhang,et al.  Mongolian Speech Recognition Based on Deep Neural Networks , 2015, CCL.

[9]  Guanglai Gao,et al.  Cyrillic Mongolian Named Entity Recognition with Rich Features , 2016, NLPCC/ICCPOL.

[10]  Hui Zhang,et al.  Comparison on Neural Network based acoustic model in Mongolian speech recognition , 2016, 2016 International Conference on Asian Language Processing (IALP).

[11]  Guanglai Gao,et al.  Segmentation-based Mongolian LVCSR approach , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[13]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[14]  Slim Abdennadher,et al.  Cross-lingual acoustic modeling for dialectal Arabic speech recognition , 2010, INTERSPEECH.

[15]  Andreas Stolcke,et al.  Cross-Domain and Cross-Language Portability of Acoustic Features Estimated by Multilayer Perceptrons , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Guanglai Gao,et al.  A Mongolian Speech Recognition System Based on HMM , 2006, ICIC.

[17]  Jeff Z. Ma,et al.  Improving Deliverable Speech-to-Text Systems with Multilingual Knowledge Transfer , 2017, INTERSPEECH.