Merging of Native and Non-native Speech for Low-resource Accented ASR

This paper presents our recent study on low-resource automatic speech recognition ASR system with accented speech. We propose multi-accent Subspace Gaussian Mixture Models SGMM and accent-specific Deep Neural Networks DNN for improving non-native ASR performance. In the SGMM framework, we present an original language weighting strategy to merge the globally shared parameters of two models based on native and non-native speech respectively. In the DNN framework, a native deep neural net is fine-tuned to non-native speech. Over the non-native baseline, we achieved relative improvement of 15i?ź% for multi-accent SGMM and 34i?ź% for accent-specific DNN with speaker adaptation.

[1]  Xin Chen,et al.  Deep neural network acoustic modeling for native and non-native Mandarin speech recognition , 2012, The 9th International Symposium on Chinese Spoken Language Processing.

[2]  Liang Lu,et al.  Cross-Lingual Subspace Gaussian Mixture Models for Low-Resource Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Chao Huang,et al.  Accent modeling based on pronunciation dictionary adaptation for large vocabulary Mandarin speech recognition , 2000, INTERSPEECH.

[6]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[7]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[8]  J. Hansen,et al.  A STUDY OF TEMPORAL FEATURES AND FREQUENCY CHARACTERISTICS IN AMERICAN ENGLISH FOREIGN ACCENT , 1997 .

[9]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Tien Ping Tan,et al.  Acoustic Model Interpolation for Non-Native Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[11]  Tien Ping Tan,et al.  Acoustic model merging using acoustic models from multilingual speakers for automatic speech recognition , 2014, 2014 International Conference on Asian Language Processing (IALP).

[12]  Florian Metze,et al.  Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training , 2013, INTERSPEECH.

[13]  John J. Morgan,et al.  Making a Speech Recognizer Tolerate Non-native Speech through Gaussian Mixture Merging , 2004 .

[14]  Richard C. Rose,et al.  Dealing with acoustic mismatch for training multilingual subspace Gaussian mixture models for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Paul Deléglise,et al.  TED-LIUM: an Automatic Speech Recognition dedicated corpus , 2012, LREC.

[16]  Petr Motlícek,et al.  Using out-of-language data to improve an under-resourced speech recognizer , 2014, Speech Communication.

[17]  Steve Renals,et al.  Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[18]  Kai Feng,et al.  The subspace Gaussian mixture model - A structured model for speech recognition , 2011, Comput. Speech Lang..

[19]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Kai Feng,et al.  SUBSPACE GAUSSIAN MIXTURE MODELS FOR SPEECH RECOGNITION , 2009 .

[21]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Yifan Gong,et al.  Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation , 2014, INTERSPEECH.

[23]  Silke Goronzy,et al.  Robust Adaptation to Non-Native Accents in Automatic Speech Recognition , 2002, Lecture Notes in Computer Science.

[24]  Jean Paul Haton,et al.  Fully Automated Non-Native Speech Recognition Using Confusion-Based Acoustic Model Integration and Graphemic Constraints , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[25]  Rong Tong,et al.  Subspace Gaussian mixture model for computer-assisted language learning , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .