Discriminative Data Selection from Multiple ASR Systems' Hypotheses for Unsupervised Acoustic Model Training (音声) -- (第17回音声言語シンポジウム)

This paper addresses unsupervised training of DNN acoustic model, by exploiting a large amount of unlabeled data with CRF-based classifiers. In the proposed scheme, we obtain ASR hypotheses by complementary GMM and DNN based ASR systems. Then, a set of dedicated classifiers are designed and trained to select the better hypothesis and verify the selected data. It is demonstrated that the classifiers can effectively filter usable data from unlabeled data for acoustic model training. The proposed method achieved significant improvement in the ASR accuracy from the baseline system, and it outperformed the models trained from the data selected based on the confidence measure scores (CMS) and also from the simple ROVER-based system combination.

[1]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[3]  Mark J. F. Gales,et al.  Unsupervised training and directed manual transcription for LVCSR , 2010, Speech Commun..

[4]  Brian Roark,et al.  Discriminative Joint Modeling of Lexical Variation and Acoustic Confusion for Automated Narrative Retelling Assessment , 2013, NAACL.

[5]  Haihua Xu,et al.  Multi-softmax deep neural network for semi-supervised training , 2015, INTERSPEECH.

[6]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[8]  Xunying Liu,et al.  Syllable language models for Mandarin speech recognition: exploiting character language models. , 2013, The Journal of the Acoustical Society of America.

[9]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[10]  Tatsuya Kawahara,et al.  Discriminative data selection for lightly supervised training of acoustic model using closed caption texts , 2015, INTERSPEECH.

[11]  Li Deng,et al.  Ensemble deep learning for speech recognition , 2014, INTERSPEECH.

[12]  Sanjeev Khudanpur,et al.  Semi-supervised maximum mutual information training of deep neural network acoustic models , 2015, INTERSPEECH.

[13]  Petr Motlícek,et al.  Exploiting un-transcribed foreign data for speech recognition in well-resourced languages , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Wei Chen,et al.  ASR error detection in a conversational spoken language translation system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[16]  Panayiotis G. Georgiou,et al.  Theoretical Analysis of Diversity in an Ensemble of Automatic Speech Recognition Systems , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[18]  Wen Wang,et al.  Investigation on Mandarin broadcast news speech recognition , 2006, INTERSPEECH.

[19]  Daisuke Kawahara,et al.  Chinese Morphological Analysis with Character-level POS Tagging , 2014, ACL.

[20]  Tatsuya Kawahara,et al.  Corpus and transcription system of Chinese Lecture Room , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[21]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[22]  Patrick Gros,et al.  CRF-based combination of contextual features to improve a posteriori word-level confidence measures , 2010, INTERSPEECH.

[23]  Philip C. Woodland,et al.  Combining Information Sources for Confidence Estimation with CRF Models , 2011, INTERSPEECH.

[24]  Jean-Luc Gauvain,et al.  MODELING CHARACTERS VERSUS WORDS FOR MANDARIN SPEECH RECOGNITION , 2009 .

[25]  Thomas Hain,et al.  Semi-supervised DNN training in meeting recognition , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[26]  Kiyohiro Shikano,et al.  Real-time word confidence scoring using local posterior probabilities on tree trellis search , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Tatsuya Kawahara,et al.  Fast Speaker Normalization and Adaptation based on BIC for Meeting Speech Recognition , 2011 .

[28]  Yifan Gong,et al.  Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re-calibration , 2013, INTERSPEECH.