Active Learning for LF-MMI Trained Neural Networks in ASR

This paper investigates how active learning (AL) effects the training of neural network acoustic models based on Latticefree Maximum Mutual Information (LF-MMI) in automatic speech recognition (ASR). To fully exploit the most informative examples from fresh datasets, different data selection criterions based on the heterogeneous neural networks were studied. In particular, we examined the relationship among the transcription cost of human labeling, example informativeness and data selection criterions for active learning. As a comparison, we tried both semi-supervised training (SST) and active learning to improve the acoustic models. Experiments were performed for both the small-scale and large-scale ASR systems. Experimental results suggested that, our AL scheme can benefit much more from the fresh data than the SST in reducing the word error rate (WER). The AL yields 6∼13% relative WER reduction against the baseline trained on a 4000 hours transcribed dataset, by only selecting 1.2K hrs informative utterances for human labeling via active learning.

[1]  Dong Yu,et al.  Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global Entropy Reduction Maximization Criterion Computer Speech and Language Article in Press Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global E , 2022 .

[2]  Yongqiang Wang,et al.  Semi-Supervised Training in Deep Learning Acoustic Model , 2016, INTERSPEECH.

[3]  Michael I. Mandel,et al.  Active learning for low-resource speech recognition: Impact of selection size and language modeling data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yiming Wang,et al.  Low Latency Acoustic Modeling Using Temporal Convolution and LSTMs , 2018, IEEE Signal Processing Letters.

[5]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[6]  Koichi Shinoda,et al.  Speech modeling based on committee-based active learning , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Tatsuya Kawahara,et al.  Semi-supervised ensemble DNN acoustic model training , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Herbert Gish,et al.  Importance sampling of delta-AUC: A basis for active learning for improved keyword search , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Rong Jin,et al.  Active Learning by Querying Informative and Representative Examples , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[12]  Janne Pylkkönen,et al.  Active and Semi-Supervised Learning in ASR: Benefits on the Acoustic and Language Models , 2016, INTERSPEECH.

[13]  Tara N. Sainath,et al.  N-best entropy based data selection for acoustic modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Naoyuki Kanda,et al.  Investigation of Semi-Supervised Acoustic Model Training Based on the Committee of Heterogeneous Neural Networks , 2016, INTERSPEECH.

[15]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[16]  Yonghong Yan,et al.  An Exploration of Dropout with LSTMs , 2017, INTERSPEECH.

[17]  Dilek Z. Hakkani-Tür,et al.  Active learning for automatic speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Dilek Z. Hakkani-Tür,et al.  Active learning: theory and applications to automatic speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[19]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[20]  Jean-Luc Gauvain,et al.  Active learning based data selection for limited resource STT and KWS , 2015, INTERSPEECH.

[21]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..