LMC-SMCA: A New Active Learning Method in ASR

In Automatic Speech Recognition (ASR), transcribed data take substantial effort to obtain. It is worthwhile to explore how to selective the samples with more information from un-transcribed datapool to get a better model with the limited cost. Therefore, active learning in ASR becomes a research topic. In this manuscript, we proposed two new methods of active learning. One is Signal-Model Committee Approach (SMCA) and the other is LM-based Certainty Approach (LMCA). These two methods respectively evaluate the information amount of samples from different angles and can be applied together for joint sampling in some scenarios. We conducted many comparative experiments on Listen, Attend and Spell (LAS) model according to different demands. In experiments, we compared our approach with the random sampling and another state-of-the-art committee-based approach: heterogeneous neural networks (HNN) based approach. We examined our approach in CER in Chinese Mandarin speech recognition task. The results show that proposed approach is not only simple to use, but also has the best performance.

[1]  Tara N. Sainath,et al.  N-best entropy based data selection for acoustic modeling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jean-Luc Gauvain,et al.  Active learning based data selection for limited resource STT and KWS , 2015, INTERSPEECH.

[4]  Dong Yu,et al.  Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global Entropy Reduction Maximization Criterion Computer Speech and Language Article in Press Active Learning and Semi-supervised Learning for Speech Recognition: a Unified Framework Using the Global E , 2022 .

[5]  Tara N. Sainath,et al.  Deep Neural Network Language Models , 2012, WLM@NAACL-HLT.

[6]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[7]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[9]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[10]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[11]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Silvio Savarese,et al.  Active Learning for Convolutional Neural Networks: A Core-Set Approach , 2017, ICLR.

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[15]  Zoubin Ghahramani,et al.  Deep Bayesian Active Learning with Image Data , 2017, ICML.

[16]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[17]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[18]  Xin Yuan,et al.  Improving Low-Resource Speech Recognition Based on Improved NN-HMM Structures , 2020, IEEE Access.

[19]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[20]  Philip N. Garner,et al.  An Investigation of Multilingual ASR Using End-to-end LF-MMI , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[22]  F. Jelinek,et al.  Perplexity—a measure of the difficulty of speech recognition tasks , 1977 .

[23]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[24]  Hermann Ney,et al.  A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  John Doherty,et al.  Committee-Based Active Learning for Surrogate-Assisted Particle Swarm Optimization of Expensive Problems , 2017, IEEE Transactions on Cybernetics.

[27]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[28]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[29]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[30]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  John R. Huizenga,et al.  Cold fusion: The scientific fiasco of the century , 1992 .

[33]  Jian Zhang,et al.  A Certainty-Based Active Learning Framework of Meeting Speech Summarization , 2014 .

[34]  Tara N. Sainath,et al.  Shallow-Fusion End-to-End Contextual Biasing , 2019, INTERSPEECH.

[35]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[36]  Dilek Z. Hakkani-Tür,et al.  Active learning: theory and applications to automatic speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.

[37]  Hui Bu,et al.  AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[38]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[39]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[40]  Thomas Niesler,et al.  A variable-length category-based n-gram language model , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[41]  Yanhua Long,et al.  Active Learning for LF-MMI Trained Neural Networks in ASR , 2018, INTERSPEECH.

[42]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.