Improving ASR Error Detection with RNNLM Adaptation

Applications of automatic speech recognition (ASR) such as broadcast transcription and dialog systems, can be helped by the ability to detect errors in the ASR output. The field of ASR error detection has emerged as a way to detect and subsequently correct ASR errors. The most common approach for ASR error detection is features-based, where a set of features are extracted from the ASR output and used to train a classifier to predict correct/incorrect labels.Language models (LMs), either from the ASR decoder or externally trained, can be used to provide features to an ASR error detection system, through scores computed on the ASR output. Recently, recurrent neural network language models (RNNLMs) features were proposed for ASR error detection with improvements to the classification rate, thanks to their ability to model longer-range context.RNNLM adaptation, through the introduction of auxiliary features that encode domain, has been shown to improve ASR performance. This work investigates whether RNNLM adaptation techniques can also improve ASR error detection performance in the context of multi-genre broadcast ASR. The results show that an overall improvement of about 1% in the F-measure can be achieved using adapted RNNLM features.

[1]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[2]  Hassan Ouahmane,et al.  Towards a generic approach for automatic speech recognition error detection and classification , 2018, 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP).

[3]  Peter Bell,et al.  Unsupervised Adaptation of Recurrent Neural Network Language Models , 2016, INTERSPEECH.

[4]  Hassan Ouahmane,et al.  Automatic speech recognition errors detection using supervised learning techniques , 2016, 2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA).

[5]  Lucia Specia,et al.  Semi-Supervised Adaptation of RNNLMs by Fine-Tuning with Domain-Specific Auxiliary Features , 2017, INTERSPEECH.

[6]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[7]  Yongqiang Wang,et al.  Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch , 2014, INTERSPEECH.

[8]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[9]  Atsunori Ogawa,et al.  Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks , 2017, Speech Commun..

[10]  Mark J. F. Gales,et al.  Recurrent neural network language model adaptation for multi-genre broadcast speech recognition , 2015, INTERSPEECH.

[11]  Mark J. F. Gales,et al.  CUED-RNNLM — An open-source toolkit for efficient training and evaluation of recurrent neural network language models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Raymond W. M. Ng,et al.  The 2015 sheffield system for transcription of Multi-Genre Broadcast media , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[13]  Thomas Hain,et al.  Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition , 2006, INTERSPEECH.

[14]  Maxim Korenevsky,et al.  Prediction of speech recognition accuracy for utterance classification , 2015, INTERSPEECH.

[15]  Paul Deléglise,et al.  Acoustic Word Embeddings for ASR Error Detection , 2016, INTERSPEECH.

[16]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[17]  Thomas Hain,et al.  Application of SVM-based correctness predictions to unsupervised discriminative speaker adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Oscar Saz-Torralba,et al.  Combining Feature and Model-Based Adaptation of RNNLMs for Multi-Genre Broadcast Speech Recognition , 2016, INTERSPEECH.

[20]  Tanel Alumäe,et al.  Multi-domain neural network language model , 2013, INTERSPEECH.

[21]  Shankar Kumar,et al.  Approaches for Neural-Network Language Model Adaptation , 2017, INTERSPEECH.

[22]  Georg Heigold,et al.  Word embeddings for speech recognition , 2014, INTERSPEECH.

[23]  Rong Zhang,et al.  Word level confidence annotation using combinations of features , 2001, INTERSPEECH.

[24]  Yun Lei,et al.  ASR error detection using recurrent neural network language model and complementary ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Beth Logan,et al.  Approaches to reduce the effects of OOV queries on indexed spoken audio , 2005, IEEE Transactions on Multimedia.