Speaker-Adapted Confidence Measures for ASR Using Deep Bidirectional Recurrent Neural Networks

In the last years, deep bidirectional recurrent neural networks (DBRNN) and DBRNN with long short-term memory cells (DBLSTM) have outperformed the most accurate classifiers for confidence estimation in automatic speech recognition. At the same time, we have recently shown that speaker adaptation of confidence measures using DBLSTM yields significant improvements over non-adapted confidence measures. In accordance with these two recent contributions to the state of the art in confidence estimation, this paper presents a comprehensive study of speaker-adapted confidence measures using DBRNN and DBLSTM models. First, we present new empirical evidences of the superiority of recurrent neural networks (RNN)-based confidence classifiers evaluated over a large speech corpus consisting of the English LibriSpeech and the Spanish poliMedia tasks. Second, we show new results on speaker-adapted confidence measures considering a multitask framework in which RNN-based confidence classifiers trained with LibriSpeech are adapted to speakers of the TED-LIUM corpus. These experiments confirm that speaker-adapted confidence measures outperform their non-adapted counterparts. Last, we describe an unsupervised adaptation method of the acoustic DBLSTM model based on confidence measures that results in better automatic speech recognition performance.

[1]  Florian Metze,et al.  On speaker adaptation of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Herbert Gish,et al.  Evaluation of word confidence for speech recognition systems , 1999, Comput. Speech Lang..

[4]  Alfons Juan-Císcar,et al.  The Translectures-UPV Toolkit , 2014, IberSPEECH.

[5]  Atsunori Ogawa,et al.  ASR error detection and recognition rate estimation using deep bidirectional recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[7]  Geoffrey Zweig,et al.  Recurrent neural networks for language understanding , 2013, INTERSPEECH.

[8]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[9]  Lidia Mangu,et al.  Finding consensus in speech recognition , 2000 .

[10]  Hermann Ney,et al.  A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Alfons Juan-Císcar,et al.  ASR Confidence Estimation with Speaker-Adapted Recurrent Neural Networks , 2016, INTERSPEECH.

[13]  Kaisheng Yao,et al.  Estimating confidence scores on ASR results using recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Yifan Gong,et al.  Normalization of ASR confidence classifier scores via confidence mapping , 2014, INTERSPEECH.

[15]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[16]  Alfons Juan-Císcar,et al.  Speaker-adapted confidence measures for speech recognition of video lectures , 2016, Comput. Speech Lang..

[17]  Paul Deléglise,et al.  Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks , 2014, LREC.

[18]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[19]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[20]  L. Deng,et al.  Calibration of Confidence Measures in Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[22]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[23]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[24]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[25]  Edward R. Dougherty,et al.  Error Estimation for Pattern Recognition , 2015 .

[26]  Yannick Estève,et al.  Word embeddings combination and neural networks for robustness in ASR error detection , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[27]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[28]  Alfons Juan-Císcar,et al.  A Word-Based Naïve Bayes Classifier for Confidence Estimation in Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Miguel Ángel Del-Agua,et al.  The MLLP ASR systems for IWSLT 2015 , 2015, IWSLT.

[30]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Daniel Povey,et al.  Revisiting Recurrent Neural Networks for robust ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[33]  Paul Deléglise,et al.  Acoustic Word Embeddings for ASR Error Detection , 2016, INTERSPEECH.

[34]  Eduardo Lleida,et al.  Utterance verification in continuous speech recognition: decoding and training procedures , 2000, IEEE Trans. Speech Audio Process..

[35]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[36]  Atsunori Ogawa,et al.  Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks , 2017, Speech Commun..

[37]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[38]  Yifan Gong,et al.  Predicting speech recognition confidence using deep learning with word identity and score features , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[40]  Li-Rong Dai,et al.  Speaker adaptation OF RNN-BLSTM for speech recognition based on speaker code , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[42]  Fabio Brugnara,et al.  Adaptive training using simple target models [speech recognition applications] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[43]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[44]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.