Improved Deep Duel Model for Rescoring N-Best Speech Recognition List Using Backward LSTMLM and Ensemble Encoders

We have proposed a neural network (NN) model called a deep duel model (DDM) for rescoring N -best speech recognition hypothesis lists. A DDM is composed of a long short-term memory (LSTM)-based encoder followed by a fully-connected linear layer-based binary-class classifier. Given the feature vector sequences of two hypotheses in an N -best list, the DDM encodes the features and selects the hypothesis that has the lower word error rate (WER) based on the output binary-class probabilities. By repeating this one-on-one hypothesis comparison (duel) for each hypothesis pair in the N -best list, we can find the oracle (lowest WER) hypothesis as the survivor of the duels. We showed that the DDM can exploit the score provided by a forward LSTM-based recurrent NN language model (LSTMLM) as an additional feature to accurately select the hypotheses. In this study, we further improve the selection performance by introducing two modifications, i.e. adding the score provided by a backward LSTMLM, which uses succeeding words to predict the current word, and employing ensemble encoders, which have a high feature encoding capability. By combining these two modifications, our DDM achieves an over 10% relative WER reduction from a strong baseline obtained using both the forward and backward LSTMLMs.

[1]  Jan Niehues,et al.  Analyzing Neural MT Search and Model Performance , 2017, NMT@ACL.

[2]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[3]  Zhe Gan,et al.  Topic Compositional Neural Language Model , 2017, AISTATS.

[4]  Atsushi Nakamura,et al.  Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  George Saon,et al.  The IBM 2015 English conversational telephone speech recognition system , 2015, INTERSPEECH.

[6]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[7]  Yu Zhang,et al.  On training bi-directional neural network language model with noise contrastive estimation , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[8]  Bhuvana Ramabhadran,et al.  Whole Sentence Neural Language Models , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tomohiro Nakatani,et al.  Rescoring N-Best Speech Recognition List Based on One-on-One Hypothesis Comparison Using Encoder-Classifier Model , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[11]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[12]  Mark J. F. Gales,et al.  Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition , 2017, INTERSPEECH.

[13]  Chenxing Li,et al.  The ZTSpeech system for CHiME-5 Challenge: A far-field speech recognition system with front-end and robust back-end , 2018 .

[14]  George Saon,et al.  The IBM 2016 English Conversational Telephone Speech Recognition System , 2016, INTERSPEECH.

[15]  Xiaofei Wang,et al.  The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays , 2018 .

[16]  Jun Du,et al.  An information fusion approach to recognizing microphone array speech in the CHiME-3 challenge based on a deep learning framework , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[17]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[18]  Akinori Ito,et al.  Round-Robin Duel Discriminative Language Models , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[20]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[21]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[22]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Ebru Arisoy,et al.  Bidirectional recurrent neural network language models for automatic speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[26]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[27]  Zhiwei Zhao,et al.  The NWPU System for CHiME-5 Challenge , 2018 .

[28]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[29]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[30]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[31]  Yangyang Shi,et al.  Exploiting the succeeding words in recurrent neural network language models , 2013, INTERSPEECH.

[32]  Lei Sun,et al.  The USTC-iFlytek systems for CHiME-5 Challenge , 2018 .

[33]  Satoshi Nakamura,et al.  Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT2015 , 2015, WAT.

[34]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[35]  Li Deng,et al.  Ensemble deep learning for speech recognition , 2014, INTERSPEECH.

[36]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[37]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[38]  Xin Chen,et al.  Building Acoustic Model Ensembles by Data Sampling With Enhanced Trainings and Features , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[40]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[41]  Atsunori Ogawa,et al.  Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks , 2017, Speech Commun..