Effective Sentence Scoring Method Using BERT for Speech Recognition

In automatic speech recognition, language models (LMs) have been used in many ways to improve performance. Some of the studies have tried to use bidirectional LMs (biLMs) for rescoring the n-best hypothesis list decoded from the acoustic model. Despite their theoretical advantages over conventional unidirectional LMs (uniLMs), previous biLMs have not given notable improvements compared to the uniLMs in the experiments. This is due to the architectural limitation that the rightward and leftward representations are not fused in the biLMs. Recently, BERT addressed the same issue by proposing the masked language modeling and achieved state-of-the-art performances in many downstream tasks by fine-tuning the pre-trained BERT. In this paper, we propose an effective sentence scoring method by adjusting the BERT to the n-best list rescoring task, which has no fine-tuning step. The core idea of how we modify the BERT for the rescoring task is bridging the gap between training and testing environments by considering the only masked language modeling within a single sentence. Experimental results on the LibriSpeech corpus show that the proposed scoring method using our biLM outperforms uniLMs for the n-best list rescoring, consistently and significantly in all experimental conditions. Additionally, an analysis about where word errors occur in a sentence demonstrates that our biLM is more robust than the uniLM especially when a recognized sentence is short or a misrecognized word is at the beginning of the sentence. Consequently, we empirically prove that the left and right representations should be fused in biLMs for scoring a sentence.

[1]  Ebru Arisoy,et al.  Bidirectional recurrent neural network language models for automatic speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Rico Sennrich,et al.  Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures , 2018, EMNLP.

[3]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[4]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[11]  Yangyang Shi,et al.  Exploiting the succeeding words in recurrent neural network language models , 2013, INTERSPEECH.

[12]  Yu Zhang,et al.  On training bi-directional neural network language model with noise contrastive estimation , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[13]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[14]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[15]  S. Arikawa,et al.  Byte Pair Encoding: a Text Compression Scheme That Accelerates Pattern Matching , 1999 .

[16]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[17]  Mark J. F. Gales,et al.  Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition , 2017, INTERSPEECH.

[18]  Kevin Gimpel,et al.  Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units , 2016, ArXiv.

[19]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[20]  Yu Zhang,et al.  Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM , 2017, INTERSPEECH.

[21]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[22]  Luke S. Zettlemoyer,et al.  Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[23]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[25]  Alex Wang,et al.  BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.