Tree-Constrained Pointer Generator for End-to-End Contextual Speech Recognition

Contextual knowledge is important for real-world automatic speech recognition (ASR) applications. In this paper, a novel tree-constrained pointer generator (TCPGen) component is proposed that incorporates such knowledge as a list of biasing words into both attentionbased encoder-decoder and transducer end-to-end ASR models in a neural-symbolic way. TCPGen structures the biasing words into an efficient prefix tree to serve as its symbolic input and creates a neural shortcut between the tree and the final ASR output distribution to facilitate recognising biasing words during decoding. Systems were trained and evaluated on the Librispeech corpus where biasing words were extracted at the scales of an utterance, a chapter, or a book to simulate different application scenarios. Experimental results showed that TCPGen consistently improved word error rates (WERs) compared to the baselines, and in particular, achieved significant WER reductions on the biasing words. TCPGen is highly efficient: it can handle 5,000 biasing words and distractors and only add a small overhead to memory use and computation cost.

[1]  Philip C. Woodland,et al.  Integrating Source-Channel and Attention-Based Sequence-to-Sequence Models for Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[3]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Tara N. Sainath,et al.  Deep Context: End-to-end Contextual Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[5]  Chanwoo Kim,et al.  Hierarchical Multi-Stage Word-to-Grapheme Named Entity Corrector for Automatic Speech Recognition , 2020, INTERSPEECH.

[6]  Yifan Gong,et al.  Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[7]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Geoffrey Zweig,et al.  Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model , 2020, INTERSPEECH.

[10]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[11]  Steve Renals,et al.  A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition , 2015, INTERSPEECH.

[12]  Tara N. Sainath,et al.  Contextual Speech Recognition with Difficult Negative Training Examples , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Yongqiang Wang,et al.  Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR , 2019, INTERSPEECH.

[14]  Yingbo Zhou,et al.  Fast and Robust Unsupervised Contextual Biasing for Speech Recognition , 2020, ArXiv.

[15]  Tomohiro Nakatani,et al.  SEQUENCE TRAINING OF ENCODER-DECODER MODEL USING POLICY GRADIENT FOR END- TO-END SPEECH RECOGNITION , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Gil Keren,et al.  Contextual RNN-T For Open Domain ASR , 2020, INTERSPEECH.

[17]  Hairong Liu,et al.  Exploring neural transducers for end-to-end speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18]  Tomoki Toda,et al.  Back-Translation-Style Data Augmentation for end-to-end ASR , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[19]  Jay Mahadeokar,et al.  Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tara N. Sainath,et al.  A Comparison of Sequence-to-Sequence Models for Speech Recognition , 2017, INTERSPEECH.

[21]  Weidong Li,et al.  Knowledge graph based natural language generation with adapted pointer-generator networks , 2020, Neurocomputing.

[22]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Cyril Allauzen,et al.  Hybrid Autoregressive Transducer (HAT) , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Ariya Rastrow,et al.  Contextual Language Model Adaptation for Conversational Agents , 2018, INTERSPEECH.

[25]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[26]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[27]  Sanjeev Khudanpur,et al.  End-to-end Speech Recognition Using Lattice-free MMI , 2018, INTERSPEECH.

[28]  Tara N. Sainath,et al.  Shallow-Fusion End-to-End Contextual Biasing , 2019, INTERSPEECH.

[29]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[30]  Gokhan Tur,et al.  Joint Contextual Modeling for ASR Correction and Language Understanding , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Xiaofeng Liu,et al.  Rnn-Transducer with Stateless Prediction Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  John Paul Shen,et al.  Audio-visual TED corpus: enhancing the TED-LIUM corpus with facial information, contextual text and object recognition , 2019, UbiComp/ISWC Adjunct.

[33]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[34]  Tara N. Sainath,et al.  Contextual Speech Recognition in End-to-end Neural Network Systems Using Beam Search , 2018, INTERSPEECH.

[35]  Shinji Watanabe,et al.  End-to-end Speech Recognition With Word-Based Rnn Language Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[36]  Hermann Ney,et al.  CTC in the Context of Generalized Full-Sum HMM Training , 2017, INTERSPEECH.

[37]  Naoyuki Kanda,et al.  Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[38]  Tsuyoshi Usagawa,et al.  Contextual keyword spotting in lecture video with deep convolutional neural network , 2017, 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS).

[39]  Naoyuki Kanda,et al.  Internal Language Model Training for Domain-Adaptive End-To-End Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[41]  Gil Keren,et al.  Deep Shallow Fusion for RNN-T Personalization , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[42]  Nancy F. Chen,et al.  Topic-Aware Pointer-Generator Networks for Summarizing Spoken Conversations , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[43]  Gunnar Evermann,et al.  Class LM and word mapping for contextual biasing in End-to-End ASR , 2020, INTERSPEECH.

[44]  Qi Liu,et al.  Modular End-to-End Automatic Speech Recognition Framework for Acoustic-to-Word Model , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[46]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Yongqiang Wang,et al.  End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Tara N. Sainath,et al.  An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Gil Keren,et al.  Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion , 2021, Interspeech 2021.

[50]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[51]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.