论文信息 - Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent of the utterance in the form of embeddings or posteriors, and these are used as auxiliary inputs for RNN-T training and inference. Experimenting with a 50khour far-field English speech corpus, this study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56% relative word error rate reduction (WERR). On the other hand, a streaming system using per-frame intent posteriors as extra inputs for the RNN-T ASR system yields a 3.33% relative WERR. A further detailed analysis of the streaming system indicates that our proposed method brings especially good gain on media-playing related intents (e.g. 9.12% relative WERR on PlayMusicIntent).

[1] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[2] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[3] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[4] Yan Yin,et al. Attention-based Transducer for Online Speech Recognition , 2020, ArXiv.

[5] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[6] Tara N. Sainath,et al. Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[7] Xiaofeng Liu,et al. Rnn-Transducer with Stateless Prediction Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Arun Narayanan,et al. From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[9] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[10] Zhong Meng,et al. Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability , 2020, INTERSPEECH.

[11] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[12] David Suendermann-Oeft,et al. Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13] Srinivas Bangalore,et al. Spoken Language Understanding without Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Yongqiang Wang,et al. Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] S. Arikawa,et al. Byte Pair Encoding: a Text Compression Scheme That Accelerates Pattern Matching , 1999 .

[16] Tara N. Sainath,et al. Streaming End-to-end Speech Recognition for Mobile Devices , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Andreas Stolcke,et al. DO as I Mean, Not as I Say: Sequence Loss Training for Spoken Language Understanding , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Yoshua Bengio,et al. Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[19] Yifan Gong,et al. Improving RNN Transducer Modeling for End-to-End Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20] A. Rastrow,et al. Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces , 2020, INTERSPEECH.

[21] Raghavendra Bilgi,et al. Improving RNN-T ASR Performance with Date-Time and Location Awareness , 2021, TDS.

[22] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[23] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24] Partha Talukdar,et al. AD3: Attentive Deep Document Dater , 2018, EMNLP.

[25] Tara N. Sainath,et al. Multistate Encoding with End-To-End Speech RNN Transducer Network , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).