论文信息 - Top-Down Attention in End-to-End Spoken Language Understanding

Top-Down Attention in End-to-End Spoken Language Understanding

Spoken language understanding (SLU) is the task of inferring the semantics of spoken utterances. Traditionally, this has been achieved with a cascading combination of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) modules that are optimized separately, which can lead to a suboptimal overall performance. More recently, End-to-End SLU (E2E SLU) was proposed to perform SLU directly from speech through a joint optimization of the modules, addressing some of the traditional SLU shortcomings. A key challenge of this approach is how to best integrate the feature learning of the ASR and NLU sub-tasks to maximize their performance. While it is known that in general, ASR models focus on low-level features, and NLU models need higher-level contextual information, ASR models can nonetheless also leverage top-down syntactic and semantic information to improve their recognition. Based on this insight, we propose Top-Down SLU (TD-SLU), a new transformer-based E2E SLU model that uses top-down attention and an attention gate to fuse high-level NLU features with low-level ASR features, which leads to a better optimization of both tasks. We have validated our model using the public FluentSpeech set, and a large custom dataset. Results show TD-SLU is able to outperform selected baselines both in terms of ASR and NLU quality metrics, and suggest that the added syntactic and semantic high-level information can improve the model’s performance.

[1] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Yongqiang Wang,et al. Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Brian Kingsbury,et al. End-to-End Spoken Language Understanding Without Full Transcripts , 2020, INTERSPEECH.

[4] Murray Shanahan,et al. Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules , 2020, ICML.

[5] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6] Arun Narayanan,et al. From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[7] J. Wolfe,et al. The order of visual processing: “Top-down,” “bottom-up,” or “middle-out” , 1979, Perception & psychophysics.

[8] Yoshua Bengio,et al. Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[9] Louis-Philippe Morency,et al. Integrating Multimodal Information in Large Pretrained Transformers , 2020, ACL.

[10] G. Pourtois,et al. What is Bottom-Up and What is Top-Down in Predictive Coding? , 2013, Front. Psychol..

[11] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12] Chengyi Wang,et al. Semantic Mask for Transformer based End-to-End Speech Recognition , 2020, INTERSPEECH.

[13] Richard Socher,et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] A. Rastrow,et al. Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces , 2020, INTERSPEECH.

[15] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[16] Xiaofei Wang,et al. A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17] Xiaodong Cui,et al. English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[18] Siegfried Kunzmann,et al. End-to-End Neural Transformer Based Spoken Language Understanding , 2020, INTERSPEECH.

[19] Gokhan Tur,et al. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .