Top-Down Attention in End-to-End Spoken Language Understanding

Spoken language understanding (SLU) is the task of inferring the semantics of spoken utterances. Traditionally, this has been achieved with a cascading combination of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) modules that are optimized separately, which can lead to a suboptimal overall performance. More recently, End-to-End SLU (E2E SLU) was proposed to perform SLU directly from speech through a joint optimization of the modules, addressing some of the traditional SLU shortcomings. A key challenge of this approach is how to best integrate the feature learning of the ASR and NLU sub-tasks to maximize their performance. While it is known that in general, ASR models focus on low-level features, and NLU models need higher-level contextual information, ASR models can nonetheless also leverage top-down syntactic and semantic information to improve their recognition. Based on this insight, we propose Top-Down SLU (TD-SLU), a new transformer-based E2E SLU model that uses top-down attention and an attention gate to fuse high-level NLU features with low-level ASR features, which leads to a better optimization of both tasks. We have validated our model using the public FluentSpeech set, and a large custom dataset. Results show TD-SLU is able to outperform selected baselines both in terms of ASR and NLU quality metrics, and suggest that the added syntactic and semantic high-level information can improve the model’s performance.

[1]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Brian Kingsbury,et al.  End-to-End Spoken Language Understanding Without Full Transcripts , 2020, INTERSPEECH.

[4]  Murray Shanahan,et al.  Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules , 2020, ICML.

[5]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Arun Narayanan,et al.  From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[7]  J. Wolfe,et al.  The order of visual processing: “Top-down,” “bottom-up,” or “middle-out” , 1979, Perception & psychophysics.

[8]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[9]  Louis-Philippe Morency,et al.  Integrating Multimodal Information in Large Pretrained Transformers , 2020, ACL.

[10]  G. Pourtois,et al.  What is Bottom-Up and What is Top-Down in Predictive Coding? , 2013, Front. Psychol..

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Chengyi Wang,et al.  Semantic Mask for Transformer based End-to-End Speech Recognition , 2020, INTERSPEECH.

[13]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  A. Rastrow,et al.  Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces , 2020, INTERSPEECH.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[18]  Siegfried Kunzmann,et al.  End-to-End Neural Transformer Based Spoken Language Understanding , 2020, INTERSPEECH.

[19]  Gokhan Tur,et al.  Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .