Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

End-to-end (E2E) spoken language understanding (SLU) systems can infer the semantics of a spoken utterance directly from an audio signal. However, training an E2E system remains a challenge, largely due to the scarcity of paired audio-semantics data. In this paper, we treat an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent space (CMLS) architecture, where a shared latent space is learned between the `acoustic' and `text' embeddings. We propose using different multi-modal losses to explicitly guide the acoustic embeddings to be closer to the text embeddings, obtained from a semantically powerful pre-trained BERT model. We train the CMLS model on two publicly available E2E datasets, across different cross-modal losses and show that our proposed triplet loss function achieves the best performance. It achieves a relative improvement of 1.4% and 4% respectively over an E2E model without a cross-modal space and a relative improvement of 0.7% and 1% over a previously published CMLS model using $L_2$ loss. The gains are higher for a smaller, more complicated E2E dataset, demonstrating the efficacy of using an efficient cross-modal loss function, especially when there is limited E2E training data available.

[1]  Francesco Caltagirone,et al.  Spoken Language Understanding on the Edge , 2018, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[2]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Yannick Estève,et al.  End-To-End Named Entity And Semantic Concept Extraction From Speech , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[5]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[6]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[7]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[8]  Yannick Estève,et al.  Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability , 2019, INTERSPEECH.

[9]  Michael Saxon,et al.  Semantic Complexity in End-to-End Spoken Language Understanding , 2020, INTERSPEECH.

[10]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[11]  Yale Song,et al.  Improving Pairwise Ranking for Multi-label Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Srinivas Bangalore,et al.  Spoken Language Understanding without Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[14]  David Suendermann-Oeft,et al.  Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[15]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[16]  Arun Narayanan,et al.  From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[17]  Gary Geunbae Lee,et al.  Recent Approaches to Dialog Management for Spoken Dialog Systems , 2010, J. Comput. Sci. Eng..

[18]  Siegfried Kunzmann,et al.  End-to-End Neural Transformer Based Spoken Language Understanding , 2020, INTERSPEECH.

[19]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[21]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[22]  Sunil Kumar Kopparapu,et al.  End-to-End Spoken Language Understanding: Bootstrapping in Low Resource Scenarios , 2019, INTERSPEECH.

[23]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[24]  Brian Kingsbury,et al.  Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Yannick Estève,et al.  Investigating Adaptation and Transfer Learning for End-to-End Spoken Language Understanding from Speech , 2019, INTERSPEECH.