Improving Spoken Language Understanding with Cross-Modal Contrastive Learning

Spoken language understanding(SLU) is conventionally based on pipeline architecture with error propagation issues. To mitigate this problem, end-to-end(E2E) models are proposed to directly map speech input to desired semantic outputs. Mean-while, others try to leverage linguistic information in addition to acoustic information by adopting a multi-modal architecture. In this work, we propose a novel multi-modal SLU method, named CMCL, which utilizes cross-modal contrastive learning to learn better multi-modal representation. In particular, a two-stream multi-modal framework is designed, and a contrastive learning task is performed across speech and text representations. More-over, CMCL employs a multi-modal shared classification task combined with a contrastive learning task to guide the learned representation to improve the performance on the intent classification task. We also investigate the efficacy of employing cross-modal contrastive learning during pretraining. CMCL achieves 99.69% and 92.50% accuracy on FSC and Smartlights datasets, respectively, outperforming state-of-the-art comparative methods. Also, performances only decrease by 0.32% and 2.8%, respectively, when trained on 10% and 1% of the FSC dataset, indicating its advantage under few-shot scenarios.

[1]  Siegfried Kunzmann,et al.  Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding , 2020, ICASSP.

[2]  Maulik C. Madhavi,et al.  Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification , 2021, Interspeech.

[3]  Florian Metze,et al.  Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding , 2021, Interspeech.

[4]  Michael Zeng,et al.  SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding , 2021, NAACL.

[5]  Wen Wang,et al.  Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning , 2021, Interspeech.

[6]  Michael Picheny,et al.  Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs , 2021, Interspeech.

[7]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[8]  Maulik C. Madhavi,et al.  Leveraging Acoustic and Linguistic Embeddings from Pretrained Speech and Language Models for Intent Classification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Seongjin Shin,et al.  Two-Stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Gyuwan Kim,et al.  St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Siegfried Kunzmann,et al.  End-to-End Neural Transformer Based Spoken Language Understanding , 2020, INTERSPEECH.

[12]  Ngoc Thang Vu,et al.  Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning , 2020, INTERSPEECH.

[13]  Joakim Lindblad,et al.  CoMIR: Contrastive Multimodal Image Representation for Registration , 2020, NeurIPS.

[14]  Nam Soo Kim,et al.  Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation , 2020, INTERSPEECH.

[15]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[16]  Srinivas Bangalore,et al.  Improved End-To-End Spoken Utterance Classification with a Self-Attention Acoustic Classifier , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[18]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, International Journal of Computer Vision.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[21]  Francesco Caltagirone,et al.  Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces , 2018, ArXiv.

[22]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Dilek Z. Hakkani-Tür,et al.  Deep Learning for Dialogue Systems , 2017, COLING.

[24]  Alexandros Potamianos,et al.  Speech understanding for spoken dialogue systems: From corpus harvesting to grammar rule induction , 2018, Comput. Speech Lang..

[25]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Gabriel Skantze,et al.  User Feedback in Human-Robot Dialogue : Task Progression and Uncertainty , 2014, HRI 2014.

[28]  Gokhan Tur,et al.  Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .

[29]  David Suendermann,et al.  SLU in Commercial and Research Spoken Dialogue Systems , 2011 .

[30]  Gary Geunbae Lee,et al.  Recent Approaches to Dialog Management for Spoken Dialog Systems , 2010, J. Comput. Sci. Eng..

[31]  Dong Yu,et al.  An Integrative and Discriminative Technique for Spoken Utterance Classification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Cheng Wu,et al.  Language model estimation for optimizing end-to-end performance of a natural language call routing system , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[33]  Joseph Polifroni,et al.  A form-based dialogue manager for spoken language applications , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.