论文信息 - Improving Spoken Language Understanding with Cross-Modal Contrastive Learning

Improving Spoken Language Understanding with Cross-Modal Contrastive Learning

Spoken language understanding(SLU) is conventionally based on pipeline architecture with error propagation issues. To mitigate this problem, end-to-end(E2E) models are proposed to directly map speech input to desired semantic outputs. Mean-while, others try to leverage linguistic information in addition to acoustic information by adopting a multi-modal architecture. In this work, we propose a novel multi-modal SLU method, named CMCL, which utilizes cross-modal contrastive learning to learn better multi-modal representation. In particular, a two-stream multi-modal framework is designed, and a contrastive learning task is performed across speech and text representations. More-over, CMCL employs a multi-modal shared classification task combined with a contrastive learning task to guide the learned representation to improve the performance on the intent classification task. We also investigate the efficacy of employing cross-modal contrastive learning during pretraining. CMCL achieves 99.69% and 92.50% accuracy on FSC and Smartlights datasets, respectively, outperforming state-of-the-art comparative methods. Also, performances only decrease by 0.32% and 2.8%, respectively, when trained on 10% and 1% of the FSC dataset, indicating its advantage under few-shot scenarios.

Hao Li | P. Zhou | Jingjing Dong | Xiaorui Wang | Jiayi Fu

[1] Siegfried Kunzmann,et al. Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding , 2020, ICASSP.

[2] Maulik C. Madhavi,et al. Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification , 2021, Interspeech.

[3] Florian Metze,et al. Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding , 2021, Interspeech.

[4] Michael Zeng,et al. SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding , 2021, NAACL.

[5] Wen Wang,et al. Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning , 2021, Interspeech.

[6] Michael Picheny,et al. Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs , 2021, Interspeech.

[7] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[8] Maulik C. Madhavi,et al. Leveraging Acoustic and Linguistic Embeddings from Pretrained Speech and Language Models for Intent Classification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Seongjin Shin,et al. Two-Stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Gyuwan Kim,et al. St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Siegfried Kunzmann,et al. End-to-End Neural Transformer Based Spoken Language Understanding , 2020, INTERSPEECH.

[12] Ngoc Thang Vu,et al. Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning , 2020, INTERSPEECH.

[13] Joakim Lindblad,et al. CoMIR: Contrastive Multimodal Image Representation for Registration , 2020, NeurIPS.

[14] Nam Soo Kim,et al. Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation , 2020, INTERSPEECH.

[15] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[16] Srinivas Bangalore,et al. Improved End-To-End Spoken Utterance Classification with a Self-Attention Acoustic Classifier , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Yoshua Bengio,et al. Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[18] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, International Journal of Computer Vision.

[19] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[21] Francesco Caltagirone,et al. Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces , 2018, ArXiv.

[22] Yongqiang Wang,et al. Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Dilek Z. Hakkani-Tür,et al. Deep Learning for Dialogue Systems , 2017, COLING.

[24] Alexandros Potamianos,et al. Speech understanding for spoken dialogue systems: From corpus harvesting to grammar rule induction , 2018, Comput. Speech Lang..

[25] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27] Gabriel Skantze,et al. User Feedback in Human-Robot Dialogue : Task Progression and Uncertainty , 2014, HRI 2014.

[28] Gokhan Tur,et al. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .

[29] David Suendermann,et al. SLU in Commercial and Research Spoken Dialogue Systems , 2011 .

[30] Gary Geunbae Lee,et al. Recent Approaches to Dialog Management for Spoken Dialog Systems , 2010, J. Comput. Sci. Eng..

[31] Dong Yu,et al. An Integrative and Discriminative Technique for Spoken Utterance Classification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[32] Cheng Wu,et al. Language model estimation for optimizing end-to-end performance of a natural language call routing system , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[33] Joseph Polifroni,et al. A form-based dialogue manager for spoken language applications , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.