Spoken Language Intent Detection using Confusion2Vec

Decoding speaker's intent is a crucial part of spoken language understanding (SLU). The presence of noise or errors in the text transcriptions, in real life scenarios make the task more challenging. In this paper, we address the spoken language intent detection under noisy conditions imposed by automatic speech recognition (ASR) systems. We propose to employ confusion2vec word feature representation to compensate for the errors made by ASR and to increase the robustness of the SLU system. The confusion2vec, motivated from human speech production and perception, models acoustic relationships between words in addition to the semantic and syntactic relations of words in human language. We hypothesize that ASR often makes errors relating to acoustically similar words, and the confusion2vec with inherent model of acoustic relationships between words is able to compensate for the errors. We demonstrate through experiments on the ATIS benchmark dataset, the robustness of the proposed model to achieve state-of-the-art results under noisy ASR conditions. Our system reduces classification error rate (CER) by 20.84% and improves robustness by 37.48% (lower CER degradation) relative to the previous state-of-the-art going from clean to noisy transcripts. Improvements are also demonstrated when training the intent detection models on noisy transcripts.

[1]  Bing Liu,et al.  Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling , 2016, INTERSPEECH.

[2]  Panayiotis G. Georgiou,et al.  Confusion2Vec: towards enriching vector space word representations with representational ambiguities , 2018, PeerJ Comput. Sci..

[3]  Geoffrey Zweig,et al.  Joint semantic utterance classification and slot filling with recursive neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[4]  Gökhan Tür,et al.  Semantic parsing using word confusion networks with conditional random fields , 2013, INTERSPEECH.

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Gökhan Tür,et al.  Optimizing SVMs for complex call classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[7]  Kaspars Balodis,et al.  Intent Detection System Based on Word Embeddings , 2018, AIMSA.

[8]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[9]  George R. Doddington,et al.  The ATIS Spoken Language Systems Pilot Corpus , 1990, HLT.

[10]  Kai Yu,et al.  Robust Spoken Language Understanding with Unsupervised ASR-Error Adaptation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Sungjin Lee,et al.  ONENET: Joint domain, intent, slot prediction for spoken language understanding , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12]  Pushpak Bhattacharyya,et al.  A Deep Learning Based Multi-task Ensemble Model for Intent Detection and Slot Filling in Spoken Language Understanding , 2018, ICONIP.

[13]  Houfeng Wang,et al.  A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding , 2016, IJCAI.

[14]  Gökhan Tür,et al.  Intent detection using semantically enriched word embeddings , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[15]  Fabrice Lefèvre,et al.  Zero-shot semantic parser for spoken language understanding , 2015, INTERSPEECH.

[16]  Dong Yu,et al.  An Integrative and Discriminative Technique for Spoken Utterance Classification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Pushpak Bhattacharyya,et al.  Intent Detection for Spoken Language Understanding Using a Deep Ensemble Model , 2018, PRICAI.

[18]  Ruhi Sarikaya,et al.  Deep belief network based semantic taggers for spoken language understanding , 2013, INTERSPEECH.

[19]  Scharolta Katharina Siencnik Adapting word2vec to Named Entity Recognition , 2015, NODALIDA.

[20]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[21]  Geoffrey Zweig,et al.  Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Raphael Schumann,et al.  Incorporating ASR Errors with Attention-Based, Jointly Trained RNN for Intent Detection and Slot Filling , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Ngoc Thang Vu,et al.  FIRST STEP TOWARDS ENHANCING WORD EMBEDDINGS WITH PITCH ACCENT FEATURES FOR DNN-BASED SLOT FILLING ON RECOGNIZED TEXT , 2017 .

[24]  Yannick Estève,et al.  Simulating ASR errors for training SLU systems , 2018, LREC.

[25]  Chih-Li Huo,et al.  Slot-Gated Modeling for Joint Slot Filling and Intent Prediction , 2018, NAACL.

[26]  Steve Young,et al.  A data-driven spoken language understanding system , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[27]  Liang Li,et al.  A Self-Attentive Model with Gate Mechanism for Spoken Language Understanding , 2018, EMNLP.

[28]  Ruhi Sarikaya,et al.  Convolutional neural network based triangular CRF for joint intent detection and slot filling , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[29]  Shinji Watanabe,et al.  Efficient learning for spoken language understanding tasks with word embedding based pre-training , 2015, INTERSPEECH.

[30]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[31]  Gökhan Tür,et al.  Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM , 2016, INTERSPEECH.

[32]  Bhuvana Ramabhadran,et al.  Leveraging word confusion networks for named entity modeling and detection from conversational telephone speech , 2012, Speech Commun..

[33]  Ryuichiro Higashinaka,et al.  Neural Confnet Classification: Fully Neural Network Based Spoken Utterance Classification Using Word Confusion Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Ruohui Wang,et al.  Edge Detection Using Convolutional Neural Network , 2016, ISNN.

[35]  Kai Yu,et al.  Joint Spoken Language Understanding and Domain Adaptive Language Modeling , 2018, IScIDE.

[36]  Homa B. Hashemi,et al.  Query Intent Detection using Convolutional Neural Networks , 2016 .

[37]  Bing Liu,et al.  Joint Online Spoken Language Understanding and Language Modeling With Recurrent Neural Networks , 2016, SIGDIAL Conference.