A Hybrid Approach to Scalable and Robust Spoken Language Understanding in Enterprise Virtual Agents

Spoken language understanding (SLU) extracts the intended mean- ing from a user utterance and is a critical component of conversational virtual agents. In enterprise virtual agents (EVAs), language understanding is substantially challenging. First, the users are infrequent callers who are unfamiliar with the expectations of a pre-designed conversation flow. Second, the users are paying customers of an enterprise who demand a reliable, consistent and efficient user experience when resolving their issues. In this work, we describe a general and robust framework for intent and entity extraction utilizing a hybrid of statistical and rule-based approaches. Our framework includes confidence modeling that incorporates information from all components in the SLU pipeline, a critical addition for EVAs to en- sure accuracy. Our focus is on creating accurate and scalable SLU that can be deployed rapidly for a large class of EVA applications with little need for human intervention.

[1]  Francesco Caltagirone,et al.  Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces , 2018, ArXiv.

[2]  Ryan Price End-To-End Spoken Language Understanding Without Matched Language Speech Model Pretraining Data , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[4]  Matthew Henderson,et al.  Discriminative spoken language understanding using word confusion networks , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[5]  Yannick Estève,et al.  Investigating Adaptation and Transfer Learning for End-to-End Spoken Language Understanding from Speech , 2019, INTERSPEECH.

[6]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[7]  Michael Picheny,et al.  Semantic confidence measurement for spoken dialog systems , 2005, IEEE Transactions on Speech and Audio Processing.

[8]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[9]  Wayne H. Ward,et al.  Confidence measures for spoken dialogue systems , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Fuchun Peng,et al.  Learning Personalized Pronunciations for Contact Name Recognition , 2016, INTERSPEECH.

[11]  Geoffrey Zweig,et al.  Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Ruhi Sarikaya,et al.  Convolutional neural network based triangular CRF for joint intent detection and slot filling , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[13]  Gareth M. James,et al.  Challenges For Spoken Dialogue Systems , 1999 .

[14]  Dong Yu,et al.  Improved name recognition with user modeling , 2003, INTERSPEECH.

[15]  James Allan,et al.  Matching Inconsistently Spelled Names in Automatic Speech Recognizer Output for Information Retrieval , 2005, HLT.

[16]  Varun Sharma,et al.  Fast Intent Classification for Spoken Language Understanding Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  P. J. Price,et al.  Evaluation of Spoken Language Systems: the ATIS Domain , 1990, HLT.

[18]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[19]  David Thomson,et al.  Practical Application of Domain Dependent Confidence Measurement for Spoken Language Understanding Systems , 2018, NAACL.

[20]  Gökhan Tür,et al.  Beyond ASR 1-best: Using word confusion networks in spoken language understanding , 2006, Comput. Speech Lang..

[21]  Gökhan Tür,et al.  The AT&T spoken language understanding system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Jason D. Williams,et al.  Estimating Probability of Correctness for ASR N-Best Lists , 2009, SIGDIAL Conference.

[23]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[24]  Bing Liu,et al.  Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling , 2016, INTERSPEECH.

[25]  Katrin Kirchhoff,et al.  Simple, Fast, Accurate Intent Classification and Slot Labeling for Goal-Oriented Dialogue Systems , 2019, SIGdial.

[26]  Mingda Li,et al.  Improving Spoken Language Understanding By Exploiting ASR N-best Hypotheses , 2020, ArXiv.