Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding

Semantic slot filling is one of the most challenging problems in spoken language understanding (SLU). In this paper, we propose to use recurrent neural networks (RNNs) for this task, and present several novel architectures designed to efficiently model past and future temporal dependencies. Specifically, we implemented and compared several important RNN architectures, including Elman, Jordan, and hybrid variants. To facilitate reproducibility, we implemented these networks with the publicly available Theano neural network toolkit and completed experiments on the well-known airline travel information system (ATIS) benchmark. In addition, we compared the approaches on two custom SLU data sets from the entertainment and movies domains. Our results show that the RNN-based models outperform the conditional random field (CRF) baseline by 2% in absolute error reduction on the ATIS benchmark. We improve the state-of-the-art by 0.5% in the Entertainment domain, and 6.7% for the movies domain.

[1]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[2]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[3]  Chin-Hui Lee,et al.  A speech understanding system based on statistical representation of semantics , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Richard M. Schwartz,et al.  Hidden Understanding Models of Natural Language , 1994, ACL.

[5]  Renato De Mori,et al.  The Application of Semantic Classification Trees to Natural Language Understanding , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[7]  Michael I. Jordan Serial Order: A Parallel Distributed Processing Approach , 1997 .

[8]  Victor Zue,et al.  Conversational interfaces: advances and challenges , 1997, Proceedings of the IEEE.

[9]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[10]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[11]  Hermann Ney,et al.  Natural language understanding using statistical machine translation , 2001, INTERSPEECH.

[12]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[13]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[14]  Steve Young,et al.  A data-driven spoken language understanding system , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[15]  Gökhan Tür,et al.  Optimizing SVMs for complex call classification , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[17]  Ye-Yi Wang,et al.  Spoken language understanding , 2005, IEEE Signal Processing Magazine.

[18]  Jean-Luc Gauvain,et al.  Training Neural Network Language Models on Very Large Corpora , 2005, HLT.

[19]  Alex Acero,et al.  Spoken Language Understanding "” An Introduction to the Statistical Framework , 2005 .

[20]  Alex Acero,et al.  Discriminative models for spoken language understanding , 2006, INTERSPEECH.

[21]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[22]  Gary Geunbae Lee,et al.  Structures for Spoken Language Understanding: A Two-Step Approach , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[23]  Giuseppe Riccardi,et al.  Generative and discriminative algorithms for spoken language understanding , 2007, INTERSPEECH.

[24]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[25]  Dong Yu,et al.  An Integrative and Discriminative Technique for Spoken Utterance Classification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Jian Peng,et al.  Conditional Neural Fields , 2009, NIPS.

[27]  Dong Yu,et al.  Sequential Labeling Using Deep-Structured Conditional Random Fields , 2010, IEEE Journal of Selected Topics in Signal Processing.

[28]  Gökhan Tür,et al.  What is left to be understood in ATIS? , 2010, 2010 IEEE Spoken Language Technology Workshop.

[29]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[30]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Gokhan Tur,et al.  Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .

[32]  Gökhan Tür,et al.  Exploiting query click logs for utterance domain detection in spoken language understanding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Gökhan Tür,et al.  Sentence simplification for spoken language understanding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Bhuvana Ramabhadran,et al.  Deep belief nets for natural language call-routing , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Li Deng,et al.  Speech Recognition, Machine Translation, and Speech Translation—A Unified Discriminative Learning Paradigm [Lecture Notes] , 2011, IEEE Signal Processing Magazine.

[36]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[37]  Alex Acero,et al.  Semantic Frame‐Based Spoken Language Understanding , 2011 .

[38]  Gökhan Tür,et al.  Use of kernel deep convex networks and end-to-end learning for spoken language understanding , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[39]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[40]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.

[41]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[42]  Yoshua Bengio,et al.  Unsupervised and Transfer Learning Challenge: a Deep Learning Approach , 2011, ICML Unsupervised and Transfer Learning.

[43]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  James R. Glass,et al.  A Conversational Movie Search System Based on Conditional Random Fields , 2012, INTERSPEECH.

[45]  Dilek Z. Hakkani-Tür,et al.  Exploiting the Semantic Web for unsupervised spoken language understanding , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[46]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[47]  Gökhan Tür,et al.  Towards deeper understanding: Deep convex networks for semantic utterance classification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Matthew Henderson,et al.  Discriminative spoken language understanding using word confusion networks , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[49]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[50]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[51]  Dong Yu,et al.  Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[52]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[53]  Geoffrey Zweig,et al.  Recurrent neural networks for language understanding , 2013, INTERSPEECH.

[54]  Ruhi Sarikaya,et al.  Convolutional neural network based triangular CRF for joint intent detection and slot filling , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[55]  Tiejun Zhao,et al.  Compound Embedding Features for Semi-supervised Learning , 2013, NAACL.

[56]  Ruhi Sarikaya,et al.  Deep belief network based semantic taggers for spoken language understanding , 2013, INTERSPEECH.

[57]  Li Deng,et al.  Speech-Centric Information Processing: An Optimization-Oriented Approach , 2013, Proceedings of the IEEE.

[58]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[59]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[60]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[61]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[62]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[63]  Gökhan Tür,et al.  Zero-Shot Learning and Clustering for Semantic Utterance Classification , 2013, ICLR.

[64]  Jianfeng Gao,et al.  Modeling Interestingness with Deep Neural Networks , 2014, EMNLP.

[65]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[66]  Geoffrey Zweig,et al.  Spoken language understanding using long short-term memory neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[67]  Christopher Meek,et al.  Semantic Parsing for Single-Relation Question Answering , 2014, ACL.

[68]  Larry P. Heck,et al.  Deep learning of knowledge graph embeddings for semantic parsing of Twitter dialogs , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[69]  Geoffrey Zweig,et al.  Recurrent conditional random field for language understanding , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[70]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..