Discriminative methods for statistical spoken dialogue systems

Dialogue promises a natural and effective method for users to interact with and obtain information from computer systems. Statistical spoken dialogue systems are able to disambiguate in the presence of errors by maintaining probability distributions over what they believe to be the state of a dialogue. However, traditionally these distributions have been derived using generative models, which do not directly optimise for the criterion of interest and cannot easily exploit arbitrary information that may potentially be useful. This thesis presents how discriminative methods can overcome these problems in Spoken Language Understanding (SLU) and Dialogue State Tracking (DST). A robust method for SLU is proposed, based on features extracted from the full posterior distribution of recognition hypotheses encoded in the form of word confusion networks. This method uses discriminative classifiers, trained on unaligned input/output pairs. Performance is evaluated on both an off-line corpus, and on-line in a live user trial. It is shown that a statistical discriminative approach to SLU operating on the full posterior ASR output distribution can substantially improve performance in terms of both accuracy and overall dialogue reward. Furthermore, additional gains can be obtained by incorporating features from the system’s output. For DST, a new word-based tracking method is presented that maps directly from the speech recognition results to the dialogue state without using an explicit semantic decoder. The method is based on a recurrent neural network structure that is capable of generalising to unseen dialogue state hypotheses, and requires very little feature engineering. The method is evaluated in the second and third Dialog State Tracking Challenges, as well as in a live user trial. The results demonstrate consistently high performance across all of the off-line metrics and a substantial increase in the quality of the dialogues in the live trial. The proposed method is shown to be readily applied to expanding dialogue domains, by exploiting robust features and a new method for online unsupervised adaptation. It is shown how the neural network structure can be adapted to output structured joint distributions, giving an improvement over estimating the dialogue state as a product of marginal distributions.

[1]  Matthew Henderson,et al.  The third Dialog State Tracking Challenge , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[2]  Joseph Polifroni,et al.  Recognition confidence scoring and its use in speech understanding systems , 2002, Comput. Speech Lang..

[3]  Robert U. Ayres,et al.  The Singularity is Near: When Humans Transcend Biology, Ray Kurzweil. Viking Penguin, New York (2005), 602 pages plus index; $29.95 , 2006 .

[4]  Marilyn A. Walker,et al.  Reinforcement Learning for Spoken Dialogue Systems , 1999, NIPS.

[5]  Milica Gasic,et al.  Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers , 2010, SIGDIAL Conference.

[6]  Dilek Z. Hakkani-Tür,et al.  Spoken language understanding , 2008, IEEE Signal Processing Magazine.

[7]  Milica Gasic,et al.  The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management , 2010, Comput. Speech Lang..

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[10]  Gökhan Tür,et al.  The AT&T spoken language understanding system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Milica Gasic,et al.  Parameter learning for POMDP spoken dialogue models , 2010, 2010 IEEE Spoken Language Technology Workshop.

[12]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[13]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Ruhi Sarikaya,et al.  Convolutional neural network based triangular CRF for joint intent detection and slot filling , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[15]  Stephen Young Probabilistic methods in spoken–dialogue systems , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[16]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[17]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[18]  William J. Rapaport,et al.  Logical Foundations for Belief Representation , 1986, Cogn. Sci..

[19]  Martin Kay,et al.  Syntactic Process , 1979, ACL.

[20]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[21]  John Lee,et al.  Statistical Spoken Language Understanding: from Generative Model to Conditional Model , 2005 .

[22]  Steve J. Young,et al.  Natural actor and belief critic: Reinforcement algorithm for learning parameters of dialogue systems modelled as POMDPs , 2011, TSLP.

[23]  Matthew Henderson,et al.  N-best error simulation for training spoken dialogue systems , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[24]  Gökhan Tür,et al.  Joint Discriminative Decoding of Words and Semantic Tags for Spoken Language Understanding , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Steve Renals,et al.  THE USE OF RECURRENT NEURAL NETWORKS IN CONTINUOUS SPEECH RECOGNITION , 1996 .

[26]  Giuseppe Riccardi,et al.  Generative and discriminative algorithms for spoken language understanding , 2007, INTERSPEECH.

[27]  Deyu Zhou,et al.  Learning Conditional Random Fields from Unaligned Data for Natural Language Understanding , 2011, ECIR.

[28]  Alexander I. Rudnicky,et al.  A “K Hypotheses + Other” Belief Updating Model , 2006 .

[29]  Roberto Pieraccini,et al.  Where do we go from here? Research and Commercial Spoken Dialog Systems , 2005, SIGDIAL.

[30]  Gökhan Tür,et al.  Active learning for spoken language understanding , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[31]  Oliver Lemon,et al.  Natural Language Generation as Planning Under Uncertainty for Spoken Dialogue Systems , 2009, EACL.

[32]  Richard M. Schwartz,et al.  A Fully Statistical Approach to Natural Language Interfaces , 1996, ACL.

[33]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[34]  Gary Geunbae Lee,et al.  Triangular-Chain Conditional Random Fields , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[36]  Mitch Weintraub,et al.  Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Matthew Henderson,et al.  Deep Neural Network Approach for the Dialog State Tracking Challenge , 2013, SIGDIAL Conference.

[38]  Antoine Raux,et al.  The Dialog State Tracking Challenge , 2013, SIGDIAL Conference.

[39]  Oliver Lemon,et al.  A Simple and Generic Belief Tracking Mechanism for the Dialog State Tracking Challenge: On the believability of observed information , 2013, SIGDIAL Conference.

[40]  Alexander I. Rudnicky,et al.  Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[41]  Alex Acero,et al.  Discriminative models for spoken language understanding , 2006, INTERSPEECH.

[42]  Alexander I. Rudnicky,et al.  Expanding the Scope of the ATIS Task: The ATIS-3 Corpus , 1994, HLT.

[43]  Blaise Roger Marie Thomson,et al.  Statistical methods for spoken dialogue management , 2013 .

[44]  J.D. Williams,et al.  Scaling up POMDPs for Dialog Management: The ``Summary POMDP'' Method , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[45]  Jan Kleindienst,et al.  Hierarchical feature-based translation for scalable natural language understanding , 2000, INTERSPEECH.

[46]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[47]  Gökhan Tür,et al.  Joint Decoding for Speech Recognition and Semantic Tagging , 2012, INTERSPEECH.

[48]  Oliver Lemon,et al.  Accurate statistical spoken language understanding from limited development resources , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  Rohit J. Kate,et al.  Learning to Transform Natural to Formal Languages , 2005, AAAI.

[50]  Rudolf Kadlec,et al.  IBM's Belief Tracker: Results On Dialog State Tracking Challenge Datasets , 2014, DM@EACL.

[51]  A. M. Turing,et al.  Can Automatic Calculating Machines Be Said to Think , 2004 .

[52]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[53]  Jason D. Williams,et al.  Demonstration of AT&T “Let's Go”: A production-grade statistical spoken dialog system , 2010, 2010 IEEE Spoken Language Technology Workshop.

[54]  Geoffrey Zweig,et al.  Spoken language understanding using long short-term memory neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[55]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[56]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[57]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[58]  Matthew Henderson,et al.  Robust dialog state tracking using delexicalised recurrent neural networks and unsupervised adaptation , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[59]  Alex Acero,et al.  Spoken Language Understanding "” An Introduction to the Statistical Framework , 2005 .

[60]  Alexander I. Rudnicky,et al.  An empirical investigation of sparse log-linear models for improved dialogue act classification , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[61]  Alexander I. Rudnicky,et al.  Sorry, I Didn’t Catch That! , 2008 .

[62]  Yonghong Yan,et al.  Markovian Discriminative Modeling for Dialog State Tracking , 2014, SIGDIAL Conference.

[63]  Andrew McCallum,et al.  Exploring the use of conditional random field models and HMMs for historical handwritten document recognition , 2006, Second International Conference on Document Image Analysis for Libraries (DIAL'06).

[64]  Alexander I. Rudnicky,et al.  Error awareness and recovery in conversational spoken language interfaces , 2007 .

[65]  Lu Chen,et al.  A generalized rule based tracker for dialogue state tracking , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[66]  Marilyn A. Walker,et al.  Trainable Sentence Planning for Complex Information Presentations in Spoken Dialog Systems , 2004, ACL.

[67]  Joelle Pineau,et al.  Spoken Dialogue Management Using Probabilistic Reasoning , 2000, ACL.

[68]  Joseph Polifroni,et al.  A form-based dialogue manager for spoken language applications , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[69]  Lewis M. Norton,et al.  Management and Evaluation of Interactive Dialog in the Air Travel Domain , 1990, HLT.

[70]  Peter Poller,et al.  Natural and Intuitive Multimodal Dialogue for In-Car Applications: The SAMMIE System , 2006, ECAI.

[71]  Steve Young,et al.  The design and implementation of dialogue control in voice operated database inquiry systems , 1989 .

[72]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[73]  Jason D. Williams A critical analysis of two statistical spoken dialog systems in public use , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[74]  Raymond J. Mooney,et al.  Learning to Parse Database Queries Using Inductive Logic Programming , 1996, AAAI/IAAI, Vol. 2.

[75]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[76]  E. Levin,et al.  CHRONUS, The next generation , 1995 .

[77]  Steve J. Young,et al.  Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems , 2010, Comput. Speech Lang..

[78]  Steve J. Young,et al.  Spoken language understanding using the Hidden Vector State Model , 2006, Speech Commun..

[79]  Dongho Kim,et al.  On-line policy optimisation of Bayesian spoken dialogue systems via human interaction , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[80]  Gökhan Tür,et al.  Semantic parsing using word confusion networks with conditional random fields , 2013, INTERSPEECH.

[81]  Angeliki Metallinou,et al.  Discriminative state tracking for spoken dialog systems , 2013, ACL.

[82]  Eric Horvitz,et al.  A computational architecture for conversation , 1999 .

[83]  Johanna D. Moore,et al.  Generating Tailored, Comparative Descriptions in Spoken Dialogue , 2004, FLAIRS Conference.

[84]  Maxine Eskénazi,et al.  Recipe For Building Robust Spoken Dialog State Trackers: Dialog State Tracking Challenge System Description , 2013, SIGDIAL Conference.

[85]  Steve J. Young,et al.  Talking to machines (statistically speaking) , 2002, INTERSPEECH.

[86]  Staffan Larsson,et al.  Information state and dialogue management in the TRINDI dialogue move engine toolkit , 2000, Natural Language Engineering.

[87]  Matthew Henderson,et al.  Word-Based Dialog State Tracking with Recurrent Neural Networks , 2014, SIGDIAL Conference.

[88]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[89]  Akira Shimazu,et al.  Semantic Parsing with Structured SVM Ensemble Classification Models , 2006, ACL.

[90]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[91]  Richard M. Schwartz,et al.  Language understanding using hidden understanding models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[92]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..

[93]  Kee-Eung Kim,et al.  Engineering Statistical Dialog State Trackers: A Case Study on DSTC , 2013, SIGDIAL Conference.

[94]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[95]  Luke S. Zettlemoyer,et al.  Online Learning of Relaxed CCG Grammars for Parsing to Logical Form , 2007, EMNLP.

[96]  Michael I. Jordan Serial Order: A Parallel Distributed Processing Approach , 1997 .

[97]  Milica Gasic,et al.  Evaluating semantic-level confidence scores with multiple hypotheses , 2008, INTERSPEECH.

[98]  Sven Behnke,et al.  The humanoid museum tour guide Robotinho , 2009, RO-MAN 2009 - The 18th IEEE International Symposium on Robot and Human Interactive Communication.

[99]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[100]  Gökhan Tür,et al.  Distributed open-domain conversational understanding framework with domain independent extractors , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[101]  Gökhan Tür,et al.  Improving spoken language understanding using word confusion networks , 2002, INTERSPEECH.

[102]  Amy Isard,et al.  Speaking the Users' Languages , 2003, IEEE Intell. Syst..

[103]  Steve Young,et al.  Statistical methods for building robust spoken dialogue systems in an automobile , 2012 .

[104]  Ronald A. Cole,et al.  Building 10,000 spoken dialogue systems , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[105]  Ronnie W. Smith Comparative Error Analysis of Dialog State Tracking , 2014, SIGDIAL Conference.

[106]  Jason Williams,et al.  Multi-domain learning and generalization in dialog state tracking , 2013, SIGDIAL Conference.

[107]  Antoine Raux,et al.  The Dialog State Tracking Challenge Series , 2014, AI Mag..

[108]  Matthew Henderson,et al.  The use of discriminative belief tracking in POMDP-based dialogue systems , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[109]  Matthew Henderson,et al.  The Effect of Cognitive Load on a Statistical Dialogue System , 2012, SIGDIAL Conference.

[110]  Yonghong Yan,et al.  Markovian discriminative modeling for cross-domain dialog state tracking , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[111]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[112]  Lu Chen,et al.  The SJTU System for Dialog State Tracking Challenge 2 , 2014, SIGDIAL Conference.

[113]  Geoffrey Zweig,et al.  Recurrent neural networks for language understanding , 2013, INTERSPEECH.

[114]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[115]  Samy Bengio,et al.  Taking on the curse of dimensionality in joint distributions using neural networks , 2000, IEEE Trans. Neural Networks Learn. Syst..

[116]  Alexander I. Rudnicky,et al.  Leveraging frame semantics and distributional semantics for unsupervised semantic slot induction in spoken dialogue systems , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[117]  Jason D. Williams,et al.  Web-style ranking and SLU combination for dialog state tracking , 2014, SIGDIAL Conference.

[118]  John W. Merrill,et al.  Automatic Speech Recognition , 2005 .

[119]  Dongho Kim,et al.  Evaluation of Statistical POMDP-Based Dialogue Systems in Noisy Environments , 2016 .

[120]  Matthew Henderson,et al.  The Second Dialog State Tracking Challenge , 2014, SIGDIAL Conference.

[121]  Gökhan Tür,et al.  Extending boosting for call classification using word confusion networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[122]  Steve Young,et al.  A data-driven spoken language understanding system , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[123]  Maria Pateraki,et al.  Two people walk into a bar: dynamic multi-party social interaction with a robot agent , 2012, ICMI '12.

[124]  Steve J. Young,et al.  Stochastic Language Generation in Dialogue using Factored Language Models , 2014, Computational Linguistics.

[125]  Milica Gasic,et al.  Spoken language understanding from unaligned data using discriminative classification models , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[126]  Daniel Jurafsky,et al.  Shallow Semantic Parsing using Support Vector Machines , 2004, NAACL.

[127]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[128]  Kee-Eung Kim,et al.  Optimizing Generative Dialog State Tracker via Cascading Gradient Descent , 2014, SIGDIAL Conference.

[129]  A. Koller,et al.  Speech Acts: An Essay in the Philosophy of Language , 1969 .

[130]  Rafael E. Banchs,et al.  Sequential Labeling for Tracking Dynamic Dialog States , 2014, SIGDIAL Conference.

[131]  P R Cohen,et al.  The role of voice input for human-machine communication. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[132]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[133]  R. Kurzweil,et al.  The Singularity Is Near: When Humans Transcend Biology , 2006 .

[134]  Alexander I. Rudnicky,et al.  Sorry and I Didn’t Catch That! - An Investigation of Non-understanding Errors and Recovery Strategies , 2005, SIGDIAL.

[135]  Mark Steedman,et al.  Lexical Generalization in CCG Grammar Induction for Semantic Parsing , 2011, EMNLP.

[136]  Michael C. Mozer,et al.  A Focused Backpropagation Algorithm for Temporal Pattern Recognition , 1989, Complex Syst..

[137]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[138]  Miroslav Vodolán,et al.  Knowledge-based Dialog State Tracking , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[139]  Robert E. Schapire,et al.  Boosting with prior knowledge for call classification , 2005, IEEE Transactions on Speech and Audio Processing.

[140]  Marilyn A. Walker,et al.  SPoT: A Trainable Sentence Planner , 2001, NAACL.

[141]  Dongho Kim,et al.  Dialogue context sensitive HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[142]  Roberto Pieraccini,et al.  A stochastic model of computer-human interaction for learning dialogue strategies , 1997, EUROSPEECH.

[143]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.

[144]  Matthew Henderson,et al.  Discriminative spoken language understanding using word confusion networks , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[145]  Maxine Eskénazi,et al.  Spoken Dialog Challenge 2010: Comparison of Live and Control Test Results , 2011, SIGDIAL Conference.

[146]  Wayne H. Ward Extracting information in spontaneous speech , 1994, ICSLP.

[147]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[148]  Oliver Lemon,et al.  Reinforcement Learning for Adaptive Dialogue Systems - A Data-driven Methodology for Dialogue Management and Natural Language Generation , 2011, Theory and Applications of Natural Language Processing.

[149]  Michael F. McTear,et al.  Modelling spoken dialogues with state transition diagrams: experiences with the CSLU toolkit , 1998, ICSLP.

[150]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[151]  Jason D. Williams Incremental partition recombination for efficient tracking of multiple dialog states , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[152]  Nanda Kambhatla,et al.  Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Information Extraction , 2004, ACL.

[153]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[154]  David R. Traum,et al.  20 Questions on Dialogue Act Taxonomies , 2000, J. Semant..

[155]  Robert Epstein,et al.  The Quest for the Thinking Computer , 1992, AI Mag..

[156]  R. de Mori Spoken language understanding: a survey , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[157]  Rohit J. Kate,et al.  Using String-Kernels for Learning Semantic Parsers , 2006, ACL.

[158]  Jonathan Lyons Artificial stupidity , 2007, SIGGRAPH '07.

[159]  Sungjin Lee,et al.  Structured Discriminative Model For Dialog State Tracking , 2013, SIGDIAL Conference.