Integrating planning for task-completion dialogue policy learning

Training a task-completion dialogue agent with real users via reinforcement learning (RL) could be prohibitively expensive, because it requires many interactions with users. One alternative is to resort to a user simulator, while the discrepancy of between simulated and real users makes the learned policy unreliable in practice. This paper addresses these challenges by integrating planning into the dialogue policy learning based on Dyna-Q framework, and provides a more sample-efficient approach to learn the dialogue polices. The proposed agent consists of a planner trained on-line with limited real user experience that can generate large amounts of simulated experience to supplement with limited real user experience, and a policy model trained on these hybrid experiences. The effectiveness of our approach is validated on a movie-booking task in both a simulation setting and a human-in-the-loop setting.

[1]  Stefan Ultes,et al.  Sub-domain Modelling for Dialogue Management with Hierarchical Reinforcement Learning , 2017, SIGDIAL Conference.

[2]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[3]  David Vandyke,et al.  On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems , 2016, ACL.

[4]  Satinder P. Singh,et al.  Reinforcement Learning with a Hierarchy of Abstract Models , 1992, AAAI.

[5]  Roberto Pieraccini,et al.  Learning dialogue strategies within the Markov decision process framework , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[6]  Richard S. Sutton,et al.  Model-Based Reinforcement Learning with an Approximate, Learned Model , 1996 .

[7]  Oliver Lemon,et al.  Learning More Effective Dialogue Strategies Using Limited Dialogue Move Features , 2006, ACL.

[8]  Alexandros Papangelis A Comparative Study of Reinforcement Learning Techniques on Dialogue Management , 2012, EACL.

[9]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[10]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[11]  S. Singh,et al.  Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System , 2011, J. Artif. Intell. Res..

[12]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[13]  David Vandyke,et al.  Continuously Learning Neural Dialogue Management , 2016, ArXiv.

[14]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[15]  Milica Gasic,et al.  POMDP-Based Statistical Spoken Dialog Systems: A Review , 2013, Proceedings of the IEEE.

[16]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[17]  David Vandyke,et al.  Policy committee for adaptation in multi-domain spoken dialogue systems , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[18]  Bing Liu,et al.  End-to-End Optimization of Task-Oriented Dialogue Model with Deep Reinforcement Learning , 2017, ArXiv.

[19]  Dongho Kim,et al.  Distributed dialogue policies for multi-domain statistical dialogue management , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Xiaodong Liu,et al.  Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval , 2015, NAACL.

[21]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[22]  Gökhan Tür,et al.  Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM , 2016, INTERSPEECH.

[23]  Matthieu Geist,et al.  Sample Efficient On-Line Learning of Optimal Dialogue Policies with Kalman Temporal Differences , 2011, IJCAI.

[24]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[25]  Milica Gasic,et al.  On-line policy optimisation of spoken dialogue systems via live interaction with human subjects , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[26]  Geoffrey Zweig,et al.  Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning , 2017, ACL.

[27]  Jianfeng Gao,et al.  A User Simulator for Task-Completion Dialogues , 2016, ArXiv.

[28]  Jianfeng Gao,et al.  Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[29]  Jing He,et al.  A Sequence-to-Sequence Model for User Simulation in Spoken Dialogue Systems , 2016, INTERSPEECH.

[30]  Steve J. Young,et al.  A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies , 2006, The Knowledge Engineering Review.

[31]  Kam-Fai Wong,et al.  Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning , 2017, EMNLP.

[32]  Roberto Pieraccini,et al.  User modeling for spoken dialogue system evaluation , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[33]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[34]  Zachary Chase Lipton,et al.  Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking , 2016 .

[35]  Jason Weston,et al.  Dialogue Learning With Human-In-The-Loop , 2016, ICLR.

[36]  Seunghak Yu,et al.  Scaling up deep reinforcement learning for multi-domain dialogue systems , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[37]  Helen F. Hastie,et al.  A survey on metrics for the evaluation of user simulations , 2012, The Knowledge Engineering Review.

[38]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[39]  Kallirroi Georgila,et al.  Learning user simulations for information state update dialogue systems , 2005, INTERSPEECH.

[40]  Jianfeng Gao,et al.  Efficient Exploration for Dialog Policy Learning with Deep BBQ Networks \& Replay Buffer Spiking , 2016, ArXiv.

[41]  Tsung-Hsien Wen,et al.  Neural Belief Tracker: Data-Driven Dialogue State Tracking , 2016, ACL.

[42]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[43]  Jianfeng Gao,et al.  End-to-End Task-Completion Neural Dialogue Systems , 2017, IJCNLP.

[44]  Maxine Eskénazi,et al.  Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning , 2016, SIGDIAL Conference.

[45]  Olivier Pietquin Consistent Goal-Directed User Model for Realisitc Man-Machine Task-Oriented Spoken Dialogue Simulation , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[46]  Steve Young,et al.  Automatic learning of dialogue strategy using dialogue simulation and reinforcement learning , 2002 .

[47]  H. Cuayahuitl,et al.  Human-computer dialogue simulation using hidden Markov models , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[48]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[49]  Thierry Dutoit,et al.  A probabilistic framework for dialog simulation and optimal strategy learning , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Kam-Fai Wong,et al.  Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[52]  Roberto Pieraccini,et al.  A stochastic model of human-machine interaction for learning dialog strategies , 2000, IEEE Trans. Speech Audio Process..

[53]  Hui Ye,et al.  Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System , 2007, NAACL.

[54]  Milica Gasic,et al.  Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers , 2010, SIGDIAL Conference.