Efficient Exploration for Dialog Policy Learning with Deep BBQ Networks \& Replay Buffer Spiking

When rewards are sparse and action spaces large, Q-learning with $\epsilon$-greedy exploration can be inefficient. This poses problems for otherwise promising applications such as task-oriented dialogue systems, where the primary reward signal, indicating successful completion of a task, requires a complex sequence of appropriate actions. Under these circumstances, a randomly exploring agent might never stumble upon a successful outcome in reasonable time. We present two techniques that significantly improve the efficiency of exploration for deep Q-learning agents in dialogue systems. First, we introduce an exploration technique based on Thompson sampling, drawing Monte Carlo samples from a Bayes-by-backprop neural network, demonstrating marked improvement over common approaches such as $\epsilon$-greedy and Boltzmann exploration. Second, we show that spiking the replay buffer with experiences from a small number of successful episodes, as are easy to harvest for dialogue tasks, can make Q-learning feasible when it might otherwise fail.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[3]  Dongho Kim,et al.  Incremental on-line adaptation of POMDP-based dialogue managers to extended domains , 2014, INTERSPEECH.

[4]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[5]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[6]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[7]  Sharad Vikram,et al.  Generative Concatenative Nets Jointly Learn to Write and Classify Reviews , 2015, 1511.03683.

[8]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[9]  Roberto Pieraccini,et al.  Learning dialogue strategies within the Markov decision process framework , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[10]  Steve Young,et al.  Statistical User Simulation with a Hidden Agenda , 2007, SIGDIAL.

[11]  Jianfeng Gao,et al.  A User Simulator for Task-Completion Dialogues , 2016, ArXiv.

[12]  Marilyn A. Walker,et al.  An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email , 2000, J. Artif. Intell. Res..

[13]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[14]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[15]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[16]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[17]  Oliver Lemon,et al.  Author manuscript, published in "European Conference on Speech Communication and Technologies (Interspeech'07), Anvers: Belgium (2007)" Machine Learning for Spoken Dialogue Systems , 2022 .

[18]  Marilyn A. Walker,et al.  Reinforcement Learning for Spoken Dialogue Systems , 1999, NIPS.

[19]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[20]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[21]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[22]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[23]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[24]  Geoffrey Zweig,et al.  End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning , 2016, ArXiv.

[25]  Milica Gasic,et al.  Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers , 2010, SIGDIAL Conference.

[26]  Steve J. Young,et al.  Characterizing task-oriented dialog using a simulated ASR chanel , 2004, INTERSPEECH.

[27]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[28]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[29]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[30]  Filip De Turck,et al.  Curiosity-driven Exploration in Deep Reinforcement Learning via Bayesian Neural Networks , 2016, ArXiv.

[31]  Jing He,et al.  Policy Networks with Two-Stage Training for Dialogue Systems , 2016, SIGDIAL Conference.

[32]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[33]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[34]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[35]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[36]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[37]  Heriberto Cuayáhuitl,et al.  SimpleDS: A Simple Deep Reinforcement Learning Dialogue System , 2016, IWSDS.

[38]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[39]  Sharad Vikram,et al.  Capturing Meaning in Product Reviews with Character-Level Generative Text Models , 2015, ArXiv.

[40]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[41]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[42]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[43]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[44]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[45]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.