论文信息 - Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking - 字舞流文

Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking

When rewards are sparse and action spaces large, Q-learning with -greedy exploration can be inefficient. This poses problems for otherwise promising applications such as task-oriented dialogue systems, where the primary reward signal, indicating successful completion of a task, requires a complex sequence of appropriate actions. Under these circumstances, a randomly exploring agent might never stumble upon a successful outcome in reasonable time. We present two techniques that significantly improve the efficiency of exploration for deep Q-learning agents in dialogue systems. First, we introduce an exploration technique based on Thompson sampling, drawing Monte Carlo samples from a Bayes-by-backprop neural network, demonstrating marked improvement over common approaches such as -greedy and Boltzmann exploration. Second, we show that spiking the replay buffer with experiences from a small number of successful episodes, as are easy to harvest for dialogue tasks, can make Q-learning feasible when it might otherwise fail.

Zachary Chase Lipton | Lihong Li | Xiujun Li | Jianfeng Gao | L. Deng | Faisal Ahmed

[1] Geoffrey E. Hinton,et al. Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[2] Roberto Pieraccini,et al. Learning dialogue strategies within the Markov decision process framework , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[3] Marilyn A. Walker,et al. Reinforcement Learning for Spoken Dialogue Systems , 1999, NIPS.

[4] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[5] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[6] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[7] Steve J. Young,et al. Characterizing task-oriented dialog using a simulated ASR chanel , 2004, INTERSPEECH.

[8] Longxin Lin. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[9] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.

[11] Steve Young,et al. Statistical User Simulation with a Hidden Agenda , 2007, SIGDIAL.

[12] Thomas J. Walsh,et al. Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[13] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[14] Lihong Li,et al. A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[15] Milica Gasic,et al. Gaussian Processes for Fast Policy Optimisation of POMDP-based Dialogue Managers , 2010, SIGDIAL Conference.

[16] Alex Graves,et al. Practical Variational Inference for Neural Networks , 2011, NIPS.

[17] Lihong Li,et al. An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[18] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[19] Dongho Kim,et al. Incremental on-line adaptation of POMDP-based dialogue managers to extended domains , 2014, INTERSPEECH.

[20] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[21] Sharad Vikram,et al. Capturing Meaning in Product Reviews with Character-Level Generative Text Models , 2015, ArXiv.

[22] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23] Julien Cornebise,et al. Weight Uncertainty in Neural Networks , 2015, ArXiv.

[24] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[25] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[26] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[27] Jing He,et al. Policy Networks with Two-Stage Training for Dialogue Systems , 2016, SIGDIAL Conference.

[28] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[29] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[30] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.