Sample-Efficient Deep RL with Generative Adversarial Tree Search

We propose Generative Adversarial Tree Search (GATS), a sample-efficient Deep Reinforcement Learning (DRL) algorithm. While Monte Carlo Tree Search (MCTS) is known to be effective for search and planning in RL but, it is often sampleinefficient and therefore expensive to apply in practice. In this work, we train Generative Adversarial Networks (GANs) to model an environment’s dynamics and train a predictor to learn the reward function. While typical DRL algorithms estimate the Q function or optimize directly over a parameterized policy, we exploit the collected data through interaction with the environment to train both a reward predictor conditional GAN that simulated state transitions. During planning, we deploy finite depth MCTS, using the trained generative model for the tree search and estimated Q value for the leaves, in order to find the best policy. We theoretically show that GATS improves the bias-variance trade-off in DRL. On the Atari game Pong, GATS significantly reduces the bias in Q estimates and leads to a drastic reduction of sample complexity of DQN by a factor of 200%.

[1]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[2]  Fernando Diaz,et al.  Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains , 2016, ArXiv.

[3]  Jing He,et al.  Policy Networks with Two-Stage Training for Dialogue Systems , 2016, SIGDIAL Conference.

[4]  Thomas B. Schön,et al.  From Pixels to Torques: Policy Learning with Deep Dynamical Models , 2015, ICML 2015.

[5]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[6]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[7]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[8]  Kamyar Azizzadenesheli,et al.  Reinforcement Learning of POMDPs using Spectral Methods , 2016, COLT.

[9]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[10]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[11]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[12]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[13]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[14]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[15]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[16]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[17]  Csaba Szepesvári,et al.  Partial Monitoring - Classification, Regret Bounds, and Algorithms , 2014, Math. Oper. Res..

[18]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[19]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[20]  A. P. Hyper-parameters Count-Based Exploration with Neural Density Models , 2017 .

[21]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[22]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[23]  Heriberto Cuayáhuitl,et al.  SimpleDS: A Simple Deep Reinforcement Learning Dialogue System , 2016, IWSDS.

[24]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[25]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[26]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[27]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Kamyar Azizzadenesheli,et al.  Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[29]  Jianfeng Gao,et al.  Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear , 2016, ArXiv.

[30]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[31]  Kamyar Azizzadenesheli,et al.  Reinforcement Learning in Rich-Observation MDPs using Spectral Methods , 2016, 1611.03907.

[32]  Zachary Chase Lipton,et al.  Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking , 2016 .

[33]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[34]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[35]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..