Fast Exploration with Simplified Models and Approximately Optimistic Planning in Model Based Reinforcement Learning

Humans learn to play video games significantly faster than the state-of-the-art reinforcement learning (RL) algorithms. People seem to build simple models that are easy to learn to support planning and strategic exploration. Inspired by this, we investigate two issues in leveraging model-based RL for sample efficiency. First we investigate how to perform strategic exploration when exact planning is not feasible and empirically show that optimistic Monte Carlo Tree Search outperforms posterior sampling methods. Second we show how to learn simple deterministic models to support fast learning using object representation. We illustrate the benefit of these ideas by introducing a novel algorithm, Strategic Object Oriented Reinforcement Learning (SOORL), that outperforms state-of-the-art algorithms in the game of Pitfall! in less than 50 episodes.

[1]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[2]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[3]  Pieter Spronck,et al.  Monte-Carlo Tree Search: A New Framework for Game AI , 2008, AIIDE.

[4]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[5]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[6]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[7]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[8]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[9]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[10]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[11]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML.

[12]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[13]  Kamyar Azizzadenesheli,et al.  Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[14]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[15]  Murray Shanahan,et al.  Towards Deep Symbolic Reinforcement Learning , 2016, ArXiv.

[16]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[17]  O. Teytaud,et al.  Monte Carlo Tree Search in Go , 2013 .

[18]  Nicholas Roy,et al.  Provably Efficient Learning with Typed Parametric Models , 2009, J. Mach. Learn. Res..

[19]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[20]  A. P. Hyper-parameters Count-Based Exploration with Neural Density Models , 2017 .

[21]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[22]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Christopher D. Rosin,et al.  Nested Rollout Policy Adaptation for Monte Carlo Tree Search , 2011, IJCAI.

[25]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[26]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[27]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[28]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[29]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[30]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[31]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[32]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[33]  Joshua B. Tenenbaum,et al.  Human Learning in Atari , 2017, AAAI Spring Symposia.

[34]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[35]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[36]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[37]  Andrea Lockerd Thomaz,et al.  Object focused q-learning for autonomous agents , 2013, AAMAS.

[38]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[39]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[40]  Stefanie Tellex,et al.  Deep Abstract Q-Networks , 2017, AAMAS.

[41]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[42]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[43]  Ross A. Knepper,et al.  DeepMPC: Learning Deep Latent Features for Model Predictive Control , 2015, Robotics: Science and Systems.

[44]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Alexei A. Efros,et al.  Investigating Human Priors for Playing Video Games , 2018, ICML.

[46]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[47]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[48]  Benjamin Van Roy,et al.  Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[49]  Carlos Guestrin,et al.  Generalizing plans to new environments in relational MDPs , 2003, IJCAI 2003.

[50]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[51]  Andre Cohen,et al.  An object-oriented representation for efficient reinforcement learning , 2008, ICML '08.

[52]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.