Strategic Object Oriented Reinforcement Learning

Humans learn to play video games significantly faster than state-of-the-art reinforcement learning (RL) algorithms. Inspired by this, we introduce strategic object oriented reinforcement learning (SOORL) to learn simple dynamics model through automatic model selection and perform efficient planning with strategic exploration. We compare different exploration strategies in a model-based setting in which exact planning is impossible. Additionally, we test our approach on perhaps the hardest Atari game Pitfall! and achieve significantly improved exploration and performance over prior methods.

[1]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[2]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[3]  Andre Cohen,et al.  An object-oriented representation for efficient reinforcement learning , 2008, ICML '08.

[4]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[5]  Murray Shanahan,et al.  Towards Deep Symbolic Reinforcement Learning , 2016, ArXiv.

[6]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[7]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[8]  Carlos Guestrin,et al.  Generalizing plans to new environments in relational MDPs , 2003, IJCAI 2003.

[9]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[10]  Christopher D. Rosin,et al.  Nested Rollout Policy Adaptation for Monte Carlo Tree Search , 2011, IJCAI.

[11]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[12]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[13]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[14]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Alexei A. Efros,et al.  Investigating Human Priors for Playing Video Games , 2018, ICML.

[16]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[17]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[18]  Kamyar Azizzadenesheli,et al.  Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[19]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[20]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[21]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[22]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[23]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[24]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[25]  Benjamin Van Roy,et al.  Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[26]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[27]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[28]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[29]  A. P. Hyper-parameters Count-Based Exploration with Neural Density Models , 2017 .

[30]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[31]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[32]  Andrea Lockerd Thomaz,et al.  Object focused q-learning for autonomous agents , 2013, AAMAS.

[33]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[34]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[35]  Joshua B. Tenenbaum,et al.  Human Learning in Atari , 2017, AAAI Spring Symposia.

[36]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[37]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[38]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[39]  Stefanie Tellex,et al.  Deep Abstract Q-Networks , 2017, AAMAS.

[40]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[41]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[42]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML.

[43]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[44]  Ross A. Knepper,et al.  DeepMPC: Learning Deep Latent Features for Model Predictive Control , 2015, Robotics: Science and Systems.