Uncertainty-sensitive Learning and Planning with Ensembles

We propose a reinforcement learning framework for discrete environments in which an agent makes both strategic and tactical decisions. The former manifests itself through the use of value function, while the latter is powered by a tree search planner. These tools complement each other. The planning module performs a local \textit{what-if} analysis, which allows to avoid tactical pitfalls and boost backups of the value function. The value function, being global in nature, compensates for inherent locality of the planner. In order to further solidify this synergy, we introduce an exploration mechanism with two distinctive components: uncertainty modelling and risk measurement. To model the uncertainty we use value function ensembles, and to reflect risk we use propose several functionals that summarize the implied by the ensemble. We show that our method performs well on hard exploration environments: Deep-sea, toy Montezuma's Revenge, and Sokoban. In all the cases, we obtain speed-up in learning and boost in performance.

[1]  Uri Zwick,et al.  SOKOBAN and other motion planning problems , 1999, Comput. Geom..

[2]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[3]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[4]  Catholijn M. Jonker,et al.  Efficient exploration with Double Uncertain Value Networks , 2017, ArXiv.

[5]  Csaba Szepesvári,et al.  Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[6]  Jan Willemson,et al.  Improved Monte-Carlo Search , 2006 .

[7]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[8]  Satinder Singh,et al.  Self-Imitation Learning , 2018, ICML.

[9]  J. Andrew Bagnell,et al.  Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[10]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[11]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[12]  Andrzej Janusz,et al.  Improving Hearthstone AI by Combining MCTS and Supervised Learning Algorithms , 2018, 2018 IEEE Conference on Computational Intelligence and Games (CIG).

[13]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[14]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[15]  Dale Schuurmans,et al.  Striving for Simplicity in Off-policy Deep Reinforcement Learning , 2019, ArXiv.

[16]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[17]  Sebastian Tschiatschek,et al.  Successor Uncertainties: exploration and uncertainty in temporal difference learning , 2018, NeurIPS.

[18]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[19]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[20]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2020, ICML.

[21]  Tor Lattimore,et al.  Behaviour Suite for Reinforcement Learning , 2019, ICLR.

[22]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[23]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[24]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[25]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[26]  Razvan Pascanu,et al.  Learning model-based planning from scratch , 2017, ArXiv.

[27]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[28]  Tom Eccles,et al.  An investigation of model-free planning , 2019, ICML.

[29]  Pieter Abbeel,et al.  Adaptive Online Planning for Continual Lifelong Learning , 2019, ArXiv.

[30]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[31]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.

[32]  Levente Kocsis,et al.  Transpositions and move groups in Monte Carlo tree search , 2008, 2008 IEEE Symposium On Computational Intelligence and Games.

[33]  Stefanie Tellex,et al.  Deep Abstract Q-Networks , 2017, AAMAS.

[34]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[35]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[36]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[37]  Laurent Orseau,et al.  Single-Agent Policy Tree Search With Guarantees , 2018, NeurIPS.

[38]  Richard B. Segal,et al.  On the Scalability of Parallel UCT , 2010, Computers and Games.

[39]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[40]  Pierre Baldi,et al.  Solving the Rubik's Cube with Approximate Policy Iteration , 2018, ICLR.

[41]  David Barber,et al.  Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[42]  Samy Bengio,et al.  Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy , 2019, ArXiv.

[43]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[44]  Shimon Whiteson,et al.  TreeQN and ATreeC: Differentiable Tree Planning for Deep Reinforcement Learning , 2017, ICLR 2018.

[45]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[46]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[47]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[48]  Taehoon Kim,et al.  Quantifying Generalization in Reinforcement Learning , 2018, ICML.