Experience Selection in Deep Reinforcement Learning for Control

Experience replay is a technique that allows off-policy reinforcement-learning methods to reuse past experiences. The stability and speed of convergence of reinforcement learning, as well as the eventual performance of the learned policy, are strongly dependent on the experiences being replayed. Which experiences are replayed depends on two important choices. The first is which and how many experiences to retain in the experience replay buffer. The second choice is how to sample the experiences that are to be replayed from that buffer. We propose new methods for the combined problem of experience retention and experience sampling. We refer to the combination as experience selection. We focus our investigation specifically on the control of physical systems, such as robots, where exploration is costly. To determine which experiences to keep and which to replay, we investigate different proxies for their immediate and long-term utility. These proxies include age, temporal difference error and the strength of the applied exploration noise. Since no currently available method works in all situations, we propose guidelines for using prior knowledge about the characteristics of the control problem at hand to choose the appropriate experience replay strategy.

[1]  Zachary Chase Lipton,et al.  Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking , 2016 .

[2]  Gene F. Franklin,et al.  Digital control of dynamic systems , 1980 .

[3]  Geoffrey E. Hinton,et al.  To recognize shapes, first learn to generate images. , 2007, Progress in brain research.

[4]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[5]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[6]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[7]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[8]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[9]  Frank Hutter,et al.  Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves , 2015, IJCAI.

[10]  Robert Babuska,et al.  Improved deep reinforcement learning for robotics through distribution-based experience retention , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[11]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[12]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[13]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[14]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[15]  Karl Tuyls,et al.  The importance of experience replay database composition in deep reinforcement learning , 2015 .

[16]  Peter Stone,et al.  Transfer learning for reinforcement learning on a physical robot , 2010, AAMAS 2010.

[17]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[18]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, NIPS.

[19]  Robert Babuska,et al.  Evaluation of physical damage associated with action selection strategies in reinforcement learning * *I. Koryakovskiy, H. Vallery and R.Babuška were supported by the European project KOROIBOT FP7-ICT-2013-10/611909. , 2017, IFAC-PapersOnLine.

[20]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[21]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[22]  Shie Mannor,et al.  Sequential Decision Making With Coherent Risk , 2017, IEEE Transactions on Automatic Control.

[23]  Marcus Hutter,et al.  Universal Reinforcement Learning Algorithms: Survey and Experiments , 2017, IJCAI.

[24]  Bart De Schutter,et al.  Approximate dynamic programming with a fuzzy parameterization , 2010, Autom..

[25]  Koray Kavukcuoglu,et al.  PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[26]  David Andre,et al.  Generalized Prioritized Sweeping , 1997, NIPS.

[27]  G. Uhlenbeck,et al.  On the Theory of the Brownian Motion , 1930 .

[28]  Wouter Caarls,et al.  Parallel Online Temporal Difference Learning for Motor Control , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[30]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[31]  Stefan Schaal,et al.  Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[32]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[33]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[34]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[35]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[36]  L. C. Baird,et al.  Reinforcement learning in continuous time: advantage updating , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[37]  Robert Babuska,et al.  Policy derivation methods for critic-only reinforcement learning in continuous spaces , 2018, Eng. Appl. Artif. Intell..

[38]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[39]  Razvan Pascanu,et al.  Sim-to-Real Robot Learning from Pixels with Progressive Nets , 2016, CoRL.

[40]  Damien Ernst,et al.  How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies , 2015, ArXiv.

[41]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[42]  Jens Kober,et al.  Off-policy experience retention for deep actor-critic learning , 2016, NIPS 2016.

[43]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[44]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[45]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[46]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[47]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[48]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[49]  Young-Woo Seo,et al.  Learning user's preferences by analyzing Web-browsing behaviors , 2000, AGENTS '00.

[50]  Frank Hutter,et al.  Online Batch Selection for Faster Training of Neural Networks , 2015, ArXiv.

[51]  Robert Babuška,et al.  Policy Derivation Methods for Critic-Only Reinforcement Learning in Continuous Action Spaces , 2016 .

[52]  Koray Kavukcuoglu,et al.  Combining policy gradient and Q-learning , 2016, ICLR.

[53]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[54]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[55]  Regina Barzilay,et al.  Language Understanding for Text-based Games using Deep Reinforcement Learning , 2015, EMNLP.

[56]  Richard S. Sutton,et al.  Model-Based Reinforcement Learning with an Approximate, Learned Model , 1996 .

[57]  Shimon Whiteson,et al.  OFFER: Off-Environment Reinforcement Learning , 2017, AAAI.

[58]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[59]  Bikramjit Banerjee,et al.  Performance Bounded Reinforcement Learning in Strategic Interactions , 2004, AAAI.

[60]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[61]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[62]  Marco Wiering,et al.  Q-learning with experience replay in a dynamic environment , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[63]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[64]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .