A Probabilistic Interpretation of Self-Paced Learning with Applications to Reinforcement Learning

Across machine learning, the use of curricula has shown strong empirical potential to improve learning from data by avoiding local optima of training objectives. For reinforcement learning (RL), curricula are especially interesting, as the underlying optimization has a strong tendency to get stuck in local optima due to the exploration-exploitation trade-off. Recently, a number of approaches for an automatic generation of curricula for RL have been shown to increase performance while requiring less expert knowledge compared to manually designed curricula. However, these approaches are seldomly investigated from a theoretical perspective, preventing a deeper understanding of their mechanics. In this paper, we present an approach for automated curriculum generation in RL with a clear theoretical underpinning. More precisely, we formalize the well-known self-paced learning paradigm as inducing a distribution over training tasks, which trades off between task complexity and the objective to match a desired task distribution. Experiments show that training on this induced distribution helps to avoid poor local optima across RL algorithms in different tasks with uninformative rewards and challenging exploration requirements.

[1]  Jan Peters,et al.  High Acceleration Reinforcement Learning for Real-World Juggling with Binary Rewards , 2020, CoRL.

[2]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[3]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[4]  Nan Jiang,et al.  Markov Decision Processes with Continuous Side Information , 2017, ALT.

[5]  Farhan Abrol,et al.  Variational Tempering , 2016, AISTATS.

[6]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[7]  Liang Zheng,et al.  Unsupervised Person Re-identification: Clustering and Fine-tuning , 2017 .

[8]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[9]  John L. Nazareth,et al.  Introduction to derivative-free optimization , 2010, Math. Comput..

[10]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[11]  Pierre-Yves Oudeyer,et al.  Accuracy-based Curriculum Learning in Deep Reinforcement Learning , 2018, ArXiv.

[12]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13]  Alessandro Lazaric,et al.  Transfer in Reinforcement Learning: A Framework and a Survey , 2012, Reinforcement Learning.

[14]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[15]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[16]  Pierre-Yves Oudeyer,et al.  Intrinsically motivated goal exploration for active motor learning in robots: A case study , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Shiguang Shan,et al.  Self-Paced Learning with Diversity , 2014, NIPS.

[18]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[19]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[20]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[21]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[22]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[23]  Andreas Krause,et al.  Safe controller optimization for quadrotors with Gaussian processes , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[25]  Daoyi Dong,et al.  Self-Paced Prioritized Curriculum Learning With Coverage Penalty in Deep Reinforcement Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[27]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[28]  Peter Stone,et al.  Learning Curriculum Policies for Reinforcement Learning , 2018, AAMAS.

[29]  Amos Storkey,et al.  Continuously Tempered Hamiltonian Monte Carlo , 2017, UAI.

[30]  Jan Peters,et al.  Probabilistic Movement Primitives , 2013, NIPS.

[31]  Michael A. Osborne,et al.  Probabilistic numerics and uncertainty in computations , 2015, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[32]  Marlos C. Machado,et al.  Count-Based Exploration with the Successor Representation , 2018, AAAI.

[33]  B. Skinner,et al.  The Behavior of Organisms: An Experimental Analysis , 2016 .

[34]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[35]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[36]  Chang Liu,et al.  Understanding and Accelerating Particle-Based Variational Inference , 2018, ICML.

[37]  Jianhong Wang,et al.  Thermostat-assisted continuously-tempered Hamiltonian Monte Carlo for Bayesian learning , 2017, NeurIPS.

[38]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[39]  Anne Auger,et al.  Comparing results of 31 algorithms from the black-box optimization benchmarking BBOB-2009 , 2010, GECCO '10.

[40]  Deva Ramanan,et al.  Self-Paced Learning for Long-Term Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Herke van Hoof,et al.  A Performance-Based Start State Curriculum Framework for Reinforcement Learning , 2020, AAMAS.

[42]  Naonori Ueda,et al.  Deterministic Annealing Variant of the EM Algorithm , 1994, NIPS.

[43]  Gerhard Neumann,et al.  Variational Inference for Policy Search in changing situations , 2011, ICML.

[44]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[45]  Phil Husbands,et al.  Once More Unto the Breach: Co-evolving a robot and its simulator , 2004 .

[46]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[47]  Jan Peters,et al.  Data-Efficient Generalization of Robot Skills with Contextual Policy Search , 2013, AAAI.

[48]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[49]  F. Aluffi-Pentini,et al.  The Use of "Continuous Method" in Complementarity Problems , 1985 .

[50]  Andre Wibisono,et al.  Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem , 2018, COLT.

[51]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[52]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[53]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[54]  G. Parisi,et al.  Simulated tempering: a new Monte Carlo scheme , 1992, hep-lat/9205018.

[55]  Jan Peters,et al.  Reinforcement learning vs human programming in tetherball robot games , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[56]  Matthew Fellows,et al.  VIREL: A Variational Inference Framework for Reinforcement Learning , 2018, NeurIPS.

[57]  Jan Peters,et al.  Receding Horizon Curiosity , 2019, CoRL.

[58]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[59]  Simon J. D. Prince,et al.  Computer Vision: Models, Learning, and Inference , 2012 .

[60]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[61]  Pierre-Yves Oudeyer,et al.  Teacher algorithms for curriculum learning of Deep RL in continuously parameterized environments , 2019, CoRL.

[62]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[63]  Tong Zhang,et al.  Multi-stage Convex Relaxation for Learning with Sparse Regularization , 2008, NIPS.

[64]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[65]  Matthew E. Taylor,et al.  Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey , 2020, J. Mach. Learn. Res..

[66]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[67]  Deepak Kumar,et al.  BRINGING UP ROBOT: FUNDAMENTAL MECHANISMS FOR CREATING A SELF-MOTIVATED, SELF-ORGANIZING ARCHITECTURE , 2005, Cybern. Syst..

[68]  S. Schaal Dynamic Movement Primitives -A Framework for Motor Control in Humans and Humanoid Robotics , 2006 .

[69]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[70]  Peter Rossmanith,et al.  Simulated Annealing , 2008, Taschenbuch der Algorithmen.

[71]  Deyu Meng,et al.  A theoretical understanding of self-paced learning , 2017, Inf. Sci..

[72]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[73]  Jan Peters,et al.  Non-parametric Policy Search with Limited Information Loss , 2017, J. Mach. Learn. Res..

[74]  Deyu Meng,et al.  Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search , 2014, ACM Multimedia.

[75]  Kenneth O. Stanley,et al.  POET: open-ended coevolution of environments and their optimized solutions , 2019, GECCO.

[76]  Minoru Asada,et al.  Purposive Behavior Acquisition for a Real Robot by Vision-Based Reinforcement Learning , 2005, Machine Learning.

[77]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[78]  Xiao-Li Meng,et al.  Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[79]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[80]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[81]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[82]  W.D. Smart,et al.  What does shaping mean for computational reinforcement learning? , 2008, 2008 7th IEEE International Conference on Development and Learning.