论文信息 - Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning

Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning

What goals should a multi-goal reinforcement learning agent pursue during training in long-horizon tasks? When the desired (test time) goal distribution is too distant to offer a useful learning signal, we argue that the agent should not pursue unobtainable goals. Instead, it should set its own intrinsic goals that maximize the entropy of the historical achieved goal distribution. We propose to optimize this objective by having the agent pursue past achieved goals in sparsely explored areas of the goal space, which focuses exploration on the frontier of the achievable goal set. We show that our strategy achieves an order of magnitude better sample efficiency than the prior state of the art on long-horizon multi-goal tasks including maze navigation and block stacking.

[1] Chrystopher L. Nehaniv,et al. Empowerment: a universal agent-centric measure of control , 2005, 2005 IEEE Congress on Evolutionary Computation.

[2] Pierre-Yves Oudeyer,et al. Active learning of inverse models with intrinsically motivated goal exploration in robots , 2013, Robotics Auton. Syst..

[3] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[5] Daan Wierstra,et al. Variational Intrinsic Control , 2016, ICLR.

[6] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[7] M. Rosenblatt. Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[8] Pierre-Yves Oudeyer,et al. CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning , 2018, ICML.

[9] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[11] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[12] Sergey Levine,et al. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[13] Animesh Garg,et al. LEAF: Latent Exploration Along the Frontier , 2020 .

[14] Javier García,et al. A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[15] Benjamin Van Roy,et al. Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[16] David Warde-Farley,et al. Unsupervised Control Through Non-Parametric Discriminative Rewards , 2018, ICLR.

[17] Harm van Seijen,et al. Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning , 2019, NeurIPS.

[18] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[19] Samy Bengio,et al. Density estimation using Real NVP , 2016, ICLR.

[20] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[21] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[22] Sham M. Kakade,et al. Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[23] Shakir Mohamed,et al. Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[24] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .

[25] Eric Nalisnick,et al. Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[26] Wojciech Jaskowski,et al. Model-Based Active Exploration , 2018, ICML.

[27] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28] Kenneth O. Stanley,et al. Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[29] Christoph Salge,et al. Empowerment - an Introduction , 2013, ArXiv.

[30] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.

[31] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[32] Andrew Y. Ng,et al. Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[33] Filip De Turck,et al. #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[34] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[35] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36] Pieter Abbeel,et al. Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[37] Ruslan Salakhutdinov,et al. Weakly-Supervised Reinforcement Learning for Controllable Behavior , 2020, NeurIPS.

[38] Daniel Guo,et al. Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[39] Pieter Abbeel,et al. Automatic Curriculum Learning through Value Disagreement , 2020, NeurIPS.

[40] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[41] Amos J. Storkey,et al. Exploration by Random Network Distillation , 2018, ICLR.

[42] Richard Socher,et al. Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards , 2019, NeurIPS.

[43] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[44] Brian Yamauchi,et al. A frontier-based approach for autonomous exploration , 1997, Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97. 'Towards New Computational Principles for Robotics and Automation'.

[45] Marc G. Bellemare,et al. Count-Based Exploration with Neural Density Models , 2017, ICML.

[46] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[47] Joshua B. Tenenbaum,et al. Meta-Learning for Semi-Supervised Few-Shot Classification , 2018, ICLR.

[48] Sven Behnke,et al. Evaluating the Efficiency of Frontier-based Exploration Strategies , 2010, ISR/ROBOTIK.

[49] Zheng Wen,et al. Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[50] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[51] Marcin Andrychowicz,et al. Parameter Space Noise for Exploration , 2017, ICLR.

[52] Rui Zhao,et al. Maximum Entropy-Regularized Multi-Goal Reinforcement Learning , 2019, ICML.

[53] Philip Wolfe,et al. An algorithm for quadratic programming , 1956 .

[54] Sergey Levine,et al. Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[55] Marcin Andrychowicz,et al. Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[56] Sergey Levine,et al. Visual Reinforcement Learning with Imagined Goals , 2018, NeurIPS.

[57] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[58] Pierre-Yves Oudeyer,et al. Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[59] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[60] Sergey Levine,et al. Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery , 2020, ICLR.

[61] Jürgen Schmidhuber,et al. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[62] Leslie Pack Kaelbling,et al. Learning to Achieve Goals , 1993, IJCAI.

[63] Lei Han,et al. Curriculum-guided Hindsight Experience Replay , 2019, NeurIPS.

[64] Sergey Levine,et al. Efficient Exploration via State Marginal Matching , 2019, ArXiv.

[65] Pierre-Yves Oudeyer,et al. Intrinsically motivated goal exploration for active motor learning in robots: A case study , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[66] Tom Lenaerts,et al. Dynamic Weights in Multi-Objective Deep Reinforcement Learning , 2018, ICML.

[67] David Warde-Farley,et al. Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[68] Marcin Andrychowicz,et al. Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[69] Pierre-Yves Oudeyer,et al. CURIOUS: Intrinsically Motivated Multi-Task, Multi-Goal Reinforcement Learning , 2018, ICML 2019.