论文信息 - Local Search for Policy Iteration in Continuous Control - 字舞流文

Local Search for Policy Iteration in Continuous Control

We present an algorithm for local, regularized, policy improvement in reinforcement learning (RL) that allows us to formulate model-based and model-free variants in a single framework. Our algorithm can be interpreted as a natural extension of work on KL-regularized RL and introduces a form of tree search for continuous action spaces. We demonstrate that additional computation spent on model-based policy improvement during learning can improve data efficiency, and confirm that model-based policy improvement during action selection can also be beneficial. Quantitatively, our algorithm improves data efficiency on several continuous control benchmarks (when a model is learned in parallel), and it provides significant improvements in wall-clock time in high-dimensional domains (when a ground truth model is available). The unified framework also helps us to better understand the space of model-based and model-free algorithms. In particular, we demonstrate that some benefits attributed to model-based RL can be obtained without a model, simply by utilizing more computation.

Jackie Kay | Yuval Tassa | Martin A. Riedmiller | Jonas Buchli | Abbas Abdolmaleki | Jost Tobias Springenberg | Martin Riedmiller | Jonas Degrave | Nicolas Heess | Julian Schrittwieser | Josh Merel | Arunkumar Byravan | Dan Belov | Daniel Mankowitz | N. Heess | Yuval Tassa | J. Merel | A. Abdolmaleki | Jonas Degrave | D. Mankowitz | Arunkumar Byravan | J. Buchli | Julian Schrittwieser | Jackie Kay | Dan Belov | J. T. Springenberg

[1] Nicolas Heess,et al. Approximate Inference in Discrete Distributions with Monte Carlo Tree Search and Value Functions , 2019, AISTATS.

[2] Jaakko Lehtinen,et al. Online motion synthesis using sequential Monte Carlo , 2014, ACM Trans. Graph..

[3] Nicolas Heess,et al. Hierarchical visuomotor control of humanoids , 2018, ICLR.

[4] Yee Whye Teh,et al. Particle Value Functions , 2017, ICLR.

[5] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[6] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[7] Yoshua Bengio,et al. Probabilistic Planning with Sequential Monte Carlo methods , 2018, ICLR.

[8] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[9] Marc Toussaint,et al. Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[10] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[11] Martin A. Riedmiller,et al. Reinforcement learning in feedback control , 2011, Machine Learning.

[12] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[13] Sergey Levine,et al. DeepMimic , 2018, ACM Trans. Graph..

[14] Sergey Levine,et al. Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model , 2019, NeurIPS.

[15] Jan M. Maciejowski,et al. Predictive control : with constraints , 2002 .

[16] H. Francis Song,et al. V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.

[17] Yuval Tassa,et al. Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18] Sergey Levine,et al. Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[19] Ruben Villegas,et al. Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[20] Karol Hausman,et al. Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[21] Catholijn M. Jonker,et al. A0C: Alpha Zero in Continuous Action Space , 2018, ArXiv.

[22] Yuval Tassa,et al. DeepMind Control Suite , 2018, ArXiv.

[23] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[24] Sergey Levine,et al. Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.

[25] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[26] Rémi Coulom,et al. Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[27] Stefan Schaal,et al. Learning Policy Improvements with Path Integrals , 2010, AISTATS.

[28] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[29] Matteo Hessel,et al. When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[30] H. Kappen. Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[31] Emanuel Todorov,et al. Linearly-solvable Markov decision problems , 2006, NIPS.

[32] Tim Salimans,et al. Policy Gradient Search: Online Planning and Expert Iteration without Search Trees , 2019, ArXiv.

[33] Michael H. Bowling,et al. Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty , 2016, IJCAI.

[34] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[35] Byron Boots,et al. Information Theoretic Model Predictive Q-Learning , 2020, L4DC.

[36] David Barber,et al. Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[37] M. V. D. Panne,et al. Sampling-based contact-rich motion control , 2010, ACM Trans. Graph..

[38] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[39] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[40] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[41] Rémi Coulom,et al. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[42] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[43] Sham M. Kakade,et al. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[44] H. Jaap van den Herik,et al. Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[45] Marc Toussaint,et al. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[46] Roy Fox,et al. Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[47] Katherine Rose Driggs-Campbell,et al. Monte Carlo Tree Search for Policy Optimization , 2019, IJCAI.

[48] Yuval Tassa,et al. Data-efficient Deep Reinforcement Learning for Dexterous Manipulation , 2017, ArXiv.

[49] Nataliya Sokolovska,et al. Continuous Upper Confidence Trees , 2011, LION.

[50] Shie Mannor,et al. How to Combine Tree-Search Methods in Reinforcement Learning , 2018, AAAI.

[51] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[52] Mohammad Norouzi,et al. Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[53] Pieter Abbeel,et al. Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[54] Sergey Levine,et al. Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[55] Martin A. Riedmiller,et al. Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models , 2019, CoRL.