Local Search for Policy Iteration in Continuous Control

We present an algorithm for local, regularized, policy improvement in reinforcement learning (RL) that allows us to formulate model-based and model-free variants in a single framework. Our algorithm can be interpreted as a natural extension of work on KL-regularized RL and introduces a form of tree search for continuous action spaces. We demonstrate that additional computation spent on model-based policy improvement during learning can improve data efficiency, and confirm that model-based policy improvement during action selection can also be beneficial. Quantitatively, our algorithm improves data efficiency on several continuous control benchmarks (when a model is learned in parallel), and it provides significant improvements in wall-clock time in high-dimensional domains (when a ground truth model is available). The unified framework also helps us to better understand the space of model-based and model-free algorithms. In particular, we demonstrate that some benefits attributed to model-based RL can be obtained without a model, simply by utilizing more computation.

[1]  Nicolas Heess,et al.  Approximate Inference in Discrete Distributions with Monte Carlo Tree Search and Value Functions , 2019, AISTATS.

[2]  Jaakko Lehtinen,et al.  Online motion synthesis using sequential Monte Carlo , 2014, ACM Trans. Graph..

[3]  Nicolas Heess,et al.  Hierarchical visuomotor control of humanoids , 2018, ICLR.

[4]  Yee Whye Teh,et al.  Particle Value Functions , 2017, ICLR.

[5]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[6]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[7]  Yoshua Bengio,et al.  Probabilistic Planning with Sequential Monte Carlo methods , 2018, ICLR.

[8]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[9]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[10]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[11]  Martin A. Riedmiller,et al.  Reinforcement learning in feedback control , 2011, Machine Learning.

[12]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[13]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[14]  Sergey Levine,et al.  Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model , 2019, NeurIPS.

[15]  Jan M. Maciejowski,et al.  Predictive control : with constraints , 2002 .

[16]  H. Francis Song,et al.  V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.

[17]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[19]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[20]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[21]  Catholijn M. Jonker,et al.  A0C: Alpha Zero in Continuous Action Space , 2018, ArXiv.

[22]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[23]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[24]  Sergey Levine,et al.  Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.

[25]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[26]  Rémi Coulom,et al.  Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[27]  Stefan Schaal,et al.  Learning Policy Improvements with Path Integrals , 2010, AISTATS.

[28]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[29]  Matteo Hessel,et al.  When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[30]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[31]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[32]  Tim Salimans,et al.  Policy Gradient Search: Online Planning and Expert Iteration without Search Trees , 2019, ArXiv.

[33]  Michael H. Bowling,et al.  Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty , 2016, IJCAI.

[34]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[35]  Byron Boots,et al.  Information Theoretic Model Predictive Q-Learning , 2020, L4DC.

[36]  David Barber,et al.  Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[37]  M. V. D. Panne,et al.  Sampling-based contact-rich motion control , 2010, ACM Trans. Graph..

[38]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[39]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[40]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[41]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[42]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[43]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[44]  H. Jaap van den Herik,et al.  Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[45]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[46]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[47]  Katherine Rose Driggs-Campbell,et al.  Monte Carlo Tree Search for Policy Optimization , 2019, IJCAI.

[48]  Yuval Tassa,et al.  Data-efficient Deep Reinforcement Learning for Dexterous Manipulation , 2017, ArXiv.

[49]  Nataliya Sokolovska,et al.  Continuous Upper Confidence Trees , 2011, LION.

[50]  Shie Mannor,et al.  How to Combine Tree-Search Methods in Reinforcement Learning , 2018, AAAI.

[51]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[52]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[53]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[54]  Sergey Levine,et al.  Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[55]  Martin A. Riedmiller,et al.  Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models , 2019, CoRL.