Model-Based Reinforcement Learning via Latent-Space Collocation

The ability to plan into the future while utilizing only raw high-dimensional observations, such as images, can provide autonomous agents with broad capabilities. Visual model-based reinforcement learning (RL) methods that plan future actions directly have shown impressive results on tasks that require only short-horizon reasoning, however, these methods struggle on temporally extended tasks. We argue that it is easier to solve long-horizon tasks by planning sequences of states rather than just actions, as the effects of actions greatly compound over time and are harder to optimize. To achieve this, we draw on the idea of collocation, which has shown good results on long-horizon tasks in optimal control literature, and adapt it to the image-based setting by utilizing learned latent state space models. The resulting latent collocation method (LatCo) optimizes trajectories of latent states, which improves over previously proposed shooting methods for visual model-based RL on tasks with sparse rewards and long-term goals. Videos and code: https://orybkin.github.io/latco/.

[1]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[2]  Sergey Levine,et al.  SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning , 2018, ICML.

[3]  Jessica B. Hamrick,et al.  Divide-and-Conquer Monte Carlo Tree Search For Goal-Directed Planning , 2020, ArXiv.

[4]  Matthew Kelly,et al.  An Introduction to Trajectory Optimization: How to Do Your Own Direct Collocation , 2017, SIAM Rev..

[5]  Huibert Kwakernaak,et al.  Linear Optimal Control Systems , 1972 .

[6]  Pieter Abbeel,et al.  Scaling up Gaussian Belief Space Planning Through Covariance-Free Trajectory Optimization and Automatic Differentiation , 2014, WAFR.

[7]  James M. Rehg,et al.  Aggressive driving with model predictive path integral control , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[8]  Russ Tedrake,et al.  A direct method for trajectory optimization of rigid bodies through contact , 2014, Int. J. Robotics Res..

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Sergey Levine,et al.  End-to-End Robotic Reinforcement Learning without Reward Engineering , 2019, Robotics: Science and Systems.

[11]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[12]  Pieter Abbeel,et al.  Learning Plannable Representations with Causal InfoGAN , 2018, NeurIPS.

[13]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[14]  Siddhartha S. Srinivasa,et al.  CHOMP: Gradient optimization techniques for efficient motion planning , 2009, 2009 IEEE International Conference on Robotics and Automation.

[15]  Leslie Pack Kaelbling,et al.  Belief space planning assuming maximum likelihood observations , 2010, Robotics: Science and Systems.

[16]  Andrew P. Witkin,et al.  Spacetime constraints , 1988, SIGGRAPH.

[17]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[18]  Ron Alterovitz,et al.  Motion planning under uncertainty using iterative local optimization in belief space , 2012, Int. J. Robotics Res..

[19]  Sergey Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[20]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[21]  Sergey Levine,et al.  Search on the Replay Buffer: Bridging Planning and Reinforcement Learning , 2019, NeurIPS.

[22]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[23]  Sergey Levine,et al.  If MaxEnt RL is the Answer, What is the Question? , 2019, ArXiv.

[24]  Claire J. Tomlin,et al.  Learning quadrotor dynamics using neural network for flight control , 2016, 2016 IEEE 55th Conference on Decision and Control (CDC).

[25]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[26]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[27]  Sergey Levine,et al.  Planning with Goal-Conditioned Policies , 2019, NeurIPS.

[28]  J. Betts Survey of Numerical Methods for Trajectory Optimization , 1998 .

[29]  Chelsea Finn,et al.  Long-Horizon Visual Planning with Goal-Conditioned Hierarchical Predictors , 2020, NeurIPS.

[30]  C. Hargraves,et al.  DIRECT TRAJECTORY OPTIMIZATION USING NONLINEAR PROGRAMMING AND COLLOCATION , 1987 .

[31]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[32]  Sergey Levine,et al.  Deep Dynamics Models for Learning Dexterous Manipulation , 2019, CoRL.

[33]  Thomas B. Schön,et al.  From Pixels to Torques: Policy Learning with Deep Dynamical Models , 2015, ICML 2015.

[34]  Pieter Abbeel,et al.  Hallucinative Topological Memory for Zero-Shot Visual Planning , 2020, ICML.

[35]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[36]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[37]  Y. Bar-Shalom,et al.  Dual effect, certainty equivalence, and separation in stochastic control , 1974 .

[38]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[39]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[40]  Zoran Popovic,et al.  Discovery of complex behaviors through contact-invariant optimization , 2012, ACM Trans. Graph..

[41]  Igor Mordatch,et al.  Model Based Planning with Energy Based Models , 2019, CoRL.

[42]  Stefan Schaal,et al.  STOMP: Stochastic trajectory optimization for motion planning , 2011, 2011 IEEE International Conference on Robotics and Automation.

[43]  Sergey Levine,et al.  PRECOG: PREdiction Conditioned on Goals in Visual Multi-Agent Settings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[45]  Ole Winther,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[46]  Kostas Daniilidis,et al.  Keyframing the Future: Keyframe Discovery for Visual Prediction and Planning , 2020, L4DC.

[47]  Pieter Abbeel,et al.  Motion planning with sequential convex optimization and convex collision checking , 2014, Int. J. Robotics Res..

[48]  Alex S. Fukunaga,et al.  Classical Planning in Deep Latent Space: Bridging the Subsymbolic-Symbolic Boundary , 2017, AAAI.

[49]  Sergey Levine,et al.  Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control , 2018, ArXiv.

[50]  Chelsea Finn,et al.  Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation , 2019, ICLR.

[51]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[52]  B. Faverjon,et al.  Probabilistic Roadmaps for Path Planning in High-Dimensional Con(cid:12)guration Spaces , 1996 .

[53]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[54]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[55]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[56]  S. Levine,et al.  Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[57]  Sergey Levine,et al.  Offline Reinforcement Learning as One Big Sequence Modeling Problem , 2021, NeurIPS.

[58]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[59]  Fabio Viola,et al.  Learning and Querying Fast Generative Models for Reinforcement Learning , 2018, ArXiv.

[60]  S. LaValle Rapidly-exploring random trees : a new tool for path planning , 1998 .

[61]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[62]  C. Karen Liu,et al.  Synthesis of complex dynamic character motion from simple animations , 2002, ACM Trans. Graph..