Deep Value Model Predictive Control

In this paper, we introduce an actor-critic algorithm called Deep Value Model Predictive Control (DMPC), which combines model-based trajectory optimization with value function estimation. The DMPC actor is a Model Predictive Control (MPC) optimizer with an objective function defined in terms of a value function estimated by the critic. We show that our MPC actor is an importance sampler, which minimizes an upper bound of the cross-entropy to the state distribution of the optimal sampling policy. In our experiments with a Ballbot system, we show that our algorithm can work with sparse and binary reward signals to efficiently solve obstacle avoidance and target reaching tasks. Compared to previous work, we show that including the value function in the running cost of the trajectory optimizer speeds up the convergence. We also discuss the necessary strategies to robustify the algorithm in practice.

[1]  Jonas Buchli,et al.  An efficient optimal planning and control framework for quadrupedal locomotion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[2]  K. Åström Introduction to Stochastic Control Theory , 1970 .

[3]  Joonho Lee,et al.  Learning agile and dynamic motor skills for legged robots , 2019, Science Robotics.

[4]  Rémi Munos,et al.  Learning to Search with MCTSnets , 2018, ICML.

[5]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[6]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[7]  James M. Rehg,et al.  Aggressive driving with model predictive path integral control , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[8]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[9]  Olivier Stasse,et al.  Whole-body model-predictive control applied to the HRP-2 humanoid , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  D. Mayne A Second-order Gradient Method for Determining Optimal Trajectories of Non-linear Discrete-time Systems , 1966 .

[11]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[12]  Edo Jelavic,et al.  Real-time motion planning of legged robots: A model predictive control approach , 2017, 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids).

[13]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[14]  Ruben Grandia,et al.  Whole-Body MPC for a Dynamically Stable Mobile Manipulator , 2019, IEEE Robotics and Automation Letters.

[15]  H. Kappen An introduction to stochastic control theory, path integrals and reinforcement learning , 2007 .

[16]  H. Kappen,et al.  Path integral control and state-dependent feedback. , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[18]  Hilbert J. Kappen,et al.  Adaptive Importance Sampling for Control and Inference , 2015, ArXiv.

[19]  Emanuel Todorov,et al.  Inverse Optimal Control with Linearly-Solvable MDPs , 2010, ICML.

[20]  Anthony Tzes,et al.  Model predictive quadrotor indoor position control , 2011, 2011 19th Mediterranean Conference on Control & Automation (MED).

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[23]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[24]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.