论文信息 - Model-Free Trajectory Optimization for Reinforcement Learning

Model-Free Trajectory Optimization for Reinforcement Learning

Many of the recent Trajectory Optimization algorithms alternate between local approximation of the dynamics and conservative policy update. However, linearly approximating the dynamics in order to derive the new policy can bias the update and prevent convergence to the optimal policy. In this article, we propose a new model-free algorithm that backpropagates a local quadratic time-dependent Q-Function, allowing the derivation of the policy update in closed form. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics demonstrating improved performance in comparison to related Trajectory Optimization algorithms linearizing the dynamics.

[1] W. Müller. JACOBSON, D. H. and D. Q. MAYNE: Differential dynamic programming. Modern analytic and computational methods in Science and Mathematics, No. 24. American Elsevier Publ. Co., Inc., New York 1970. XVI, 208 S., 17 Abb., Dfl. 51.50. , 1973 .

[2] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3] Jun Nakanishi,et al. Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[4] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[5] Emanuel Todorov,et al. Optimal Control Theory , 2006 .

[6] Stefan Schaal,et al. Path integral-based stochastic optimal control for rigid body dynamics , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[7] Marc Toussaint,et al. Robot trajectory optimization using approximate inference , 2009, ICML '09.

[8] Yuval Tassa,et al. Iterative local dynamic programming , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[9] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[10] Yuval Tassa,et al. Stochastic Differential Dynamic Programming , 2010, Proceedings of the 2010 American Control Conference.

[11] Jan Peters,et al. A biomimetic approach to robot table tennis , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[13] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[14] Paul Wagner,et al. A reinterpretation of the policy oscillation phenomenon in approximate policy iteration , 2011, NIPS.

[15] Jan Peters,et al. Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[16] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[17] Luca Bascetta,et al. Adaptive Step-Size for Policy Gradient Methods , 2013, NIPS.

[18] Daniele Calandriello,et al. Safe Policy Iteration , 2013, ICML.

[19] Sergey Levine,et al. Learning Complex Neural Network Policies with Trajectory Optimization , 2014, ICML.

[20] Yunpeng Pan,et al. Probabilistic Differential Dynamic Programming , 2014, NIPS.

[21] Sergey Levine,et al. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[22] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[23] Luís Paulo Reis,et al. Model-Based Relative Entropy Stochastic Search , 2016, NIPS.

[24] Anastasios Kyrillidis,et al. Dropping Convexity for Faster Semi-definite Optimization , 2015, COLT.