Predictor-Corrector Policy Optimization

We present a predictor-corrector framework, called PicCoLO, that can transform a first-order model-free reinforcement or imitation learning algorithm into a new hybrid method that leverages predictive models to accelerate policy learning. The new "PicCoLOed" algorithm optimizes a policy by recursively repeating two steps: In the Prediction Step, the learner uses a model to predict the unseen future gradient and then applies the predicted estimate to update the policy; in the Correction Step, the learner runs the updated policy in the environment, receives the true gradient, and then corrects the policy using the gradient error. Unlike previous algorithms, PicCoLO corrects for the mistakes of using imperfect predicted gradients and hence does not suffer from model bias. The development of PicCoLO is made possible by a novel reduction from predictable online learning to adversarial online learning, which provides a systematic way to modify existing first-order algorithms to achieve the optimal regret with respect to predictable information. We show, in both theory and simulation, that the convergence rate of several first-order model-free algorithms can be improved by PicCoLO.

[1]  Nolan Wagener,et al.  Fast Policy Learning through Imitation and Reinforcement , 2018, UAI.

[2]  R. Bellman Dynamic programming. , 1957, Science.

[3]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Byron Boots,et al.  Accelerating Imitation Learning with Predictive Models , 2018, AISTATS.

[6]  Gaurav S. Sukhatme,et al.  Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning , 2017, ICML.

[7]  Yoram Singer,et al.  A Unified Approach to Adaptive Regularization in Online and Stochastic Optimization , 2017, ArXiv.

[8]  David Duvenaud,et al.  Backpropagation through the Void: Optimizing control variates for black-box gradient estimation , 2017, ICLR.

[9]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[10]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[11]  J. Andrew Bagnell,et al.  Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[12]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[13]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[14]  Atil Iscen,et al.  Sim-to-Real: Learning Agile Locomotion For Quadruped Robots , 2018, Robotics: Science and Systems.

[15]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[16]  Geoffrey J. Gordon Regret bounds for prediction problems , 1999, COLT '99.

[17]  Razvan Pascanu,et al.  Learning model-based planning from scratch , 2017, ArXiv.

[18]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[19]  Shahin Shahrampour,et al.  Online Optimization : Competing with Dynamic Comparators , 2015, AISTATS.

[20]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[21]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[22]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[23]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[24]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[25]  G. M. Korpelevich The extragradient method for finding saddle points and other problems , 1976 .

[26]  Karthik Sridharan,et al.  Online Learning with Predictable Sequences , 2012, COLT.

[27]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[28]  David Barber,et al.  Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[29]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, Journal of computer and system sciences (Print).

[30]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[31]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[32]  Jelena Diakonikolas,et al.  Accelerated Extra-Gradient Descent: A Novel Accelerated First-Order Method , 2017, ITCS.

[33]  Karthik Sridharan,et al.  Optimization, Learning, and Games with Predictable Sequences , 2013, NIPS.

[34]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[35]  Byron Boots,et al.  Dual Policy Iteration , 2018, NeurIPS.

[36]  Marcello Restelli,et al.  Stochastic Variance-Reduced Policy Gradient , 2018, ICML.

[37]  A. Juditsky,et al.  Solving variational inequalities with Stochastic Mirror-Prox algorithm , 2008, 0809.0815.

[38]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[39]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[40]  Gaurav S. Sukhatme,et al.  Combining Model-Based and Model-Free Updates for Deep Reinforcement Learning , 2017 .

[41]  Nam Ho-Nguyen,et al.  Exploiting problem structure in optimization under uncertainty via online convex optimization , 2017, Mathematical Programming.

[42]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[43]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[44]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[45]  Karen Liu Dynamic Animation and Robotics Toolkit , 2014 .

[46]  Siddhartha S. Srinivasa,et al.  DART: Dynamic Animation and Robotics Toolkit , 2018, J. Open Source Softw..

[47]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[48]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[49]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[50]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[51]  Amit Daniely,et al.  Strongly Adaptive Online Learning , 2015, ICML.

[52]  H. Brendan McMahan,et al.  A survey of Algorithms and Analysis for Adaptive Online Learning , 2014, J. Mach. Learn. Res..

[53]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[54]  Byron Boots,et al.  Convergence of Value Aggregation for Imitation Learning , 2018, AISTATS.

[55]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[56]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[57]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[58]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[59]  Allan Jabri,et al.  Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control , 2018, ICML.

[60]  Yunpeng Pan,et al.  Probabilistic Differential Dynamic Programming , 2014, NIPS.

[61]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[62]  E. Todorov,et al.  A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems , 2005, Proceedings of the 2005, American Control Conference, 2005..

[63]  Rong Jin,et al.  25th Annual Conference on Learning Theory Online Optimization with Gradual Variations , 2022 .

[64]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[65]  Arkadi Nemirovski,et al.  Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems , 2004, SIAM J. Optim..

[66]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[67]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[68]  Allan Jabri,et al.  Universal Planning Networks , 2018, ICML.

[69]  Byron Boots,et al.  Differentiable MPC for End-to-end Planning and Control , 2018, NeurIPS.