Approximate Newton Methods for Policy Search in Markov Decision Processes

Approximate Newton methods are standard optimization tools which aim to maintain the benefits of Newton's method, such as a fast rate of convergence, while alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first analyse the structure of the Hessian of the total expected reward, which is a standard objective function for MDPs. We show that, like the gradient, the Hessian exhibits useful structure in the context of MDPs and we use this analysis to motivate two Gauss-Newton methods for MDPs. Like the Gauss-Newton method for non-linear least squares, these methods drop certain terms in the Hessian. The approximate Hessians possess desirable properties, such as negative definiteness, and we demonstrate several important performance guarantees including guaranteed ascent directions, invariance to afine transformation of the parameter space and convergence guarantees. We finally provide a unifying perspective of key policy search algorithms, demonstrating that our second Gauss-Newton algorithm is closely related to both the EM-algorithm and natural gradient ascent applied to MDPs, but performs significantly better in practice on a range of challenging domains.

[1]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[2]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[3]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[4]  James M. Ortega,et al.  Iterative solution of nonlinear equations in several variables , 2014, Computer science and applied mathematics.

[5]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Peter W. Glynn,et al.  Proceedings of Ihe 1986 Winter Simulation , 2022 .

[8]  Robot modelling and control , 1990 .

[9]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[10]  Shun-ichi Amari,et al.  Information geometry of Boltzmann machines , 1992, IEEE Trans. Neural Networks.

[11]  Steven Douglas Whitehead,et al.  Reinforcement learning for the adaptive control of perception and action , 1992 .

[12]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[13]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[14]  Robert F. Stengel,et al.  Optimal Control and Estimation , 1994 .

[15]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[16]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[17]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[18]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[19]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[20]  Shun-ichi Amari,et al.  Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.

[21]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[22]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[23]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[24]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[25]  J. Spall,et al.  Model-free control of nonlinear stochastic systems with discrete-time measurements , 1998, IEEE Trans. Autom. Control..

[26]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[27]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[28]  J. Tsitsiklis,et al.  Actor-citic agorithms , 1999, NIPS 1999.

[29]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[30]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[31]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[32]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[33]  Sham M. Kakade,et al.  Optimizing Average Reward Using Discounted Rewards , 2001, COLT/EuroCOLT.

[34]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[35]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[36]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[37]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[38]  Jun Nakanishi,et al.  Movement imitation with nonlinear dynamical systems in humanoid robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[39]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[40]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[41]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[42]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[43]  Emanuel Todorov,et al.  Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems , 2004, ICINCO.

[44]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[45]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[46]  Nicol N. Schraudolph,et al.  Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation , 2005, NIPS.

[47]  H. Sebastian Seung,et al.  Learning to Walk in 20 Minutes , 2005 .

[48]  Peter Sollich,et al.  Theory of Neural Information Processing Systems , 2005 .

[49]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[50]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[51]  Weiwei Li Optimal control for biological movement systems , 2006 .

[52]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[53]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.

[54]  Kevin Warwick,et al.  Maintain order even on the hop - review of "Robot modelling and control" by M. Spong, S. Hutchinson and M. Vidyasagar , 2006 .

[55]  Weiwei Li,et al.  An Iterative Optimal Control and Estimation Design for Nonlinear Stochastic System , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[56]  Jin Yu,et al.  Natural Actor-Critic for Road Traffic Optimisation , 2006, NIPS.

[57]  Marc Toussaint,et al.  Probabilistic inference for solving (PO) MDPs , 2006 .

[58]  Dipti Srinivasan,et al.  Neural Networks for Real-Time Traffic Signal Control , 2006, IEEE Transactions on Intelligent Transportation Systems.

[59]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[60]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[61]  Stefan Schaal,et al.  Dynamics systems vs. optimal control--a unifying view. , 2007, Progress in brain research.

[62]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[63]  Junichiro Yoshimoto,et al.  A New Natural Policy Gradient by Stationary Distribution Metric , 2008, ECML/PKDD.

[64]  David Silver,et al.  Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Achieving Master Level Play in 9 × 9 Computer Go , 2022 .

[65]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[66]  Joel Veness,et al.  Bootstrapping from Game Tree Search , 2009, NIPS.

[67]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[68]  D. Barber,et al.  Solving deterministic policy ( PO ) MDPs using Expectation-Maximisation and Antifreeze , 2009 .

[69]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[70]  Nando de Freitas,et al.  An Expectation Maximization Algorithm for Continuous Markov Decision Processes with Arbitrary Reward , 2009, AISTATS.

[71]  Yuval Tassa,et al.  Iterative local dynamic programming , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[72]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[73]  David Barber,et al.  Variational methods for Reinforcement Learning , 2010, AISTATS.

[74]  Marc Toussaint,et al.  Bayesian Time Series Models: Expectation maximisation methods for solving (PO)MDPs and optimal control problems , 2011 .

[75]  TaeChoong Chung,et al.  Hessian matrix distribution for Bayesian policy gradient reinforcement learning , 2011, Inf. Sci..

[76]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[77]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[78]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[79]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[80]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[81]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[82]  Thomas Furmston,et al.  Applications of probabilistic inference to planning & reinforcement learning , 2013 .

[83]  P. Olver Nonlinear Systems , 2013 .

[84]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[85]  Bruno Scherrer,et al.  Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[86]  Philip Thomas,et al.  GeNGA: A Generalization of Natural Gradient Ascent with Positive and Negative Convergence Results , 2014, ICML.

[87]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[88]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[89]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[90]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[91]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[92]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.