An online actor-critic learning approach with Levenberg-Marquardt algorithm

This paper focuses on the efficiency improvement of online actor-critic design base on the Levenberg-Marquardt (LM) algorithm rather than traditional chain rule. Over the decades, several generations of adaptive/approximate dynamic programming (ADP) structures have been proposed in the community and demonstrated many successfully applications. Neural network with backpropagation has been one of the most important approaches to tune the parameters in such ADP designs. In this paper, we aim to study the integration of Levenberg-Marquardt method into the regular actor-critic design to improve weights updating and learning for a quadratic convergence under certain condition. Specifically, for the critic network design, we adopt the LM method targeting improved learning performance, while for the action network, we use the neural network with backpropagation to provide an appropriate control action. A detailed learning algorithm is presented, followed by benchmark tests of pendulum swing up and balance and cart-pole balance tasks. Various simulation results and comparative study demonstrated the effectiveness of this approach.

[1]  P.J. Werbos,et al.  Using ADP to Understand and Replicate Brain Intelligence: the Next Level Design , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[2]  Robert Kozma,et al.  Beyond Feedforward Models Trained by Backpropagation: A Practical Training Tool for a More Efficient Universal Approximator , 2007, IEEE Transactions on Neural Networks.

[3]  Nuttapong Chentanez,et al.  Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .

[4]  Kaj Madsen,et al.  Methods for Non-Linear Least Squares Problems , 1999 .

[5]  Huaguang Zhang,et al.  Adaptive Dynamic Programming: An Introduction , 2009, IEEE Computational Intelligence Magazine.

[6]  Haibo He,et al.  Adaptive dynamic programming with balanced weights seeking strategy , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[7]  Haibo He,et al.  A hierarchical learning architecture with multiple-goal representations based on adaptive dynamic programming , 2010, 2010 International Conference on Networking, Sensing and Control (ICNSC).

[8]  Warren B. Powell,et al.  Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics) , 2007 .

[9]  Henri P. Gavin,et al.  The Levenberg-Marquardt method for nonlinear least squares curve-fitting problems c © , 2013 .

[10]  Ananth Ranganathan,et al.  The Levenberg-Marquardt Algorithm , 2004 .

[11]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[12]  Haibo He,et al.  Adaptive Learning and Control for MIMO System Based on Adaptive Dynamic Programming , 2011, IEEE Transactions on Neural Networks.

[13]  Jennie Si,et al.  Online learning control by association and reinforcement , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[14]  F.L. Lewis,et al.  Reinforcement learning and adaptive dynamic programming for feedback control , 2009, IEEE Circuits and Systems Magazine.

[15]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[16]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[17]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[18]  Paul J. Werbos,et al.  2009 Special Issue: Intelligence in the brain: A theory of how it works and how to build it , 2009 .

[19]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[20]  Frank L. Lewis,et al.  Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem , 2010, Autom..

[21]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[22]  Haibo He,et al.  An Adaptive Dynamic Programming Approach for Closely-Coupled MIMO System Control , 2011, ISNN.

[23]  H. B. Nielsen DAMPING PARAMETER IN MARQUARDT ’ S METHOD , 1999 .

[24]  E. Deci,et al.  Intrinsic and Extrinsic Motivations: Classic Definitions and New Directions. , 2000, Contemporary educational psychology.

[25]  George W. Irwin,et al.  Second-order training of adaptive critics for online process control , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).