论文信息 - Value-gradient learning

Value-gradient learning

We describe an Adaptive Dynamic Programming algorithm VGL(λ) for learning a critic function over a large continuous state space. The algorithm, which requires a learned model of the environment, extends Dual Heuristic Dynamic Programming to include a bootstrapping parameter analogous to that used in the reinforcement learning algorithm TD(λ). We provide on-line and batch mode implementations of the algorithm, and summarise the theoretical relationships and motivations of using this method over its precursor algorithms Dual Heuristic Dynamic Programming and TD(λ). Experiments for control problems using a neural network and greedy policy are provided.

Michael Fairbank | Eduardo Alonso | Michael Fairbank | Eduardo Alonso

[1] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .

[2] L. S. Pontryagin,et al. Mathematical Theory of Optimal Processes , 1962 .

[3] E. Feigenbaum,et al. Computers and Thought , 1963 .

[4] E. Blum,et al. The Mathematical Theory of Optimal Processes. , 1963 .

[5] L. M. Sonneborn,et al. The Bang-Bang Principle for Linear Control Systems , 1964 .

[6] M. L. Chambers. The Mathematical Theory of Optimal Processes , 1965 .

[7] A. L. Samuel,et al. Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[8] A. L. Samuel,et al. Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[9] Donald E. Kirk,et al. Optimal control theory : an introduction , 1970 .

[10] David Q. Mayne,et al. Differential dynamic programming , 1972, The Mathematical Gazette.

[11] Bernard Widrow,et al. Punish/Reward: Learning with a Critic in Adaptive Threshold Systems , 1973, IEEE Trans. Syst. Man Cybern..

[12] P. Werbos,et al. Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[13] R. Godson. Elements of intelligence , 1979 .

[14] Louis B. Rall,et al. Automatic Differentiation: Techniques and Applications , 1981, Lecture Notes in Computer Science.

[15] Paul J. Werbos,et al. Applications of advances in nonlinear sensitivity analysis , 1982 .

[16] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[17] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[18] Paul J. Werbos,et al. Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[19] Paul J. Werbos,et al. Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[20] Ronald J. Williams,et al. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[21] Paul J. Werbos,et al. Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[22] Alexander Linden,et al. Inversion of neural networks by gradient descent , 1990, Parallel Comput..

[23] Vijaykumar Gullapalli,et al. A stochastic reinforcement learning algorithm for learning real-valued functions , 1990, Neural Networks.

[24] Panos J. Antsaklis,et al. Neural networks for control systems , 1990, IEEE Trans. Neural Networks.

[25] Paul J. Werbos,et al. Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[26] William H. Press,et al. Numerical recipes in C (2nd ed.): the art of scientific computing , 1992 .

[27] Etienne Barnard,et al. Temporal-difference methods and Markov models , 1993, IEEE Trans. Syst. Man Cybern..

[28] M. F. Møller,et al. Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in 0(N) Time , 1993 .

[29] Martin A. Riedmiller,et al. A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[30] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[31] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[32] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[33] Lutz Prechelt,et al. A Set of Neural Network Benchmark Problems and Benchmarking Rules , 1994 .

[34] Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[35] D. Signorini,et al. Neural networks , 1995, The Lancet.

[36] Richard S. Sutton,et al. A Menu of Designs for Reinforcement Learning Over Time , 1995 .

[37] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[38] Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[39] Roberto A. Santiago,et al. Adaptive critic designs: A case study for neurocontrol , 1995, Neural Networks.

[40] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[41] S. N. Balakrishnan,et al. Neurocontrol: A literature survey , 1996 .

[42] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[43] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[44] George G. Lendaris,et al. Training strategies for critic and action neural networks in dual heuristic programming method , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[45] Donald C. Wunsch,et al. Convergence of critic-based training , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[46] Martin T. Hagan,et al. Neural networks for control , 1999, Proceedings of the 1999 American Control Conference (Cat. No. 99CH36251).

[47] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[48] Paul J. Werbos,et al. Stable adaptive control using new critic designs , 1998, Other Conferences.

[49] Arthur L. Samuel,et al. Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[50] George G. Lendaris,et al. Adaptive critic design for intelligent steering and speed control of a 2-axle vehicle , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[51] Kenji Doya,et al. Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[52] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[53] Richard S. Sutton,et al. Comparing Policy-Gradient Algorithms , 2001 .

[54] S. N. Balakrishnan,et al. State-constrained agile missile control with adaptive-critic-based neural networks , 2002, IEEE Trans. Control. Syst. Technol..

[55] Robert F. Stengel,et al. An adaptive critic global controller , 2002, Proceedings of the 2002 American Control Conference (IEEE Cat. No.CH37301).

[56] Rémi Coulom,et al. Reinforcement Learning Using Neural Networks, with Applications to Motor Control. (Apprentissage par renforcement utilisant des réseaux de neurones, avec des applications au contrôle moteur) , 2002 .

[57] George G. Lendaris,et al. Adaptive dynamic programming , 2002, IEEE Trans. Syst. Man Cybern. Part C.

[58] Jennie Si,et al. Helicopter Flight-Control Reconfiguration for Main Rotor Actuator Failures , 2003 .

[59] Warren B. Powell,et al. GUIDANCE IN THE USE OF ADAPTIVE CRITICS FOR CONTROL , 2007 .

[60] Peter Dayan,et al. The convergence of TD(λ) for general λ , 1992, Machine Learning.

[61] Dieter Fox,et al. Reinforcement learning for sensing strategies , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[62] Jennie Si,et al. ADP: Goals, Opportunities and Principles , 2004 .

[63] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[64] Ben Tse,et al. Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[65] A. Barto,et al. ModelBased Adaptive Critic Designs , 2004 .

[66] Jennie Si,et al. Adaptive Critic Based Neural Network for ControlConstrained Agile Missile , 2004 .

[67] John N. Tsitsiklis,et al. Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[68] Chao Lu,et al. Direct Neural Dynamic Programming Method for Power System Stability Enhancement , 2005, Proceedings of the 13th International Conference on, Intelligent Systems Application to Power Systems.

[69] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[70] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[71] Rémi Munos,et al. Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[72] Razvan V. Florian,et al. Correct equations for the dynamics of the cart-pole system , 2005 .

[73] Warren B. Powell,et al. Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[74] Rajesh P. N. Rao,et al. Bayesian brain : probabilistic approaches to neural coding , 2006 .

[75] Emanuel Todorov,et al. Optimal Control Theory , 2006 .

[76] P. Werbos. Backwards Differentiation in AD and Neural Nets: Past Links and New Opportunities , 2006 .

[77] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[78] Rémi Munos,et al. Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation , 2005, J. Mach. Learn. Res..

[79] P.J. Werbos,et al. Using ADP to Understand and Replicate Brain Intelligence: the Next Level Design , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[80] Michael Fairbank,et al. Reinforcement Learning by Value Gradients , 2008, ArXiv.

[81] Frank L. Lewis,et al. Discrete-Time Nonlinear HJB Solution Using Approximate Dynamic Programming: Convergence Proof , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[82] Paul J. Werbos,et al. Foreword: ADP - The Key Direction for Future Research in Intelligent Control and Understanding Brain Intelligence , 2008, IEEE Trans. Syst. Man Cybern. Part B.

[83] Danil Prokhorov,et al. Computational Intelligence in Automotive Applications , 2008, Computational Intelligence in Automotive Applications.

[84] Huaguang Zhang,et al. Adaptive Dynamic Programming: An Introduction , 2009, IEEE Computational Intelligence Magazine.

[85] George G. Lendaris,et al. A retrospective on Adaptive Dynamic Programming for control , 2009, 2009 International Joint Conference on Neural Networks.

[86] F.L. Lewis,et al. Reinforcement learning and adaptive dynamic programming for feedback control , 2009, IEEE Circuits and Systems Magazine.

[87] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[88] Warren B. Powell,et al. What you should know about approximate dynamic programming , 2009, Naval Research Logistics (NRL).

[89] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[90] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[91] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[92] R. Sutton,et al. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[93] P. Schrimpf,et al. Dynamic Programming , 2011 .

[94] A. Heydari,et al. Finite-horizon input-constrained nonlinear optimal control using single network adaptive critics , 2011, Proceedings of the 2011 American Control Conference.

[95] Michael Fairbank,et al. The Local Optimality of Reinforcement Learning by Value Gradients, and its Relationship to Policy Gradient Learning , 2011, ArXiv.

[96] Shuhui Li,et al. Vector control of a grid-connected rectifier/inverter using an artificial neural network , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[97] Michael Fairbank,et al. A comparison of learning speed and ability to cope without exploration between DHP and TD(0) , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[98] Richard S. Sutton,et al. Temporal-difference search in computer Go , 2012, Machine Learning.

[99] F. Lewis,et al. Reinforcement Learning and Feedback Control: Using Natural Decision Methods to Design Optimal Adaptive Controllers , 2012, IEEE Control Systems.

[100] Michael Fairbank,et al. The divergence of reinforcement learning algorithms with value-iteration and function approximation , 2011, The 2012 International Joint Conference on Neural Networks (IJCNN).

[101] Michael Fairbank,et al. Simple and Fast Calculation of the Second-Order Gradients for Globalized Dual Heuristic Dynamic Programming in Neural Networks , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[102] Michael Fairbank,et al. An Equivalence Between Adaptive Dynamic Programming With a Critic and Backpropagation Through Time , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[103] Michael Fairbank,et al. Approximating Optimal Control with Value Gradient Learning , 2013 .

[104] Michael Fairbank,et al. The Importance of Clipping in Neurocontrol by Direct Gradient Descent on the Cost-to-Go Function and in Adaptive Dynamic Programming , 2013, ArXiv.

[105] Frank L. Lewis,et al. Reinforcement Learning and Approximate Dynamic Programming (RLADP)Â -Â Foundations, Common Misconceptions, and the Challenges Ahead , 2013 .

[106] Michael Fairbank,et al. Clipping in Neurocontrol by Adaptive Dynamic Programming , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[107] Shuhui Li,et al. An adaptive recurrent neural-network controller using a stabilization matrix and predictive inputs to solve a tracking problem under disturbances , 2014, Neural Networks.