Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data

Approximate dynamic programming (ADP) is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available in practical situations. In this paper, we show how to implement ADP methods using only measured input/output data from the system. Linear dynamical systems with deterministic behavior are considered herein, which are systems of great interest in the control system community. In control system theory, these types of methods are referred to as output feedback (OPFB). The stochastic equivalent of the systems dealt with in this paper is a class of partially observable Markov decision processes. We develop both policy iteration and value iteration algorithms that converge to an optimal controller that requires only OPFB. It is shown that, similar to Q-learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control. Only the order of the system, as well as an upper bound on its "observability index," must be known. The learned OPFB controller is in the form of a polynomial autoregressive moving-average controller that has equivalent performance with the optimal state variable feedback gain.

[1]  Leiba Rodman,et al.  Algebraic Riccati equations , 1995 .

[2]  Miroslav Krstic,et al.  Stabilization of Nonlinear Uncertain Systems , 1998 .

[3]  Joe Brewer,et al.  Kronecker products and matrix calculus in system theory , 1978 .

[4]  Michael G. Safonov,et al.  The unfalsified control concept and learning , 1994, Proceedings of 1994 33rd IEEE Conference on Decision and Control.

[5]  Richard W. Longman,et al.  State-Space System Identification with Identified Hankel Matrix , 1998 .

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7]  Sergio M. Savaresi,et al.  Virtual reference feedback tuning for two degree of freedom controllers , 2002 .

[8]  R. Skelton,et al.  Markov Data-Based LQG Control , 2000 .

[9]  Frank L. Lewis,et al.  Discrete-Time Nonlinear HJB Solution Using Approximate Dynamic Programming: Convergence Proof , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  Richard W. Longman,et al.  Unifying Input-Output and State-Space Perspectives of Predictive Control , 1998 .

[11]  Huaguang Zhang,et al.  Adaptive Dynamic Programming: An Introduction , 2009, IEEE Computational Intelligence Magazine.

[12]  P. Lancaster,et al.  The Algebraic Riccati Equation , 1995 .

[13]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[14]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[15]  Svante Gunnarsson,et al.  Iterative feedback tuning: theory and applications , 1998 .

[16]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[17]  Paul J. Werbos,et al.  Approximate dynamic programming for real-time control and neural modeling , 1992 .

[18]  Victor M. Becerra,et al.  Optimal control , 2008, Scholarpedia.

[19]  S. M. Savaresi,et al.  Virtual reference feedback tuning for two degree of freedom controllers , 2001, 2001 European Control Conference (ECC).

[20]  M. Krstić,et al.  Optimal design of adaptive tracking controllers for nonlinear systems , 1997, Proceedings of the 1997 American Control Conference (Cat. No.97CH36041).

[21]  Frank L. Lewis,et al.  Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem , 2010, Autom..

[22]  Andrew G. Barto,et al.  Adaptive linear quadratic control using policy iteration , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[23]  Kenji Doya,et al.  Neural mechanisms of learning and control , 2001 .

[24]  B. Widrow,et al.  Adaptive inverse control , 1987, Proceedings of 8th IEEE International Symposium on Intelligent Control.

[25]  Jeffrey Bennighof,et al.  Minimum time Pulse Response Based Control of flexible structures , 1991 .

[26]  Sean P. Meyn,et al.  Q-learning and Pontryagin's Minimum Principle , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[27]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[28]  F.L. Lewis,et al.  Reinforcement learning and adaptive dynamic programming for feedback control , 2009, IEEE Circuits and Systems Magazine.

[29]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[30]  Frank L. Lewis,et al.  Online policy iteration based algorithms to solve the continuous-time infinite horizon optimal control problem , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[31]  Minh Q. Phan,et al.  Identification of a Multistep-Ahead Observer and Its Application to Predictive Control , 1997 .

[32]  O.H. Bosgra,et al.  Suppressing non-periodically repeating disturbances in mechanical servo systems , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[33]  J. Spall,et al.  Model-free control of nonlinear stochastic systems with discrete-time measurements , 1998, IEEE Trans. Autom. Control..

[34]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[35]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[36]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[37]  Xi-Ren Cao Stochastic Learning and Optimization , 2007 .

[38]  W. Schultz Neural coding of basic reward terms of animal learning theory, game theory, microeconomics and behavioural ecology , 2004, Current Opinion in Neurobiology.

[39]  Frank L. Lewis,et al.  Guest Editorial: Special Issue on Adaptive Dynamic Programming and Reinforcement Learning in Feedback Control , 2008, IEEE Trans. Syst. Man Cybern. Part B.

[40]  M. Steinbuch,et al.  Data-based optimal control , 2005, Proceedings of the 2005, American Control Conference, 2005..

[41]  Draguna Vrabie,et al.  Adaptive optimal controllers based on Generalized Policy Iteration in a continuous-time framework , 2009, 2009 17th Mediterranean Conference on Control and Automation.

[42]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[43]  Frank L. Lewis,et al.  Special issue on approximate dynamic programming and reinforcement learning , 2011 .

[44]  Jeffrey K. Bennighof,et al.  Minimum time Pulse Response Based Control of flexible structures , 1991 .

[45]  G. Hewer An iterative technique for the computation of the steady state gains for the discrete optimal regulator , 1971 .