Solutions to finite horizon cost problems using actor-critic reinforcement learning

Actor-critic reinforcement learning algorithms have shown to be a successful tool in learning the optimal control for a range of (repetitive) tasks on systems with (partially) unknown dynamics, which may or may not be nonlinear. Most of the reinforcement learning literature published up to this point only deals with modeling the task at hand as a Markov decision process with an infinite horizon cost function. In practice, however, it is sometimes desired to have a solution for the case where the cost function is defined over a finite horizon, which means that the optimal control problem will be time-varying and thus harder to solve. This paper adapts two previously introduced actor-critic algorithms from the infinite horizon setting to the finite horizon setting and applies them to learning a task on a nonlinear system, without needing any assumptions or knowledge about the system dynamics, using radial basis function networks. Simulations on a typical nonlinear motion control problem are carried out, showing that actor-critic algorithms are capable of solving the difficult problem of time-varying optimal control. Moreover, the benefit of using a model learning technique is shown.

[1]  David Barber,et al.  Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes , 2011, ECML/PKDD.

[2]  U. Rieder,et al.  Markov Decision Processes with Applications to Finance , 2011 .

[3]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[4]  Shalabh Bhatnagar,et al.  Simulation-Based Optimization Algorithms for Finite-Horizon Markov Decision Processes , 2008, Simul..

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  Robert Babuska,et al.  Efficient Model Learning Methods for Actor–Critic Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[7]  Derong Liu,et al.  Finite-horizon neural optimal tracking control for a class of nonlinear systems with unknown dynamics , 2012, Proceedings of the 10th World Congress on Intelligent Control and Automation.

[8]  Frank L. Lewis,et al.  Fixed-Final Time Constrained Optimal Control of Nonlinear Systems Using Neural Network HJB Approach , 2006, CDC.

[9]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[10]  Robert Babuska,et al.  Model learning actor-critic algorithms: Performance evaluation in a motion control task , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[11]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[12]  Jay H. Lee,et al.  Model predictive control: past, present and future , 1999 .

[13]  Kwang Y. Lee,et al.  An optimal tracking neuro-controller for nonlinear dynamic systems , 1996, IEEE Trans. Neural Networks.

[14]  A. Gosavi Finite horizon Markov control with one-step variance penalties , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).