Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms

Despite their many empirical successes, approximate value -function based approaches to reinforcement learning suffer from a paucity of theoretical guarantees on the performance of the policy generated by the value-func tio . In this paper we pursue an alternative approach: first compute the gradien t of the average reward with respect to the parameters controlling the state transi tio in a Markov chain (be they parameters of a class of approximate value fun ctions generating a policy by some form of look-ahead, or parameters directly pa rameterizing a set of policies), and then use gradient ascent to generate a new set of parameters with increased average reward. We call this method “direct” rein forcement learning because we are not attempting to first find an accurate value-fun ction from which to generate a policy, we are instead adjusting the parameters t o directly improve the average reward. We present an algorithm for computing approximations to the gradient of the average reward from a single sample path of the underlying Ma rkov chain. We show that the accuracy of these approximations depends on th e relationship between the discount factor used by the algorithm and the mixin g time of the Markov chain, and that the error can be made arbitrarily small by set ting he discount factor suitably close to1. We extend this algorithm to the case of partially observabl e Markov decision processes controlled by stochastic polici es. We prove that both algorithms converge with probability 1.

[1]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[2]  P. Marbach Simulation-Based Methods for Markov Decision Processes , 1998 .

[3]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[4]  Satinder Singh,et al.  An Upper Bound on the Loss from Approximate Optimal-Value Functions , 2004, Machine Learning.

[5]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[6]  F. De Bruyne,et al.  Iterative controller optimization for nonlinear systems , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[7]  Shigenobu Kobayashi,et al.  Reinforcement Learning in POMDPs with Function Approximation , 1997, ICML.

[8]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9]  T. Bukowski,et al.  Integral. , 2019, Healthcare protection management.

[10]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[11]  Ronald J. Williams,et al.  Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[14]  S. Gunnarsson,et al.  A convergent iterative restricted complexity control design scheme , 1994, Proceedings of 1994 33rd IEEE Conference on Decision and Control.

[15]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[16]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[17]  R. A. Silverman,et al.  Integral, Measure and Derivative: A Unified Approach , 1967 .

[18]  John G. Kemeny,et al.  Finite Markov chains , 1960 .

[19]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[20]  Robert R. Bitmead,et al.  Direct iterative tuning via spectral analysis , 2000, Autom..

[21]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[22]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[23]  Peter Lancaster,et al.  The theory of matrices , 1969 .

[24]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[25]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[26]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .