A reinforcement learning neural network for adaptive control of Markov chains

In this paper we consider the problem of reinforcement learning in a dynamically changing environment. In this context, we study the problem of adaptive control of finite-state Markov chains with a finite number of controls. The transition and payoff structures are unknown. The objective is to find an optimal policy which maximizes the expected total discounted payoff over the infinite horizon. A stochastic neural network model is suggested for the controller. The parameters of the neural net, which determine a random control strategy, are updated at each instant using a simple learning scheme. This learning scheme involves estimation of some relevant parameters using an adaptive critic. It is proved that the controller asymptotically chooses an optimal action in each state of the Markov chain with a high probability.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  N. Rouche,et al.  Stability Theory by Liapunov's Direct Method , 1977 .

[3]  P. Kumar,et al.  Optimal adaptive controllers for unknown Markov chains , 1982 .

[4]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[6]  Harold J. Kushner,et al.  Approximation and Weak Convergence Methods for Random Processes , 1984 .

[7]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[8]  Richard Wheeler,et al.  Decentralized learning in finite Markov chains , 1985, 1985 24th IEEE Conference on Decision and Control.

[9]  A G Barto,et al.  Learning by statistical cooperation of self-interested neuron-like computing elements. , 1985, Human neurobiology.

[10]  Patchigolla Kiran Kumar,et al.  A Survey of Some Results in Stochastic Adaptive Control , 1985 .

[11]  Mandayam A. L. Thathachar,et al.  Learning Optimal Discriminant Functions through a Cooperative Game of Automata , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[12]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[13]  MITSUO SATO,et al.  Learning control of finite Markov chains with an explicit trade-off between estimation and control , 1988, IEEE Trans. Syst. Man Cybern..

[14]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[15]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[16]  Andrew G. Barto,et al.  On the Computational Economics of Reinforcement Learning , 1991 .

[17]  V. Borkar Topics in controlled Markov chains , 1991 .

[18]  Shouchuan Hu Differential equations with discontinuous right-hand sides☆ , 1991 .

[19]  Ronald J. Williams,et al.  Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .