论文信息 - Average reward reinforcement learning: Foundations, algorithms, and empirical results

Average reward reinforcement learning: Foundations, algorithms, and empirical results

This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.

Sridhar Mahadevan | S. Mahadevan

[1] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .

[2] D. White,et al. Dynamic programming, Markov chains, and the method of successive approximations , 1963 .

[3] Rutherford Aris,et al. Discrete Dynamic Programming , 1965, The Mathematical Gazette.

[4] A. F. Veinott. Discrete Dynamic Programming with Sensitive Discount Optimality Criteria , 1969 .

[5] Eric V. Denardo,et al. Computing a Bias-Optimal Policy in a Discrete-Time Markov Decision Problem , 1970, Oper. Res..

[6] A. Hordijk,et al. A MODIFIED FORM OF THE ITERATIVE METHOD OF DYNAMIC PROGRAMMING , 1975 .

[7] Paul J. Schweitzer,et al. Successive Approximation Methods for Solving Nested Functional Equations in Markov Decision Problems , 1984, Math. Oper. Res..

[8] Richard Wheeler,et al. Decentralized learning in finite Markov chains , 1985, 1985 24th IEEE Conference on Decision and Control.

[9] Dimitri P. Bertsekas,et al. Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[10] Kumpati S. Narendra,et al. Learning automata - an introduction , 1989 .

[11] A. Jalali,et al. Computationally efficient adaptive control algorithms for Markov chains , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.