Average reward reinforcement learning: Foundations, algorithms, and empirical results

This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric calledn-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms while several algorithms can provably generategain-optimal policies that maximize average reward, none of them can reliably filter these to producebias-optimal (orT-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  D. White Dynamic programming, Markov chains, and the method of successive approximations , 1963 .

[3]  Rutherford Aris,et al.  Discrete Dynamic Programming , 1965, The Mathematical Gazette.

[4]  A. F. Veinott Discrete Dynamic Programming with Sensitive Discount Optimality Criteria , 1969 .

[5]  Eric V. Denardo,et al.  Computing a Bias-Optimal Policy in a Discrete-Time Markov Decision Problem , 1970, Oper. Res..

[6]  A. Hordijk,et al.  A MODIFIED FORM OF THE ITERATIVE METHOD OF DYNAMIC PROGRAMMING , 1975 .

[7]  Paul J. Schweitzer,et al.  Successive Approximation Methods for Solving Nested Functional Equations in Markov Decision Problems , 1984, Math. Oper. Res..

[8]  Richard Wheeler,et al.  Decentralized learning in finite Markov chains , 1985, 1985 24th IEEE Conference on Decision and Control.

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[10]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[11]  A. Jalali,et al.  Computationally efficient adaptive control algorithms for Markov chains , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[12]  Joseph F. Engelberger,et al.  Robotics in Service , 1989 .

[13]  A. Jalali,et al.  A distributed asynchronous algorithm for expected average cost dynamic programming , 1990, 29th IEEE Conference on Decision and Control.

[14]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[15]  M. Puterman,et al.  An improved algorithm for solving communicating average reward Markov decision processes , 1991 .

[16]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[17]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[18]  Sridhar Mahadevan,et al.  Enhancing Transfer in Reinforcement Learning by Building Stochastic Models of Robot Actions , 1992, ML.

[19]  Tom M. Mitchell,et al.  A Personal Learning Apprentice , 1992, AAAI.

[20]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[21]  D. Sofge THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[22]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[23]  Marcos Salganicoff,et al.  Density-Adaptive Learning and Forgetting , 1993, ICML.

[24]  Jonas Karlsson,et al.  Learning Multiple Goal Behavior via Task Decomposition and Dynamic Policy Merging , 1993 .

[25]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[26]  Satinder Singh,et al.  Learning to Solve Markovian Decision Processes , 1993 .

[27]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[28]  Satinder P. Singh,et al.  Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes , 1994, AAAI.

[29]  Sridhar Mahadevan,et al.  To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning , 1994, ICML.

[30]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[31]  Prasad Tadepalli,et al.  H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted Average Reward , 1994 .

[32]  Craig Boutilier,et al.  Process-Oriented Planning and Average-Reward Optimality , 1995, IJCAI.

[33]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[34]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[35]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[36]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[37]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[38]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[39]  J. Walrand,et al.  Distributed Dynamic Programming , 2022 .