Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes

Reinforcement learning (RL) has become a central paradigm for solving learning-control problems in robotics and artificial intelligence. RL researchers have focussed almost exclusively on problems where the controller has to maximize the discounted sum of payoffs. However, as emphasized by Schwartz (1993), in many problems, e.g., those for which the optimal behavior is a limit cycle, it is more natural and computationally advantageous to formulate tasks so that the controller's objective is to maximize the average payoff received per time step. In this paper I derive new average-payoff RL algorithms as stochastic approximation methods for solving the system of equations associated with the policy evaluation and optimal control questions in average-payoff RL tasks. These algorithms are analogous to the popular TD and Q-learning algorithms already developed for the discounted-payoff case. One of the algorithms derived here is a significant variation of Schwartz's R-learning algorithm. Preliminary empirical results are presented to validate these new algorithms.

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  Dimitri Bertsekas,et al.  Distributed dynamic programming , 1981, 1981 20th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[3]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[5]  C. Watkins Learning from delayed rewards , 1989 .

[6]  A. Jalali,et al.  Adaptive control of Markov chains with local updates , 1990 .

[7]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[8]  Satinder Singh,et al.  Learning to Solve Markovian Decision Processes , 1993 .

[9]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  B. Pasik-Duncan,et al.  Adaptive Control , 1996, IEEE Control Systems.