论文信息 - N Average- Ent Learning Algorit Mot Ivat Ion

N Average- Ent Learning Algorit Mot Ivat Ion

Average-reward reinforcement learning (ARL) is an undiscounted optimality framework that is generally applicable to a broad range of control tasks. ARL computes gain-optimal control policies that maximize the expected payoff per step. However, gainoptimality has some intrinsic limitations as an optimality criterion, since for example, it cannot distinguish between different policies that all reach an absorbing goal state, but incur varying costs. A more selective criterion is bias optima&y, which can filter gain-optimal policies to select those that reach absorbing goals with the minimum cost. While several ARL algorithms for computing gain-optimal policies have been proposed, none of these algorithms can guarantee bias optimality, since this requires solving at least two nested optimality equations. In this paper, we describe a novel model-based ARL algorithm for computing bias-optimal policies. We test the proposed algorithm using an admission control queuing system, and show that it is able to utilize the queue much more efficiently than a gain-optimal method by learning bias-optimal policies.

S. Mahadevan

[1] Rutherford Aris,et al. Discrete Dynamic Programming , 1965, The Mathematical Gazette.

[2] A. F. Veinott. Discrete Dynamic Programming with Sensitive Discount Optimality Criteria , 1969 .

[3] Eric V. Denardo,et al. Computing a Bias-Optimal Policy in a Discrete-Time Markov Decision Problem , 1970, Oper. Res..

[4] Shaler Stidham,et al. Socially and Individually Optimal Control of Arrivals to a GI/M/1 Queue , 1978 .

[5] Paul J. Schweitzer,et al. Successive Approximation Methods for Solving Nested Functional Equations in Markov Decision Problems , 1984, Math. Oper. Res..

[6] A. Jalali,et al. Computationally efficient adaptive control algorithms for Markov chains , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[7] Anton Schwartz,et al. A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[8] Satinder P. Singh,et al. Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes , 1994, AAAI.

[9] Sridhar Mahadevan,et al. To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning , 1994, ICML.

[10] Prasad Tadepalli,et al. H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted Average Reward , 1994 .

[11] Craig Boutilier,et al. Process-Oriented Planning and Average-Reward Optimality , 1995, IJCAI.

[12] M. Puterman,et al. Bias optimality in controlled queueing systems , 1998 .

[13] S. Wittevrongel,et al. Queueing systems , 2019, Autom..