Efficient Average Reward Reinforcement Learning Using Constant Shifting Values

There are two classes of average reward reinforcement learning (RL) algorithms: model-based ones that explicitly maintain MDP models and model-free ones that do not learn such models. Though model-free algorithms are known to be more efficient, they often cannot converge to optimal policies due to the perturbation of parameters. In this paper, a novel model-free algorithm is proposed, which makes use of constant shifting values (CSVs) estimated from prior knowledge. To encourage exploration during the learning process, the algorithm constantly subtracts the CSV from the rewards. A terminating condition is proposed to handle the unboundedness of Q-values caused by such substraction. The convergence of the proposed algorithm is proved under very mild assumptions. Furthermore, linear function approximation is investigated to generalize our method to handle large-scale tasks. Extensive experiments on representative MDPs and the popular game Tetris show that the proposed algorithms significantly outperform the state-of-the-art ones.

[1]  Ronald Ortner,et al.  Adaptive aggregation for reinforcement learning in average reward Markov decision processes , 2013, Ann. Oper. Res..

[2]  John N. Tsitsiklis,et al.  Call admission control and routing in integrated services networks using neuro-dynamic programming , 2000, IEEE Journal on Selected Areas in Communications.

[3]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[4]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[5]  S. Mahadevan,et al.  Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .

[6]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[7]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[8]  Yanjie Li,et al.  Reinforcement learning algorithms for semi-Markov decision processes with average reward , 2012, Proceedings of 2012 9th IEEE International Conference on Networking, Sensing and Control.

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Prasad Tadepalli,et al.  Average-Reward Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[11]  Hoong Chuin Lau,et al.  Decentralized Multi-Agent Reinforcement Learning in Average-Reward Dynamic DCOPs , 2014, AAAI.

[12]  Sridhar Mahadevan,et al.  To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning , 1994, ICML.

[13]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.

[14]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[15]  Erik D. Demaine,et al.  Tetris is hard, even to approximate , 2002, Int. J. Comput. Geom. Appl..

[16]  Satinder P. Singh,et al.  Reinforcement Learning Algorithms for Average-Payoff Markovian Decision Processes , 1994, AAAI.

[17]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[18]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[19]  Sridhar Mahadevan,et al.  Hierarchical Average Reward Reinforcement Learning , 2007, J. Mach. Learn. Res..

[20]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[21]  Abhijit Gosavi,et al.  Reinforcement learning for long-run average cost , 2004, Eur. J. Oper. Res..

[22]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..