Average-Reward Reinforcement Learning with Trust Region Methods

Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. To the best of our knowledge, our work is the first one to study the trust region approach with the average criterion and it complements the framework of reinforcement learning beyond the discounted criterion. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.

[1]  Marco Aiello,et al.  AAAI Conference on Artificial Intelligence , 2011, AAAI Conference on Artificial Intelligence.

[2]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3]  Pieter Abbeel,et al.  Responsive Safety in Reinforcement Learning by PID Lagrangian Methods , 2020, ICML.

[4]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[5]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[6]  René David,et al.  Discrete event dynamic systems , 1989 .

[7]  Yang Gao,et al.  Efficient Average Reward Reinforcement Learning Using Constant Shifting Values , 2016, AAAI.

[8]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[9]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[10]  Keith W. Ross,et al.  On-Policy Deep Reinforcement Learning for the Average-Reward Criterion , 2021, ICML.

[11]  M. Gallagher,et al.  Average-reward model-free reinforcement learning: a systematic review and literature mapping , 2020, ArXiv.

[12]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[13]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[14]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[15]  Haipeng Luo,et al.  Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation , 2020, AISTATS.

[16]  John N. Tsitsiklis,et al.  Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes , 2003, Discret. Event Dyn. Syst..

[17]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[18]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[19]  Mark Gluzman,et al.  Refined Policy Improvement Bounds for MDPs , 2021, ArXiv.

[20]  M. Belmonte Potentials , 2021, Quantum Mechanics.

[21]  Dock Bumpers,et al.  Volume 2 , 2005, Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, 2005..

[22]  Shimon Whiteson,et al.  Average-Reward Off-Policy Policy Evaluation with Function Approximation , 2021, ICML.

[23]  Sham M. Kakade,et al.  Optimizing Average Reward Using Discounted Rewards , 2001, COLT/EuroCOLT.

[24]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[25]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[26]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[27]  Pieter Abbeel,et al.  rlpyt: A Research Code Base for Deep Reinforcement Learning in PyTorch , 2019, ArXiv.

[28]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[29]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[30]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[31]  Marcin Andrychowicz,et al.  What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study , 2020, ArXiv.

[32]  Mark Gluzman,et al.  Queueing Network Controls via Deep Reinforcement Learning , 2020, ArXiv.

[33]  Crimi,et al.  Volume II , 2018, The Hamburg Dramaturgy by G.E. Lessing.

[34]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.