Average-Reward Off-Policy Policy Evaluation with Function Approximation

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent offpolicy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.

[1]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[2]  Qiang Liu,et al.  Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning , 2020, ICLR.

[3]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[4]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[5]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[6]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[7]  Qiang Liu,et al.  Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation , 2019, ICLR.

[8]  Peter L. Bartlett,et al.  POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[9]  J. Zico Kolter,et al.  The Fixed Points of Off-Policy TD , 2011, NIPS.

[10]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[11]  Huizhen Yu,et al.  On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning , 2017, ArXiv.

[12]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[13]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[14]  Hengshuai Yao,et al.  Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation , 2019, ICML.

[15]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[16]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[18]  Martha White,et al.  Planning with Expectation Models , 2019, IJCAI.

[19]  Yao Liu,et al.  Combining Parametric and Nonparametric Models for Off-Policy Evaluation , 2019, ICML.

[20]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[23]  Thorsten Joachims,et al.  MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.

[24]  Yao Liu,et al.  Representation Balancing MDPs for Off-Policy Policy Evaluation , 2018, NeurIPS.

[25]  Shalabh Bhatnagar,et al.  A Convergent Off-Policy Temporal Difference Algorithm , 2019, ECAI.

[26]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[27]  Ali H. Sayed,et al.  Distributed Policy Evaluation Under Multiple Behavior Strategies , 2013, IEEE Transactions on Automatic Control.

[28]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[29]  Lihong Li A perspective on off-policy evaluation in reinforcement learning , 2019, Frontiers of Computer Science.

[30]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[31]  Dilan Görür,et al.  A maximum-entropy approach to off-policy evaluation in average-reward MDPs , 2020, NeurIPS.

[32]  S. Whiteson,et al.  GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values , 2020, ICML.

[33]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[34]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[35]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[36]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[37]  Richard S. Sutton,et al.  Learning and Planning in Average-Reward Markov Decision Processes , 2020, ICML.

[38]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[39]  Ilya Kostrikov,et al.  AlgaeDICE: Policy Gradient from Arbitrary Experience , 2019, ArXiv.

[40]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[41]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[42]  Bo Dai,et al.  GenDICE: Generalized Offline Estimation of Stationary Values , 2020, ICLR.

[43]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[44]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[45]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[46]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[47]  Lihong Li,et al.  Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[48]  Bo Liu,et al.  A Block Coordinate Ascent Algorithm for Mean-Variance Optimization , 2018, NeurIPS.

[49]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[50]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[51]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[52]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[53]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[54]  Donald E. Kirk,et al.  Optimal control theory : an introduction , 1970 .

[55]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[56]  Shimon Whiteson,et al.  Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning , 2020, AAAI.

[57]  Emma Brunskill,et al.  Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.