Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation

Infinite horizon off-policy policy evaluation is a highly challenging task due to the excessively large variance of typical importance sampling (IS) estimators. Recently, Liu et al. (2018a) proposed an approach that significantly reduces the variance of infinite-horizon off-policy evaluation by estimating the stationary density ratio, but at the cost of introducing potentially high biases due to the error in density ratio estimation. In this paper, we develop a bias-reduced augmentation of their method, which can take advantage of a learned value function to obtain higher accuracy. Our method is doubly robust in that the bias vanishes when either the density ratio or the value function estimation is perfect. In general, when either of them is accurate, the bias can also be reduced. Both theoretical and empirical results show that our method yields significant advantages over previous methods.

[1]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[2]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Peter W. Glynn,et al.  Stochastic Simulation: Algorithms and Analysis , 2007 .

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[7]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[8]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes , 2019, ArXiv.

[9]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[10]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[11]  Qiang Liu,et al.  A Kernel Loss for Solving the Bellman Equation , 2019, NeurIPS.

[12]  Le Song,et al.  Learning from Conditional Distributions via Dual Kernel Embeddings , 2016, ArXiv.

[13]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[14]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[15]  Philip S. Thomas,et al.  Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation , 2017, NIPS.

[16]  Moshe Ben-Horim,et al.  A linear programming approach , 1977 .

[17]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[18]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[19]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[20]  Le Song,et al.  Smoothed Dual Embedding Control , 2017, ArXiv.

[21]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[22]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[23]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[24]  Mengdi Wang,et al.  Stochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning , 2016, ArXiv.

[25]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[26]  Nan Jiang,et al.  Doubly Robust Off-policy Evaluation for Reinforcement Learning , 2015, ArXiv.

[27]  Peter Stone,et al.  Importance Sampling Policy Evaluation with an Estimated Behavior Policy , 2018, ICML.

[28]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[29]  Louis Wehenkel,et al.  Batch mode reinforcement learning based on the synthesis of artificial trajectories , 2013, Ann. Oper. Res..

[30]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[31]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[32]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[33]  Philip S. Thomas,et al.  Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing , 2017, AAAI.

[34]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[35]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[36]  Yao Liu Representation Balancing MDPs for Off-Policy Policy Evaluation , 2019 .

[37]  Emma Brunskill,et al.  Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.