Importance Sampling Policy Evaluation with an Estimated Behavior Policy

We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian.

[1]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[2]  Shota Yasui,et al.  Efficient Counterfactual Learning from Bandit Feedback , 2018, AAAI.

[3]  Philip S. Thomas,et al.  Importance Sampling for Fair Policy Selection , 2017, UAI.

[4]  Peter Stone,et al.  Data-Efficient Policy Evaluation Through Behavior Policy Search , 2017, ICML.

[5]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[6]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[7]  Marc G. Bellemare,et al.  The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.

[8]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[9]  Nan Jiang,et al.  Doubly Robust Off-policy Evaluation for Reinforcement Learning , 2015, ArXiv.

[10]  B. Delyon,et al.  Integral approximation by kernel smoothing , 2014, 1409.0733.

[11]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  A. Preliminaries Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016 .

[14]  Zoran Popovic,et al.  Offline Evaluation of Online Reinforcement Learning Algorithms , 2016, AAAI.

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  N. Chopin,et al.  Control functionals for Monte Carlo integration , 2014, 1410.2392.

[17]  H. Shimodaira,et al.  Improving predictive inference under covariate shift by weighting the log-likelihood function , 2000 .

[18]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[19]  Shie Mannor,et al.  Off-policy Model-based Learning under Unknown Factored Dynamics , 2015, ICML.

[20]  Marc G. Bellemare,et al.  The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.

[21]  Anthony O'Hagan,et al.  Monte Carlo is fundamentally unsound , 1987 .

[22]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[23]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[24]  P. Rosenbaum Model-Based Direct Adjustment , 1987 .

[25]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[26]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[27]  Dirk P. Kroese,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[28]  Louis Wehenkel,et al.  Model-Free Monte Carlo-like Policy Evaluation , 2010, AISTATS.

[29]  Srivatsan Srinivasan,et al.  Evaluating Reinforcement Learning Algorithms in Observational Health Settings , 2018, ArXiv.

[30]  Carl E. Rasmussen,et al.  Bayesian Monte Carlo , 2002, NIPS.

[31]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[32]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[33]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[34]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[35]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[36]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[37]  S. Eguchi,et al.  Importance Sampling Via the Estimated Sampler , 2007 .

[38]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[39]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[40]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[41]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[42]  Philip S. Thomas Magical Policy Search : Data Efficient Reinforcement Learning with Guarantees of Global Optimality , 2016 .

[43]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[44]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[45]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[46]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[47]  Qiang Liu,et al.  Black-box Importance Sampling , 2016, AISTATS.