State Relevance for Off-Policy Evaluation

Importance sampling-based estimators for offpolicy evaluation (OPE) are valued for their simplicity, unbiasedness, and reliance on relatively few assumptions. However, the variance of these estimators is often high, especially when trajectories are of different lengths. In this work, we introduce Omitting-States-Irrelevant-to-Return Importance Sampling (OSIRIS), an estimator which reduces variance by strategically omitting likelihood ratios associated with certain states. We formalize the conditions under which OSIRIS is unbiased and has lower variance than ordinary importance sampling, and we demonstrate these properties empirically.

[1]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[2]  B. L. Welch ON THE COMPARISON OF SEVERAL MEAN VALUES: AN ALTERNATIVE APPROACH , 1951 .

[3]  Marcello Restelli,et al.  Optimistic Policy Optimization via Multiple Importance Sampling , 2019, ICML.

[4]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Yi Su,et al.  Doubly robust off-policy evaluation with shrinkage , 2019, ICML.

[7]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[8]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[9]  E. Ionides Truncated Importance Sampling , 2008 .

[10]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[11]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[12]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[13]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[14]  T. Schaul,et al.  Conditional Importance Sampling for Off-Policy Learning , 2019, AISTATS.

[15]  Philip S. Thomas,et al.  Importance Sampling for Fair Policy Selection , 2017, UAI.

[16]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[17]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[18]  Philip S. Thomas,et al.  Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation , 2017, NIPS.

[19]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[20]  J. L. Hodges,et al.  The significance probability of the smirnov two-sample test , 1958 .

[21]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[22]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[23]  Yao Liu,et al.  Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions , 2020, ICML.

[24]  Marcello Restelli,et al.  Importance Sampling Techniques for Policy Optimization , 2020, J. Mach. Learn. Res..