Marginalized Off-Policy Evaluation for Reinforcement Learning

Off-policy evaluation is concerned with evaluating the performance of a policy using the historical data obtained by different behavior policies. In the real-world application of reinforcement learning, acting a policy can be costly and dangerous, and off-policy evaluation usually plays as a crucial step. Currently, the existing methods for off-policy evaluation are mainly based on the Markov decision process (MDP) model of discrete tree MDPs, and they suffer from high variance due to the cumulative product of importance weights. In this paper, we propose a new off-policy evaluation approach directly based on the discrete directed acyclic graph (DAG) MDPs. Our approach can be applied to most of the estimators of off-policy evaluation without modification and could reduce the variance dramatically. We also provide a theoretical analysis of our approach and evaluate it by empirical results.

[1]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[2]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[3]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[4]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[5]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[6]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[7]  Liang Tang,et al.  Automatic ad format selection via contextual bandits , 2013, CIKM.

[8]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[9]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[10]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[11]  Eugen Slutsuky,et al.  Über stochastische Asymptoten und Grenzwerte (Abdruck) , 2022 .

[12]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[13]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[14]  Rómer Rosales,et al.  Simple and Scalable Response Prediction for Display Advertising , 2014, ACM Trans. Intell. Syst. Technol..

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[17]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[18]  Philip S. Thomas,et al.  Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation , 2017, NIPS.

[19]  Philip S. Thomas,et al.  Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing , 2017, AAAI.

[20]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[21]  Peter Szolovits,et al.  Continuous State-Space Models for Optimal Sepsis Treatment: a Deep Reinforcement Learning Approach , 2017, MLHC.

[22]  Philip S. Thomas,et al.  Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees , 2015, IJCAI.