Conditional Importance Sampling for Off-Policy Learning

The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios. This framework yields new perspectives and understanding of existing off-policy algorithms, and reveals a broad space of unexplored algorithms. We theoretically analyse this space, and concretely investigate several algorithms that arise from this framework.

[1]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[2]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[3]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[4]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[5]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[6]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[7]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[8]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[9]  J. Norris Appendix: probability and measure , 1997 .

[10]  Nicolas Le Roux,et al.  A Geometric Perspective on Optimal Representations for Reinforcement Learning , 2019, NeurIPS.

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[13]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[14]  Yao Liu,et al.  Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling , 2020, ICML.

[15]  Yifei Ma,et al.  Marginalized Off-Policy Evaluation for Reinforcement Learning , 2019, NeurIPS 2019.

[16]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[17]  Masatoshi Uehara,et al.  Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[18]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[19]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[20]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[21]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[22]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[23]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[24]  Peter Stone,et al.  Importance Sampling Policy Evaluation with an Estimated Behavior Policy , 2018, ICML.

[25]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[26]  Simo Särkkä,et al.  Bayesian Filtering and Smoothing , 2013, Institute of Mathematical Statistics textbooks.

[27]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[28]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[29]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[30]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[31]  Simo Srkk,et al.  Bayesian Filtering and Smoothing , 2013 .

[32]  Richard S. Sutton,et al.  Off-policy TD( l) with a true online equivalence , 2014, UAI.

[33]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[34]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes , 2019, ArXiv.

[35]  Richard S. Sutton,et al.  Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[36]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[37]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[38]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[39]  Masashi Sugiyama,et al.  Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[40]  Lihong Li,et al.  Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[41]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[42]  Marcello Restelli,et al.  Optimistic Policy Optimization via Multiple Importance Sampling , 2019, ICML.

[43]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[44]  A. Rollett,et al.  The Monte Carlo Method , 2004 .

[45]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[46]  Shie Mannor,et al.  Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis , 2015, AAAI.

[47]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[48]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[49]  Bo Dai,et al.  DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.