论文信息 - Conditional Importance Sampling for Off-Policy Learning - 字舞流文

Conditional Importance Sampling for Off-Policy Learning

The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios. This framework yields new perspectives and understanding of existing off-policy algorithms, and reveals a broad space of unexplored algorithms. We theoretically analyse this space, and concretely investigate several algorithms that arise from this framework.

Tom Schaul | Hado van Hasselt | Diana Borsa | Anna Harutyunyan | Will Dabney | Mark Rowland | R'emi Munos | T. Schaul | R. Munos | Mark Rowland | Will Dabney | H. V. Hasselt | A. Harutyunyan | Diana Borsa

[1] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[2] Yifei Ma,et al. Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[3] Hoon Kim,et al. Monte Carlo Statistical Methods , 2000, Technometrics.

[4] Richard S. Sutton,et al. Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[5] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[6] Marc G. Bellemare,et al. Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[7] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[8] Shie Mannor,et al. Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[9] J. Norris. Appendix: probability and measure , 1997 .

[10] Nicolas Le Roux,et al. A Geometric Perspective on Optimal Representations for Reinforcement Learning , 2019, NeurIPS.

[11] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[13] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[14] Yao Liu,et al. Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling , 2020, ICML.

[15] Yifei Ma,et al. Marginalized Off-Policy Evaluation for Reinforcement Learning , 2019, NeurIPS 2019.

[16] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[17] Masatoshi Uehara,et al. Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[18] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[19] Dimitri P. Bertsekas,et al. Stochastic optimal control : the discrete time case , 2007 .

[20] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.

[21] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[22] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[23] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[24] Peter Stone,et al. Importance Sampling Policy Evaluation with an Estimated Behavior Policy , 2018, ICML.

[25] Marcello Restelli,et al. Policy Optimization via Importance Sampling , 2018, NeurIPS.

[26] Simo Särkkä,et al. Bayesian Filtering and Smoothing , 2013, Institute of Mathematical Statistics textbooks.

[27] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[28] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[29] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[30] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[31] Simo Srkk,et al. Bayesian Filtering and Smoothing , 2013 .

[32] Richard S. Sutton,et al. Off-policy TD( l) with a true online equivalence , 2014, UAI.

[33] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[34] Masatoshi Uehara,et al. Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes , 2019, ArXiv.

[35] Richard S. Sutton,et al. Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[36] Masatoshi Uehara,et al. Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[37] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[38] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[39] Masashi Sugiyama,et al. Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[40] Lihong Li,et al. Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[41] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[42] Marcello Restelli,et al. Optimistic Policy Optimization via Multiple Importance Sampling , 2019, ICML.

[43] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[44] A. Rollett,et al. The Monte Carlo Method , 2004 .

[45] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[46] Shie Mannor,et al. Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis , 2015, AAAI.

[47] Philip S. Thomas,et al. High Confidence Policy Improvement , 2015, ICML.

[48] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[49] Bo Dai,et al. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.