A Unified View of Multi-step Temporal Difference Learning
暂无分享,去创建一个
[1] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.
[2] Richard S. Sutton,et al. Off-policy learning based on weighted importance sampling with linear computational complexity , 2015, UAI.
[3] Shimon Whiteson,et al. A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.
[4] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.
[5] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.
[6] Richard S. Sutton,et al. On Generalized Bellman Equations and Temporal-Difference Learning , 2017, Canadian Conference on AI.
[7] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .
[8] Richard S. Sutton,et al. Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .
[9] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .
[10] R. Bellman,et al. Dynamic Programming and Markov Processes , 1960 .
[11] Richard S. Sutton,et al. Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.
[12] Marc G. Bellemare,et al. Q(λ) with Off-Policy Corrections , 2016, ALT.
[13] J. Hammersley. SIMULATION AND THE MONTE CARLO METHOD , 1982 .
[14] Martha White,et al. Emphatic Temporal-Difference Learning , 2015, ArXiv.
[15] R. Sutton,et al. Off-policy Learning with Recognizers , 2000 .
[16] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.
[17] Rémi Bardenet,et al. Monte Carlo Methods , 2013, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..
[18] Shie Mannor,et al. Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis , 2015, AAAI.
[19] Richard S. Sutton,et al. Predicting Periodicity with Temporal Difference Learning , 2018, ArXiv.
[20] Richard S. Sutton,et al. Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.
[21] Richard S. Sutton,et al. Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.
[22] Chris Watkins,et al. Learning from delayed rewards , 1989 .
[23] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[24] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.
[25] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.
[26] Richard S. Sutton,et al. Per-decision Multi-step Temporal Difference Learning with Control Variates , 2018, UAI.
[27] R. Sutton,et al. A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.
[28] Richard S. Sutton,et al. True online TD(λ) , 2014, ICML 2014.
[29] R. Sutton,et al. A new Q ( � ) with interim forward view and Monte Carlo equivalence , 2014 .
[30] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.