暂无分享,去创建一个
[1] R. Pemantle,et al. Nonconvergence to Unstable Points in Urn Models and Stochastic Approximations , 1990 .
[2] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.
[3] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .
[4] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.
[5] R. Sutton,et al. Off-policy Learning with Recognizers , 2000 .
[6] Kenji Doya,et al. Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.
[7] John N. Tsitsiklis,et al. Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..
[8] Leslie Pack Kaelbling,et al. Effective reinforcement learning for mobile robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).
[9] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..
[10] Peter Dayan,et al. Q-learning , 1992, Machine Learning.
[11] 木村元,et al. Off‐Policy Actor‐Criticアルゴリズムによる強化学習 , 2004 .
[12] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.
[13] Doina Precup,et al. Off-policy Learning with Options and Recognizers , 2005, NIPS.
[14] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[15] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.
[16] R. Sutton,et al. A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.
[17] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .
[18] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..
[19] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.
[20] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.
[21] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.
[22] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.
[23] R. Sutton,et al. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .
[24] Michael Delp,et al. Experiments in off-policy reinforcement learning with the GQ(lambda) algorithm , 2011 .
[25] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.
[26] M. Delp,et al. Experiments in Off-policy Reinforcement Learning with the Gq(λ) Algorithm Examining Committee , 2011 .
[27] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .