暂无分享,去创建一个
[1] Daniele Calandriello,et al. Safe Policy Iteration , 2013, ICML.
[2] Nando de Freitas,et al. Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.
[3] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .
[4] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.
[5] Richard E. Turner,et al. Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.
[6] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.
[7] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.
[8] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.
[9] Stephen J. Wright,et al. A Fast and Reliable Policy Improvement Algorithm , 2016, AISTATS.
[10] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.
[11] Jun Morimoto,et al. Trial and Error: Using Previous Experiences as Simulation Models in Humanoid Motor Learning , 2016, IEEE Robotics & Automation Magazine.
[12] Marc G. Bellemare,et al. Q($\lambda$) with Off-Policy Corrections , 2016 .
[13] L. V. D. Heyden,et al. Perturbation bounds for the stationary probabilities of a finite Markov chain , 1984 .
[14] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.
[15] Martha White,et al. Linear Off-Policy Actor-Critic , 2012, ICML.
[16] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.
[17] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.
[18] Stefan Schaal,et al. Reinforcement Learning for Humanoid Robotics , 2003 .
[19] Paul Wagner,et al. A reinterpretation of the policy oscillation phenomenon in approximate policy iteration , 2011, NIPS.
[20] Paul Wagner,et al. Policy oscillation is overshooting , 2014, Neural Networks.
[21] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.
[22] Jun Morimoto,et al. Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration , 2012, Neural Computation.
[23] D. Bertsekas. Approximate policy iteration: a survey and some new methods , 2011 .
[24] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.
[25] Jeff G. Schneider,et al. Covariant policy search , 2003, IJCAI 2003.
[26] Philip S. Thomas,et al. High Confidence Policy Improvement , 2015, ICML.
[27] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.
[28] Sergey Levine,et al. Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.