暂无分享,去创建一个
[1] Philip Thomas,et al. Bias in Natural Actor-Critic Algorithms , 2014, ICML.
[2] S. Resnick. A Probability Path , 1999 .
[3] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.
[4] Leonid Peshkin,et al. Learning from Scarce Experience , 2002, ICML.
[5] Richard E. Turner,et al. Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.
[6] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.
[7] Martha White,et al. Linear Off-Policy Actor-Critic , 2012, ICML.
[8] Tzuu-Hseng S. Li,et al. Backward Q-learning: The combination of Sarsa algorithm and Q-learning , 2013, Eng. Appl. Artif. Intell..
[9] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.
[10] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.
[11] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.
[12] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.
[13] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.
[14] Koray Kavukcuoglu,et al. Combining policy gradient and Q-learning , 2016, ICLR.
[15] Sergey Levine,et al. PLATO: Policy learning using adaptive trajectory optimization , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).
[16] Sergey Levine,et al. Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.
[17] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.
[18] Koray Kavukcuoglu,et al. PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.
[19] Satinder Singh,et al. Self-Imitation Learning , 2018, ICML.
[20] Doina Precup,et al. Policy Gradient Methods for Off-policy Control , 2015, ArXiv.
[21] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.
[22] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.
[23] Neil D. Lawrence,et al. Dataset Shift in Machine Learning , 2009 .
[24] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.
[25] Dale Schuurmans,et al. Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.
[26] Nando de Freitas,et al. Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.
[27] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.
[28] Timothy J. Robinson,et al. Sequential Monte Carlo Methods in Practice , 2003 .
[29] Joseph Kang,et al. Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data , 2007, 0804.2958.
[30] Pieter Abbeel,et al. On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient , 2010, NIPS.
[31] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.
[32] C. Robert,et al. Rethinking the Effective Sample Size , 2018, International Statistical Review.
[33] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.
[34] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.
[35] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.
[36] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.
[37] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.
[38] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..
[39] J. Robins,et al. Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.
[40] Sergey Levine,et al. Guided Policy Search , 2013, ICML.
[41] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..
[42] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.
[43] Hoon Kim,et al. Monte Carlo Statistical Methods , 2000, Technometrics.
[44] Xi-Ren Cao,et al. A basic formula for online policy gradient algorithms , 2005, IEEE Transactions on Automatic Control.
[45] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.
[46] Dengyong Zhou,et al. Action-depedent Control Variates for Policy Optimization via Stein's Identity , 2017 .
[47] Matthew Hausknecht and Peter Stone. On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning , 2016 .
[48] Nando de Freitas,et al. Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.
[49] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.
[50] Sergey Levine,et al. The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.
[51] Larry Rudolph,et al. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , 2018, ArXiv.