Bridging the Gap Between Value and Policy Based Reinforcement Learning
暂无分享,去创建一个
Dale Schuurmans | Mohammad Norouzi | Kelvin Xu | Ofir Nachum | Dale Schuurmans | Ofir Nachum | Mohammad Norouzi | Kelvin Xu
[1] Jing Peng,et al. Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .
[2] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .
[3] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..
[4] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..
[5] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.
[6] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .
[7] Michael L. Littman,et al. Algorithms for Sequential Decision Making , 1996 .
[8] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.
[9] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[10] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.
[11] Adrian S. Lewis,et al. Convex Analysis And Nonlinear Optimization , 2000 .
[12] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.
[13] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.
[14] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.
[15] Peter Dayan,et al. Q-learning , 1992, Machine Learning.
[16] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.
[17] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.
[18] Jing Peng,et al. Incremental multi-step Q-learning , 1994, Machine Learning.
[19] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.
[20] H. Kappen. Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.
[21] Emanuel Todorov,et al. Linearly-solvable Markov decision problems , 2006, NIPS.
[22] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.
[23] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.
[24] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .
[25] Jürgen Schmidhuber,et al. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.
[26] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .
[27] Emanuel Todorov,et al. Policy gradients in linearly-solvable MDPs , 2010, NIPS.
[28] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.
[29] Eduardo F. Morales,et al. An Introduction to Reinforcement Learning , 2011 .
[30] Vicenç Gómez,et al. Dynamic Policy Programming with Function Approximation , 2011, AISTATS.
[31] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.
[32] Hilbert J. Kappen,et al. Dynamic policy programming , 2010, J. Mach. Learn. Res..
[33] Vicenç Gómez,et al. Optimal control as a graphical model inference problem , 2009, Machine Learning.
[34] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.
[35] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..
[36] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.
[37] Philip S. Thomas,et al. Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees , 2015, IJCAI.
[38] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.
[39] J. Andrew Bagnell,et al. Approximate MaxEnt Inverse Optimal Control and Its Application for Mental Simulation of Human Interactions , 2015, AAAI.
[40] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.
[41] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[42] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.
[43] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.
[44] Dale Schuurmans,et al. Reward Augmented Maximum Likelihood for Neural Structured Prediction , 2016, NIPS.
[45] Kavosh Asadi,et al. A New Softmax Operator for Reinforcement Learning , 2016, ArXiv.
[46] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[47] Roy Fox,et al. Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.
[48] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.
[49] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.
[50] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.
[51] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.
[52] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.
[53] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..
[54] Koray Kavukcuoglu,et al. PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.
[55] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.
[56] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.
[57] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.
[58] Marc G. Bellemare,et al. The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.
[59] Kavosh Asadi,et al. An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.
[60] Sergey Levine,et al. Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.
[61] Nando de Freitas,et al. Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.
[62] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.
[63] Dale Schuurmans,et al. Improving Policy Gradient by Exploring Under-appreciated Rewards , 2016, ICLR.
[64] Pieter Abbeel,et al. Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.
[65] Sergey Levine,et al. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).
[66] Marc G. Bellemare,et al. The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.