Monte-Carlo Tree Search as Regularized Policy Optimization
暂无分享,去创建一个
Michal Valko | Yunhao Tang | Thomas Hubert | Ioannis Antonoglou | Jean-Bastien Grill | R'emi Munos | Florent Altch'e | R. Munos | Ioannis Antonoglou | T. Hubert | Michal Valko | Jean-Bastien Grill | Yunhao Tang | Florent Altch'e
[1] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .
[2] Matthieu Geist,et al. A Theory of Regularized Markov Decision Processes , 2019, ICML.
[3] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.
[4] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.
[5] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..
[6] Yuval Tassa,et al. DeepMind Control Suite , 2018, ArXiv.
[7] David Budden,et al. Distributed Prioritized Experience Replay , 2018, ICLR.
[8] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.
[9] Demis Hassabis,et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.
[10] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..
[11] Rémi Munos,et al. Learning to Search with MCTSnets , 2018, ICML.
[12] Shimon Whiteson,et al. TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning , 2017, ICLR.
[13] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.
[14] Yunhao Tang,et al. Discretizing Continuous Action Space for On-Policy Optimization , 2019, AAAI.
[15] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.
[16] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..
[17] Koray Kavukcuoglu,et al. Combining policy gradient and Q-learning , 2016, ICLR.
[18] Simon M. Lucas,et al. A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.
[19] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.
[20] Navdeep Jaitly,et al. Discrete Sequential Prediction of Continuous Actions for Deep RL , 2017, ArXiv.
[21] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.
[22] Roy Fox,et al. Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.
[23] Richard Evans,et al. Deep Reinforcement Learning in Large Discrete Action Spaces , 2015, 1512.07679.
[24] Tim Salimans,et al. Policy Gradient Search: Online Planning and Expert Iteration without Search Trees , 2019, ArXiv.
[25] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.
[26] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.
[27] Koray Kavukcuoglu,et al. PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.
[28] Sergey Levine,et al. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.
[29] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[30] Jessica B. Hamrick,et al. Combining Q-Learning and Search with Amortized Value Estimates , 2020, ICLR.
[31] Michal Valko,et al. Planning in entropy-regularized Markov decision processes and games , 2019, NeurIPS.
[32] Vicenç Gómez,et al. A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.
[33] Igor Vajda,et al. On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.
[34] Christopher D. Rosin,et al. Multi-armed bandits with episode context , 2011, Annals of Mathematics and Artificial Intelligence.
[35] Andriy Mnih,et al. Q-Learning in enormous action spaces via amortized approximate maximization , 2020, ArXiv.
[36] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[37] H. Francis Song,et al. V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.
[38] Satinder Singh,et al. Value Prediction Network , 2017, NIPS.
[39] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.
[40] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.
[41] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.