Provably Efficient Exploration in Policy Optimization
暂无分享,去创建一个
[1] Quanquan Gu,et al. Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping , 2020, ICML.
[2] Sham M. Kakade,et al. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..
[3] Yishay Mansour,et al. Adversarial Stochastic Shortest Path , 2020, ArXiv.
[4] Mengdi Wang,et al. Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.
[5] Csaba Szepesvari,et al. Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2019, ICML.
[6] Lin F. Yang,et al. Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2019, ICLR.
[7] Jian Peng,et al. √n-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank , 2019, COLT.
[8] Zhaoran Wang,et al. Neural Policy Gradient Methods: Global Optimality and Rates of Convergence , 2019, ICLR.
[9] S. Kakade,et al. Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.
[10] Michael I. Jordan,et al. Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.
[11] Mengdi Wang,et al. Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.
[12] Zhuoran Yang,et al. A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.
[13] Benjamin Van Roy,et al. Comments on the Du-Kakade-Wang-Yang Lower Bounds , 2019, ArXiv.
[14] Nevena Lazic,et al. Exploration-Enhanced POLITEX , 2019, ArXiv.
[15] Yongxin Chen,et al. On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost , 2019, ArXiv.
[16] Qi Cai,et al. Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.
[17] Ruosong Wang,et al. Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle , 2019, NeurIPS.
[18] Martin J. Wainwright,et al. Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.
[19] Jalaj Bhandari,et al. Global Optimality Guarantees For Policy Gradient Methods , 2019, ArXiv.
[20] Peter L. Bartlett,et al. POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.
[21] Yishay Mansour,et al. Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.
[22] Nan Jiang,et al. Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.
[23] Mengdi Wang,et al. Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.
[24] Matthieu Geist,et al. A Theory of Regularized Markov Decision Processes , 2019, ICML.
[25] Yishay Mansour,et al. Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , 2019, NeurIPS.
[26] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.
[27] William Yang Wang,et al. Deep Reinforcement Learning for NLP , 2018, ACL.
[28] Benjamin Recht,et al. Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.
[29] Sham M. Kakade,et al. Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.
[30] Xian Wu,et al. Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.
[31] Xian Wu,et al. Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.
[32] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.
[33] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.
[34] Marcello Restelli,et al. Boosted Fitted Q-Iteration , 2017, ICML.
[35] Vicenç Gómez,et al. A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.
[36] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.
[37] Tor Lattimore,et al. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.
[38] Nan Jiang,et al. Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.
[39] Zheng Wen,et al. Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization , 2013, Math. Oper. Res..
[40] Benjamin Van Roy,et al. On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.
[41] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.
[42] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.
[43] Benjamin Van Roy,et al. Generalization and Exploration via Randomized Value Functions , 2014, ICML.
[44] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.
[45] Csaba Szepesvári,et al. Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.
[46] Gergely Neu,et al. Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.
[47] Hilbert J. Kappen,et al. On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.
[48] Sébastien Bubeck,et al. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..
[49] András György,et al. The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.
[50] Hilbert J. Kappen,et al. Dynamic policy programming , 2010, J. Mach. Learn. Res..
[51] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.
[52] Hilbert J. Kappen,et al. Speedy Q-Learning , 2011, NIPS.
[53] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.
[54] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.
[55] John N. Tsitsiklis,et al. Linearly Parameterized Bandits , 2008, Math. Oper. Res..
[56] Csaba Szepesvari,et al. The Online Loop-free Stochastic Shortest-Path Problem , 2010, Annual Conference Computational Learning Theory.
[57] Lin Xiao,et al. Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..
[58] Yishay Mansour,et al. Online Markov Decision Processes , 2009, Math. Oper. Res..
[59] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..
[60] Shie Mannor,et al. Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..
[61] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..
[62] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.
[63] Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.
[64] Michael L. Littman,et al. Efficient Reinforcement Learning with Relocatable Action Models , 2007, AAAI.
[65] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.
[66] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .
[67] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[68] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.
[69] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.
[70] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.
[71] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .
[72] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..
[73] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.
[74] J. Baxter,et al. Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).
[75] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.
[76] Justin A. Boyan,et al. Least-Squares Temporal Difference Learning , 1999, ICML.
[77] Vijay R. Konda,et al. Actor-Critic Algorithms , 1999, NIPS.
[78] Reid G. Simmons,et al. Complexity Analysis of Real-Time Reinforcement Learning , 1993, AAAI.
[79] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .