On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift
暂无分享,去创建一个
Sham M. Kakade | Jason D. Lee | Gaurav Mahajan | Alekh Agarwal | S. Kakade | J. Lee | Alekh Agarwal | Gaurav Mahajan | G. Mahajan
[1] R. Bellman,et al. FUNCTIONAL APPROXIMATIONS AND DYNAMIC PROGRAMMING , 1959 .
[2] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .
[3] Jing Peng,et al. Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .
[4] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.
[5] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.
[6] K. Ball. An elementary introduction to modern convex geometry, in flavors of geometry , 1997 .
[7] K. Ball. An Elementary Introduction to Modern Convex Geometry , 1997 .
[8] Justin A. Boyan,et al. Least-Squares Temporal Difference Learning , 1999, ICML.
[9] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.
[10] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.
[11] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.
[12] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.
[13] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..
[14] Rémi Munos,et al. Error Bounds for Approximate Policy Iteration , 2003, ICML.
[15] Jeff G. Schneider,et al. Covariant Policy Search , 2003, IJCAI.
[16] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .
[17] Jeff G. Schneider,et al. Policy Search by Dynamic Programming , 2003, NIPS.
[18] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.
[19] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.
[20] Csaba Szepesvári,et al. Finite time bounds for sampling based fitted value iteration , 2005, ICML.
[21] Rémi Munos,et al. Error Bounds for Approximate Value Iteration , 2005, AAAI.
[22] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.
[23] Yurii Nesterov,et al. Cubic regularization of Newton method and its global performance , 2006, Math. Program..
[24] Csaba Szepesvári,et al. Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.
[25] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .
[26] Adrian S. Lewis,et al. The [barred L]ojasiewicz Inequality for Nonsmooth Subanalytic Functions with Applications to Subgradient Dynamical Systems , 2006, SIAM J. Optim..
[27] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..
[28] Shalabh Bhatnagar,et al. Natural actorcritic algorithms. , 2009 .
[29] Yishay Mansour,et al. Online Markov Decision Processes , 2009, Math. Oper. Res..
[30] Alessandro Lazaric,et al. Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.
[31] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .
[32] Hédy Attouch,et al. Proximal Alternating Minimization and Projection Methods for Nonconvex Problems: An Approach Based on the Kurdyka-Lojasiewicz Inequality , 2008, Math. Oper. Res..
[33] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.
[34] Sham M. Kakade,et al. Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.
[35] Hilbert J. Kappen,et al. Dynamic policy programming , 2010, J. Mach. Learn. Res..
[36] Shai Shalev-Shwartz,et al. Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..
[37] Sham M. Kakade,et al. Random Design Analysis of Ridge Regression , 2012, COLT.
[38] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..
[39] Matthieu Geist,et al. Local Policy Search in a Convex Space and Conservative Policy Iteration as Boosted Policy Search , 2014, ECML/PKDD.
[40] F. John. Extremum Problems with Inequalities as Subsidiary Conditions , 2014 .
[41] Shai Ben-David,et al. Understanding Machine Learning: From Theory to Algorithms , 2014 .
[42] Csaba Szepesvári,et al. Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.
[43] Bruno Scherrer,et al. Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.
[44] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.
[45] Matthieu Geist,et al. Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..
[46] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.
[47] Mark W. Schmidt,et al. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.
[48] Saeed Ghadimi,et al. Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.
[49] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.
[50] Nan Jiang,et al. Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.
[51] Vicenç Gómez,et al. A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.
[52] Prateek Jain,et al. Parallelizing Stochastic Gradient Descent for Least Squares Regression: Mini-batching, Averaging, and Model Misspecification , 2016, J. Mach. Learn. Res..
[53] Amir Beck,et al. First-Order Methods in Optimization , 2017 .
[54] Sham M. Kakade,et al. Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.
[55] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.
[56] Michael I. Jordan,et al. How to Escape Saddle Points Efficiently , 2017, ICML.
[57] Sham M. Kakade,et al. Global Convergence of Policy Gradient Methods for Linearized Control Problems , 2018, ICML 2018.
[58] Sham M. Kakade,et al. Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.
[59] Yuval Tassa,et al. Maximum a Posteriori Policy Optimisation , 2018, ICLR.
[60] Mengdi Wang,et al. Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.
[61] Nevena Lazic,et al. Exploration-Enhanced POLITEX , 2019, ArXiv.
[62] Qi Cai,et al. Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.
[63] Peter L. Bartlett,et al. POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.
[64] Nan Jiang,et al. Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.
[65] Matthieu Geist,et al. A Theory of Regularized Markov Decision Processes , 2019, ICML.
[66] Nicolas Le Roux,et al. Understanding the impact of entropy on policy optimization , 2018, ICML.
[67] Jalaj Bhandari,et al. Global Optimality Guarantees For Policy Gradient Methods , 2019, ArXiv.
[68] J. Lee,et al. Neural Temporal-Difference Learning Converges to Global Optima , 2019, NeurIPS.
[69] Michael I. Jordan,et al. Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.
[70] Shie Mannor,et al. Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2019, AAAI.