论文信息 - Reinforcement Learning: Theory and Algorithms - 字舞流文

Reinforcement Learning: Theory and Algorithms

S. Kakade | Alekh Agarwal | Nan Jiang

[1] Ruosong Wang,et al. What are the Statistical Limits of Offline RL with Linear Function Approximation? , 2020, ICLR.

[2] Quanquan Gu,et al. Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping , 2020, ICML.

[3] Aleksandrs Slivkins,et al. Corruption Robust Exploration in Episodic Reinforcement Learning , 2019, COLT.

[4] Sham M. Kakade,et al. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[5] Siddhartha Srinivasa,et al. Imitation Learning as f-Divergence Minimization , 2019, WAFR.

[6] Wen Sun,et al. PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning , 2020, NeurIPS.

[7] Andrey Kolobov,et al. Policy Improvement from Multiple Experts , 2020, ArXiv.

[8] S. Kakade,et al. FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs , 2020, NeurIPS.

[9] Mengdi Wang,et al. Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[10] Yuxin Chen,et al. Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[11] Dale Schuurmans,et al. On the Global Convergence Rates of Softmax Policy Gradient Methods , 2020, ICML.

[12] Mikael Henaff,et al. Disagreement-Regularized Imitation Learning , 2020, ICLR.

[13] Nan Jiang,et al. $Q^\star$ Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison , 2020, 2003.03924.

[14] Dylan J. Foster,et al. Logarithmic Regret for Adversarial Online Control , 2020, ICML.

[15] Mengdi Wang,et al. Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.

[16] Babak Hassibi,et al. The Power of Linear Controllers in LQR Control , 2020, 2022 IEEE 61st Conference on Decision and Control (CDC).

[17] Max Simchowitz,et al. Naive Exploration is Optimal for Online LQR , 2020, ICML.

[18] Max Simchowitz,et al. Improper Learning for Non-Stochastic Control , 2020, COLT.

[19] Ambuj Tewari,et al. Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles , 2019, AISTATS.

[20] Ruosong Wang,et al. Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2020, ICLR.

[21] S. Kakade,et al. Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[22] Michael I. Jordan,et al. Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[23] Mengdi Wang,et al. Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[24] Qi Cai,et al. Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[25] Lin F. Yang,et al. Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT.

[26] Peter L. Bartlett,et al. POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[27] Byron Boots,et al. Provably Efficient Imitation Learning from Observation Alone , 2019, ICML.

[28] Nan Jiang,et al. Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[29] Sham M. Kakade,et al. Online Control with Adversarial Disturbances , 2019, ICML.

[30] Benjamin Recht,et al. Certainty Equivalent Control of LQR is Efficient , 2019, ArXiv.

[31] Yishay Mansour,et al. Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret , 2019, ArXiv.

[32] Mengdi Wang,et al. Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[33] Matthieu Geist,et al. A Theory of Regularized Markov Decision Processes , 2019, ICML.

[34] Nicolas Le Roux,et al. Understanding the impact of entropy on policy optimization , 2018, ICML.

[35] Nan Jiang,et al. Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[36] Lin F. Yang,et al. Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model , 2018, 1806.01492.

[37] Nikolai Matni,et al. On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.

[38] Nikolai Matni,et al. A System-Level Approach to Controller Synthesis , 2016, IEEE Transactions on Automatic Control.

[39] Nan Jiang,et al. On Oracle-Efficient PAC Reinforcement Learning with Rich Observations , 2018 .

[40] Nikolai Matni,et al. Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator , 2018, NeurIPS.

[41] Sergey Levine,et al. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[42] Nolan Wagener,et al. Fast Policy Learning through Imitation and Reinforcement , 2018, UAI.

[43] Nan Jiang,et al. Hierarchical Imitation and Reinforcement Learning , 2018, ICML.

[44] Michael I. Jordan,et al. Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification , 2018, COLT.

[45] Byron Boots,et al. Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning , 2018, ICLR.

[46] Byron Boots,et al. Convergence of Value Aggregation for Imitation Learning , 2018, AISTATS.

[47] Sham M. Kakade,et al. Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[48] Sergey Levine,et al. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[49] Tom Schaul,et al. Deep Q-learning From Demonstrations , 2017, AAAI.

[50] Prateek Jain,et al. Non-convex Optimization for Machine Learning , 2017, Found. Trends Mach. Learn..

[51] Amir Beck,et al. First-Order Methods in Optimization , 2017 .

[52] Marcello Restelli,et al. Boosted Fitted Q-Iteration , 2017, ICML.

[53] Vicenç Gómez,et al. A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[54] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[55] Byron Boots,et al. Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[56] Tor Lattimore,et al. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[57] Evangelos A. Theodorou,et al. Model Predictive Path Integral Control: From Theory to Parallel Computation , 2017 .

[58] Nan Jiang,et al. Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[59] Benjamin Van Roy,et al. On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[60] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[61] John Langford,et al. PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[62] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[63] Christoph Dann,et al. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[64] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[65] John Langford,et al. Learning to Search Better than Your Teacher , 2015, ICML.

[66] Sébastien Bubeck,et al. Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[67] Sergey Levine,et al. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[68] Matthieu Geist,et al. Local Policy Search in a Convex Space and Conservative Policy Iteration as Boosted Policy Search , 2014, ECML/PKDD.

[69] J. Andrew Bagnell,et al. Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[70] Bruno Scherrer,et al. Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[71] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[72] Sergey Levine,et al. Guided Policy Search , 2013, ICML.

[73] Hilbert J. Kappen,et al. On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[74] Martial Hebert,et al. Activity Forecasting , 2012, ECCV.

[75] Leslie Pack Kaelbling,et al. LQR-RRT*: Optimal sampling-based motion planning with automatically derived extension heuristics , 2012, 2012 IEEE International Conference on Robotics and Automation.

[76] Shai Shalev-Shwartz,et al. Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[77] Csaba Szepesvári,et al. Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[78] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[79] Yinyu Ye,et al. The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate , 2011, Math. Oper. Res..

[80] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[81] John Langford,et al. Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[82] Anind K. Dey,et al. Modeling Interaction via the Principle of Maximum Causal Entropy , 2010, ICML.

[83] Alessandro Lazaric,et al. Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[84] J. Andrew Bagnell,et al. Efficient Reductions for Imitation Learning , 2010, AISTATS.

[85] Siddhartha S. Srinivasa,et al. Planning-based prediction for pedestrians , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[86] Yishay Mansour,et al. Online Markov Decision Processes , 2009, Math. Oper. Res..

[87] Russ Tedrake,et al. LQR-trees: Feedback motion planning on sparse randomized trees , 2009, Robotics: Science and Systems.

[88] John Langford,et al. Search-based structured prediction , 2009, Machine Learning.

[89] Sham M. Kakade,et al. A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[90] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[91] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[92] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[93] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[94] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[95] Robert E. Schapire,et al. A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[96] Kevin L. Moore,et al. Iterative Learning Control: Brief Survey and Categorization , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[97] Csaba Szepesvári,et al. Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[98] Yinyu Ye,et al. A New Complexity Result on Solving the Markov Decision Problem , 2005, Math. Oper. Res..

[99] Rémi Munos,et al. Error Bounds for Approximate Value Iteration , 2005, AAAI.

[100] E. Todorov,et al. A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems , 2005, Proceedings of the 2005, American Control Conference, 2005..

[101] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[102] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[103] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[104] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[105] Satinder Singh,et al. An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[106] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[107] Jeff G. Schneider,et al. Policy Search by Dynamic Programming , 2003, NIPS.

[108] Rémi Munos,et al. Error Bounds for Approximate Policy Iteration , 2003, ICML.

[109] Martin Zinkevich,et al. Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[110] Jeff G. Schneider,et al. Covariant policy search , 2003, IJCAI 2003.

[111] Venkataramanan Balakrishnan,et al. Semidefinite programming duality and linear time-invariant systems , 2003, IEEE Trans. Autom. Control..

[112] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[113] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[114] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[115] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[116] S. R. Jammalamadaka,et al. Empirical Processes in M-Estimation , 2001 .

[117] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[118] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[119] Yishay Mansour,et al. Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[120] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[121] Michael Kearns,et al. Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[122] Yishay Mansour,et al. On the Complexity of Policy Iteration , 1999, UAI.

[123] Philip M. Long,et al. Associative Reinforcement Learning using Linear Probabilistic Concepts , 1999, ICML.

[124] Michael Kearns,et al. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[125] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[126] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[127] Peter W. Glynn,et al. Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[128] B. Anderson,et al. Optimal control: linear quadratic methods , 1990 .

[129] Colin McDiarmid,et al. Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[130] Dean Pomerleau,et al. ALVINN, an autonomous land vehicle in a neural network , 2015 .

[131] Dante C. Youla,et al. Modern Wiener-Hopf Design of Optimal Controllers. Part I , 1976 .

[132] R Bellman,et al. DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS. , 1956, Proceedings of the National Academy of Sciences of the United States of America.

[133] H. Robbins. Some aspects of the sequential design of experiments , 1952 .