Reinforcement Learning: Theory and Algorithms

[1]  Ruosong Wang,et al.  What are the Statistical Limits of Offline RL with Linear Function Approximation? , 2020, ICLR.

[2]  Quanquan Gu,et al.  Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping , 2020, ICML.

[3]  Aleksandrs Slivkins,et al.  Corruption Robust Exploration in Episodic Reinforcement Learning , 2019, COLT.

[4]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[5]  Siddhartha Srinivasa,et al.  Imitation Learning as f-Divergence Minimization , 2019, WAFR.

[6]  Wen Sun,et al.  PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning , 2020, NeurIPS.

[7]  Andrey Kolobov,et al.  Policy Improvement from Multiple Experts , 2020, ArXiv.

[8]  S. Kakade,et al.  FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs , 2020, NeurIPS.

[9]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[10]  Yuxin Chen,et al.  Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[11]  Dale Schuurmans,et al.  On the Global Convergence Rates of Softmax Policy Gradient Methods , 2020, ICML.

[12]  Mikael Henaff,et al.  Disagreement-Regularized Imitation Learning , 2020, ICLR.

[13]  Nan Jiang,et al.  $Q^\star$ Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison , 2020, 2003.03924.

[14]  Dylan J. Foster,et al.  Logarithmic Regret for Adversarial Online Control , 2020, ICML.

[15]  Mengdi Wang,et al.  Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation , 2020, ICML.

[16]  Babak Hassibi,et al.  The Power of Linear Controllers in LQR Control , 2020, 2022 IEEE 61st Conference on Decision and Control (CDC).

[17]  Max Simchowitz,et al.  Naive Exploration is Optimal for Online LQR , 2020, ICML.

[18]  Max Simchowitz,et al.  Improper Learning for Non-Stochastic Control , 2020, COLT.

[19]  Ambuj Tewari,et al.  Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles , 2019, AISTATS.

[20]  Ruosong Wang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2020, ICLR.

[21]  S. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[22]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[23]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[24]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[25]  Lin F. Yang,et al.  Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT.

[26]  Peter L. Bartlett,et al.  POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[27]  Byron Boots,et al.  Provably Efficient Imitation Learning from Observation Alone , 2019, ICML.

[28]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[29]  Sham M. Kakade,et al.  Online Control with Adversarial Disturbances , 2019, ICML.

[30]  Benjamin Recht,et al.  Certainty Equivalent Control of LQR is Efficient , 2019, ArXiv.

[31]  Yishay Mansour,et al.  Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret , 2019, ArXiv.

[32]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[33]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[34]  Nicolas Le Roux,et al.  Understanding the impact of entropy on policy optimization , 2018, ICML.

[35]  Nan Jiang,et al.  Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches , 2018, COLT.

[36]  Lin F. Yang,et al.  Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model , 2018, 1806.01492.

[37]  Nikolai Matni,et al.  On the Sample Complexity of the Linear Quadratic Regulator , 2017, Foundations of Computational Mathematics.

[38]  Nikolai Matni,et al.  A System-Level Approach to Controller Synthesis , 2016, IEEE Transactions on Automatic Control.

[39]  Nan Jiang,et al.  On Oracle-Efficient PAC Reinforcement Learning with Rich Observations , 2018 .

[40]  Nikolai Matni,et al.  Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator , 2018, NeurIPS.

[41]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[42]  Nolan Wagener,et al.  Fast Policy Learning through Imitation and Reinforcement , 2018, UAI.

[43]  Nan Jiang,et al.  Hierarchical Imitation and Reinforcement Learning , 2018, ICML.

[44]  Michael I. Jordan,et al.  Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification , 2018, COLT.

[45]  Byron Boots,et al.  Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning , 2018, ICLR.

[46]  Byron Boots,et al.  Convergence of Value Aggregation for Imitation Learning , 2018, AISTATS.

[47]  Sham M. Kakade,et al.  Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator , 2018, ICML.

[48]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[49]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[50]  Prateek Jain,et al.  Non-convex Optimization for Machine Learning , 2017, Found. Trends Mach. Learn..

[51]  Amir Beck,et al.  First-Order Methods in Optimization , 2017 .

[52]  Marcello Restelli,et al.  Boosted Fitted Q-Iteration , 2017, ICML.

[53]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[54]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[55]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[56]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[57]  Evangelos A. Theodorou,et al.  Model Predictive Path Integral Control: From Theory to Parallel Computation , 2017 .

[58]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[59]  Benjamin Van Roy,et al.  On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[60]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[61]  John Langford,et al.  PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[62]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[63]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[64]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[65]  John Langford,et al.  Learning to Search Better than Your Teacher , 2015, ICML.

[66]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[67]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[68]  Matthieu Geist,et al.  Local Policy Search in a Convex Space and Conservative Policy Iteration as Boosted Policy Search , 2014, ECML/PKDD.

[69]  J. Andrew Bagnell,et al.  Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[70]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[71]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[72]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[73]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[74]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[75]  Leslie Pack Kaelbling,et al.  LQR-RRT*: Optimal sampling-based motion planning with automatically derived extension heuristics , 2012, 2012 IEEE International Conference on Robotics and Automation.

[76]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[77]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[78]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[79]  Yinyu Ye,et al.  The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate , 2011, Math. Oper. Res..

[80]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[81]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[82]  Anind K. Dey,et al.  Modeling Interaction via the Principle of Maximum Causal Entropy , 2010, ICML.

[83]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[84]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[85]  Siddhartha S. Srinivasa,et al.  Planning-based prediction for pedestrians , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[86]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[87]  Russ Tedrake,et al.  LQR-trees: Feedback motion planning on sparse randomized trees , 2009, Robotics: Science and Systems.

[88]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[89]  Sham M. Kakade,et al.  A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[90]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[91]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[92]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[93]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[94]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[95]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[96]  Kevin L. Moore,et al.  Iterative Learning Control: Brief Survey and Categorization , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[97]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[98]  Yinyu Ye,et al.  A New Complexity Result on Solving the Markov Decision Problem , 2005, Math. Oper. Res..

[99]  Rémi Munos,et al.  Error Bounds for Approximate Value Iteration , 2005, AAAI.

[100]  E. Todorov,et al.  A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems , 2005, Proceedings of the 2005, American Control Conference, 2005..

[101]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[102]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[103]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[104]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[105]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[106]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[107]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[108]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[109]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[110]  Jeff G. Schneider,et al.  Covariant policy search , 2003, IJCAI 2003.

[111]  Venkataramanan Balakrishnan,et al.  Semidefinite programming duality and linear time-invariant systems , 2003, IEEE Trans. Autom. Control..

[112]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[113]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[114]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[115]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[116]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[117]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[118]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[119]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[120]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[121]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[122]  Yishay Mansour,et al.  On the Complexity of Policy Iteration , 1999, UAI.

[123]  Philip M. Long,et al.  Associative Reinforcement Learning using Linear Probabilistic Concepts , 1999, ICML.

[124]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[125]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[126]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[127]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[128]  B. Anderson,et al.  Optimal control: linear quadratic methods , 1990 .

[129]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[130]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[131]  Dante C. Youla,et al.  Modern Wiener-Hopf Design of Optimal Controllers. Part I , 1976 .

[132]  R Bellman,et al.  DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS. , 1956, Proceedings of the National Academy of Sciences of the United States of America.

[133]  H. Robbins Some aspects of the sequential design of experiments , 1952 .