Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces

In this paper, we set forth a new vision of reinforcement learning developed by us over the past few years, one that yields mathematically rigorous solutions to longstanding important questions that have remained unresolved: (i) how to design reliable, convergent, and robust reinforcement learning algorithms (ii) how to guarantee that reinforcement learning satisfies pre-specified "safety" guarantees, and remains in a stable region of the parameter space (iii) how to design "off-policy" temporal difference learning algorithms in a reliable and stable manner, and finally (iv) how to integrate the study of reinforcement learning into the rich theory of stochastic optimization. In this paper, we provide detailed answers to all these questions using the powerful framework of proximal operators. The key idea that emerges is the use of primal dual spaces connected through the use of a Legendre transform. This allows temporal difference updates to occur in dual spaces, allowing a variety of important technical advantages. The Legendre transform elegantly generalizes past algorithms for solving reinforcement learning problems, such as natural gradient methods, which we show relate closely to the previously unconnected framework of mirror descent methods. Equally importantly, proximal operator theory enables the systematic development of operator splitting methods that show how to safely and reliably decompose complex products of gradients that occur in recent variants of gradient-based temporal difference learning. This key technical innovation makes it possible to finally design "true" stochastic gradient methods for reinforcement learning. Finally, Legendre transforms enable a variety of other benefits, including modeling sparsity and domain geometry. Our work builds extensively on recent work on the convergence of saddle-point algorithms, and on the theory of monotone operators.

[1]  B. McCarl,et al.  Economics , 1870, The Indian medical gazette.

[2]  H. H. Rachford,et al.  On the numerical solution of heat conduction problems in two and three space variables , 1956 .

[3]  J. Moreau Fonctions convexes duales et points proximaux dans un espace hilbertien , 1962 .

[4]  G. Stampacchia,et al.  On some non-linear elliptic differential-functional equations , 1966 .

[5]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[6]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[7]  G. M. Korpelevich The extragradient method for finding saddle points and other problems , 1976 .

[8]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[9]  Stella Dafermos,et al.  Traffic Equilibrium and Variational Inequalities , 1980 .

[10]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[11]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[12]  Katta G. Murty,et al.  Linear complementarity, linear and nonlinear programming , 1988 .

[13]  E. Khobotov Modification of the extra-gradient method for solving variational inequalities and certain optimization problems , 1989 .

[14]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[15]  A. Nagurney Migration equilibrium and variational inequalities. , 1989, Economics letters.

[16]  P. Marcotte APPLICATION OF KHOBOTOVS ALGORITHM TO VARIATIONAL INEQUALITIES ANT) NETWORK EQUILIBRIUM PROBLEMS , 1991 .

[17]  A. Nagurney Network Economics: A Variational Inequality Approach , 1992 .

[18]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  A. Nagurney,et al.  Projected Dynamical Systems and Variational Inequalities with Applications , 1995 .

[21]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[22]  Karl Johan Åström,et al.  PID Controllers: Theory, Design, and Tuning , 1995 .

[23]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[24]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[27]  A. Iusem,et al.  A variant of korpelevich’s method for variational inequalities with a new search strategy , 1997 .

[28]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[29]  Doina Precup,et al.  Exponentiated Gradient Methods for Reinforcement Learning , 1997, ICML.

[30]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[31]  Shun-ichi Amari,et al.  Why natural gradient? , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[32]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[33]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[34]  M. Solodov,et al.  A New Projection Method for Variational Inequality Problems , 1999 .

[35]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[36]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[37]  Jennie Si,et al.  Online learning control by association and reinforcement. , 2001, IEEE transactions on neural networks.

[38]  Arkadi Nemirovski,et al.  The Ordered Subsets Mirror Descent Optimization Method with Applications to Tomography , 2001, SIAM J. Optim..

[39]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[40]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[41]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[42]  Andrew G. Barto,et al.  Lyapunov Design for Safe Reinforcement Learning , 2003, J. Mach. Learn. Res..

[43]  William H. Press,et al.  Numerical recipes in C , 2002 .

[44]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[45]  Neil Munro,et al.  Fast calculation of stabilizing PID controllers , 2003, Autom..

[46]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[47]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[48]  F. Facchinei,et al.  Finite-Dimensional Variational Inequalities and Complementarity Problems , 2003 .

[49]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[50]  Arkadi Nemirovski,et al.  Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems , 2004, SIAM J. Optim..

[51]  Manfred K. Warmuth,et al.  On the Worst-Case Analysis of Temporal-Difference Learning Algorithms , 2005, Machine Learning.

[52]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[53]  Sridhar Mahadevan,et al.  Representation Policy Iteration , 2005, UAI.

[54]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[55]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[56]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[57]  Arkadi Nemirovski,et al.  Non-euclidean restricted memory level method for large-scale convex optimization , 2005, Math. Program..

[58]  Ari Arapostathis,et al.  Control of Markov chains with safety bounds , 2005, IEEE Transactions on Automation Science and Engineering.

[59]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[60]  M. Nowak Evolutionary Dynamics: Exploring the Equations of Life , 2006 .

[61]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[62]  Ben Taskar,et al.  Structured Prediction, Dual Extragradient and Bregman Projections , 2006, J. Mach. Learn. Res..

[63]  Yurii Nesterov,et al.  Dual extrapolation and its applications to solving variational inequalities and related problems , 2003, Math. Program..

[64]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[65]  Bogert Aj A Proportional Derivative FES Controller for Planar Arm Movement , 2007 .

[66]  Sridhar Mahadevan,et al.  Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes , 2007, J. Mach. Learn. Res..

[67]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[68]  C. Lynch,et al.  Functional Electrical Stimulation , 2017, IEEE Control Systems.

[69]  Jian-Wen Peng,et al.  A NEW HYBRID-EXTRAGRADIENT METHOD FOR GENERALIZED MIXED EQUILIBRIUM PROBLEMS, FIXED POINT PROBLEMS AND VARIATIONAL INEQUALITY PROBLEMS , 2008 .

[70]  Dimitri P. Bertsekas,et al.  New error bounds for approximations from projected linear equations , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[71]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[72]  A. Juditsky,et al.  Solving variational inequalities with Stochastic Mirror-Prox algorithm , 2008, 0809.0815.

[73]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[74]  Antonie J. van den Bogert,et al.  A Real-Time, 3-D Musculoskeletal Model for Dynamic Simulation of Arm Movements , 2009, IEEE Transactions on Biomedical Engineering.

[75]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[76]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[77]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[78]  Sridhar Mahadevan,et al.  Learning Representation and Control in Markov Decision Processes: New Frontiers , 2009, Found. Trends Mach. Learn..

[79]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[80]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[81]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[82]  Robert F. Kirsch,et al.  Combined feedforward and feedback control of a redundant, nonlinear, dynamic musculoskeletal system , 2009, Medical & Biological Engineering & Computing.

[83]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[84]  D. Bertsekas Projected Equations, Variational Inequalities, and Temporal Difference Methods , 2009 .

[85]  Philip S. Thomas,et al.  Application of the Actor-Critic Architecture to Functional Electrical Stimulation Control of a Human Arm , 2009, IAAI.

[86]  Angelia Nedic,et al.  Subgradient Methods for Saddle-Point Problems , 2009, J. Optimization Theory and Applications.

[87]  Yoram Singer,et al.  Efficient Learning using Forward-Backward Splitting , 2009, NIPS.

[88]  Scott Kuindersma,et al.  Dexterous mobility with the uBot-5 mobile manipulator , 2009, 2009 International Conference on Advanced Robotics.

[89]  Haesun Park,et al.  Fast Active-set-type Algorithms for L1-regularized Linear Regression , 2010, AISTATS.

[90]  Ambuj Tewari,et al.  Composite objective mirror descent , 2010, COLT 2010.

[91]  Tony F. Chan,et al.  A General Framework for a Class of First Order Primal-Dual Algorithms for Convex Optimization in Imaging Science , 2010, SIAM J. Imaging Sci..

[92]  Tim Roughgarden,et al.  Algorithmic Game Theory , 2007 .

[93]  Ronald Parr,et al.  Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[94]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[95]  Antonin Chambolle,et al.  A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging , 2011, Journal of Mathematical Imaging and Vision.

[96]  A. Juditsky,et al.  5 First-Order Methods for Nonsmooth Convex Large-Scale Optimization , I : General Purpose Methods , 2010 .

[97]  José M. Bioucas-Dias,et al.  Alternating direction algorithms for constrained sparse regression: Application to hyperspectral unmixing , 2010, 2010 2nd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing.

[98]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[99]  Roderic A. Grupen,et al.  Whole-body strategies for mobility and manipulation , 2010 .

[100]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[101]  E. David,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World , 2010 .

[102]  Alessandro Lazaric,et al.  LSTD with Random Projections , 2010, NIPS.

[103]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[104]  R. Sutton,et al.  GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[105]  Matthew W. Hoffman,et al.  Finite-Sample Analysis of Lasso-TD , 2011, ICML.

[106]  J. Zico Kolter,et al.  The Fixed Points of Off-Policy TD , 2011, NIPS.

[107]  H. Brendan McMahan,et al.  Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization , 2011, AISTATS.

[108]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[109]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[110]  George Konidaris,et al.  Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[111]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[112]  Matthieu Geist,et al.  A Dantzig Selector Approach to Temporal Difference Learning , 2012, ICML.

[113]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[114]  Tobias Scheffer,et al.  Static prediction games for adversarial learning problems , 2012, J. Mach. Learn. Res..

[115]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[116]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[117]  Nuno C. Martins,et al.  Control Design for Markov Chains under Safety Constraints: A Convex Approach , 2012, ArXiv.

[118]  Patrick M. Pilarski,et al.  Model-Free reinforcement learning with continuous action in practice , 2012, 2012 American Control Conference (ACC).

[119]  Andrew G. Barto,et al.  Motor primitive discovery , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[120]  Chris Arney,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World (Easley, D. and Kleinberg, J.; 2010) [Book Review] , 2013, IEEE Technology and Society Magazine.

[121]  Geoffrey J. Gordon Galerkin Methods for Complementarity Problems and Variational Inequalities , 2013, ArXiv.

[122]  R. Washington A Voted Regularized Dual Averaging Method for Large-Scale Discriminative Training in Natural Language Processing , 2013 .

[123]  Yunmei Chen,et al.  Optimal Primal-Dual Methods for a Class of Saddle Point Problems , 2013, SIAM J. Optim..

[124]  Zhiwei Qin,et al.  Sparse Reinforcement Learning via Convex Optimization , 2014, ICML.

[125]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[126]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[127]  Sébastien Bubeck,et al.  Theory of Convex Optimization for Machine Learning , 2014, ArXiv.

[128]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[129]  André da Motta Salles Barreto,et al.  Practical Kernel-Based Reinforcement Learning , 2014, J. Mach. Learn. Res..