Better Parameter-free Stochastic Optimization with ODE Updates for Coin-Betting

Parameter-free stochastic gradient descent (PFSGD) algorithms do not require setting learning rates while achieving optimal theoretical performance. In practical applications, however, there remains an empirical gap between tuned stochastic gradient descent (SGD) and PFSGD. In this paper, we close the empirical gap with a new parameter-free algorithm based on continuous-time Coin-Betting on truncated models. The new update is derived through the solution of an Ordinary Differential Equation (ODE) and solved in a closed form. We show empirically that this new parameter-free algorithm outperforms algorithms with the "best default" learning rates and almost matches the performance of finely tuned baselines without anything to tune.

[1]  Francesco Orabona A Modern Introduction to Online Learning , 2019, ArXiv.

[2]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[3]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[4]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[5]  Francesco Orabona,et al.  Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning , 2014, NIPS.

[6]  Wojciech Kotlowski,et al.  Adaptive scale-invariant online algorithms for learning linear models , 2019, ICML.

[7]  Wojciech Kotlowski,et al.  Scale-Invariant Unconstrained Online Learning , 2017, ALT.

[8]  John C. Duchi,et al.  Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity , 2018, SIAM J. Optim..

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  Francesco Orabona,et al.  Coin Betting and Parameter-Free Online Learning , 2016, NIPS.

[11]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[12]  Karthik Sridharan,et al.  Online Learning: Sufficient Statistics and the Burkholder Method , 2018, COLT.

[13]  Alexandre M. Bayen,et al.  Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[14]  Francesco Orabona,et al.  Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations , 2014, COLT.

[15]  John Langford,et al.  Online Importance Weight Aware Updates , 2010, UAI.

[16]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[17]  Ashok Cutkosky,et al.  Online Learning Without Prior Information , 2017, COLT.

[18]  Peter L. Bartlett,et al.  Implicit Online Learning , 2010, ICML.

[19]  Francesco Orabona,et al.  Black-Box Reductions for Parameter-free Online Learning in Banach Spaces , 2018, COLT.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[22]  Ashok Cutkosky,et al.  Online Convex Optimization with Unconstrained Domains and Losses , 2017, NIPS.

[23]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[24]  Ashok Cutkosky,et al.  Matrix-Free Preconditioning in Online Learning , 2019, ICML.

[25]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[26]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[27]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[28]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.