The Power of Factorial Powers: New Parameter settings for (Stochastic) Optimization

The convergence rates for convex and non-convex optimization methods depend on the choice of a host of constants, including step sizes, Lyapunov function constants and momentum constants. In this work we propose the use of factorial powers as a flexible tool for defining constants that appear in convergence proofs. We list a number of remarkable properties that these sequences enjoy, and show how they can be applied to convergence proofs to simplify or improve the convergence rates of the momentum method, accelerated gradient and the stochastic variance reduced method (SVRG).

[1]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[2]  H. Robbins A Stochastic Approximation Method , 1951 .

[3]  Francis R. Bach,et al.  Duality Between Subgradient and Conditional Gradient Methods , 2012, SIAM J. Optim..

[4]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[5]  Marc Teboulle,et al.  Interior Gradient and Proximal Methods for Convex and Conic Optimization , 2006, SIAM J. Optim..

[6]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[7]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[8]  Kevin J. McGown,et al.  The generalization of Faulhaber's formula to sums of non-integral powers , 2007 .

[9]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[10]  Martin J. Wainwright,et al.  Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[11]  Mark W. Schmidt,et al.  A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[12]  J. Dunn,et al.  Conditional gradient algorithms with open loop step size rules , 1978 .

[13]  Francis Bach,et al.  Towards closing the gap between the theory and practice of SVRG , 2019, NeurIPS.

[14]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[15]  Zhisong Pan,et al.  Primal Averaging: A New Gradient Evaluation Step to Attain the Optimal Individual Convergence , 2020, IEEE Transactions on Cybernetics.

[16]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[17]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..