Sarah Frank-Wolfe: Methods for Constrained Optimization with Best Rates and Practical Features

The Frank-Wolfe (FW) method is a popular approach for solving optimization problems with structured constraints that arise in machine learning applications. In recent years, stochastic versions of FW have gained popularity, motivated by large datasets for which the computation of the full gradient is prohibitively expensive. In this paper, we present two new variants of the FW algorithms for stochastic finite-sum minimization. Our algorithms have the best convergence guarantees of existing stochastic FW approaches for both convex and non-convex objective functions. Our methods do not have the issue of permanently collecting large batches, which is common to many stochastic projection-free approaches. Moreover, our second approach does not require either large batches or full deterministic gradients, which is a typical weakness of many techniques for finite-sum problems. The faster theoretical rates of our approaches are confirmed experimentally.

[1]  Cyrille W. Combettes,et al.  Conditional Gradient Methods , 2022, 2211.14103.

[2]  G. Wang,et al.  Distributed Momentum-Based Frank-Wolfe Algorithm for Stochastic Optimization , 2022, IEEE/CAA Journal of Automatica Sinica.

[3]  Ketan Rajawat,et al.  Momentum based Projection Free Stochastic Optimization Under Affine Constraints , 2021, 2021 American Control Conference (ACC).

[4]  Peter Richt'arik,et al.  ZeroSARAH: Efficient Nonconvex Finite-Sum Optimization with Zero Full Gradient Computation , 2021, ArXiv.

[5]  Xiangliang Zhang,et al.  PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization , 2020, ICML.

[6]  Heng Huang,et al.  Can Stochastic Zeroth-Order Frank-Wolfe Method Converge Faster for Non-Convex Problems? , 2020, ICML.

[7]  L. Ghaoui,et al.  Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization , 2020, ICML.

[8]  S. Sra,et al.  Projection-free nonconvex stochastic optimization on Riemannian manifolds , 2019 .

[9]  Sebastian U. Stich,et al.  Unified Optimal Analysis of the (Stochastic) Gradient Method , 2019, ArXiv.

[10]  Cheng Wang,et al.  Accelerating Mini-batch SARAH by Step Size Rules , 2019, Inf. Sci..

[11]  Volkan Cevher,et al.  Conditional Gradient Methods via Stochastic Path-Integrated Differential Estimator , 2019, ICML.

[12]  Francesco Orabona,et al.  Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[13]  Zebang Shen,et al.  Complexities in Projection-Free Stochastic Non-convex Minimization , 2019, AISTATS.

[14]  Peter Richtárik,et al.  Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop , 2019, ALT.

[15]  Peter Richtárik,et al.  SAGA with Arbitrary Sampling , 2019, ICML.

[16]  Lam M. Nguyen,et al.  Inexact SARAH algorithm for stochastic optimization , 2018, Optim. Methods Softw..

[17]  Haihao Lu,et al.  Generalized stochastic Frank–Wolfe algorithm with stochastic “substitute” gradient for structured convex optimization , 2018, Mathematical Programming.

[18]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[19]  Amin Karbasi,et al.  Stochastic Conditional Gradient Methods: From Convex Minimization to Submodular Maximization , 2018, J. Mach. Learn. Res..

[20]  Yan Li,et al.  Non-convex Conditional Gradient Sliding , 2017, ICML.

[21]  Ivan Laptev,et al.  Learning from Video and Text via Large-Scale Discriminative Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Jie Liu,et al.  Stochastic Recursive Gradient Algorithm for Nonconvex Optimization , 2017, ArXiv.

[23]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[24]  Alexander J. Smola,et al.  Stochastic Frank-Wolfe methods for nonconvex optimization , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[25]  Simon Lacoste-Julien,et al.  Convergence Rate of Frank-Wolfe for Non-Convex Objectives , 2016, ArXiv.

[26]  Yi Zhou,et al.  Conditional Gradient Sliding for Convex Optimization , 2016, SIAM J. Optim..

[27]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[28]  Haipeng Luo,et al.  Variance-Reduced and Projection-Free Stochastic Optimization , 2016, ICML.

[29]  Martin Jaggi,et al.  On the Global Linear Convergence of Frank-Wolfe Optimization Variants , 2015, NIPS.

[30]  Rahul G. Krishnan,et al.  Barrier Frank-Wolfe for Marginal Inference , 2015, NIPS.

[31]  Elad Hazan,et al.  Fast and Simple PCA via Convex Optimization , 2015, ArXiv.

[32]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[33]  Ohad Shamir,et al.  A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate , 2014, ICML.

[34]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[35]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[36]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[37]  Maria-Florina Balcan,et al.  A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning , 2014, SDM.

[38]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[39]  Julien Mairal,et al.  Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[40]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[41]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[42]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[43]  Elad Hazan,et al.  Projection-free Online Learning , 2012, ICML.

[44]  Francis R. Bach,et al.  Learning with Submodular Functions: A Convex Optimization Perspective , 2011, Found. Trends Mach. Learn..

[45]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[46]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[47]  M. Patriksson Partial linearization methods in nonlinear programming , 1993 .

[48]  J. Dunn,et al.  Conditional gradient algorithms with open loop step size rules , 1978 .

[49]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[50]  H. Robbins A Stochastic Approximation Method , 1951 .

[51]  Xiangru Lian,et al.  Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent , 2019, NeurIPS.

[52]  Sophia Decker,et al.  Approximate Methods In Optimization Problems , 2016 .

[53]  Boris Polyak,et al.  Constrained minimization methods , 1966 .