Generalized stochastic Frank–Wolfe algorithm with stochastic “substitute” gradient for structured convex optimization

The stochastic Frank–Wolfe method has recently attracted much general interest in the context of optimization for statistical and machine learning due to its ability to work with a more general feasible region. However, there has been a complexity gap in the dependence on the optimality tolerance $$\varepsilon $$ ε in the guaranteed convergence rate for stochastic Frank–Wolfe compared to its deterministic counterpart. In this work, we present a new generalized stochastic Frank–Wolfe method which closes this gap for the class of structured optimization problems encountered in statistical and machine learning characterized by empirical loss minimization with a certain type of “linear prediction” property (formally defined in the paper), which is typically present in loss minimization problems in practice. Our method also introduces the notion of a “substitute gradient” that is a not-necessarily-unbiased estimate of the gradient. We show that our new method is equivalent to a particular randomized coordinate mirror descent algorithm applied to the dual problem, which in turn provides a new interpretation of randomized dual coordinate descent in the primal space. Also, in the special case of a strongly convex regularizer our generalized stochastic Frank–Wolfe method (as well as the randomized dual coordinate descent method) exhibits linear convergence. Furthermore, we present computational experiments that indicate that our method outperforms other stochastic Frank–Wolfe methods for a sufficiently small optimality tolerance, consistent with the theory developed herein.

[1]  Julien Mairal,et al.  Convex and Network Flow Optimization for Structured Sparsity , 2011, J. Mach. Learn. Res..

[2]  Peter Richtárik,et al.  Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling , 2015, NIPS.

[3]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[4]  J. Lafferty,et al.  High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.

[5]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[6]  Lin Xiao,et al.  On the complexity analysis of randomized block-coordinate descent methods , 2013, Mathematical Programming.

[7]  S. Shah,et al.  Frank-Wolfe variants for minimization of a sum of functions , 2018, 1805.10200.

[8]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[9]  Shai Shalev-Shwartz,et al.  SDCA without Duality, Regularization, and Individual Convexity , 2016, ICML.

[10]  Yurii Nesterov,et al.  Subgradient methods for huge-scale optimization problems , 2013, Mathematical Programming.

[11]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[12]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[13]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[14]  Martin Jaggi,et al.  Optimizing over the Growing Spectrahedron , 2012, ESA.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Q. Nguyen Forward-Backward Splitting with Bregman Distances , 2015, 1505.05198.

[17]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[18]  Yaoliang Yu,et al.  Generalized Conditional Gradient for Sparse Estimation , 2014, J. Mach. Learn. Res..

[19]  C. Zălinescu Convex analysis in general vector spaces , 2002 .

[20]  Ambuj Tewari,et al.  Regularization Techniques for Learning with Matrices , 2009, J. Mach. Learn. Res..

[21]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[22]  Peter Richtárik,et al.  Fastest rates for stochastic mirror descent methods , 2018, Computational Optimization and Applications.

[23]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[24]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[25]  Elad Hazan,et al.  Projection-free Online Learning , 2012, ICML.

[26]  Martin J. Wainwright,et al.  High-Dimensional Graphical Model Selection Using ℓ1-Regularized Logistic Regression , 2006, NIPS.

[27]  Yi Zhou,et al.  Conditional Gradient Sliding for Convex Optimization , 2016, SIAM J. Optim..

[28]  Marc Teboulle,et al.  A Descent Lemma Beyond Lipschitz Gradient Continuity: First-Order Methods Revisited and Applications , 2017, Math. Oper. Res..

[29]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[30]  Paul Grigas,et al.  New analysis and results for the Frank–Wolfe method , 2013, Mathematical Programming.

[31]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[32]  Yuchen Zhang,et al.  Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization , 2014, ICML.

[33]  Haipeng Luo,et al.  Variance-Reduced and Projection-Free Stochastic Optimization , 2016, ICML.

[34]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[35]  HarchaouiZaid,et al.  Conditional gradient algorithms for norm-regularized smooth convex optimization , 2015 .

[36]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[37]  Paul Grigas,et al.  An Extended Frank-Wolfe Method with "In-Face" Directions, and Its Application to Low-Rank Matrix Completion , 2015, SIAM J. Optim..

[38]  Amin Karbasi,et al.  Stochastic Conditional Gradient Methods: From Convex Minimization to Submodular Maximization , 2018, J. Mach. Learn. Res..

[39]  Tong Zhang,et al.  Proximal Stochastic Dual Coordinate Ascent , 2012, ArXiv.

[40]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[41]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[42]  Julien Mairal,et al.  Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[43]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[44]  Francis R. Bach,et al.  Duality Between Subgradient and Conditional Gradient Methods , 2012, SIAM J. Optim..

[45]  Haihao Lu “Relative Continuity” for Non-Lipschitz Nonsmooth Convex Optimization Using Stochastic (or Deterministic) Mirror Descent , 2017, INFORMS Journal on Optimization.

[46]  Lin Xiao,et al.  An Accelerated Randomized Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization , 2015, SIAM J. Optim..

[47]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[48]  Zaïd Harchaoui,et al.  Conditional gradient algorithms for norm-regularized smooth convex optimization , 2013, Math. Program..

[49]  Yurii Nesterov,et al.  Relatively Smooth Convex Optimization by First-Order Methods, and Applications , 2016, SIAM J. Optim..