Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions

Summary form only given. Stochastic optimization algorithms have many desirable features for large-scale machine learning, and accordingly have been the focus of renewed and intensive study in the last several years (e.g., see the papers [2], [5], [14] and references therein). The empirical efficiency of these methods is backed with strong theoretical guarantees, providing sharp bounds on their convergence rates. These convergence rates are known to depend on the structure of the underlying objective function, with faster rates being possible for objective functions that are smooth and/or (strongly) convex, or optima that have desirable features such as sparsity. More precisely, for an objective function that is strongly convex, stochastic gradient descent enjoys a convergence rate ranging from O(1/T ), when features vectors are extremely sparse, to O(d/T ) when feature vectors are dense [10], [6]. A complementary type of condition is that of sparsity, either exact or approximate, in the optimal solution. Sparse models have proven useful in many application areas (see the overview papers [4], [9], [3] and references therein for further background), and many optimization-based statistical procedures seek to exploit such sparsity via ℓ1-regularization. A significant feature of optimization algorithms for sparse problems is their mild logarithmic scaling with the problem dimension [11], [12], [5], [14]. More precisely, itis known [11], [12] that when the optimal solution θ has at most s non-zero entries, appropriate versions of the stochastic mirror descent algorithm ',/ converge at a rate O(s (log d)/T ). Srebro et al. [13] exploit the smoothness of many common loss functions; in application to sparse linear regression, their analysis yields improved rates ',/ of the form O(η (s log d)/T), where η is the noise variance. While the Vlog d scaling of the dimension makes these methods attractive in high dimensions, the scaling with respect to the number V of iterations is relatively slow-namely, O(1/ T ) versus O(1/T) for strongly convex problems.The algorithm proposed in this paper aims to use both strong convexity and sparsity. We show that the algorithm has convergence rate O((s log d)/T) for a strongly convex problem with an s-sparse optimum in d dimensions. Moreover, this rate is unimprovable up to constant factors, meaning that no algorithm can converge at a substantially faster rate. The method builds off recent work on multi-step methods for strongly convex problems [7], [8], but involves some new ingredients that are essential to obtain optimal rates for statistical problems with sparse optima. Numerical simulations confirm our theoretical predictions regarding the convergence rate of the algorithm, and demonstrate its performance in comparison to other methods: regularized dual averaging [14] and stochastic gradient descent algorithms. We refer the reader to the full report [1] for more details.

[1]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization, II: Shrinking Procedures and Optimal Algorithms , 2013, SIAM J. Optim..

[2]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[3]  Martin J. Wainwright,et al.  Fast global convergence rates of gradient methods for high-dimensional statistical recovery , 2010, NIPS.

[4]  Martin J. Wainwright,et al.  Restricted Eigenvalue Properties for Correlated Gaussian Designs , 2010, J. Mach. Learn. Res..

[5]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[6]  Ting Hu,et al.  Online Learning with Samples Drawn from Non-identical Distributions , 2009, J. Mach. Learn. Res..

[7]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[8]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[9]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[10]  J. Hiriart-Urruty,et al.  Convex analysis and minimization algorithms , 1993 .

[11]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[12]  Martin J. Wainwright,et al.  Fast global convergence of gradient methods for high-dimensional statistical recovery , 2011, ArXiv.

[13]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[14]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[15]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[16]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[17]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[18]  Lin Xiao,et al.  A Proximal-Gradient Homotopy Method for the Sparse Least-Squares Problem , 2012, SIAM J. Optim..

[19]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[20]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[21]  Ambuj Tewari,et al.  Composite objective mirror descent , 2010, COLT 2010.

[22]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[23]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[24]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[25]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[26]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[27]  V. Buldygin,et al.  Metric characterization of random variables and random processes , 2000 .

[28]  Adam Tauman Kalai,et al.  Logarithmic Regret Algorithms for Online Convex Optimization , 2006, COLT.

[29]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[30]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[31]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[32]  Shuheng Zhou,et al.  25th Annual Conference on Learning Theory Reconstruction from Anisotropic Random Measurements , 2022 .

[33]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[34]  Alexander Shapiro,et al.  Validation analysis of mirror descent stochastic approximation method , 2012, Math. Program..

[35]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[36]  Y. Nesterov,et al.  Primal-dual subgradient methods for minimizing uniformly convex functions , 2010, 1401.1792.