Solving Empirical Risk Minimization in the Current Matrix Multiplication Time

Many convex problems in machine learning and computer science share the same form: \begin{align*} \min_{x} \sum_{i} f_i( A_i x + b_i), \end{align*} where $f_i$ are convex functions on $\mathbb{R}^{n_i}$ with constant $n_i$, $A_i \in \mathbb{R}^{n_i \times d}$, $b_i \in \mathbb{R}^{n_i}$ and $\sum_i n_i = n$. This problem generalizes linear programming and includes many problems in empirical risk minimization. In this paper, we give an algorithm that runs in time \begin{align*} O^* ( ( n^{\omega} + n^{2.5 - \alpha/2} + n^{2+ 1/6} ) \log (n / \delta) ) \end{align*} where $\omega$ is the exponent of matrix multiplication, $\alpha$ is the dual exponent of matrix multiplication, and $\delta$ is the relative accuracy. Note that the runtime has only a log dependence on the condition numbers or other data dependent parameters and these are captured in $\delta$. For the current bound $\omega \sim 2.38$ [Vassilevska Williams'12, Le Gall'14] and $\alpha \sim 0.31$ [Le Gall, Urrutia'18], our runtime $O^* ( n^{\omega} \log (n / \delta))$ matches the current best for solving a dense least squares regression problem, a special case of the problem we consider. Very recently, [Alman'18] proved that all the current known techniques can not give a better $\omega$ below $2.168$ which is larger than our $2+1/6$. Our result generalizes the very recent result of solving linear programs in the current matrix multiplication time [Cohen, Lee, Song'19] to a more broad class of problems. Our algorithm proposes two concepts which are different from [Cohen, Lee, Song'19] : $\bullet$ We give a robust deterministic central path method, whereas the previous one is a stochastic central path which updates weights by a random sparse vector. $\bullet$ We propose an efficient data-structure to maintain the central path of interior point methods even when the weights update vector is dense.

[1]  Anirban Dasgupta,et al.  Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[2]  Yin Tat Lee,et al.  Leverage Score Sampling for Faster Accelerated Regression and ERM , 2017, ALT.

[3]  Dean P. Foster,et al.  Finding Linear Structure in Large Datasets with Scalable Canonical Correlation Analysis , 2015, ICML.

[4]  Mikhail Kapralov,et al.  Sparse fourier transform in any constant dimension with nearly-optimal sample complexity in sublinear time , 2016, STOC.

[5]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[6]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[7]  Eric C. Price,et al.  Sparse recovery and Fourier sampling , 2013 .

[8]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[9]  Piotr Indyk,et al.  (Nearly) Sample-Optimal Sparse Fourier Transform , 2014, SODA.

[10]  Dean P. Foster,et al.  Faster Ridge Regression via the Subsampled Randomized Hadamard Transform , 2013, NIPS.

[11]  Rong Jin,et al.  Empirical Risk Minimization for Stochastic Convex Optimization: $O(1/n)$- and $O(1/n^2)$-type of Risk Bounds , 2017, COLT.

[12]  Shai Shalev-Shwartz,et al.  Fast Rates for Empirical Risk Minimization of Strict Saddle Problems , 2017, COLT.

[13]  Zeyuan Allen Zhu,et al.  Natasha: Faster Non-Convex Stochastic Optimization Via Strongly Non-Convex Parameter , 2017, ArXiv.

[14]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[15]  Vasileios Nakos,et al.  (Nearly) Sample-Optimal Sparse Fourier Transform in Any Dimension; RIPless and Filterless , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[16]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17]  Josh Alman,et al.  Limits on the Universal method for matrix multiplication , 2018, CCC.

[18]  Francis R. Bach,et al.  Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2014, AISTATS 2014.

[19]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[20]  Yin Tat Lee,et al.  An homotopy method for lp regression provably beyond self-concordance and in input-sparsity time , 2018, STOC.

[21]  E. Nadaraya On Estimating Regression , 1964 .

[22]  Zaïd Harchaoui,et al.  Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice , 2017, J. Mach. Learn. Res..

[23]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[24]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[25]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[26]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[27]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[28]  David P. Woodruff,et al.  Fast Regression with an `∞ Guarantee∗ , 2017 .

[29]  Pravin M. Vaidya,et al.  Speeding-up linear programming using fast matrix multiplication , 1989, 30th Annual Symposium on Foundations of Computer Science.

[30]  Piotr Indyk,et al.  Sample-Optimal Fourier Sampling in Any Constant Dimension , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[31]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[32]  Zhao Song,et al.  A Robust Sparse Fourier Transform in the Continuous Setting , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[33]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[34]  Michael I. Jordan,et al.  Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[35]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[36]  K. Clarkson Subgradient and sampling algorithms for l1 regression , 2005, SODA '05.

[37]  Yuchen Zhang,et al.  Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization , 2014, ICML.

[38]  Richard Peng,et al.  Iterative Refinement for $\ell_p$-norm Regression , 2019, SODA 2019.

[39]  Yurii Nesterov,et al.  Efficiency of the Accelerated Coordinate Descent Method on Structured Optimization Problems , 2017, SIAM J. Optim..

[40]  Naomi S. Altman,et al.  Quantile regression , 2019, Nature Methods.

[41]  Yuanyuan Liu,et al.  Guaranteed Sufficient Decrease for Variance Reduced Stochastic Gradient Descent , 2017, 1703.06807.

[42]  Piotr Indyk,et al.  Nearly optimal sparse fourier transform , 2012, STOC '12.

[43]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[44]  Roger Koenker,et al.  Galton, Edgeworth, Frisch, and prospects for quantile regression in econometrics , 2000 .

[45]  Michael I. Jordan,et al.  A generalized mean field algorithm for variational inference in exponential families , 2002, UAI.

[46]  V. Strassen Gaussian elimination is not optimal , 1969 .

[47]  Virginia Vassilevska Williams,et al.  Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[48]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[49]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[50]  Zeyuan Allen-Zhu,et al.  Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization , 2018, ICML.

[51]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[52]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[53]  Josh Alman,et al.  Limits on All Known (and Some Unknown) Approaches to Matrix Multiplication , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[54]  Tong Zhang,et al.  A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization , 2016, J. Mach. Learn. Res..

[55]  D. Cox The Regression Analysis of Binary Sequences , 1958 .

[56]  Dominik Csiba,et al.  Data sampling strategies in stochastic algorithms for empirical risk minimization , 2018, 1804.00437.

[57]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[58]  Xue Chen,et al.  Fourier-Sparse Interpolation without a Frequency Gap , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[59]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[60]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[61]  Michael I. Jordan,et al.  On the Local Minima of the Empirical Risk , 2018, NeurIPS.

[62]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[63]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[64]  François Le Gall,et al.  Powers of tensors and fast matrix multiplication , 2014, ISSAC.

[65]  Taiji Suzuki,et al.  Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization , 2017, NIPS.

[66]  Martin J. Wainwright,et al.  Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares , 2014, J. Mach. Learn. Res..

[67]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[68]  Piotr Indyk,et al.  Simple and practical algorithm for sparse Fourier transform , 2012, SODA.

[69]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[70]  Zeyuan Allen-Zhu,et al.  Natasha 2: Faster Non-Convex Optimization Than SGD , 2017, NeurIPS.

[71]  François Le Gall,et al.  Improved Rectangular Matrix Multiplication using Powers of the Coppersmith-Winograd Tensor , 2017, SODA.

[72]  Yin Tat Lee,et al.  Solving linear programs in the current matrix multiplication time , 2018, STOC.

[73]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[74]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[75]  Yuanyuan Liu,et al.  Variance Reduced Stochastic Gradient Descent with Sufficient Decrease , 2017, ArXiv.

[76]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[77]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[78]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[79]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[80]  Josh Alman,et al.  Further Limitations of the Known Approaches for Matrix Multiplication , 2017, ITCS.

[81]  Prasad Raghavendra,et al.  Agnostic Learning of Monomials by Halfspaces Is Hard , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[82]  Mikhail Kapralov,et al.  Sample Efficient Estimation and Recovery in Sparse FFT via Isolation on Average , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[83]  Shai Shalev-Shwartz,et al.  SDCA without Duality, Regularization, and Individual Convexity , 2016, ICML.