论文信息 - Solving Empirical Risk Minimization in the Current Matrix Multiplication Time

Solving Empirical Risk Minimization in the Current Matrix Multiplication Time

Many convex problems in machine learning and computer science share the same form: \begin{align*} \min_{x} \sum_{i} f_i( A_i x + b_i), \end{align*} where $f_i$ are convex functions on $\mathbb{R}^{n_i}$ with constant $n_i$, $A_i \in \mathbb{R}^{n_i \times d}$, $b_i \in \mathbb{R}^{n_i}$ and $\sum_i n_i = n$. This problem generalizes linear programming and includes many problems in empirical risk minimization. In this paper, we give an algorithm that runs in time \begin{align*} O^* ( ( n^{\omega} + n^{2.5 - \alpha/2} + n^{2+ 1/6} ) \log (n / \delta) ) \end{align*} where $\omega$ is the exponent of matrix multiplication, $\alpha$ is the dual exponent of matrix multiplication, and $\delta$ is the relative accuracy. Note that the runtime has only a log dependence on the condition numbers or other data dependent parameters and these are captured in $\delta$. For the current bound $\omega \sim 2.38$ [Vassilevska Williams'12, Le Gall'14] and $\alpha \sim 0.31$ [Le Gall, Urrutia'18], our runtime $O^* ( n^{\omega} \log (n / \delta))$ matches the current best for solving a dense least squares regression problem, a special case of the problem we consider. Very recently, [Alman'18] proved that all the current known techniques can not give a better $\omega$ below $2.168$ which is larger than our $2+1/6$. Our result generalizes the very recent result of solving linear programs in the current matrix multiplication time [Cohen, Lee, Song'19] to a more broad class of problems. Our algorithm proposes two concepts which are different from [Cohen, Lee, Song'19] : $\bullet$ We give a robust deterministic central path method, whereas the previous one is a stochastic central path which updates weights by a random sparse vector. $\bullet$ We propose an efficient data-structure to maintain the central path of interior point methods even when the weights update vector is dense.

[1] Anirban Dasgupta,et al. Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[2] Yin Tat Lee,et al. Leverage Score Sampling for Faster Accelerated Regression and ERM , 2017, ALT.

[3] Dean P. Foster,et al. Finding Linear Structure in Large Datasets with Scalable Canonical Correlation Analysis , 2015, ICML.

[4] Mikhail Kapralov,et al. Sparse fourier transform in any constant dimension with nearly-optimal sample complexity in sublinear time , 2016, STOC.

[5] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[6] Don Coppersmith,et al. Matrix multiplication via arithmetic progressions , 1987, STOC.

[7] Eric C. Price,et al. Sparse recovery and Fourier sampling , 2013 .

[8] Martin J. Wainwright,et al. Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[9] Piotr Indyk,et al. (Nearly) Sample-Optimal Sparse Fourier Transform , 2014, SODA.

[10] Dean P. Foster,et al. Faster Ridge Regression via the Subsampled Randomized Hadamard Transform , 2013, NIPS.

[11] Rong Jin,et al. Empirical Risk Minimization for Stochastic Convex Optimization: $O(1/n)$- and $O(1/n^2)$-type of Risk Bounds , 2017, COLT.

[12] Shai Shalev-Shwartz,et al. Fast Rates for Empirical Risk Minimization of Strict Saddle Problems , 2017, COLT.

[13] Zeyuan Allen Zhu,et al. Natasha: Faster Non-Convex Stochastic Optimization Via Strongly Non-Convex Parameter , 2017, ArXiv.

[14] G. S. Watson,et al. Smooth regression analysis , 1964 .

[15] Vasileios Nakos,et al. (Nearly) Sample-Optimal Sparse Fourier Transform in Any Dimension; RIPless and Filterless , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[16] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[17] Josh Alman,et al. Limits on the Universal method for matrix multiplication , 2018, CCC.

[18] Francis R. Bach,et al. Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2014, AISTATS 2014.

[19] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[20] Yin Tat Lee,et al. An homotopy method for lp regression provably beyond self-concordance and in input-sparsity time , 2018, STOC.

[21] E. Nadaraya. On Estimating Regression , 1964 .

[22] Zaïd Harchaoui,et al. Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice , 2017, J. Mach. Learn. Res..

[23] Mark W. Schmidt,et al. Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[24] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[25] Shai Ben-David,et al. Understanding Machine Learning: From Theory to Algorithms , 2014 .

[26] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[27] Tong Zhang,et al. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[28] David P. Woodruff,et al. Fast Regression with an `∞ Guarantee∗ , 2017 .

[29] Pravin M. Vaidya,et al. Speeding-up linear programming using fast matrix multiplication , 1989, 30th Annual Symposium on Foundations of Computer Science.

[30] Piotr Indyk,et al. Sample-Optimal Fourier Sampling in Any Constant Dimension , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[31] Zaïd Harchaoui,et al. A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[32] Zhao Song,et al. A Robust Sparse Fourier Transform in the Continuous Setting , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[33] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.

[34] Michael I. Jordan,et al. Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[35] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[36] K. Clarkson. Subgradient and sampling algorithms for l1 regression , 2005, SODA '05.

[37] Yuchen Zhang,et al. Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization , 2014, ICML.

[38] Richard Peng,et al. Iterative Refinement for $\ell_p$-norm Regression , 2019, SODA 2019.

[39] Yurii Nesterov,et al. Efficiency of the Accelerated Coordinate Descent Method on Structured Optimization Problems , 2017, SIAM J. Optim..

[40] Naomi S. Altman,et al. Quantile regression , 2019, Nature Methods.

[41] Yuanyuan Liu,et al. Guaranteed Sufficient Decrease for Variance Reduced Stochastic Gradient Descent , 2017, 1703.06807.

[42] Piotr Indyk,et al. Nearly optimal sparse fourier transform , 2012, STOC '12.

[43] P. Bartlett,et al. Local Rademacher complexities , 2005, math/0508275.

[44] Roger Koenker,et al. Galton, Edgeworth, Frisch, and prospects for quantile regression in econometrics , 2000 .

[45] Michael I. Jordan,et al. A generalized mean field algorithm for variational inference in exponential families , 2002, UAI.

[46] V. Strassen. Gaussian elimination is not optimal , 1969 .

[47] Virginia Vassilevska Williams,et al. Multiplying matrices faster than coppersmith-winograd , 2012, STOC '12.

[48] Alexander J. Smola,et al. Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[49] P. Massart,et al. Adaptive estimation of a quadratic functional by model selection , 2000 .

[50] Zeyuan Allen-Zhu,et al. Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization , 2018, ICML.

[51] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[52] J. Tukey,et al. An algorithm for the machine calculation of complex Fourier series , 1965 .

[53] Josh Alman,et al. Limits on All Known (and Some Unknown) Approaches to Matrix Multiplication , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[54] Tong Zhang,et al. A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization , 2016, J. Mach. Learn. Res..

[55] D. Cox. The Regression Analysis of Binary Sequences , 1958 .

[56] Dominik Csiba,et al. Data sampling strategies in stochastic algorithms for empirical risk minimization , 2018, 1804.00437.

[57] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[58] Xue Chen,et al. Fourier-Sparse Interpolation without a Frequency Gap , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[59] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[60] Mark W. Schmidt,et al. A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[61] Michael I. Jordan,et al. On the Local Minima of the Empirical Risk , 2018, NeurIPS.

[62] Sham M. Kakade,et al. Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[63] Zeyuan Allen Zhu,et al. Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[64] François Le Gall,et al. Powers of tensors and fast matrix multiplication , 2014, ISSAC.

[65] Taiji Suzuki,et al. Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization , 2017, NIPS.

[66] Martin J. Wainwright,et al. Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares , 2014, J. Mach. Learn. Res..

[67] Vladimir Vapnik,et al. Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[68] Piotr Indyk,et al. Simple and practical algorithm for sparse Fourier transform , 2012, SODA.

[69] Lin Xiao,et al. A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[70] Zeyuan Allen-Zhu,et al. Natasha 2: Faster Non-Convex Optimization Than SGD , 2017, NeurIPS.