Competing with the Empirical Risk Minimizer in a Single Pass

In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data -- commonly referred to as either the empirical risk minimizer (ERM) or the $M$-estimator -- is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal in this work is to perform as well as the ERM, on every problem, while minimizing the use of computational resources such as running time and space usage. We provide a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties: * The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample. * The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer on every problem, even considering constant factors. * The algorithm's performance depends on the initial error at a rate that decreases super-polynomially. * The algorithm is easily parallelizable. Moreover, we quantify the (finite-sample) rate at which the algorithm becomes competitive with the ERM.

[1]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[2]  D. Anbar On Optimal Estimation Methods Using Stochastic Approximation Procedures , 1973 .

[3]  V. Fabian Asymptotically Efficient Stochastic Approximation; The RM Case , 1973 .

[4]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[5]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[6]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[7]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[8]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[9]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[10]  H. Robbins A Stochastic Approximation Method , 1951 .

[11]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[12]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[13]  R. Vershynin,et al.  A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[14]  Massimiliano Pontil,et al.  Online Gradient Descent Learning Algorithms , 2008, Found. Comput. Math..

[15]  Francis R. Bach,et al.  Self-concordant analysis for logistic regression , 2009, ArXiv.

[16]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[17]  Y. Nesterov,et al.  Primal-dual subgradient methods for minimizing uniformly convex functions , 2010, 1401.1792.

[18]  Yuan Yao,et al.  On Complexity Issues of Online Learning Algorithms , 2010, IEEE Transactions on Information Theory.

[19]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[20]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[21]  Sham M. Kakade,et al.  A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[22]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[23]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[24]  Daniel J. Hsu,et al.  Tail inequalities for sums of random matrices that depend on the intrinsic dimension , 2012 .

[25]  Sham M. Kakade,et al.  Random Design Analysis of Ridge Regression , 2012, COLT.

[26]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[27]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[28]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[29]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[30]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[31]  Francis R. Bach,et al.  Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2014, AISTATS 2014.

[32]  Yuan Yao,et al.  Online Learning as Stochastic Approximation of Regularization Paths: Optimality and Almost-Sure Convergence , 2011, IEEE Transactions on Information Theory.

[33]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[34]  Ohad Shamir,et al.  The sample complexity of learning linear predictors with the squared loss , 2014, J. Mach. Learn. Res..

[35]  Léon Bottou,et al.  A Lower Bound for the Optimization of Finite Sums , 2014, ICML.