Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.

[1]  V. Fabian On Asymptotic Normality in Stochastic Approximation , 1968 .

[2]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[3]  N. Vakhania,et al.  Probability Distributions on Banach Spaces , 1987 .

[4]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[5]  R. Durrett Probability: Theory and Examples , 1993 .

[6]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[7]  C. Ahlbrandt,et al.  Discrete Hamiltonian Systems: Difference Equations, Continued Fractions, and Riccati Equations , 1996 .

[8]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[9]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[10]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[11]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[12]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[13]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[14]  Léon Bottou,et al.  On-line learning for very large data sets: Research Articles , 2005 .

[15]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .

[16]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[17]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[18]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[19]  Nathan Srebro,et al.  Fast Rates for Regularized Objectives , 2008, NIPS.

[20]  Yurii Nesterov,et al.  Confidence level solutions for stochastic programming , 2000, Autom..

[21]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[22]  Martin J. Wainwright,et al.  Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[23]  Francis R. Bach,et al.  Self-concordant analysis for logistic regression , 2009, ArXiv.

[24]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[25]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[26]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[27]  Mark Broadie,et al.  General Bounds and Finite-Time Improvement for the Kiefer-Wolfowitz Stochastic Approximation Algorithm , 2011, Oper. Res..

[28]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[29]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[30]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.