Stochastic gradient descent with finite samples sizes

The minimization of empirical risks over finite sample sizes is an important problem in large-scale machine learning. A variety of algorithms has been proposed in the literature to alleviate the computational burden per iteration at the expense of convergence speed and accuracy. Many of these approaches can be interpreted as stochastic gradient descent algorithms, where data is sampled from particular empirical distributions. In this work, we leverage this interpretation and draw from recent results in the field of online adaptation to derive new tight performance expressions for empirical implementations of stochastic gradient descent, mini-batch gradient descent, and importance sampling. The expressions are exact to first order in the step-size parameter and are tighter than existing bounds. We further quantify the performance gained from employing mini-batch solutions, and propose an optimal importance sampling algorithm to optimize performance.

[1]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[2]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[3]  Ali H. Sayed,et al.  A time-domain feedback analysis of filtered-error adaptive gradient algorithms , 1996, IEEE Trans. Signal Process..

[4]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[5]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[6]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[7]  Mikhail V. Solodov,et al.  Incremental Gradient Algorithms with Stepsizes Bounded Away from Zero , 1998, Comput. Optim. Appl..

[8]  Yehuda Koren,et al.  Factorization meets the neighborhood: a multifaceted collaborative filtering model , 2008, KDD.

[9]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[10]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[11]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[12]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[13]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[14]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[15]  Ali H. Sayed,et al.  Adaptive Networks , 2014, Proceedings of the IEEE.

[16]  Ali Sayed,et al.  Adaptation, Learning, and Optimization over Networks , 2014, Found. Trends Mach. Learn..

[17]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[18]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[19]  S. Haykin Adaptive Filters , 2007 .

[20]  Ali H. Sayed,et al.  An l2-stable feedback structure for nonlinear adaptive filtering and identification , 1997, Autom..