Random Barzilai-Borwein step size for mini-batch algorithms

Abstract Mini-batch algorithms, a well-studied, highly popular approach in stochastic optimization methods, are used by practitioners because of their ability to accelerate training through better use of parallel processing power and reduction of stochastic variance. However, mini-batch algorithms often employ either a diminishing step size or a tuning step size by hand, which, in practice, can be time consuming. In this paper, we propose using the improved Barzilai–Borwein (BB) method to automatically compute step sizes for the state of the art mini-batch algorithm (mini-batch semi-stochastic gradient descent (mS2GD) method), which leads to a new algorithm: mS2GD-RBB. We theoretically prove that mS2GD-RBB converges with a linear convergence rate for strongly convex objective functions. To further validate the efficacy and scalability of the improved BB method, we introduce it into another modern mini-batch algorithm, Accelerated Mini-Batch Prox SVRG (Acc-Prox-SVRG) method. In a machine learning context, numerical experiments on three benchmark data sets indicate that the proposed methods outperform some advanced stochastic optimization methods.

[1]  Roger Fletcher,et al.  On the asymptotic behaviour of some new gradient methods , 2005, Math. Program..

[2]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[3]  Siba K. Udgata,et al.  Swarm Intelligence Based Localization in Wireless Sensor Networks , 2011, MIWAI.

[4]  Atsushi Nitanda,et al.  Stochastic Proximal Gradient Descent with Acceleration Techniques , 2014, NIPS.

[5]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[6]  Bin Zhou,et al.  Gradient Methods with Adaptive Step-Sizes , 2006, Comput. Optim. Appl..

[7]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[8]  Vladimir N. Minin,et al.  Efficient transition probability computation for continuous-time branching processes via compressed sensing , 2015, UAI 2015.

[9]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[10]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[11]  Krzysztof Sopyla,et al.  Stochastic Gradient Descent with Barzilai-Borwein update step for SVM , 2015, Inf. Sci..

[12]  José Mario Martínez,et al.  Nonmonotone Spectral Projected Gradient Methods on Convex Sets , 1999, SIAM J. Optim..

[13]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[14]  J. J. Moré,et al.  Quasi-Newton Methods, Motivation and Theory , 1974 .

[15]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[16]  M. Raydan On the Barzilai and Borwein choice of steplength for the gradient method , 1993 .

[17]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[18]  H. Robbins A Stochastic Approximation Method , 1951 .

[19]  Zhi-Quan Luo,et al.  On the Convergence of the LMS Algorithm with Adaptive Learning Rate for Linear Feedforward Networks , 1991, Neural Computation.

[20]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[21]  Nlarcos Raydant,et al.  On the Barzilai and Borwein Choice of Steplength for the Gradient Method * , 2003 .

[22]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[23]  Songcan Chen,et al.  SCIHTBB: Sparsity constrained iterative hard thresholding with Barzilai-Borwein step size , 2011, Neurocomputing.

[24]  Yuhong Dai A New Analysis on the Barzilai-Borwein Gradient Method , 2013 .

[25]  Philipp Hennig,et al.  Probabilistic Line Searches for Stochastic Optimization , 2015, NIPS.

[26]  Maghsud Solimanpur,et al.  Scaling on the Spectral Gradient Method , 2013, J. Optim. Theory Appl..

[27]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[28]  Noah A. Smith,et al.  Distributed Asynchronous Online Learning for Natural Language Processing , 2010, CoNLL.

[29]  Akbar Hashemi Borzabadi,et al.  An adaptive nonmonotone global Barzilai–Borwein gradient method for unconstrained optimization , 2017 .

[30]  J. Borwein,et al.  Two-Point Step Size Gradient Methods , 1988 .

[31]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[32]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[33]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[34]  Wah June Leong,et al.  New quasi-Newton methods via higher order tensor models , 2011, J. Comput. Appl. Math..

[35]  Rémi Munos,et al.  Bandit Theory meets Compressed Sensing for high dimensional Stochastic Linear Bandit , 2012, AISTATS.

[36]  Min Han,et al.  Improved extreme learning machine for multivariate time series online sequential prediction , 2015, Eng. Appl. Artif. Intell..

[37]  Shiqian Ma,et al.  Barzilai-Borwein Step Size for Stochastic Gradient Descent , 2016, NIPS.

[38]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[39]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[40]  Saeed Ghadimi,et al.  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2013, Mathematical Programming.

[41]  Jie Liu,et al.  Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting , 2015, IEEE Journal of Selected Topics in Signal Processing.

[42]  Jorge Nocedal,et al.  A Multi-Batch L-BFGS Method for Machine Learning , 2016, NIPS.

[43]  Mikhail V. Solodov,et al.  Incremental Gradient Algorithms with Stepsizes Bounded Away from Zero , 1998, Comput. Optim. Appl..

[44]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[45]  Marcos Raydan,et al.  Preconditioned Barzilai-Borwein method for the numerical solution of partial differential equations , 1996, Numerical Algorithms.

[46]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[47]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[48]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[49]  Qingsheng Zhu,et al.  A parallel matrix factorization based recommender by alternating stochastic gradient decent , 2012, Eng. Appl. Artif. Intell..

[50]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[51]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[52]  Javier Del Ser,et al.  On the design of a novel two-objective harmony search approach for distance- and connectivity-based localization in wireless sensor networks , 2013, Eng. Appl. Artif. Intell..

[53]  Bing Zheng,et al.  A New Modified Barzilai–Borwein Gradient Method for the Quadratic Minimization Problem , 2017, J. Optim. Theory Appl..