A Multi-Batch L-BFGS Method for Machine Learning

The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This can cause difficulties because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.

[1]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[2]  Weizhu Chen,et al.  Large-scale L-BFGS using MapReduce , 2014, NIPS.

[3]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[4]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[5]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[6]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[7]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[8]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[9]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[10]  Yu-Hong Dai,et al.  Convergence Properties of the BFGS Algoritm , 2002, SIAM J. Optim..

[11]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[12]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[13]  Avleen Singh Bijral,et al.  Mini-Batch Primal and Dual Methods for SVMs , 2013, ICML.

[14]  Dimitris S. Papailiopoulos,et al.  Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[15]  J. Nocedal,et al.  Exact and Inexact Subsampled Newton Methods for Optimization , 2016, 1609.08502.

[16]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[17]  H. Robbins A Stochastic Approximation Method , 1951 .

[18]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[19]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[20]  Walter F. Mascarenhas,et al.  The BFGS method with exact line searches fails for non-convex objective functions , 2004, Math. Program..

[21]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[22]  Yuchen Zhang,et al.  Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss , 2015, ArXiv.

[23]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[24]  Masao Fukushima,et al.  On the Global Convergence of the BFGS Method for Nonconvex Unconstrained Optimization Problems , 2000, SIAM J. Optim..

[25]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[26]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[27]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[28]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..