Optimization Methods for Large-Scale Machine Learning

This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.

[1]  R. Courant,et al.  What Is Mathematics , 1943 .

[2]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[3]  K. Chung On a Stochastic Approximation Method , 1954 .

[4]  R. E. Kalman,et al.  A New Approach to Linear Filtering and Prediction Problems , 2002 .

[5]  S. Friedman On Stochastic Approximations , 1963 .

[6]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[7]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[8]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[9]  E. G. Gladyshev On Stochastic Approximation , 1965 .

[10]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[11]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .

[12]  C. G. Broyden The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations , 1970 .

[13]  R. Fletcher,et al.  A New Approach to Variable Metric Algorithms , 1970, Comput. J..

[14]  D. Shanno Conditioning of Quasi-Newton Methods for Function Minimization , 1970 .

[15]  D. Goldfarb A family of variable-metric methods derived by variational means , 1970 .

[16]  James M. Ortega,et al.  Iterative solution of nonlinear equations in several variables , 2014, Computer science and applied mathematics.

[17]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[18]  M. J. D. Powell,et al.  On search directions for minimization algorithms , 1973, Math. Program..

[19]  J. J. Moré,et al.  A Characterization of Superlinear Convergence and its Application to Quasi-Newton Methods , 1973 .

[20]  R. Glowinski,et al.  Sur l'approximation, par éléments finis d'ordre un, et la résolution, par pénalisation-dualité d'une classe de problèmes de Dirichlet non linéaires , 1975 .

[21]  B. Mercier,et al.  A dual algorithm for the solution of nonlinear variational problems via finite element approximation , 1976 .

[22]  Larry Nazareth,et al.  A family of variable metric updates , 1977, Math. Program..

[23]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[24]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[25]  Louis B. Rall,et al.  Automatic differentiation , 1981 .

[26]  R. Dembo,et al.  INEXACT NEWTON METHODS , 1982 .

[27]  T. Steihaug The Conjugate Gradient Method and Trust Regions in Large Scale Optimization , 1983 .

[28]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[29]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[30]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[31]  Gene H. Golub,et al.  Matrix computations , 1983 .

[32]  John E. Dennis,et al.  Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.

[33]  H. Robbins,et al.  A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[34]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[35]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[36]  Henryk Wozniakowski,et al.  Information-based complexity , 1987, Nature.

[37]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[38]  Griewank,et al.  On automatic differentiation , 1988 .

[39]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[40]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[41]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[42]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[43]  David S. Touretzky,et al.  Advances in neural information processing systems 2 , 1989 .

[44]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[45]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[46]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[47]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[48]  M. R. Osborne Fisher's Method of Scoring , 1992 .

[49]  Dimitri P. Bertsekas,et al.  On the Douglas—Rachford splitting method and the proximal point algorithm for maximal monotone operators , 1992, Math. Program..

[50]  Todd K. Leen,et al.  Optimal Stochastic Search and Adaptive Momentum , 1993, NIPS.

[51]  R. D. Murphy,et al.  Iterative solution of nonlinear equations , 1994 .

[52]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[53]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[54]  David L. Donoho,et al.  De-noising by soft-thresholding , 1995, IEEE Trans. Inf. Theory.

[55]  Dimitri P. Bertsekas,et al.  Incremental Least Squares Methods and the Extended Kalman Filter , 1996, SIAM J. Optim..

[56]  T. Cover Universal Portfolios , 1996 .

[57]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[58]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[59]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[60]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[61]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[62]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[63]  Jean-Francois Cardoso,et al.  Blind signal separation: statistical principles , 1998, Proc. IEEE.

[64]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[65]  Peter L. Bartlett,et al.  The Importance of Convexity in Learning with Squared Loss , 1998, IEEE Trans. Inf. Theory.

[66]  Alexander Shapiro,et al.  A simulation-based approach to two-stage stochastic programming with recourse , 1998, Math. Program..

[67]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[68]  R. Dudley,et al.  Uniform Central Limit Theorems: Notation Index , 2014 .

[69]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[70]  Chih-Jen Lin,et al.  Newton's Method for Large Bound-Constrained Optimization Problems , 1999, SIAM J. Optim..

[71]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[72]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[73]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[74]  Nicholas I. M. Gould,et al.  Trust Region Methods , 2000, MOS-SIAM Series on Optimization.

[75]  E. Berger UNIFORM CENTRAL LIMIT THEOREMS (Cambridge Studies in Advanced Mathematics 63) By R. M. D UDLEY : 436pp., £55.00, ISBN 0-521-46102-2 (Cambridge University Press, 1999). , 2001 .

[76]  Nicol N. Schraudolph Fast Curvature Matrix-Vector Products , 2001, ICANN.

[77]  G. Shafer,et al.  Probability and Finance: It's Only a Game! , 2001 .

[78]  Brian D. Fisher,et al.  University of British Columbia , 2002, INTR.

[79]  O. Bousquet Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms , 2002 .

[80]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[81]  Michael C. Fu,et al.  Optimization for Simulation: Theory vs. Practice , 2002 .

[82]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[83]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[84]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[85]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[86]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[87]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[88]  Ronald,et al.  Learning representations by backpropagating errors , 2004 .

[89]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[90]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .

[91]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[92]  David Gleicher A Statistical Study , 2006 .

[93]  H. Parthasarathy,et al.  NemaFootPrinter: a web based software for the identification of conserved non-coding genome sequence regions between C. elegans and C. briggsae , 1981, Nature Immunology.

[94]  Alfred O. Hero,et al.  A Convergent Incremental Gradient Method with a Constant Step Size , 2007, SIAM J. Optim..

[95]  H. Robbins A Stochastic Approximation Method , 1951 .

[96]  Mário A. T. Figueiredo,et al.  Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems , 2007, IEEE Journal of Selected Topics in Signal Processing.

[97]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[98]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[99]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[100]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[101]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[102]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[103]  Stephen J. Wright,et al.  Sparse reconstruction by separable approximation , 2009, IEEE Trans. Signal Process..

[104]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[105]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[106]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[107]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[108]  Adrian S. Lewis,et al.  Randomized Methods for Linear Constraints: Convergence Rates and Conditioning , 2008, Math. Oper. Res..

[109]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[110]  Patrick Gallinari,et al.  Erratum: SGDQN is Less Careful than Expected , 2010, J. Mach. Learn. Res..

[111]  Mark W. Schmidt,et al.  Graphical model structure learning using L₁-regularization , 2010 .

[112]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[113]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[114]  Emmanuel J. Candès,et al.  Templates for convex cone problems with applications to sparse signal recovery , 2010, Math. Program. Comput..

[115]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[116]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[117]  Pradeep Ravikumar,et al.  Sparse inverse covariance matrix estimation using quadratic approximation , 2011, MLSLP.

[118]  Wei Xu,et al.  Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent , 2011, ArXiv.

[119]  F. Bach,et al.  Optimization with Sparsity-Inducing Penalties (Foundations and Trends(R) in Machine Learning) , 2011 .

[120]  Léon Bottou,et al.  Batch and online learning algorithms for nonconvex neyman-pearson classification , 2011, TIST.

[121]  Maxim Raginsky,et al.  Information-Based Complexity, Feedback and Dynamics in Convex Programming , 2010, IEEE Transactions on Information Theory.

[122]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[123]  Mark W. Schmidt,et al.  Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization , 2011, NIPS.

[124]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[125]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[126]  Julien Mairal,et al.  Optimization with Sparsity-Inducing Penalties , 2011, Found. Trends Mach. Learn..

[127]  Shay B. Cohen,et al.  Advances in Neural Information Processing Systems 25 , 2012, NIPS 2012.

[128]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[129]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[130]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[131]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[132]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[133]  Mark W. Schmidt,et al.  Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[134]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[135]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[136]  J. Nocedal,et al.  An inexact successive quadratic approximation method for L-1 regularized optimization , 2013, Mathematical Programming.

[137]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[138]  John Langford,et al.  Normalized Online Learning , 2013, UAI.

[139]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[140]  Pradeep Ravikumar,et al.  BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables , 2013, NIPS.

[141]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[142]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[143]  Stephen J. Wright,et al.  Optimization for Machine Learning , 2013 .

[144]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[145]  Haipeng Luo,et al.  Accelerated Parallel Optimization Methods for Large Scale Machine Learning , 2014, ArXiv.

[146]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[147]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[148]  Raghu Pasupathy,et al.  On adaptive sampling rules for stochastic recursions , 2014, Proceedings of the Winter Simulation Conference 2014.

[149]  É. Moulines,et al.  On stochastic proximal gradient algorithms , 2014 .

[150]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[151]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[152]  Aryan Mokhtari,et al.  RES: Regularized Stochastic BFGS Algorithm , 2014, IEEE Transactions on Signal Processing.

[153]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[154]  Yoshua Bengio,et al.  Equilibrated adaptive learning rates for non-convex optimization , 2015, NIPS.

[155]  Stephen J. Wright,et al.  An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[156]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[157]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[158]  Mark W. Schmidt,et al.  Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection , 2015, ICML.

[159]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[160]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[161]  A. Ozdaglar,et al.  Convergence Rate of Incremental Gradient and Newton Methods , 2015 .

[162]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[163]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[164]  Dimitri P. Bertsekas,et al.  Convex Optimization Algorithms , 2015 .

[165]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[166]  P. Glynn,et al.  ON SAMPLING RATES IN STOCHASTIC RECURSIONS , 2016 .

[167]  Jorge Nocedal,et al.  A family of second-order methods for convex $$\ell _1$$ℓ1-regularized optimization , 2016, Math. Program..

[168]  Zeyuan Allen Zhu,et al.  Optimal Black-Box Reductions Between Optimization Objectives , 2016, NIPS.

[169]  Michael W. Mahoney,et al.  Sub-Sampled Newton Methods II: Local Convergence Rates , 2016, ArXiv.

[170]  Katya Scheinberg,et al.  Practical inexact proximal quasi-Newton method with global complexity analysis , 2013, Mathematical Programming.

[171]  Yann Ollivier,et al.  Practical Riemannian Neural Networks , 2016, ArXiv.

[172]  Naman Agarwal,et al.  Second Order Stochastic Optimization in Linear Time , 2016, ArXiv.

[173]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[174]  Katya Scheinberg,et al.  Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning , 2017, ArXiv.

[175]  Asuman E. Ozdaglar,et al.  On the Convergence Rate of Incremental Aggregated Gradient Algorithms , 2015, SIAM J. Optim..

[176]  Gersende Fort,et al.  On Perturbed Proximal Gradient Algorithms , 2014, J. Mach. Learn. Res..

[177]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[178]  Naman Agarwal,et al.  Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[179]  Friedrich Haslinger ANNALES DE LA FACULTÉ DES SCIENCES DE TOULOUSE , 2019 .

[180]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[181]  F. F. Soulié Experiments with Time Delay Networks and Dynamic Time Warping for speaker independent isolated digits recognition , 2022 .