Optimization Methods for Supervised Machine Learning: From Linear Models to Deep Learning

The goal of this tutorial is to introduce key models, algorithms, and open questions related to the use of optimization methods for solving problems arising in machine learning. It is written with an INFORMS audience in mind, specifically those readers who are familiar with the basics of optimization algorithms, but less familiar with machine learning. We begin by deriving a formulation of a supervised learning problem and show how it leads to various optimization problems, depending on the context and underlying assumptions. We then discuss some of the distinctive features of these optimization problems, focusing on the examples of logistic regression and the training of deep neural networks. The latter half of the tutorial focuses on optimization algorithms, first for convex logistic regression, for which we discuss the use of first-order methods, the stochastic gradient method, variance reducing stochastic methods, and second-order methods. Finally, we discuss how these approaches can be employed to the training of deep neural networks, emphasizing the difficulties that arise from the complex, nonconvex structure of these models.

[1]  Surya Ganguli,et al.  On the saddle point problem for non-convex optimization , 2014, ArXiv.

[2]  Ching-Piao Tsai,et al.  BACK-PROPAGATION NEURAL NETWORK IN TIDAL-LEVEL FORECASTING , 2001 .

[3]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[4]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[5]  Philippe L. Toint,et al.  Towards an efficient sparsity exploiting newton method for minimization , 1981 .

[6]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  M. C. Deo,et al.  Neural networks for wave forecasting , 2001 .

[8]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[9]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Haiyan Lu,et al.  Multi-step forecasting for wind speed using a modified EMD-based artificial neural network model , 2012 .

[11]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[12]  Michael W. Mahoney,et al.  Sub-Sampled Newton Methods I: Globally Convergent Algorithms , 2016, ArXiv.

[13]  Katya Scheinberg,et al.  Convergence of Trust-Region Methods Based on Probabilistic Models , 2013, SIAM J. Optim..

[14]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[15]  Katya Scheinberg,et al.  Stochastic optimization using a trust-region method and random models , 2015, Mathematical Programming.

[16]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[17]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[18]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[20]  G C Lee,et al.  NEURAL NETWORKS TRAINED BY ANALYTICALLY SIMULATED DAMAGE STATES , 1993 .

[21]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[22]  Peng Xu,et al.  Sub-sampled Newton Methods with Non-uniform Sampling , 2016, NIPS.

[23]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[24]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[25]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[26]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[27]  J. Blanchet,et al.  Convergence Rate Analysis of a Stochastic Trust Region Method for Nonconvex Optimization , 2016 .

[28]  Raghu Pasupathy,et al.  Simulation Optimization: A Concise Overview and Implementation Guide , 2013 .

[29]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[30]  Hojjat Adeli,et al.  Neural Networks in Civil Engineering: 1989–2000 , 2001 .

[31]  Stuart E. Dreyfus,et al.  Second-order stagewise backpropagation for Hessian-matrix analyses and investigation of negative curvature , 2008, Neural Networks.

[32]  Nicholas I. M. Gould,et al.  Solving the Trust-Region Subproblem using the Lanczos Method , 1999, SIAM J. Optim..

[33]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[34]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[35]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[36]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[38]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[39]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[40]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[41]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[43]  Stephen J. Wright Optimization algorithms for data analysis , 2018, IAS/Park City Mathematics Series.

[44]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing , 2011 .

[45]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.

[46]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[47]  L. N. Vicente,et al.  Complexity and global rates of trust-region methods based on probabilistic models , 2018 .

[48]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[49]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[50]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[51]  James H. Garrett,et al.  Artificial Neural Networks for Civil Engineers: Fundamentals and Applications , 1997 .

[52]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[53]  S. Ashhab,et al.  Fully connected network of superconducting qubits in a cavity , 2008, 0802.1469.

[54]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[55]  G. Gnecco,et al.  Approximation Error Bounds via Rademacher's Complexity , 2008 .

[56]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[57]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[58]  Seyedalireza Yektamaram,et al.  Optimization Algorithms for Machine Learning Designed for Parallel and Distributed Environments , 2018 .

[59]  Jorge Nocedal,et al.  A Multi-Batch L-BFGS Method for Machine Learning , 2016, NIPS.

[60]  Andrea Montanari,et al.  Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[61]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[62]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[63]  J.A. Anderson,et al.  Neurocomputing: Foundations of Research@@@Neurocomputing 2: Directions for Research , 1992 .

[64]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[65]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[66]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[67]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[68]  Jorge Nocedal,et al.  Representations of quasi-Newton matrices and their use in limited memory methods , 1994, Math. Program..

[69]  C. P. Sheppard,et al.  Predicting time series by a fully connected neural network trained by back propagation , 1992 .

[70]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[71]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[72]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[73]  Michael W. Mahoney,et al.  Sub-Sampled Newton Methods II: Local Convergence Rates , 2016, ArXiv.

[74]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[75]  Robert M. Gower,et al.  Stochastic Block BFGS: Squeezing More Curvature out of Data , 2016, ICML.

[76]  Katya Scheinberg,et al.  Global convergence rate analysis of unconstrained optimization methods based on probabilistic models , 2015, Mathematical Programming.

[77]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[78]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[79]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[80]  J. Nocedal,et al.  Exact and Inexact Subsampled Newton Methods for Optimization , 2016, 1609.08502.

[81]  Stephen J. Wright,et al.  Numerical Optimization (Springer Series in Operations Research and Financial Engineering) , 2000 .

[82]  T. Steihaug The Conjugate Gradient Method and Trust Regions in Large Scale Optimization , 1983 .

[83]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[84]  H. Robbins A Stochastic Approximation Method , 1951 .

[85]  Martin T. Hagan,et al.  Neural network design , 1995 .

[86]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[87]  Frank E. Curtis,et al.  A Self-Correcting Variable-Metric Algorithm for Stochastic Optimization , 2016, ICML.

[88]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[89]  Ian Flood,et al.  Neural Networks in Civil Engineering. I: Principles and Understanding , 1994 .

[90]  Mark W. Schmidt,et al.  Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[91]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[92]  K. Thirumalaiah,et al.  River Stage Forecasting Using Artificial Neural Networks , 1998 .

[93]  Herbert Jaeger,et al.  Reservoir computing approaches to recurrent neural network training , 2009, Comput. Sci. Rev..

[94]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..