Train Like a (Var)Pro: Efficient Training of Neural Networks with Variable Projection

Deep neural networks (DNNs) have achieved state-of-the-art performance across a variety of traditional machine learning tasks, e.g., speech recognition, image classification, and segmentation. The ability of DNNs to efficiently approximate high-dimensional functions has also motivated their use in scientific applications, e.g., to solve partial differential equations (PDE) and to generate surrogate models. In this paper, we consider the supervised training of DNNs, which arises in many of the above applications. We focus on the central problem of optimizing the weights of the given DNN such that it accurately approximates the relation between observed input and target data. Devising effective solvers for this optimization problem is notoriously challenging due to the large number of weights, non-convexity, data-sparsity, and non-trivial choice of hyperparameters. To solve the optimization problem more efficiently, we propose the use of variable projection (VarPro), a method originally designed for separable nonlinear least-squares problems. Our main contribution is the Gauss-Newton VarPro method (GNvpro) that extends the reach of the VarPro idea to non-quadratic objective functions, most notably, cross-entropy loss functions arising in classification. These extensions make GNvpro applicable to all training problems that involve a DNN whose last layer is an affine mapping, which is common in many state-of-the-art architectures. In numerical experiments from classification and surrogate modeling, GNvpro not only solves the optimization problem more efficiently but also yields DNNs that generalize better than commonly-used optimization schemes.

[1]  Thomas O'Leary-Roseberry,et al.  Inexact Newton Methods for Stochastic Non-Convex Optimization with Applications to Neural Network Training , 2019, 1905.06738.

[2]  M. Viberg,et al.  Separable non-linear least-squares minimization-possible improvements for neural net fitting , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[3]  Víctor Pereyra,et al.  Variable projections neural network training , 2006, Math. Comput. Simul..

[4]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[5]  J. Geiser Multiscale Modeling of Chemical Vapor Deposition (CVD) Apparatus: Simulations and Approximations , 2013 .

[6]  Junjie Wei,et al.  Bifurcation and spatiotemporal patterns in a homogeneous diffusive predator-prey system ✩ , 2009 .

[7]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[8]  Åke Björck,et al.  An implicit shift bidiagonalization algorithm for ill-posed systems , 1994 .

[9]  Carl Tim Kelley,et al.  Iterative methods for optimization , 1999, Frontiers in applied mathematics.

[10]  G. Cybenkot,et al.  Approximation by Superpositions of a Sigmoidal Function * , 2006 .

[11]  Dianne P. O'Leary,et al.  Variable projection for nonlinear least squares problems , 2012, Computational Optimization and Applications.

[12]  Frederick Tung,et al.  Multi-level Residual Networks from Dynamical Systems View , 2017, ICLR.

[13]  A. Jüngel Transport Equations for Semiconductors , 2009 .

[14]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[15]  Kurt Keutzer,et al.  ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs , 2019, IJCAI.

[16]  Peng Xu,et al.  Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study , 2017, SDM.

[17]  Jacob B. Schroder,et al.  Multilevel Initialization for Layer-Parallel Deep Neural Network Training , 2019, ArXiv.

[18]  Kurt Keutzer,et al.  ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, AAAI.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[21]  J. Nocedal,et al.  Exact and Inexact Subsampled Newton Methods for Optimization , 2016, 1609.08502.

[22]  A. Dey,et al.  Resistivity modeling for arbitrarily shaped three-dimensional structures , 1979 .

[23]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[24]  Maolin Tang A Hybrid , 2010 .

[25]  H. Robbins A Stochastic Approximation Method , 1951 .

[26]  Catherine Choquet,et al.  Optimal Control for a Groundwater Pollution Ruled by a Convection–Diffusion–Reaction Problem , 2017, J. Optim. Theory Appl..

[27]  E Weinan,et al.  The Deep Ritz Method: A Deep Learning-Based Numerical Algorithm for Solving Variational Problems , 2017, Communications in Mathematics and Statistics.

[28]  Lingyu Li,et al.  Numerical Simulation of Groundwater Pollution Problems Based on Convection Diffusion Equation , 2017 .

[29]  Justin A. Sirignano,et al.  DGM: A deep learning algorithm for solving partial differential equations , 2017, J. Comput. Phys..

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Mauro Perego,et al.  Robust Training and Initialization of Deep Neural Networks: An Adaptive Basis Viewpoint , 2019, MSML.

[32]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[33]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[34]  Paris Perdikaris,et al.  Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations , 2019, J. Comput. Phys..

[35]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[36]  M. L. Martins,et al.  Reaction-diffusion model for the growth of avascular tumor. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[37]  Alexander Shapiro,et al.  The Sample Average Approximation Method for Stochastic Discrete Optimization , 2002, SIAM J. Optim..

[38]  Knut Seidel,et al.  Direct Current Resistivity Methods , 2007 .

[39]  Bidyut Baran Chaudhuri,et al.  HybridSN: Exploring 3-D–2-D CNN Feature Hierarchy for Hyperspectral Image Classification , 2019, IEEE Geoscience and Remote Sensing Letters.

[40]  Arnulf Jentzen,et al.  Solving high-dimensional partial differential equations using deep learning , 2017, Proceedings of the National Academy of Sciences.

[41]  R. Pasupathy,et al.  A Guide to Sample Average Approximation , 2015 .

[42]  Eldad Haber,et al.  Computational Methods in Geophysical Electromagnetics , 2014, Mathematics in Industry.

[43]  Jonathan Malmaud,et al.  TensorFlow.jl: An Idiomatic Julia Front End for TensorFlow , 2018, J. Open Source Softw..

[44]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[45]  J. Nagy,et al.  Numerical methods for coupled super-resolution , 2006 .

[46]  Ferenc Izsák,et al.  Dispersion modeling of air pollutants in the atmosphere: a review , 2014 .

[47]  Gene H. Golub,et al.  The differentiation of pseudo-inverses and non-linear least squares problems whose variables separate , 1972, Milestones in Matrix Computation.

[48]  Eric C. Cyr,et al.  A Block Coordinate Descent Optimizer for Classification Problems Exploiting Convexity , 2020, AAAI Spring Symposium: MLPS.

[49]  Eldad Haber,et al.  Deep Neural Networks Motivated by Partial Differential Equations , 2018, Journal of Mathematical Imaging and Vision.

[50]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[51]  G. Golub,et al.  Separable nonlinear least squares: the variable projection method and its applications , 2003 .

[52]  Anotida Madzvamuse,et al.  A numerical approach to the study of spatial pattern formation in the ligaments of arcoid bivalves , 2002, Bulletin of mathematical biology.

[53]  L. Kaufman A variable projection method for solving separable nonlinear least squares problems , 1974 .

[54]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[55]  Yuanle Ma,et al.  Computational methods for multiphase flows in porous media , 2007, Math. Comput..

[56]  Lars Ruthotto,et al.  Discretize-Optimize vs. Optimize-Discretize for Time-Series Regression and Continuous Normalizing Flows , 2020, ArXiv.

[57]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[58]  M. F. Baumgardner,et al.  220 Band AVIRIS Hyperspectral Image Data Set: June 12, 1992 Indian Pine Test Site 3 , 2015 .

[59]  Cosmin Safta,et al.  A hybrid, non-split, stiff/RKC, solver for advection–diffusion–reaction equations and its application to low-Mach number combustion , 2019, Combustion Theory and Modelling.

[60]  Ilias Bilionis,et al.  Deep UQ: Learning deep neural network surrogate models for high dimensional uncertainty quantification , 2018, J. Comput. Phys..

[61]  Thilo Gross,et al.  Instabilities in spatially extended predator-prey systems: spatio-temporal patterns in the neighborhood of Turing-Hopf bifurcations. , 2007, Journal of theoretical biology.