Vector-output ReLU Neural Network Problems are Copositive Programs: Convex Analysis of Two Layer Networks and Polynomial-time Algorithms

We describe the convex semi-infinite dual of the two-layer vector-output ReLU neural network training problem. This semi-infinite dual admits a finite dimensional representation, but its support is over a convex set which is difficult to characterize. In particular, we demonstrate that the non-convex neural network training problem is equivalent to a finite-dimensional convex copositive program. Our work is the first to identify this strong connection between the global optima of neural networks and those of copositive programs. We thus demonstrate how neural networks implicitly attempt to solve copositive programs via semi-nonnegative matrix factorization, and draw key insights from this formulation. We describe the first algorithms for provably finding the global minimum of the vector output neural network training problem, which are polynomial in the number of samples for a fixed data rank, yet exponential in the dimension. However, in the case of convolutional architectures, the computational complexity is exponential in only the filter size and polynomial in all other parameters. We describe the circumstances in which we can find the global optimum of this neural network training problem exactly with soft-thresholded SVD, and provide a copositive relaxation which is guaranteed to be exact for certain classes of problems, and which corresponds with the solution of Stochastic Gradient Descent in practice.

[1]  Peter J. C. Dickinson The copositive cone, the completely positive cone and their generalisations , 2013 .

[2]  Samuel Burer,et al.  A gentle, geometric introduction to copositive optimization , 2015, Math. Program..

[3]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[4]  Morteza Mardani,et al.  Hidden Convexity of Wasserstein GANs: Interpretable Generative Models with Closed-Form Solutions , 2021, ArXiv.

[5]  Mert Pilanci,et al.  Revealing the Structure of Deep Neural Networks via Convex Duality , 2020 .

[6]  Mert Pilanci,et al.  Convex Optimization for Shallow Neural Networks , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[7]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[8]  Mert Pilanci,et al.  Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-Layer Networks , 2020, ICML.

[9]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[10]  Mert Pilanci,et al.  Convex Geometry of Two-Layer ReLU Networks: Implicit Autoencoding and Interpretable Models , 2020, AISTATS.

[11]  Mert Pilanci,et al.  Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time , 2021, ICLR.

[12]  Nathan Srebro,et al.  How do infinite width bounded norm networks look in function space? , 2019, COLT.

[13]  Mert Pilanci,et al.  Convex Duality of Deep Neural Networks , 2020, ArXiv.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[16]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[17]  Raman Arora,et al.  On the Implicit Bias of Dropout , 2018, ICML.

[18]  Mert Pilanci,et al.  All Local Minima are Global for Two-Layer ReLU Neural Networks: The Hidden Convex Optimization Landscape , 2020, ArXiv.

[19]  Mert Pilanci,et al.  Convex Geometry and Duality of Over-parameterized Neural Networks , 2020, J. Mach. Learn. Res..

[20]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[21]  Zhihui Zhu,et al.  Geometry of Factored Nuclear Norm Regularization , 2017, ArXiv.

[22]  Stefano Soatto,et al.  Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence , 2019, NeurIPS.

[23]  Alexandre d'Aspremont,et al.  Global Convergence of Frank Wolfe on One Hidden Layer Networks , 2020, ArXiv.

[24]  Zhize Li,et al.  Learning Two-layer Neural Networks with Symmetric Inputs , 2018, ICLR.

[25]  J. Lasserre,et al.  Handbook on Semidefinite, Conic and Polynomial Optimization , 2012 .

[26]  Raman Arora,et al.  On Dropout and Nuclear Norm Regularization , 2019, ICML.

[27]  Trevor J. Hastie,et al.  Matrix completion and low-rank SVD via fast alternating least squares , 2014, J. Mach. Learn. Res..

[28]  Chris H. Q. Ding,et al.  Convex and Semi-Nonnegative Matrix Factorizations , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[30]  Andrea Montanari,et al.  Cone-Constrained Principal Component Analysis , 2014, NIPS.

[31]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[32]  Mert Pilanci,et al.  Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs , 2021, ICML.

[33]  Lei Huang,et al.  Decorrelated Batch Normalization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Mert Pilanci,et al.  Convex Relaxations of Convolutional Neural Nets , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Nicolas Gillis,et al.  Exact and Heuristic Algorithms for Semi-Nonnegative Matrix Factorization , 2014, SIAM J. Matrix Anal. Appl..

[36]  R. Stanley An Introduction to Hyperplane Arrangements , 2007 .

[37]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[38]  A. Shapiro Semi-infinite programming, duality, discretization and optimality conditions , 2009 .

[39]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[40]  Alexandre Bernardino,et al.  Unifying Nuclear Norm and Bilinear Factorization Approaches for Low-Rank Matrix Decomposition , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Morteza Mardani,et al.  Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization , 2021, ICLR.

[42]  Francis R. Bach,et al.  A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization , 2008, J. Mach. Learn. Res..

[43]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[44]  Morteza Mardani,et al.  Convex Regularization Behind Neural Reconstruction , 2020, ICLR.

[45]  Dimitris S. Papailiopoulos,et al.  Nonnegative Sparse PCA with Provable Guarantees , 2014, ICML.

[46]  Mirjam Dür,et al.  Copositive Programming – a Survey , 2010 .

[47]  Abraham Berman,et al.  Characterization of completely positive graphs , 1993, Discret. Math..

[48]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[49]  Mert Pilanci,et al.  Convex Duality and Cutting Plane Methods for Over-parameterized Neural Networks , 2019 .

[50]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[51]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Francis R. Bach,et al.  Trace Lasso: a trace norm regularization for correlated designs , 2011, NIPS.

[53]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[54]  Mert Pilanci,et al.  Training Convolutional ReLU Neural Networks in Polynomial Time: Exact Convex Optimization Formulations , 2020, ArXiv.

[55]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[56]  Michael Eickenberg,et al.  Greedy Layerwise Learning Can Scale to ImageNet , 2018, ICML.