A Geometric Analysis of Neural Collapse with Unconstrained Features

We provide the first global optimization landscape analysis of Neural Collapse— an intriguing empirical phenomenon that arises in the last-layer classifiers and features of neural networks during the terminal phase of training. As recently reported in [1], this phenomenon implies that (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero. We study the problem based on a simplified unconstrained feature model, which isolates the topmost layers from the classifier of the neural network. In this context, we show that the classical cross-entropy loss with weight decay has a benign global landscape, in the sense that the only global minimizers are the Simplex ETFs while all other critical points are strict saddles whose Hessian exhibit negative curvature directions. Our analysis of the simplified model not only explains what kind of features are learned in the last layer, but also shows why they can be efficiently optimized, matching the empirical observations in practical deep network architectures. These findings provide important practical implications. As an example, our experiments demonstrate that one may set the feature dimension equal to the number of classes and fix the last-layer classifier to be a Simplex ETF for network training, which reduces memory cost by over 20% on ResNet18 without sacrificing the generalization performance. The source code is available at https://github.com/tding1/Neural-Collapse.

[1]  Ilya P. Razenshteyn,et al.  Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm , 2021, COLT.

[2]  Dustin G. Mixon,et al.  Neural collapse with unconstrained features , 2020, Sampling Theory, Signal Processing, and Data Analysis.

[3]  Andrea Montanari,et al.  The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training , 2020, The Annals of Statistics.

[4]  Massimiliano Pontil,et al.  Reexamining Low Rank Matrix Factorization for Trace Norm Regularization , 2017, Mathematics in Engineering.

[5]  Yiping Lu,et al.  An Unconstrained Layer-Peeled Perspective on Neural Collapse , 2021, ICLR.

[6]  Marc Niethammer,et al.  Dissecting Supervised Constrastive Learning , 2021, ICML.

[7]  X. Y. Han,et al.  Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path , 2021, ICLR.

[8]  Zhihui Zhu,et al.  Convolutional Normalization: Improving Deep Convolutional Network Robustness and Training , 2021, NeurIPS.

[9]  Chi Jin,et al.  A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network , 2021, COLT.

[10]  S. Mallat,et al.  Separation and Concentration in Deep Networks , 2020, ICLR.

[11]  Ting Chen,et al.  Intriguing Properties of Contrastive Losses , 2020, NeurIPS.

[12]  John Wright,et al.  Deep Networks and the Multiple Manifold Problem , 2020, ICLR.

[13]  Ohad Shamir,et al.  Gradient Methods Never Overfit On Separable Data , 2020, J. Mach. Learn. Res..

[14]  James M. Rehg,et al.  Orthogonal Over-Parameterized Training , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Mert Pilanci,et al.  Revealing the Structure of Deep Neural Networks via Convex Duality , 2020, ICML.

[16]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[17]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Hangfeng He,et al.  Layer-Peeled Model: Toward Understanding Well-Trained Deep Neural Networks , 2021, ArXiv.

[19]  Yi Ren,et al.  Kronecker-factored Quasi-Newton Methods for Convolutional Neural Networks , 2021, ArXiv.

[20]  Stefan Steinerberger,et al.  Neural Collapse with Cross-Entropy Loss , 2020, ArXiv.

[21]  Eduard A. Gorbunov,et al.  Recent Theoretical Advances in Non-Convex Optimization , 2020, ArXiv.

[22]  E. Weinan,et al.  On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers , 2020, ArXiv.

[23]  John Wright,et al.  Deep Networks from the Principle of Rate Reduction , 2020, ArXiv.

[24]  R. Arora,et al.  Adversarial Robustness of Supervised Sparse Coding , 2020, NeurIPS.

[25]  Michael Elad,et al.  Another step toward demystifying deep neural networks , 2020, Proceedings of the National Academy of Sciences.

[26]  Vardan Papyan,et al.  Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra , 2020, ArXiv.

[27]  David L. Donoho,et al.  Prevalence of neural collapse during the terminal phase of deep learning training , 2020, Proceedings of the National Academy of Sciences.

[28]  Weijie J. Su,et al.  Benign Overfitting and Noisy Features , 2020, arXiv.org.

[29]  John Wright,et al.  From Symmetry to Geometry: Tractable Nonconvex Problems , 2020, ArXiv.

[30]  Nathan Srebro,et al.  Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy , 2020, NeurIPS.

[31]  Dawei Li,et al.  The Global Landscape of Neural Networks: An Overview , 2020, IEEE Signal Processing Magazine.

[32]  Chong You,et al.  Deep Isometric Learning for Visual Recognition , 2020, ICML.

[33]  Matus Telgarsky,et al.  Gradient descent follows the regularization path for general losses , 2020, COLT.

[34]  Zhihui Zhu,et al.  Robust Recovery via Implicit Bias of Discrepant Learning Rates for Double Over-parameterization , 2020, NeurIPS.

[35]  Chong You,et al.  Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction , 2020, NeurIPS.

[36]  Ruo-Yu Sun,et al.  Optimization for Deep Learning: An Overview , 2020, Journal of the Operations Research Society of China.

[37]  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[38]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[39]  Zhihui Zhu,et al.  Geometric Analysis of Nonconvex Optimization Landscapes for Overcomplete Learning , 2020, ICLR.

[40]  Chong You,et al.  Rethinking Bias-Variance Trade-off for Generalization of Neural Networks , 2020, ICML.

[41]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[42]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[43]  Zhihui Zhu,et al.  Finding the Sparsest Vectors in a Subspace: Theory, Algorithms, and Applications , 2020, ArXiv.

[44]  Jeffrey Pennington,et al.  Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks , 2020, ICLR.

[45]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[46]  Zhihui Zhu,et al.  Exact Recovery of Multichannel Sparse Blind Deconvolution via Gradient Descent , 2020, SIAM J. Imaging Sci..

[47]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[48]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Pengcheng Zhou,et al.  Short-and-Sparse Deconvolution - A Geometric Approach , 2019, ICLR.

[50]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[51]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[52]  John Wright,et al.  Structured Local Optima in Sparse Blind Deconvolution , 2018, IEEE Transactions on Information Theory.

[53]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[54]  Qian Qian,et al.  The Implicit Bias of AdaGrad on Separable Data , 2019, NeurIPS.

[55]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[56]  Francis Bach,et al.  Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks , 2019, NeurIPS.

[57]  Michael I. Jordan,et al.  On Nonconvex Optimization for Machine Learning , 2019, J. ACM.

[58]  Daniel Kunin,et al.  Loss Landscapes of Regularized Linear Autoencoders , 2019, ICML.

[59]  John Wright,et al.  Geometry and Symmetry in Short-and-Sparse Deconvolution , 2019, ICML.

[60]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[61]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[62]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[63]  Yuxin Chen,et al.  Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview , 2018, IEEE Transactions on Signal Processing.

[64]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[65]  Nathan Srebro,et al.  Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate , 2018, AISTATS.

[66]  Yonina C. Eldar,et al.  The Global Optimization Geometry of Shallow Linear Neural Networks , 2018, Journal of Mathematical Imaging and Vision.

[67]  Nathan Srebro,et al.  Convergence of Gradient Descent on Separable Data , 2018, AISTATS.

[68]  Suvrit Sra,et al.  Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[69]  Zhihui Zhu,et al.  Distributed Low-rank Matrix Factorization With Exact Consensus , 2019, NeurIPS.

[70]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[71]  Thomas Laurent,et al.  Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , 2017, ICML.

[72]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[73]  Jascha Sohl-Dickstein,et al.  Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[74]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[75]  R. Srikant,et al.  Adding One Neuron Can Eliminate All Bad Local Minima , 2018, NeurIPS.

[76]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[77]  Jorge Nocedal,et al.  A Progressive Batching L-BFGS Method for Machine Learning , 2018, ICML.

[78]  Meisam Razaviyayn,et al.  Learning Deep Models: Critical Points and Local Openness , 2018, ICLR.

[79]  Elad Hoffer,et al.  Fix your classifier: the marginal value of training the last weight layer , 2018, ICLR.

[80]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[81]  Raef Bassily,et al.  The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[82]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[83]  Suvrit Sra,et al.  Global optimality conditions for deep neural networks , 2017, ICLR.

[84]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[85]  Zhihui Zhu,et al.  Global Optimality in Low-Rank Matrix Optimization , 2017, IEEE Transactions on Signal Processing.

[86]  Junwei Lu,et al.  Symmetry. Saddle Points, and Global Optimization Landscape of Nonconvex Matrix Factorization , 2016, 2018 Information Theory and Applications Workshop (ITA).

[87]  Qiuwei Li,et al.  The non-convex geometry of low-rank matrix optimization , 2016, Information and Inference: A Journal of the IMA.

[88]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[89]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[90]  Liwei Wang,et al.  The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[91]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[92]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[93]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[94]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[95]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere II: Recovery by Riemannian Trust-Region Method , 2015, IEEE Transactions on Information Theory.

[96]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[97]  Yonina C. Eldar,et al.  Convolutional Phase Retrieval , 2017, NIPS.

[98]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[99]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[100]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[101]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[102]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[103]  Anish Shah,et al.  Deep Residual Networks with Exponential Linear Unit , 2016, ArXiv.

[104]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[105]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[106]  John Wright,et al.  Finding a Sparse Vector in a Subspace: Linear Sparsity Using Alternating Directions , 2014, IEEE Transactions on Information Theory.

[107]  Zhaoran Wang,et al.  A Nonconvex Optimization Framework for Low Rank Matrix Estimation , 2015, NIPS.

[108]  John Wright,et al.  When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[109]  Alexander Cloninger,et al.  Provable approximation properties for deep neural networks , 2015, ArXiv.

[110]  René Vidal,et al.  Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[111]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[112]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[113]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[114]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[115]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[116]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[117]  D. Costarelli,et al.  Constructive Approximation by Superposition of Sigmoidal Functions , 2013 .

[118]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[119]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[120]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[121]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[122]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[123]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[124]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[125]  Renato D. C. Monteiro,et al.  A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization , 2003, Math. Program..

[126]  G. Watson Characterization of the subdifferential of some matrix norms , 1992 .

[127]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[128]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.