Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training

Significance The remarkable development of deep learning over the past decade relies heavily on sophisticated heuristics and tricks. To better exploit its potential in the coming decade, perhaps a rigorous framework for reasoning about deep learning is needed, which, however, is not easy to build due to the intricate details of neural networks. For near-term purposes, a practical alternative is to develop a mathematically tractable surrogate model, yet maintaining many characteristics of neural networks. This paper proposes a model of this kind that we term the Layer-Peeled Model. The effectiveness of this model is evidenced by, among others, its ability to reproduce a known empirical pattern and to predict a hitherto-unknown phenomenon when training deep-learning models on imbalanced datasets. In this paper, we introduce the Layer-Peeled Model, a nonconvex, yet analytically tractable, optimization program, in a quest to better understand deep neural networks that are trained for a sufficiently long time. As the name suggests, this model is derived by isolating the topmost layer from the remainder of the neural network, followed by imposing certain constraints separately on the two parts of the network. We demonstrate that the Layer-Peeled Model, albeit simple, inherits many characteristics of well-trained neural networks, thereby offering an effective tool for explaining and predicting common empirical patterns of deep-learning training. First, when working on class-balanced datasets, we prove that any solution to this model forms a simplex equiangular tight frame, which, in part, explains the recently discovered phenomenon of neural collapse [V. Papyan, X. Y. Han, D. L. Donoho, Proc. Natl. Acad. Sci. U.S.A. 117, 24652–24663 (2020)]. More importantly, when moving to the imbalanced case, our analysis of the Layer-Peeled Model reveals a hitherto-unknown phenomenon that we term Minority Collapse, which fundamentally limits the performance of deep-learning models on the minority classes. In addition, we use the Layer-Peeled Model to gain insights into how to mitigate Minority Collapse. Interestingly, this phenomenon is first predicted by the Layer-Peeled Model before being confirmed by our computational experiments.

[1]  Dustin G. Mixon,et al.  Neural collapse with unconstrained features , 2020, Sampling Theory, Signal Processing, and Data Analysis.

[2]  Qianli Liao,et al.  Explicit regularization and implicit bias in deep network classifiers trained with the square loss , 2020, ArXiv.

[3]  Cong Fang,et al.  Mathematical Models of Overparameterized Neural Networks , 2020, Proceedings of the IEEE.

[4]  Cong Fang,et al.  Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks , 2020, COLT.

[5]  Ohad Shamir,et al.  Gradient Methods Never Overfit On Separable Data , 2020, J. Mach. Learn. Res..

[6]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[7]  Cong Ma,et al.  A Selective Overview of Deep Learning , 2019, Statistical science : a review journal of the Institute of Mathematical Statistics.

[8]  D. Tao,et al.  Recent advances in deep learning theory , 2020, ArXiv.

[9]  E. Weinan,et al.  On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers , 2020, ArXiv.

[10]  David L. Donoho,et al.  Prevalence of neural collapse during the terminal phase of deep learning training , 2020, Proceedings of the National Academy of Sciences.

[11]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[12]  Chong You,et al.  Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction , 2020, NeurIPS.

[13]  Qianli Liao,et al.  Theoretical issues in deep networks , 2020, Proceedings of the National Academy of Sciences.

[14]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[15]  Mert Pilanci,et al.  Convex Duality of Deep Neural Networks , 2020, ArXiv.

[16]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[17]  Weijie J. Su,et al.  The Local Elasticity of Neural Networks , 2019, ICLR.

[18]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[19]  Lei Wu,et al.  A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics , 2019, Science China Mathematics.

[20]  Samet Oymak,et al.  Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks , 2019, IEEE Journal on Selected Areas in Information Theory.

[21]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[22]  René Vidal,et al.  Structured Low-Rank Matrix Factorization: Global Optimality, Algorithms, and Applications , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[24]  Colin Wei,et al.  Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss , 2019, NeurIPS.

[25]  Sanjeev Arora,et al.  Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets , 2019, NeurIPS.

[26]  Taghi M. Khoshgoftaar,et al.  Survey on deep learning with class imbalance , 2019, J. Big Data.

[27]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[28]  Qi Xie,et al.  Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting , 2019, NeurIPS.

[29]  Yang Song,et al.  Class-Balanced Loss Based on Effective Number of Samples , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[31]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[32]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[33]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[34]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[35]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[36]  James Zou,et al.  AI can be sexist and racist — it’s time to make it fair , 2018, Nature.

[37]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[38]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[39]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[40]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[41]  Raef Bassily,et al.  The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[42]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[43]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[44]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[45]  Grant M. Rotskoff,et al.  Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks , 2018, NeurIPS.

[46]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[47]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[48]  M. Ramaswami,et al.  Data Imbalance and Classifiers : Impact and Solutions from a Big Data Perspective , 2017 .

[49]  Bolei Zhou,et al.  Places: An Image Database for Deep Scene Understanding , 2016, ArXiv.

[50]  Longbing Cao,et al.  Training deep neural networks on imbalanced data sets , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[51]  Chen Huang,et al.  Learning Deep Representation for Imbalanced Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[55]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[56]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[57]  Frederick Vu Central Limit Theorem , 2015 .

[58]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[59]  Walter Daelemans,et al.  Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2014, EMNLP 2014.

[60]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[61]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[62]  Jean Ponce,et al.  Convex Sparse Matrix Factorizations , 2008, ArXiv.

[63]  Jeff A. Bilmes,et al.  Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers , 2006, HLT-NAACL 2006.

[64]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[65]  Shuzhong Zhang,et al.  On Cones of Nonnegative Quadratic Functions , 2003, Math. Oper. Res..

[66]  Thomas Strohmer,et al.  GRASSMANNIAN FRAMES WITH APPLICATIONS TO CODING AND COMMUNICATION , 2003, math/0301135.

[67]  David Lowe,et al.  The optimised internal representation of multilayer classifier networks performs nonlinear discriminant analysis , 1990, Neural Networks.

[68]  Masaaki Kijima,et al.  THEORY AND ALGORITHMS OF , 1988 .