The Unexpected Deterministic and Universal Behavior of Large Softmax Classifiers

This paper provides a large dimensional analysis of the Softmax classifier. We discover and prove that, when the classifier is trained on data satisfying loose statistical modeling assumptions, its weights become deterministic and solely depend on the data statistical means and covariances. As a striking consequence, despite the implicit and non-linear nature of the underlying optimization problem, the performance of the Softmax classifier is the same as if performed on a mere Gaussian mixture model, thereby disrupting the intuition that non-linearities inherently extract advanced statistical features from the data. Our findings are theoretically as well as numerically sustained on CNN representations of images produced by GANs.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Z. Landsman,et al.  Stein's Lemma for elliptical random vectors , 2008 .

[3]  Shengcai Liao,et al.  Soft-Margin Softmax for Deep Classification , 2017, ICONIP.

[4]  R. Couillet,et al.  Concentration of solutions to random equations with concentration of measure hypotheses , 2020 .

[5]  Daniel J. Fresen A simplified proof of CLT for convex bodies , 2019, Proceedings of the American Mathematical Society.

[6]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[7]  Zhenyu Liao,et al.  A Large Scale Analysis of Logistic Regression: Asymptotic Performance and New Insights , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Sepp Hochreiter,et al.  Fréchet ChemblNet Distance: A metric for generative models for molecules , 2018, ArXiv.

[9]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Romain Couillet,et al.  Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures , 2020, ICML.

[12]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[13]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[14]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[15]  B. Klartag A central limit theorem for convex sets , 2006, math/0605014.

[16]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[17]  HaiYing Wang,et al.  Optimal subsampling for softmax regression , 2019, Statistical Papers.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Gabriela Csurka,et al.  Distance-Based Image Classification: Generalizing to New Classes at Near-Zero Cost , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Ankit Singh Rawat,et al.  Sampled Softmax with Random Fourier Features , 2019, NeurIPS.

[21]  B. Caputo,et al.  DEEP NEAREST CLASS MEAN CLASSIFIERS , 2018 .

[22]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Meng Yang,et al.  Large-Margin Softmax Loss for Convolutional Neural Networks , 2016, ICML.

[24]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[25]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[26]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[27]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  M. Ledoux The concentration of measure phenomenon , 2001 .

[29]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[30]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[31]  Lacra Pavel,et al.  On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , 2017, ArXiv.

[32]  Grigoris Paouris,et al.  A stability result for mean width of Lp-centroid bodies , 2007 .

[33]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[34]  Helmut Finner,et al.  A Generalization of Holder's Inequality and Some Probability Inequalities , 1992 .

[35]  Yasuhiro Fujiwara,et al.  Sigsoftmax: Reanalysis of the Softmax Bottleneck , 2018, NeurIPS.

[36]  Marcus Rohrbach,et al.  Decoupling Representation and Classifier for Long-Tailed Recognition , 2020, ICLR.

[37]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[38]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[39]  Pradeep Ravikumar,et al.  Representer Point Selection for Explaining Deep Neural Networks , 2018, NeurIPS.

[40]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[41]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[42]  Romain Couillet,et al.  Concentration of Measure and Large Random Matrices with an application to Sample Covariance Matrices , 2018, 1805.08295.

[43]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[44]  P. Bickel,et al.  On robust regression with high-dimensional predictors , 2013, Proceedings of the National Academy of Sciences.