A Geometric Look at Double Descent Risk: Volumes, Singularities, and Distinguishabilities

The appearance of the double-descent risk phenomenon has received growing interest in the machine learning and statistics community, as it challenges well-understood notions behind the U-shaped train-test curves. Motivated through Rissanen's minimum description length (MDL), Balasubramanian's Occam's Razor, and Amari's information geometry, we investigate how the logarithm of the model volume: $\log V$, works to extend intuition behind the AIC and BIC model selection criteria. We find that for the particular model classes of isotropic linear regression, statistical lattices, and the stochastic perceptron unit, the $\log V$ term may be decomposed into a sum of distinct components. These components work to extend the idea of model complexity inherent in AIC and BIC, and are driven by new, albeit intuitive notions of (i) Model richness, and (ii) Model distinguishability. Our theoretical analysis assists in the understanding of how the double descent phenomenon may manifest, as well as why generalization error does not necessarily continue to grow with increasing model dimensionality.

[1]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[2]  Frank Nielsen,et al.  Lightlike Neuromanifolds, Occam's Razor and Deep Learning , 2019, ArXiv.

[3]  Levent Sagun,et al.  A jamming transition from under- to over-parametrization affects generalization in deep learning , 2018, Journal of Physics A: Mathematical and Theoretical.

[4]  Koji Tsuda,et al.  Legendre decomposition for tensors , 2018, NeurIPS.

[5]  Alan Agresti,et al.  Categorical Data Analysis , 2003 .

[6]  F. Opitz Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[7]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[8]  K. Hofmann,et al.  Continuous Lattices and Domains , 2003 .

[9]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[10]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[11]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[12]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[13]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[14]  Tengyu Ma,et al.  Optimal Regularization Can Mitigate Double Descent , 2020, ICLR.

[15]  Jorma Rissanen,et al.  Stochastic Complexity in Learning , 1995, J. Comput. Syst. Sci..

[16]  Taiji Suzuki,et al.  Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint , 2020, ICLR.

[17]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[18]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[19]  David Tse,et al.  Fundamentals of Wireless Communication , 2005 .

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  A. Barron,et al.  Jeffreys' prior is asymptotically least favorable under entropy risk , 1994 .

[22]  Levent Sagun,et al.  Scaling description of generalization with number of parameters in deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[23]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[24]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[25]  Shun-ichi Amari,et al.  Information geometry of neural network—an overview , 1997 .

[26]  Koji Tsuda,et al.  Information decomposition on structured space , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[27]  Shun-ichi Amari,et al.  Information geometry on hierarchy of probability distributions , 2001, IEEE Trans. Inf. Theory.

[28]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[29]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[30]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[31]  Vijay Balasubramanian,et al.  A Geometric Formulation of Occam's Razor For Inference of Parametric Distributions , 1996, adap-org/9601001.

[32]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[33]  V. Balasubramanian MDL , Bayesian Inference and the Geometry of the Space of Probability Distributions , 2006 .

[34]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[35]  渡邊 澄夫 Algebraic geometry and statistical learning theory , 2009 .

[36]  Amiel Feinstein,et al.  Information and information stability of random variables and processes , 1964 .

[37]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[38]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[39]  Brian A. Davey,et al.  An Introduction to Lattices and Order , 1989 .

[40]  Koji Tsuda,et al.  Tensor Balancing on Statistical Manifold , 2017, ICML.

[41]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[42]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[43]  Florent Krzakala,et al.  Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , 2020, ICML.

[44]  Emre Telatar,et al.  Capacity of Multi-antenna Gaussian Channels , 1999, Eur. Trans. Telecommun..

[45]  Levent Sagun,et al.  The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.