Lightlike Neuromanifolds, Occam's Razor and Deep Learning

How do deep neural networks benefit from a very high dimensional parameter space? Their high complexity vs stunning generalization performance forms an intriguing paradox. We took an information-theoretic approach. We find that the locally varying dimensionality of the parameter space can be studied by the discipline of singular semi-Riemannian geometry. We adapt Fisher information to this singular neuromanifold. We use a new prior to interpolate between Jeffreys' prior and the Gaussian prior. We derive a minimum description length of a deep learning model, where the spectrum of the Fisher information matrix plays a key role to reduce the model complexity.

[1]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[2]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[3]  Demir N. Kupeli Singular Semi-Riemannian Geometry , 1996 .

[4]  Frederik Kunstner,et al.  Limitations of the empirical Fisher approximation for natural gradient descent , 2019, NeurIPS.

[5]  Nicolas Le Roux,et al.  Negative eigenvalues of the Hessian in deep neural networks , 2018, ICLR.

[6]  Adam Gaier,et al.  Weight Agnostic Neural Networks , 2019, NeurIPS.

[7]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[8]  V. Jain,et al.  On the geometry of lightlike submanifolds of indefinite statistical manifolds , 2019, 1903.07387.

[9]  Xaq Pitkow,et al.  Skip Connections Eliminate Singularities , 2017, ICLR.

[10]  Masato Okada,et al.  Dynamics of Learning in MLP: Natural Gradient and Singularity Revisited , 2018, Neural Computation.

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  Jeffrey Pennington,et al.  The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network , 2018, NeurIPS.

[13]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[14]  Philip Thomas,et al.  GeNGA: A Generalization of Natural Gradient Ascent with Positive and Negative Convergence Results , 2014, ICML.

[15]  H. Akaike A new look at the statistical model identification , 1974 .

[16]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[17]  Surya Ganguli,et al.  The Emergence of Spectral Universality in Deep Networks , 2018, AISTATS.

[18]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[19]  T. Aoki,et al.  On the category of stratifolds , 2016, 1605.04142.

[20]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[21]  Shun-ichi Amari,et al.  Universal statistics of Fisher information in deep neural networks: mean field approach , 2018, AISTATS.

[22]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[23]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[24]  Shun-ichi Amari,et al.  Dynamics of Learning Near Singularities in Layered Networks , 2008, Neural Computation.

[25]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[26]  Chico Q. Camargo,et al.  Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.

[27]  K. L. Duggal A Review on Unique Existence Theorems in Lightlike Geometry , 2014 .

[28]  Zhinan Zhang,et al.  The rank of a random matrix , 2007, Appl. Math. Comput..

[29]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[30]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[31]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[32]  A. Guionnet,et al.  Free probability and random matrices , 2012 .

[33]  D. C. Kay Schaum's outline of theory and problems of tensor calculus , 1988 .

[34]  F. Opitz Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[35]  I. J. Myung,et al.  Counting probability distributions: Differential geometry and model selection , 2000, Proc. Natl. Acad. Sci. USA.

[36]  Krishan L. Duggal,et al.  Lightlike Submanifolds of Semi-Riemannian Manifolds and Applications , 1996 .

[37]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[38]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[39]  Frank Nielsen,et al.  Relative Fisher Information and Natural Gradient for Learning Large Modular Models , 2017, ICML.

[40]  V. Balasubramanian MDL , Bayesian Inference and the Geometry of the Space of Probability Distributions , 2006 .

[41]  野水 克己,et al.  Affine differential geometry : geometry of affine immersions , 1994 .

[42]  Masato Okada,et al.  Statistical mechanical analysis of learning dynamics of two-layer perceptron with multiple output units , 2019, Journal of Physics A: Mathematical and Theoretical.

[43]  Mikhail Belkin,et al.  Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[44]  Yann Ollivier,et al.  The Description Length of Deep Learning models , 2018, NeurIPS.

[45]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[46]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[47]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[48]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[49]  T. Roos,et al.  Minimum Description Length Revisited , 2019, International Journal of Mathematics for Industry.

[50]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[51]  M. Tripathi,et al.  Geometry of lightlike hypersurfaces of a statistical manifold , 2019, 1901.09251.

[52]  渡邊 澄夫 Algebraic geometry and statistical learning theory , 2009 .

[53]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[54]  Patrick D. McDaniel,et al.  Making machine learning robust against adversarial inputs , 2018, Commun. ACM.

[55]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[56]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[57]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[58]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.