论文信息 - Lightlike Neuromanifolds, Occam's Razor and Deep Learning - 字舞流文

Lightlike Neuromanifolds, Occam's Razor and Deep Learning

How do deep neural networks benefit from a very high dimensional parameter space? Their high complexity vs stunning generalization performance forms an intriguing paradox. We took an information-theoretic approach. We find that the locally varying dimensionality of the parameter space can be studied by the discipline of singular semi-Riemannian geometry. We adapt Fisher information to this singular neuromanifold. We use a new prior to interpolate between Jeffreys' prior and the Gaussian prior. We derive a minimum description length of a deep learning model, where the spectrum of the Fisher information matrix plays a key role to reduce the model complexity.

Frank Nielsen | Ke Sun | Ke Sun | F. Nielsen

[1] Surya Ganguli,et al. On the Expressive Power of Deep Neural Networks , 2016, ICML.

[2] Yann Dauphin,et al. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[3] Demir N. Kupeli. Singular Semi-Riemannian Geometry , 1996 .

[4] Frederik Kunstner,et al. Limitations of the empirical Fisher approximation for natural gradient descent , 2019, NeurIPS.

[5] Nicolas Le Roux,et al. Negative eigenvalues of the Hessian in deep neural networks , 2018, ICLR.

[6] Adam Gaier,et al. Weight Agnostic Neural Networks , 2019, NeurIPS.

[7] Ran El-Yaniv,et al. Binarized Neural Networks , 2016, ArXiv.

[8] V. Jain,et al. On the geometry of lightlike submanifolds of indefinite statistical manifolds , 2019, 1903.07387.

[9] Xaq Pitkow,et al. Skip Connections Eliminate Singularities , 2017, ICLR.

[10] Masato Okada,et al. Dynamics of Learning in MLP: Natural Gradient and Singularity Revisited , 2018, Neural Computation.

[11] G. Schwarz. Estimating the Dimension of a Model , 1978 .

[12] Jeffrey Pennington,et al. The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network , 2018, NeurIPS.

[13] Surya Ganguli,et al. Deep Information Propagation , 2016, ICLR.

[14] Philip Thomas,et al. GeNGA: A Generalization of Natural Gradient Ascent with Positive and Negative Convergence Results , 2014, ICML.

[15] H. Akaike. A new look at the statistical model identification , 1974 .

[16] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[17] Surya Ganguli,et al. The Emergence of Spectral Universality in Deep Networks , 2018, AISTATS.

[18] Razvan Pascanu,et al. Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[19] T. Aoki,et al. On the category of stratifolds , 2016, 1605.04142.

[20] Yee Whye Teh,et al. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[21] Shun-ichi Amari,et al. Universal statistics of Fisher information in deep neural networks: mean field approach , 2018, AISTATS.

[22] Jorma Rissanen,et al. The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[23] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[24] Shun-ichi Amari,et al. Dynamics of Learning Near Singularities in Layered Networks , 2008, Neural Computation.

[25] Surya Ganguli,et al. Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[26] Chico Q. Camargo,et al. Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.

[27] K. L. Duggal. A Review on Unique Existence Theorems in Lightlike Geometry , 2014 .

[28] Zhinan Zhang,et al. The rank of a random matrix , 2007, Appl. Math. Comput..

[29] V. Marčenko,et al. DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[30] P. Grünwald. The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[31] Tao Zhang,et al. A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[32] A. Guionnet,et al. Free probability and random matrices , 2012 .

[33] D. C. Kay. Schaum's outline of theory and problems of tensor calculus , 1988 .

[34] F. Opitz. Information geometry and its applications , 2012, 2012 9th European Radar Conference.

[35] I. J. Myung,et al. Counting probability distributions: Differential geometry and model selection , 2000, Proc. Natl. Acad. Sci. USA.

[36] Krishan L. Duggal,et al. Lightlike Submanifolds of Semi-Riemannian Manifolds and Applications , 1996 .

[37] C. S. Wallace,et al. An Information Measure for Classification , 1968, Comput. J..

[38] Jeffrey Pennington,et al. Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[39] Frank Nielsen,et al. Relative Fisher Information and Natural Gradient for Learning Large Modular Models , 2017, ICML.

[40] V. Balasubramanian. MDL , Bayesian Inference and the Geometry of the Space of Probability Distributions , 2006 .

[41] 野水克己,et al. Affine differential geometry : geometry of affine immersions , 1994 .

[42] Masato Okada,et al. Statistical mechanical analysis of learning dynamics of two-layer perceptron with multiple output units , 2019, Journal of Physics A: Mathematical and Theoretical.

[43] Mikhail Belkin,et al. Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[44] Yann Ollivier,et al. The Description Length of Deep Learning models , 2018, NeurIPS.

[45] C. R. Rao,et al. Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[46] J. Rissanen,et al. Modeling By Shortest Data Description* , 1978, Autom..

[47] Nathan Srebro,et al. Exploring Generalization in Deep Learning , 2017, NIPS.

[48] Jason Yosinski,et al. Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[49] T. Roos,et al. Minimum Description Length Revisited , 2019, International Journal of Mathematics for Industry.

[50] Mikhail Belkin,et al. Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[51] M. Tripathi,et al. Geometry of lightlike hypersurfaces of a statistical manifold , 2019, 1901.09251.

[52] 渡邊澄夫. Algebraic geometry and statistical learning theory , 2009 .

[53] Razvan Pascanu,et al. Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[54] Patrick D. McDaniel,et al. Making machine learning robust against adversarial inputs , 2018, Commun. ACM.

[55] Tomaso A. Poggio,et al. Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[56] P. Grünwald. The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[57] Mikhail Belkin,et al. Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[58] Jorma Rissanen,et al. Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.