Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

Neural networks appear to have mysterious generalization properties when using parameter counting as a proxy for complexity. Indeed, neural networks often have many more parameters than there are data points, yet still provide good generalization performance. Moreover, when we measure generalization as a function of parameters, we see double descent behaviour, where the test error decreases, increases, and then again decreases. We show that many of these properties become understandable when viewed through the lens of effective dimensionality, which measures the dimensionality of the parameter space determined by the data. We relate effective dimensionality to posterior contraction in Bayesian deep learning, model selection, width-depth tradeoffs, double descent, and functional diversity in loss surfaces, leading to a richer understanding of the interplay between parameters and functions in deep models. We also show that effective dimensionality compares favourably to alternative norm- and flatness- based generalization measures.

[1]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[2]  John Moody,et al.  Note on generalization, regularization and architecture selection in nonlinear learning systems , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[3]  Andrew Gordon Wilson,et al.  GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration , 2018, NeurIPS.

[4]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[5]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[6]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[7]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[8]  Wolfgang Kinzel,et al.  Basins of attraction near the critical storage capacity for neural networks with constant stabilities , 1989 .

[9]  Stephen F. Gull,et al.  Developments in Maximum Entropy Data Analysis , 1989 .

[10]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[11]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[12]  David Madras,et al.  Detecting Extrapolation with Local Ensembles , 2020, ICLR.

[13]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[14]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ryan P. Adams,et al.  Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach , 2018, ICLR.

[16]  Vardan Papyan,et al.  The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size. , 2018 .

[17]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[18]  Andrew Gordon Wilson,et al.  Subspace Inference for Bayesian Deep Learning , 2019, UAI.

[19]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[20]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[21]  Andrew Gordon Wilson,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[22]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[23]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[24]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[25]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[26]  Taiji Suzuki,et al.  Fast generalization error bound of deep learning from a kernel perspective , 2018, AISTATS.

[27]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[28]  Noah D. Goodman,et al.  Pyro: Deep Universal Probabilistic Programming , 2018, J. Mach. Learn. Res..

[29]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[30]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[31]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[32]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[33]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[34]  Kanter,et al.  Eigenvalues of covariance matrices: Application to neural-network learning. , 1991, Physical review letters.

[35]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[36]  Stefan Wager,et al.  High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification , 2015, 1507.03003.

[37]  Opper,et al.  Generalization ability of perceptrons with continuous outputs. , 1993, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[38]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[39]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[40]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[41]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[42]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[43]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[44]  Neeraj Pradhan,et al.  Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro , 2019, ArXiv.

[45]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[46]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[47]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[48]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[49]  Micah Goldblum,et al.  Understanding Generalization through Visualizations , 2019, ICBINB@NeurIPS.

[50]  Yoh-ichi Mototake,et al.  Semi-flat minima and saddle points by embedding neural networks to overparameterization , 2019, NeurIPS.

[51]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[52]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[53]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[54]  Guodong Zhang,et al.  Functional Variational Bayesian Neural Networks , 2019, ICLR.

[55]  Radford M. Neal,et al.  High Dimensional Classification with Bayesian Neural Networks and Dirichlet Diffusion Trees , 2006, Feature Extraction.

[56]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[57]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[58]  M. Opper,et al.  On the ability of the optimal perceptron to generalise , 1990 .

[59]  David J. C. MacKay,et al.  Bayesian Model Comparison and Backprop Nets , 1991, NIPS.

[60]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[61]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[62]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[63]  Tong Zhang,et al.  Learning Bounds for Kernel Regression Using Effective Data Dimensionality , 2005, Neural Computation.