Revisiting complexity and the bias-variance tradeoff

The recent success of high-dimensional models, such as deep neural networks (DNNs), has led many to question the validity of the bias-variance tradeoff principle in high dimensions. We reexamine it with respect to two key choices: the model class and the complexity measure. We argue that failing to suitably specify either one can falsely suggest that the tradeoff does not hold. This observation motivates us to seek a valid complexity measure, defined with respect to a reasonably good class of models. Building on Rissanen's principle of minimum description length (MDL), we propose a novel MDL-based complexity (MDL-COMP). We focus on the context of linear models, which have been recently used as a stylized tractable approximation to DNNs in high-dimensions. MDL-COMP is defined via an optimality criterion over the encodings induced by a good Ridge estimator class. We derive closed-form expressions for MDL-COMP and show that for a dataset with $n$ observations and $d$ parameters it is \emph{not always} equal to $d/n$, and is a function of the singular values of the design matrix and the signal-to-noise ratio. For random Gaussian design, we find that while MDL-COMP scales linearly with $d$ in low-dimensions ($d n$) the scaling is exponentially smaller, scaling as $\log d$. We hope that such a slow growth of complexity in high-dimensions can help shed light on the good generalization performance of several well-tuned high-dimensional models. Moreover, via an array of simulations and real-data experiments, we show that a data-driven Prac-MDL-COMP can inform hyper-parameter tuning for ridge regression in limited data settings, sometimes improving upon cross-validation.

[1]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[2]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[3]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[4]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[5]  B. Efron How Biased is the Apparent Error Rate of a Prediction Rule , 1986 .

[6]  Junwei Lu,et al.  On Tighter Generalization Bound for Deep Neural Networks: CNNs, ResNets, and Beyond , 2018, ArXiv.

[7]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[8]  B. Yu,et al.  Boosting with the L_2-Loss: Regression and Classification , 2001 .

[9]  Dean P. Foster,et al.  The Contribution of Parameters to Stochastic Complexity , 2022 .

[10]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[11]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[12]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[13]  Randal S. Olson,et al.  PMLB: a large benchmark suite for machine learning evaluation and comparison , 2017, BioData Mining.

[14]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[15]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[16]  Antonia Maria Tulino,et al.  Random Matrix Theory and Wireless Communications , 2004, Found. Trends Commun. Inf. Theory.

[17]  Saharon Rosset,et al.  When does more regularization imply fewer degrees of freedom? Sufficient conditions and counterexamples , 2014 .

[18]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[19]  Nairanjana Dasgupta,et al.  Analyzing Categorical Data , 2004, Technometrics.

[20]  Yi Ma,et al.  Rethinking Bias-Variance Trade-off for Generalization of Neural Networks , 2020, ICML.

[21]  Bo Zhang,et al.  Generalized degrees of freedom and adaptive model selection in linear mixed-effects models , 2012, Comput. Stat. Data Anal..

[22]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[23]  Tong Zhang From ɛ-entropy to KL-entropy: Analysis of minimum information complexity density estimation , 2006, math/0702653.

[24]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[25]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[26]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[27]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[28]  C. Mallows More comments on C p , 1995 .

[29]  J. W. Silverstein Strong convergence of the empirical distribution of eigenvalues of large dimensional random matrices , 1995 .

[30]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[31]  Z. Bai,et al.  Limit of the smallest eigenvalue of a large dimensional sample covariance matrix , 1993 .

[32]  R. Tibshirani,et al.  Linear Smoothers and Additive Models , 1989 .

[33]  A. U.S.,et al.  Effective degrees of freedom : a flawed metaphor , 2015 .

[34]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[35]  Mary C. Meyer,et al.  ON THE DEGREES OF FREEDOM IN SHAPE-RESTRICTED REGRESSION , 2000 .

[36]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[37]  Robert P. W. Duin,et al.  Bagging for linear classifiers , 1998, Pattern Recognit..

[38]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[39]  Xiaotong Shen,et al.  Adaptive Model Selection , 2002 .

[40]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[41]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[42]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[43]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[44]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[45]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[46]  Kuniaki Uehara,et al.  Discovery of Time-Series Motif from Multi-Dimensional Data Based on MDL Principle , 2005, Machine Learning.

[47]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[48]  Yann Ollivier,et al.  The Description Length of Deep Learning models , 2018, NeurIPS.

[49]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[50]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[51]  Paul M. B. Vitányi,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Graduate Texts in Computer Science.

[52]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[53]  Bin Yu,et al.  Three principles of data science: predictability, computability, and stability (PCS) , 2019 .

[54]  Robert P. W. Duin,et al.  Bagging and the Random Subspace Method for Redundant Feature Spaces , 2001, Multiple Classifier Systems.

[55]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[56]  H. Akaike A new look at the statistical model identification , 1974 .

[57]  A. Shiryayev On Tables of Random Numbers , 1993 .

[58]  Jürgen Schmidhuber,et al.  Discovering Neural Nets with Low Kolmogorov Complexity and High Generalization Capability , 1997, Neural Networks.

[59]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Ryan J. Tibshirani,et al.  Degrees of freedom and model search , 2014, 1402.1920.

[61]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[62]  Peter Grünwald,et al.  A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity , 2017, ALT.

[63]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[64]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[65]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[66]  R. Tibshirani,et al.  Degrees of freedom in lasso problems , 2011, 1111.0653.

[67]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[68]  Martin J. Wainwright,et al.  Early stopping for non-parametric regression: An optimal data-dependent stopping rule , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[69]  Kenji Yamanishi,et al.  High-dimensional penalty selection via minimum description length principle , 2018, Machine Learning.

[70]  Robert P. W. Duin,et al.  Bagging, Boosting and the Random Subspace Method for Linear Classifiers , 2002, Pattern Analysis & Applications.

[71]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[73]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[74]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.