Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

In many applications, one works with neural network models trained by someone else. For such pretrained models, one may not have access to training data or test data. Moreover, one may not know details about the model, e.g., the specifics of the training data, the loss function, the hyperparameter values, etc. Given one or many pretrained models, it is a challenge to say anything about the expected performance or quality of the models. Here, we address this challenge by providing a detailed meta-analysis of hundreds of publicly available pretrained models. We examine norm-based capacity control metrics as well as power law based metrics from the recently-developed Theory of Heavy-Tailed Self Regularization. We find that norm based metrics correlate well with reported test accuracies for well-trained models, but that they often cannot distinguish well-trained versus poorly trained models. We also find that power law based metrics can do much better—quantitatively better at discriminating among series of well-trained models with a given architecture; and qualitatively better at discriminating well-trained versus poorly trained models. These methods can be used to identify when a pretrained neural network has problems that cannot be detected simply by examining training/test accuracies.

[1]  J. Senez I. INTRODUCTION , 1962, Bacteriological reviews.

[2]  R. Gaylord unpublished results , 1985 .

[3]  Tang,et al.  Self-Organized Criticality: An Explanation of 1/f Noise , 2011 .

[4]  Thomas Anderson,et al.  State of the Art of Natural Language Processing , 1987 .

[5]  L.K.J. Vandamme,et al.  An explanation of 1/f noise in LDD MOSFETs from the ohmic region to saturation , 1993 .

[6]  D. Sornette,et al.  Convergent Multiplicative Processes Repelled from Zero: Power Laws and Truncated Power Laws , 1996, cond-mat/9609074.

[7]  D. Sornette Critical Phenomena in Natural Sciences: Chaos, Fractals, Selforganization and Disorder: Concepts and Tools , 2000 .

[8]  H. Nishimori Statistical Physics of Spin Glasses and Information Processing , 2001 .

[9]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[10]  西森 秀稔 Statistical physics of spin glasses and information processing : an introduction , 2001 .

[11]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[12]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[13]  SHORT REVIEW , 2007 .

[14]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[15]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[16]  Woodrow L. Shew,et al.  Information Capacity and Transmission Are Maximized in Balanced Cortical Networks with Neuronal Avalanches , 2010, The Journal of Neuroscience.

[17]  J. Bouchaud,et al.  Theory of Financial Risk and Derivative Pricing: From Statistical Physics to Risk Management , 2011 .

[18]  D. Plenz,et al.  powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions , 2013, PloS one.

[19]  Andreas Klaus,et al.  Scale-Invariant Neuronal Avalanche Dynamics and the Cut-Off in Size Distributions , 2014, PloS one.

[20]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[21]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[22]  Gunnar Pruessner,et al.  25 Years of Self-organized Criticality: Concepts and Controversies , 2015, 1504.04991.

[23]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[24]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[25]  Jean-Philippe Bouchaud,et al.  Cleaning large correlation matrices: tools from random matrix theory , 2016, 1610.08104.

[26]  Michael W. Mahoney,et al.  Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior , 2017, ArXiv.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[29]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[30]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[31]  Wojciech Czarnecki,et al.  On Loss Functions for Deep Neural Networks in Classification , 2017, ArXiv.

[32]  Tomaso A. Poggio,et al.  A Surprising Linear Relationship Predicts Test Performance in Deep Networks , 2018, ArXiv.

[33]  J. Bouchaud,et al.  Financial Applications of Random Matrix Theory: a short review , 2009, 0910.1205.

[34]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[35]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[36]  Michael W. Mahoney,et al.  Traditional and Heavy-Tailed Self Regularization in Neural Network Models , 2019, ICML.

[37]  Michael W. Mahoney,et al.  Statistical Mechanics Methods for Discovering Knowledge from Modern Production Quality Neural Networks , 2019, KDD.

[38]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[39]  Philip M. Long,et al.  The Singular Values of Convolutional Layers , 2018, ICLR.

[40]  Michael W. Mahoney,et al.  Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks , 2019, SDM.

[41]  Timo Ropinski,et al.  Classifying the classifier: dissecting the weight space of neural networks , 2020, ECAI.

[42]  Surya Ganguli,et al.  Statistical Mechanics of Deep Learning , 2020, Annual Review of Condensed Matter Physics.

[43]  Daniel Keysers,et al.  Predicting Neural Network Accuracy from Weights , 2020, ArXiv.

[44]  Michael W. Mahoney,et al.  Multiplicative noise and heavy tails in stochastic optimization , 2020, ICML.

[45]  Michael W. Mahoney,et al.  Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning , 2018, J. Mach. Learn. Res..