Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth

A key factor in the success of deep neural networks is the ability to scale models to improve performance by varying the architecture depth and width. This simple property of neural network design has resulted in highly effective architectures for a variety of tasks. Nevertheless, there is limited understanding of effects of depth and width on the learned representations. In this paper, we study this fundamental question. We begin by investigating how varying depth and width affects model hidden representations, finding a characteristic block structure in the hidden representations of larger capacity (wider or deeper) models. We demonstrate that this block structure arises when model capacity is large relative to the size of the training set, and is indicative of the underlying layers preserving and propagating the dominant principal component of their representations. This discovery has important ramifications for features learned by different models, namely, representations outside the block structure are often similar across architectures with varying widths and depths, but the block structure is unique to each model. We analyze the output predictions of different model architectures, finding that even when the overall accuracy is similar, wide and deep models exhibit distinctive error patterns and variations across classes.

[1]  Andrew K. Lampinen,et al.  What shapes feature representations? Exploring datasets, architectures, and training , 2020, NeurIPS.

[2]  Yann Dauphin,et al.  Selective Brain Damage: Measuring the Disparate Impact of Model Pruning , 2019, ArXiv.

[3]  Veselin Stoyanov,et al.  Emerging Cross-lingual Structure in Pretrained Language Models , 2020, ACL.

[4]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[5]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[6]  Pierre Bellec,et al.  Transferability of Brain decoding using Graph Convolutional Networks , 2020, bioRxiv.

[7]  Surya Ganguli,et al.  Universality and individuality in neural dynamics across large populations of recurrent networks , 2019, NeurIPS.

[8]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[9]  Le Song,et al.  Feature Selection via Dependence Maximization , 2012, J. Mach. Learn. Res..

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[12]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[13]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[14]  Eric Shea-Brown,et al.  Comparison Against Task Driven Artificial Neural Networks Reveals Functional Properties in Mouse Visual Cortex , 2019, NeurIPS.

[15]  Yonatan Belinkov,et al.  Similarity Analysis of Contextual Word Representation Models , 2020, ACL.

[16]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[17]  Mehryar Mohri,et al.  Algorithms for Learning Kernels Based on Centered Alignment , 2012, J. Mach. Learn. Res..

[18]  Mark Sellke,et al.  Approximating Continuous Functions by ReLU Nets of Minimal Width , 2017, ArXiv.

[19]  Richard Socher,et al.  A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation , 2018, ICLR.

[20]  Behnam Neyshabur,et al.  What is being transferred in transfer learning? , 2020, NeurIPS.

[21]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[22]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[23]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[24]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[25]  Matus Telgarsky,et al.  Representation Benefits of Deep Feedforward Networks , 2015, ArXiv.

[26]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[27]  Kenji Fukumizu,et al.  Post Selection Inference with Kernels , 2016, AISTATS.

[28]  Ankur Bapna,et al.  Investigating Multilingual NMT Representations at Scale , 2019, EMNLP.

[29]  Matthias Bethge,et al.  Learning From Brains How to Regularize Machines , 2019, NeurIPS.

[30]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[31]  Jon Kleinberg,et al.  Transfusion: Understanding Transfer Learning for Medical Imaging , 2019, NeurIPS.

[32]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[33]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[34]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[35]  Yoshua Bengio,et al.  The effect of task and training on intermediate representations in convolutional neural networks revealed with modified RV similarity analysis , 2019, 2019 Conference on Cognitive Computational Neuroscience.

[36]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[37]  Samy Bengio,et al.  Insights on representational similarity in neural networks with canonical correlation , 2018, NeurIPS.

[38]  D. Weinshall,et al.  Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets , 2019, ICML.

[39]  Yonatan Belinkov,et al.  Identifying and Controlling Important Neurons in Neural Machine Translation , 2018, ICLR.

[40]  Yuval Tassa,et al.  Deep neuroethology of a virtual rodent , 2019, ICLR.

[41]  Stefanie Jegelka,et al.  ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[42]  B. Efron Regression and ANOVA with Zero-One Data: Measures of Residual Variation , 1978 .

[43]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[44]  Liwei Wang,et al.  The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[45]  Jascha Sohl-Dickstein,et al.  SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.

[46]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.