On the importance of single directions for generalization

Despite their ability to memorize large datasets, deep neural networks often achieve good generalization performance. However, the differences between the learned solutions of networks which generalize and those which do not remain unclear. Additionally, the tuning properties of single directions (defined as the activation of a single unit or some linear combination of units in response to some input) have been highlighted, but their importance has not been evaluated. Here, we connect these lines of inquiry to demonstrate that a network's reliance on single directions is a good predictor of its generalization performance, across networks trained on datasets with different fractions of corrupted labels, across ensembles of networks trained on datasets with unmodified labels, across different hyperparameters, and over the course of training. While dropout only regularizes this quantity up to a point, batch normalization implicitly discourages single direction reliance, in part by decreasing the class selectivity of individual units. Finally, we find that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance.

[1]  R. L. Valois,et al.  The orientation and direction selectivity of cells in macaque visual cortex , 1982, Vision Research.

[2]  J. Movshon,et al.  The analysis of visual motion: a comparison of neuronal and psychophysical performance , 1992, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[3]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[4]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[5]  David J. Freedman,et al.  Experience-dependent representation of visual categories in parietal cortex , 2006, Nature.

[6]  A. Pouget,et al.  Neural correlations, population coding and computation , 2006, Nature Reviews Neuroscience.

[7]  Pascal Vincent,et al.  Visualizing Higher-Layer Features of a Deep Network , 2009 .

[8]  Andrew Y. Ng,et al.  Emergence of Object-Selective Features in Unsupervised Feature Learning , 2012, NIPS.

[9]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  W. Newsome,et al.  Context-dependent computation by recurrent dynamics in prefrontal cortex , 2013, Nature.

[11]  Xiao-Jing Wang,et al.  The importance of mixed selectivity in complex cognitive tasks , 2013, Nature.

[12]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[13]  Matthew T. Kaufman,et al.  A category-free neural population supports evolving demands during decision-making , 2014, Nature Neuroscience.

[14]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[15]  Jitendra Malik,et al.  Analyzing the Performance of Multilayer Neural Networks for Object Recognition , 2014, ECCV.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Cyriel M A Pennartz,et al.  Population-Level Neural Codes Are Robust to Single-Neuron Variability from a Multidimensional Coding Perspective. , 2016, Cell reports.

[20]  Ari S. Morcos,et al.  History-dependent variability in population dynamics during evidence accumulation in cortex , 2016, Nature Neuroscience.

[21]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[22]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[23]  Gabriel Kreiman,et al.  On the Robustness of Convolutional Neural Networks to Internal Architecture and Weight Perturbations , 2017, ArXiv.

[24]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[25]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[26]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[27]  Jascha Sohl-Dickstein,et al.  SVCCA: Singular Vector Canonical Correlation Analysis for Deep Understanding and Improvement , 2017, ArXiv.

[28]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[29]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[30]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[31]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[32]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[33]  Ilya Sutskever,et al.  Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[34]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Wonyong Sung,et al.  Structured Pruning of Deep Convolutional Neural Networks , 2015, ACM J. Emerg. Technol. Comput. Syst..

[36]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[37]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[38]  Quoc V. Le,et al.  Understanding Generalization and Stochastic Gradient Descent , 2017 .

[39]  Joel Zylberberg Untuned But Not Irrelevant: A Role For Untuned Neurons In Sensory Information Coding , 2017 .

[40]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[41]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.