Can Subnetwork Structure be the Key to Out-of-Distribution Generalization?

Can models with particular structure avoid being biased towards spurious correlation in out-of-distribution (OOD) generalization? Peters et al. (2016) provides a positive answer for linear cases. In this paper, we use a functional modular probing method to analyze deep model structures under OOD setting. We demonstrate that even in biased models (which focus on spurious correlation) there still exist unbiased functional subnetworks. Furthermore, we articulate and demonstrate the functional lottery ticket hypothesis: full network contains a subnetwork that can achieve better OOD performance. We then propose Modular Risk Minimization to solve the subnetwork selection problem. Our algorithm learns the subnetwork structure from a given dataset, and can be combined with any other OOD regularization methods. Experiments on various OOD generalization tasks corroborate the effectiveness of our method.

[1]  Jonas Peters,et al.  Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[2]  Aleksander Madry,et al.  Robustness May Be at Odds with Accuracy , 2018, ICLR.

[3]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[4]  Yoshua Bengio,et al.  An Analysis of the Adaptation Speed of Causal Models , 2020, AISTATS.

[5]  Bernhard Schölkopf,et al.  Recurrent Independent Mechanisms , 2021, ICLR.

[6]  Joseph D. Janizek,et al.  AI for radiographic COVID-19 detection selects shortcuts over signal , 2020, Nature Machine Intelligence.

[7]  Roger B. Grosse,et al.  Picking Winning Tickets Before Training by Preserving Gradient Flow , 2020, ICLR.

[8]  Aaron C. Courville,et al.  What Do Compressed Deep Neural Networks Forget , 2019, 1911.05248.

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Bernhard Schölkopf,et al.  Domain Generalization via Invariant Feature Representation , 2013, ICML.

[11]  Masanori Koyama,et al.  Out-of-Distribution Generalization with Maximal Invariant Predictor , 2020, ArXiv.

[12]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[13]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[14]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[15]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[16]  Yingwei Li,et al.  Shape-Texture Debiased Neural Network Training , 2020, ICLR.

[17]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[18]  Andreas Geiger,et al.  Counterfactual Generative Networks , 2021, ICLR.

[19]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[20]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[21]  Christopher Joseph Pal,et al.  A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms , 2019, ICLR.

[22]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[23]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[24]  Percy Liang,et al.  Removing Spurious Features can Hurt Accuracy and Affect Groups Disproportionately , 2020, FAccT.

[25]  Pradeep Ravikumar,et al.  The Risks of Invariant Risk Minimization , 2020, ICLR.

[26]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[27]  Raquel Urtasun,et al.  MLPrune: Multi-Layer Pruning for Automated Neural Network Compression , 2018 .

[28]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[29]  Svetlana Lazebnik,et al.  PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  David Lopez-Paz,et al.  In Search of Lost Domain Generalization , 2020, ICLR.

[31]  Pietro Perona,et al.  Recognition in Terra Incognita , 2018, ECCV.

[32]  Percy Liang,et al.  Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , 2019, ArXiv.

[33]  Bernhard Schölkopf,et al.  Invariant Models for Causal Transfer Learning , 2015, J. Mach. Learn. Res..

[34]  Luigi Gresele,et al.  Learning explanations that are hard to vary , 2020, ArXiv.

[35]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Thomas L. Griffiths,et al.  Automatically Composing Representation Transformations as a Means for Generalization , 2018, ICLR.

[37]  Regina Barzilay,et al.  Domain Extrapolation via Regret Minimization , 2020, ArXiv.

[38]  Yadong Mu,et al.  Informative Dropout for Robust Representation Learning: A Shape-bias Perspective , 2020, ICML.

[39]  Gintare Karolina Dziugaite,et al.  Pruning Neural Networks at Initialization: Why are We Missing the Mark? , 2020, ArXiv.

[40]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[41]  Koby Crammer,et al.  Learning from Multiple Sources , 2006, NIPS.

[42]  Richard Zemel,et al.  Exchanging Lessons Between Algorithmic Fairness and Domain Generalization , 2020, ArXiv.

[43]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[44]  Nathan Srebro,et al.  Does Invariant Risk Minimization Capture Invariance? , 2021, ArXiv.

[45]  Dana H. Ballard,et al.  Modular Learning in Neural Networks , 1987, AAAI.

[46]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[47]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[48]  Tengyu Ma,et al.  In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness , 2020, ICLR.

[49]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Hod Lipson,et al.  The evolutionary origins of modularity , 2012, Proceedings of the Royal Society B: Biological Sciences.

[51]  Peter Stone,et al.  Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[52]  Prateek Jain,et al.  The Pitfalls of Simplicity Bias in Neural Networks , 2020, NeurIPS.

[53]  Stuart Russell,et al.  Neural Networks are Surprisingly Modular , 2020, ArXiv.

[54]  Eric P. Xing,et al.  Learning Robust Representations by Projecting Superficial Statistics Out , 2018, ICLR.

[55]  Aaron C. Courville,et al.  Systematic generalisation with group invariant predictions , 2021, ICLR.

[56]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Aaron C. Courville,et al.  Out-of-Distribution Generalization via Risk Extrapolation , 2020 .

[58]  Zenon W. Pylyshyn,et al.  Connectionism and cognitive architecture: A critical analysis , 1988, Cognition.

[59]  Amit Dhurandhar,et al.  Empirical or Invariant Risk Minimization? A Sample Complexity Perspective , 2020, ArXiv.

[60]  Bernhard Schölkopf,et al.  On causal and anticausal learning , 2012, ICML.

[61]  Behnam Neyshabur,et al.  Understanding the Failure Modes of Out-of-Distribution Generalization , 2021, ICLR.

[62]  G. Marcus Rethinking Eliminative Connectionism , 1998, Cognitive Psychology.

[63]  Seong Joon Oh,et al.  Learning De-biased Representations with Biased Representations , 2019, ICML.

[64]  Fei Chen,et al.  Risk Variance Penalization: From Distributional Robustness to Causality , 2020, ArXiv.

[65]  Samuel Ritter,et al.  Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study , 2017, ICML.

[66]  Emily Denton,et al.  Characterising Bias in Compressed Models , 2020, ArXiv.

[67]  Matthias Bethge,et al.  Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet , 2019, ICLR.

[68]  Donald A. Adjeroh,et al.  Unified Deep Supervised Domain Adaptation and Generalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[69]  Sjoerd van Steenkiste,et al.  Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks , 2020, ICLR.

[70]  Percy Liang,et al.  An Investigation of Why Overparameterization Exacerbates Spurious Correlations , 2020, ICML.

[71]  Aaron C. Courville,et al.  Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.

[72]  Kunio Kashino,et al.  Understanding Community Structure in Layered Neural Networks , 2018, Neurocomputing.

[73]  Amit Dhurandhar,et al.  Invariant Risk Minimization Games , 2020, ICML.

[74]  Tommi S. Jaakkola,et al.  Invariant Rationalization , 2020, ICML.

[75]  Jason Yosinski,et al.  Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask , 2019, NeurIPS.

[76]  Ullrich Köthe,et al.  Learning Robust Models Using The Principle of Independent Causal Mechanisms , 2020, ArXiv.

[77]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[78]  Aaron C. Courville,et al.  Gradient Starvation: A Learning Proclivity in Neural Networks , 2020, NeurIPS.

[79]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[80]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[81]  Jinwoo Shin,et al.  Learning from Failure: Training Debiased Classifier from Biased Classifier , 2020, ArXiv.

[82]  Xin Dong,et al.  Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon , 2017, NIPS.

[83]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.