How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks

We study how neural networks trained by gradient descent extrapolate, i.e., what they learn outside the support of the training distribution. Previous works report mixed empirical results when extrapolating with neural networks: while multilayer perceptrons (MLPs) do not extrapolate well in certain simple tasks, Graph Neural Network (GNN), a structured network with MLP modules, has shown some success in more complex tasks. Working towards a theoretical explanation, we identify conditions under which MLPs and GNNs extrapolate well. First, we quantify the observation that ReLU MLPs quickly converge to linear functions along any direction from the origin, which implies that ReLU MLPs do not extrapolate most non-linear functions. But, they can provably learn a linear target function when the training distribution is sufficiently "diverse". Second, in connection to analyzing successes and limitations of GNNs, these results suggest a hypothesis for which we provide theoretical and empirical evidence: the success of GNNs in extrapolating algorithmic tasks to new data (e.g., larger graphs or edge weights) relies on encoding task-specific non-linearities in the architecture or features.

[1]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[2]  Atsushi Nitanda,et al.  Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime , 2021, ICLR.

[3]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[4]  E. Fama,et al.  Common risk factors in the returns on stocks and bonds , 1993 .

[5]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[6]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[7]  Tatsunori B. Hashimoto,et al.  Distributionally Robust Neural Networks , 2020, ICLR.

[8]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Bernhard Schölkopf,et al.  Domain Generalization via Invariant Feature Representation , 2013, ICML.

[10]  Vera Kurková,et al.  Kolmogorov's theorem and multilayer neural networks , 1992, Neural Networks.

[11]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[12]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[13]  Francis Bach,et al.  A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[14]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[15]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[16]  Melvyn Sim,et al.  Distributionally Robust Optimization and Its Tractable Approximations , 2010, Oper. Res..

[17]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[18]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[19]  Jordan Boyd-Graber,et al.  Interactive Refinement of Cross-Lingual Word Embeddings , 2020, EMNLP.

[20]  Tao Xiang,et al.  Domain Generalization with MixStyle , 2021, ICLR.

[21]  Raia Hadsell,et al.  Graph networks as learnable physics engines for inference and control , 2018, ICML.

[22]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[23]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[24]  Colin Wei,et al.  Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.

[25]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[26]  Jure Leskovec,et al.  Strategies for Pre-training Graph Neural Networks , 2020, ICLR.

[27]  S. Levine,et al.  Reasoning About Physical Interactions with Object-Centric Models , 2018 .

[28]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[29]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[30]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[31]  John C. Duchi,et al.  Certifying Some Distributional Robustness with Principled Adversarial Training , 2017, ICLR.

[32]  Chris Dyer,et al.  Neural Arithmetic Logic Units , 2018, NeurIPS.

[33]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[34]  Stefanie Jegelka,et al.  Distributionally Robust Optimization and Generalization in Kernel Methods , 2019, NeurIPS.

[35]  Amit Dhurandhar,et al.  Empirical or Invariant Risk Minimization? A Sample Complexity Perspective , 2020, ArXiv.

[36]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[37]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[38]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[39]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[40]  Pushmeet Kohli,et al.  Analysing Mathematical Reasoning Abilities of Neural Models , 2019, ICLR.

[41]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[42]  Raman Arora,et al.  Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[43]  Ken-ichi Kawarabayashi,et al.  Representation Learning on Graphs with Jumping Knowledge Networks , 2018, ICML.

[44]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[45]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[46]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[47]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[48]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[49]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[50]  Zachary Dulberg,et al.  Learning Representations that Support Extrapolation , 2020, ICML.

[51]  Guillaume Lample,et al.  Deep Learning for Symbolic Mathematics , 2019, ICLR.

[52]  Taiji Suzuki,et al.  Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint , 2020, ICLR.

[53]  Jaehoon Lee,et al.  Neural Tangents: Fast and Easy Infinite Neural Networks in Python , 2019, ICLR.

[54]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[55]  Chuang Gan,et al.  The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision , 2019, ICLR.

[56]  Matthias Hein,et al.  Why ReLU Networks Yield High-Confidence Predictions Far Away From the Training Data and How to Mitigate the Problem , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Joan Bruna,et al.  Gradient Dynamics of Shallow Univariate ReLU Networks , 2019, NeurIPS.

[58]  A. Lapedes,et al.  Nonlinear signal processing using neural networks: Prediction and system modelling , 1987 .

[59]  Felix Hill,et al.  Measuring abstract reasoning in neural networks , 2018, ICML.

[60]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[61]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[62]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[63]  Razvan Pascanu,et al.  Visual Interaction Networks: Learning a Physics Simulator from Video , 2017, NIPS.

[64]  Nathan Srebro,et al.  How do infinite width bounded norm networks look in function space? , 2019, COLT.

[65]  Ruosong Wang,et al.  Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels , 2019, NeurIPS.

[66]  José M. F. Moura,et al.  Adversarial Multiple Source Domain Adaptation , 2018, NeurIPS.

[67]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[68]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[69]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[70]  René Alquézar,et al.  Improvement of Learning in Recurrent Networks by Substituting the Sigmoid Activation Function , 1994 .

[71]  Jonas Peters,et al.  Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[72]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[73]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[74]  S. Ross The arbitrage theory of capital asset pricing , 1976 .

[75]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[76]  David Rolnick,et al.  Complexity of Linear Regions in Deep Networks , 2019, ICML.

[77]  Ruosong Wang,et al.  Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks , 2019, ICLR.

[78]  Alexander Rosenberg Johansen,et al.  Neural Arithmetic Units , 2020, ICLR.

[79]  Bernhard Schölkopf,et al.  Invariant Models for Causal Transfer Learning , 2015, J. Mach. Learn. Res..

[80]  Julien Mairal,et al.  On the Inductive Bias of Neural Tangent Kernels , 2019, NeurIPS.

[81]  Koby Crammer,et al.  Learning Bounds for Domain Adaptation , 2007, NIPS.

[82]  L.F.A. Wessels,et al.  Extrapolation and interpolation in neural network classifiers , 1992, IEEE Control Systems.

[83]  Kun Zhang,et al.  On Learning Invariant Representation for Domain Adaptation , 2019, ArXiv.

[84]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[85]  P. J. Haley,et al.  Extrapolation limitations of multilayer feedforward neural networks , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[86]  R. Banz,et al.  The relationship between return and market value of common stocks , 1981 .

[87]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[88]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[89]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[90]  Ken-ichi Kawarabayashi,et al.  What Can Neural Networks Reason About? , 2019, ICLR.

[91]  Pietro Liò,et al.  Principal Neighbourhood Aggregation for Graph Nets , 2020, NeurIPS.

[92]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[93]  T. Lindvall ON A ROUTING PROBLEM , 2004, Probability in the Engineering and Informational Sciences.

[94]  Pradeep Ravikumar,et al.  The Risks of Invariant Risk Minimization , 2020, ICLR.

[95]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[96]  Quoc V. Le,et al.  Exploiting Similarities among Languages for Machine Translation , 2013, ArXiv.

[97]  Yishay Mansour,et al.  Domain Adaptation: Learning Bounds and Algorithms , 2009, COLT.

[98]  D. B. McCaughan On the properties of periodic perceptrons , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[99]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[100]  Han Zhao,et al.  On Learning Invariant Representations for Domain Adaptation , 2019, ICML.

[101]  R. Bellman Dynamic programming. , 1957, Science.

[102]  Raia Hadsell,et al.  Neural Execution of Graph Algorithms , 2020, ICLR.

[103]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[104]  Ken-ichi Kawarabayashi,et al.  Are Girls Neko or Shōjo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization , 2019, ACL.

[105]  Sylvain Gelly,et al.  Gradient Descent Quantizes ReLU Network Features , 2018, ArXiv.

[106]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.