Optimization of Graph Neural Networks: Implicit Acceleration by Skip Connections and More Depth

Graph Neural Networks (GNNs) have been studied from the lens of expressive power and generalization. However, their optimization properties are less well understood. We take the first step towards analyzing GNN training by studying the gradient dynamics of GNNs. First, we analyze linearized GNNs and prove that despite the non-convexity of training, convergence to a global minimum at a linear rate is guaranteed under mild assumptions that we validate on real-world graphs. Second, we study what may affect the GNNs' training speed. Our results show that the training of GNNs is implicitly accelerated by skip connections, more depth, and/or a good label distribution. Empirical results confirm that our theoretical results for linearized GNNs align with the training behavior of nonlinear GNNs. Our results provide the first theoretical support for the success of GNNs with skip connections in terms of optimization, and suggest that deep GNNs with skip connections would be promising in practice.

[1]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[2]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[3]  Kilian Q. Weinberger,et al.  Simplifying Graph Convolutional Networks , 2019, ICML.

[4]  F. Scarselli,et al.  A new model for learning in graph domains , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[5]  Andreas Loukas,et al.  How hard is to distinguish graphs with graph neural networks? , 2020, NeurIPS.

[6]  Wei Hu,et al.  Width Provably Matters in Optimization for Deep Linear Neural Networks , 2019, ICML.

[7]  Chong Wang,et al.  Attention-based Graph Neural Network for Semi-supervised Learning , 2018, ArXiv.

[8]  Cao Xiao,et al.  FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling , 2018, ICLR.

[9]  Kenta Oono,et al.  Optimization and Generalization Analysis of Transduction through Gradient Boosting and Application to Multi-scale Graph Neural Networks , 2020, NeurIPS.

[10]  Bernard Ghanem,et al.  DeeperGCN: All You Need to Train Deeper GCNs , 2020, ArXiv.

[11]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[12]  Jiaoyang Huang,et al.  Gradient Descent Finds Global Minima for Generalizable Deep Neural Networks of Practical Sizes , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[13]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[14]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[15]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[16]  Hisashi Kashima,et al.  Approximation Ratios of Graph Neural Networks for Combinatorial Problems , 2019, NeurIPS.

[17]  Thomas Lengauer,et al.  Automatic Generation of Complementary Descriptors with Molecular Graph Networks , 2005, J. Chem. Inf. Model..

[18]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[19]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[20]  Ken-ichi Kawarabayashi,et al.  How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks , 2020, ICLR.

[21]  Ken-ichi Kawarabayashi,et al.  What Can Neural Networks Reason About? , 2019, ICLR.

[22]  Mikhail Belkin,et al.  On the linearity of large non-linear models: when and why the tangent kernel is constant , 2020, NeurIPS.

[23]  Ken-ichi Kawarabayashi,et al.  Are Girls Neko or Shōjo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization , 2019, ACL.

[24]  Joan Bruna,et al.  On the equivalence between graph isomorphism testing and function approximation with GNNs , 2019, NeurIPS.

[25]  Liwei Wang,et al.  GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training , 2020, ICML.

[26]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[27]  Atsushi Nitanda,et al.  Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime , 2021, ICLR.

[28]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[29]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[30]  Hisashi Kashima,et al.  Random Features Strengthen Graph Neural Networks , 2020, SDM.

[31]  Samy Bengio,et al.  Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks , 2019, KDD.

[32]  Ken-ichi Kawarabayashi,et al.  Representation Learning on Graphs with Jumping Knowledge Networks , 2018, ICML.

[33]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[34]  Junzhou Huang,et al.  Adaptive Sampling Towards Fast Graph Representation Learning , 2018, NeurIPS.

[35]  Gabriel Peyré,et al.  Universal Invariant and Equivariant Graph Neural Networks , 2019, NeurIPS.

[36]  Thomas Laurent,et al.  Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , 2017, ICML.

[37]  Wei Hu,et al.  A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.

[38]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[39]  Yizhou Sun,et al.  Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks , 2019, NeurIPS.

[40]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[41]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[42]  Philip M. Long,et al.  On the Global Convergence of Training Deep Linear ResNets , 2020, ICLR.

[43]  Ruosong Wang,et al.  Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels , 2019, NeurIPS.

[44]  Bernard Ghanem,et al.  DeepGCNs: Can GCNs Go As Deep As CNNs? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Jordan L. Boyd-Graber,et al.  Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification , 2020, AAAI.

[46]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[47]  Andreas Loukas,et al.  Building powerful and equivariant graph neural networks with message-passing , 2020, ArXiv.

[48]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[49]  Yaliang Li,et al.  Simple and Deep Graph Convolutional Networks , 2020, ICML.

[50]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[51]  Philip M. Long,et al.  Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks , 2018, Neural Computation.

[52]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[53]  Joonseok Lee,et al.  N-GCN: Multi-scale Graph Convolution for Semi-supervised Node Classification , 2018, UAI.

[54]  Matus Telgarsky,et al.  Directional convergence and alignment in deep learning , 2020, NeurIPS.

[55]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[56]  Geoffrey J. Gordon,et al.  Graph Adversarial Networks: Protecting Information against Adversarial Attacks , 2020, ArXiv.

[57]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[58]  Stefanie Jegelka,et al.  Generalization and Representational Limits of Graph Neural Networks , 2020, ICML.

[59]  Ah Chung Tsoi,et al.  The Vapnik-Chervonenkis dimension of graph and recursive neural networks , 2018, Neural Networks.

[60]  Kenji Kawaguchi,et al.  On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers , 2021, ICLR.

[61]  Meng Wang,et al.  Fast Learning of Graph Neural Networks with Guaranteed Generalizability: One-hidden-layer Case , 2020, ICML.

[62]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[63]  Le Song,et al.  Stochastic Training of Graph Convolutional Networks with Variance Reduction , 2017, ICML.

[64]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[65]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[66]  Yaron Lipman,et al.  Provably Powerful Graph Networks , 2019, NeurIPS.

[67]  Jiaoyang Huang,et al.  Dynamics of Deep Neural Networks and Neural Tangent Hierarchy , 2019, ICML.