Towards Deepening Graph Neural Networks: A GNTK-based Optimization Perspective

Graph convolutional networks (GCNs) and their variants have achieved great success in dealing with graph-structured data. However, it is well known that deep GCNs suffer from the over-smoothing problem, where node representations tend to be indistinguishable as more layers are stacked up. The theoretical research to date on deep GCNs has focused primarily on expressive power rather than trainability, an optimization perspective. Compared to expressivity, trainability attempts to address a more fundamental question: given a sufficiently expressive space of models, can we successfully find a good solution by gradient descentbased optimizer? This work fills this gap by exploiting the Graph Neural Tangent Kernel (GNTK), which governs the optimization trajectory under gradient descent for wide GCNs. We formulate the asymptotic behaviors of GNTK in the large depth, which enables us to reveal the dropping trainability of wide and deep GCNs at an exponential rate in the optimization process. Additionally, we extend our theoretical framework to analyze residual connection-resemble techniques, which are found to be only able to mildly mitigate the exponential decay of trainability. To overcome the exponential decay problem more fundamentally, we propose Critical DropEdge, a connectivity-aware and graph-adaptive sampling method, inspired by our theoretical insights on trainability. Experimental evaluation consistently confirms using our proposed method can achieve better results compared to relevant counterparts with both infinite-width and finite-width.

[1]  Stefanos Zafeiriou,et al.  Geometrically Principled Connections in Graph Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  V. Prasanna,et al.  Deep Graph Neural Networks with Shallow Subgraph Samplers , 2020, ArXiv.

[3]  Ryan A. Rossi,et al.  The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[4]  Wei Huang,et al.  On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization , 2020, IJCAI.

[5]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[6]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[7]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[8]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[9]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[10]  Greg Yang,et al.  Tensor Programs II: Neural Tangent Kernel for Any Architecture , 2020, ArXiv.

[11]  Xiao Huang,et al.  Towards Deeper Graph Neural Networks with Differentiable Group Normalization , 2020, NeurIPS.

[12]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[13]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[14]  Greg Yang,et al.  Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation , 2019, ArXiv.

[15]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[16]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[17]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[18]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[19]  Stephan Günnemann,et al.  Predict then Propagate: Graph Neural Networks meet Personalized PageRank , 2018, ICLR.

[20]  S. Redner,et al.  Introduction To Percolation Theory , 2018 .

[21]  Shuiwang Ji,et al.  StructPool: Structured Graph Pooling via Conditional Random Fields , 2020, ICLR.

[22]  Ruosong Wang,et al.  Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels , 2019, NeurIPS.

[23]  Jiaoyang Huang,et al.  Dynamics of Deep Neural Networks and Neural Tangent Hierarchy , 2019, ICML.

[24]  Tingyang Xu,et al.  DropEdge: Towards Deep Graph Convolutional Networks on Node Classification , 2020, ICLR.

[25]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[26]  A. Rbnyi ON THE EVOLUTION OF RANDOM GRAPHS , 2001 .

[27]  Arnaud Doucet,et al.  On the Impact of the Activation Function on Deep Neural Networks Training , 2019, ICML.

[28]  Arnaud Doucet,et al.  Exact Convergence Rates of the Neural Tangent Kernel in the Large Depth Limit , 2019, 1905.13654.

[29]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[30]  Ken-ichi Kawarabayashi,et al.  Representation Learning on Graphs with Jumping Knowledge Networks , 2018, ICML.

[31]  H ⋂t,et al.  CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS , 2017 .

[32]  Shuiwang Ji,et al.  Towards Deeper Graph Neural Networks , 2020, KDD.

[33]  Yaliang Li,et al.  Simple and Deep Graph Convolutional Networks , 2020, ICML.

[34]  Max Welling,et al.  Variational Graph Auto-Encoders , 2016, ArXiv.

[35]  Samuel S. Schoenholz,et al.  Mean Field Residual Networks: On the Edge of Chaos , 2017, NIPS.

[36]  Taiji Suzuki,et al.  Graph Neural Networks Exponentially Lose Expressive Power for Node Classification , 2019, ICLR.

[37]  Ken-ichi Kawarabayashi,et al.  How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks , 2020, ICLR.

[38]  Gitta Kutyniok,et al.  Expressivity of Deep Neural Networks , 2020, ArXiv.

[39]  Jaewoo Kang,et al.  Self-Attention Graph Pooling , 2019, ICML.

[40]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[41]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[42]  R. Ziff,et al.  Critical percolation clusters in seven dimensions and on a complete graph. , 2017, Physical review. E.

[43]  Xiao-Ming Wu,et al.  Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning , 2018, AAAI.

[44]  Jascha Sohl-Dickstein,et al.  Infinite attention: NNGP and NTK for deep attention networks , 2020, ICML.

[45]  Samuel S. Schoenholz,et al.  Disentangling trainability and generalization in deep learning , 2019, ArXiv.

[46]  Bernard Ghanem,et al.  DeeperGCN: All You Need to Train Deeper GCNs , 2020, ArXiv.

[47]  Bernard Ghanem,et al.  DeepGCNs: Can GCNs Go As Deep As CNNs? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).