Grain: Improving Data Efficiency of Graph Neural Networks via Diversified Influence Maximization

Data selection methods, such as active learning and core-set selection, are useful tools for improving the data efficiency of deep learning models on large-scale datasets. However, recent deep learning models have moved forward from independent and identically distributed data to graph-structured data, such as social networks, ecommerce user-item graphs, and knowledge graphs. This evolution has led to the emergence of Graph Neural Networks (GNNs) that go beyond the models existing data selection methods are designed for. Therefore, we present Grain, an efficient framework that opens up a new perspective through connecting data selection in GNNs with social influence maximization. By exploiting the common patterns of GNNs, Grain introduces a novel feature propagation concept, a diversified influence maximization objective with novel influence and diversity functions, and a greedy algorithm with an approximation guarantee into a unified framework. Empirical studies on public datasets demonstrate that Grain significantly improves both the performance and efficiency of data selection (including active learning and core-set selection) for GNNs. To the best of our knowledge, this is the first attempt to bridge two largely parallel threads of research, data selection, and social influence maximization, in the setting of GNNs, paving new ways for improving data efficiency. PVLDB Reference Format: Wentao Zhang, Zhi Yang, Yexin Wang, Yu Shen, Yang Li, Liang Wang, Bin Cui. Grain: Improving Data Efficiency of Graph Neural Networks via Diversified Influence Maximization. PVLDB, 14(11): 2473 2482, 2021. doi:10.14778/3476249.3476295 PVLDB Availability Tag: The source code of this research paper has been made publicly available at https://github.com/zwt233/Grain.

[1]  Xiao-Ming Wu,et al.  Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning , 2018, AAAI.

[2]  Samy Bengio,et al.  Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks , 2019, KDD.

[3]  Laks V. S. Lakshmanan,et al.  Information and Influence Propagation in Social Networks , 2013, Synthesis Lectures on Data Management.

[4]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[5]  Xiaokui Xiao,et al.  Active Learning for Node Classification: The Additional Learning Ability from Unlabelled Nodes , 2020, ArXiv.

[6]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[7]  Éva Tardos,et al.  Maximizing the Spread of Influence through a Social Network , 2015, Theory Comput..

[8]  Jeff A. Bilmes,et al.  Submodular subset selection for large-scale speech training data , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Trevor Campbell,et al.  Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent , 2018, ICML.

[10]  Lei Chen,et al.  ALG: Fast and Accurate Active Learning Framework for Graph Convolutional Networks , 2021, SIGMOD Conference.

[11]  Hong Yang,et al.  Active Discriminative Network Representation Learning , 2018, IJCAI.

[12]  Wei Chen,et al.  Scalable and parallelizable influence maximization with Random Walk Ranking and Rank Merge Pruning , 2017, Inf. Sci..

[13]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[14]  Silvio Savarese,et al.  Active Learning for Convolutional Neural Networks: A Core-Set Approach , 2017, ICLR.

[15]  Yuxiao Dong,et al.  Microsoft Academic Graph: When experts are not enough , 2020, Quantitative Science Studies.

[16]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[17]  Stephan Günnemann,et al.  Personalized Embedding Propagation: Combining Neural Networks on Graphs with Personalized PageRank , 2018, ArXiv.

[18]  Bin Ma,et al.  Unsupervised data selection and word-morph mixed language model for tamil low-resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Stephan Günnemann,et al.  Predict then Propagate: Graph Neural Networks meet Personalized PageRank , 2018, ICLR.

[20]  Kilian Q. Weinberger,et al.  Simplifying Graph Convolutional Networks , 2019, ICML.

[21]  Xin Li,et al.  Adaptive Active Learning for Image Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Kaigui Bian,et al.  GARG: Anonymous Recommendation of Point-of-Interest in Mobile Networks by Graph Convolution Network , 2020, Data Science and Engineering.

[23]  Jiawei Jiang,et al.  OpenBox: A Generalized Black-box Optimization Service , 2021, KDD.

[24]  Shiwen Wu,et al.  Graph Neural Networks in Recommender Systems: A Survey , 2020, ArXiv.

[25]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[26]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[27]  Andreas Krause,et al.  Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization , 2010, J. Artif. Intell. Res..

[28]  Joan Bruna,et al.  On Graph Neural Networks versus Graph-Augmented MLPs , 2021, ICLR.

[29]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[30]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[31]  Davide Eynard,et al.  SIGN: Scalable Inception Graph Neural Networks , 2020, ArXiv.

[32]  Ken-ichi Kawarabayashi,et al.  Representation Learning on Graphs with Jumping Knowledge Networks , 2018, ICML.

[33]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[34]  Kaveh Hassani,et al.  Contrastive Multi-View Representation Learning on Graphs , 2020, ICML.

[35]  Franziska Abend,et al.  Facility Location Concepts Models Algorithms And Case Studies , 2016 .

[36]  Martial Hebert,et al.  Contextual Sequence Prediction with Application to Control Library Optimization , 2012, Robotics: Science and Systems.

[37]  Philip S. Yu,et al.  Active Learning: A Survey , 2014, Data Classification: Algorithms and Applications.

[38]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[39]  Pingpeng Yuan,et al.  Maximizing Influence Over Streaming Graphs with Query Sequence , 2021, Data Sci. Eng..

[40]  Lei Chen,et al.  Reliable Data Distillation on Graph Convolutional Network , 2020, SIGMOD Conference.

[41]  Kyomin Jung,et al.  IRIE: Scalable and Robust Influence Maximization in Social Networks , 2011, 2012 IEEE 12th International Conference on Data Mining.

[42]  Jeff A. Bilmes,et al.  Using Document Summarization Techniques for Speech Data Subset Selection , 2013, NAACL.

[43]  Piotr Koniusz,et al.  Simple Spectral Graph Convolution , 2021, ICLR.

[44]  Lise Getoor,et al.  Active Learning for Networked Data , 2010, ICML.

[45]  Bolin Ding,et al.  VolcanoML: speeding up end-to-end AutoML via scalable search space decomposition , 2021, The VLDB Journal.

[46]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[47]  Andreas Krause,et al.  Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies , 2008, J. Mach. Learn. Res..

[48]  Zoubin Ghahramani,et al.  Deep Bayesian Active Learning with Image Data , 2017, ICML.

[49]  Jure Leskovec,et al.  Unifying Graph Convolutional Neural Networks and Label Propagation , 2020, ArXiv.

[50]  Jianping Yin,et al.  Graph-Based Active Learning Based on Label Propagation , 2008, MDAI.

[51]  J. Leskovec,et al.  Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[52]  Yarin Gal,et al.  BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning , 2019, NeurIPS.

[53]  Anima Anandkumar,et al.  Deep Active Learning for Named Entity Recognition , 2017, Rep4NLP@ACL.

[54]  Kevin Chen-Chuan Chang,et al.  Active Learning for Graph Embedding , 2017, ArXiv.

[55]  Cao Xiao,et al.  FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling , 2018, ICLR.

[56]  Enhong Chen,et al.  On Approximation of Real-World Influence Spread , 2012, ECML/PKDD.

[57]  Yongdong Zhang,et al.  LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation , 2020, SIGIR.

[58]  Jiawei Jiang,et al.  Snapshot boosting: a fast ensemble framework for deep neural networks , 2019, Science China Information Sciences.

[59]  Sanjoy Dasgupta,et al.  Analysis of a greedy active learning strategy , 2004, NIPS.

[60]  Aarti Singh,et al.  Active Learning for Graph Neural Networks via Node Feature Propagation , 2019, ArXiv.

[61]  Jie Zhou,et al.  Adaptive Graph Encoder for Attributed Graph Embedding , 2020, KDD.

[62]  Jure Leskovec,et al.  OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs , 2021, NeurIPS Datasets and Benchmarks.

[63]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[64]  Stefanie Jegelka,et al.  Submodular meets Structured: Finding Diverse Subsets in Exponentially-Large Structured Item Sets , 2014, NIPS.

[65]  Trevor Campbell,et al.  Coresets for Scalable Bayesian Logistic Regression , 2016, NIPS.