SINE: Scalable Incomplete Network Embedding

Attributed network embedding aims to learn low-dimensional vector representations for nodes in a network, where each node contains rich attributes/features describing node content. Because network topology structure and node attributes often exhibit high correlation, incorporating node attribute proximity into network embedding is beneficial for learning good vector representations. In reality, large-scale networks often have incomplete/missing node content or linkages, yet existing attributed network embedding algorithms all operate under the assumption that networks are complete. Thus, their performance is vulnerable to missing data and suffers from poor scalability. In this paper, we propose a Scalable Incomplete Network Embedding (SINE) algorithm for learning node representations from incomplete graphs. SINE formulates a probabilistic learning framework that separately models pairs of node-context and node-attribute relationships. Different from existing attributed network embedding algorithms, SINE provides greater flexibility to make the best of useful information and mitigate negative effects of missing information on representation learning. A stochastic gradient descent based online algorithm is derived to learn node representations, allowing SINE to scale up to large-scale networks with high learning efficiency. We evaluate the effectiveness and efficiency of SINE through extensive experiments on real-world networks. Experimental results confirm that SINE outperforms state-of-the-art baselines in various tasks, including node classification, node clustering, and link prediction, under settings with missing links and node attributes. SINE is also shown to be scalable and efficient on large-scale networks with millions of nodes/edges and high-dimensional node features. The source code of this paper is available at https://github.com/daokunzhang/SINE.

[1]  Nagarajan Natarajan,et al.  Inductive matrix completion for predicting gene–disease associations , 2014, Bioinform..

[2]  Huan Liu,et al.  Paired Restricted Boltzmann Machine for Linked Data , 2016, CIKM.

[3]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Wei Lu,et al.  Deep Neural Networks for Learning Graph Representations , 2016, AAAI.

[6]  Qiongkai Xu,et al.  GraRep: Learning Graph Representations with Global Structural Information , 2015, CIKM.

[7]  Wenwu Zhu,et al.  Structural Deep Network Embedding , 2016, KDD.

[8]  Xiaoming Zhang,et al.  From Properties to Links: Deep Network Embedding on Incomplete Graphs , 2017, CIKM.

[9]  Chengqi Zhang,et al.  Homophily, Structure, and Content Augmented Network Representation Learning , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[10]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[11]  Deli Zhao,et al.  Network Representation Learning with Rich Text Information , 2015, IJCAI.

[12]  Chengqi Zhang,et al.  Network Representation Learning: A Survey , 2017, IEEE Transactions on Big Data.

[13]  Chengqi Zhang,et al.  User Profile Preserving Social Network Embedding , 2017, IJCAI.

[14]  James Moody,et al.  Network sampling coverage II: The effect of non-random missing data on network measurement , 2017, Soc. Networks.

[15]  Zhiyuan Liu,et al.  CANE: Context-Aware Network Embedding for Relation Modeling , 2017, ACL.

[16]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[17]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[18]  Jian Pei,et al.  Community Preserving Network Embedding , 2017, AAAI.

[19]  Jilles Vreeken,et al.  Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics , 2015, SDM.

[20]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[21]  Xiao Huang,et al.  Accelerated Attributed Network Embedding , 2017, SDM.

[22]  Jure Leskovec,et al.  The Network Completion Problem: Inferring Missing Nodes and Edges in Networks , 2011, SDM.

[23]  Gueorgi Kossinets Effects of missing data in social networks , 2006, Soc. Networks.

[24]  Lise Getoor,et al.  Identifying graphs from noisy and incomplete data , 2009, KDD Workshop on Knowledge Discovery from Uncertain Data.

[25]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[26]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[27]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[28]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.