The Network Completion Problem: Inferring Missing Nodes and Edges in Networks

Network structures, such as social networks, web graphs and networks from systems biology, play important roles in many areas of science and our everyday lives. In order to study the networks one needs to first collect reliable large scale network data. While the social and information networks have become ubiquitous, the challenge of collecting complete network data still persists. Many times the collected network data is incomplete with nodes and edges missing. Commonly, only a part of the network can be observed and we would like to infer the unobserved part of the network. We address this issue by studying the Network Completion Problem: Given a network with missing nodes and edges, can we complete the missing part? We cast the problem in the Expectation Maximization (EM) framework where we use the observed part of the network to fit a model of network structure, and then we estimate the missing part of the network using the model, re-estimate the parameters and so on. We combine the EM with the Kronecker graphs model and design a scalable Metropolized Gibbs sampling approach that allows for the estimation of the model parameters as well as the inference about missing nodes and edges of the network. Experiments on synthetic and several real-world networks show that our approach can effectively recover the network even when about half of the nodes in the network are missing. Our algorithm outperforms not only classical link-prediction approaches but also the state of the art Stochastic block modeling approach. Furthermore, our algorithm easily scales to networks with tens of thousands of nodes.

[1]  Christos Faloutsos,et al.  Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication , 2005, PKDD.

[2]  Christos Faloutsos,et al.  Scalable modeling of real graphs using Kronecker multiplication , 2007, ICML '07.

[3]  Jean-Philippe Vert,et al.  Supervised reconstruction of biological networks with local models , 2007, ISMB/ECCB.

[4]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[5]  H. Russell Bernard,et al.  Estimation of Seroprevalence, Rape, and Homelessness in the United States Using a Social Network Approach , 1998, Evaluation review.

[6]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[7]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[8]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[9]  Roger Guimerà,et al.  Missing and spurious interactions and the reconstruction of complex networks , 2009, Proceedings of the National Academy of Sciences.

[10]  A. Barabasi,et al.  Spectra of "real-world" graphs: beyond the semicircle law. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[12]  D. Goldberg,et al.  Assessing experimentally derived interactions in a small world , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Jure Leskovec,et al.  Inferring networks of diffusion and influence , 2010, KDD.

[14]  Matthew J. Salganik,et al.  How Many People Do You Know?: Efficiently Estimating Personal Network Size , 2010, Journal of the American Statistical Association.

[15]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[16]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[17]  Derek de Solla Price,et al.  A general theory of bibliometric and other cumulative advantage processes , 1976, J. Am. Soc. Inf. Sci..

[18]  Lyle H. Ungar,et al.  Statistical Relational Learning for Link Prediction , 2003 .

[19]  Charalampos E. Tsourakakis Fast Counting of Triangles in Large Real Networks without Counting: Algorithms and Laws , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[20]  M. Newman,et al.  Hierarchical structure and the prediction of missing links in networks , 2008, Nature.

[21]  Yoshihiro Yamanishi,et al.  Protein network inference from multiple genomic data: a supervised approach , 2004, ISMB/ECCB.

[22]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[23]  Jure Leskovec,et al.  Microscopic evolution of social networks , 2008, KDD.

[24]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, ISIT.

[25]  Eric P. Xing,et al.  Network Completion and Survey Sampling , 2009, AISTATS.

[26]  Mark Gerstein,et al.  Predicting interactions in protein networks by completing defective cliques , 2006, Bioinform..

[27]  Jure Leskovec,et al.  On the Convexity of Latent Social Network Inference , 2010, NIPS.

[28]  Mohammad Mahdian,et al.  Stochastic Kronecker Graphs , 2007, WAW.

[29]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[30]  P. V. Marsden,et al.  NETWORK DATA AND MEASUREMENT , 1990 .

[31]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[32]  Ben Taskar,et al.  Link Prediction in Relational Data , 2003, NIPS.

[33]  Gueorgi Kossinets Effects of missing data in social networks , 2006, Soc. Networks.