Graph-based features for supervised link prediction

The growing ubiquity of social networks has spurred research in link prediction, which aims to predict new connections based on existing ones in the network. The 2011 IJCNN Social Network challenge asked participants to separate real edges from fake in a set of 8960 edges sampled from an anonymized, directed graph depicting a subset of relationships on Flickr. Our method incorporates 94 distinct graph features, used as input for classification with Random Forests. We present a three-pronged approach to the link prediction task, along with several novel variations on established similarity metrics. We discuss the challenges of processing a graph with more than a million nodes. We found that the best classification results were achieved through the combination of a large number of features that model different aspects of the graph structure. Our method achieved an area under the receiver-operator characteristic (ROC) curve of 0.9695, the 2nd best overall score in the competition and the best score which did not de-anonymize the dataset.

[1]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[2]  Laks V. S. Lakshmanan,et al.  Fast Katz and Commuters: Efficient Estimation of Social Relatedness in Large Networks , 2010, WAW.

[3]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[4]  G. Golub,et al.  A Fast Two-Stage Algorithm for Computing PageRank , 2003 .

[5]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[6]  M. Newman,et al.  The structure of scientific collaboration networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Yizhou Sun,et al.  Fast computation of SimRank for static and dynamic information networks , 2010, EDBT '10.

[8]  Srinivasan Parthasarathy,et al.  Local Probabilistic Models for Link Prediction , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[9]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[10]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[11]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[12]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[13]  A. Barabasi,et al.  Evolution of the social network of scientific collaborations , 2001, cond-mat/0104162.

[14]  Katherine A. Heller,et al.  Bayesian Sets , 2005, NIPS.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  M. Newman Clustering and preferential attachment in growing networks. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Hongyan Liu,et al.  Fast Single-Pair SimRank Computation , 2010, SDM.

[18]  Gianna M. Del Corso,et al.  Fast PageRank Computation via a Sparse Linear System , 2005, Internet Math..

[19]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[20]  Pavel Velikhov,et al.  Accuracy estimate and optimization techniques for SimRank computation , 2008, The VLDB Journal.

[21]  Kurt C. Foster,et al.  A Faster Katz Status Score Algorithm , 2001, Comput. Math. Organ. Theory.