New perspectives and methods in link prediction

This paper examines important factors for link prediction in networks and provides a general, high-performance framework for the prediction task. Link prediction in sparse networks presents a significant challenge due to the inherent disproportion of links that can form to links that do form. Previous research has typically approached this as an unsupervised problem. While this is not the first work to explore supervised learning, many factors significant in influencing and guiding classification remain unexplored. In this paper, we consider these factors by first motivating the use of a supervised framework through a careful investigation of issues such as network observational period, generality of existing methods, variance reduction, topological causes and degrees of imbalance, and sampling approaches. We also present an effective flow-based predicting algorithm, offer formal bounds on imbalance in sparse network link prediction, and employ an evaluation method appropriate for the observed imbalance. Our careful consideration of the above issues ultimately leads to a completely general framework that outperforms unsupervised link prediction methods by more than 30% AUC.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[3]  Srinivasan Parthasarathy,et al.  Local Probabilistic Models for Link Prediction , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[4]  David D. Jensen,et al.  The case for anomalous link discovery , 2005, SKDD.

[5]  Mohammad Al Hasan,et al.  Link prediction using supervised learning , 2006 .

[6]  A. Barabasi,et al.  Evolution of the social network of scientific collaborations , 2001, cond-mat/0104162.

[7]  David A. Cieslak,et al.  Learning Decision Trees for Unbalanced Data , 2008, ECML/PKDD.

[8]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  M. Newman Clustering and preferential attachment in growing networks. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[12]  Carsten Wiuf,et al.  Subnets of scale-free networks are not scale-free: sampling properties of networks. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[14]  A. Barab,et al.  Evolution of the social network of scienti $ c collaborations , 2002 .

[15]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[16]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[17]  Valdis E. Krebs,et al.  Mapping Networks of Terrorist Cells , 2001 .

[18]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[19]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[20]  Bart Selman,et al.  Referral Web: combining social networks and collaborative filtering , 1997, CACM.

[21]  Ian Witten,et al.  Data Mining , 2000 .