Leveraging Label-Independent Features for Classification in Sparsely Labeled Networks: An Empirical Study

We address the problem of within-network classification in sparsely labeled networks. Recent work has demonstrated success with statistical relational learning (SRL) and semi-supervised learning (SSL) on such problems. However, both approaches rely on the availability of labeled nodes to infer the values of missing labels. When few labels are available, the performance of these approaches can degrade. In addition, many such approaches are sensitive to the specific set of nodes labeled. So, although average performance may be acceptable, the performance on a specific task may not. We explore a complimentary approach to within-network classification, based on the use of label-independent (LI) features - i.e., features calculated without using the values of class labels. While previous work has made some use of LI features, the effects of these features on classification performance have not been extensively studied. Here, we present an empirical study in order to better understand these effects. Through experiments on several real-world data sets, we show that the use of LI features produces classifiers that are less sensitive to specific label assignments and can lead to performance improvements of over 40% for both SRL- and SSL-based classifiers. We also examine the relative utility of individual LI features; and show that, in many cases, it is a combination of a few diverse network-based structural characteristics that is most informative.

[1]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[2]  Lisa Singh,et al.  Pruning social networks using structural properties and descriptive attributes , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[3]  Kalyan Moy Gupta,et al.  Cautious Inference in Collective Classification , 2007, AAAI.

[4]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[5]  PentlandAlex,et al.  Reality mining: sensing complex social systems , 2006 .

[6]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Ben Taskar,et al.  Learning Probabilistic Models of Link Structure , 2003, J. Mach. Learn. Res..

[9]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[10]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[11]  Gang Chen,et al.  Collaborative Filtering Using Orthogonal Nonnegative Matrix Tri-factorization , 2007 .

[12]  Jennifer Neville,et al.  Learning relational probability trees , 2003, KDD '03.

[13]  Foster J. Provost,et al.  Aggregation-based feature invention and relational concept classes , 2003, KDD '03.

[14]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[15]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[16]  Foster Provost,et al.  A Simple Relational Classifier , 2003 .

[17]  Christos Faloutsos,et al.  Using ghost edges for classification in sparsely labeled networks , 2008, KDD.

[18]  Jennifer Neville,et al.  Simple estimators for relational Bayesian classifiers , 2003, Third IEEE International Conference on Data Mining.

[19]  Tina Eliassi-Rad,et al.  An Examination of Experimental Methodology for Classifiers of Relational Data , 2007 .

[20]  Jennifer Neville,et al.  Relational Dependency Networks , 2007, J. Mach. Learn. Res..

[21]  Alex Pentland,et al.  Reality mining: sensing complex social systems , 2006, Personal and Ubiquitous Computing.

[22]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[23]  Jennifer Neville,et al.  Why collective inference improves relational classification , 2004, KDD.