Active Learning for Networked Data

We introduce a novel active learning algorithm for classification of network data. In this setting, training instances are connected by a set of links to form a network, the labels of linked nodes are correlated, and the goal is to exploit these dependencies and accurately label the nodes. This problem arises in many domains, including social and biological network analysis and document classification, and there has been much recent interest in methods that collectively classify the nodes in the network. While in many cases labeled examples are expensive, often network information is available. We show how an active learning algorithm can take advantage of network structure. Our algorithm effectively exploits the links between instances and the interaction between the local and collective aspects of a classifier to improve the accuracy of learning from fewer labeled examples. We experiment with two real-world benchmark collective classification domains, and show that we are able to achieve extremely accurate results even when only a small fraction of the data is labeled.

[1]  Mustafa Bilgic,et al.  Cost-sensitive Information Acquisition in Structured Domains , 2010 .

[2]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[3]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[4]  Lise Getoor,et al.  Reflect and correct: A misclassification prediction approach to active inference , 2009, TKDD.

[5]  Dan Roth,et al.  Margin-Based Active Learning for Structured Output Spaces , 2006, ECML.

[6]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Foster J. Provost,et al.  Active Sampling for Class Probability Estimation and Ranking , 2004, Machine Learning.

[8]  Lise Getoor,et al.  Active Inference for Collective Classification , 2010, AAAI.

[9]  Jennifer Neville,et al.  Iterative Classification in Relational Data , 2000 .

[10]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[11]  Andrew McCallum,et al.  Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[12]  Sanjoy Dasgupta,et al.  Hierarchical sampling for active learning , 2008, ICML '08.

[13]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[14]  Sofus A. Macskassy Using graph-based metrics with empirical risk minimization to speed up active learning on networked data , 2009, KDD.

[15]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[16]  J. Lafferty,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[17]  Bin Wu,et al.  Exploiting Network Structure for Active Inference in Collective Classification , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[18]  Brigham Anderson,et al.  Active learning for Hidden Markov Models: objective functions and algorithms , 2005, ICML.

[19]  Lyle H. Ungar,et al.  Machine Learning manuscript No. (will be inserted by the editor) Active Learning for Logistic Regression: , 2007 .

[20]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[21]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[22]  Lise Getoor,et al.  Effective label acquisition for collective classification , 2008, KDD.

[23]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.