Active Sampling of Networks

In network classification, a typical assumption is knowledge of all edges when computing the joint distribution of the instances in the network. That is, for an instance in the network, the neighbors of the instance and their attributes are known. Such settings include social networks such as Facebook where a person’s friends are known, allowing for prediction of an attribute of the person given the description of their friends. However, in other domains, relationship information may not be available for all nodes in the network due to privacy or legal restrictions or because a cost is associated with determining the connections of a node. For example, it is unreasonable to expect to be able to access the phone records of the entire population when attempting to identify a handful of individuals involved in illegal or fraudulent activities. We refer to this problem domain as Active Sampling, a domain where instances’ labels and edges are acquired through an iterative process in order to identify a handful of instances in a network. In this work, we develop this problem domain formally and present methods estimating the probability of an instance being positively labeled using only the previously acquired samples. Furthermore, we extend our methods to allow for collective inference and learned priors and demonstrate the robustness of the techniques on two synthetic and two real-world datasets. Appearing in Proceedings of the Workshop on Mining and Learning with Graphs (MLG-2012), Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s).

[1]  Sofus A. Macskassy Using graph-based metrics with empirical risk minimization to speed up active learning on networked data , 2009, KDD.

[2]  Ben Taskar,et al.  Relational Markov Networks , 2007 .

[3]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[4]  Andrew Smolen,et al.  The National Longitudinal Study of Adolescent Health (Add Health) Sibling Pairs Data , 2012, Twin Research and Human Genetics.

[5]  A Díaz-Guilera,et al.  Self-similar community structure in a network of human interactions. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[7]  Jennifer Neville,et al.  Relational Dependency Networks , 2007, J. Mach. Learn. Res..

[8]  Huan Liu,et al.  Efficiently Determine the Starting Sample Size for Progressive Sampling , 2001, DMKD.

[9]  Jennifer Neville,et al.  Relational Active Learning for Joint Collective Classification Models , 2011, ICML.

[10]  Srinivasan Parthasarathy,et al.  Efficient progressive sampling for association rules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[11]  Lise Getoor,et al.  Query-driven Active Surveying for Collective Classification , 2012 .

[12]  Lise Getoor,et al.  Effective label acquisition for collective classification , 2008, KDD.

[13]  Jennifer Neville,et al.  Using relational knowledge discovery to prevent securities fraud , 2005, KDD '05.

[14]  Christos Faloutsos,et al.  Using ghost edges for classification in sparsely labeled networks , 2008, KDD.