Retrieval of Relevant and Non-redundant Nodes ∗

We discuss the problem of discovering interesting nodes in networks. We adapt a generic model to choosing relevant and non-redundant pieces of information in networks and probabilistic relations. In the model we assume that one or more query nodes have been given, and the problem is to identify other nodes that are relevant with respect to the query nodes but non-redundant with respect to each other. Also, negative query nodes can be specified. This is in contrast with mainstream graph mining, where one typically looks for frequent patterns, not for interesting individuals. We consider two instances of the model: one where node proximity (and relevance) is measured by the shortest path, and one where the graph is probabilistic or uncertain and proximity reflects the probability that the nodes are connected. The generic model also has simple parameterization to allow for different behaviors with respect to query nodes. We compare different similarity measures and empirically evaluate two algorithms on different applications: social networks and biomedical networks. The results indicate that the model and methods are useful in finding a relevant and non-redundant set of nodes.

[1]  Christos Faloutsos,et al.  Random walk with restart: fast solutions and applications , 2008, Knowledge and Information Systems.

[2]  George Kollios,et al.  k-nearest neighbors in uncertain graphs , 2010, Proc. VLDB Endow..

[3]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[4]  Ulrik Brandes,et al.  Pure spreading activation is pointless , 2009, CIKM.

[5]  Christos Faloutsos,et al.  Center-piece subgraphs: problem definition and fast solutions , 2006, KDD '06.

[6]  Fredric C. Gey,et al.  Term importance, Boolean conjunct training, negative terms, and foreign language retrieval: probabilistic algorithms at TREC-5 , 1996, TREC.

[7]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[8]  Christos Faloutsos,et al.  Automatic multimedia cross-modal correlation discovery , 2004, KDD.

[9]  Srinivasan Parthasarathy,et al.  A viewpoint-based approach for interaction graph analysis , 2009, KDD.

[10]  Yehuda Koren,et al.  Measuring and extracting proximity in networks , 2006, KDD '06.

[11]  Hannu Toivonen,et al.  Finding reliable subgraphs from large probabilistic graphs , 2008, Data Mining and Knowledge Discovery.

[12]  ChengXiang Zhai,et al.  A study of methods for negative relevance feedback , 2008, SIGIR '08.

[13]  Charles J. Colbourn,et al.  The Combinatorics of Network Reliability , 1987 .

[14]  Ravi Kumar,et al.  Core algorithms in the CLEVER system , 2006, TOIT.

[15]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[16]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[17]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[18]  Filip Radlinski,et al.  Learning diverse rankings with multi-armed bandits , 2008, ICML '08.

[19]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[20]  Jingrui He,et al.  Diversified ranking on large graphs: an optimization viewpoint , 2011, KDD.

[21]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[22]  Hannu Toivonen,et al.  Link Discovery in Graphs Derived from Biological Databases , 2006, DILS.

[23]  Theodoros Lappas,et al.  Finding a team of experts in social networks , 2009, KDD.

[24]  Anthony K. H. Tung,et al.  Finding representative set from massive data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[25]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[26]  Hannu Toivonen,et al.  A model for mining relevant and non-redundant information , 2012, SAC '12.

[27]  Dragomir R. Radev,et al.  DivRank: the interplay of prestige and diversity in information networks , 2010, KDD.

[28]  Luc De Raedt,et al.  Compressing probabilistic Prolog programs , 2007, Machine Learning.

[29]  Matthew Richardson,et al.  Mining the network value of customers , 2001, KDD '01.

[30]  Hanghang Tong,et al.  Measuring Proximity on Graphs with Side Information , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[31]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.