Link Discovery in Graphs Derived from Biological Databases

Public biological databases contain vast amounts of rich data that can also be used to create and evaluate new biological hypothesis. We propose a method for link discovery in biological databases, i.e., for prediction and evaluation of implicit or previously unknown connections between biological entities and concepts. In our framework, information extracted from available databases is represented as a graph, where vertices correspond to entities and concepts, and edges represent known, annotated relationships between vertices. A link, an (implicit and possibly unknown) relation between two entities is manifested as a path or a subgraph connecting the corresponding vertices. We propose measures for link goodness that are based on three factors: edge reliability, relevance, and rarity. We handle these factors with a proper probabilistic interpretation. We give practical methods for finding and evaluating links in large graphs and report experimental results with Alzheimer genes and protein interactions.

[1]  Francis D. Gibbons,et al.  Predicting protein complex membership using probabilistic network reliability. , 2004, Genome research.

[2]  David Eppstein,et al.  Finding the k shortest paths , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[3]  Charles J. Colbourn,et al.  The Combinatorics of Network Reliability , 1987 .

[4]  Amit P. Sheth,et al.  Discovering informative connection subgraphs in multi-relational graphs , 2005, SKDD.

[5]  Christos Faloutsos,et al.  Fast discovery of connection subgraphs , 2004, KDD.

[6]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[7]  Shou-De Lin,et al.  Unsupervised link discovery in multi-relational data via rarity analysis , 2003, Third IEEE International Conference on Data Mining.

[8]  Alon Y. Halevy,et al.  PQL: a declarative query language over dynamic biological schemata , 2002, AMIA.

[9]  Maria-Esther Vidal,et al.  Efficient Techniques to Explore and Rank Paths in Life Science Data Sources , 2004, DILS.

[10]  P. Bork,et al.  G2D: a tool for mining genes associated with disease , 2005, BMC Genetics.

[11]  Neil R. Smalheiser,et al.  Artificial Intelligence An interactive system for finding complementary literatures : a stimulus to scientific discovery , 1995 .

[12]  Jon Kleinberg,et al.  The link prediction problem for social networks , 2003, CIKM '03.

[13]  Frances S. Turner,et al.  POCUS: mining genomic sequence annotation to predict disease genes , 2003, Genome Biology.

[14]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[15]  Leslie G. Valiant,et al.  The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..

[16]  Lise Getoor,et al.  Link mining: a survey , 2005, SKDD.