Hypotheses generation as supervised link discovery with automated class labeling on large-scale biomedical concept networks

Computational approaches to generate hypotheses from biomedical literature have been studied intensively in recent years. Nevertheless, it still remains a challenge to automatically discover novel, cross-silo biomedical hypotheses from large-scale literature repositories. In order to address this challenge, we first model a biomedical literature repository as a comprehensive network of biomedical concepts and formulate hypotheses generation as a process of link discovery on the concept network. We extract the relevant information from the biomedical literature corpus and generate a concept network and concept-author map on a cluster using Map-Reduce frame-work. We extract a set of heterogeneous features such as random walk based features, neighborhood features and common author features. The potential number of links to consider for the possibility of link discovery is large in our concept network and to address the scalability problem, the features from a concept network are extracted using a cluster with Map-Reduce framework. We further model link discovery as a classification problem carried out on a training data set automatically extracted from two network snapshots taken in two consecutive time duration. A set of heterogeneous features, which cover both topological and semantic features derived from the concept network, have been studied with respect to their impacts on the accuracy of the proposed supervised link discovery process. A case study of hypotheses generation based on the proposed method has been presented in the paper.

[1]  D. Swanson Fish Oil, Raynaud's Syndrome, and Undiscovered Public Knowledge , 2015, Perspectives in biology and medicine.

[2]  D. Swanson Medical literature as a potential source of new knowledge. , 1990, Bulletin of the Medical Library Association.

[3]  D. Swanson Somatomedin C and Arginine: Implicit Connections between Mutually Isolated Literatures , 2015, Perspectives in biology and medicine.

[4]  Marc Weeber,et al.  Text-based discovery in biomedicine: the architecture of the DAD-system , 2000, AMIA.

[5]  Wanda Pratt,et al.  H.3.3 Information Search and Retrieval , 2022 .

[6]  Padmini Srinivasan,et al.  Text mining: Generating hypotheses from MEDLINE , 2004, J. Assoc. Inf. Sci. Technol..

[7]  Johannes Stegmann,et al.  Hypothesis generation guided by co-word clustering , 2004, Scientometrics.

[8]  Christos Faloutsos,et al.  Fast discovery of connection subgraphs , 2004, KDD.

[9]  Tanja Bekhuis Conceptual biology, hypothesis discovery, and text mining: Swanson's legacy , 2006, Biomedical digital libraries.

[10]  Mohammad Al Hasan,et al.  Link prediction using supervised learning , 2006 .

[11]  Yehuda Koren,et al.  Measuring and extracting proximity in networks , 2006, KDD '06.

[12]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[13]  Dragomir R. Radev,et al.  Identifying gene-disease associations using centrality on a literature mined gene-interaction network , 2008, ISMB.

[14]  Vijay V. Raghavan,et al.  Conceptual Biology Research Supporting Platform: Current Design and Future Directions , 2008, Applications of Computational Intelligence in Biology.

[15]  Céline Rouveirol,et al.  A supervised machine learning link prediction approach for academic collaboration recommendation , 2010, RecSys '10.

[16]  Xiaofeng Wang,et al.  Mining hidden connections among biomedical concepts from disjoint biomedical literature sets through semantic‐based association rule , 2010, Int. J. Intell. Syst..

[17]  Wei Tang,et al.  Supervised Link Prediction Using Multiple Sources , 2010, 2010 IEEE International Conference on Data Mining.

[18]  Xiaofeng Wang,et al.  Mining hidden connections among biomedical concepts from disjoint biomedical literature sets through semantic-based association rule , 2010 .