Mining strong relevance between heterogeneous entities from unstructured biomedical data

Huge volumes of biomedical text data discussing about different biomedical entities are being generated every day. Hidden in those unstructured data are the strong relevance relationships between those entities, which are critical for many interesting applications including building knowledge bases for the biomedical domain and semantic search among biomedical entities. In this paper, we study the problem of discovering strong relevance between heterogeneous typed biomedical entities from massive biomedical text data. We first build an entity correlation graph from data, in which the collection of paths linking two heterogeneous entities offer rich semantic contexts for their relationships, especially those paths following the patterns of top-$$k$$k selected meta paths inferred from data. Guided by such meta paths, we design a novel relevance measure to compute the strong relevance between two heterogeneous entities, named $${\mathsf {EntityRel}}$$EntityRel. Our intuition is, two entities of heterogeneous types are strongly relevant if they have strong direct links or they are linked closely to other strongly relevant heterogeneous entities along paths following the selected patterns. We provide experimental results on mining strong relevance between drugs and diseases. More than 20 millions of MEDLINE abstracts and 5 types of biological entities (Drug, Disease, Compound, Target, MeSH) are used to construct the entity correlation graph. A prototype of drug search engine for disease queries is implemented. Extensive comparisons are made against multiple state-of-the-arts in the examples of Drug–Disease relevance discovery.

[1]  John Riedl,et al.  Tagommenders: connecting users to items through tags , 2009, WWW '09.

[2]  Amit P. Sheth,et al.  Ρ-Queries: enabling querying for semantic associations on the semantic web , 2003, WWW '03.

[3]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[4]  Philip S. Yu,et al.  Relevance search in heterogeneous networks , 2012, EDBT '12.

[5]  Ni Lao,et al.  Relational retrieval using a combination of path-constrained random walks , 2010, Machine Learning.

[6]  Amit P. Sheth,et al.  Unsupervised Discovery of Compound Entities for Relationship Extraction , 2008, EKAW.

[7]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Jennifer Widom,et al.  Scaling personalized web search , 2003, WWW '03.

[9]  David Eppstein,et al.  Finding the k shortest paths , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[10]  W. Scott Spangler,et al.  Cross Media Entity Extraction and Linkage for Chemical Documents , 2011, AAAI.

[11]  Brian D. Davison,et al.  A probabilistic model for personalized tag prediction , 2010, KDD.

[12]  Amit P. Sheth,et al.  Context-Aware Semantic Association Ranking , 2003, SWDB.

[13]  Amit P. Sheth,et al.  Semantic Association Identification and Knowledge Discovery for National Security Applications , 2005, J. Database Manag..

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Russ B. Altman,et al.  Integration and publication of heterogeneous text-mined relationships on the Semantic Web , 2011, J. Biomed. Semant..

[16]  Amit P. Sheth,et al.  SemRank: ranking complex relationship search results on the semantic web , 2005, WWW '05.

[17]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[18]  David B. Searls,et al.  Data integration: challenges for drug discovery , 2005, Nature Reviews Drug Discovery.

[19]  Chun Chen,et al.  Document recommendation in social tagging services , 2010, WWW '10.

[20]  Ni Lao,et al.  Fast query execution for retrieval models based on path-constrained random walks , 2010, KDD.

[21]  Aldo Gangemi,et al.  Knowledge Engineering: Practice and Patterns, 16th International Conference, EKAW 2008, Acitrezza, Italy, September 29 - October 2, 2008. Proceedings , 2008, EKAW.

[22]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.