Predicting Disease-Gene Associations using Cross-Document Graph-based Features

In the context of personalized medicine, text mining methods pose an interesting option for identifying disease-gene associations, as they can be used to generate novel links between diseases and genes which may complement knowledge from structured databases. The most straightforward approach to extract such links from text is to rely on a simple assumption postulating an association between all genes and diseases that co-occur within the same document. However, this approach (i) tends to yield a number of spurious associations, (ii) does not capture different relevant types of associations, and (iii) is incapable of aggregating knowledge that is spread across documents. Thus, we propose an approach in which disease-gene co-occurrences and gene-gene interactions are represented in an RDF graph. A machine learning-based classifier is trained that incorporates features extracted from the graph to separate disease-gene pairs into valid disease-gene associations and spurious ones. On the manually curated Genetic Testing Registry, our approach yields a 30 points increase in F1 score over a plain co-occurrence baseline.

[1]  Teruyoshi Hishiki,et al.  Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning , 2005, Pacific Symposium on Biocomputing.

[2]  Núria Queralt-Rosinach,et al.  DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes , 2015, Database J. Biol. Databases Curation.

[3]  Martin Hofmann-Apitius,et al.  Named Entity Recognition with Combinations of Conditional Random Fields , 2007 .

[4]  P. Kemmeren,et al.  A new web-based data mining tool for the identification of candidate genes for human genetic disorders , 2003, European Journal of Human Genetics.

[5]  David S. Wishart,et al.  Nucleic Acids Research Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs and Metabolites , 2008 .

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Hendrik ter Horst,et al.  Ranking of disease gene associations from large corpora of scientific publications , 2015 .

[8]  Yan Teng,et al.  Upregulation of heat shock protein 27 confers resistance to actinomycin D‐induced apoptosis in cancer cells , 2013, The FEBS journal.

[9]  Zhiyong Lu,et al.  An improved corpus of disease mentions in PubMed citations , 2012, BioNLP@HLT-NAACL.

[10]  Changqin Quan,et al.  Gene-disease association extraction by text mining and network analysis , 2014, Louhi@EACL.

[11]  Joyce A. Mitchell,et al.  Improving Literature Based Discovery Support by Genetic Knowledge Integration , 2003, MIE.

[12]  Dragomir R. Radev,et al.  Identifying gene-disease associations using centrality on a literature mined gene-interaction network , 2008, ISMB.

[13]  Udo Hahn,et al.  High-performance gene name normalization with GENO , 2009, Bioinform..

[14]  Philipp Cimiano,et al.  Towards Gene Recognition from Rare and Ambiguous Abbreviations using a Filtering Approach , 2014, BioNLP@ACL.

[15]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[16]  Dennis M. Wilkinson,et al.  A method for finding communities of related genes , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  P. McHugh,et al.  Pharmacogenetics, Kinetics, and Dynamics for Personalized Medicine by DF Kisor, MD Kane, JN Talbot and JE Sprague , 2016 .

[18]  Jonathan D. Wren,et al.  Knowledge discovery by automated identification and ranking of implicit relationships , 2004, Bioinform..

[19]  Nora Husain,et al.  The NIH genetic testing registry: a new, centralized database of genetic tests to enable access to comprehensive information and improve transparency , 2012, Nucleic Acids Res..

[20]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[21]  Neil R. Smalheiser,et al.  Undiscovered Public Knowledge: A Ten-Year Update , 1996, KDD.

[22]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[23]  Jari Björne,et al.  Extracting Complex Biological Events with Rich Graph-Based Feature Sets , 2009, BioNLP@HLT-NAACL.

[24]  Jari Björne,et al.  Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization , 2013, PloS one.

[25]  Hitoshi Isahara,et al.  Chinese Named Entity Recognition with Conditional Random Fields , 2006, SIGHAN@COLING/ACL.

[26]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2010, Nucleic Acids Res..

[27]  Hisham Al-Mubaid,et al.  A New Text Mining Approach for Finding Protein-to-Disease Associations , 2005 .