GeDex: A consensus Gene-disease Event Extraction System based on frequency patterns and supervised learning

Motivation The genetic mechanisms involved in human diseases are fundamental in biomedical research. Several databases with curated associations between genes and diseases have emerged in the last decades. Although, due to the demanding and time consuming nature of manual curation of literature, they still lack large amounts of information. Current automatic approaches extract associations by considering each abstract or sentence independently. This approach could potentially lead to contradictions between individual cases. Therefore, there is a current need for automatic strategies that can provide a literature consensus of gene-disease associations, and are not prone to making contradictory predictions. Results Here, we present GeDex, an effective and freely available automatic approach to extract consensus gene-disease associations from biomedical literature based on a predictive model trained with four simple features. As far as we know, it is the only system that reports a single consensus prediction from multiple sentences supporting the same association. We tested our approach on the curated fraction of DisGeNet (f-score 0.77) and validated it on a manually curated dataset, obtaining a competitive performance when compared to pre-existing methods (f-score 0.74). In addition, we effectively recovered associations from an article collection of chronic pulmonary diseases and discovered that a large proportion is not reported in current databases. Our results demonstrate that GeDex, despite its simplicity, is a competitive tool that can successfully assist the curation of existing databases. Availability GeDex is available at https://bitbucket.org/laigen/gedex/src/master/ and can be used as a docker image https://hub.docker.com/r/laigen/gedex. The code and data used to generate the figures of this publication are available in the same bitbucket repository. Contact cmendezc@ccg.unam.mx Supplementary information Supplementary data are available are available bioRxiv online.

[1]  F. McCoy,et al.  Janus-faced PIDD: a sensor for DNA damage-induced cell death or survival? , 2012, Molecular cell.

[2]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[3]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[4]  M. Stratton,et al.  The cancer genome , 2009, Nature.

[5]  Heidi L Rehm,et al.  ClinGen--the Clinical Genome Resource. , 2015, The New England journal of medicine.

[6]  Hyunju Lee,et al.  An analysis of disease-gene relationship from Medline abstracts by DigSee , 2017, Scientific Reports.

[7]  Jan Gorodkin,et al.  Protein-driven inference of miRNA–disease associations , 2013, Bioinform..

[8]  Kai Ming Ting,et al.  Precision and Recall , 2017, Encyclopedia of Machine Learning and Data Mining.

[9]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[10]  Núria Queralt-Rosinach,et al.  DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes , 2015, Database J. Biol. Databases Curation.

[11]  Fang-Xiang Wu,et al.  Identifying Disease-Gene Associations With Graph-Regularized Manifold Learning , 2019, Front. Genet..

[12]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[13]  Núria Queralt-Rosinach,et al.  Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research , 2014, BMC Bioinformatics.

[14]  Dong Xu,et al.  DTMiner: identification of potential disease targets through biomedical literature mining , 2016, Bioinform..

[15]  Zhiyong Lu,et al.  - like interactive curation system for document triage and literature curation , 2012 .

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Masato Kimura,et al.  NCBI’s Database of Genotypes and Phenotypes: dbGaP , 2013, Nucleic Acids Res..

[18]  Ye Wu,et al.  RENET: A Deep Learning Approach for Extracting Gene-Disease Associations from Literature , 2019, RECOMB.

[19]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[20]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[21]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2019 , 2018, Nucleic Acids Res..

[22]  Claudio Giuliano,et al.  Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature , 2006, EACL.

[23]  Valérie Lanneau,et al.  Clinical Practice Guidelines for Rare Diseases: The Orphanet Database , 2017, PloS one.

[24]  Zhiyong Lu,et al.  Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts , 2012, Database J. Biol. Databases Curation.

[25]  Janos X. Binder,et al.  DISEASES: Text mining and data integration of disease–gene associations , 2014, bioRxiv.

[26]  Lon Phan,et al.  Phenotype–Genotype Integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources , 2013, European Journal of Human Genetics.

[27]  Laura Inés Furlong,et al.  PsyGeNET: a knowledge platform on psychiatric disorders and their genes , 2015, Bioinform..

[28]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[29]  Jihoon Yang,et al.  Data and text mining Kernel approaches for genic interaction extraction , 2008 .

[30]  Michael P. Schroeder,et al.  Cancer Genome Interpreter annotates the biological and clinical relevance of tumor alterations , 2017, Genome Medicine.

[31]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..