Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research

BackgroundCurrent biomedical research needs to leverage and exploit the large amount of information reported in scientific publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases.ResultsBy exploiting morpho-syntactic information of the text, BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation, in order to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications.ConclusionsBeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources (2%), raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.

[1]  Laurent Descarries,et al.  The neurobiology of depression—revisiting the serotonin hypothesis. I. Cellular and molecular mechanisms , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[2]  César de Pablo-Sánchez,et al.  Using a shallow linguistic kernel for drug-drug interaction extraction , 2011, J. Biomed. Informatics.

[3]  Laura Inés Furlong,et al.  DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene-disease networks , 2010, Bioinform..

[4]  Russ B. Altman,et al.  Discovery and Explanation of Drug-Drug Interactions via Text Mining , 2011, Pacific Symposium on Biocomputing.

[5]  Hans-Peter Kriegel,et al.  Extraction of semantic biomedical relations from text using conditional random fields , 2008, BMC Bioinformatics.

[6]  Bridget T. McInnes,et al.  Using PharmGKB to train text mining approaches for identifying potential gene targets for pharmacogenomic studies , 2012, J. Biomed. Informatics.

[7]  Andrew McCallum,et al.  Combining joint models for biomedical event extraction , 2012, BMC Bioinformatics.

[8]  Benjamin M. Good,et al.  Crowdsourcing for bioinformatics , 2013, Bioinform..

[9]  Elena Beisswanger,et al.  The Extraction of Pharmacogenetic and Pharmacogenomic Relations - A Case Study Using PharmGKB , 2011, Pacific Symposium on Biocomputing.

[10]  Karin M. Verspoor,et al.  Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations , 2013, PloS one.

[11]  K. Bretonnel Cohen,et al.  Rapid Pattern Development for Concept Recognition Systems: Application to Point mutations , 2007, J. Bioinform. Comput. Biol..

[12]  Rob W.W. Hooft,et al.  The value of data , 2011, Nature Genetics.

[13]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[14]  Erik M. van Mulligen,et al.  Knowledge-based extraction of adverse drug events from biomedical text , 2014, BMC Bioinformatics.

[15]  Rong Xu,et al.  Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing , 2013, BMC Bioinformatics.

[16]  F. Sanz,et al.  Improving data and knowledge management to better integrate health care and research , 2013, Journal of internal medicine.

[17]  Dragomir R. Radev,et al.  Identifying gene-disease associations using centrality on a literature mined gene-interaction network , 2008, ISMB.

[18]  Laura Inés Furlong,et al.  The EU-ADR corpus: Annotated drugs, diseases, targets, and their relationships , 2012, J. Biomed. Informatics.

[19]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[20]  Jihoon Yang,et al.  Walk-weighted subsequence kernels for protein-protein interaction extraction , 2010, BMC Bioinformatics.

[21]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[22]  Luca Toldo,et al.  Extraction of potential adverse drug events from medical case reports , 2012, Journal of biomedical semantics.

[23]  Halil Kilicoglu,et al.  Recognizing speculative language in biomedical research articles: a linguistically motivated perspective , 2008, BMC Bioinformatics.

[24]  Sophia Ananiadou,et al.  Event-based text mining for biology and functional genomics , 2014, Briefings in functional genomics.

[25]  Chitta Baral,et al.  A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions , 2012, J. Biomed. Informatics.

[26]  Sophia Ananiadou,et al.  Negated bio-events: analysis and identification , 2013, BMC Bioinformatics.

[27]  Zhiyong Lu,et al.  Automatic integration of drug indications from multiple health resources , 2010, IHI.

[28]  Teruyoshi Hishiki,et al.  Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning , 2005, Pacific Symposium on Biocomputing.

[29]  Rong Xu,et al.  A knowledge-driven conditional approach to extract pharmacogenomics specific drug-gene relationships from free text , 2012, J. Biomed. Informatics.

[30]  Halil Kilicoglu,et al.  Automatic summarization of MEDLINE citations for evidence-based medical treatment: A topic-oriented evaluation , 2009, J. Biomed. Informatics.

[31]  BMC Bioinformatics , 2005 .

[32]  Xiaoyan Wang,et al.  Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[33]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[34]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[35]  Zhiyong Lu,et al.  BioCreative-IV virtual issue , 2014, Database J. Biol. Databases Curation.

[36]  D. Rebholz-Schuhmann,et al.  Text-mining solutions for biomedical research: enabling integrative biology , 2012, Nature Reviews Genetics.

[37]  Jun'ichi Tsujii,et al.  Protein-protein interaction extraction by leveraging multiple kernels and parsers , 2009, Int. J. Medical Informatics.

[38]  K. Bretonnel Cohen,et al.  Mining the pharmacogenomics literature - a survey of the state of the art , 2012, Briefings Bioinform..

[39]  Jihoon Yang,et al.  Data and text mining Kernel approaches for genic interaction extraction , 2008 .

[40]  Alberto Lavelli,et al.  Combining Tree Structures, Flat Features and Patterns for Biomedical Relation Extraction , 2012, EACL.

[41]  János Csirik,et al.  The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes , 2008, BMC Bioinformatics.

[42]  Hongfang Liu,et al.  Evaluating the UMLS as a source of lexical knowledge for medical language processing , 2001, AMIA.

[43]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[44]  F. Sanz,et al.  A Knowledge-Driven Approach to Extract Disease-Related Biomarkers from the Literature , 2014, BioMed research international.

[45]  Erik M. van Mulligen,et al.  Using rule-based natural language processing to improve disease normalization in biomedical text , 2012, J. Am. Medical Informatics Assoc..

[46]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[47]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[48]  Claudio Giuliano,et al.  Exploiting Shallow Linguistic Information for Relation Extraction from Biomedical Literature , 2006, EACL.