Beegle: from literature mining to disease-gene discovery

Disease-gene identification is a challenging process that has multiple applications within functional genomics and personalized medicine. Typically, this process involves both finding genes known to be associated with the disease (through literature search) and carrying out preliminary experiments or screens (e.g. linkage or association studies, copy number analyses, expression profiling) to determine a set of promising candidates for experimental validation. This requires extensive time and monetary resources. We describe Beegle, an online search and discovery engine that attempts to simplify this process by automating the typical approaches. It starts by mining the literature to quickly extract a set of genes known to be linked with a given query, then it integrates the learning methodology of Endeavour (a gene prioritization tool) to train a genomic model and rank a set of candidate genes to generate novel hypotheses. In a realistic evaluation setup, Beegle has an average recall of 84% in the top 100 returned genes as a search engine, which improves the discovery engine by 12.6% in the top 5% prioritized genes. Beegle is publicly available at http://beegle.esat.kuleuven.be/.

[1]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[2]  B. De Moor,et al.  TXTGate: profiling gene groups with text-based information , 2004, Genome Biology.

[3]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[4]  N. Campbell Genetic association database , 2004, Nature Reviews Genetics.

[5]  C. Ouzounis,et al.  Genome-wide identification of genes likely to be involved in human genetic disease. , 2004, Nucleic acids research.

[6]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[7]  Carol Friedman,et al.  Exploiting Semantic Relations for Literature-Based Discovery , 2006, AMIA.

[8]  Barend Mons,et al.  Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation , 2007, BMC Bioinformatics.

[9]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[10]  Jana Marie Schwarz,et al.  GeneDistiller—Distilling Candidate Genes from Linkage Intervals , 2008, PloS one.

[11]  Bart De Moor,et al.  Endeavour update: a web resource for gene prioritization in multiple species , 2008, Nucleic Acids Res..

[12]  Martijn J. Schuemie,et al.  Literature-based concept profiles for gene annotation: The issue of weighting , 2008, Int. J. Medical Informatics.

[13]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Scoring, term weighting, and the vector space model , 2008 .

[14]  M. Schuemie,et al.  Anni 2.0: a multipurpose text-mining tool for the life sciences , 2008, Genome Biology.

[15]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[16]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[17]  William C Hahn,et al.  Functional genomics and cancer drug target discovery. , 2010, Current opinion in molecular therapeutics.

[18]  H. Parkinson,et al.  A global map of human gene expression , 2010, Nature Biotechnology.

[19]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[20]  Russ B. Altman Editorial: Current progress in Bioinformatics 2010 , 2010, Briefings Bioinform..

[21]  Miguel A. Andrade-Navarro,et al.  Génie: literature-based gene prioritization at multi genomic scale , 2011, Nucleic Acids Res..

[22]  Jacob de Vlieg,et al.  CoPub update: CoPub 5.0 a text mining system to answer biological questions , 2011, Nucleic Acids Res..

[23]  Carol A. Bocchini,et al.  A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®) , 2011, Human mutation.

[24]  Bart De Moor,et al.  A guide to web tools to prioritize candidate genes , 2011, Briefings Bioinform..

[25]  R. Piro,et al.  Computational approaches to disease‐gene prediction: rationale, classification and successes , 2012, The FEBS journal.

[26]  W. Wasserman,et al.  Inferring novel gene-disease associations using Medical Subject Heading Over-representation Profiles , 2012, Genome Medicine.

[27]  Y. Moreau,et al.  Computational tools for prioritizing candidate genes: boosting disease gene discovery , 2012, Nature Reviews Genetics.

[28]  Mario Albrecht,et al.  Recent approaches to the prioritization of candidate disease genes , 2012, Wiley interdisciplinary reviews. Systems biology and medicine.

[29]  Bart De Moor,et al.  An unbiased evaluation of gene prioritization tools , 2012, Bioinform..

[30]  Bart De Moor,et al.  eXtasy: variant prioritization by genomic data fusion , 2013, Nature Methods.

[31]  S. Lei,et al.  Identification of novel risk genes associated with type 1 diabetes mellitus using a genome-wide gene-based association analysis , 2014, Journal of diabetes investigation.

[32]  B. Udd,et al.  Late onset spinal motor neuronopathy is caused by mutation in CHCHD10 , 2015, Annals of neurology.

[33]  Todd A. Johnson,et al.  Meta-analysis of genome-wide association studies of adult height in East Asians identifies 17 novel loci. , 2015, Human molecular genetics.

[34]  Jack Kuipers,et al.  Detailed simulation of cancer exome sequencing data reveals differences and common limitations of variant callers , 2017, BMC Bioinformatics.