Concept-based query expansion for retrieving gene related publications from MEDLINE

BackgroundAdvances in biotechnology and in high-throughput methods for gene analysis have contributed to an exponential increase in the number of scientific publications in these fields of study. While much of the data and results described in these articles are entered and annotated in the various existing biomedical databases, the scientific literature is still the major source of information. There is, therefore, a growing need for text mining and information retrieval tools to help researchers find the relevant articles for their study. To tackle this, several tools have been proposed to provide alternative solutions for specific user requests.ResultsThis paper presents QuExT, a new PubMed-based document retrieval and prioritization tool that, from a given list of genes, searches for the most relevant results from the literature. QuExT follows a concept-oriented query expansion methodology to find documents containing concepts related to the genes in the user input, such as protein and pathway names. The retrieved documents are ranked according to user-definable weights assigned to each concept class. By changing these weights, users can modify the ranking of the results in order to focus on documents dealing with a specific concept. The method's performance was evaluated using data from the 2004 TREC genomics track, producing a mean average precision of 0.425, with an average of 4.8 and 31.3 relevant documents within the top 10 and 100 retrieved abstracts, respectively.ConclusionsQuExT implements a concept-based query expansion scheme that leverages gene-related information available on a variety of biological resources. The main advantage of the system is to give the user control over the ranking of the results by means of a simple weighting scheme. Using this approach, researchers can effortlessly explore the literature regarding a group of genes and focus on the different aspects relating to these genes.

[1]  Miguel Rocha,et al.  Data Integration Issues in the Reconstruction of the Genome-Scale Metabolic Model of Zymomonas Mobillis , 2008, IWPACBB.

[2]  D. Rebholz-Schuhmann,et al.  Facts from Text—Is Text Mining Ready to Deliver? , 2005, PLoS biology.

[3]  L. Grivell,et al.  Text mining for biology - the way forward: opinions from leading scientists , 2008, Genome Biology.

[4]  William R Hersh,et al.  Enhancing access to the Bibliome: the TREC 2004 Genomics Track , 2006, Journal of biomedical discovery and collaboration.

[5]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[6]  Toshihisa Takagi,et al.  Gene/Protein/Family Name Recognition in Biomedical Literature , 2004, HLT-NAACL 2004.

[7]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[8]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..

[9]  Yue Lu,et al.  An empirical study of gene synonym query expansion in biomedical information retrieval , 2008, Information Retrieval.

[10]  Jun'ichi Tsujii,et al.  Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases , 2006, ACL.

[11]  Sophia Ananiadou,et al.  FACTA: a text search engine for finding associated biomedical concepts , 2008, Bioinform..

[12]  Barend Mons,et al.  Online tools to support literature-based discovery in the life sciences , 2005, Briefings Bioinform..

[13]  P. Khatri,et al.  Global functional profiling of gene expression ? ? This work was funded in part by a Sun Microsystem , 2003 .

[14]  Dietrich Rebholz-Schuhmann,et al.  Categorization of services for seeking information in biomedical literature: a typology for improvement of practice , 2008, Briefings Bioinform..

[15]  José Luís Oliveira,et al.  Improving Literature Searches in Gene Expression Studies , 2008, IWPACBB.

[16]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.

[17]  Hagit Shatkay,et al.  Hairpins in bookstacks: Information retrieval from biomedical text , 2005, Briefings Bioinform..

[18]  José Luís Oliveira,et al.  GeNS: a Biological Data Integration Platform , 2009 .

[19]  Frank van Harmelen,et al.  A tool for gene expression based PubMed search through combining data sources , 2004, Bioinform..

[20]  P. Khatri,et al.  Global functional profiling of gene expression. , 2003, Genomics.

[21]  Hongfang Liu,et al.  Gene name ambiguity of eukaryotic nomenclatures , 2005, Bioinform..

[22]  Manuel A. S. Santos,et al.  GeneBrowser: an approach for integration and functional classification of genomic data , 2007 .

[23]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[24]  Yi Li,et al.  Exploring criteria for successful query expansion in the genomic domain , 2009, Information Retrieval.

[25]  Martijn J. Schuemie,et al.  Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification , 2007, J. Biomed. Informatics.

[26]  Ulf Leser,et al.  ALIBABA: PubMed as a graph , 2006, Bioinform..

[27]  Martijn J. Schuemie,et al.  GeneE: Gene and protein query expansion with disambiguation , 2010, Bioinform..

[28]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[29]  Zhiyong Lu,et al.  Evaluation of query expansion using MeSH in PubMed , 2009, Information Retrieval.

[30]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.