Semantically linking and browsing PubMed abstracts with gene ontology

BackgroundThe technological advances in the past decade have lead to massive progress in the field of biotechnology. The documentation of the progress made exists in the form of research articles. The PubMed is the current most used repository for bio-literature. PubMed consists of about 17 million abstracts as of 2007 that require methods to efficiently retrieve and browse large volume of relevant information. The State-of-the-art technologies such as GOPubmed use simple keyword-based techniques for retrieving abstracts from the PubMed and linking them to the Gene Ontology (GO). This paper changes the paradigm by introducing semantics enabled technique to link the PubMed to the Gene Ontology, called, SEGOPubmed for ontology-based browsing. Latent Semantic Analysis (LSA) framework is used to semantically interface PubMed abstracts to the Gene Ontology.ResultsThe Empirical analysis is performed to compare the performance of the SEGOPubmed with the GOPubmed. The analysis is initially performed using a few well-referenced query words. Further, statistical analysis is performed using GO curated dataset as ground truth. The analysis suggests that the SEGOPubmed performs better than the classic GOPubmed as it incorporates semantics.ConclusionsThe LSA technique is applied on the PubMed abstracts obtained based on the user query and the semantic similarity between the query and the abstracts. The analyses using well-referenced keywords show that the proposed semantic-sensitive technique outperformed the string comparison based techniques in associating the relevant abstracts to the GO terms. The SEGOPubmed also extracted the abstracts in which the keywords do not appear in isolation (i.e. they appear in combination with other terms) that could not be retrieved by simple term matching techniques.

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[3]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[4]  John G. Cleary,et al.  Automatically linking MEDLINE abstracts to the Gene Ontology , 2003 .

[5]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[8]  G. Stoesser NCBI (National Center for Biotechnology Information) , 2004 .

[9]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[10]  Ulf Leser,et al.  ALIBABA: PubMed as a graph , 2006, Bioinform..

[11]  Claus-Wilhelm von der Lieth,et al.  PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts , 2005, Nucleic Acids Res..

[12]  J McEntyre,et al.  PubMed: bridging the information gap. , 2001, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[13]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[14]  Ncbi National Center for Biotechnology Information , 2008 .

[15]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[16]  Gerard Salton,et al.  The smart document retrieval project , 1991, SIGIR '91.

[17]  Xuesong Lu,et al.  Significance of Gene Ranking for Classification of Microarray Samples , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Hao Chen,et al.  Content-rich biological network constructed by mining PubMed abstracts , 2004, BMC Bioinformatics.