GOClonto: An ontological clustering approach for conceptualizing PubMed abstracts

Concurrent with progress in biomedical sciences, an overwhelming of textual knowledge is accumulating in the biomedical literature. PubMed is the most comprehensive database collecting and managing biomedical literature. To help researchers easily understand collections of PubMed abstracts, numerous clustering methods have been proposed to group similar abstracts based on their shared features. However, most of these methods do not explore the semantic relationships among groupings of documents, which could help better illuminate the groupings of PubMed abstracts. To address this issue, we proposed an ontological clustering method called GOClonto for conceptualizing PubMed abstracts. GOClonto uses latent semantic analysis (LSA) and gene ontology (GO) to identify key gene-related concepts and their relationships as well as allocate PubMed abstracts based on these key gene-related concepts. Based on two PubMed abstract collections, the experimental results show that GOClonto is able to identify key gene-related concepts and outperforms the STC (suffix tree clustering) algorithm, the Lingo algorithm, the Fuzzy Ants algorithm, and the clustering based TRS (tolerance rough set) algorithm. Moreover, the two ontologies generated by GOClonto show significant informative conceptual structures.

[1]  Lefteris Angelis,et al.  Gene functional annotation by statistical analysis of biomedical articles , 2007, Int. J. Medical Informatics.

[2]  Patrick Pantel,et al.  Document clustering with committees , 2002, SIGIR '02.

[3]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[4]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[5]  Anton J. Enright,et al.  TEXTQUEST: Document Clustering of MEDLINE Abstracts For Concept Discovery In Molecular Biology , 2000, Pacific Symposium on Biocomputing.

[6]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[7]  John G. Cleary,et al.  Automatically linking MEDLINE abstracts to the Gene Ontology , 2003 .

[8]  José L. V. Mejino,et al.  A reference ontology for biomedical informatics: the Foundational Model of Anatomy , 2003, J. Biomed. Informatics.

[9]  Su-Shing Chen,et al.  Automated Linking PUBMED Documents with GO Terms Using SVM , 2007, Journal of Data Science.

[10]  Michael Schroeder,et al.  GoPubMed: ontology-based literature search applied to Gene Ontology and PubMed , 2004, German Conference on Bioinformatics.

[11]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[12]  Hung Son Nguyen,et al.  A Tolerance Rough Set Approach to Clustering Web Search Results , 2004, PKDD.

[13]  Kent A. Spackman,et al.  SNOMED clinical terms: overview of the development process and project status , 2001, AMIA.

[14]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[15]  Hong-Gee Kim,et al.  Exploiting noun phrases and semantic relationships for text document clustering , 2009, Inf. Sci..

[16]  Michael W. Berry,et al.  Gene clustering by Latent Semantic Indexing of MEDLINE abstracts , 2005, Bioinform..

[17]  Judith A. Blake,et al.  Gene Ontology annotations: what they mean and where they come from , 2008, BMC Bioinformatics.

[18]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[19]  Mohammed Yeasin,et al.  Semantically linking and browsing PubMed abstracts with gene ontology , 2008, BMC Genomics.

[20]  Eleanor Howe,et al.  MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms , 2005, Bioinform..

[21]  Yasunori Yamamoto,et al.  Biomedical knowledge navigation by literature clustering , 2007, J. Biomed. Informatics.

[22]  Keke Chen,et al.  Model Formulation: A Document Clustering and Ranking System for Exploring MEDLINE Citations , 2007, J. Am. Medical Informatics Assoc..

[23]  Eisaku Maeda,et al.  Assigning gene ontology categories (GO) to yeast genes using text-based supervised learning methods , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[24]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[25]  Dawid Weiss,et al.  A concept-driven algorithm for clustering search results , 2005, IEEE Intelligent Systems.