Multi-label literature classification based on the Gene Ontology graph

BackgroundThe Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification.ResultsIn this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community.ConclusionThrough utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.

[1]  Alex A. Freitas,et al.  A review of performance evaluation measures for hierarchical classifiers , 2007 .

[2]  Bin Zheng,et al.  Research Paper: Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation , 2006, J. Am. Medical Informatics Assoc..

[3]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[4]  William R Hersh,et al.  The TREC 2004 genomics track categorization task: classifying full text biomedical documents , 2006, Journal of biomedical discovery and collaboration.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[7]  Stan Matwin,et al.  Functional Annotation of Genes Using Hierarchical Text Categorization , 2005 .

[8]  Karin M. Verspoor,et al.  Protein annotation as term categorization in the gene ontology using word proximity networks , 2005, BMC Bioinformatics.

[9]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[10]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[11]  Bin Zheng,et al.  BMC Bioinformatics BioMed Central , 2005 .

[12]  Carole A. Goble,et al.  Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation , 2003, Bioinform..

[13]  Martijn J. Schuemie,et al.  Literature-based concept profiles for gene annotation: The issue of weighting , 2008, Int. J. Medical Informatics.

[14]  Juho Rousu,et al.  Kernel-Based Learning of Hierarchical Multilabel Classification Models , 2006, J. Mach. Learn. Res..

[15]  Anton Yuryev,et al.  Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks , 2007, BMC Bioinformatics.

[16]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[17]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[18]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[19]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[20]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[21]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[22]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database - An integrated resource of GO annotations to the UniProt Knowledgebase , 2003, Silico Biol..

[23]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[24]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[25]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[26]  Rolf Apweiler,et al.  GOAnnotator: linking protein GO annotations to evidence text , 2006, Journal of biomedical discovery and collaboration.

[27]  Karin M. Verspoor,et al.  A categorization approach to automated ontological function annotation , 2006, Protein science : a publication of the Protein Society.

[28]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[29]  Robert Stevens,et al.  Protein classification using ontology classification , 2006, ISMB.

[30]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[31]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[32]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[33]  C. Lindberg The Unified Medical Language System (UMLS) of the National Library of Medicine. , 1990, Journal.

[34]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[35]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .