GOTA: GO term annotation of biomedical literature

BackgroundFunctional annotation of genes and gene products is a major challenge in the post-genomic era. Nowadays, gene function curation is largely based on manual assignment of Gene Ontology (GO) annotations to genes by using published literature. The annotation task is extremely time-consuming, therefore there is an increasing interest in automated tools that can assist human experts.ResultsHere we introduce GOTA, a GO term annotator for biomedical literature. The proposed approach makes use only of information that is readily available from public repositories and it is easily expandable to handle novel sources of information. We assess the classification capabilities of GOTA on a large benchmark set of publications. The overall performances are encouraging in comparison to the state of the art in multi-label classification over large taxonomies. Furthermore, the experimental tests provide some interesting insights into the potential improvement of automated annotation tools.ConclusionsGOTA implements a flexible and expandable model for GO annotation of biomedical literature. The current version of the GOTA tool is freely available at http://gota.apice.unibo.it.

[1]  Claudio Sartori,et al.  A Study on Term Weighting for Text Categorization: A Novel Supervised Variant of tf.idf , 2015, DATA.

[2]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[3]  Ellen M. Voorhees,et al.  The Tenth Text REtrieval Conference, TREC 2001 | NIST , 2002 .

[4]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[5]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[6]  Christophe Dessimoz,et al.  Quality of Computationally Inferred Gene Ontology Annotations , 2012, PLoS Comput. Biol..

[7]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[8]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[9]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[10]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[11]  Patrick Ruch,et al.  Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases , 2013, Database J. Biol. Databases Curation.

[12]  Patrick Ruch,et al.  Closing the loop: from paper to protein annotation using supervised Gene Ontology classification , 2014, Database J. Biol. Databases Curation.

[13]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[14]  Claudio Gentile,et al.  Incremental Algorithms for Hierarchical Classification , 2004, J. Mach. Learn. Res..

[15]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[16]  Daniel L. Rubin,et al.  Biomedical ontologies: a functional perspective , 2007, Briefings Bioinform..

[17]  Tanya Z. Berardini,et al.  Building an efficient curation workflow for the Arabidopsis literature corpus , 2012, Database J. Biol. Databases Curation.

[18]  Thomas H. Wonnacott,et al.  Introductory Statistics , 2007, Technometrics.

[19]  Jung-Hsien Chiang,et al.  Overview of the gene ontology task at BioCreative IV , 2014, Database J. Biol. Databases Curation.

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  Michael Schroeder,et al.  GoPubMed: exploring PubMed with the Gene Ontology , 2005, Nucleic Acids Res..

[22]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[23]  Christophe Dessimoz,et al.  The what, where, how and why of gene ontology—a primer for bioinformaticians , 2011, Briefings Bioinform..

[24]  Karin M. Verspoor,et al.  A categorization approach to automated ontological function annotation , 2006, Protein science : a publication of the Protein Society.

[25]  Jane Lomax,et al.  Get ready to GO! A biologist's guide to the Gene Ontology , 2005, Briefings Bioinform..

[26]  Suzanna E Lewis,et al.  Gene Ontology: looking backwards and forwards , 2004, Genome Biology.

[27]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[28]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[29]  Paul N. Bennett,et al.  Refined experts: improving classification in large taxonomies , 2009, SIGIR.

[30]  Carol Friedman,et al.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function , 2007, ISMB/ECCB.

[31]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Novelty Track. , 2005 .

[32]  Juho Rousu,et al.  Kernel-Based Learning of Hierarchical Multilabel Classification Models , 2006, J. Mach. Learn. Res..

[33]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[34]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[35]  Marco Masseroli,et al.  Discovering New Gene Functionalities from Random Perturbations of Known Gene Ontological Annotations , 2014, KDIR.

[36]  Raymond Y. K. Lau,et al.  Unsupervised Multi-label Text Classification Using a World Knowledge Ontology , 2012, PAKDD.

[37]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.