Applying Support Vector Machines for Gene ontology based gene function prediction

BackgroundThe current progress in sequencing projects calls for rapid, reliable and accurate function assignments of gene products. A variety of methods has been designed to annotate sequences on a large scale. However, these methods can either only be applied for specific subsets, or their results are not formalised, or they do not provide precise confidence estimates for their predictions.ResultsWe have developed a large-scale annotation system that tackles all of these shortcomings. In our approach, annotation was provided through Gene Ontology terms by applying multiple Support Vector Machines (SVM) for the classification of correct and false predictions. The general performance of the system was benchmarked with a large dataset. An organism-wise cross-validation was performed to define confidence estimates, resulting in an average precision of 80% for 74% of all test sequences. The validation results show that the prediction performance was organism-independent and could reproduce the annotation of other automated systems as well as high-quality manual annotations. We applied our trained classification system to Xenopus laevis sequences, yielding functional annotation for more than half of the known expressed genome. Compared to the currently available annotation, we provided more than twice the number of contigs with good quality annotation, and additionally we assigned a confidence value to each predicted GO term.ConclusionsWe present a complete automated annotation system that overcomes many of the usual problems by applying a controlled vocabulary of Gene Ontology and an established classification method on large and well-described sequence data sets. In a case study, the function for Xenopus laevis contig sequences was predicted and the results are publicly available at ftp://genome.dkfz-heidelberg.de/pub/agd/gene_association.agd_Xenopus.

[1]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[2]  M. Ashburner,et al.  Annotating eukaryote genomes. , 2000, Current opinion in structural biology.

[3]  Hans Lehrach,et al.  Automated Gene Ontology annotation for anonymous sequence data , 2003, Nucleic Acids Res..

[4]  J. Schug,et al.  GAIA: framework annotation of genomic sequence. , 1998, Genome research.

[5]  Peter Ernst,et al.  W2H: WWW interface to the GCG sequence analysis package , 1998, Bioinform..

[6]  Takashi Matsumoto,et al.  RiceGAAS: an automated annotation system and database for rice genome sequence , 2002, Nucleic Acids Res..

[7]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[8]  Michael Y. Galperin,et al.  Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption , 1998, Silico Biol..

[9]  A Bairoch,et al.  Go hunting in sequence databases but watch out for the traps. , 1996, Trends in genetics : TIG.

[10]  Günther Zehetner,et al.  OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms , 2003, Nucleic Acids Res..

[11]  J. Blake,et al.  Extension and integration of the gene ontology (GO): combining GO vocabularies with external vocabularies. , 2002, Genome research.

[12]  J. Schug,et al.  Predicting gene ontology functions from ProDom and CDD protein domains. , 2002, Genome research.

[13]  N. Harris,et al.  Genotator: a workbench for sequence annotation. , 1997, Genome research.

[14]  David J. Edwards,et al.  Functional annotation of proteomic sequences based on consensus of sequence and structural analysis , 2002, Briefings Bioinform..

[15]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[16]  D. Searls,et al.  Using bioinformatics in gene and drug discovery. , 2000, Drug discovery today.

[17]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[18]  Avi Shoshan,et al.  Large-scale protein annotation through gene ontology. , 2002, Genome research.

[19]  Peter Ernst,et al.  A task framework for the web interface W2H , 2003, Bioinform..

[20]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[21]  Dmitrij Frishman,et al.  Functional and structural genomics using PEDANT , 2001, Bioinform..

[22]  Ken W. Y. Cho,et al.  Xenopus DNA Microarrays , 2003 .

[23]  T Gaasterland,et al.  MAGPIE: automated genome interpretation. , 1996, Trends in genetics : TIG.

[24]  Coral del Val,et al.  cDNA2Genome: A tool for mapping and annotating cDNAs , 2003, BMC Bioinformatics.

[25]  T. Gibson,et al.  Applying motif and profile searches. , 1996, Methods in enzymology.

[26]  T. Smith,et al.  Functional genomics--bioinformatics is ready for the challenge. , 1998, Trends in genetics : TIG.

[27]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[28]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[29]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .