BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm633 Sequence analysis

MOTIVATION Accurate automatic assignment of protein functions remains a challenge for genome annotation. We have developed and compared the automatic annotation of four bacterial genomes employing a 5-fold cross-validation procedure and several machine learning methods. RESULTS The analyzed genomes were manually annotated with FunCat categories in MIPS providing a gold standard. Features describing a pair of sequences rather than each sequence alone were used. The descriptors were derived from sequence alignment scores, InterPro domains, synteny information, sequence length and calculated protein properties. Following training we scored all pairs from the validation sets, selected a pair with the highest predicted score and annotated the target protein with functional categories of the prototype protein. The data integration using machine-learning methods provided significantly higher annotation accuracy compared to the use of individual descriptors alone. The neural network approach showed the best performance. The descriptors derived from the InterPro domains and sequence similarity provided the highest contribution to the method performance. The predicted annotation scores allow differentiation of reliable versus non-reliable annotations. The developed approach was applied to annotate the protein sequences from 180 complete bacterial genomes. AVAILABILITY The FUNcat Annotation Tool (FUNAT) is available on-line as Web Services at http://mips.gsf.de/proj/funat.

[1]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[2]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[3]  James H. Wikel,et al.  The use of neural networks for variable selection in QSAR , 1993 .

[4]  Christian von Mering,et al.  STRING 7—recent developments in the integration and prediction of protein interactions , 2006, Nucleic Acids Res..

[5]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[6]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[7]  Igor V. Tetko,et al.  Neural Network Studies, 2. Variable Selection , 1996, J. Chem. Inf. Comput. Sci..

[8]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[9]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[10]  Robert B. Russell,et al.  GlobPlot: exploring protein sequences for globularity and disorder , 2003, Nucleic Acids Res..

[11]  Alfonso Valencia,et al.  Automatic annotation of protein function based on family identification , 2003, Proteins.

[12]  Thomas Rattei,et al.  SIMAP—structuring the network of protein similarities , 2007, Nucleic Acids Res..

[13]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[14]  Dmitrij Frishman,et al.  PEDANT genome database: 10 years online , 2006, Nucleic Acids Res..

[15]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[16]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[17]  S. Oliver,et al.  Erratum: Overview of the yeast genome , 1997, Nature.

[18]  Huiru Zheng,et al.  Predictive Integration of Gene Ontology-Driven Similarity and Functional Interactions , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[19]  Amanda Clare,et al.  Predicting gene function in Saccharomyces cerevisiae , 2003, ECCB.

[20]  Walter R. Gilks,et al.  Probabilistic annotation of protein sequences based on functional classifications , 2005, BMC Bioinformatics.

[21]  Søren Brunak,et al.  Prediction of human protein function according to Gene Ontology categories , 2003, Bioinform..

[22]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[23]  Andreas Ruepp,et al.  Prediction and classification of protein functions. , 2006, Drug discovery today. Technologies.

[24]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[25]  Mark Gerstein,et al.  Total ancestry measure: quantifying the similarity in tree-like classification, with genomic applications , 2007, Bioinform..

[26]  P. Argos,et al.  Seventy‐five percent accuracy in protein secondary structure prediction , 1997, Proteins.

[27]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[28]  H. Mewes,et al.  Overview of the yeast genome. , 1997, Nature.

[29]  M. Gerstein,et al.  Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. , 2002, Genome research.

[30]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[31]  H. Mewes,et al.  SNAPping up functionally related genes based on context information: a colinearity-free approach. , 2001, Journal of molecular biology.

[32]  Igor V. Tetko,et al.  Virtual Computational Chemistry Laboratory – Design and Description , 2005, J. Comput. Aided Mol. Des..

[33]  Alessandro Vespignani,et al.  Global protein function prediction from protein-protein interaction networks , 2003, Nature Biotechnology.

[34]  Igor V. Tetko,et al.  MIPS bacterial genomes functional annotation benchmark dataset , 2005, Bioinform..

[35]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[36]  A. Valencia Automatic annotation of protein function. , 2005, Current opinion in structural biology.

[37]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[38]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Igor V. Tetko,et al.  Neural Network Studies, 4. Introduction to Associative Neural Networks , 2002, J. Chem. Inf. Comput. Sci..

[40]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[41]  Igor V. Tetko,et al.  Benchmarking of Linear and Nonlinear Approaches for Quantitative Structure-Property Relationship Studies of Metal Complexation with Ionophores , 2006, J. Chem. Inf. Model..

[42]  Igor V. Tetko,et al.  Super paramagnetic clustering of protein sequences , 2005, BMC Bioinformatics.

[43]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[44]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[45]  C. A. Andersen,et al.  Prediction of human protein function from post-translational modifications and localization features. , 2002, Journal of molecular biology.

[46]  Dmitrij Frishman Protein Annotation at Genomic Scale: The Current Status , 2007 .

[47]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[48]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[49]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[50]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[51]  N. Bodor,et al.  Neural network studies: Part 3. Prediction of partition coefficients , 1994 .

[52]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[53]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2006, Nucleic acids research.

[54]  Gregg D. Wilensky,et al.  Neural Network Studies , 1993 .

[55]  A. Lupas Prediction and analysis of coiled-coil structures. , 1996, Methods in enzymology.

[56]  Martin Vingron,et al.  The SYSTERS Protein Family Database in 2005 , 2004, Nucleic Acids Res..

[57]  Michal Linial,et al.  A functional hierarchical organization of the protein sequence space , 2004, BMC Bioinformatics.

[58]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[59]  Philip E. Bourne,et al.  Statistically rigorous automated protein annotation , 2004, Bioinform..

[60]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[61]  Haiyuan Yu,et al.  Developing a similarity measure in biological function space , 2007 .

[62]  H Nielsen,et al.  Machine learning approaches for the prediction of signal peptides and other protein sorting signals. , 1999, Protein engineering.

[63]  Thomas Rattei,et al.  SIMAP: the similarity matrix of proteins , 2006, Nucleic Acids Res..

[64]  Amanda Clare,et al.  Functional bioinformatics for Arabidopsis thaliana , 2006, Bioinform..

[65]  Rolf Apweiler,et al.  Applications of InterPro in Protein Annotation and Genome Analysis , 2002, Briefings Bioinform..

[66]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.