Benchmarking PSI-BLAST in genome annotation.

The recognition of remote protein homologies is a major aspect of the structural and functional annotation of newly determined genomes. Here we benchmark the coverage and error rate of genome annotation using the widely used homology-searching program PSI-BLAST (position-specific iterated basic local alignment search tool). This study evaluates the one-to-many success rate for recognition, as often there are several homologues in the database and only one needs to be identified for annotating the sequence. In contrast, previous benchmarks considered one-to-one recognition in which a single query was required to find a particular target. The benchmark constructs a model genome from the full sequences of the structural classification of protein (SCOP) database and searches against a target library of remote homologous domains (<20 % identity). The structural benchmark provides a reliable list of correct and false homology assignments. PSI-BLAST successfully annotated 40 % of the domains in the model genome that had at least one homologue in the target library. This coverage is more than three times that if one-to-one recognition is evaluated (11 % coverage of domains). Although a structural benchmark was used, the results equally apply to just sequence homology searches. Accordingly, structural and sequence assignments were made to the sequences of Mycoplasma genitalium and Mycobacterium tuberculosis (see http://www.bmm.icnet. uk). The extent of missed assignments and of new superfamilies can be estimated for these genomes for both structural and functional annotations.

[1]  D. Wetlaufer Nucleation, rapid folding, and globular intrachain regions in proteins. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[2]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[3]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[4]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[5]  G. Heijne Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. , 1992, Journal of molecular biology.

[6]  W R Taylor,et al.  Fast structure alignment for protein databank searching , 1992, Proteins.

[7]  R. Fleischmann,et al.  The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[8]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[9]  E. Koonin,et al.  A minimal gene set for cellular life derived by comparison of complete bacterial genomes. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[10]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[11]  Hans-Werner Mewes,et al.  The PIR-International Protein Sequence Database , 1992, Nucleic Acids Res..

[12]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[13]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[14]  D. Fischer,et al.  Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[15]  B. Berger,et al.  MultiCoil: A program for predicting two‐and three‐stranded coiled coils , 1997, Protein science : a publication of the Protein Society.

[16]  David C. Jones,et al.  Progress in protein structure prediction. , 1997, Current opinion in structural biology.

[17]  M Gerstein,et al.  A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. , 1997, Journal of molecular biology.

[18]  Dmitrij Frishman,et al.  PEDANTic genome analysis , 1997 .

[19]  M. Levitt,et al.  A structural census of the current population of protein sequences. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[20]  J L Sussman,et al.  Protein Data Bank archives of three-dimensional macromolecular structures. , 1997, Methods in enzymology.

[21]  C. Chothia,et al.  Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[22]  A. Godzik,et al.  Fold and function predictions for Mycoplasma genitalium proteins. , 1998, Folding & design.

[23]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[24]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[25]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[26]  A. Sali,et al.  Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[27]  P Bork,et al.  Homology-based fold predictions for Mycoplasma genitalium proteins. , 1998, Journal of molecular biology.

[28]  J Schultz,et al.  SMART, a simple modular architecture research tool: identification of signaling domains. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[29]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[30]  Peer Bork,et al.  SMART, a simple modular architecture research tool , 1998 .

[31]  M J Sternberg,et al.  Supersites within superfolds. Binding site similarity in the absence of homology. , 1998, Journal of molecular biology.

[32]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[33]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[34]  C. Orengo,et al.  Protein folds and functions. , 1998, Structure.

[35]  M. Gerstein Patterns of protein‐fold usage in eight microbial genomes: A comprehensive structural census , 1998, Proteins.

[36]  C A Orengo,et al.  Genome analysis: Assigning protein coding regions to three‐dimensional structures , 1999 .

[37]  Robert D. Finn,et al.  Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins , 1999, Nucleic Acids Res..

[38]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[39]  Jérôme Gouzy,et al.  Recent improvements of the ProDom database of protein domain families , 1999, Nucleic Acids Res..

[40]  M J Sternberg,et al.  Progress in protein structure prediction: assessment of CASP3. , 1999, Current opinion in structural biology.

[41]  Michael Levitt,et al.  A brighter future for protein structure prediction , 1999, Nature Structural Biology.

[42]  M Gerstein,et al.  Advances in structural genomics. , 1999, Current opinion in structural biology.

[43]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[44]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[45]  D Fischer,et al.  CAFASP‐1: Critical assessment of fully automated structure prediction methods , 1999, Proteins.

[46]  E. Koonin,et al.  Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. , 1999, Journal of molecular biology.

[47]  S E Brenner,et al.  Distribution of protein folds in the three superkingdoms of life. , 1999, Genome research.

[48]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..