Protein families and TRIBES in genome sequence space.

Accurate detection of protein families allows assignment of protein function and the analysis of functional diversity in complete genomes. Recently, we presented a novel algorithm called TribeMCL for the detection of protein families that is both accurate and efficient. This method allows family analysis to be carried out on a very large scale. Using TribeMCL, we have generated a resource called TRIBES that contains protein family information, comprising annotations, protein sequence alignments and phylogenetic distributions describing 311 257 proteins from 83 completely sequenced genomes. The analysis of at least 60 934 detected protein families reveals that, with the essential families excluded, paralogy levels are similar between prokaryotes, irrespective of genome size. The number of essential families is estimated to be between 366 and 426. We also show that the currently known space of protein families is scale free and discuss the implications of this distribution. In addition, we show that smaller families are often formed by shorter proteins and discuss the reasons for this intriguing pattern. Finally, we analyse the functional diversity of protein families in entire genome sequences. The TRIBES protein family resource is accessible at http://www.ebi.ac.uk/research/cgg/tribes/.

[1]  Chris Sander,et al.  CAST: an iterative algorithm for the complexity analysis of sequence tracts , 2000, Bioinform..

[2]  D. Eisenberg,et al.  Protein function in the post-genomic era , 2000, Nature.

[3]  Burkhard Rost,et al.  Domains, motifs and clusters in the protein universe. , 2003, Current opinion in chemical biology.

[4]  O. White,et al.  Global transposon mutagenesis and a minimal Mycoplasma genome. , 1999, Science.

[5]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[6]  N. P. Brown,et al.  The GeneQuiz web server: protein functional analysis through the Web. , 2000, Trends in biochemical sciences.

[7]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[8]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[9]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[10]  Christos A. Ouzounis,et al.  GeneTRACE - Reconstruction of Gene Content of Ancestral Species , 2003, Bioinform..

[11]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[12]  M. Gerstein,et al.  Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. , 2002, Journal of molecular biology.

[13]  S. Dongen Graph clustering by flow simulation , 2000 .

[14]  Nikos Kyrpides,et al.  Universal Protein Families and the Functional Content of the Last Universal Common Ancestor , 1999, Journal of Molecular Evolution.

[15]  Anton J. Enright,et al.  COmplete GENome Tracking (COGENT): A Flexible Data Environment for Computational Genomics , 2003, Bioinform..

[16]  Anton J. Enright,et al.  Computational analysis of protein function within complete genomes , 2002 .

[17]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[18]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[19]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[20]  C. Sander,et al.  Functional Classes in the Three Domains of Life , 1999, Journal of Molecular Evolution.

[21]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[22]  Rolf Apweiler,et al.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins , 2001, Nucleic Acids Res..

[23]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[24]  Martin Vingron,et al.  SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein , 2002, Nucleic Acids Res..

[25]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[26]  E. Birney,et al.  Mining the draft human genome , 2001, Nature.

[27]  Victor de Lorenzo,et al.  Myriads of protein families, and still counting , 2003, Genome Biology.

[28]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[29]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.