An efficient algorithm for large-scale detection of protein families.

Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.

[1]  W. Fitch,et al.  Aspects of molecular evolution. , 1973, Annual review of genetics.

[2]  M. O. Dayhoff,et al.  The origin and evolution of protein superfamilies. , 1976, Federation proceedings.

[3]  Dayhoff Mo,et al.  The origin and evolution of protein superfamilies. , 1976 .

[4]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  P Bork,et al.  Evolutionarily mobile modules in proteins. , 1993, Scientific American.

[6]  E. Meyerowitz,et al.  Eukaryotes have "two-component" signal transducers. , 1994, Research in microbiology.

[7]  R. Doolittle The multiplicity of domains in proteins. , 1995, Annual review of biochemistry.

[8]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[9]  C Ouzounis,et al.  The emergence of major cellular processes in evolution , 1996, FEBS letters.

[10]  C. Sander,et al.  Computational comparisons of model genomes. , 1996, Trends in biotechnology.

[11]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[12]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[13]  K. Yeh,et al.  A cyanobacterial phytochrome two-component light sensory system. , 1997, Science.

[14]  P Bork,et al.  On the Classification and Evolution of Protein Modules , 1997, Journal of protein chemistry.

[15]  L. Hood,et al.  Gene families: the taxonomy of protein paralogs and chimeras. , 1997, Science.

[16]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[17]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[18]  Xiaojun Guan,et al.  Domain Identification by Clustering Sequence Alignments , 1997, ISMB.

[19]  Temple F. Smith,et al.  The challenges of genome sequence annotation or “The devil is in the details” , 1997, Nature Biotechnology.

[20]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[21]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[22]  Jérôme Gouzy,et al.  The ProDom database of protein domain families , 1998, Nucleic Acids Res..

[23]  S. Dongen A new cluster algorithm for graphs , 1998 .

[24]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[25]  Terri K. Attwood,et al.  PRINTS prepares for the new millennium , 1999, Nucleic Acids Res..

[26]  Amos Bairoch,et al.  The PROSITE database, its status in 1999 , 1999, Nucleic Acids Res..

[27]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[28]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[29]  Tom M. Mitchell,et al.  Machine Learning and Data Mining , 2012 .

[30]  S. Dongen Graph clustering by flow simulation , 2000 .

[31]  L Holm,et al.  Towards a covering set of protein family profiles. , 2000, Progress in biophysics and molecular biology.

[32]  S. Dongen Performance criteria for graph clustering and Markov cluster experiments , 2000 .

[33]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[34]  C. Sander,et al.  Genome sequences and great expectations , 2000, Genome Biology.

[35]  D. Eisenberg,et al.  Protein function in the post-genomic era , 2000, Nature.

[36]  S. Dongen A stochastic uncoupling process for graphs , 2000 .

[37]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[38]  Chris Sander,et al.  CAST: an iterative algorithm for the complexity analysis of sequence tracts , 2000, Bioinform..

[39]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[40]  P D Karp,et al.  Global properties of the metabolic map of Escherichia coli. , 2000, Genome research.

[41]  C. Ouzounis,et al.  Recent developments and future directions in computational genomics , 2000, FEBS letters.

[42]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[43]  S. Teichmann,et al.  Domain combinations in archaeal, eubacterial and eukaryotic proteomes. , 2001, Journal of molecular biology.

[44]  Nikos Kyrpides,et al.  Genomes OnLine Database (GOLD): a monitor of genome projects world-wide , 2001, Nucleic Acids Res..

[45]  Sarah A. Teichmann,et al.  An insight into domain combinations , 2001, ISMB.

[46]  Anton J. Enright,et al.  Transcription-associated protein families are primarily taxon-specific , 2001, Bioinform..

[47]  C. Ouzounis,et al.  Strain-specific genes of Helicobacter pylori: distribution, function and dynamics. , 2001, Nucleic acids research.

[48]  T. Eulgem Eukaryotic transcription factors , 2001, Genome Biology.

[49]  Anton J. Enright,et al.  BioLayout-an automatic graph layout algorithm for similarity visualization , 2001, Bioinform..

[50]  E. Birney,et al.  Mining the draft human genome , 2001, Nature.

[51]  Michael Y. Galperin,et al.  The COG database: new developments in phylogenetic classification of proteins from complete genomes , 2001, Nucleic Acids Res..

[52]  Alex Bateman,et al.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites , 2001, Nucleic Acids Res..

[53]  J. Palous,et al.  Machine Learning and Data Mining , 2002 .

[54]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..