A functional hierarchical organization of the protein sequence space

BackgroundIt is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high sensitivity, but are limited by the necessary manual labor. This makes our current view of the protein world incomplete and biased. This paper concerns ProtoNet, a automatic unsupervised global clustering system that generates a hierarchical tree of over 1,000,000 proteins, based solely on sequence similarity.ResultsIn this paper we show that ProtoNet correctly captures functional and structural aspects of the protein world. Furthermore, a novel feature is an automatic procedure that reduces the tree to 12% its original size. This procedure utilizes only parameters intrinsic to the clustering process. Despite the substantial reduction in size, the system's predictive power concerning biological functions is hardly affected. We then carry out an automatic comparison with existing functional protein annotations. Consequently, 78% of the clusters in the compressed tree (5,300 clusters) get assigned a biological function with a high confidence. The clustering and compression processes are unsupervised, and robust.ConclusionsWe present an automatically generated unbiased method that provides a hierarchical classification of all currently known proteins.

[1]  Cathy H. Wu,et al.  iProClass: an integrated, comprehensive and annotated protein classification database , 2001, Nucleic Acids Res..

[2]  O Pongs,et al.  Bacillus stearothermophilus lctB gene gives rise to functional K+ channels in Escherichia coli and in Xenopus oocytes. , 1999, Receptors & channels.

[3]  Michael Y. Galperin,et al.  Who's your neighbor? New computational approaches for functional genomics , 2000, Nature Biotechnology.

[4]  Alex Bateman,et al.  InterPro: An Integrated Documentation Resource for Protein Families, Domains and Functional Sites , 2002, Briefings Bioinform..

[5]  Frances M. G. Pearl,et al.  The CATH extended protein‐family database: Providing structural annotations for genome sequences , 2002, Protein science : a publication of the Protein Society.

[6]  Chris Sander,et al.  Completeness in structural genomics , 2001, Nature Structural Biology.

[7]  Chris Sander,et al.  Protein folds and families: sequence and structure alignments , 1999, Nucleic Acids Res..

[8]  Ke Fan,et al.  PROTEINS: Structure, Function, and Bioinformatics 54:491–499 (2004) The Number of Protein Folds and Their Distribution Over Families in Nature , 2022 .

[9]  Liisa Holm,et al.  Picasso: generating a covering set of protein family profiles , 2001, Bioinform..

[10]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[11]  Victor de Lorenzo,et al.  Myriads of protein families, and still counting , 2003, Genome Biology.

[12]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[13]  Ori Sasson,et al.  The metric space of proteins-comparative study of clustering algorithms , 2002, ISMB.

[14]  Michal Linial,et al.  A robust method to detect structural and functional remote homologues , 2004, Proteins.

[15]  Liisa Holm,et al.  Identification of homology in protein structure classification , 2001, Nature Structural Biology.

[16]  Elon Portugaly,et al.  Selecting targets for structural determination by navigating in a graph of protein families , 2002, Bioinform..

[17]  Ori Sasson,et al.  ProtoNet 4.0: A hierarchical classification of one million protein sequences , 2004, Nucleic Acids Res..

[18]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[19]  David A. Lee,et al.  Progress towards mapping the universe of protein folds , 2004, Genome Biology.

[20]  Burkhard Rost,et al.  Domains, motifs and clusters in the protein universe. , 2003, Current opinion in chemical biology.

[21]  S. Brenner,et al.  Expectations from structural genomics , 2008, Protein science : a publication of the Protein Society.

[22]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[23]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[24]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[25]  A. C. May,et al.  Optimal classification of protein sequences and selection of representative sets from multiple alignments: application to homologous families and lessons for structural genomics. , 2001, Protein engineering.

[26]  Sung-Hou Kim,et al.  Overview of structural genomics: from structure to function. , 2003, Current opinion in chemical biology.

[27]  Burkhard Rost,et al.  Target space for structural genomics revisited , 2002, Bioinform..