Entropy-driven partitioning of the hierarchical protein space

Motivation: Modern protein sequencing techniques have led to the determination of >50 million protein sequences. ProtoNet is a clustering system that provides a continuous hierarchical agglomerative clustering tree for all proteins. While ProtoNet performs unsupervised classification of all included proteins, finding an optimal level of granularity for the purpose of focusing on protein functional groups remain elusive. Here, we ask whether knowledge-based annotations on protein families can support the automatic unsupervised methods for identifying high-quality protein families. We present a method that yields within the ProtoNet hierarchy an optimal partition of clusters, relative to manual annotation schemes. The method’s principle is to minimize the entropy-derived distance between annotation-based partitions and all available hierarchical partitions. We describe the best front (BF) partition of 2 478 328 proteins from UniRef50. Of 4 929 553 ProtoNet tree clusters, BF based on Pfam annotations contain 26 891 clusters. The high quality of the partition is validated by the close correspondence with the set of clusters that best describe thousands of keywords of Pfam. The BF is shown to be superior to naïve cut in the ProtoNet tree that yields a similar number of clusters. Finally, we used parameters intrinsic to the clustering process to enrich a priori the BF’s clusters. We present the entropy-based method’s benefit in overcoming the unavoidable limitations of nested clusters in ProtoNet. We suggest that this automatic information-based cluster selection can be useful for other large-scale annotation schemes, as well as for systematically testing and comparing putative families derived from alternative clustering methods. Availability and implementation: A catalog of BF clusters for thousands of Pfam keywords is provided at http://protonet.cs.huji.ac.il/bestFront/ Contact: michall@cc.huji.ac.il

[1]  J. Zukas Introduction to the Modern Theory of Dynamical Systems , 1998 .

[2]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[3]  Burkhard Rost,et al.  Domains, motifs and clusters in the protein universe. , 2003, Current opinion in chemical biology.

[4]  Ni Li,et al.  Gene Ontology Annotations and Resources , 2012, Nucleic Acids Res..

[5]  Ian Sillitoe,et al.  The CATH classification revisited—architectures reviewed and new ways to characterize structural divergence in superfamilies , 2008, Nucleic Acids Res..

[6]  Peter B. McGarvey,et al.  Protein Information Resource: a community resource for expert annotation of protein data , 2001, Nucleic Acids Res..

[7]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[8]  Robert Petryszak,et al.  The predictive power of the CluSTr database , 2005, Bioinform..

[9]  Ori Sasson,et al.  ProtoNet 4.0: A hierarchical classification of one million protein sequences , 2004, Nucleic Acids Res..

[10]  Cyrus Chothia,et al.  SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny , 2008, Nucleic Acids Res..

[11]  Alan Kilian,et al.  The MetaFam Server: a comprehensive protein family resource , 2001, Nucleic Acids Res..

[12]  David A. Lee,et al.  New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures , 2012, Nucleic Acids Res..

[13]  Xin Gao,et al.  Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups. , 2011, Current protocols in bioinformatics.

[14]  Nathan Linial,et al.  ProtoNet 6.0: organizing 10 million protein sequences in a compact hierarchical family tree , 2011, Nucleic Acids Res..

[15]  Ori Sasson,et al.  ProtoNet: hierarchical classification of the protein space , 2003, Nucleic Acids Res..

[16]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[17]  C. Orengo,et al.  Protein function annotation by homology-based inference , 2009, Genome Biology.

[18]  Nathan Linial,et al.  ProtoNet: charting the expanding universe of protein sequences , 2013, Nature Biotechnology.

[19]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..