Top-Down Clustering for Protein Subfamily Identification

We propose a novel method for the task of protein subfamily identification; that is, finding subgroups of functionally closely related sequences within a protein family. In line with phylogenomic analysis, the method first builds a hierarchical tree using as input a multiple alignment of the protein sequences, then uses a post-pruning procedure to extract clusters from the tree. Differently from existing methods, it constructs the hierarchical tree top-down, rather than bottom-up and associates particular mutations with each division into subclusters. The motivating hypothesis for this method is that it may yield a better tree topology with more accurate subfamily identification as a result and additionally indicates functionally important sites and allows for easy classification of new proteins. A thorough experimental evaluation confirms the hypothesis. The novel method yields more accurate clusters and a better tree topology than the state-of-the-art method SCI-PHY, identifies known functional sites, and identifies mutations that alone allow for classifying new sequences with an accuracy approaching that of hidden Markov models.

[1]  Gert Vriend,et al.  Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems , 2001, Nucleic Acids Res..

[2]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[3]  Celine Vens,et al.  Top-Down Induction of Phylogenetic Trees , 2010, EvoBIO.

[4]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[5]  Hasan H. Otu,et al.  Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets , 2010, BMC Bioinformatics.

[6]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[7]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[8]  Peter J Bickel,et al.  Finding important sites in protein sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  David A. Lee,et al.  GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains , 2009, Nucleic acids research.

[10]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[11]  Duncan P. Brown,et al.  Subfamily HMMS in Functional Genomics , 2004, Pacific Symposium on Biocomputing.

[12]  陈奕欣,et al.  The Universal Protein Resource (UniProt) , 2007, Nucleic Acids Res..

[13]  Salim Bougouffa,et al.  SitesIdentify: a protein functional site prediction tool , 2009, BMC Bioinformatics.

[14]  Duncan P. Brown,et al.  Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[15]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[16]  Elena Marchiori,et al.  Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics , 2007, Lecture Notes in Computer Science.

[17]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[18]  Mark A. Gluck,et al.  Information, Uncertainty and the Utility of Categories , 1985 .

[19]  D. Baker,et al.  Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design , 2005, Nucleic acids research.

[20]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[21]  H. Lehmann,et al.  Nucleic Acid Research , 1967 .

[22]  M. Salemi,et al.  The phylogenetic handbook : a practical approach to DNA and protein phylogeny , 2003 .

[23]  Paul D. Thomas,et al.  On the quality of tree-based protein classification , 2005, Bioinform..

[24]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[25]  Michal Linial,et al.  Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data , 2008, PloS one.

[26]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .