论文信息 - Top-Down Clustering for Protein Subfamily Identification

Top-Down Clustering for Protein Subfamily Identification

We propose a novel method for the task of protein subfamily identification; that is, finding subgroups of functionally closely related sequences within a protein family. In line with phylogenomic analysis, the method first builds a hierarchical tree using as input a multiple alignment of the protein sequences, then uses a post-pruning procedure to extract clusters from the tree. Differently from existing methods, it constructs the hierarchical tree top-down, rather than bottom-up and associates particular mutations with each division into subclusters. The motivating hypothesis for this method is that it may yield a better tree topology with more accurate subfamily identification as a result and additionally indicates functionally important sites and allows for easy classification of new proteins. A thorough experimental evaluation confirms the hypothesis. The novel method yields more accurate clusters and a better tree topology than the state-of-the-art method SCI-PHY, identifies known functional sites, and identifies mutations that alone allow for classifying new sequences with an accuracy approaching that of hidden Markov models.

Celine Vens | Hendrik Blockeel | Eduardo P. Costa | H. Blockeel | C. Vens

[1] Gert Vriend,et al. Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems , 2001, Nucleic Acids Res..

[2] W. Fitch. Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[3] Celine Vens,et al. Top-Down Induction of Phylogenetic Trees , 2010, EvoBIO.

[4] William R. Taylor,et al. The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[5] Hasan H. Otu,et al. Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets , 2010, BMC Bioinformatics.

[6] J A Eisen,et al. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[7] Sean R. Eddy,et al. Profile hidden Markov models , 1998, Bioinform..

[8] Peter J Bickel,et al. Finding important sites in protein sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9] David A. Lee,et al. GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains , 2009, Nucleic acids research.

[10] David Haussler,et al. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[11] Duncan P. Brown,et al. Subfamily HMMS in Functional Genomics , 2004, Pacific Symposium on Biocomputing.

[12] 陈奕欣,et al. The Universal Protein Resource (UniProt) , 2007, Nucleic Acids Res..

[13] Salim Bougouffa,et al. SitesIdentify: a protein functional site prediction tool , 2009, BMC Bioinformatics.

[14] Duncan P. Brown,et al. Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[15] Robert R. Sokal,et al. A statistical method for evaluating systematic relationships , 1958 .

[16] Elena Marchiori,et al. Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics , 2007, Lecture Notes in Computer Science.

[17] N. Saitou,et al. The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[18] Mark A. Gluck,et al. Information, Uncertainty and the Utility of Categories , 1985 .

[19] D. Baker,et al. Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design , 2005, Nucleic acids research.

[20] Luc De Raedt,et al. Top-Down Induction of Clustering Trees , 1998, ICML.

[21] H. Lehmann,et al. Nucleic Acid Research , 1967 .

[22] M. Salemi,et al. The phylogenetic handbook : a practical approach to DNA and protein phylogeny , 2003 .

[23] Paul D. Thomas,et al. On the quality of tree-based protein classification , 2005, Bioinform..

[24] Evgueni A. Haroutunian,et al. Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[25] Michal Linial,et al. Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data , 2008, PloS one.

[26] J. Berger. Statistical Decision Theory and Bayesian Analysis , 1988 .