Efficient functional clustering of protein sequences using the Dirichlet process

MOTIVATION Automatic clustering of protein sequences is an important problem in computational biology. The recent explosion in genome sequences has given biological researchers a vast number of novel protein sequences. However, the majority of these sequences have no experimental evidence for their molecular function in the cell, and the responsibility for correctly annotating these sequences falls upon the bioinformatics community. Ideally, we would like to be able to group sequences of similar or identical molecular function in an automatic fashion, without relying on experimental evidence. RESULTS In this article I present a novel probabilistic framework that models subfamilies within a known protein family. Given a multiple sequence alignment, the model uses Dirichlet mixture densities to estimate amino acid preferences within subfamily clusters, and places a Dirichlet process prior on the overall set of clusters. Based on results from several datasets, the model breaks data accurately into functional subgroups. AVAILABILITY The algorithm is implemented as c++ software available at bpg-research.berkeley.edu/approximately duncanb/dpcluster/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[2]  Kimmen Sjölander,et al.  Phylogenetic Inference in Protein Superfamilies: Analysis of SH2 Domains , 1998, ISMB.

[3]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[4]  Nathan Linial,et al.  A Map of the Protein Space: An Automatic Hierarchical Classification of all Protein Sequences , 1998, ISMB.

[5]  Carl E. Rasmussen,et al.  Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture Models , 2003, Pacific Symposium on Biocomputing.

[6]  Gert Vriend,et al.  Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems , 2001, Nucleic Acids Res..

[7]  Alfonso Valencia,et al.  Clustering of proximal sequence space for the identification of protein families , 2002, Bioinform..

[8]  D. B. Dahl An improved merge-split sampler for conjugate dirichlet process mixture models , 2003 .

[9]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[10]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[11]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[12]  Martin Vingron,et al.  A set-theoretic approach to database searching and clustering , 1998, Bioinform..

[13]  Bernhard Schölkopf,et al.  Learning Theory and Kernel Machines , 2003, Lecture Notes in Computer Science.

[14]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[15]  Steven E Brenner,et al.  Structural genomics and structural biology: compare and contrast , 2004, Genome Biology.

[16]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[17]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[18]  C Sander,et al.  Predicting protein structure using hidden Markov models , 1997, Proteins.

[19]  Gert Vriend,et al.  GPCRDB information system for G protein-coupled receptors , 2003, Nucleic Acids Res..

[20]  E. Webb Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. , 1992 .

[21]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[22]  Conrad C. Huang,et al.  Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database. , 2006, Biochemistry.

[23]  N. Wicker,et al.  Secator: a program for inferring protein subfamilies from phylogenetic trees. , 2001, Molecular biology and evolution.

[24]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[25]  Duncan P. Brown,et al.  Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[26]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .