Bayesian search of functionally divergent protein subgroups and their function specific residues

Motivation: The rapid increase in the amount of protein sequence data has created a need for an automated identification of evolutionarily related subgroups from large datasets. The existing methods typically require a priori specification of the number of putative groups, which defines the resolution of the classification solution. Results: We introduce a Bayesian model-based approach to simultaneous identification of evolutionary groups and conserved parts of the protein sequences. The model-based approach provides an intuitive and efficient way of determining the number of groups from the sequence data, in contrast to the ad hoc methods often exploited for similar purposes. Our model recognizes the areas in the sequences that are relevant for the clustering and regards other areas as noise. We have implemented the method using a fast stochastic optimization algorithm which yields a clustering associated with the estimated maximum posterior probability. The method has been shown to have high specificity and sensitivity in simulated and real clustering tasks. With real datasets the method also highlights the residues close to the active site. Availability: Software 'kPax' is available at http://www.rni.helsinki.fi/jic/softa.html Contact: pekka.marttinen@helsinki.fi Supplementary information: http://www.rni.helsinki.fi/~jic/softa.html

[1]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[2]  Tony O’Hagan Bayes factors , 2006 .

[3]  Eugene I. Shakhnovich,et al.  Predicting specificity-determining residues in two large eukaryotic transcription factor families , 2005, Nucleic acids research.

[4]  Eugene I. Shakhnovich,et al.  Determining functional specificity from protein sequences , 2005, Bioinform..

[5]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[6]  Igor V. Tetko,et al.  Super paramagnetic clustering of protein sequences , 2005, BMC Bioinformatics.

[7]  Jukka Corander,et al.  BAPS 2: enhanced possibilities for the analysis of genetic population structure , 2004, Bioinform..

[8]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[10]  O. Lichtarge,et al.  A family of evolution-entropy hybrid methods for ranking protein residues by importance. , 2004, Journal of molecular biology.

[11]  Liisa Holm,et al.  Sensitive pattern discovery with 'fuzzy' alignments of distantly related proteins , 2003, ISMB.

[12]  Michael Lappe,et al.  Accurate detection of very sparse sequence motifs , 2003, RECOMB '03.

[13]  L. Mirny,et al.  Using orthologous and paralogous proteins to identify specificity determining residues , 2002, Genome Biology.

[14]  Shivakumar Vaithyanathan,et al.  Model-Based Hierarchical Clustering , 2000, UAI.

[15]  Lee Ann McCue,et al.  Bayesian Protein Family Classifier , 1998, ISMB.

[16]  C Sander,et al.  An evolutionary treasure: unification of a broad set of amidohydrolases related to urease , 1997, Proteins.

[17]  C. Sander,et al.  A method to predict functional residues in proteins , 1995, Nature Structural Biology.

[18]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[19]  Olivier Lichtarge,et al.  Accurate and scalable identification of functional sites by evolutionary tracing , 2004, Journal of Structural and Functional Genomics.

[20]  B. Ripley Pattern Recognition and Neural Networks , 1996 .