A New Alignment-Independent Algorithm for Clustering Protein Sequences

The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important, the challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as multi-domain, circular-permutation and tandem-repeats protein sequences, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-independent algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families.

[1]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[2]  Erik L. L. Sonnhammer,et al.  Scoredist: A simple and robust protein sequence distance estimator , 2005, BMC Bioinformatics.

[3]  Robert C. Edgar,et al.  Local homology recognition and distance measures in linear time using compressed amino acid alphabets. , 2004, Nucleic acids research.

[4]  H. Lodish Molecular Cell Biology , 1986 .

[5]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[6]  Ryszard Brzezinski,et al.  Two exo-beta-D-glucosaminidases/exochitosanases from actinomycetes define a new subfamily within family 2 of glycoside hydrolases. , 2006, The Biochemical journal.

[7]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[8]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[9]  Ryszard Brzezinski,et al.  Exo-beta-D-glucosaminidase from Amycolatopsis orientalis: catalytic residues, sugar recognition specificity, kinetics, and synergism. , 2006, Glycobiology.

[10]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  Samuel Karlin,et al.  Comparative statistics for DNA and protein sequences: multiple sequence analysis , 1985 .

[13]  Kimmen Sjölander,et al.  Phylogenomic inference of protein molecular function: advances and challenges , 2004, Bioinform..

[14]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[15]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[16]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[17]  V. Batagelj Generalized Ward and Related Clustering Problems ∗ , 1988 .

[18]  Todd Richmond,et al.  Phylogenetic classification of proteins encoded in complete genomes , 2000, Genome Biology.

[19]  Alexander Schliep,et al.  ProClust: improved clustering of protein sequences with an extended graph-based approach , 2002, ECCB.

[20]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[21]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[22]  N. Wicker,et al.  Secator: a program for inferring protein subfamilies from phylogenetic trees. , 2001, Molecular biology and evolution.

[23]  Julie Dawn Thompson,et al.  Improved sensitivity of profile searches through the use of sequence weights and gap excision , 1994, Comput. Appl. Biosci..

[24]  Kimmen Sjölander,et al.  Phylogenomic Inference of Protein Molecular Function , 2005, Current protocols in bioinformatics.

[25]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[26]  Joe H. Ward,et al.  Application of an Hierarchical Grouping Procedure to a Problem of Grouping Profiles , 1963 .

[27]  Igor V. Tetko,et al.  Super paramagnetic clustering of protein sequences , 2005, BMC Bioinformatics.

[28]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[29]  S Karlin,et al.  Comparative statistics for DNA and protein sequences: single sequence analysis. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Jean-Paul Delahaye,et al.  Transformation distances: a family of dissimilarity measures based on movements of segments , 1999, Bioinform..

[31]  J. Felsenstein An alternating least squares approach to inferring phylogenies from pairwise distances. , 1997, Systematic biology.

[32]  M. Kimura Evolutionary Rate at the Molecular Level , 1968, Nature.

[33]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .