CLUSS: Clustering of protein sequences based on a new similarity measure

BackgroundThe rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for S ubstitution M atching S imilarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "phylogenetic" in the sense of "relatedness of biological functions".ResultsTo show the effectiveness of CLUSS, we performed an extensive clustering on COG database. To demonstrate its ability to deal with hard-to-align sequences, we tested it on the GH2 family. In addition, we carried out experimental comparisons of CLUSS with a variety of mainstream algorithms. These comparisons were made on hard-to-align and easy-to-align protein sequences. The results of these experiments show the superiority of CLUSS in yielding clusters of proteins with similar functional activity.ConclusionWe have developed an effective method and tool for clustering protein sequences to meet the needs of biologists in terms of phylogenetic analysis and prediction of biological functions. Compared to existing clustering methods, CLUSS more accurately highlights the functional characteristics of the clustered families. It provides biologists with a new and plausible instrument for the analysis of protein sequences, especially those that cause problems for the alignment-dependent algorithms.

[1]  M. Saier,et al.  Evolutionary relationships between sugar kinases and transcriptional repressors in bacteria. , 1994, Microbiology.

[2]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[3]  Yasushi Morikawa,et al.  Cloning and heterologous expression of the exo-β-d-glucosaminidase-encoding gene (gls93) from a filamentous fungus, Trichoderma reesei PC-3-7 , 2006, Applied Microbiology and Biotechnology.

[4]  D Sheehan,et al.  Nucleotide and deduced amino acid sequences of Rhizobium meliloti 102F34 lacZ gene: comparison with prokaryotic beta-galactosidases and human beta-glucuronidase. , 1994, Gene.

[5]  Ryszard Brzezinski,et al.  Two exo-beta-D-glucosaminidases/exochitosanases from actinomycetes define a new subfamily within family 2 of glycoside hydrolases. , 2006, The Biochemical journal.

[6]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[7]  Kimmen Sjölander,et al.  Phylogenetic Inference in Protein Superfamilies: Analysis of SH2 Domains , 1998, ISMB.

[8]  Alexander Schliep,et al.  ProClust: improved clustering of protein sequences with an extended graph-based approach , 2002, ECCB.

[9]  Joe H. Ward,et al.  Application of an Hierarchical Grouping Procedure to a Problem of Grouping Profiles , 1963 .

[10]  Igor V. Tetko,et al.  Super paramagnetic clustering of protein sequences , 2005, BMC Bioinformatics.

[11]  Julie Dawn Thompson,et al.  Improved sensitivity of profile searches through the use of sequence weights and gap excision , 1994, Comput. Appl. Biosci..

[12]  Takeshi Ishimizu,et al.  Endo-β-mannosidase, a Plant Enzyme Acting on N-Glycan , 2004, Journal of Biological Chemistry.

[13]  Teresa M. Przytycka,et al.  COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations , 2006, Bioinform..

[14]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[15]  N. Wicker,et al.  Secator: a program for inferring protein subfamilies from phylogenetic trees. , 2001, Molecular biology and evolution.

[16]  Kimmen Sjölander,et al.  Phylogenomic inference of protein molecular function: advances and challenges , 2004, Bioinform..

[17]  S Karlin,et al.  Comparative statistics for DNA and protein sequences: multiple sequence analysis. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[19]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[20]  J. Felsenstein An alternating least squares approach to inferring phylogenies from pairwise distances. , 1997, Systematic biology.

[21]  M. Kimura Evolutionary Rate at the Molecular Level , 1968, Nature.

[22]  Francesc Rosselló,et al.  The Universal Similarity Metric does not detect domain similarity , 2006 .

[23]  Todd Richmond,et al.  Phylogenetic classification of proteins encoded in complete genomes , 2000, Genome Biology.

[24]  J Heringa Computational methods for protein secondary structure prediction using multiple sequence alignments. , 2000, Current protein & peptide science.

[25]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[26]  David Sheehan,et al.  Nucleotide and deduced amino acid sequences of Rhizobium meliloti 102F34 lacZ gene: comparison with prokaryotic β-galactosidases and human β-glucuronidase , 1994 .

[27]  Jean-Paul Delahaye,et al.  Transformation distances: a family of dissimilarity measures based on movements of segments , 1999, Bioinform..

[28]  Erik L. L. Sonnhammer,et al.  Scoredist: A simple and robust protein sequence distance estimator , 2005, BMC Bioinformatics.

[29]  H. Lodish Molecular Cell Biology , 1986 .

[30]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[31]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[32]  Jean-Paul Delahaye,et al.  The transformation distance: A dissimilarity measure based an movements of segments , 1998, German Conference on Bioinformatics.

[33]  Hans-Hermann Bock,et al.  Classification and Related Methods of Data Analysis , 1988 .

[34]  Ryszard Brzezinski,et al.  Exo-beta-D-glucosaminidase from Amycolatopsis orientalis: catalytic residues, sugar recognition specificity, kinetics, and synergism. , 2006, Glycobiology.

[35]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[36]  V. Batagelj Generalized Ward and Related Clustering Problems ∗ , 1988 .

[37]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[38]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[39]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[40]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[41]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[42]  Robert C. Edgar,et al.  Local homology recognition and distance measures in linear time using compressed amino acid alphabets. , 2004, Nucleic acids research.

[43]  David G. Stork,et al.  Pattern Classification , 1973 .

[44]  S Karlin,et al.  An efficient algorithm for identifying matches with errors in multiple long molecular sequences. , 1991, Journal of molecular biology.

[45]  S Karlin,et al.  Comparative statistics for DNA and protein sequences: single sequence analysis. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Samuel Karlin,et al.  Maximal Length of Common Words Among Random Letter Sequences , 1988 .