Clustering Protein Sequences Using Affinity Propagation Based on an Improved Similarity Measure

The sizes of the protein databases are growing rapidly nowadays, thus it becomes increasingly important to cluster protein sequences only based on sequence information. In this paper we improve the similarity measure proposed by Kelil et al, then cluster sequences using the Affinity propagation (AP) algorithm and provide a method to decide the input preference of AP algorithm. We tested our method extensively and compared its performance with other four methods on several datasets of COG, G protein, CAZy, SCOP database. We consistently observed that, the number of clusters that we obtained for a given set of proteins approximate to the correct number of clusters in that set. Moreover, in our experiments, the quality of the clusters when quantified by F-measure was better than that of other algorithms (on average, it is 15% better than that of BlastClust, 56% better than that of TribeMCL, 23% better than that of CLUSS, and 42% better than that of Spectral clustering).

[1]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Christian Gautier,et al.  Statistical method for predicting protein coding regions in nucleic acid sequences , 1987, Comput. Appl. Biosci..

[3]  Jack-Gérard Postaire,et al.  Mode Detection by Relaxation , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[5]  Jack-Gérard Postaire,et al.  Cluster Analysis by Binary Morphology , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  William R. Pearson Protein sequence comparison and protein evolution , 1995, ISMB 1995.

[7]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[8]  D. Davison,et al.  A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. , 1997, Biometrics.

[9]  Alexander Schliep,et al.  Clustering Protein Sequences ? Structure Prediction by Transitive Homology , 2001, German Conference on Bioinformatics.

[10]  Jean-Paul Delahaye,et al.  Transformation distances: a family of dissimilarity measures based on movements of segments , 1999, Bioinform..

[11]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[12]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[13]  Todd Richmond,et al.  Phylogenetic classification of proteins encoded in complete genomes , 2000, Genome Biology.

[14]  Nathan Linial,et al.  ProtoMap: automatic classification of protein sequences and hierarchy of protein families , 2000, Nucleic Acids Res..

[15]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[16]  Martin Vingron,et al.  The SYSTERS protein sequence cluster set , 2000, Nucleic Acids Res..

[17]  Tiee-Jian Wu,et al.  Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition , 2001, Biometrics.

[18]  William R. Pearson,et al.  Protein sequence comparison and Protein evolution Tutorial - ISMB2000 , 2001 .

[19]  Steve Baker,et al.  Integrated gene and species phylogenies from unaligned whole genome protein sequences , 2002, Bioinform..

[20]  Alexander Schliep,et al.  ProClust: improved clustering of protein sequences with an extended graph-based approach , 2002, ECCB.

[21]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[22]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[23]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2006, Nucleic acids research.

[24]  Erik L. L. Sonnhammer,et al.  Scoredist: A simple and robust protein sequence distance estimator , 2005, BMC Bioinformatics.

[25]  Jonas S. Almeida,et al.  Comparative evaluation of word composition distances for the recognition of SCOP relationships , 2004, Bioinform..

[26]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[27]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[28]  Ryszard Brzezinski,et al.  Two exo-beta-D-glucosaminidases/exochitosanases from actinomycetes define a new subfamily within family 2 of glycoside hydrolases. , 2006, The Biochemical journal.

[29]  Ryszard Brzezinski,et al.  Exo-beta-D-glucosaminidase from Amycolatopsis orientalis: catalytic residues, sugar recognition specificity, kinetics, and synergism. , 2006, Glycobiology.

[30]  Jin Xu,et al.  Prediction of pi-turns in proteins using PSI-BLAST profiles and secondary structure information. , 2006, Biochemical and biophysical research communications.

[31]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .

[32]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[33]  Shengrui Wang,et al.  CLUSS: Clustering of protein sequences based on a new similarity measure , 2007, BMC Bioinformatics.