A Clustering Method for Molecular Sequences based on Pairwise Similarity

This paper presents a method for clustering a large and mixed set of uncharacterized sequences provided by genome projects. As the measure of the clustering, we use a fast approximation of sequence similarity (FASTA score). However, in the case to detect similarity between two sequences that are much diverged in evolutionary process, FASTA sometimes underestimates the similarity compared to the rigorous Smith-Waterman algorithm. Also the distance derived from the similarity score may not be metric since the triangle inequality may not hold when the sequences have multi-domain structure. To cope with these problems, we introduce a new graph structure called p-quasi complete graph for describing a cluster of sequences with a con dence measure. We prove that a restricted version of the p-quasi complete graph problem (given a positive integer k, whether a graph contains a 0.5-quasi complete subgraph of which size k or not) is NP-complete. Thus we present the outline of an approximation algorithm for clustering a set of sequences into subsets corresponding to p-quasi complete graphs. The e ectiveness of our method is demonstrated by the result of clustering Escherichia coli protein sequences by our method.

[1]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its new supplement TREMBL , 1996, Nucleic Acids Res..

[2]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[3]  John E. Hopcroft,et al.  Complexity of Computer Computations , 1974, IFIP Congress.

[4]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[6]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[7]  Jinya Otsuka,et al.  A comprehensive representation of extensive similarity linkage between large numbers of proteins , 1995, Comput. Appl. Biosci..

[8]  E V Koonin,et al.  Sequence similarity analysis of Escherichia coli proteins: functional and evolutionary implications. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[9]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Guy Kortsarz,et al.  On choosing a dense subgraph , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[11]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[12]  Amos Bairoch,et al.  The PROSITE database, its status in 1995 , 1996, Nucleic Acids Res..

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.