A new approach to sequence comparison: normalized sequence alignment

The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g., maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al., 1999 [33]. In this paper we propose a new sequence comparison algorithm (normalized local alignment) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is &Ogr;(n2 log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.

[1]  T J Gibson,et al.  PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. , 1996, Nucleic acids research.

[2]  Werner Dinkelbach On Nonlinear Fractional Programming , 1967 .

[3]  B. Berger,et al.  Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction , 2000 .

[4]  M. I. Kanehisa,et al.  Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries , 1982, Nucleic Acids Res..

[5]  Mashe Sniedovich,et al.  Dynamic Programming , 1991 .

[6]  S F Altschul,et al.  Locally optimal subalignments using nonlinear similarity functions. , 1986, Bulletin of mathematical biology.

[7]  Daniel H. Huson,et al.  The Conserved Exon Method for Gene Finding , 2000, ISMB.

[8]  P. Sellers Pattern recognition in genetic sequences by mismatch density , 1984 .

[9]  Enrique Vidal,et al.  Fast Computation of Normalized Edit Distances , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  S F Altschul,et al.  Significance levels for biological sequence comparison using non-linear similarity functions. , 1988, Bulletin of mathematical biology.

[11]  B. John Oommen,et al.  The Normalized String Editing Problem Revisited , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  M. Waterman,et al.  The Erdos-Renyi Law in Distribution, for Coin Tossing and Sequence Matching , 1990 .

[13]  Piotr Berman,et al.  Post-processing long pairwise alignments , 1999, Bioinform..

[14]  Abdullah N. Arslan,et al.  Efficient Algorithms For Normalized Edit Distance , 2000 .

[15]  Ömer Egecioglu,et al.  An efficient uniform-cost normalized edit distance algorithm , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[16]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[17]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[18]  Piotr Berman,et al.  Alignments without low-scoring regions , 1998, RECOMB '98.

[19]  E. G. Shpaer,et al.  Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA. , 1996, Genomics.

[20]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Nimrod Megiddo Combinatorial Optimization with Rational Objective Functions , 1979, Math. Oper. Res..

[22]  Enrique Vidal,et al.  Computation of Normalized Edit Distance and Applications , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Valentin I. Spitkovsky,et al.  A dictionary-based approach for gene annotation. , 1999 .

[25]  N N Alexandrov,et al.  Statistical significance of ungapped sequence alignments. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[26]  Ömer Egecioglu,et al.  Parallel algorithms for fast computation of normalized edit distances , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[27]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[28]  Pavel A. Pevzner,et al.  Parametric Recomuting in Alignment Graphs , 1994, CPM.

[29]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.