The ruzzo-tompa algorithm can find the maximal paths in weighted, directed graphs on a one-dimensional lattice

Biological sequences can contain regions of unusual composition, e.g., proteins contain DNA binding domains, transmembrane regions, and charged regions. The linear-time Ruzzo-Tompa algorithm finds such regions by inputting a sequence of scores and outputting the corresponding “maximal segments”, i.e., contiguous, disjoint subsequences having the greatest total scores. Just as gaps improved the sensitivity of BLAST searches, they might improve the sensitivity of searches for regions of unusual composition as well. Accordingly, we generalize the Ruzzo-Tompa algorithm from sequences of scores to paths in weighted, directed graphs on a one-dimensional lattice. Within the generalization, unfavorable scores can be deleted from contiguous, disjoint subsequences by paying a penalty, and the Ruzzo-Tompa algorithm can then find gapped subsequences having the greatest total gapped scores. An application to finding gapped inexact repeats in biological sequences exemplifies some of the concepts.

[1]  Alejandro A. Schäffer,et al.  WindowMasker: window-based masker for sequenced genomes , 2006, Bioinform..

[2]  Walter L. Ruzzo,et al.  A Linear Time Algorithm for Finding All Maximal Scoring Subsequences , 1999, ISMB.

[3]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[4]  Alejandro A. Schäffer,et al.  A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences , 2006, J. Comput. Biol..

[5]  J. Spouge Markov Additive Processes and Repeats in Sequences , 2007, Journal of Applied Probability.

[6]  Frédéric Boyer,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2005 .

[7]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[8]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[9]  David Eppstein,et al.  Speeding up dynamic programming , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[10]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[13]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[14]  Martin C. Frith,et al.  Gentle Masking of Low-Complexity Sequences Improves Homology Search , 2011, PloS one.

[15]  S Karlin,et al.  Very long charge runs in systemic lupus erythematosus-associated autoantigens. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[16]  S Karlin,et al.  Patchiness and correlations in DNA sequences , 1993, Science.

[17]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[18]  S Karlin,et al.  Significant similarity and dissimilarity in homologous proteins. , 1992, Molecular biology and evolution.

[19]  Raffaele Giancarlo,et al.  Speeding up Dynamic Programming with Applications to Molecular Biology , 1989, Theor. Comput. Sci..

[20]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[21]  Amir Dembo,et al.  Statistical Composition of High-Scoring Segments from Molecular Sequences , 1990 .

[22]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[23]  E. Myers,et al.  Sequence comparison with concave weighting functions. , 1988, Bulletin of mathematical biology.

[24]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[25]  C. DeLisi,et al.  Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. , 1987, Journal of molecular biology.