Searching for repeats, as an example of using the generalised Ruzzo-Tompa algorithm to find optimal subsequences with gaps

Some biological sequences contain subsequences of unusual composition; e.g. some proteins contain DNA binding domains, transmembrane regions and charged regions, and some DNA sequences contain repeats. The linear-time Ruzzo-Tompa (RT) algorithm finds subsequences of unusual composition, using a sequence of scores as input and the corresponding 'maximal segments' as output. In principle, permitting gaps in the output subsequences could improve sensitivity. Here, the input of the RT algorithm is generalised to a finite, totally ordered, weighted graph, so the algorithm locates paths of maximal weight through increasing but not necessarily adjacent vertices. By permitting the penalised deletion of unfavourable letters, the generalisation therefore includes gaps. The program RepWords, which finds inexact simple repeats in DNA, exemplifies the general concepts by out-performing a similar extant, ad hoc tool. With minimal programming effort, the generalised Ruzzo-Tompa algorithm could improve the performance of many programs for finding biological subsequences of unusual composition.

[1]  C. DeLisi,et al.  Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. , 1987, Journal of molecular biology.

[2]  L. Mariño-Ramírez,et al.  Development and Characterization of Microsatellite Markers for the Cape Gooseberry Physalis peruviana , 2011, PloS one.

[3]  E. Lerat Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs , 2010, Heredity.

[4]  David Eppstein,et al.  Speeding up dynamic programming , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[5]  Walter L. Ruzzo,et al.  A Linear Time Algorithm for Finding All Maximal Scoring Subsequences , 1999, ISMB.

[6]  I. K. Jordan,et al.  Transposable element derived DNaseI-hypersensitive sites in the human genome , 2006, Biology Direct.

[7]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[8]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[9]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[10]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[11]  D. Landsman,et al.  Transposable elements donate lineage-specific regulatory sequences to host genomes , 2005, Cytogenetic and Genome Research.

[12]  S Karlin,et al.  Significant similarity and dissimilarity in homologous proteins. , 1992, Molecular biology and evolution.

[13]  Amir Dembo,et al.  Statistical Composition of High-Scoring Segments from Molecular Sequences , 1990 .

[14]  D. Landsman,et al.  Repetitive DNA elements, nucleosome binding and human gene expression. , 2009, Gene.

[15]  Gina A. Garzón-Martínez,et al.  The Physalis peruviana leaf transcriptome: assembly, annotation and gene model prediction , 2012, BMC Genomics.

[16]  Frédéric Boyer,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2005 .

[17]  Ahsan Huda,et al.  Epigenetic histone modifications of human transposable elements: genome defense versus exaptation , 2010, Mobile DNA.

[18]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[19]  S. Bridges,et al.  Empirical comparison of ab initio repeat finding programs , 2008, Nucleic acids research.

[20]  S Karlin,et al.  Patchiness and correlations in DNA sequences , 1993, Science.

[21]  Martin C. Frith,et al.  Gentle Masking of Low-Complexity Sequences Improves Homology Search , 2011, PloS one.

[22]  Alejandro A. Schäffer,et al.  A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences , 2006, J. Comput. Biol..

[23]  M. Frith A new repeat-masking method enables specific detection of homologous sequences , 2010, Nucleic acids research.

[24]  Alejandro A. Schäffer,et al.  WindowMasker: window-based masker for sequenced genomes , 2006, Bioinform..

[25]  I. K. Jordan,et al.  Prediction of Transposable Element Derived Enhancers Using Chromatin Modification Profiles , 2011, PloS one.

[26]  S Karlin,et al.  Very long charge runs in systemic lupus erythematosus-associated autoantigens. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Leonardo Mariño-Ramírez,et al.  The ruzzo-tompa algorithm can find the maximal paths in weighted, directed graphs on a one-dimensional lattice , 2012, 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS).

[28]  Raffaele Giancarlo,et al.  Speeding up Dynamic Programming with Applications to Molecular Biology , 1989, Theor. Comput. Sci..

[29]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[30]  A. Smit Interspersed repeats and other mementos of transposable elements in mammalian genomes. , 1999, Current opinion in genetics & development.

[31]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[32]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[33]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[34]  C. Feschotte Transposable elements and the evolution of regulatory networks , 2008, Nature Reviews Genetics.

[35]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[36]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[37]  E. Myers,et al.  Sequence comparison with concave weighting functions. , 1988, Bulletin of mathematical biology.

[38]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[39]  J. Spouge Markov Additive Processes and Repeats in Sequences , 2007, Journal of Applied Probability.

[40]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.