Rime: Repeat identification

We present an algorithm for detecting long similar fragments occurring at least twice in a set of biological sequences. The problem becomes computationally challenging when the frequency of a repeat is allowed to increase and when a non-negligible number of insertions, deletions and substitutions are allowed. We introduce in this paper an algorithm, Rime (for Repeat Identification: long, Multiple, and with Edits) that performs this task, and manages instances whose size and combination of parameters cannot be handled by other currently existing methods. This is achieved by using a filter as a preprocessing step, and by then exploiting the information gathered by the filter in the following actual repeat inference step. To the best of our knowledge, Rime is the first algorithm that can accurately deal with very long repeats (up to a few thousands), occurring possibly several times, and with a rate of differences (substitutions and indels) allowed among copies of a same repeat of 10-15% or even more.

[1]  Luciano Milanesi,et al.  Systematic analysis of human kinase genes: a large number of genes and alternative splicing events result in functional and structural diversity , 2005, BMC Bioinformatics.

[2]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[3]  Roberto Grossi,et al.  Inferring Mobile Elements in S. Cerevisiae Strains , 2011, BIOINFORMATICS.

[4]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[5]  Roberto Grossi,et al.  A Taste of Yeast Mobilomics , 2012, BIOINFORMATICS.

[6]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[7]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[8]  Eli Upfal,et al.  MADMX: A Strategy for Maximal Dense Motif Extraction , 2011, J. Comput. Biol..

[9]  Roberto Grossi,et al.  Efficient bubble enumeration in directed graphs , 2012 .

[10]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[11]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[12]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[13]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2006, J. Comput. Biol..

[14]  Adam Eyre-Walker,et al.  The Effect of Transposable Element Insertions on Gene Expression Evolution in Rodents , 2009, PloS one.

[15]  Alair Pereira do Lago,et al.  Lossless filter for multiple repetitions with Hamming distance , 2008, J. Discrete Algorithms.

[16]  M. Gellert,et al.  The taming of a transposon: V(D)J recombination and the immune system , 2004, Immunological reviews.

[17]  Marie-France Sagot,et al.  Identifying SNPs without a Reference Genome by Comparing Raw Reads , 2010, SPIRE.

[18]  Nadia Pisanti,et al.  Filters and seeds approaches for fast homology searches in large datasets , 2010 .

[19]  Frédéric Boyer,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2005 .

[20]  Robert P. Davey,et al.  Population genomics of domestic and wild yeasts , 2008, Nature.

[21]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2012, Nature Reviews Genetics.

[22]  Simona E. Rombo Extracting string motif bases for quorum higher than two , 2012, Theor. Comput. Sci..

[23]  Frédéric Boyer,et al.  Lossless Filter for Finding Long Multiple Approximate Repetitions Using a New Data Structure, the Bi-factor Array , 2005, SPIRE.

[24]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Matteo Comin,et al.  VARUN: Discovering Extensible Motifs under Saturation Constraints , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Marie-France Sagot,et al.  Finding Long and Multiple Repeats with Edit Distance , 2011, Stringology.

[27]  Nadia Pisanti,et al.  An optimized filter for finding multiple repeats in DNA sequences , 2010, ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010.

[28]  James A. M. McHugh,et al.  A first approach to finding common motifs with gaps , 2005, Int. J. Found. Comput. Sci..

[29]  Ina Koch,et al.  Enumerating all connected maximal common subgraphs in two graphs , 2001, Theor. Comput. Sci..

[30]  Alair Pereira do Lago,et al.  Lossless filter for multiple repeats with bounded edit distance , 2008, Algorithms for Molecular Biology.

[31]  Michael Kaufmann,et al.  DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment , 2008, Algorithms for Molecular Biology.

[32]  Zhao Xu,et al.  LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons , 2007, Nucleic Acids Res..

[33]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[34]  Raazesh Sainudiin,et al.  Auto-validating von Neumann rejection sampling from small phylogenetic tree spaces , 2006, Algorithms for Molecular Biology.

[35]  E. Sonnhammer,et al.  Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features , 2008, Nucleic acids research.