Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree

Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8 GB internal memory (actual internal memory usage is less than 6 GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.

[1]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[3]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[4]  Juha Kärkkäinen,et al.  Fast BWT in small space by blockwise suffix sorting , 2007, Theor. Comput. Sci..

[5]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[6]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[7]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[8]  Wing-Kai Hon,et al.  Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[9]  GusfieldDan Introduction to the IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2004 .

[10]  Gonzalo Navarro,et al.  An Alphabet-Friendly FM-Index , 2004, SPIRE.

[11]  H. M. Martinez,et al.  An efficient method for finding repeats in molecular sequences , 1983, Nucleic Acids Res..

[12]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[13]  Lusheng Wang,et al.  Finding the region of pseudo-periodic tandem repeats in biological sequences , 2006, Algorithms for Molecular Biology.

[14]  Veli Mäkinen Compact Suffix Array - A Space-Efficient Full-Text Index , 2003, Fundam. Informaticae.

[15]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[16]  Behshad Behzadi,et al.  DNA Compression Challenge Revisited: A Dynamic Programming Approach , 2005, CPM.

[17]  Ross Lippert,et al.  A Space-Efficient Construction of the Burrows-Wheeler Transform for Genomic Data , 2005, J. Comput. Biol..

[18]  Wolfgang Gerlach,et al.  Engineering a compressed suffix tree implementation , 2007, JEAL.

[19]  Giovanni Manzini,et al.  Two Space Saving Tricks for Linear Time LCP Array Computation , 2004, SWAT.

[20]  Arnaud Lefebvre,et al.  FORRepeats: detects repeats on entire chromosomes and between genomes , 2003, Bioinform..

[21]  Joong Chae Na,et al.  Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space , 2007, Theor. Comput. Sci..

[22]  Giovanni Manzini,et al.  A simple and fast DNA compressor , 2004, Softw. Pract. Exp..

[23]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, COCOON.

[24]  Jouni Sirén Sampled Longest Common Prefix Array , 2010, CPM.

[25]  S. Bridges,et al.  Empirical comparison of ab initio repeat finding programs , 2008, Nucleic acids research.

[26]  Jeffrey Scott Vitter,et al.  Time- and space-efficient maximal repeat finding using the burrows-wheeler transform and wavelet trees , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[27]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[28]  Judith Klein-Seetharaman,et al.  Evolutionary insights from suffix array-based genome sequence analysis , 2007, Journal of Biosciences.

[29]  E. McConkey Human Genetics: The Molecular Revolution , 1993 .

[30]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[31]  Wing-Kai Hon,et al.  Breaking a time-and-space barrier in constructing full-text indices , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[32]  Kunihiko Sadakane,et al.  A Linear-Time Burrows-Wheeler Transform Using Induced Sorting , 2009, SPIRE.

[33]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[34]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[35]  Johannes Fischer,et al.  Space Efficient String Mining under Frequency Constraints , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[36]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.

[37]  Meng He,et al.  Indexing Compressed Text , 2003 .

[38]  Verónica Becher,et al.  Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome , 2009, Bioinform..

[39]  S. Bridges,et al.  Computational Approaches and Tools Used in Identification of Dispersed Repetitive DNA Sequences , 2008, Tropical Plant Biology.

[40]  Ross Lippert,et al.  Space-Efficient Whole Genome Comparisons with BurrowsWheeler Transforms , 2005, J. Comput. Biol..

[41]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[42]  Guang R. Gao,et al.  TROLL-Tandem Repeat Occurrence Locator , 2002, Bioinform..

[43]  Enno Ohlebusch,et al.  Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes , 2010, SPIRE.