slaMEM: efficient retrieval of maximal exact matches using a sampled LCP array

MOTIVATION Maximal exact matches, or just MEMs, are a powerful tool in the context of multiple sequence alignment and approximate string matching. The most efficient algorithms to collect them are based on compressed indexes that rely on longest common prefix array-centered data structures. However, their space-efficient representations make use of encoding techniques that are expensive from a computational point of view. With the deluge of data generated by high-throughput sequencing, new approaches need to be developed to deal with larger genomic sequences. RESULTS In this work, we have developed a new longest common prefix array-sampled representation, optimized to work with the backward search method inherently used by the FM-Index. Unlike previous implementations that sacrifice running time to have smaller space, ours lead to both a fast and a space-efficient approach. This implementation was used by the new software slaMEM, developed to efficiently retrieve MEMs. The results show that the new algorithm is competitive against existing state-of-the-art approaches. AVAILABILITY AND IMPLEMENTATION The software is implemented in C and is operating system independent. The source code is freely available for download at http://github.com/fjdf/slaMEM/ under the GPLv3 license.

[1]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[2]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[3]  R. Wilson,et al.  Modernizing Reference Genome Assemblies , 2011, PLoS biology.

[4]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[5]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[6]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[7]  Mona Singh,et al.  A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays , 2009, Bioinform..

[8]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[9]  Ge Nong,et al.  Linear Suffix Array Construction by Almost Pure Induced-Sorting , 2009, 2009 Data Compression Conference.

[10]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[11]  Veerle Fack,et al.  Prospects and limitations of full-text index structures in genome analysis , 2012, Nucleic acids research.

[12]  Enno Ohlebusch,et al.  CoCoNUT: an efficient system for the comparison and analysis of genomes , 2008, BMC Bioinformatics.

[13]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[14]  Jouni Sirén Sampled Longest Common Prefix Array , 2010, CPM.

[15]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[16]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[17]  Eugene L. Lawler,et al.  Sublinear approximate string matching and biological applications , 1994, Algorithmica.

[18]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[19]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[20]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[21]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[22]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[23]  Enno Ohlebusch,et al.  Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes , 2010, SPIRE.

[24]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[25]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[26]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[27]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[28]  Volker Heun,et al.  A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array , 2007, ESCAPE.

[29]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[30]  Jeffrey Scott Vitter,et al.  Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[31]  Naila Rahman,et al.  A simple optimal representation for balanced parentheses , 2006, Theor. Comput. Sci..

[32]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999, Softw. Pract. Exp..

[33]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[34]  Mike Paterson,et al.  Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, First International Symposium, ESCAPE 2007, Hangzhou, China, April 7-9, 2007, Revised Selected Papers , 2007, ESCAPE.

[35]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[36]  Kunihiko Sadakane,et al.  A Linear-Time Burrows-Wheeler Transform Using Induced Sorting , 2009, SPIRE.

[37]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[38]  Robin Milner,et al.  On Observing Nondeterminism and Concurrency , 1980, ICALP.

[39]  Bernard De Baets,et al.  essaMEM: finding maximal exact matches using enhanced sparse suffix arrays , 2013, Bioinform..