E-MEM: efficient computation of maximal exact matches for very large genomes

MOTIVATION Alignment of similar whole genomes is often performed using anchors given by the maximal exact matches (MEMs) between their sequences. In spite of significant amount of research on this problem, the computation of MEMs for large genomes remains a challenging problem. The leading current algorithms employ full text indexes, the sparse suffix array giving the best results. Still, their memory requirements are high, the parallelization is not very efficient, and they cannot handle very large genomes. RESULTS We present a new algorithm, efficient computation of MEMs (E-MEM) that does not use full text indexes. Our algorithm uses much less space and is highly amenable to parallelization. It can compute all MEMs of minimum length 100 between the whole human and mouse genomes on a 12 core machine in 10 min and 2 GB of memory; the required memory can be as low as 600 MB. It can run efficiently genomes of any size. Extensive testing and comparison with currently best algorithms is provided. AVAILABILITY AND IMPLEMENTATION The source code of E-MEM is freely available at: http://www.csd.uwo.ca/∼ilie/E-MEM/ CONTACT: ilie@csd.uwo.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[2]  Bernard De Baets,et al.  essaMEM: finding maximal exact matches using enhanced sparse suffix arrays , 2013, Bioinform..

[3]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[4]  Lucian Ilie,et al.  Multiple spaced seeds for homology search , 2007, Bioinform..

[5]  Francisco Fernandes,et al.  slaMEM: efficient retrieval of maximal exact matches using a sampled LCP array , 2014, Bioinform..

[6]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[7]  Lucian Ilie,et al.  SpEED: fast computation of sensitive spaced seeds , 2011, Bioinform..

[8]  Enno Ohlebusch,et al.  Chaining Algorithms and Applications in Comparative Genomics , 2004 .

[9]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[10]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[11]  Sadakane Kunihiko Compressed Full-text Indexes for DNA Sequences , 2008 .

[12]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[13]  Mona Singh,et al.  A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays , 2009, Bioinform..

[14]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[15]  Hwan-Gue Cho,et al.  GAME: A simple and efficient whole genome alignment method using maximal exact match filtering , 2005, Comput. Biol. Chem..

[16]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[17]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[18]  Juha Kärkkäinen,et al.  Sparse Suffix Trees , 1996, COCOON.

[19]  Enno Ohlebusch,et al.  Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes , 2010, SPIRE.

[20]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[21]  Jitender S. Deogun,et al.  EMAGEN: An Efficient Approach to Multiple Whole Genome Alignment , 2004, APBC.

[22]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[23]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[24]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[25]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[26]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[27]  Roberto Grossi,et al.  Mobilomics in Saccharomyces cerevisiae strains , 2013, BMC Bioinformatics.

[28]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.