Constructing burrows-wheeler transforms of large string collections via merging

The throughput of biological sequencing technologies has led to the necessity for compressed and accessible sequencing formats. Recently, the Multi-String Burrows-Wheeler Transform (MSBWT) has risen in prevalence as a method for transforming sequence data to improve compression while providing access to the reads through an auxiliary FM-index. While there are many algorithms for building the MSBWT for a collection of strings, they do not scale well as the length of those strings increases. We propose a new method for constructing the MSBWT for a collection of strings based on previous work for merging two or more MSBWTs. It requires O(N * LCPavg * log(m)) time and O(N) bits of memory for a collection of m strings composed of N symbols where the average longest common prefix of all suffixes is LCPavg. We evaluate the speed of the algorithm on multiple datasets that vary in both quantity of strings and string length. Availability: https://code.google.com/p/msbwt/source/browse/MUSCython/MultimergeCython.pyx

[1]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[2]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[3]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[5]  Giovanna Rosone,et al.  Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes , 2012, WABI.

[6]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[7]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[8]  Giovanna Rosone,et al.  Lightweight BWT Construction for Very Large String Collections , 2011, CPM.

[9]  Giovanna Rosone,et al.  Lightweight algorithms for constructing and inverting the BWT of string collections , 2013, Theor. Comput. Sci..

[10]  Jouni Sirén,et al.  Compressed Suffix Arrays for Massive Data , 2009, SPIRE.

[11]  Jared T. Simpson,et al.  Efficient construction of an assembly string graph using the FM-index , 2010, Bioinform..

[12]  Leonard McMillan,et al.  Merging of multi-string BWTs with applications , 2014, Bioinform..

[13]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[14]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[15]  Antonio Restivo,et al.  An Extension of the Burrows Wheeler Transform and Applications to Sequence Comparison and Data Compression , 2005, CPM.

[16]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.