Space-Efficient Whole Genome Comparisons with BurrowsWheeler Transforms

The starting point for any alignment of mammalian genomes is the computation of exact matches satisfying various criteria. Time-efficient, O(n), data structures for this computation, such as the suffix tree, require O(n log(n)) space, several times the space of the genomes themselves. Thus, any reasonable whole-genome comparative project finds itself requiring tens of Gigabytes of RAM to maintain time-efficiency. This is beyond most modern workstations. With a new data structure, the compressed suffix array (CSA) implemented via the Burrows-Wheeler transform, we can trade time-efficiency for space-efficiency, taking O(n log(n)) time, but running in O(n) space, typically in total space less than or equal to that of the genomes themselves. If space is more expensive than time, this is an appropriate approach to consider. The most space-efficient implementation of this data structure requires 5 bits per nucleotide character to build on-line, in the worst case, and 2.5 bits per character to store once built. We present a description of this data structure and how it is used to obtain matches. An implementation (called bbbwt) is demonstrated by aligning two mammalian genomes on a modest workstation equipped with under 2 GB of free RAM in time superior to that of the implementations of other data structures.

[1]  Ross Lippert,et al.  A Space-Efficient Construction of the Burrows-Wheeler Transform for Genomic Data , 2005, J. Comput. Biol..

[2]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[3]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, Algorithmica.

[4]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[5]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[6]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[7]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[8]  Enno Ohlebusch,et al.  The Enhanced Suffix Array and Its Applications to Genome Analysis , 2002, WABI.

[9]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[10]  Wing-Kai Hon,et al.  Space-Economical Algorithms for Finding Maximal Unique Matches , 2002, CPM.

[11]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[12]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[13]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[14]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, COCOON.

[15]  Wing-Kai Hon,et al.  Constructing Compressed Suffix Arrays with Large Alphabets , 2003, ISAAC.

[16]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[17]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[18]  S. Srinivasa Rao,et al.  Space Efficient Suffix Trees , 1998, FSTTCS.