High Throughput Short Read Alignment via Bi-directional BWT

The advancement of sequencing technologies has made it feasible for researchers to consider many high-throughput biological applications. A core step of these applications is to align an enormous amount of short reads to a reference genome. For example, to resequence a human genome, billions of reads of 35 bp are produced in 1-2 weeks, putting a lot of pressure of faster software for alignment. Based on existing indexing and pattern matching technologies, several short read alignment software have been developed recently. Yet this is still strong need to further improve the speed. In this paper, we show a new indexing data structure called bi-directional BWT, which allows us to build the fastest software for aligning short reads. When compared with existing software (Bowtie is the best), our software is at least 3 times faster for finding unique best alignments, and 25 times faster for finding all possible alignments. We believe that bi-directional BWTis an interesting data structure on its own and couldbe applied to other pattern matching problems.

[1]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[2]  Ross Lippert,et al.  Space-Efficient Whole Genome Comparisons with BurrowsWheeler Transforms , 2005, J. Comput. Biol..

[3]  Timothy T. Harkins,et al.  Transcriptome sequencing with the Genome Sequencer FLX system , 2008 .

[4]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[5]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[6]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[7]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[8]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[9]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[10]  Gonzalo Navarro,et al.  Rank and select revisited and extended , 2007, Theor. Comput. Sci..

[11]  D. Bentley,et al.  Whole-genome re-sequencing. , 2006, Current opinion in genetics & development.

[12]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[13]  Bin Ma,et al.  ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[14]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[15]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[16]  Gabor T. Marth,et al.  Whole-genome sequencing and variant discovery in C. elegans , 2008, Nature Methods.

[17]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..