An Experimental Study of Compressed Indexing and Local Alignments of DNA

Recent experimental studies on compressed indexes (BWT, CSA, FM-index) have confirmed their practicality for indexing long DNA sequences such as the human genome (about 3 billion characters) in the main memory [5,13,16]. However, these indexes are designed for exact pattern matching, which is too stringent for most biological applications. The demand is often on finding local alignments (pairs of similar substrings with gaps allowed). In this paper, we show how to build a software called BWT-SW that exploits a BWT index of a text T to speed up the dynamic programming for finding all local alignments with any pattern P. Experiments reveal that BWT-SW is very efficient (e.g., aligning a pattern of length 3,000 with the human genome takes less than a minute). We have also analyzed BWT-SW mathematically, using a simpler model (with gaps disallowed) and random strings. We find that the expected running time is O(|T|0.628|P|). As far as we know, BWT-SW is the first practical tool that can find all local alignments.

[1]  Ross Lippert,et al.  Space-Efficient Whole Genome Comparisons with BurrowsWheeler Transforms , 2005, J. Comput. Biol..

[2]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[3]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[4]  Siu-Ming Yiu,et al.  Practical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences , 2004, ALENEX/ANALC.

[5]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[6]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[7]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[8]  Wing-Kai Hon,et al.  Constructing Compressed Suffix Arrays with Large Alphabets , 2003, ISAAC.

[9]  Jignesh M. Patel,et al.  OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences , 2003, VLDB.

[10]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[11]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[12]  Malcolm P. Atkinson,et al.  Database indexing for large DNA and protein sequence collections , 2002, The VLDB Journal.

[13]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[14]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[15]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[16]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[17]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[18]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[19]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .