Fast search in DNA sequence databases using punctuation and indexing

Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, Compressed-Punctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences. cp-BM encodes two bits to represent each A, T, C, G character (4-character 8 bit (4C8B) compression), plus punctuator characters to indicate unambiguously the encoding frame of the compressed target sequence, thereby solving the misalignment problem in searching patterns with ordinary 4C8B compression. cp-BM searches DNA patterns at least 6 times faster than AGREP for pattern lengths ≥ 128 and between 2-fold and 5-fold faster than d-BM for all pattern lengths. cp-BM's performance is enhanced by punctuator indexing and multiple punctuators, especially for short sequences, yielding greater than 10-fold enhancements compared to d-BM and AGREP. In addition, cp-BM outperformed BLAT for sequences 64 or more bases in length, and was more than three-fold faster for 256 base sequences.

[1]  Ricardo A. Baeza-Yates,et al.  Direct pattern matching on compressed text , 1998, Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No.98EX207).

[2]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  Ayumi Shinohara,et al.  A Boyer-Moore Type Algorithm for Compressed Pattern Matching , 2000, CPM.

[5]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[6]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[7]  Lei Chen,et al.  Compressed pattern matching in DNA sequences , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[8]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[9]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  Ayumi Shinohara,et al.  Speeding Up Pattern Matching by Text Compression , 2000, CIAC.

[11]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[12]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .