Bidirectional search in a string with wavelet trees and bidirectional matching statistics

Searching for genes encoding microRNAs (miRNAs) is an important task in genome analysis. Because the secondary structure of miRNA (but not the sequence) is highly conserved, the genes encoding it can be determined by finding regions in a genomic DNA sequence that match the structure. It is known that algorithms using a bidirectional search on the DNA sequence for this task outperform algorithms based on unidirectional search. The data structures supporting a bidirectional search (affix trees and affix arrays), however, are rather complex and suffer from their large space consumption. Here, we present a new data structure called bidirectional wavelet index that supports bidirectional search with much less space. With this data structure, it is possible to search for candidates of RNA secondary structural patterns in large genomes, for example the complete human genome. Another important application of this data structure is short read alignment. As a second contribution, we show how bidirectional matching statistics can be computed in linear time.

[1]  Sven Rahmann Fast and sensitive probe selection for DNA chips using jumps in matching statistics , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[2]  Jin-Wu Nam,et al.  Genomics of microRNA. , 2006, Trends in genetics : TIG.

[3]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[4]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[5]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[6]  Eugene L. Lawler,et al.  Sublinear approximate string matching and biological applications , 1994, Algorithmica.

[7]  Choon Hui Teo,et al.  Fast and space efficient string kernels using suffix arrays , 2006, ICML.

[8]  Siu-Ming Yiu,et al.  High Throughput Short Read Alignment via Bi-directional BWT , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[9]  Giancarlo Mauri,et al.  Pattern Discovery in RNA Secondary Structure Using Affix Trees , 2003, CPM.

[10]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[11]  Enno Ohlebusch,et al.  Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes , 2010, SPIRE.

[12]  Enno Ohlebusch,et al.  A Compressed Enhanced Suffix Array Supporting Fast String Matching , 2009, SPIRE.

[13]  Moritz G. Maaß Linear Bidirectional On-Line Construction of Affix Trees , 2003, Algorithmica.

[14]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[15]  V. Ambros,et al.  The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14 , 1993, Cell.

[16]  Amar Mukherjee,et al.  The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching , 2008 .

[17]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[18]  Johann van der Merwe,et al.  A survey on peer-to-peer key management for mobile ad hoc networks , 2007, CSUR.

[19]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[20]  Dirk Strothmann,et al.  The affix array data structure and its applications to RNA secondary structure analysis , 2007, Theor. Comput. Sci..

[21]  Gonzalo Navarro,et al.  Rank and select revisited and extended , 2007, Theor. Comput. Sci..

[22]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.