A Practical Method for Approximate Subsequence Search in DNA Databases

In this paper, we propose an accurate and efficient method for approximate subsequence search in large DNA databases. The proposed method basically adopts a binary trie as its primary structure and stores all the window subsequences extracted from a DNA sequence. For approximate subsequence search, it traverses the binary trie in a breadth-first fashion and retrieves all the matched subsequences from the traversed path within the trie by a dynamic programming technique. However, the proposed method stores only window subsequences of the pre-determined length, and thus suffers from large post-processing time in case of long query sequences. To overcome this problem, we divide a query sequence into shorter pieces, perform searching for those subsequences, and then merge their results.

[1]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[2]  Sanghyun Park,et al.  A Novel Indexing Method for Efficient Sequence Matching in Large DNA Database Environment , 2005, PAKDD.

[3]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 2003, Softw. Pract. Exp..

[4]  Gonzalo Navarro,et al.  A Hybrid Indexing Method for Approximate String Matching , 2007 .

[5]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[6]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Jignesh M. Patel,et al.  Practical Suffix Tree Construction , 2004, VLDB.

[8]  Hugh E. Williams,et al.  Indexing and Retrieval for Genomic Databases , 2002, IEEE Trans. Knowl. Data Eng..

[9]  Beng Chin Ooi,et al.  BLAST++ : A Tool for BLASTing Queries in Batches , 2003, APBC.

[10]  Jignesh M. Patel,et al.  OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences , 2003, VLDB.

[11]  Malcolm P. Atkinson,et al.  Database indexing for large DNA and protein sequence collections , 2002, The VLDB Journal.

[12]  Philippe Dessen,et al.  A rapid access motif database (RAMdb) with a search algorithm for the retrieval patterns in nucleic acids or protein databanks , 1995, Comput. Appl. Biosci..

[13]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[14]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[15]  Lei Zhou,et al.  BLAST++: BLASTing queries in batches , 2003, Bioinform..

[16]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[17]  Tamer Kahveci,et al.  An Efficient Index Structure for String Databases , 2001 .