An efficient approach for sequence matching in large DNA databases

In molecular biology, DNA sequence matching is one of the most crucial operations. Since DNA databases contain a huge volume of sequences, fast indexes are essential for efficient processing of DNA sequence matching. In this paper, we first point out the problems of the suffix tree, an index structure widely-used for DNA sequence matching, in respect of storage overhead, search performance, and difficulty in seamless integration with DBMS. Then, we propose a new index structure that resolves such problems. The proposed index structure consists of two parts: the primary part realizes the trie as binary bit-string representation without any pointers, and the secondary part helps fast access to the trie's leaf nodes that need to be accessed for post-processing. We also suggest efficient algorithms based on that index for DNA sequence matching. To verify the superiority of the proposed approach, we conduct performance evaluation via a series of experiments. The results reveal that the proposed approach, which requires smaller storage space, can be a few orders of magnitude faster than the suffix tree.

[1]  Gonzalo Navarro,et al.  A Hybrid Indexing Method for Approximate String Matching , 2007 .

[2]  Isidore Rigoutsos,et al.  FLASH: a fast look-up algorithm for string homology , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[4]  Ellis Horowitz,et al.  Fundamentals of Data Structures in Pascal , 1984 .

[5]  Hugh E. Williams,et al.  Indexing and Retrieval for Genomic Databases , 2002, IEEE Trans. Knowl. Data Eng..

[6]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[7]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[8]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[9]  T. H. Merrett,et al.  Tries for Approximate String Matching , 1996, IEEE Trans. Knowl. Data Eng..

[10]  Ellis Horowitz,et al.  Fundamentals of data structures in C , 1976 .

[11]  Per Jambeck,et al.  Developing Bioinformatics Computer Skills , 2001 .

[12]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[13]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[14]  Lei Zhou,et al.  BLAST++: BLASTing queries in batches , 2003, Bioinform..

[15]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[16]  John Riedl,et al.  Generalized suffix trees for biological sequence data: applications and implementation , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[17]  Tetsuo Shibuya,et al.  Indexing huge genome sequences for solving various problems. , 2001, Genome informatics. International Conference on Genome Informatics.

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  Maxime Crochemore,et al.  Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching , 1993 .

[20]  Donald E. Knuth,et al.  The Art of Computer Programming, Vol. 3: Sorting and Searching , 1974 .

[22]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[23]  Ricardo A. Baeza-Yates,et al.  A New Indexing Method for Approximate String Matching , 1999, CPM.

[24]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 1999, Softw. Pract. Exp..

[25]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[26]  Malcolm P. Atkinson,et al.  Database indexing for large DNA and protein sequence collections , 2002, The VLDB Journal.

[27]  Philippe Dessen,et al.  A rapid access motif database (RAMdb) with a search algorithm for the retrieval patterns in nucleic acids or protein databanks , 1995, Comput. Appl. Biosci..

[28]  Graham A. Stephen String Searching Algorithms , 1994, Lecture Notes Series on Computing.

[29]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[30]  Tamer Kahveci,et al.  An Efficient Index Structure for String Databases , 2001 .

[31]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[32]  Jignesh M. Patel,et al.  OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences , 2003, VLDB.

[33]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[34]  Carole A. Goble,et al.  A classification of tasks in bioinformatics , 2001, Bioinform..

[35]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[36]  Beng Chin Ooi,et al.  BLAST++ : A Tool for BLASTing Queries in Batches , 2003, APBC.