Reference-based indexing of sequence databases

We consider the problem of similarity search in a very large sequence database with edit distance as the similarity measure. Given limited main memory, our goal is to develop a reference-based index that reduces the number of costly edit distance computations in order to answer a query. The idea in reference-based indexing is to select a small set of reference sequences that serve as a surrogate for the other sequences in the database. We consider two novel strategies for selecting references as well as a new strategy for assigning references to database sequences. Our experimental results show that our selection and assignment methods far outperform competitive methods. For example, our methods prune up to 20 times as many sequences as the Omni method, and as many as 30 times as many sequences as frequency vectors. Our methods also scale nicely for databases containing many and/or very long sequences.

[1]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[2]  Ambuj K. Singh,et al.  Speeding up whole-genome alignment by indexing frequency vectors , 2004, Bioinform..

[3]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[4]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[5]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[6]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[7]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[8]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1992, Inf. Process. Lett..

[9]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[10]  Anthony K. H. Tung,et al.  DSIM: A distance-based indexing method for genomic sequences , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[11]  Remco C. Veltkamp,et al.  Efficient image retrieval through vantage objects , 1999, Pattern Recognition.

[12]  Tamer Kahveci,et al.  An Efficient Index Structure for String Databases , 2001 .

[13]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[14]  Gonzalo Navarro,et al.  Faster Approximate String Matching , 1999, Algorithmica.

[15]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[16]  James Ze Wang,et al.  SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size , 2002, Bioinform..

[17]  Z. Meral Özsoyoglu,et al.  Distance based indexing for string proximity search , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[18]  Amos Bairoch,et al.  Swiss-Prot: Juggling between evolution and stability , 2004, Briefings Bioinform..

[19]  Christos Faloutsos,et al.  Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes , 2000, EDBT.

[20]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[21]  Jignesh M. Patel,et al.  OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences , 2003, VLDB.

[22]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[23]  Marcos R. Vieira,et al.  DBM-Tree: A Dynamic Metric Access Method Sensitive to Local Density Data , 2010, J. Inf. Data Manag..

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  Christos Faloutsos,et al.  Similarity search without tears: the OMNI-family of all-purpose access methods , 2001, Proceedings 17th International Conference on Data Engineering.

[26]  Malcolm P. Atkinson,et al.  A Database Index to Large Biological Sequences , 2001, VLDB.

[27]  Juha Kärkkäinen Suffix Cactus: A Cross between Suffix Tree and Suffix Array , 1995, CPM.

[28]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[29]  Gonzalo Navarro,et al.  A Hybrid Indexing Method for Approximate String Matching , 2007 .

[30]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[31]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[32]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[34]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[35]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[36]  Chuan Yi Tang,et al.  A 2.|E|-Bit Distributed Algorithm for the Directed Euler Trail Problem , 1993, Inf. Process. Lett..

[37]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[38]  Christos Faloutsos,et al.  How to improve the pruning ability of dynamic metric access methods , 2002, CIKM '02.