Indexing techniques for metric databases with costly searches

Similarity search in database systems is becoming an increasingly important task in modern application domains such as artificial intelligence, computational biology, pattern recognition and data mining. With the evolution of information, applications with new data types such as text, images, videos, audio, DNA and protein sequences have began to appear. Despite extensive research and the development of a plethora of index structures, similarity search is still too costly in many application domains, especially when measuring the similarity between a pair or objects is expensive. In this dissertation, the similarity search queries we consider are classified under similarity search and similarity join queries. Several new indexing techniques to improve the performance of similarity search are proposed. For the similarity search queries, reference-based indexing methods applicable to both static and growing databases are proposed. For similarity join queries, a generalized nearest neighbor framework and several search and optimization algorithms are proposed. The extensive experiments evaluates the different parameters used by the proposed methods and performance improvements over the state-of-art algorithms.

[1]  Divyakant Agrawal,et al.  Discovery of Influence Sets in Frequently Updated Databases , 2001, VLDB.

[2]  Malcolm P. Atkinson,et al.  A Database Index to Large Biological Sequences , 2001, VLDB.

[3]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[4]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[5]  S. Albers Competitive Online Algorithms , 1996 .

[6]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[7]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[8]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Gonzalo Navarro,et al.  A Hybrid Indexing Method for Approximate String Matching , 2007 .

[10]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[11]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[12]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[13]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[14]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[15]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[16]  Mario A. López,et al.  STR: a simple and efficient algorithm for R-tree packing , 1997, Proceedings 13th International Conference on Data Engineering.

[17]  Christian S. Jensen,et al.  Join operations in temporal databases , 2005, The VLDB Journal.

[18]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[19]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[20]  Remco C. Veltkamp,et al.  Selecting vantage objects for similarity indexing , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[21]  Christos Faloutsos,et al.  How to improve the pruning ability of dynamic metric access methods , 2002, CIKM '02.

[22]  Ambuj K. Singh,et al.  Speeding up whole-genome alignment by indexing frequency vectors , 2004, Bioinform..

[23]  Juha Kärkkäinen Suffix Cactus: A Cross between Suffix Tree and Suffix Array , 1995, CPM.

[24]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[25]  Z. Galil,et al.  Pattern matching algorithms , 1997 .

[26]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[27]  R. Bayer,et al.  Organization and maintenance of large ordered indices , 1970, SIGFIDET '70.

[28]  Christian Böhm,et al.  The k-Nearest Neighbour Join: Turbo Charging the KDD Process , 2004, Knowledge and Information Systems.

[29]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[30]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[31]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[32]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[33]  Christos Faloutsos,et al.  Similarity search without tears: the OMNI-family of all-purpose access methods , 2001, Proceedings 17th International Conference on Data Engineering.

[34]  Marcos R. Vieira,et al.  DBM-Tree: A Dynamic Metric Access Method Sensitive to Local Density Data , 2010, J. Inf. Data Manag..

[35]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[36]  T. H. Merrett,et al.  Scheduling of Page-Fetches in Join Operations , 1981, VLDB.

[37]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[38]  Gonzalo Navarro,et al.  Faster Approximate String Matching , 1999, Algorithmica.

[39]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[40]  Bernhard Seeger,et al.  An Analysis of Schedules for Performing Multi-Page Requests , 1996, Inf. Syst..

[41]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[42]  Tamer Kahveci,et al.  An Efficient Index Structure for String Databases , 2001 .

[43]  Tamer Kahveci,et al.  Reference-based indexing of sequence databases , 2006, VLDB.

[44]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[45]  Ricardo A. Baeza-Yates,et al.  Spaghettis: an array based algorithm for similarity queries in metric spaces , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[46]  Jon Louis Bentley,et al.  Multidimensional Binary Search Trees in Database Applications , 1979, IEEE Transactions on Software Engineering.

[47]  Hans-Peter Kriegel,et al.  Fast nearest neighbor search in high-dimensional space , 1998, Proceedings 14th International Conference on Data Engineering.

[48]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[49]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[50]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[51]  Tamer Kahveci,et al.  Reference-based indexing for metric spaces with costly distance measures , 2008, The VLDB Journal.

[52]  Yufei Tao,et al.  Reverse kNN Search in Arbitrary Dimensionality , 2004, VLDB.

[53]  Nora Reyes,et al.  Similarity Search Using Sparse Pivots for Efficient Multimedia Information Retrieval , 2006, Eighth IEEE International Symposium on Multimedia (ISM'06).

[54]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[55]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[56]  Václav Snásel,et al.  PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases , 2004, ADBIS.

[57]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[58]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[59]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[60]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[61]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[62]  S. Muthukrishnan,et al.  Influence sets based on reverse nearest neighbor queries , 2000, SIGMOD '00.

[63]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[64]  Ambuj K. Singh,et al.  Index-based Similarity Search for Protein Structure Databases , 2004, J. Bioinform. Comput. Biol..

[65]  Beng Chin Ooi,et al.  Gorder: An Efficient Method for KNN Join Processing , 2004, VLDB.

[66]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[67]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[68]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[69]  Gonzalo Navarro,et al.  Fixed Queries Array: A Fast and Economical Data Structure for Proximity Searching , 2001, Multimedia Tools and Applications.

[70]  Ambuj K. Singh,et al.  Indexing Spatially Sensitive Distance Measures Using Multi-resolution Lower Bounds , 2006, EDBT.

[71]  Remco C. Veltkamp,et al.  Efficient image retrieval through vantage objects , 1999, Pattern Recognition.

[72]  Ricardo A. Baeza-Yates,et al.  Proximity Matching Using Fixed-Queries Trees , 1994, CPM.

[73]  Beng Chin Ooi,et al.  Efficient Scheduling of Page Access in Index-Based Join Processing , 1997, IEEE Trans. Knowl. Data Eng..

[74]  David J. DeWitt,et al.  The Object-Oriented Database System Manifesto , 1994, Building an Object-Oriented Database System, The Story of O2.

[75]  Enrique Vidal-Ruiz,et al.  An algorithm for finding nearest neighbours in (approximately) constant average time , 1986, Pattern Recognit. Lett..

[76]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[77]  Michael Stonebraker,et al.  Implementation techniques for main memory database systems , 1984, SIGMOD '84.

[78]  Tamer Kahveci,et al.  Finding Data Broadness Via Generalized Nearest Neighbors , 2006, EDBT.

[79]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[80]  Michael Stonebraker,et al.  Object-Relational DBMSs: The Next Great Wave , 1995 .

[81]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[82]  Ambuj K. Singh,et al.  ViVo: visual vocabulary construction for mining biomedical images , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[83]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[84]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[85]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[86]  Luisa Micó,et al.  A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements , 1994, Pattern Recognit. Lett..

[87]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[88]  Ricardo A. Baeza-Yates,et al.  Fast and Practical Approximate String Matching , 1992, Inf. Process. Lett..

[89]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[90]  Jignesh M. Patel,et al.  OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences , 2003, VLDB.

[91]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[92]  James Ze Wang,et al.  SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size , 2002, Bioinform..

[93]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[94]  Theodore Bially,et al.  Space-filling curves: Their generation and their application to bandwidth reduction , 1969, IEEE Trans. Inf. Theory.

[95]  Amos Bairoch,et al.  Swiss-Prot: Juggling between evolution and stability , 2004, Briefings Bioinform..

[96]  King-Ip Lin,et al.  An index structure for efficient reverse nearest neighbor queries , 2001, Proceedings 17th International Conference on Data Engineering.

[97]  Ambuj K. Singh,et al.  Towards index-based similarity search for protein structure databases , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[98]  Christos Faloutsos,et al.  Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes , 2000, EDBT.

[99]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[100]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .