D-Cache: Universal Distance Cache for Metric Access Methods

The caching of accessed disk pages has been successfully used for decades in database technology, resulting in effective amortization of I/O operations needed within a stream of query or update requests. However, in modern complex databases, like multimedia databases, the I/O cost becomes a minor performance factor. In particular, metric access methods (MAMs), used for similarity search in complex unstructured data, have been designed to minimize rather the number of distance computations than I/O cost (when indexing or querying). Inspired by I/O caching in traditional databases, in this paper we introduce the idea of distance caching for usage with MAMs - a novel approach to streamline similarity search. As a result, we present the D-cache, a main-memory data structure which can be easily implemented into any MAM, in order to spare the distance computations spent by queries/updates. In particular, we have modified two state-of-the-art MAMs to make use of D-cache - the M-tree and Pivot tables. Moreover, we present the D-file, an index-free MAM based on simple sequential search augmented by D-cache. The experimental evaluation shows that performance gain achieved due to D-cache is significant for all the MAMs, especially for the D-file.

[1]  Hans-Peter Kriegel,et al.  Multiple Similarity Queries: A Basic DBMS Operation for Mining in Metric Databases , 2001, IEEE Trans. Knowl. Data Eng..

[2]  Guillermo Sapiro,et al.  Comparing point clouds , 2004, SGP '04.

[3]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[4]  Luisa Micó,et al.  A fast branch & bound nearest neighbour classifier in metric spaces , 1996, Pattern Recognit. Lett..

[5]  Scott D. Carson,et al.  A system for adaptive disk rearrangement , 1990, Softw. Pract. Exp..

[6]  Daniel P. Huttenlocher,et al.  Comparing Images Using the Hausdorff Distance , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[8]  Salvatore Orlando,et al.  Caching content-based queries for robust and efficient image retrieval , 2009, EDBT '09.

[9]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[11]  Jakub Lokoc,et al.  Parallel Dynamic Batch Loading in the M-tree , 2009, 2009 Second International Workshop on Similarity Search and Applications.

[12]  Tomás Skopal,et al.  Pivoting M-tree: A Metric Access Method for Efficient Similarity Search , 2004, DATESO.

[13]  Ricardo A. Baeza-Yates,et al.  Spaghettis: an array based algorithm for similarity queries in metric spaces , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[14]  Benjamin Bustos,et al.  On Index-Free Similarity Search in Metric Spaces , 2009, DEXA.

[15]  Salvatore Orlando,et al.  A metric cache for similarity search , 2008, LSDS-IR '08.

[16]  Gonzalo Navarro,et al.  Probabilistic proximity searching algorithms based on compact partitions , 2004, J. Discrete Algorithms.

[17]  Joachim M. Buhmann,et al.  Empirical evaluation of dissimilarity measures for color and texture , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[18]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[19]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[20]  Jakub Lokoc,et al.  New dynamic construction techniques for M-tree , 2009, J. Discrete Algorithms.

[21]  David Novak,et al.  Scalability comparison of Peer-to-Peer similarity search structures , 2008, Future Gener. Comput. Syst..

[22]  Marco Patella,et al.  The many facets of approximate similarity search , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[23]  Hans-Peter Kriegel,et al.  Knowledge and Information Systems SHORT PAPER , 2006 .

[24]  A. Sussman,et al.  Multiple Range Query Optimization with Distributed Cache Indexing , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[25]  Enrique Vidal,et al.  New formulation and improvements of the nearest-neighbour approximating and eliminating search algorithm (AESA) , 1994, Pattern Recognit. Lett..

[26]  Christos Faloutsos,et al.  The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient , 2007, The VLDB Journal.

[27]  Tamer Kahveci,et al.  Reference-based indexing for metric spaces with costly distance measures , 2008, The VLDB Journal.

[28]  Gonzalo Navarro,et al.  Metric Spaces Library , 2008 .

[29]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[30]  Gonzalo Navarro,et al.  Practical Construction of k-Nearest Neighbor Graphs in Metric Spaces , 2006, WEA.

[31]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[32]  Wolfgang Effelsberg,et al.  Principles of database buffer management , 1984, TODS.

[33]  E. Vidal,et al.  An algorithm for finding nearest neighbours in constant average time with a linear space complexity , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[34]  Bernhard Seeger,et al.  An Evaluation of Generic Bulk Loading Techniques , 2001, VLDB.

[35]  James Lee Hafner,et al.  Efficient Color Histogram Indexing for Quadratic Form Distance Functions , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[37]  Gonzalo Navarro,et al.  Probabilistic proximity searching algorithms based on compact partitions , 2002, J. Discrete Algorithms.

[38]  Seok Il Song,et al.  An Efficient Cache Conscious Multi-dimensional Index Structure , 2004, ICCSA.

[39]  Andrea Esuli,et al.  CoPhIR: a Test Collection for Content-Based Image Retrieval , 2009, ArXiv.