Clustered pivot tables for I/O-optimized similarity search

The pivot tables are a popular metric access method, primarily designed as a main-memory index structure. It has been many times proven that pivot tables are very efficient in terms of distance computations, hence, when assuming a computationally expensive distance function. However, for cheaper distance functions and/or huge datasets exceeding the capacity of the main memory, the classic pivot tables become inefficient. The situation is dramatically changing with the rise of solid state disks that decrease the seek times, so we can now efficiently access also small fragments of data stored in the secondary memory. In this paper, we propose a persistent variant of pivot tables, the clustered pivot tables, focusing on minimizing I/O cost when accessing small data blocks (a few kilobytes). The clustered pivot tables employs a preprocessing method utilizing the M-tree in the role of clustering technique and an original heuristic for I/O-optimized kNN query processing. In the experiments we empirically show that our proposed method significantly reduces the number of necessary I/O operations during query processing.

[1]  Gonzalo Navarro,et al.  Analyzing Metric Space Indexes: What For? , 2009, 2009 Second International Workshop on Similarity Search and Applications.

[2]  Luisa Micó,et al.  A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements , 1994, Pattern Recognit. Lett..

[3]  Marco Patella,et al.  Bulk Loading the M-tree , 2001 .

[4]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[5]  Chen Li,et al.  NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms , 2004, EDBT.

[6]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[7]  Ricardo A. Baeza-Yates,et al.  Spaghettis: an array based algorithm for similarity queries in metric spaces , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[8]  Nieves R. Brisaboa,et al.  Spatial Selection of Sparse Pivots for Similarity Search in Metric Spaces , 2007, SOFSEM.

[9]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[10]  Alan P. Sexton,et al.  Bulk Loading the M-Tree to Enhance Query Performance , 2004, BNCOD.

[11]  Jakub Lokoc,et al.  On reinsertions in M-tree , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[12]  Nieves R. Brisaboa,et al.  A Dynamic Pivot Selection Technique for Similarity Search , 2008, First International Workshop on Similarity Search and Applications (sisap 2008).

[13]  Christos Faloutsos,et al.  The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient , 2007, The VLDB Journal.

[14]  Václav Snásel,et al.  Revisiting M-Tree Building Principles , 2003, ADBIS.

[15]  Nieves R. Brisaboa,et al.  A dynamic pivot selection technique for similarity search , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[16]  Andrea Esuli,et al.  CoPhIR: a Test Collection for Content-Based Image Retrieval , 2009, ArXiv.

[17]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[18]  Jakub Lokoc,et al.  New dynamic construction techniques for M-tree , 2009, J. Discrete Algorithms.

[19]  Tomás Skopal,et al.  Pivoting M-tree: A Metric Access Method for Efficient Similarity Search , 2004, DATESO.

[20]  Luisa Micó,et al.  A fast branch & bound nearest neighbour classifier in metric spaces , 1996, Pattern Recognit. Lett..

[21]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.