Mitigating the Curse of Dimensionality for Exact kNN Retrieval

Efficient data indexing and exact k-nearest-neighbor (kNN) retrieval are still challenging tasks in high-dimensional spaces. This work highlights the difficulties of indexing in high-dimensional and tightly-clustered dataspaces by exploring several important tunable parameters for optimizing kNN query performance using the iDistance and iDStar algorithms. We experiment on real and synthetic datasets of varying size, cluster density, and dimensionality, and compare performance primarily through filter-and-refine efficiency and execution time. Results show great variability over parameter values and provide new insights and justifications in support of prior best-use practices. Local segmentation with iDStar consistently outperforms iDistance in any clustered space below 256 dimensions, setting a new benchmark for efficient and exact kNN retrieval in high-dimensional spaces. We propose several directions of future work to further increase performance in high-dimensional real-world settings.

[1]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[2]  Rafal A. Angryk,et al.  A Comprehensive Study of iDistance Partitioning Strategies for kNN Queries and High-Dimensional Data Indexing , 2013, BNCOD.

[3]  Beng Chin Ooi,et al.  Towards effective indexing for very large video sequence database , 2005, SIGMOD '05.

[4]  Rafal A. Angryk,et al.  Improving the Performance of High-Dimensional kNN Retrieval through Localized Dataspace Segmentation and Hybrid Indexing , 2013, ADBIS.

[5]  Rudolf Bayer,et al.  Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[6]  Beng Chin Ooi,et al.  Indexing the edges—a simple and yet efficient approach to high-dimensional indexing , 2000, PODS.

[7]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[8]  John W. Sheppard,et al.  Cluster Analysis for Optimal Indexing , 2013, FLAIRS Conference.

[9]  Franz Aurenhammer,et al.  Voronoi diagrams—a survey of a fundamental geometric data structure , 1991, CSUR.

[10]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[11]  Christos Doulkeridis,et al.  Peer-to-Peer Similarity Search in Metric Spaces , 2007, VLDB.

[12]  Eduardo Mena,et al.  Location-dependent queries in mobile contexts: distributed processing using mobile agents , 2006, IEEE Transactions on Mobile Computing.

[13]  Ambuj K. Singh,et al.  SIMP: accurate and efficient near neighbor search in high dimensional spaces , 2012, EDBT '12.

[14]  R. Bayer,et al.  Organization and maintenance of large ordered indices , 1970, SIGFIDET '70.

[15]  Jian Pei,et al.  Using high dimensional indexes to support relevance feedback based interactive images retrieval , 2006, VLDB.

[16]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[17]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[18]  Hans-Peter Kriegel,et al.  The pyramid-technique: towards breaking the curse of dimensionality , 1998, SIGMOD '98.

[19]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[20]  Xiaofan Yang,et al.  iDistance Based Interactive Visual Surveillance Retrieval Algorithm , 2008, 2008 International Conference on Intelligent Computation Technology and Automation (ICICTA).

[21]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[22]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.