Rank-Based Similarity Search: Reducing the Dimensional Dependence

This paper introduces a data structure for k-NN search, the Rank Cover Tree (RCT), whose pruning tests rely solely on the comparison of similarity values; other properties of the underlying space, such as the triangle inequality, are not employed. Objects are selected according to their ranks with respect to the query object, allowing much tighter control on the overall execution costs. A formal theoretical analysis shows that with very high probability, the RCT returns a correct query result in time that depends very competitively on a measure of the intrinsic dimensionality of the data set. The experimental results for the RCT show that non-metric pruning strategies for similarity search can be practical even when the representational dimension of the data is extremely high. They also show that the RCT is capable of meeting or exceeding the level of performance of state-of-the-art methods that make use of metric pruning or other selection tests involving numerical constraints on distance values.

[1]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[2]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[3]  Robert Krauthgamer,et al.  The Black-Box Complexity of Nearest Neighbor Search , 2004, ICALP.

[4]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[5]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[6]  Sunil Arya,et al.  ANN: library for approximate nearest neighbor searching , 1998 .

[7]  Gonzalo Navarro Searching in metric spaces by spatial approximation , 2002, The VLDB Journal.

[8]  John Riedl,et al.  Application of Dimensionality Reduction in Recommender System - A Case Study , 2000 .

[9]  Nong Ye,et al.  The Handbook of Data Mining , 2003 .

[10]  Sanjay Chawla,et al.  Finding Local Anomalies in Very High Dimensional Space , 2010, 2010 IEEE International Conference on Data Mining.

[11]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[12]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[13]  Robert Krauthgamer,et al.  Bounded geometries, fractals, and low-distortion embeddings , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[14]  Suhas N. Diggavi,et al.  Randomized Algorithms for Comparison-based Search , 2011, NIPS.

[15]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[16]  Bruce M. Maggs,et al.  On hierarchical routing in doubling metrics , 2005, SODA '05.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Jun Sakuma,et al.  Fast approximate similarity search in extremely high-dimensional data sets , 2005, 21st International Conference on Data Engineering (ICDE'05).

[19]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[20]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[21]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[22]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[23]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[24]  Ittai Abraham,et al.  LAND: stretch (1 + epsilon) locality-aware networks for DHTs , 2004, ACM-SIAM Symposium on Discrete Algorithms.

[25]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[26]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[27]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[28]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[29]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[30]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[31]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[32]  David R. Karger,et al.  Finding nearest neighbors in growth-restricted metrics , 2002, STOC '02.

[33]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[34]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[35]  Michael E. Houle The Relevant-Set Correlation Model for Data Clustering , 2008, Stat. Anal. Data Min..

[36]  Robert Krauthgamer,et al.  Navigating nets: simple algorithms for proximity search , 2004, SODA '04.

[37]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[38]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[39]  Shengyu Zhang,et al.  Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design , 2009, SODA.

[40]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[41]  Aleksandrs Slivkins Distance estimation and object location via rings of neighbors , 2006, Distributed Computing.

[42]  Vladimir Pestov,et al.  On the geometry of similarity search: Dimensionality curse and concentration of measure , 1999, Inf. Process. Lett..

[43]  Z. Meral Özsoyoglu,et al.  Indexing large metric spaces for similarity search queries , 1999, TODS.

[44]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[45]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[46]  Yury Lifshits,et al.  Disorder inequality: a combinatorial approach to nearest neighbor search , 2008, WSDM '08.

[47]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[48]  Gert Vegter,et al.  In handbook of discrete and computational geometry , 1997 .

[49]  Michael E. Houle,et al.  Rank Cover Trees for Nearest Neighbor Search , 2013, SISAP.