Similarity Search via Combinatorial Nets

We consider Nearest Neighbor Search problem in the so called combinatorial framework: Only direct comparisons between two pairwise similarity values are allowed. We assume that the similarity order for the input dataset has the following consistency property: if x is the a’th most similar object to y and y is the b’th most similar object to z, then x is among the D(a + b) most similar objects to z. Though the oracle gives much less information compared to the standard general metric space model where distance values are given, it turns out that one can still design a deterministic preprocessing algorithm with almost linear time and space complexity, and answer queries deterministically in near-logarithmic time. A key procedure of our main algorithm is efficient constructions of combinatorial nets. We show that this data structure is useful for solving other important problems. For example, motivated by navigability questions we show that for any dataset a visibility graph can be constructed: all out-degrees are near-logarithmic and greedy routing deterministically converges to nearest neighbor in logarithmic number of steps. Also, for near-duplicate detection problem we present the first known deterministic algorithm that requires just near-linear time + time proportional to the size of output.

[1]  K. Clarkson Nearest-Neighbor Searching and Metric Space Dimensions , 2005 .

[2]  Emin Gün Sirer,et al.  Meridian: a lightweight network location service without virtual coordinates , 2005, SIGCOMM '05.

[3]  Richard Cole,et al.  Searching dynamic point sets in spaces with bounded doubling dimension , 2006, STOC '06.

[4]  Sariel Har-Peled,et al.  Fast construction of nets in low dimensional metrics, and their applications , 2004, SCG.

[5]  Bernard Chazelle,et al.  Splitting a Delaunay Triangulation in Linear Time , 2001, Algorithmica.

[6]  Jon M. Kleinberg,et al.  The small-world phenomenon: an algorithmic perspective , 2000, STOC '00.

[7]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[8]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[9]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[10]  Pierre Fraigniaud,et al.  A doubling dimension threshold Θ(log log n) for augmented graph navigability , 2006 .

[11]  Rajmohan Rajaraman,et al.  Accessing Nearby Copies of Replicated Objects in a Distributed Environment , 1997, SPAA '97.

[12]  Süleyman Cenk Sahinalp,et al.  Hardness of String Similarity Search and Other Indexing Problems , 2004, ICALP.

[13]  Robert Krauthgamer,et al.  Navigating nets: simple algorithms for proximity search , 2004, SODA '04.

[14]  Ilya Valentinovich Segalovich,et al.  An efficient method to detect duplicates of web documents with the use of inverted index , 2002, WWW 2002.

[15]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[16]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[17]  Edward A. Hirsch,et al.  UnitWalk: A new SAT solver that uses local search guided by unit clause elimination , 2005, Annals of Mathematics and Artificial Intelligence.

[18]  Marvin B. Shapiro The choice of reference points in best-match file searching , 1977, CACM.

[19]  Piotr Indyk,et al.  Nearest Neighbors in High-Dimensional Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[20]  Michael T. Orchard,et al.  A fast nearest-neighbor search algorithm , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[21]  David Novak,et al.  MESSIF: Metric Similarity Search Implementation Framework , 2007, DELOS.

[22]  Yannis Manolopoulos,et al.  Closest pair queries in spatial databases , 2000, SIGMOD '00.

[23]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[24]  Benjamin Hoffmann,et al.  Maximal Intersection Queries in Randomized Graph Models , 2007, CSR.

[25]  SametHanan,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003 .

[26]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[27]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[28]  Jitendra Malik,et al.  Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[29]  Gonzalo Navarro Searching in metric spaces by spatial approximation , 2002, The VLDB Journal.

[30]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[31]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[32]  I. Keidar,et al.  Do not crawl in the DUST: Different URLs with similar text , 2006, TWEB.

[33]  Joshua Alspector,et al.  Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[34]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[35]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[36]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[37]  Kunal Talwar,et al.  Bypassing the embedding: algorithms for low dimensional metrics , 2004, STOC '04.

[38]  Allan Borodin,et al.  Lower bounds for high dimensional nearest neighbor search and related problems , 1999, STOC '99.

[39]  Andrew V. Goldberg,et al.  Routing in Networks with Low Doubling Dimension , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[40]  Yury Lifshits,et al.  Disorder inequality: a combinatorial approach to nearest neighbor search , 2008, WSDM '08.

[41]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[42]  Gurmeet Singh Manku,et al.  Symphony: Distributed Hashing in a Small World , 2003, USENIX Symposium on Internet Technologies and Systems.

[43]  Robert Krauthgamer,et al.  The Black-Box Complexity of Nearest Neighbor Search , 2004, ICALP.

[44]  Georges Voronoi Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Deuxième mémoire. Recherches sur les parallélloèdres primitifs. , 1908 .

[45]  Ittai Abraham,et al.  Embedding metric spaces in their intrinsic dimension , 2008, SODA '08.

[46]  Yury Lifshits,et al.  Estimation of the Click Volume by Large Scale Regression Analysis , 2007, CSR.

[47]  David R. Karger,et al.  Finding nearest neighbors in growth-restricted metrics , 2002, STOC '02.

[48]  Satish Rao,et al.  A note on the nearest neighbor in growth-restricted metrics , 2004, SODA '04.

[49]  Aleksandrs Slivkins Distance estimation and object location via rings of neighbors , 2006, Distributed Computing.