Similarity Search via Combinatorial Nets

We consider Nearest Neighbor Search problem in the so called combinatorial framework: Only direct comparisons between two pairwise similarity values are allowed. We assume that the similarity order for the input dataset has the following consistency property: if x is the a’th most similar object to y and y is the b’th most similar object to z, then x is among the D(a + b) most similar objects to z. Though the oracle gives much less information compared to the standard general metric space model where distance values are given, it turns out that one can still design a deterministic preprocessing algorithm with almost linear time and space complexity, and answer queries deterministically in near-logarithmic time. A key procedure of our main algorithm is efficient constructions of combinatorial nets. We show that this data structure is useful for solving other important problems. For example, motivated by navigability questions we show that for any dataset a visibility graph can be constructed: all out-degrees are near-logarithmic and greedy routing deterministically converges to nearest neighbor in logarithmic number of steps. Also, for near-duplicate detection problem we present the first known deterministic algorithm that requires just near-linear time + time proportional to the size of output.

