Faster proximity searching with the distal SAT

Abstract Searching by proximity has been a source of puzzling behaviors and counter-intuitive findings for well established algorithmic design rules. One example is a linked list; it is the worst data structure for exact searching, and one of the most competitive for proximity searching. Common sense also dictates that an online data structure is less competitive than the full-knowledge, static version. A counter example in proximity searching is the static Spatial Approximation Tree ( SAT ), which is slower than its dynamic version ( DSAT ). In this paper we show that changing only the insertion policy of the SAT , leaving every other aspect of the data structure untouched, can produce a systematically faster index. We call the index Distal Spatial Approximation Tree ( DiSAT ). We found that even a random insertion policy produce a faster version of the SAT , which explains why the DSAT is faster than SAT . In brief, the SAT is improved by selecting distal, instead of proximal, nodes. This is the exact opposite of the insertion policy proposed in the original paper, and can be used in main or secondary memory versions of the index. We tested our approach with representatives of the state of the art in exact proximity searching. As it happens often in experimental setups, there are no absolute winners in all the aspects tested. Our data structure has no parameters to tune-up and a small memory footprint. In addition it can be constructed quickly. Our approach is among the most competitive, those outperforming DiSAT achieve this at the expense of larger memory usage or an impractical construction time.

[1]  Marco Patella,et al.  PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[2]  E. Ruiz An algorithm for finding nearest neighbours in (approximately) constant average time , 1986 .

[3]  Hanan Samet,et al.  Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling) , 2005 .

[4]  Magnus Lie Hetland The Basic Principles of Metric Indexing , 2009 .

[5]  Gonzalo Navarro,et al.  Effective Proximity Retrieval by Ordering Permutations , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Gonzalo Navarro,et al.  Probabilistic proximity searching algorithms based on compact partitions , 2004, J. Discrete Algorithms.

[7]  David R. Karger,et al.  Finding nearest neighbors in growth-restricted metrics , 2002, STOC '02.

[8]  Andrea Esuli,et al.  CoPhIR: a Test Collection for Content-Based Image Retrieval , 2009, ArXiv.

[9]  Vlastislav Dohnal,et al.  An Access Structure for Similarity Search in Metric Spaces , 2004, EDBT Workshops.

[10]  Knut Verbarg The C-Tree: A Dynamically Balanced Spatial Index , 1993, Geometric Modelling.

[11]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[12]  Pavel Zezula,et al.  D-Index: Distance Searching Index for Metric Data Sets , 2003, Multimedia Tools and Applications.

[13]  Václav Snásel,et al.  PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases , 2004, ADBIS.

[14]  Robert Krauthgamer,et al.  Navigating nets: simple algorithms for proximity search , 2004, SODA '04.

[15]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[16]  Hanan Samet,et al.  Properties of Embedding Methods for Similarity Searching in Metric Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Ricardo A. Baeza-Yates,et al.  Spaghettis: an array based algorithm for similarity queries in metric spaces , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[18]  Nora Reyes,et al.  Faster Proximity Searching with the Distal SAT , 2014, SISAP.

[19]  Gonzalo Navarro,et al.  Dynamic spatial approximation trees , 2008, JEAL.

[20]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[21]  Nora Reyes,et al.  Reaching near neighbors with far and random proxies , 2011, 2011 8th International Conference on Electrical Engineering, Computing Science and Automatic Control.

[22]  Ramesh C. Jain,et al.  Similarity indexing: algorithms and performance , 1996, Electronic Imaging.

[23]  Gonzalo Navarro,et al.  Analyzing Metric Space Indexes: What For? , 2009, 2009 Second International Workshop on Similarity Search and Applications.

[24]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[25]  Marco Patella,et al.  Approximate similarity search: A multi-faceted problem , 2009, J. Discrete Algorithms.

[26]  Hanan Samet,et al.  Improved search heuristics for the sa-tree , 2003, Pattern Recognit. Lett..

[27]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[28]  Nora Reyes,et al.  Similarity Search Using Sparse Pivots for Efficient Multimedia Information Retrieval , 2006, Eighth IEEE International Symposium on Multimedia (ISM'06).

[29]  Pavel Zezula,et al.  SAPIR: Scalable and Distributed Image Searching , 2007, SAMT.

[30]  Luisa Micó,et al.  A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements , 1994, Pattern Recognit. Lett..

[31]  Hartmut Noltemeier,et al.  Monotonous Bisector* Trees - A Tool for Efficient Partitioning of Complex Scenes of Geometric Objects , 1992, Data Structures and Efficient Algorithms.

[32]  Gonzalo Navarro,et al.  Fixed Queries Array: A Fast and Economical Data Structure for Proximity Searching , 2001, Multimedia Tools and Applications.

[33]  Marco Patella,et al.  Approximate and probabilistic methods , 2010, SIGSPACIAL.

[34]  Vladimir Pestov,et al.  Indexing Schemes for Similarity Search: an Illustrated Paradigm , 2002, Fundam. Informaticae.

[35]  Margarida Mamede,et al.  Recursive Lists of Clusters: A Dynamic Data Structure for Range Queries in Metric Spaces , 2005, ISCIS.

[36]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[37]  Laurent Amsaleg,et al.  NV-Tree: An Efficient Disk-Based Index for Approximate Search in Very Large High-Dimensional Collections , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[39]  Ricardo A. Baeza-Yates,et al.  Proximity Matching Using Fixed-Queries Trees , 1994, CPM.

[40]  Gonzalo Navarro,et al.  Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces , 2003, Inf. Process. Lett..

[41]  Christos Faloutsos,et al.  Similarity search without tears: the OMNI-family of all-purpose access methods , 2001, Proceedings 17th International Conference on Data Engineering.

[42]  Edgar Chávez,et al.  Extreme Pivots for Faster Metric Indexes , 2013, SISAP.

[43]  Kenneth L. Clarkson,et al.  Nearest Neighbor Queries in Metric Spaces , 1999, Discret. Comput. Geom..

[44]  Marco Patella,et al.  Searching in metric spaces with user-defined and approximate distances , 2002, TODS.

[45]  Gonzalo Navarro,et al.  A compact space decomposition for effective metric indexing , 2005, Pattern Recognit. Lett..

[46]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[47]  Gonzalo Navarro Searching in metric spaces by spatial approximation , 2002, The VLDB Journal.

[48]  A KeimDaniel,et al.  Searching in high-dimensional spaces , 2001 .