Engineering Efficient and Effective Non-metric Space Library

We present a new similarity search library and discuss a variety of design and performance issues related to its development. We adopt a position that engineering is equally important to design of the algorithms and pursue a goal of producing realistic benchmarks. To this end, we pay attention to various performance aspects and utilize modern hardware, which provides a high degree of parallelization. Since we focus on realistic measurements, performance of the methods should not be measured using merely the number of distance computations performed, because other costs, such as computation of a cheaper distance function, which approximates the original one, are oftentimes substantial. The paper includes preliminary experimental results, which support this point of view. Rather than looking for the best method, we want to ensure that the library implements competitive baselines, which can be useful for future work.

[1]  Dirk Eddelbuettel,et al.  Rcpp: Seamless R and C++ Integration , 2011 .

[2]  Mario Cannataro,et al.  Protein-to-protein interactions: Technologies, databases, and algorithms , 2010, CSUR.

[3]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[4]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[5]  Gonzalo Navarro,et al.  Effective Proximity Retrieval by Ordering Permutations , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[7]  Christos Faloutsos,et al.  Searching Multimedia Databases by Content , 1996, Advances in Database Systems.

[8]  Gonzalo Navarro,et al.  Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces , 2003, Inf. Process. Lett..

[9]  Robert Hundt,et al.  Loop Recognition in C++/Java/Go/Scala , 2011 .

[10]  Jeffrey Scott Vitter,et al.  Proceedings of the thirtieth annual ACM symposium on Theory of computing , 1998, STOC 1998.

[11]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[12]  Daphna Weinshall,et al.  Classification with Nonmetric Distances: Image Retrieval and Class Representation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Pavel Zezula,et al.  Approximate similarity retrieval with M-trees , 1998, The VLDB Journal.

[14]  Rodrigo A. Vivanco,et al.  Scientific computing with Java and C++: a case study using functional magnetic resonance neuroimages , 2005, Softw. Pract. Exp..

[15]  Lawrence Cayton,et al.  Fast nearest neighbor retrieval for bregman divergences , 2008, ICML '08.

[16]  Kimmo Fredriksson,et al.  Engineering efficient metric indexes , 2007, Pattern Recognit. Lett..

[17]  Rodrigo A. Vivanco,et al.  Scientific computing with Java and Cpp: a case study using functional magnetic resonance neuroimages , 2005 .

[18]  Pavel Zezula,et al.  Second International Workshop on Similarity Search and Applications, SISAP 2009, 29-30 August 2009, Prague, Czech Republic , 2009, SISAP.

[19]  Marc Snir,et al.  Computer and information science and engineering , 2011, Commun. ACM.

[20]  Tomás Skopal,et al.  Unified framework for fast exact and approximate search in dissimilarity spaces , 2007, TODS.

[21]  Iulian Neamtiu,et al.  Assessing programming language impact on development and maintenance: a study on c and c++ , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[22]  Ritchie S. King The top 10 programming languages [The Data] , 2011 .

[23]  Piotr Indyk,et al.  Nearest Neighbors in High-Dimensional Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[24]  Vladimir Pestov,et al.  Is the kk-NN classifier in high dimensions affected by the curse of dimensionality? , 2011, Comput. Math. Appl..

[25]  Mark Baker,et al.  A comparative study of Java and C performance in two large-scale parallel applications , 2009 .

[26]  Benjamin Bustos,et al.  On nonmetric similarity search problems in complex domains , 2011, CSUR.

[27]  Andrea Esuli,et al.  Use of permutation prefixes for efficient and scalable approximate similarity search , 2012, Inf. Process. Manag..

[28]  Pasquale Savino,et al.  Approximate similarity search in metric spaces using inverted files , 2008, Infoscale.

[29]  D. W. Scott,et al.  PROBABILITY DENSITY ESTIMATION IN HIGHER DIMENSIONS , 2014 .

[30]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[31]  Voicu Groza,et al.  Returning Control to the Programmer , 2011 .

[32]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[33]  Gary King,et al.  How Not to Lie with Statistics: Avoiding Common Mistakes in Quantitative Political Science , 1986 .

[34]  Jakub Lokoc,et al.  Ptolemaic indexing of the signature quadratic form distance , 2011, SISAP.

[35]  Anthony K. H. Tung,et al.  Similarity Search on Bregman Divergence: Towards Non-Metric Indexing , 2009, Proc. VLDB Endow..

[36]  Shuicheng Yan,et al.  Non-Metric Locality-Sensitive Hashing , 2010, AAAI.

[37]  Karina Figueroa,et al.  Speeding Up Permutation Based Indexing with Indexing , 2009, 2009 Second International Workshop on Similarity Search and Applications.

[38]  Zhe Wang,et al.  Modeling LSH for performance tuning , 2008, CIKM '08.

[39]  Ulrich Drepper,et al.  What Every Programmer Should Know About Memory , 2007 .

[40]  Pavel Zezula,et al.  Similarity Search: The Metric Space Approach (Advances in Database Systems) , 2005 .

[41]  Pavel Zezula,et al.  Region proximity in metric spaces and its use for approximate similarity search , 2003, TOIS.

[42]  Gert Vegter,et al.  In handbook of discrete and computational geometry , 1997 .

[43]  David Novak,et al.  On locality-sensitive indexing in generic metric spaces , 2010, SISAP.

[44]  Vladimir Pestov,et al.  Indexability, concentration, and VC theory , 2010, J. Discrete Algorithms.

[45]  L. Hedges,et al.  Fixed- and random-effects models in meta-analysis. , 1998 .

[46]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..