Cache-conscious performance optimization for similarity search

All-pairs similarity search can be implemented in two stages. The first stage is to partition the data and group potentially similar vectors. The second stage is to run a set of tasks where each task compares a partition of vectors with other candidate partitions. Because of data sparsity, accessing feature vectors in memory for runtime comparison in the second stage, incurs significant overhead due to the presence of memory hierarchy. This paper proposes a cache-conscious data layout and traversal optimization to reduce the execution time through size-controlled data splitting and vector coalescing. It also provides an analysis to guide the optimal choice for the parameter setting. Our evaluation with several application datasets verifies the performance gains obtained by the optimization and shows that the proposed scheme is upto 2.74x as fast as the cache-oblivious baseline.

[1]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[2]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[3]  Joshua Alspector,et al.  Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[4]  Martin L. Kersten,et al.  Generic Database Cost Models for Hierarchical Memory Systems , 2002, VLDB.

[5]  Daniele Quercia,et al.  Auralist: introducing serendipity into music recommendation , 2012, WSDM '12.

[6]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[7]  Jimmy J. Lin,et al.  No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[8]  Jimmy J. Lin Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[9]  John R. Gilbert,et al.  Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[10]  Wen-tau Yih,et al.  Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.

[11]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[12]  Ranieri Baraglia,et al.  Scaling Out All Pairs Similarity Search with MapReduce , 2010, LSDS-IR@SIGIR.

[13]  Divyakant Agrawal,et al.  Detectives: detecting coalition hit inflation attacks in advertising networks streams , 2007, WWW '07.

[14]  Tao Yang,et al.  S+: Efficient 2D Sparse LU Factorization on Parallel Machines , 2000, SIAM J. Matrix Anal. Appl..

[15]  Jeffrey F. Naughton,et al.  Cache Conscious Algorithms for Relational Query Processing , 1994, VLDB.

[16]  Tao Yang,et al.  Optimizing parallel algorithms for all pairs similarity search , 2013, WSDM.

[17]  Xin Liu,et al.  Clustering and load balancing optimization for redundant content removal , 2012, WWW.

[18]  James Demmel,et al.  Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[19]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[20]  Iain S. Duff,et al.  An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum , 2002, TOMS.

[21]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[22]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[23]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[24]  Mehran Sahami,et al.  A web-based kernel function for measuring the similarity of short text snippets , 2006, WWW '06.

[25]  Diego Fernández,et al.  Comparison of collaborative filtering algorithms , 2011, ACM Trans. Web.