论文信息 - Cache-conscious performance optimization for similarity search

Cache-conscious performance optimization for similarity search

All-pairs similarity search can be implemented in two stages. The first stage is to partition the data and group potentially similar vectors. The second stage is to run a set of tasks where each task compares a partition of vectors with other candidate partitions. Because of data sparsity, accessing feature vectors in memory for runtime comparison in the second stage, incurs significant overhead due to the presence of memory hierarchy. This paper proposes a cache-conscious data layout and traversal optimization to reduce the execution time through size-controlled data splitting and vector coalescing. It also provides an analysis to guide the optimal choice for the parameter setting. Our evaluation with several application datasets verifies the performance gains obtained by the optimization and shows that the proposed scheme is upto 2.74x as fast as the cache-oblivious baseline.

Tao Yang | Xun Tang | Maha Alabduljalil

[1] Hector Garcia-Molina,et al. Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[2] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[3] Joshua Alspector,et al. Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[4] Martin L. Kersten,et al. Generic Database Cost Models for Hierarchical Memory Systems , 2002, VLDB.

[5] Daniele Quercia,et al. Auralist: introducing serendipity into music recommendation , 2012, WSDM '12.

[6] Ophir Frieder,et al. Collection statistics for fast duplicate document detection , 2002, TOIS.

[7] Jimmy J. Lin,et al. No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity , 2011, SIGIR '11.

[8] Jimmy J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[9] John R. Gilbert,et al. Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication , 2008, 2008 37th International Conference on Parallel Processing.

[10] Wen-tau Yih,et al. Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.

[11] Bing Liu,et al. Opinion spam and analysis , 2008, WSDM '08.