Fast Document Cosine Similarity Self-Join on GPUs

Similarity Search has been studied in many different fields of computer science, including data mining, information retrieval, databases and so on. Document similarity self-join is a crucial part of lots of applications, such as near-duplicate document detection, document clustering and web search. On a collection of documents, document similarity self-join finds out all pairs of documents whose similarity values are no lower than a threshold value. However, similarity search is a computation-intensive procedure and consumes a large amount of time as the dataset size increases. Thus, many serial algorithms focus on speeding up the process by decreasing the possible similarity candidates for each query object on high-dimensional sparse datasets, including documents. However, the efficiency of those serial algorithms degrade badly as the threshold decreases. Parallel implementations based on OpenMP or MapReduce also adopt the pruning policy and do not solve the problem thoroughly. In this context, taking into account features of document datasets, we propose 2Step-SSJ, which solves the document similarity self-join in CUDA environment on GPUs. 2Step-SSJ performs the similarity self-join in two steps, i.e., similarity computing on the inverted list and similarity computing on the forward list, which compromises between the memory visiting and dot-product computation. The experimental results show that 2Step-SSJ could solve the problem much faster than existing methods on three benchmark text corpora, achieving the speedup of 2x-23x against the state-of-the-art parallel algorithm in general, while keep a relatively stable running time with different values of the threshold.

[1]  Moshe Tennenholtz,et al.  Content-based relevance estimation on the web using inter-document similarities , 2012, CIKM.

[2]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[3]  Nagiza F. Samatova,et al.  Parallel All Pairs Similarity Search , 2010 .

[4]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[5]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[6]  Wen-tau Yih,et al.  Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.

[7]  George Karypis,et al.  PL2AP: fast parallel cosine similarity search , 2015, IA3@SC.

[8]  Nagiza F. Samatova,et al.  Fast Matching for All Pairs Similarity Search , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[9]  Jianzhong Li,et al.  Set-based Similarity Search for Time Series , 2016, SIGMOD Conference.

[10]  Tao Yang,et al.  Optimizing parallel algorithms for all pairs similarity search , 2013, WSDM.

[11]  Tao Yang,et al.  Load balancing for partition-based similarity search , 2014, SIGIR.

[12]  Jin Wang,et al.  Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[15]  George Karypis,et al.  L2AP: Fast cosine similarity search with prefix L-2 norm bounds , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[16]  Sang-goo Lee,et al.  An Efficient Similarity Join Algorithm with Cosine Similarity Predicate , 2010, DEXA.

[17]  Jimmy J. Lin Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce , 2009, SIGIR.

[18]  Erkay Savas,et al.  Efficient top-k similarity document search utilizing distributed file systems and cosine similarity , 2015, Cluster Computing.

[19]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..

[20]  Tao Yang,et al.  Cache-conscious performance optimization for similarity search , 2013, SIGIR.

[21]  Ranieri Baraglia,et al.  Scaling Out All Pairs Similarity Search with MapReduce , 2010, LSDS-IR@SIGIR.

[22]  Frank Mueller,et al.  Data-intensive document clustering on graphics processing unit (GPU) clusters , 2011, J. Parallel Distributed Comput..

[23]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.