Fast T-overlap query algorithms using graphics processor units and its applications in web data query

Given a collection of sets and a query set, a T-Overlap query identifies all sets having at least T common elements with the query. T-Overlap query is the foundation of set similarity query and join and plays an important role on web data query and processing, such as the behavior analysis of web users and the near duplicated detection of web documents. To address T-Overlap query efficiently, unlike traditional algorithms based on CPU, we aim at designing efficient GPU based algorithms. We firstly design inverted index in GPU, then choose ScanCount, a straightforward but efficient T-Overlap algorithm, as underlying algorithm to develop our GPU based T-Overlap algorithms. Depending on queries processed serially or in parallel, three new efficient algorithms are proposed based on our GPU based inverted index. Among all these three algorithms, GS-Parallel-Group processes a group of queries in parallel and supports a high degree of parallelism. Extensive experiments are carried out to compare our GPU based algorithms with other state-of-the-art CPU based algorithms. Results show that GS-Parallel-Group outperforms CPU based algorithms significantly.

[1]  Stephen D. Bay,et al.  The UCI KDD archive of large data sets for data mining research and experimentation , 2000, SKDD.

[2]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[3]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[4]  Gang Wang,et al.  Efficient lists intersection by CPU-GPU cooperative computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[5]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2005, SIGGRAPH Courses.

[6]  J. Kulpa,et al.  Time-frequency analysis using NVIDIA compute unified device architecture (CUDA) , 2009, Symposium on Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments (WILGA).

[7]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[8]  Torsten Suel,et al.  Using graphics processors for high performance IR query processing , 2009, WWW.

[9]  Bingsheng He,et al.  Frequent itemset mining on graphics processors , 2009, DaMoN '09.

[10]  Gang Wang,et al.  A Batched GPU Algorithm for Set Intersection , 2009, 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks.

[11]  Lynda L. McGhie,et al.  World Wide Web , 2011, Encyclopedia of Information Assurance.

[12]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2004, SIGMOD '04.

[13]  Yong Liu,et al.  ETI: an efficient index for set similarity queries , 2012, Frontiers of Computer Science.

[14]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[15]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[16]  Becky Verastegui,et al.  Proceedings of the 2007 ACM/IEEE conference on Supercomputing , 2007, HiPC 2007.

[17]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[18]  Bingsheng He,et al.  Efficient gather and scatter operations on graphics processors , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[19]  M. Naghibzadeh,et al.  Weighted Semantic Similarity Assessment Using WordNet , 2012, 2012 International Conference on Computer & Information Science (ICCIS).

[20]  Julio J. Castillo A WordNet-based semantic approach to textual entailment and cross-lingual textual entailment , 2011, Int. J. Mach. Learn. Cybern..

[21]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[22]  Gang Wang,et al.  Efficient Parallel Lists Intersection and Index Compression Algorithms using Graphics Processing Units , 2011, Proc. VLDB Endow..

[23]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[24]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[25]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[26]  Mahmoud Naghibzadeh,et al.  Semantic similarity assessment of words using weighted WordNet , 2014, Int. J. Mach. Learn. Cybern..

[27]  Divyakant Agrawal,et al.  Hardware Acceleration in Commercial Databases: A Case Study of Spatial Operations , 2004, VLDB.

[28]  Manoranjan Dash,et al.  Efficient K-Means Clustering Using Accelerated Graphics Processors , 2008, DaWaK.

[29]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.