TT-Join: Efficient Set Containment Join

In this paper, we study the problem of set containment join. Given two collections R and S of records, the set containment join R./S retrieves all record pairs f(r, s)g 2 R S such that r s. This problem has been extensively studied in the literature and has many important applications in commercial and scientific fields. Recent research focuses on the in-memory set containment join algorithms, and several techniques have been developed following intersectionoriented or union-oriented computing paradigms. Nevertheless, we observe that two computing paradigms have their limits due to the nature of the intersection and union operators. Particularly, intersection-oriented method relies on the intersection of the relevant inverted lists built on the elements of S. A nice property of the intersection-oriented method is that the join computation is verification free. However, the number of records explored during the join process may be large because there are multiple replicas for each record in S. On the other hand, the unionaornidenttehde mcaenthdoiddatgeenpearairtses aare siogbntaatiunreed fobry etahceh urneicoonrdofin thRe inverted lists of the relevant signatures. The candidate size of the union-oriented method is usually small because each record contributes only one replica in the index. Unfortunately, unionoriented method needs to verify the candidate pairs, which may be cost expensive especially when the join result size is large. As a matter of fact, the state-of-the-art union-oriented solution is not competitive compared to the intersection-oriented ones. In this paper, we propose a new union-oriented method, namely TT-Join, which not only enhances the advantage of the previous unionoriented methods but also integrates the goodness of intersectionoriented methods by imposing a variant of prefix tree structure. We conduct extensive experiments on 20 real-life datasets by comparing our method with 7 existing methods. The experiment results demonstrate that TT-Join significantly outperforms the existing algorithms on most of the datasets, and can achieve up to two orders of magnitude speedup.

[1]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[2]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[4]  Timos K. Sellis,et al.  Efficient answering of set containment queries for skewed item distributions , 2011, EDBT/ICDT '11.

[5]  Hector Garcia-Molina,et al.  Adaptive algorithms for set containment joins , 2003, TODS.

[6]  Sven Helmer,et al.  PIEJoin: Towards Parallel Set Containment Joins , 2016, SSDBM.

[7]  Sven Helmer,et al.  Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates , 1996, VLDB.

[8]  Renée J. Miller,et al.  LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[9]  Hector Garcia-Molina,et al.  Divide-and-Conquer Algorithm for Computing Set Containment Joins , 2002, EDBT.

[10]  Guoliang Li,et al.  An Efficient Partition Based Method for Exact Set Similarity Joins , 2015, Proc. VLDB Endow..

[11]  Nikos Mamoulis,et al.  Set containment join revisited , 2015, Knowledge and Information Systems.

[12]  Jeffrey F. Naughton,et al.  Set Containment Joins: The Good, The Bad and The Ugly , 2000, VLDB.

[13]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[14]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[15]  Vikram Pudi,et al.  Using Prefix-Trees for Efficiently Computing Set Joins , 2005, DASFAA.

[16]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[17]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[18]  Jan Hidders,et al.  Efficient and scalable trie-based algorithms for computing set containment relations , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[19]  Cédric du Mouza,et al.  Subscription indexes for web syndication systems , 2012, EDBT '12.

[20]  Hector Garcia-Molina,et al.  Index structures for selective dissemination of information under the Boolean model , 1994, TODS.

[21]  Timos K. Sellis,et al.  A combination of trie-trees and inverted files for the indexing of set-valued attributes , 2006, CIKM '06.

[22]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[23]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[24]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[25]  Parag Agrawal,et al.  On indexing error-tolerant set containment , 2010, SIGMOD Conference.