Set containment join revisited

Given two collections of set objects R and S, the $$R\bowtie _{\subseteq }S$$R⋈⊆S set containment join returns all object pairs $$(r,s) \in R\times S$$(r,s)∈R×S such that $$r\subseteq s$$r⊆s. Besides being a basic operator in all modern data management systems with a wide range of applications, the join can be used to evaluate complex SQL queries based on relational division and as a module of data mining algorithms. The state-of-the-art algorithm for set containment joins ($$\mathtt {PRETTI}$$PRETTI) builds an inverted index on the right-hand collection S and a prefix tree on the left-hand collection R that groups set objects with common prefixes and thus, avoids redundant processing. In this paper, we present a framework which improves $$\mathtt {PRETTI}$$PRETTI in two directions. First, we limit the prefix tree construction by proposing an adaptive methodology based on a cost model; this way, we can greatly reduce the space and time cost of the join. Second, we partition the objects of each collection based on their first contained item, assuming that the set objects are internally sorted. We show that we can process the partitions and evaluate the join while building the prefix tree and the inverted index progressively. This allows us to significantly reduce not only the join cost, but also the maximum memory requirements during the join. An experimental evaluation using both real and synthetic datasets shows that our framework outperforms $$\mathtt {PRETTI}$$PRETTI by a wide margin.

[1]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[2]  Ron Kohavi,et al.  Real world performance of association rule algorithms , 2001, KDD '01.

[3]  Timos K. Sellis,et al.  A combination of trie-trees and inverted files for the indexing of set-valued attributes , 2006, CIKM '06.

[4]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[5]  Sven Helmer,et al.  A performance study of four index structures for set-valued attributes of low cardinality , 2003, The VLDB Journal.

[6]  S. Muthukrishnan,et al.  Selectively estimation for Boolean queries , 2000, PODS '00.

[7]  S. Muthukrishnan,et al.  Generalized substring selectivity estimation , 2003, J. Comput. Syst. Sci..

[8]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[9]  Theo Härder,et al.  Efficient Set Similarity Joins Using Min-prefixes , 2009, ADBIS.

[10]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[11]  Timos K. Sellis,et al.  Efficient answering of set containment queries for skewed item distributions , 2011, EDBT/ICDT '11.

[12]  Hector Garcia-Molina,et al.  Adaptive algorithms for set containment joins , 2003, TODS.

[13]  Henning Köhler Estimating set intersection using small samples , 2010, ACSC.

[14]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[15]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[16]  Jeffrey F. Naughton,et al.  Set Containment Joins: The Good, The Bad and The Ugly , 2000, VLDB.

[17]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[18]  Quan Wang,et al.  Algorithms and applications for universal quantification in relational databases , 2003, Inf. Syst..

[19]  Claire Mathieu,et al.  Adaptive intersection and t-threshold problems , 2002, SODA '02.

[20]  Antonio Badia,et al.  A nested relational approach to processing SQL subqueries , 2005, SIGMOD '05.

[21]  Nikos Mamoulis,et al.  Spatio-textual similarity joins , 2012, Proc. VLDB Endow..

[22]  George H. L. Fletcher,et al.  Efficient processing of containment queries on nested sets , 2013, EDBT '13.

[23]  Shirish Tatikonda,et al.  Posting list intersection on multicore architectures , 2011, SIGIR.

[24]  Ricardo Baeza-Yates,et al.  Fast Intersection Algorithms for Sorted Sequences , 2010, Algorithms and Applications.

[25]  Sven Helmer,et al.  Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates , 1996, VLDB.

[26]  Ricardo A. Baeza-Yates,et al.  A Fast Set Intersection Algorithm for Sorted Sequences , 2004, CPM.

[27]  Erik D. Demaine,et al.  Experiments on Adaptive Set Intersections for Text Retrieval Systems , 2001, ALENEX.

[28]  Vikram Pudi,et al.  Using Prefix-Trees for Efficiently Computing Set Joins , 2005, DASFAA.

[29]  Hector Garcia-Molina,et al.  Divide-and-Conquer Algorithm for Computing Set Containment Joins , 2002, EDBT.

[30]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[31]  J. Shane Culpepper,et al.  Efficient set intersection for inverted indexing , 2010, TOIS.

[32]  Gang Chen,et al.  Efficient processing of probabilistic set-containment queries on uncertain set-valued data , 2012, Inf. Sci..

[33]  Ralf Rantzau,et al.  Processing frequent itemset discovery queries by division and set containment join operators , 2003, DMKD '03.

[34]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[35]  Erik D. Demaine,et al.  Adaptive set intersections, unions, and differences , 2000, SODA '00.

[36]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[37]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[38]  Sudipto Guha,et al.  Improving the Performance of List Intersection , 2009, Proc. VLDB Endow..

[39]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[40]  Parag Agrawal,et al.  On indexing error-tolerant set containment , 2010, SIGMOD Conference.

[41]  Kenneth Ward Church,et al.  Heavy-tailed distributions and multi-keyword queries , 2007, SIGIR.

[42]  Shirish Tatikonda,et al.  On efficient posting list intersection with multicore processors , 2009, SIGIR.

[43]  Ricardo A. Baeza-Yates,et al.  Experimental Analysis of a Fast Intersection Algorithm for Sorted Sequences , 2005, SPIRE.

[44]  Alejandro López-Ortiz,et al.  An experimental investigation of set intersection algorithms for text searching , 2010, JEAL.