Efficient and scalable trie-based algorithms for computing set containment relations

Computing containment relations between massive collections of sets is a fundamental operation in data management, for example in graph analytics and data mining applications. Motivated by recent hardware trends, in this paper we present two novel solutions for computing set-containment joins over massive sets: the Patricia Trie-based Signature Join (PTSJ) and PRETTI+, a Patricia trie enhanced extension of the state-of-the-art PRETTI join. The compact trie structure not only enables efficient use of main-memory, but also significantly boosts the performance of both approaches. By carefully analyzing the algorithms and conducting extensive experiments with various synthetic and real-world datasets, we show that, in many practical cases, our algorithms are an order of magnitude faster than the state-of-the-art.

[1]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[2]  Antonio Badia,et al.  SQL query optimization through nested relational algebra , 2007, TODS.

[3]  Timos K. Sellis,et al.  Efficient answering of set containment queries for skewed item distributions , 2011, EDBT/ICDT '11.

[4]  Timos K. Sellis,et al.  A combination of trie-trees and inverted files for the indexing of set-valued attributes , 2006, CIKM '06.

[5]  Jeffrey F. Naughton,et al.  Set Containment Joins: The Good, The Bad and The Ugly , 2000, VLDB.

[6]  Jan Hidders,et al.  External memory K-bisimulation reduction of big graphs , 2012, CIKM.

[7]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[8]  Jan Hidders,et al.  A Structural Approach to Indexing Triples , 2012, ESWC.

[9]  Ulf Leser,et al.  State-of-the-art in string similarity search and join , 2014, SGMD.

[10]  Jan Van den Bussche,et al.  On the complexity of division and set joins in the relational algebra , 2005, PODS '05.

[11]  Sven Helmer,et al.  Indexing Set-Valued Attributes with a Multi-level Extendible Hashing Scheme , 2007, DEXA.

[12]  Gang Chen,et al.  Efficient processing of probabilistic set-containment queries on uncertain set-valued data , 2012, Inf. Sci..

[13]  Guoliang Li,et al.  Trie-join: a trie-based method for efficient string similarity joins , 2012, The VLDB Journal.

[14]  Sven Helmer,et al.  Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates , 1996, VLDB.

[15]  Ralf Rantzau,et al.  Processing frequent itemset discovery queries by division and set containment join operators , 2003, DMKD '03.

[16]  Carlo Zaniolo,et al.  Graph Queries in a Next-Generation Datalog System , 2013, Proc. VLDB Endow..

[17]  Chengkai Li,et al.  Set Predicates in SQL: Enabling Set-Level Comparisons for Dynamically Formed Groups , 2014, IEEE Transactions on Knowledge and Data Engineering.

[18]  Lei Zou,et al.  gStore: a graph-based SPARQL query engine , 2014, The VLDB Journal.

[19]  Jure Leskovec,et al.  Defining and Evaluating Network Communities Based on Ground-Truth , 2012, ICDM.

[20]  Hector Garcia-Molina,et al.  Adaptive algorithms for set containment joins , 2003, TODS.

[21]  George H. L. Fletcher,et al.  Efficient processing of containment queries on nested sets , 2013, EDBT '13.

[22]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[23]  Marcel Worring,et al.  Unsupervised multi-feature tag relevance learning for social image retrieval , 2010, CIVR '10.

[24]  Sven Helmer,et al.  A performance study of four index structures for set-valued attributes of low cardinality , 2003, The VLDB Journal.

[25]  Wei Wang,et al.  Trie-based similarity search and join , 2013, EDBT '13.

[26]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[27]  Antonio Badia,et al.  Complex SQL Predicates as Quantifiers , 2014, IEEE Transactions on Knowledge and Data Engineering.

[28]  Jeffrey F. Naughton,et al.  On the complexity of join predicates , 2001, PODS.

[29]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[30]  Vikram Pudi,et al.  Using Prefix-Trees for Efficiently Computing Set Joins , 2005, DASFAA.