Internal and external memory set containment join

A set containment join operates on two set-valued attributes with a subset ( $$\subseteq $$ ) relationship as the join condition. It has many real-world applications, such as in publish/subscribe services and inclusion dependency discovery. Existing solutions can be broadly classified into union-oriented and intersection-oriented methods. Based on several recent studies, union-oriented methods are not competitive as they involve an expensive subset enumeration step. Intersection-oriented methods build an inverted index on one attribute and perform inverted list intersection on another attribute. Existing intersection-oriented methods intersect inverted lists one-by-one. In contrast, in this paper, we propose to intersect all the inverted lists simultaneously while skipping many irrelevant entries in the lists. To share computation, we utilize the prefix tree structure and extend our novel list intersection method to operate on the prefix tree. To further improve the efficiency, we propose to partition the data and process each partition separately. Each partition will be associated with a much smaller inverted index, and the set containment join cost can be significantly reduced. Moreover, to support large-scale datasets that are beyond the available memory space, we develop a novel adaptive data partition method that is designed to fully leverage the available memory and achieve high I/O efficiency, and thereby exhibiting outstanding performance for external memory set containment join. We evaluate our methods using both real-world and synthetic datasets. Experimental results demonstrate that our method outperforms state-of-the-art methods by up to 10 $$\times $$ when the dataset is completely resided in memory. Furthermore, our approach achieves up to two orders of magnitude improvement on I/O efficiency compared with a baseline method when the dataset size exceeds the main memory space.

[1]  Hai Jin,et al.  Privacy preserving similarity joins using MapReduce , 2019, Inf. Sci..

[2]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[3]  Lijun Chang,et al.  Leveraging Set Relations in Exact Set Similarity Join , 2017, Proc. VLDB Endow..

[4]  Xuemin Lin,et al.  TT-Join: Efficient Set Containment Join , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[5]  Xuemin Lin,et al.  Efficient set containment join , 2018, The VLDB Journal.

[6]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[8]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[9]  Anthony K. H. Tung,et al.  Efficient and Scalable Processing of String Similarity Join , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  Peter Nijkamp,et al.  Accessibility of Cities in the Digital Economy , 2004, cond-mat/0412004.

[11]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[12]  Xiaoyong Du,et al.  Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[13]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[14]  Nikos Mamoulis,et al.  Set containment join revisited , 2015, Knowledge and Information Systems.

[15]  Jeffrey F. Naughton,et al.  Set Containment Joins: The Good, The Bad and The Ugly , 2000, VLDB.

[16]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[17]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[18]  Li Ju,et al.  Batch Text Similarity Search with MapReduce , 2011, APWeb.

[19]  Hector Garcia-Molina,et al.  Divide-and-Conquer Algorithm for Computing Set Containment Joins , 2002, EDBT.

[20]  Alfredo Cuzzocrea,et al.  Set Similarity Joins with Complex Expressions on Distributed Platforms , 2018, ADBIS.

[21]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[22]  Timos K. Sellis,et al.  A combination of trie-trees and inverted files for the indexing of set-valued attributes , 2006, CIKM '06.

[23]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[24]  Nikos Mamoulis,et al.  Privacy Preservation by Disassociation , 2012, Proc. VLDB Endow..

[25]  Yeye He,et al.  ClusterJoin: A Similarity Joins Framework using Map-Reduce , 2014, Proc. VLDB Endow..

[26]  Yufei Tao,et al.  Overlap Set Similarity Joins with Theoretical Guarantees , 2018, SIGMOD Conference.

[27]  George H. L. Fletcher,et al.  Efficient processing of containment queries on nested sets , 2013, EDBT '13.

[28]  Ulf Leser,et al.  Set Similarity Joins on MapReduce: An Experimental Survey , 2018, Proc. VLDB Endow..

[29]  Sven Helmer,et al.  A performance study of four index structures for set-valued attributes of low cardinality , 2003, The VLDB Journal.

[30]  C.S. Roberts,et al.  Partial-match retrieval via the method of superimposed codes , 1979, Proceedings of the IEEE.

[31]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[32]  Xuemin Lin,et al.  Selectivity Estimation on Set Containment Search , 2019, DASFAA.

[33]  Guoliang Li,et al.  String similarity search and join: a survey , 2016, Frontiers of Computer Science.

[34]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[35]  Yoshiharu Ishikawa,et al.  Local Similarity Search for Unstructured Text , 2016, SIGMOD Conference.

[36]  Ulf Leser,et al.  State-of-the-art in string similarity search and join , 2014, SGMD.

[37]  Chuan Xiao,et al.  Pigeonring: A Principle for Faster Thresholded Similarity Search , 2018, Proc. VLDB Endow..

[38]  Yasin N. Silva,et al.  Exploiting MapReduce-based similarity joins , 2012, SIGMOD Conference.

[39]  Xi He,et al.  Cloud Computing: a Perspective Study , 2010, New Generation Computing.

[40]  Zhifeng Bao,et al.  Dima: A Distributed In-Memory Similarity-Based Query Processing System , 2017, Proc. VLDB Endow..

[41]  Zhifeng Bao,et al.  Balance-Aware Distributed String Similarity-Based Query Processing System , 2019, Proc. VLDB Endow..

[42]  Peng Wang,et al.  An efficient MapReduce algorithm for similarity join in metric spaces , 2016, The Journal of Supercomputing.

[43]  Sven Helmer,et al.  PIEJoin: Towards Parallel Set Containment Joins , 2016, SSDBM.

[44]  Hamid Haj Seyyed Javadi,et al.  Load balancing in join algorithms for skewed data in MapReduce systems , 2018, The Journal of Supercomputing.

[45]  Hector Garcia-Molina,et al.  Adaptive algorithms for set containment joins , 2003, TODS.

[46]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[47]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[48]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[49]  Sven Helmer,et al.  Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates , 1996, VLDB.

[50]  Guoliang Li,et al.  A partition-based method for string similarity joins with edit-distance constraints , 2013, TODS.

[51]  Parag Agrawal,et al.  On indexing error-tolerant set containment , 2010, SIGMOD Conference.

[52]  Jan Hidders,et al.  Efficient and scalable trie-based algorithms for computing set containment relations , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[53]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[54]  Michael Stonebraker,et al.  SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints , 2017, Proc. VLDB Endow..

[55]  Panos Kalnis,et al.  Local and global recoding methods for anonymizing set-valued data , 2010, The VLDB Journal.

[56]  Anurag Lal,et al.  Parallel Implementation of Local Similarity Search for Unstructured Text Using Prefix Filtering , 2017, 2017 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT).

[57]  Vikram Pudi,et al.  Using Prefix-Trees for Efficiently Computing Set Joins , 2005, DASFAA.

[58]  Reynold Xin,et al.  Apache Spark , 2016 .

[59]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[60]  Timos K. Sellis,et al.  Efficient answering of set containment queries for skewed item distributions , 2011, EDBT/ICDT '11.

[61]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[62]  Guoliang Li,et al.  An Efficient Partition Based Method for Exact Set Similarity Joins , 2015, Proc. VLDB Endow..

[63]  Lijun Chang,et al.  Leveraging set relations in exact and dynamic set similarity join , 2018, The VLDB Journal.