Adaptive Top-k Overlap Set Similarity Joins

The set similarity join (SSJ) is core functionality in a range of applications, including data cleaning, near-duplicate object detection, and data integration. Threshold-based SSJ queries return all pairs of sets with similarity no smaller than a given threshold. As results, and their utility, are very sensitive to the choice of threshold value, it is a problem that it is difficult to choose such an appropriate value. Doing so requires prior knowledge of the data, which users often do not have. To avoid this problem, we propose a solution to the top-k overlap set similarity join (TkOSSJ) that returns k pairs of sets with the highest overlap similarities. The state-of-the-art solution disregards the effect of the so-called step size, which is the number of elements accessed in each iteration of the algorithm. This affects its performance negatively. To address this issue, we first propose an algorithm that uses a fixed step size, thus taking advantage of the benefits of a large step size, and then we present an adaptive step size algorithm that is capable of automatically adjusting the step size, thus reducing redundant computations. An extensive empirical study offers insight into the new algorithms and indicates that they are capable of outperforming the state-of-the-art method on real, large-scale data sets.

[1]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[2]  Christoph Quix,et al.  Data Lake , 2019, Encyclopedia of Big Data Technologies.

[3]  Guoliang Li,et al.  String similarity search and join: a survey , 2016, Frontiers of Computer Science.

[4]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[5]  Lijun Chang,et al.  Leveraging Set Relations in Exact Set Similarity Join , 2017, Proc. VLDB Endow..

[6]  Renée J. Miller,et al.  LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[7]  Rasmus Pagh,et al.  Set similarity search beyond MinHash , 2017, STOC.

[8]  Nikos Mamoulis,et al.  Spatio-textual similarity joins , 2012, Proc. VLDB Endow..

[9]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[10]  Jukka Riekki,et al.  Implementing Big Data Lake for Heterogeneous Data Sources , 2019, 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW).

[11]  Guoliang Li,et al.  An Efficient Partition Based Method for Exact Set Similarity Joins , 2015, Proc. VLDB Endow..

[12]  Ling Shao,et al.  LCJoin: Set Containment Join via List Crosscutting , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[13]  Ying Zhang,et al.  An Efficient Framework for Exact Set Similarity Search Using Tree Structure Indexes , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[14]  Ping Li,et al.  Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment , 2015, WWW.

[15]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[16]  Renée J. Miller,et al.  Data Lake Management: Challenges and Opportunities , 2019, Proc. VLDB Endow..

[17]  Rasmus Pagh,et al.  Set Similarity Search for Skewed Data , 2018, PODS.

[18]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[19]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[20]  Xuemin Lin,et al.  TT-Join: Efficient Set Containment Join , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[21]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[22]  Yufei Tao,et al.  Overlap Set Similarity Joins with Theoretical Guarantees , 2018, SIGMOD Conference.

[23]  Hector Garcia-Molina,et al.  Adaptive algorithms for set containment joins , 2003, TODS.

[24]  Rasmus Pagh Locality-sensitive Hashing without False Negatives , 2016, SODA.

[25]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[26]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[27]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[29]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[30]  Gang Chen,et al.  Metric Similarity Joins Using MapReduce , 2017, IEEE Transactions on Knowledge and Data Engineering.

[31]  Renée J. Miller,et al.  JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes , 2019, SIGMOD Conference.