An Empirical Evaluation of Set Similarity Join Techniques

Set similarity joins compute all pairs of similar sets from two collections of sets. We conduct extensive experiments on seven state-of-the-art algorithms for set similarity joins. These algorithms adopt a filter-verification approach. Our analysis shows that verification has not received enough attention in previous works. In practice, efficient verification inspects only a small, constant number of set elements and is faster than some of the more sophisticated filter techniques. Although we can identify three winners, we find that most algorithms show very similar performance. The key technique is the prefix filter, and AllPairs, the first algorithm adopting this techniques is still a relevant competitor. We repeat experiments from previous work and discuss diverging results. All our claims are supported by a detailed analysis of the factors that determine the overall runtime.

[1]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[2]  Eva Zangerle,et al.  Combining Spotify and Twitter Data for Generating a Recent and Public Dataset for Music Recommendation , 2014, Grundlagen von Datenbanken.

[3]  Ron Kohavi,et al.  Real world performance of association rule algorithms , 2001, KDD '01.

[4]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[5]  BourosPanagiotis,et al.  An empirical evaluation of set similarity join techniques , 2016, VLDB 2016.

[6]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[7]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..

[8]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[9]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[10]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[11]  Nikos Mamoulis,et al.  Spatio-textual similarity joins , 2012, Proc. VLDB Endow..

[12]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[13]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[14]  Michael H. Böhlen,et al.  Similarity Joins in Relational Database Systems , 2013, Similarity Joins in Relational Database Systems.

[15]  Nikolaus Augsten,et al.  PEL: Position-Enhanced Length Filter for Set Similarity Joins , 2014, Grundlagen von Datenbanken.

[16]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[17]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[19]  Theo Härder,et al.  Generalizing prefix filtering to improve set similarity joins , 2011, Inf. Syst..

[20]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.