Diversity in Similarity Joins

With the increasing ability of current applications to produce and consume more complex data, such as images and geographic information, the similarity join has attracted considerable attention. However, this operator does not consider the relationship among the elements in the answer, generating results with many pairs similar among themselves, which does not add value to the final answer. Result diversification methods are intended to retrieve elements similar enough to satisfy the similarity conditions, but also considering the diversity among the elements in the answer, producing a more heterogeneous result with smaller cardinality, which improves the meaning of the answer. Still, diversity have been studied only when applied to unary operations. In this paper, we introduce the concept of diverse similarity joins: a similarity join operator that ensures a smaller, more diversified and useful answers. The experiments performed on real and synthetic datasets show that our proposal allows exploiting diversity in similarity joins without diminish their performance whereas providing elements that cover the same data space distribution of the non-diverse answers.

[1]  Evaggelia Pitoura,et al.  DisC diversity: result diversification based on dissimilarity and coverage , 2012, Proc. VLDB Endow..

[2]  Walid G. Aref,et al.  Similarity queries: their conceptual evaluation, transformations, and processing , 2013, The VLDB Journal.

[3]  Agma J. M. Traina,et al.  Parameter-free and domain-independent similarity search with diversity , 2013, SSDBM.

[4]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[5]  Ximena Olivares,et al.  Visual diversification of image search results , 2009, WWW '09.

[6]  Dmitri V. Kalashnikov,et al.  Super-EGO: fast multi-dimensional similarity join , 2013, The VLDB Journal.

[7]  Pavel Zezula,et al.  Distinct nearest neighbors queries for similarity search in very large multimedia databases , 2009, WIDM.

[8]  Pavel Zezula,et al.  Similarity Join in Metric Spaces Using eD-Index , 2003, DEXA.

[9]  Ji-Rong Wen,et al.  Multi-dimensional search result diversification , 2011, WSDM '11.

[10]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[11]  Kimmo Fredriksson,et al.  Quicker range- and k-NN joins in metric spaces , 2015, Inf. Syst..

[12]  Nora Reyes,et al.  Solving similarity joins and range queries in metric spaces with the list of twin clusters , 2009, J. Discrete Algorithms.

[13]  Tova Milo,et al.  Diversification and refinement in collaborative filtering recommender , 2011, CIKM '11.

[14]  Bernhard Seeger,et al.  GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces , 2001, KDD '01.

[15]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[16]  Yasin N. Silva,et al.  Index-Based R-S Similarity Joins , 2014, SISAP.

[17]  Divesh Srivastava,et al.  On query result diversification , 2011, 2011 IEEE 27th International Conference on Data Engineering.