Inclusão de Operadores Físicos de Junção por Similaridade em um SGBD Comercial

Complex data, such as video, images and audio, require particular forms of querying, storing and indexing that commercial DBMSs still do not provide. One of the solutions for DBMS to o↵er support for complex data is to extend relational operators to represent similarity queries. Similarity queries retrieve data based on similarity relations among stored data, which are derived from the intrinsic data content. One important type of similarity query is the similarity join that returns pairs of elements from two input datasets that satisfy the stated join condition, which can be for instance if they are closer to each other than a given threshold (Range join) or if one element is among the k-nearest neighbors of the other (k-NN join). Existing algorithms to execute similarity joins essentially consider input datasets/relations are directly read from disk. However, join is one of the most time-consuming operations in a query, therefore delaying it in the execution plan and join filtered data in memory usually results in performance gain. In this paper, we present range join algorithms developed to perform in memory on filtered data on top of a commecial DBMS. Our proposal is to allow executing similarity joins in di↵erent positions in the query plan and evaluate how distinct algorithms behave according to varied situations. Presented results show that best developed options are the state-of-the-art DBSimJoin algorithm for high selective filtered inputs and an indexbased similarity join for queries in which the join selectivity is high and a proper data index is available.

[1]  Andrew W. Moore,et al.  New Algorithms for Efficient High-Dimensional Nonparametric Classification , 2006, J. Mach. Learn. Res..

[2]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[3]  Pavel Zezula,et al.  Similarity Join in Metric Spaces Using eD-Index , 2003, DEXA.

[4]  Christian Böhm,et al.  A cost model and index architecture for the similarity join , 2001, Proceedings 17th International Conference on Data Engineering.

[5]  Anne E. James,et al.  Content-based image retrieval approach for biometric security using colour, texture and shape features controlled by fuzzy heuristics , 2012, J. Comput. Syst. Sci..

[6]  Walid G. Aref,et al.  Similarity queries: their conceptual evaluation, transformations, and processing , 2013, The VLDB Journal.

[7]  Agma J. M. Traina,et al.  Efficient Self-similarity Range Wide-joins Fostering Near-duplicate Image Detection in Emergency Scenarios , 2016, ICEIS.

[8]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[9]  Bernhard Seeger,et al.  GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces , 2001, KDD '01.

[10]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[11]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[12]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[13]  Yasin N. Silva,et al.  Index-Based R-S Similarity Joins , 2014, SISAP.

[14]  Walid G. Aref,et al.  The similarity join database operator , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[15]  Christos Faloutsos,et al.  Fast Indexing and Visualization of Metric Data Sets using Slim-Trees , 2002, IEEE Trans. Knowl. Data Eng..

[16]  Ivica Dimitrovski,et al.  Content based image retrieval in medical applications: an improvement of the two-level architecture , 2009, IEEE EUROCON 2009.

[17]  Tomás Skopal,et al.  On Fast Non-metric Similarity Search by Metric Access Methods , 2006, EDBT.

[18]  Agma J. M. Traina,et al.  Efficient Content-Based Image Retrieval through Metric Histograms , 2003, World Wide Web.

[19]  Yasin N. Silva,et al.  Similarity Joins: Their implementation and interactions with other database operators , 2015, Inf. Syst..