Semantic Similarity Group By Operators for Metric Data

Grouping operators summarize data in DBMS arranging elements in groups using identity comparisons. However, for metric data, grouping by identity is seldom useful, since adopting the concept of similarity is often a better fit. There are operators that can group data elements using similarity. However, the existing operators do not achieve good results for certain data domains or distributions. The major contributions of this work are a novel operator called the SGB-Vote that assign groups using an election involving already assigned groups and an extension for current operators bounds each group to a maximum amount of the nearest neighbors. The operators were implemented in a framework and evaluated using real and synthetic datasets from diverse domains considering both quality of and execution time. The results obtained show that the proposed operators produce higher quality groups in all tested datasets and highlight that the operators can efficiently run inside a DBMS.

[1]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[2]  Gunter Saake,et al.  Advanced grouping and aggregation for data integration , 2001, CIKM '01.

[3]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[4]  Agma J. M. Traina,et al.  A New Concept of Sets to Handle Similarity in Databases: The SimSets , 2013, SISAP.

[5]  Agma J. M. Traina,et al.  A Wider Concept for Similarity Joins , 2014, J. Inf. Data Manag..

[6]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[7]  Walid G. Aref,et al.  Similarity Group-By , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[8]  Kevin Chen-Chuan Chang,et al.  Supporting ranking and clustering as generalized order-by and group-by , 2007, SIGMOD '07.

[9]  Walid G. Aref,et al.  The Similarity-Aware Relational Intersect Database Operator , 2014, SISAP.

[10]  Agma J. M. Traina,et al.  Querying Multimedia Data by Similarity in Relational DBMS , 2011 .

[11]  Walid G. Aref,et al.  The similarity join database operator , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[12]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[13]  Walid G. Aref,et al.  SimDB: a similarity-aware database system , 2010, SIGMOD Conference.

[14]  Agma J. M. Traina,et al.  FMI-SiR: A Flexible and Efficient Module for Similarity Searching on Oracle Database , 2010, J. Inf. Data Manag..

[15]  Agma J. M. Traina,et al.  On the Support of a Similarity-enabled Relational Database Management System in Civilian Crisis Situations , 2016, ICEIS.

[16]  Yan Huang,et al.  Cluster By: a new sql extension for spatial data aggregation , 2007, GIS.

[17]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[18]  Agma J. M. Traina,et al.  SIREN: a similarity retrieval engine for complex data , 2006, VLDB.

[19]  Walid G. Aref,et al.  Similarity Group-By operators for multi-dimensional relational data , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).