Similarity Group-By

Group-by is a core database operation that is used extensively in OLTP, OLAP, and decision support systems. In many application scenarios, it is required to group similar but not necessarily equal values. In this paper we propose a new SQL construct that supports similarity-based Group-by (SGB). SGB is not a new clustering algorithm, but rather is a practical and fast similarity grouping query operator that is compatible with other SQL operators and can be combined with them to answer similarity-based queries efficiently. In contrast to expensive clustering algorithms, the proposed similarity group-by operator maintains low execution times while still generating meaningful groupings that address many application needs. The paper presents a general definition of the similarity group-by operation and gives three instances of this definition. The paper also discusses how optimization techniques for the regular group-by can be extended to the case of SGB. The proposed operators are implemented inside PostgreSQL. The performance study shows that the proposed similarity-based group-by operators have good scalability properties with at most only 25% increase in execution time over the regular group-by.

[1]  Christian Böhm,et al.  High performance clustering based on the similarity join , 2000, CIKM '00.

[2]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[3]  Benno Stein,et al.  On Cluster Validity and the Information Need of Users , 2003 .

[4]  Sara Cohen,et al.  User-defined aggregate functions: bridging theory and practice , 2006, SIGMOD Conference.

[5]  Kevin Chen-Chuan Chang,et al.  Supporting ranking and clustering as generalized order-by and group-by , 2007, SIGMOD '07.

[6]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[7]  Michalis Vazirgiannis,et al.  Clustering validity checking methods: part II , 2002, SGMD.

[8]  Jonathan Goldstein,et al.  Optimizing queries using materialized views: a practical, scalable solution , 2001, SIGMOD '01.

[9]  Kai-Uwe Sattler,et al.  Using Similarity-Based Operations for Resolving Data-Level Conflicts , 2003, BNCOD.

[10]  Christian Böhm,et al.  Optimal Dimension Order: A Generic Technique for the Similarity Join , 2002, DaWaK.

[11]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[12]  César A. Galindo-Legaria,et al.  Orthogonal optimization of subqueries and aggregation , 2001, SIGMOD '01.

[13]  Gunter Saake,et al.  Extensible Grouping and Aggregation for Data Reconciliation , 2001, EFIS.

[14]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[15]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[16]  Per-Åke Larson,et al.  Data reduction by partial preaggregation , 2002, Proceedings 18th International Conference on Data Engineering.

[17]  Per-Åke Larson,et al.  Eager Aggregation and Lazy Aggregation , 1995, VLDB.

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[19]  Werner Nutt,et al.  Rewriting queries with arbitrary aggregation functions using views , 2006, TODS.

[20]  Christian Böhm,et al.  The k-Nearest Neighbour Join: Turbo Charging the KDD Process , 2004, Knowledge and Information Systems.

[21]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Join on Uncertain Data , 2006, DASFAA.

[22]  Geoff Holmes,et al.  Clustering Large Datasets Using Cobweb and K-Means in Tandem , 2004, Australian Conference on Artificial Intelligence.

[23]  Christian Böhm,et al.  A cost model and index architecture for the similarity join , 2001, Proceedings 17th International Conference on Data Engineering.

[24]  Beng Chin Ooi,et al.  Gorder: An Efficient Method for KNN Join Processing , 2004, VLDB.

[25]  Gunter Saake,et al.  Efficient similarity-based operations for data integration , 2004, Data Knowl. Eng..

[26]  Yan Huang,et al.  Cluster By: a new sql extension for spatial data aggregation , 2007, GIS.

[27]  Jianwen Su,et al.  Efficient index-based KNN join processing for high-dimensional data , 2007, Inf. Softw. Technol..