Similarity-aware Query Processing and Optimization

Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, require or can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, e.g., join and selection, to be aware of data similarities; there has not been much study on the role, interaction, and implementation of similarity-aware operations as first-class database operators. The focus of this thesis work is the proposal and study of several similarity-aware database operators and a systematic analysis of their role as query operators, interactions, optimizations, and implementation techniques. This work presents a detailed study of two core similarity-aware operators: Similarity Group-by and Similarity Join. We describe multiple optimization techniques for the introduced operators. Specifically, we present: (1) multiple non-trivial equivalence rules that enable similarity query transformations, (2) Eager and Lazy aggregation transformations for Similarity Group-by and Similarity Join to allow pre-aggregation before potentially expensive joins, and (3) techniques to use materialized views to answer similarity-based queries. We also present the main guidelines to implement the presented operators as integral components of a database system query engine and several key performance evaluation results of this implementation in an open source database system. We introduce a comprehensive conceptual evaluation model for similarity queries with multiple similarity-aware predicates, i.e., Similarity Selection, Similarity Join, Similarity Group-by. This model clearly defines the expected correct result of a query with multiple similarity-aware predicates. Furthermore, we present multiple transformation rules to transform the initial evaluation plan into more efficient equivalent plans.

[1]  Michalis Vazirgiannis,et al.  Clustering validity checking methods: part II , 2002, SGMD.

[2]  Christian Böhm,et al.  A cost model and index architecture for the similarity join , 2001, Proceedings 17th International Conference on Data Engineering.

[3]  Walid G. Aref,et al.  SimDB: a similarity-aware database system , 2010, SIGMOD Conference.

[4]  V. S. Subrahmanian,et al.  A multi-similarity algebra , 1998, SIGMOD '98.

[5]  César A. Galindo-Legaria,et al.  Orthogonal optimization of subqueries and aggregation , 2001, SIGMOD '01.

[6]  Surajit Chaudhuri,et al.  Data Debugger: An Operator-Centric Approach for Data Quality Solutions , 2006, IEEE Data Eng. Bull..

[7]  Benno Stein,et al.  On Cluster Validity and the Information Need of Users , 2003 .

[8]  Agma J. M. Traina,et al.  An efficient framework for similarity query optimization , 2007, GIS.

[9]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[10]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[11]  Walid G. Aref,et al.  The similarity join database operator , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[12]  Christos Faloutsos,et al.  Compact Similarity Joins , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Sara Cohen,et al.  User-defined aggregate functions: bridging theory and practice , 2006, SIGMOD Conference.

[14]  Kevin Chen-Chuan Chang,et al.  Supporting ranking and clustering as generalized order-by and group-by , 2007, SIGMOD '07.

[15]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Pavel Zezula,et al.  Similarity Join in Metric Spaces , 2003, ECIR.

[17]  Walid G. Aref,et al.  Similarity Group-By , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[18]  Per-Åke Larson,et al.  Eager Aggregation and Lazy Aggregation , 1995, VLDB.

[19]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[20]  Christos Faloutsos,et al.  Efficient processing of complex similarity queries in RDBMS through query rewriting , 2006, CIKM '06.

[21]  Beng Chin Ooi,et al.  Gorder: An Efficient Method for KNN Join Processing , 2004, VLDB.

[22]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[23]  Gunter Saake,et al.  Efficient similarity-based operations for data integration , 2004, Data Knowl. Eng..

[24]  Agma J. M. Traina,et al.  MAMCost: Global and Local Estimates leading to Robust Cost Estimation of Similarity Queries , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[25]  Jianwen Su,et al.  Efficient index-based KNN join processing for high-dimensional data , 2007, Inf. Softw. Technol..

[26]  Geoff Holmes,et al.  Clustering Large Datasets Using Cobweb and K-Means in Tandem , 2004, Australian Conference on Artificial Intelligence.

[27]  Pavel Zezula,et al.  Similarity Join in Metric Spaces Using eD-Index , 2003, DEXA.

[28]  Xiang Lian,et al.  Similarity Search in Arbitrary Subspaces Under Lp-Norm , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[29]  Yan Huang,et al.  Cluster By: a new sql extension for spatial data aggregation , 2007, GIS.

[30]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[31]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[32]  Ira Assent,et al.  Efficient EMD-based similarity search in multimedia databases via flexible dimensionality reduction , 2008, SIGMOD Conference.

[33]  Bernhard Seeger,et al.  GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces , 2001, KDD '01.

[34]  Agma J. M. Traina,et al.  SIREN: a similarity retrieval engine for complex data , 2006, VLDB.

[35]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Join on Uncertain Data , 2006, DASFAA.

[36]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[37]  Kai-Uwe Sattler,et al.  Using Similarity-Based Operations for Resolving Data-Level Conflicts , 2003, BNCOD.

[38]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[39]  Christian Böhm,et al.  The k-Nearest Neighbour Join: Turbo Charging the KDD Process , 2004, Knowledge and Information Systems.

[40]  Christian Böhm,et al.  Optimal Dimension Order: A Generic Technique for the Similarity Join , 2002, DaWaK.

[41]  Hanan Samet,et al.  A Fast Similarity Join Algorithm Using Graphics Processing Units , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[42]  Per-Åke Larson,et al.  Data reduction by partial preaggregation , 2002, Proceedings 18th International Conference on Data Engineering.

[43]  Werner Nutt,et al.  Rewriting queries with arbitrary aggregation functions using views , 2006, TODS.

[44]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[45]  Thai Ngoc Thuy ED-JOIN: AN EFFICIENT ALGORITHM FOR SIMILARITY JOINS WITH EDIT DISTANCE CONSTRAINTS , 2009 .

[46]  Christian Böhm,et al.  High performance clustering based on the similarity join , 2000, CIKM '00.

[47]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[48]  Jonathan Goldstein,et al.  Optimizing queries using materialized views: a practical, scalable solution , 2001, SIGMOD '01.

[49]  Walid G. Aref,et al.  Exploiting similarity-aware grouping in decision support systems , 2009, EDBT '09.

[50]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[51]  Gunter Saake,et al.  Extensible Grouping and Aggregation for Data Reconciliation , 2001, EFIS.

[52]  Bin Wang,et al.  Cost-based variable-length-gram selection for string collections to support approximate queries efficiently , 2008, SIGMOD Conference.

[53]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[54]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..