Multiple Similarity Queries: A Basic DBMS Operation for Mining in Metric Databases

Metric databases are databases where a metric distance function is defined for pairs of database objects. In such databases, similarity queries in the form of range queries or k-nearest-neighbor queries are the most important query types. In traditional query processing, single queries are issued independently by different users. In many data mining applications, however, the database is typically explored by iteratively asking similarity queries for answers of previous similarity queries. We introduce a generic scheme for such data mining algorithms and we investigate two orthogonal approaches, reducing I/O cost as well as CPU cost, to speed-up the processing of multiple similarity queries. The proposed techniques apply to any type of similarity query and to an implementation based on an index or using a sequential scan. Parallelization yields an additional impressive speed-up. An extensive performance evaluation confirms the efficiency of our approach.

[1]  James Lee Hafner,et al.  Efficient Color Histogram Indexing for Quadratic Form Distance Functions , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[3]  Christian Böhm,et al.  Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[4]  Ulrich Bastian,et al.  The TYCHO Catalogue , 1997 .

[5]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[6]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[7]  Hans-Peter Kriegel,et al.  Algorithms for Characterization and Trend Detection in Spatial Databases , 1998, KDD.

[8]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[9]  Jaideep Srivastava,et al.  Web Mining: Pattern Discovery from World Wide Web Transactions , 1996 .

[10]  Hans-Peter Kriegel,et al.  Efficient User-Adaptable Similarity Search in Large Multimedia Databases , 1997, VLDB.

[11]  Hans-Peter Kriegel,et al.  Efficiently supporting multiple similarity queries for mining in metric databases , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[12]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[13]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[14]  Jiawei Han,et al.  Discovery of Spatial Association Rules in Geographic Information Databases , 1995, SSD.

[15]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..

[16]  A. Guttman,et al.  A Dynamic Index Structure for Spatial Searching , 1984, SIGMOD 1984.

[17]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[18]  Shivakumar Venkataraman,et al.  Cost-based optimization of decision support queries using transient-views , 1998, SIGMOD '98.

[19]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[20]  Sharad Mehrotra,et al.  Efficient Query Refinement in Multimedia Databases , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[21]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[22]  K. Wakimoto,et al.  Efficient and Effective Querying by Image Content , 1994 .

[23]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[24]  Bhavani M. Thuraisingham,et al.  Database Issues for Data Visualization: Supporting Interactive Database Explorartion , 1995, Workshop on Database Issues for Data Visualization.

[25]  Prasan Roy,et al.  Efficient and extensible algorithms for multi query optimization , 1999, SIGMOD '00.

[26]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[27]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[28]  Raymond T. Ng,et al.  Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining , 1996, IEEE Trans. Knowl. Data Eng..

[29]  A. Guttmma,et al.  R-trees: a dynamic index structure for spatial searching , 1984 .

[30]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[31]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.