Solving approximate similarity queries

Supporting similarity search capabilities in data repositories helps satisfy user information needs rather than only user data needs like conventional DBMSs. This is desired for many modern database applications. However, as the data repository contains high-dimensional data, solutions to similarity search problem become cost-inefficient due to the so-called dimensionality curse. This phenomenon has been observed and shown that in high-dimensional data spaces the probability of overlaps between a query and data regions in a multidimensional access method (MAM) is very high. Hence, the execution of a similarity query may require accessing a vast number of the data regions and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order to lighten complexities of the problem. However, most research work done so far focuses mainly on approximate nearest neighbor (NN) and range queries in a single-feature data space. In practice, multiple-condition queries appear more frequently and get more complicated to deal with in whatever sense. In this article, we present effcient approaches to three types of approximate similarity queries: approximate multi-feature NN, approximate single-feature NN, and approximate range queries. Specially, we will use the Vague Query System, one among flexible query answering systems for conventional DBMSs, as a case study to illustrate and establish the practical value of our proposed solutions. Experimental results with both synthetic and real data sets will confirm the efficiency of these solutions.

[1]  Amihai Motro,et al.  VAGUE: a user interface to relational databases that permits vague queries , 1988, TOIS.

[2]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[3]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[4]  Sharad Mehrotra,et al.  The hybrid tree: an index structure for high dimensional feature spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[5]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[6]  Hans-Peter Kriegel,et al.  Indexing the Solution Space: A New Technique for Nearest Neighbor Search in High-Dimensional Space , 2000, IEEE Trans. Knowl. Data Eng..

[7]  Luis Gravano,et al.  Evaluating Top-k Selection Queries , 1999, VLDB.

[8]  Hans-Peter Kriegel,et al.  Efficient User-Adaptable Similarity Search in Large Multimedia Databases , 1997, VLDB.

[9]  Hans-Peter Kriegel,et al.  Multiple Similarity Queries: A Basic DBMS Operation for Mining in Metric Databases , 2001, IEEE Trans. Knowl. Data Eng..

[10]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[11]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[12]  A Min Tjoa,et al.  Advanced Query Mechanisms in Tourism Information Systems , 2002, ENTER.

[13]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[14]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[15]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[16]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[17]  Tran Khanh Dang,et al.  ISA - An Incremental Hyper-sphere Approach for Efficiently Solving Complex Vague Queries , 2002, DEXA.

[18]  Norbert Fuhr,et al.  Efficient processing of vague queries using a data stream approach , 1995, SIGIR '95.

[19]  Jon Louis Bentley,et al.  Multidimensional divide-and-conquer , 1980, CACM.

[20]  Laura M. Haas,et al.  Using Fagin's algorithm for merging ranked results in multimedia middleware , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[21]  Thomas S. Huang,et al.  Supporting similarity queries in MARS , 1997, MULTIMEDIA '97.

[22]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[23]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[24]  Elke A. Rundensteiner,et al.  Processing incremental multidimensional range queries in a direct manipulation visual query environment , 1998, Proceedings 14th International Conference on Data Engineering.

[25]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[26]  Hans-Jörg Schek,et al.  Fast Evaluation Techniques for Complex Similarity Queries , 2001, VLDB.

[27]  Jirí Matousek,et al.  Geometric range searching , 1994, CSUR.

[28]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[29]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[30]  Sunil Arya,et al.  Accounting for boundary effects in nearest-neighbor searching , 1996, Discret. Comput. Geom..

[31]  Dang Tran,et al.  Multidimensional Access Methods : Important Factor for Current and Next Decade ’ s Applications in Spatial Databases , 2008 .

[32]  Josef Küng,et al.  VQS-a vague query system prototype , 1997, Database and Expert Systems Applications. 8th International Conference, DEXA '97. Proceedings.

[33]  Masahito Hirakawa,et al.  ARES: A relational database with the capability of performing flexible interpretation of queries , 1986, IEEE Transactions on Software Engineering.

[34]  Christian Böhm,et al.  Efficient similarity search in digital libraries , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[35]  Hans-Peter Kriegel,et al.  S3: similarity search in CAD database systems , 1997, SIGMOD '97.

[36]  Thomas S. Huang,et al.  Supporting Ranked Boolean Similarity Queries in MARS , 1998, IEEE Trans. Knowl. Data Eng..

[37]  Josef Küng,et al.  An Incremental Hypercube Approach for Finding Best Matches for Vague Queries , 1999, DEXA.

[38]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[39]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[40]  Sunil Arya,et al.  Approximate nearest neighbor queries in fixed dimensions , 1993, SODA '93.

[41]  Pankaj K. Agarwal,et al.  Geometric Range Searching and Its Relatives , 2007 .

[42]  Marco Patella,et al.  PAC nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[43]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[44]  Sunil Arya,et al.  Approximate range searching , 1995, SCG '95.

[45]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[46]  Z. Meral Özsoyoglu,et al.  Indexing large metric spaces for similarity search queries , 1999, TODS.

[47]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[48]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[49]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[50]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[51]  John R. Smith,et al.  Supporting Incremental Join Queries on Ranked Inputs , 2001, VLDB.

[52]  Kenneth L. Clarkson,et al.  An algorithm for approximate closest-point queries , 1994, SCG '94.

[53]  Ronald Fagin,et al.  A formula for incorporating weights into scoring rules , 2000, Theor. Comput. Sci..

[54]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[55]  Tran Khanh Dang Semantic Based Similarity Searches in Database Systems (Multidimensional Access Methods, Similarity Search Algorithms) , 2003 .

[56]  Marshall W. Bern,et al.  Approximate Closest-Point Queries in High Dimensions , 1993, Inf. Process. Lett..

[57]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.