Flexible aggregate similarity search

Aggregate similarity search, a.k.a. aggregate nearest neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves the most (or top-k) similar object to Q from a database P, where the similarity is an aggregation (e.g., sum, max) of the distances between the retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggregation over the distances between p and any subset of ÆM objects in Q for some support 0 < Æ d 1. We call this new definition flexible aggregate similarity (Fann) search, which generalizes the Ann problem. Next, we present algorithms for answering Fann queries exactly and approximately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return nearoptimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experiments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.

[1]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.

[2]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[3]  Panos Kalnis,et al.  Efficient and accurate nearest neighbor and closest pair search in high-dimensional space , 2010, TODS.

[4]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[5]  Feifei Li,et al.  Group Enclosing Queries , 2011, IEEE Transactions on Knowledge and Data Engineering.

[6]  Christian Böhm,et al.  A cost model for query processing in high dimensional data spaces , 2000, TODS.

[7]  Kyriakos Mouratidis,et al.  Group nearest neighbor queries , 2004, Proceedings. 20th International Conference on Data Engineering.

[8]  Kyriakos Mouratidis,et al.  Aggregate nearest neighbor queries in spatial databases , 2005, TODS.

[9]  Dimitris Papadias,et al.  Aggregate nearest neighbor queries in road networks , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[11]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[12]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[13]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[14]  Pankaj K. Agarwal,et al.  Practical methods for shape fitting and kinetic data structures using core sets , 2004, Symposium on Computational Geometry.

[15]  Hua Lu,et al.  Two ellipse-based pruning methods for group nearest neighbor queries , 2005, GIS '05.

[16]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[17]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[18]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[19]  Christos Faloutsos,et al.  A novel optimization approach to efficiently process aggregate similarity queries in metric access methods , 2008, CIKM '08.

[20]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[21]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[22]  Christian Böhm,et al.  Determining the Convex Hull in Large Multidimensional Databases , 2001, DaWaK.

[23]  Joseph S. B. Mitchell,et al.  Approximate minimum enclosing balls in high dimensions using core-sets , 2003, ACM J. Exp. Algorithmics.