On Group Nearest Group Query Processing

Given a data point set D, a query point set Q, and an integer k, the Group Nearest Group (GNG) query finds a subset ω (|ω| ≤ k)of points from Dsuch that the total distance from all points in Q to the nearest point in ω is not greater than any other subset ω' (|ω'| ≤ k) of points in D. GNG query is a partition-based clustering problem which can be found in many real applications and is NP-hard. In this paper, Exhaustive Hierarchical Combination (EHC) algorithm and Subset Hierarchial Refinement (SHR) algorithm are developed for GNG query processing. While EHC is capable to provide the optimal solution for k = 2, SHR is an efficient approximate approach that combines database techniques with local search heuristic. The processing focus of our approaches is on minimizing the access and evaluation of subsets of cardinality k in D since the number of such subsets is exponentially greater than |D|. To do that, the hierarchical blocks of data points at high level are used to find an intermediate solution and then refined by following the guided search direction at low level so as to prune irrelevant subsets. The comprehensive experiments on both real and synthetic data sets demonstrate the superiority of SHR in terms of efficiency and quality.

[1]  R. A. Whitaker,et al.  A Fast Algorithm For The Greedy Interchange For Large-Scale Clustering And Median Location Problems , 1983 .

[2]  Kyriakos Mouratidis,et al.  Group nearest neighbor queries , 2004, Proceedings. 20th International Conference on Data Engineering.

[3]  Shazia Wasim Sadiq,et al.  Processing Group Nearest Group Query , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  Kyriakos Mouratidis,et al.  Tree-based partition querying: a methodology for computing medoids in large spatial datasets , 2008, The VLDB Journal.

[5]  Kenneth E. Rosing,et al.  An Empirical Investigation of the Effectiveness of a Vertex Substitution Heuristic , 1997 .

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Hans-Peter Kriegel,et al.  A Database Interface for Clustering in Large Spatial Databases , 1995, KDD.

[8]  Kyriakos Mouratidis,et al.  Aggregate nearest neighbor queries in spatial databases , 2005, TODS.

[9]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[10]  Dimitris Papadias,et al.  Aggregate nearest neighbor queries in road networks , 2005, IEEE Transactions on Knowledge and Data Engineering.

[11]  Ada Wai-Chee Fu,et al.  Enhanced nearest neighbour search on the R-tree , 1998, SGMD.

[12]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[13]  Anthony K. H. Tung,et al.  Spatial clustering methods in data mining : A survey , 2001 .

[14]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[15]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[16]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[17]  Heng Tao Shen,et al.  Multi-source Skyline Query Processing in Road Networks , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[18]  Hanan Samet,et al.  Distance browsing in spatial databases , 1999, TODS.

[19]  Cyrus Shahabi,et al.  The spatial skyline queries , 2006, VLDB.

[20]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[21]  P. Hansen,et al.  Systems of Cities and Facility Location , 2002 .

[22]  Christian Böhm,et al.  Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases , 2001, CSUR.

[23]  Yufei Tao,et al.  Progressive computation of the min-dist optimal-location query , 2006, VLDB.