Data Structures for Minimization of Total Within-Group Distance for Spatio-temporal Clustering

Statistical principles suggest minimization of the total within-group distance (TWGD) as a robust criterion for clustering point data associated with a Geographical Information System [17]. This NP-hard problem must essentially be solved using heuristic methods, although admitting a linear programming formulation. Heuristics proposed so far require quadratic time, which is prohibitively expensive for data mining applications. This paper introduces data structures for the management of large bi-dimensional point data sets and for fast clustering via interchange heuristics. These structures avoid the need for quadratic time through approximations to proximity information. Our scheme is illustrated with two-dimensional quadtrees, but can be extended to use other structures suited to three dimensional data or spatial data with time-stamps. As a result, we obtain a fast and robust clustering method.

[1]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[2]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[3]  Peter Eades,et al.  FADE: Graph Drawing, Clustering, and Visual Abstraction , 2000, GD.

[4]  Erhan Erkut,et al.  Analysis of aggregation errors for the p-median problem , 1999, Comput. Oper. Res..

[5]  Stan Openshaw,et al.  Two exploratory space-time-attribute pattern analysers relevant to GIS , 1994 .

[6]  Paul H. Calamai,et al.  The demand partitioning method for reducing aggregation errors in p-median problems , 1999, Comput. Oper. Res..

[7]  Erich Schikuta,et al.  The BANG-Clustering System: Grid-Based Data Analysis , 1997, IDA.

[8]  Paul S. Bradley,et al.  Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[9]  Richard L. Church,et al.  Applying simulated annealing to location-planning models , 1996, J. Heuristics.

[10]  C. S. Wallace,et al.  Unsupervised Learning Using MML , 1996, ICML.

[11]  L. Greengard,et al.  A Fast Adaptive Multipole Algorithm for Particle Simulations , 1988 .

[12]  Hans-Peter Kriegel,et al.  Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification , 1995, SSD.

[13]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[16]  L. Belbin The Use of Non-hierarchical Allocation Methods for Clustering Large Sets of Data , 1987, Aust. Comput. J..

[17]  Paul S. Bradley,et al.  Clustering via Concave Minimization , 1996, NIPS.

[18]  Manfred Horn Analysis and Computational Schemes for p-Median Heuristics , 1996 .

[19]  M. Rao Cluster Analysis and Mathematical Programming , 1971 .

[20]  Hrishikesh D. Vinod Mathematica Integer Programming and the Theory of Grouping , 1969 .

[21]  Michael E. Houle,et al.  Robust Clustering of Large Geo-referenced Data Sets , 1999, PAKDD.

[22]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[23]  Polly Bart,et al.  Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph , 1968, Oper. Res..

[24]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[25]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[26]  Michael E. Houle,et al.  Fast Randomized Algorithms for Robust Estimation of Location , 2000, TSDM.

[27]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..