Robust Distance-Based Clustering with Applications to Spatial Data Mining

Abstract. In this paper we present a method for clustering geo-referenced data suitable for applications in spatial data mining, based on the medoid method. The medoid method is related to k -MEANS, with the restriction that cluster representatives be chosen from among the data elements. Although the medoid method in general produces clusters of high quality, especially in the presence of noise, it is often criticized for the Ω(n2) time that it requires. Our method incorporates both proximity and density information to achieve high-quality clusters in subquadratic time; it does not require that the user specify the number of clusters in advance. The time bound is achieved by means of a fast approximation to the medoid objective function, using Delaunay triangulations to store proximity information.

[1]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[2]  R. Ng,et al.  Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[3]  Alan T. Murray,et al.  Cluster Discovery Techniques for Exploratory Spatial Data Analysis , 1998, Int. J. Geogr. Inf. Sci..

[4]  Raymond T. Ng,et al.  Finding Boundary Shape Matching Relationships in Spatial Data , 1997, SSD.

[5]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[6]  Matthew Self,et al.  Bayesian Classification , 1988, AAAI.

[7]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[8]  Peter Nijkamp,et al.  Spatial information systems , 1991 .

[9]  G. Nemhauser,et al.  Exceptional Paper—Location of Bank Accounts to Optimize Float: An Analytic Study of Exact and Approximate Algorithms , 1977 .

[10]  Richard L. Church,et al.  A Median Location Model with Nonclosest Facility Service , 1985, Transp. Sci..

[11]  S. Arono,et al.  Geographic Information Systems: A Management Perspective , 1989 .

[12]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[13]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[14]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[15]  Fred W. Glover,et al.  Future paths for integer programming and links to artificial intelligence , 1986, Comput. Oper. Res..

[16]  M. Jambu,et al.  Cluster analysis and data analysis , 1985 .

[17]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[18]  Erich Schikuta,et al.  Grid-clustering: an efficient hierarchical clustering method for very large data sets , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[19]  Matthew Dickerson,et al.  Simple algorithms for enumerating interpoint distances and finding $k$ nearest neighbors , 1992, Int. J. Comput. Geom. Appl..

[20]  K. Rosing An Optimal Method for Solving the (Generalized) Multi-Weber Problem , 1992 .

[21]  Hans-Peter Kriegel,et al.  Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification , 1995, SSD.

[22]  Stan Openshaw,et al.  A parallel Kohonen algorithm for the classification of large spatial datasets , 1996 .

[23]  Jiawei Han,et al.  Discovery of Spatial Association Rules in Geographic Information Databases , 1995, SSD.

[24]  Vladimir Estivill-Castro,et al.  Discovering Associations in Spatial Data - An Efficient Medoid Based Approach , 1998, PAKDD.

[25]  Max J. Egenhofer Geographic database systems: issues and research needs , 1996, PODS '96.

[26]  Jiawei Han,et al.  Attribute-Oriented Induction in Relational Databases , 1991, Knowledge Discovery in Databases.

[27]  K. E. Rosing,et al.  The p-Median and its Linear Programming Relaxation: An Approach to Large Problems , 1979 .

[28]  Z. A. Melzak Companion to concrete mathematics : mathematical techniques and various applications , 1973 .

[29]  Polly Bart,et al.  Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph , 1968, Oper. Res..

[30]  David Eppstein,et al.  On Nearest-Neighbor Graphs , 1992, ICALP.

[31]  R. Webster,et al.  Statistical Methods in Soil and Land Resource Survey. , 1990 .

[32]  Alex Alves Freitas,et al.  Mining Very Large Databases with Parallel Processing , 1997, The Kluwer International Series on Advances in Database Systems.

[33]  Christos Levcopoulos,et al.  Fast Algorithms for Complete Linkage Clustering , 1998, Discret. Comput. Geom..

[34]  Stan Openshaw,et al.  Two exploratory space-time-attribute pattern analysers relevant to GIS , 1994 .

[35]  Jorma Rissanen Fast Universal Coding With Context Models , 1999, IEEE Trans. Inf. Theory.

[36]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[37]  Robert J. Schalkoff,et al.  Pattern recognition - statistical, structural and neural approaches , 1991 .

[38]  Joseph O'Rourke,et al.  Computational Geometry in C. , 1995 .

[39]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[40]  David L. Dowe,et al.  Point Estimation Using the Kullback-Leibler Loss Function and MML , 1998, PAKDD.

[41]  Jennifer Chiang,et al.  Issues for On-Line Analytical Mining of Data Warehouses , 1998 .

[42]  Vladimir Estivill-Castro Spatial Clustering for Data Mining with Genetic Algorithms , 1997 .

[43]  Fionn Murtagh,et al.  Comments on 'Parallel Algorithms for Hierarchical Clustering and Cluster Validity' , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Peter Meer,et al.  Robust retrieval of three-dimensional structures from image stacks , 1999, Medical Image Anal..

[45]  Atsuyuki Okabe,et al.  Spatial Tessellations: Concepts and Applications of Voronoi Diagrams , 1992, Wiley Series in Probability and Mathematical Statistics.

[46]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[47]  M. Goodchild,et al.  Discrete space location-allocation solutions from genetic algorithms , 1986 .

[48]  Vladimir Estivill-Castro,et al.  Hybrid Genetic Algorithm for Solving the p-Median Problem , 1998, SEAL.

[49]  Micha Sharir,et al.  The Discrete 2-Center Problem , 1997, SCG '97.

[50]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[51]  Ki-Joune Li,et al.  A spatial data mining method by Delaunay triangulation , 1997, GIS '97.

[52]  Beng Chin Ooi,et al.  Discovery of General Knowledge in Large Spatial Databases , 1993 .

[53]  Hans-Peter Kriegel,et al.  Spatial Data Mining: A Database Approach , 1997, SSD.

[54]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[55]  Leon F. McGinnis,et al.  Facility Layout and Location: An Analytical Approach , 1991 .

[56]  Paul J. Densham,et al.  A more efficient heuristic for solving largep-median problems , 1992 .

[57]  J. Current,et al.  An efficient tabu search procedure for the p-Median Problem , 1997 .

[58]  Erich Schikuta,et al.  The BANG-Clustering System: Grid-Based Data Analysis , 1997, IDA.

[59]  Paul S. Bradley,et al.  Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[60]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[61]  Richard L. Church,et al.  Applying simulated annealing to location-planning models , 1996, J. Heuristics.

[62]  Alan T. Murray,et al.  Mining Spatial Data via Clustering , 1998 .

[63]  C. Watson-Gandy A Note on the Centre of Gravity in Depot Location , 1972 .

[64]  Craig Eldershaw,et al.  Cluster Analysis using Triangulation , 1997 .

[65]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[66]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[67]  Andrew K. C. Wong,et al.  Information Discovery through Hierarchical Maximum Entropy Discretization and Synthesis , 1991, Knowledge Discovery in Databases.

[68]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[69]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[70]  Subhash Suri,et al.  Finding tailored partitions , 1989, SCG '89.

[71]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[72]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[73]  Subhash C. Narula,et al.  Technical Note - An Algorithm for the p-Median Problem , 1977, Oper. Res..

[74]  David L. Dowe,et al.  Minimum Message Length and Kolmogorov Complexity , 1999, Comput. J..

[75]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[76]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[77]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[78]  Jiawei Han,et al.  GeoMiner: a system prototype for spatial data mining , 1997, SIGMOD '97.

[79]  Paul J. Densham,et al.  A more e cient heuristic for solving large p-median problems , 1992 .

[80]  Alan Hutchinson,et al.  Algorithmic Learning , 1994 .

[81]  R. Haining Spatial Data Analysis in the Social and Environmental Sciences , 1990 .

[82]  G. Rushton,et al.  Exploratory spatial analysis of birth defect rates in an urban population. , 1996, Statistics in medicine.

[83]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[84]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[85]  Hans-Peter Kriegel,et al.  Algorithms for Characterization and Trend Detection in Spatial Databases , 1998, KDD.

[86]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[87]  Martin Charlton,et al.  A Mark 1 Geographical Analysis Machine for the automated analysis of point data sets , 1987, Int. J. Geogr. Inf. Sci..

[88]  Xiaobo Li,et al.  Parallel Algorithms for Hierarchical Clustering and Cluster Validity , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[89]  Manfred Horn Analysis and Computational Schemes for p-Median Heuristics , 1996 .