Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

Many applications require the clustering of large amounts of high-dimensional data. Most clustering algorithms, however, do not work e ectively and e ciently in highdimensional space, which is due to the so-called "curse of dimensionality". In addition, the high-dimensional data often contains a signi cant amount of noise which causes additional e ectiveness problems. In this paper, we review and compare the existing algorithms for clustering highdimensional data and show the impact of the curse of dimensionality on their e ectiveness and e ciency. The comparison reveals that condensation-based approaches (such as BIRCH or STING) are the most promising candidates for achieving the necessary e ciency, but it also shows that basically all condensation-based approaches have severe weaknesses with respect to their e ectiveness in highdimensional space. To overcome these problems, we develop a new clustering technique called OptiGrid which is based on constructing an optimal grid-partitioning of the data. The optimal grid-partitioning is determined by calculating the best partitioning hyperplanes for each dimension (if such a partitioning exists) using certain projections of the data. The advantages of our new approach are (1) it has a rm mathematical basis (2) it is by far more e ective than existing clustering algorithms for highdimensional data (3) it is very e cient even for large data sets of high dimensionality. To demonstrate the e ectiveness and e ciency of our new approach, we perform a series of experiments on a number of di erent data sets including real data sets from CAD and molecular biology. A comparison with one of the best known algorithms (BIRCH) shows the superiority of our new approach. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, 1999.

[1]  J. Simonoff Multivariate Density Estimation , 1996 .

[2]  Erich Schikuta,et al.  Grid-clustering: an efficient hierarchical clustering method for very large data sets , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[3]  R. Ng,et al.  Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[4]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[5]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[6]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[7]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[8]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[9]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[10]  P. Wintz,et al.  An efficient three-dimensional aircraft recognition algorithm using normalized fourier descriptors , 1980 .

[11]  H. V. Jagadish,et al.  A retrieval technique for similar shapes , 1991, SIGMOD '91.

[12]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[13]  Harpreet Sawhney,et al.  Efficient color histogram indexing , 1994, Proceedings of 1st International Conference on Image Processing.

[14]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[15]  Hans-Peter Kriegel,et al.  Density-Connected Sets and their Application for Trend Detection in Spatial Databases , 1997, KDD.

[16]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[17]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[18]  X. Daura,et al.  Reversible peptide folding in solution by molecular dynamics simulation. , 1998, Journal of molecular biology.

[19]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[20]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[21]  Stefan Näher,et al.  LEDA: A Library of Efficient Data Types and Algorithms , 1989, STACS.

[22]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[23]  Rajiv Mehrotra,et al.  Feature-Index-Based Similar Shape Retrieval , 1997, VDB.

[24]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[25]  Hans-Peter Kriegel,et al.  Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification , 1995, SSD.