Data Clustering Analysis - From Simple Groupings to Scalable Clustering With Constraints

Clustering is the problem of grouping data based on similarity and consists of maximizing the intra-group similarity while minimizing the iter-group similarity. While this problem has attracted the attention of many researchers for many years, we are witnessing a resurgence of interest in new clustering techniques in the data mining community. In this paper we discuss some very recent clustering approaches and recount our experience with some of these algorithms. We also present the problem of clustering in the presence of constraints and discuss the issue of cluster validation.

[1]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[2]  Erich Schikuta,et al.  Grid-clustering: an efficient hierarchical clustering method for very large data sets , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[3]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[4]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[5]  Ambuj K. Singh,et al.  Dimensionality reduction for similarity searching in dynamic databases , 1998, SIGMOD '98.

[6]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[7]  Philip S. Yu,et al.  Redefining Clustering for High-Dimensional Applications , 2002, IEEE Trans. Knowl. Data Eng..

[8]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[9]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[10]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[11]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[12]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[13]  Robert H. Gross,et al.  Web Page Categorization and Feature Selection Using Association Rule and Principal Component Cluster , 1997 .

[14]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[15]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[16]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[17]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[18]  Anthony K. H. Tung,et al.  Spatial clustering in the presence of obstacles , 2001, Proceedings 17th International Conference on Data Engineering.

[19]  Timothy Masters,et al.  Neural, Novel & Hybrid Algorithms for Time Series Prediction , 1995 .

[20]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[21]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[22]  Andrew Foss,et al.  A non-parametric approach to web log analysis , 2001 .

[23]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[24]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[25]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[26]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[27]  Anthony K. H. Tung,et al.  Spatial clustering methods in data mining : A survey , 2001 .

[28]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[29]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[30]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[31]  Alexander Thomasian,et al.  Clustering and singular value decomposition for approximate indexing in high dimensional spaces , 1998, CIKM '98.

[32]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[33]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[34]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.