A Scale Dependent Model for Clustering by Optimization of Homogeneity and Separation ∗

We present a model for clustering a set of high dimensional data into subsets of homogenous clusters which are well-separated from each other. A novel feature of our model is that it allows the user to directly control the scale of the clusters. This is realized by formulating the clustering problem as an optimization problem whose objective function combines two measures of the quality of a clustering, namely homogeneity and separation, and a parameter which controls the scale. As an illustration to the use of our general framework, an application on clustering data based on pair-wise similarity measured by the (uncentered) Pearson correlation coefficient (cosine of the angle between two data vectors in an Euclidean space) is presented. In this case, for a dataset of size n in a p-dimensional space, our algorithm will output a clustering in O(npd + mpd2) floating point operations while typical hierarchical and partitioning algorithms require O(n2p) and O(npkm) respectively. Here d is the number of clusters determined by our algorithm, k is the number of clusters specified by the user, and m is the number of iterations. Experimental results on synthetic, biological and textured image data are presented to demonstrate the usefulness of the proposed model.

[1]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[2]  Gary L. Miller,et al.  Geometric mesh partitioning: implementation and experiments , 1995, Proceedings of 9th International Parallel Processing Symposium.

[3]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  T. Chan,et al.  Edge-preserving and scale-dependent properties of total variation regularization , 2003 .

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  D. Botstein,et al.  The transcriptional program in the response of human fibroblasts to serum. , 1999, Science.

[8]  D. Mumford,et al.  Optimal approximations by piecewise smooth functions and associated variational problems , 1989 .

[9]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[10]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[12]  Tony F. Chan,et al.  Active contours without edges , 2001, IEEE Trans. Image Process..

[13]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[14]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[15]  Roded Sharan,et al.  CLICK: A Clustering Algorithm for Gene Expression Analysis , 2000, ISMB 2000.

[16]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[17]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .