The Minimum Code Length for Clustering Using the Gray Code

We propose new approaches to exploit compression algorithms for clustering numerical data. Our first contribution is to design a measure that can score the quality of a given clustering result under the light of a fixed encoding scheme. We call this measure the Minimum Code Length (MCL). Our second contribution is to propose a general strategy to translate any encoding method into a cluster algorithm, which we call COOL (COding-Oriented cLustering). COOL has a low computational cost since it scales linearly with the data set size. The clustering results of COOL is also shown to minimize MCL. To illustrate further this approach, we consider the Gray Code as the encoding scheme to present GCOOL. G-COOL can find clusters of arbitrary shapes and remove noise. Moreover, it is robust to change in the input parameters; it requires only two lower bounds for the number of clusters and the size of each cluster, whereas most algorithms for finding arbitrarily shaped clusters work well only if all parameters are tuned appropriately. G-COOL is theoretically shown to achieve internal cohesion and external isolation and is experimentally shown to work well for both synthetic and real data sets.

[1]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume IV: Fascicle 2: Generating All Tuples and Permutations , 2005 .

[2]  Kai Ming Ting,et al.  Multi-dimensional Mass Estimation and Mass-based Clustering , 2010, 2010 IEEE International Conference on Data Mining.

[3]  Geng Li,et al.  ABACUS: Mining Arbitrary Shaped Clusters from Large Datasets based on Backbone Identification , 2011, SDM.

[4]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[5]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[6]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[7]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[8]  Ws. Rasband ImageJ, U.S. National Institutes of Health, Bethesda, Maryland, USA , 2011 .

[9]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[10]  Mark A. Pitt,et al.  Advances in Minimum Description Length: Theory and Applications , 2005 .

[11]  Jorma Rissanen,et al.  An MDL Framework for Data Clustering , 2005 .

[12]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[13]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[14]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[15]  Xiaogang Wang,et al.  Clues: an R Package for Nonparametric Clustering Based on Local Shrinking , 2022 .

[16]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[17]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[18]  Hideki Tsuiki,et al.  Real number computation through Gray code embedding , 2002, Theor. Comput. Sci..

[19]  Mohammad Al Hasan,et al.  Under consideration for publication in Knowledge and Information Systems SPARCL: An Effective and Efficient Algorithm for Mining Arbitrary Shape-based Clusters 1 , 2022 .

[20]  Klaus Weihrauch,et al.  Computable Analysis: An Introduction , 2014, Texts in Theoretical Computer Science. An EATCS Series.

[21]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[22]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[23]  Petri Myllymäki,et al.  An Empirical Comparison of NML Clustering Algorithms , 2008, ITSL.

[24]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[25]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[26]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[27]  Li Wei,et al.  Compression-based data mining of sequential data , 2007, Data Mining and Knowledge Discovery.

[28]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[29]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[30]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume 4, Fascicle 2: Generating All Tuples and Permutations (Art of Computer Programming) , 2005 .

[31]  Ian Witten,et al.  Data Mining , 2000 .