A general framework for efficient clustering of large datasets based on activity detection

Data clustering is one of the most popular data mining techniques with broad applications. K-Means is one of the most popular clustering algorithms, due to its high efficiency/effectiveness and wide implementation in many commercial/noncommercial softwares. Performing efficient clustering on large dataset is especially useful; however, conducting K-Means clustering on large data suffers heavy computation burden which originates from the numerous distance calculations between the patterns and the centers. This paper proposes framework General Activity Detection (GAD) for fast clustering on large-scale data based on center activity detection. Within this framework, we design a set of algorithms for different scenarios: (i) exact GAD algorithm, E-GAD, which is much faster than K-Means and gets the same clustering result; (ii) approximate GAD algorithms with different assumptions, which are faster than E-GAD, while achieving different degrees of approximation; and (iii) GAD based algorithms to handle the large clusters problem which appears in many large-scale clustering applications. The framework provides a general solution to exploit activity detection for fast clustering in both exact and approximate scenarios, and our proposed algorithms within the framework can achieve very high speed. We have conducted extensive experiments on several datasets from various real world applications, including data compression, image clustering, and bioinformatics. By measuring the clustering quality and CPU time, the experiment results show the effectiveness and high efficiency of our proposed algorithms. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 11–29 2011 (This work is extended from our SDM'09 conference paper [1]. Supported in part by the U.S. National Science Foundation grants IIS-08-42769 and BDI-05-15813 and IIS-05-13678, and Office of Naval Research (ONR) grant N00014-08-1-0565. Any opinions, findings, and conclusions expressed here are those of the authors and do not necessarily reflect the views of the funding agencies.)

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Michael J. Brusco,et al.  Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques , 2007, J. Classif..

[3]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[4]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[5]  S. Ra,et al.  A fast mean-distance-ordered partial codebook search algorithm for image vector quantization , 1993 .

[6]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7]  Baowen Xu,et al.  Stable initialization scheme for K-means clustering , 2009, Wuhan University Journal of Natural Sciences.

[8]  Andreas Stafylopatis,et al.  A clustering method based on boosting , 2004, Pattern Recognit. Lett..

[9]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[10]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[11]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[12]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[13]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[14]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[15]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[16]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[17]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[18]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[19]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[20]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[21]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[24]  M M Astrahan SPEECH ANALYSIS BY CLUSTERING, OR THE HYPERPHONEME METHOD , 1970 .

[25]  Olli Nevalainen,et al.  A fast exact GLA based on code vector activity detection , 2000, IEEE Trans. Image Process..

[26]  Shen-Chuan Tai,et al.  Two fast nearest neighbor searching algorithms for image vector quantization , 1996, IEEE Trans. Commun..

[27]  Jeng-Shyang Pan,et al.  An efficient encoding algorithm for vector quantization based on subvector technique , 2003, IEEE Trans. Image Process..

[28]  Sangkyum Kim,et al.  GAD: General Activity Detection for Fast Clustering on Large Data , 2009, SDM.

[29]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[30]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[32]  Daphna Weinshall,et al.  Classification with Nonmetric Distances: Image Retrieval and Class Representation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[34]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[35]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[36]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[37]  Yizhou Sun,et al.  RankClus: integrating clustering with ranking for heterogeneous information network analysis , 2009, EDBT '09.

[38]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[39]  Yi-Ching Liaw,et al.  Fast-searching algorithm for vector quantization using projection and triangular inequality , 2004, IEEE Transactions on Image Processing.

[40]  Mohamed S. Kamel,et al.  Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Yi-Ching Liaw,et al.  A fast VQ codebook generation algorithm using codeword displacement , 2008, Pattern Recognit..

[42]  Sin-Horng Chen,et al.  FAST ALGORITHM FOR VQ CODEBOOK DESIGN , 1991 .

[43]  Ting Su,et al.  A deterministic method for initializing K-means clustering , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[44]  Boris Mirkin,et al.  Clustering For Data Mining: A Data Recovery Approach (Chapman & Hall/Crc Computer Science) , 2005 .

[45]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[46]  Robert M. Gray,et al.  An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization , 1985, IEEE Trans. Commun..

[47]  Robert E. Tarjan,et al.  Graph Clustering and Minimum Cut Trees , 2004, Internet Math..

[48]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[49]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[50]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[51]  Aidong Zhang,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, The VLDB Journal.

[52]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[54]  Jon Louis Bentley,et al.  Multidimensional divide-and-conquer , 1980, CACM.