GAD: General Activity Detection for Fast Clustering on Large Data

In this paper, we propose GAD (General Activity Detection) for fast clustering on large scale data. Within this framework we design a set of algorithms for different scenarios: (1) Exact GAD algorithm E-GAD, which is much faster than K-Means and gets the same clustering result. (2) Approximate GAD algorithms with different assumptions, which are faster than E-GAD while achieving different degrees of approximation. (3) GAD based algorithms to handle the ”large clusters” problem which appears in many large scale clustering applications. Two existing activity detection algorithms GT and CGAUTC are special cases under the framework. The most important contribution of our work is that the framework is the general solution to exploit activity detection for fast clustering in both exact and approximate senarios, and our proposed algorithms within the framework can achieve very high speed. Extensive experiments have been conducted on several large datasets from various real world applications; the results show that our proposed algorithms are effective and efficient.

[1]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[2]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[3]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[4]  Yi-Ching Liaw,et al.  Fast-searching algorithm for vector quantization using projection and triangular inequality , 2004, IEEE Transactions on Image Processing.

[5]  Mohamed S. Kamel,et al.  Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Yi-Ching Liaw,et al.  A fast VQ codebook generation algorithm using codeword displacement , 2008, Pattern Recognit..

[7]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[8]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[9]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[10]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[11]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[12]  Sin-Horng Chen,et al.  FAST ALGORITHM FOR VQ CODEBOOK DESIGN , 1991 .

[13]  Daphna Weinshall,et al.  Classification with Nonmetric Distances: Image Retrieval and Class Representation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[15]  S. Ra,et al.  A fast mean-distance-ordered partial codebook search algorithm for image vector quantization , 1993 .

[16]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  Robert M. Gray,et al.  An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization , 1985, IEEE Trans. Commun..

[18]  Andreas Stafylopatis,et al.  A clustering method based on boosting , 2004, Pattern Recognit. Lett..

[19]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[20]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[21]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[23]  Olli Nevalainen,et al.  A fast exact GLA based on code vector activity detection , 2000, IEEE Trans. Image Process..

[24]  Shen-Chuan Tai,et al.  Two fast nearest neighbor searching algorithms for image vector quantization , 1996, IEEE Trans. Commun..

[25]  Jeng-Shyang Pan,et al.  An efficient encoding algorithm for vector quantization based on subvector technique , 2003, IEEE Trans. Image Process..

[26]  Michael J. Brusco,et al.  Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques , 2007, J. Classif..

[27]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .