Parallel D2-Clustering: Large-Scale Clustering of Discrete Distributions

The discrete distribution clustering algorithm, namely D2-clustering, has demonstrated its usefulness in image classification and annotation where each object is represented by a bag of weighed vectors. The high computational complexity of the algorithm, however, limits its applications to large-scale problems. We present a parallel D2-clustering algorithm with substantially improved scalability. A hierarchical structure for parallel computing is devised to achieve a balance between the individual-node computation and the integration process of the algorithm. Additionally, it is shown that even with a single CPU, the hierarchical structure results in significant speed-up. Experiments on real-world large-scale image data, Youtube video data, and protein sequence data demonstrate the efficiency and wide applicability of the parallel D2-clustering algorithm. The loss in clustering accuracy is minor in comparison with the original sequential algorithm.

[1]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[2]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Jianying Hu,et al.  K-means clustering of proportional data using L1 distance , 2008, 2008 19th International Conference on Pattern Recognition.

[5]  Ingrid Daubechies,et al.  Ten Lectures on Wavelets , 1992 .

[6]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[7]  Bodo Manthey,et al.  k-Means Has Polynomial Smoothed Complexity , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[8]  Hongyuan Zha,et al.  A new Mallows distance based metric for comparing clusterings , 2005, ICML '05.

[9]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[10]  David R. Westhead,et al.  TMB-Hunt: An amino acid composition based method to screen proteomes for beta-barrel transmembrane proteins , 2005, BMC Bioinformatics.

[11]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[12]  Steven Salzberg,et al.  Clustering metagenomic sequences with interpolated Markov models , 2010, BMC Bioinformatics.

[13]  Alva L. Couch,et al.  Parallel K-means Clustering Algorithm on NOWs , 2003 .

[14]  E. Chang,et al.  Parallel algorithms for mining large-scale rich-media data , 2009, ACM Multimedia.

[15]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[16]  Keiji Yanai,et al.  Web video retrieval based on the Earth Mover’s Distance by integrating color, motion and sound , 2008, 2008 15th IEEE International Conference on Image Processing.

[17]  C. Mallows A Note on Asymptotic Joint Normality , 1972 .

[18]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[19]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Jonathan Dushoff,et al.  Unsupervised statistical clustering of environmental shotgun sequences , 2009, BMC Bioinformatics.

[22]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Stefan M. Rüger,et al.  Automated Image Annotation Using Global Features and Robust Nonparametric Density Estimation , 2005, CIVR.

[24]  Konrad Scheffler,et al.  Evolutionary fingerprinting of genes. , 2010, Molecular biology and evolution.

[25]  Anil K. Jain,et al.  A self-organizing network for hyperellipsoidal clustering (HEC) , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[26]  Rainer Lienhart,et al.  Comparison of automatic shot boundary detection algorithms , 1998, Electronic Imaging.

[27]  Lorenza Saitta,et al.  Image and Image-Set Modeling Using a Mixture Model , 2008 .

[28]  James Ze Wang,et al.  Real-Time Computerized Annotation of Pictures , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Edward Y. Chang,et al.  Parallel Spectral Clustering , 2008, ECML/PKDD.

[30]  Xiaojun Wan,et al.  A novel document similarity measure based on earth mover's distance , 2007, Inf. Sci..

[31]  Dong Xu,et al.  Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[33]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, Comb..

[34]  Jonathan Pevsner,et al.  Bioinformatics and functional genomics , 2003 .

[35]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[36]  Shih-Fu Chang,et al.  Columbia University’s Baseline Detectors for 374 LSCOM Semantic Visual Concepts , 2007 .

[37]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.