Parallel Spectral Clustering

Spectral clustering algorithm has been shown to be more effective in finding clusters than most traditional algorithms. However, spectral clustering suffers from a scalability problem in both memory use and computational time when a dataset size is large. To perform clustering on large datasets, we propose to parallelize both memory use and computation on distributed computers. Through an empirical study on a large document dataset of 193,844 data instances and a large photo dataset of 637,137, we demonstrate that our parallel algorithm can effectively alleviate the scalability problem.

[1]  D. C. Sorensen,et al.  A portable implementation of ARPACK for distributed memory parallel architectures , 1996 .

[2]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[3]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[5]  Chao Yang,et al.  ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[6]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[7]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[8]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[9]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[10]  Vicente Hernández,et al.  SLEPc: A scalable and flexible toolkit for the solution of eigenvalue problems , 2005, TOMS.

[11]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  J. Demmel,et al.  Sun Microsystems , 1996 .

[13]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[14]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[15]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[16]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17]  Shuting Xu,et al.  A Parallel Hybrid Web Document Clustering Algorithm and its Performance Study , 2004, The Journal of Supercomputing.

[18]  Hao Zhang,et al.  Segmentation of 3D meshes through spectral clustering , 2004, 12th Pacific Conference on Computer Graphics and Applications, 2004. PG 2004. Proceedings..

[19]  Edward Y. Chang,et al.  Discovery of a perceptual distance function for measuring image similarity , 2003, Multimedia Systems.

[20]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[21]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[22]  Kien A. Hua,et al.  SamMatch: a flexible and efficient sampling-based image retrieval technique for large image databases , 1999, MULTIMEDIA '99.

[23]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.