Parallel Spectral Clustering in Distributed Systems

Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.

[1]  Yoshua Bengio,et al.  Greedy Spectral Embedding , 2005, AISTATS.

[2]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[3]  Ivor W. Tsang,et al.  Improved Nyström low-rank approximation and error analysis , 2008, ICML '08.

[4]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[5]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[6]  GhemawatSanjay,et al.  The Google file system , 2003 .

[7]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[8]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[9]  Edward Y. Chang,et al.  Discovery of a perceptual distance function for measuring image similarity , 2003, Multimedia Systems.

[10]  Edward Y. Chang,et al.  Parallel Spectral Clustering , 2008, ECML/PKDD.

[11]  Michael I. Jordan,et al.  Learning Spectral Clustering , 2003, NIPS.

[12]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[13]  Carl Edward Rasmussen,et al.  Observations on the Nyström Method for Gaussian Process Prediction , 2002 .

[14]  Chao Yang,et al.  ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[15]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[17]  Ameet Talwalkar,et al.  Large-scale manifold learning , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[19]  H. Simon,et al.  A parallel Lanczos method for symmetric generalized eigenvalue problems , 1999 .

[20]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[21]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[22]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[23]  H. Simon,et al.  TRLAN User Guide , 1999 .

[24]  Francisco Tirado,et al.  Some Aspects About the Scalability of Scientific Applications on Parallel Architectures , 1996, Parallel Comput..

[25]  Bernhard Schölkopf,et al.  Sampling Techniques for Kernel Methods , 2001, NIPS.

[26]  Bernhard Schölkopf,et al.  A Local Learning Approach for Clustering , 2006, NIPS.

[27]  Vicente Hernández,et al.  SLEPc: A scalable and flexible toolkit for the solution of eigenvalue problems , 2005, TOMS.

[28]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[29]  Attila Gursoy Data Decomposition for Parallel K-means Clustering , 2003 .

[30]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[31]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[32]  Attila Gürsoy,et al.  Data Decomposition for Parallel K-means Clustering , 2003, Parallel Processing and Applied Mathematics.

[33]  Shuting Xu,et al.  A Parallel Hybrid Web Document Clustering Algorithm and its Performance Study , 2004, The Journal of Supercomputing.

[34]  Hao Zhang,et al.  Segmentation of 3D meshes through spectral clustering , 2004, 12th Pacific Conference on Computer Graphics and Applications, 2004. PG 2004. Proceedings..

[35]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[36]  Jianbo Shi,et al.  Learning Segmentation by Random Walks , 2000, NIPS.

[37]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[38]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[39]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[41]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[42]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[43]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[44]  J. Watts,et al.  Interprocessor collective communication library (InterCom) , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[45]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[46]  O. Marques BLZPACK: Description and User's Guide , 1995 .

[47]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[48]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  D. C. Sorensen,et al.  A portable implementation of ARPACK for distributed memory parallel architectures , 1996 .

[50]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[51]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[52]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.