Distributed approximate spectral clustering for large-scale datasets

Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernel-based machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.

[1]  Tomaso A. Poggio,et al.  Face recognition with support vector machines: global versus component-based approach , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[2]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[3]  Tony F. Chan,et al.  Hierarchical algorithms and architectures for parallel scientific computing , 1990, ICS '90.

[4]  R. Lehoucq,et al.  Implicitly restarted Arnoldi methods and eigenvalues of the discretized Navier-Stokes equations , 1997 .

[5]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[6]  Jerome H. Friedman,et al.  An Overview of Predictive Learning and Function Approximation , 1994 .

[7]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[8]  Jure Leskovec,et al.  Predicting positive and negative links in online social networks , 2010, WWW '10.

[9]  Chao Yang,et al.  ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[10]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[11]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[12]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[13]  J. L. Nolan Stable Distributions. Models for Heavy Tailed Data , 2001 .

[14]  Yair Weiss,et al.  Segmentation using eigenvectors: a unifying view , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[15]  B. Parlett,et al.  Lanczos versus subspace iteration for solution of eigenvalue problems , 1983 .

[16]  Jignesh M. Patel,et al.  Estimating the selectivity of tf-idf based cosine similarity predicates , 2007, SGMD.

[17]  Frédéric Maire,et al.  Implementation of Kernel Methods on the GPU , 2005, Digital Image Computing: Techniques and Applications (DICTA'05).

[18]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[19]  Wenhua Wang,et al.  Large Scale of E-learning Resources Clustering with Parallel Affinity Propagation , 2008 .

[20]  N. Shanthi,et al.  LITERATURE SURVEY ON ENHANCING CLUSTER QUALITY , 2010 .

[21]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[22]  Jimmy J. Lin,et al.  Web-scale computer vision using MapReduce for multimedia data mining , 2010, MDMKDD '10.

[23]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[25]  Keke Chen,et al.  iVIBRATE: Interactive visualization-based framework for clustering large datasets , 2006, ACM Trans. Inf. Syst..

[26]  Sean Owen,et al.  Mahout in Action , 2011 .

[27]  Alexandros Nanopoulos,et al.  Nearest neighbors in high-dimensional data: the emergence and influence of hubs , 2009, ICML '09.

[28]  R. Krishnapuram,et al.  A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering , 1999, FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No.99CH36315).

[29]  Rudolf Eigenmann,et al.  Programming Distributed Memory Sytems Using OpenMP , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[30]  Jian Yang,et al.  From image vector to matrix: a straightforward image projection technique - IMPCA vs. PCA , 2002, Pattern Recognit..

[31]  T. Moon,et al.  Mathematical Methods and Algorithms for Signal Processing , 1999 .

[32]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[33]  Rajeev Motwani,et al.  Lower bounds on locality sensitive hashing , 2005, SCG '06.

[34]  Anirban Dasgupta,et al.  Approximation algorithms for co-clustering , 2008, PODS.

[35]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  Jure Leskovec,et al.  The dynamics of viral marketing , 2005, EC '06.

[37]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[38]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[39]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[40]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  J. Cullum,et al.  Lanczos algorithms for large symmetric eigenvalue computations , 1985 .

[42]  Dariu Gavrila,et al.  An Experimental Study on Pedestrian Classification , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  D. Vanderbilt,et al.  A new iterative scheme for obtaining Eigenvectors of large, real-symmetric matrices , 1989 .

[44]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[45]  Onaiza Maqbool,et al.  The weighted combined algorithm: a linkage algorithm for software clustering , 2004, Eighth European Conference on Software Maintenance and Reengineering, 2004. CSMR 2004. Proceedings..

[46]  Licheng Jiao,et al.  Spectral clustering ensemble for image segmentation , 2009, GEC '09.

[47]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Trevor Darrell,et al.  Active Learning with Gaussian Processes for Object Categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[49]  S. Bandyopadhyay,et al.  A New Cluster Validity Index Based on Fuzzy Granulation-degranulation Criteria , 2007, 15th International Conference on Advanced Computing and Communications (ADCOM 2007).

[50]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[51]  Nathan Linial,et al.  Low dimensional embeddings of ultrametrics , 2004, Eur. J. Comb..

[52]  Amnon Shashua,et al.  Doubly Stochastic Normalization for Spectral Clustering , 2006, NIPS.

[53]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[54]  Tao Shi,et al.  Multiple Sample Data Spectroscopic Clustering of Large Datasets Using Nyström Extension , 2012 .

[55]  S. Altschul,et al.  Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices , 1991 .

[56]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[57]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[58]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[59]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[60]  Matthias W. Seeger,et al.  Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers , 1999, NIPS.

[61]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[62]  Kumiko Tanaka-Ishii,et al.  Multilingual Spectral Clustering Using Document Similarity Propagation , 2009, EMNLP.

[63]  Fionn Murtagh,et al.  Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding , 2008, SIAM J. Sci. Comput..

[64]  Ilse C. F. Ipsen Computing an Eigenvector with Inverse Iteration , 1997, SIAM Rev..

[65]  Mamoru Hoshi,et al.  Fast Computation of Similarity Based on Jaccard Coefficient for Composition-Based Image Retrieval , 2009, PCM.

[66]  L. Delves,et al.  Numerical solution of integral equations , 1975 .

[67]  A. Moore,et al.  Variable KD-Tree Algorithms for Efficient Spatial Pattern Search , 2005 .

[68]  H. Markov,et al.  An algorithm to , 1997 .

[69]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[70]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[71]  Michael W. Mahoney,et al.  Approximating a Gram Matrix for Improved Kernel-Based Learning (Extended Abstract) , 2005 .

[72]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[73]  Jeremy Buhler,et al.  Finding Motifs Using Random Projections , 2002, J. Comput. Biol..

[74]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[75]  Jimmy Lin,et al.  Full-text indexing for optimizing selection operations in large-scale data analytics , 2011, MapReduce '11.

[76]  Victor Y. Pan,et al.  Real and complex polynomial root-finding with eigen-solving and preprocessing , 2010, ISSAC.

[77]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[78]  Younès Bennani,et al.  Selection of clusters number and features subset during a two-levels clustering task , 2006, Artificial Intelligence and Soft Computing.

[79]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.