APPROXIMATE SPECTRAL CLUSTERING FOR LARGE-SCALE DATASETS by

Many kernel-based clustering algorithms do not scale up to high-dimensional large datasets. The similarity matrix, on which these algorithms rely, calls for O(N2) complexity in both time and space. In this thesis, we present the design of an approximation algorithm to cluster high-dimensional large datasets. The proposed design enables great reduction of the similarity matrix’s computing time as well as its space requirements without significantly impacting the accuracy of the clustering. The proposed design is modular and self-contained. Therefore, several kernel-based clustering algorithms could also benefit from the proposed design to improve their performance. We implemented the proposed algorithm in the MapReduce distributed programming framework and experimented with synthetic datasets as well as a real dataset from Wikipedia that has more than three million documents. Our results demonstrate the high accuracy and the significant time and memory savings that can be achieved by our algorithm.

[1]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[2]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[4]  B. Parlett,et al.  Lanczos versus subspace iteration for solution of eigenvalue problems , 1983 .

[5]  J. Cullum,et al.  Lanczos algorithms for large symmetric eigenvalue computations , 1985 .

[6]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[7]  D. Vanderbilt,et al.  A new iterative scheme for obtaining Eigenvectors of large, real-symmetric matrices , 1989 .

[8]  Tony F. Chan,et al.  Hierarchical algorithms and architectures for parallel scientific computing , 1990, ICS '90.

[9]  S. Altschul,et al.  Improved Sensitivity of Nucleic Acid Database Searches Using Application-Specific Scoring Matrices , 1991 .

[10]  Jerome H. Friedman,et al.  An Overview of Predictive Learning and Function Approximation , 1994 .

[11]  R. Lehoucq,et al.  Implicitly restarted Arnoldi methods and eigenvalues of the discretized Navier-Stokes equations , 1997 .

[12]  H. Markov,et al.  An algorithm to , 1997 .

[13]  Ilse C. F. Ipsen Computing an Eigenvector with Inverse Iteration , 1997, SIAM Rev..

[14]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[15]  Chao Yang,et al.  ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[16]  R. Krishnapuram,et al.  A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering , 1999, FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No.99CH36315).

[17]  T. Moon,et al.  Mathematical Methods and Algorithms for Signal Processing , 1999 .

[18]  Yair Weiss,et al.  Segmentation using eigenvectors: a unifying view , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[19]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[20]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[21]  J. L. Nolan Stable Distributions. Models for Heavy Tailed Data , 2001 .

[22]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[23]  Tomaso A. Poggio,et al.  Face recognition with support vector machines: global versus component-based approach , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[24]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[25]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[26]  Jian Yang,et al.  From image vector to matrix: a straightforward image projection technique - IMPCA vs. PCA , 2002, Pattern Recognit..

[27]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[28]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[29]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[30]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Onaiza Maqbool,et al.  The weighted combined algorithm: a linkage algorithm for software clustering , 2004, Eighth European Conference on Software Maintenance and Reengineering, 2004. CSMR 2004. Proceedings..

[32]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis: Kernels for text , 2004 .

[33]  Nathan Linial,et al.  Low dimensional embeddings of ultrametrics , 2004, Eur. J. Comb..

[34]  A. Moore,et al.  Variable KD-Tree Algorithms for Efficient Spatial Pattern Search , 2005 .

[35]  Michael W. Mahoney,et al.  Approximating a Gram Matrix for Improved Kernel-Based Learning (Extended Abstract) , 2005 .

[36]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[37]  Amnon Shashua,et al.  Doubly Stochastic Normalization for Spectral Clustering , 2006, NIPS.

[38]  Younès Bennani,et al.  Selection of clusters number and features subset during a two-levels clustering task , 2006, Artificial Intelligence and Soft Computing.

[39]  Keke Chen,et al.  iVIBRATE: Interactive visualization-based framework for clustering large datasets , 2006, ACM Trans. Inf. Syst..

[40]  Rajeev Motwani,et al.  Lower bounds on locality sensitive hashing , 2005, SCG '06.

[41]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[42]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[43]  J. Patel,et al.  Estimating the selectivity of tf-idf based cosine similarity predicates , 2007, SGMD.

[44]  Trevor Darrell,et al.  Active Learning with Gaussian Processes for Object Categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[45]  Rudolf Eigenmann,et al.  Programming Distributed Memory Sytems Using OpenMP , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[46]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[47]  S. Bandyopadhyay,et al.  A New Cluster Validity Index Based on Fuzzy Granulation-degranulation Criteria , 2007, 15th International Conference on Advanced Computing and Communications (ADCOM 2007).

[48]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[49]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[50]  Fionn Murtagh,et al.  Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding , 2008, SIAM J. Sci. Comput..

[51]  Wenhua Wang,et al.  Large Scale of E-learning Resources Clustering with Parallel Affinity Propagation , 2008 .

[52]  Anirban Dasgupta,et al.  Approximation algorithms for co-clustering , 2008, PODS.

[53]  Alexandros Nanopoulos,et al.  Nearest neighbors in high-dimensional data: the emergence and influence of hubs , 2009, ICML '09.

[54]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[55]  Licheng Jiao,et al.  Spectral clustering ensemble for image segmentation , 2009, GEC '09.

[56]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[57]  Mamoru Hoshi,et al.  Fast Computation of Similarity Based on Jaccard Coefficient for Composition-Based Image Retrieval , 2009, PCM.

[58]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[59]  Kumiko Tanaka-Ishii,et al.  Multilingual Spectral Clustering Using Document Similarity Propagation , 2009, EMNLP.

[60]  Victor Y. Pan,et al.  Real and complex polynomial root-finding with eigen-solving and preprocessing , 2010, ISSAC.

[61]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Tao Shi,et al.  Multiple Sample Data Spectroscopic Clustering of Large Datasets Using Nyström Extension , 2012 .