A Very Fast Method for Clustering Big Text Datasets

Large-scale text datasets have long eluded a family of particularly elegant and effective clustering methods that exploits the power of pair-wise similarities between data points due to the prohibitive cost, time-and space-wise, in operating on a similarity matrix, where the state-of-the-art is at best quadratic in time and in space. We present an extremely fast and simple method also using the power of all pair-wise similarity between data points, and show through experiments that it does as well as previous methods in clustering accuracy, and it does so with in linear time and space, without sampling data points or sparsifying the similarity matrix.

[1]  William W. Cohen,et al.  Power Iteration Clustering , 2010, ICML.

[2]  Edith Cohen,et al.  Approximating matrix multiplication for pattern recognition tasks , 1997, SODA '97.

[3]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[4]  Edward Y. Chang,et al.  Parallel Spectral Clustering , 2008, ECML/PKDD.

[5]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  R. Schiffer Psychobiology of Language , 1986 .

[8]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Partha Pratim Talukdar,et al.  Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks , 2008, EMNLP.

[10]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[11]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[12]  Danny C. Sorensen,et al.  Deflation Techniques for an Implicitly Restarted Arnoldi Iteration , 1996, SIAM J. Matrix Anal. Appl..

[13]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[14]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[15]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[16]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[17]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[19]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[21]  Arunabha Sen,et al.  Graph Clustering Using Multiway Ratio Cut , 1997, GD.

[22]  David P. Woodruff,et al.  Clustering via matrix powering , 2004, PODS.

[23]  George Kingsley Zipf,et al.  The Psychobiology of Language , 2022 .

[24]  Nick Craswell,et al.  Random walks on the click graph , 2007, SIGIR.

[25]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[26]  Naftali Tishby,et al.  Data Clustering by Markovian Relaxation and the Information Bottleneck Method , 2000, NIPS.