Spherical k-Means Clustering

Clustering text documents is a fundamental task in modern data analysis, requiring approaches which perform well both in terms of solution quality and computational efficiency. Spherical k-means clustering is one approach to address both issues, employing cosine dissimilarities to perform prototype-based partitioning of term weight representations of the documents. This paper presents the theory underlying the standard spherical k-means problem and suitable extensions, and introduces the R extension package skmeans which provides a computational environment for spherical k-means clustering featuring several solvers: a fixed-point and genetic algorithm, and interfaces to two external solvers (CLUTO and Gmeans). Performance of these solvers is investigated by means of a large scale benchmark experiment.

[1]  Inderjit S. Dhillon,et al.  Iterative clustering of high dimensional text data augmented by local search , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[3]  Inderjit S. Dhillon,et al.  Large-scale clustering: algorithms and applications , 2006 .

[4]  Kurt Hornik,et al.  A CLUE for CLUster Ensembles , 2005 .

[5]  Pierre Hansen,et al.  Fuzzy J-Means: a new heuristic for fuzzy clustering , 2001, Pattern Recognit..

[6]  C. Elkan,et al.  Topic Models , 2008 .

[7]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[8]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[9]  Alexandros Karatzoglou,et al.  Kernel-based machine learning for fast text mining in R , 2010, Comput. Stat. Data Anal..

[10]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[11]  K. Hornik,et al.  Sparse Lightweight Arrays and Matrices , 2014 .

[12]  Ranjan Maitra,et al.  A k-mean-directions Algorithm for Fast Clustering of Data on the Sphere , 2010 .

[13]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[14]  Patrik D'haeseleer,et al.  How does gene expression clustering work? , 2005, Nature Biotechnology.

[15]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[16]  Kurt Hornik,et al.  Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering , 2011 .

[17]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[18]  Vipin Kumar,et al.  WebACE: a Web agent for document categorization and exploration , 1998, AGENTS '98.

[19]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[20]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[21]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[22]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[23]  Jaideep Srivastava,et al.  Selecting the right objective measure for association analysis , 2004, Inf. Syst..

[24]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[25]  Kurt Hornik,et al.  Text Mining Infrastructure in R , 2008 .

[26]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .