论文信息 - Spherical k-Means Clustering

Spherical k-Means Clustering

Clustering text documents is a fundamental task in modern data analysis, requiring approaches which perform well both in terms of solution quality and computational efficiency. Spherical k-means clustering is one approach to address both issues, employing cosine dissimilarities to perform prototype-based partitioning of term weight representations of the documents. This paper presents the theory underlying the standard spherical k-means problem and suitable extensions, and introduces the R extension package skmeans which provides a computational environment for spherical k-means clustering featuring several solvers: a fixed-point and genetic algorithm, and interfaces to two external solvers (CLUTO and Gmeans). Performance of these solvers is investigated by means of a large scale benchmark experiment.

[1] Inderjit S. Dhillon,et al. Iterative clustering of high dimensional text data augmented by local search , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2] James C. Bezdek,et al. Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[3] Inderjit S. Dhillon,et al. Large-scale clustering: algorithms and applications , 2006 .

[4] Kurt Hornik,et al. A CLUE for CLUster Ensembles , 2005 .

[5] Pierre Hansen,et al. Fuzzy J-Means: a new heuristic for fuzzy clustering , 2001, Pattern Recognit..

[6] C. Elkan,et al. Topic Models , 2008 .

[7] Inderjit S. Dhillon,et al. Efficient Clustering of Very Large Document Collections , 2001 .

[8] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[9] Alexandros Karatzoglou,et al. Kernel-based machine learning for fast text mining in R , 2010, Comput. Stat. Data Anal..

[10] Inderjit S. Dhillon,et al. Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[11] K. Hornik,et al. Sparse Lightweight Arrays and Matrices , 2014 .

[12] Ranjan Maitra,et al. A k-mean-directions Algorithm for Fast Clustering of Data on the Sphere , 2010 .

[13] George Karypis,et al. CLUTO - A Clustering Toolkit , 2002 .

[14] Patrik D'haeseleer,et al. How does gene expression clustering work? , 2005, Nature Biotechnology.

[15] Mark Steyvers,et al. Topics in semantic representation. , 2007, Psychological review.

[16] Kurt Hornik,et al. Dissimilarity Plots: A Visual Exploration Tool for Partitional Clustering , 2011 .

[17] Chris Buckley,et al. OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[18] Vipin Kumar,et al. WebACE: a Web agent for document categorization and exploration , 1998, AGENTS '98.

[19] R Core Team,et al. R: A language and environment for statistical computing. , 2014 .

[20] George Karypis,et al. Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[21] Nello Cristianini,et al. Classification using String Kernels , 2000 .

[22] Inderjit S. Dhillon,et al. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[23] Jaideep Srivastava,et al. Selecting the right objective measure for association analysis , 2004, Inf. Syst..

[24] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[25] Kurt Hornik,et al. Text Mining Infrastructure in R , 2008 .

[26] David D. Lewis,et al. Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .