Recent Developments in Document Clustering

This report aims to give a brief overview of the current state of document clustering research and present recent developments in a well-organized manner. Clustering algorithms are considered with two hypothetical scenarios in mind: online query clustering with tight efficiency constraints, and offline clustering with an emphasis on accuracy. A comparative analysis of the algorithms is performed along with a table summarizing important properties, and open problems as well as directions for future research are discussed.

[1]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[2]  Joydeep Ghosh,et al.  Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[3]  Vipin Kumar,et al.  Multilevel Algorithms for Multi-Constraint Graph Partitioning , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[4]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[5]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[6]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[7]  Chris H. Q. Ding,et al.  NMF and PLSI: equivalence and a hybrid algorithm , 2006, SIGIR '06.

[8]  Yoshi Gotoh DIMENSIONALITY REDUCTION TECHNIQUES FOR SEARCH RESULTS CLUSTERING , 2004 .

[9]  Joachim M. Buhmann,et al.  A Resampling Approach to Cluster Validation , 2002, COMPSTAT.

[10]  Xin Liu,et al.  Document clustering with cluster refinement and model selection capabilities , 2002, SIGIR '02.

[11]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[12]  Bernhard Schölkopf,et al.  A Kernel Approach for Learning from Almost Orthogonal Patterns , 2002, European Conference on Principles of Data Mining and Knowledge Discovery.

[13]  L. Sacks,et al.  Evaluating fuzzy clustering for relevance-based information access , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[14]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[15]  Shi Zhong,et al.  Efficient online spherical k-means clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[16]  Xiaohua Hu,et al.  A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[17]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[18]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[19]  Derek Greene,et al.  Producing Accurate Interpretable Clusters from High-Dimensional Data , 2005, PKDD.

[20]  William-Chandra Tjhi,et al.  Fuzzy co-clustering of Web documents , 2005, 2005 International Conference on Cyberworlds (CW'05).

[21]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[22]  Joydeep Ghosh,et al.  Scalable, Balanced Model-based Clustering , 2003, SDM.

[23]  Stefan M. Wild,et al.  Improving non-negative matrix factorizations through structured initialization , 2004, Pattern Recognit..

[24]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[25]  Rong Zhang,et al.  A large scale clustering scheme for kernel K-Means , 2002, Object recognition supported by user interaction for service robots.

[26]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[27]  Dell Zhang,et al.  Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.

[28]  Joydeep Ghosh,et al.  CLUMP: a scalable and robust framework for structure discovery , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[29]  Chaitanya Swamy,et al.  Correlation Clustering: maximizing agreements via semidefinite programming , 2004, SODA '04.

[30]  Roberto Basili,et al.  Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[31]  Takeo Kanade,et al.  Discriminative cluster analysis , 2006, ICML.

[32]  Carlos Ordonez,et al.  FREM: fast and robust EM clustering for large data sets , 2002, CIKM '02.

[33]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[34]  Sven Meyer,et al.  The Suffix Tree Document Model Revisited , 1992 .

[35]  Doheon Lee,et al.  Evaluation of the performance of clustering algorithms in kernel-induced feature space , 2005, Pattern Recognit..

[36]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[37]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[38]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[39]  Greg Hamerly,et al.  Alternatives to the k-means algorithm that find better clusterings , 2002, CIKM '02.

[40]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[41]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[42]  Santosh S. Vempala,et al.  A divide-and-merge methodology for clustering , 2005, PODS '05.

[43]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[44]  Mohamed S. Kamel,et al.  CorePhrase: Keyphrase Extraction for Document Clustering , 2005, MLDM.

[45]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[46]  George Karypis,et al.  Soft clustering criterion functions for partitional document clustering: a summary of results , 2004, CIKM '04.

[47]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[48]  M. Aldenderfer Cluster Analysis , 1984 .

[49]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[50]  Raghu Krishnapuram,et al.  Fuzzy co-clustering of documents and keywords , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[51]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[52]  C. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and K-means - Spectral Clustering , 2005 .

[53]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[54]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[55]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[56]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[57]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.