论文信息 - Recent Developments in Document Clustering

Recent Developments in Document Clustering

This report aims to give a brief overview of the current state of document clustering research and present recent developments in a well-organized manner. Clustering algorithms are considered with two hypothetical scenarios in mind: online query clustering with tight efficiency constraints, and offline clustering with an emphasis on accuracy. A comparative analysis of the algorithms is performed along with a table summarizing important properties, and open problems as well as directions for future research are discussed.

Edward A. Fox | Nicholas Andrews | E. Fox | Nicholas Andrews

[1] Jeff A. Bilmes,et al. A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[2] Joydeep Ghosh,et al. Frequency sensitive competitive learning for clustering on high-dimensional hyperspheres , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[3] Vipin Kumar,et al. Multilevel Algorithms for Multi-Constraint Graph Partitioning , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[4] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[5] George Karypis,et al. Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[6] Mohamed S. Kamel,et al. Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[7] Chris H. Q. Ding,et al. NMF and PLSI: equivalence and a hybrid algorithm , 2006, SIGIR '06.

[8] Yoshi Gotoh. DIMENSIONALITY REDUCTION TECHNIQUES FOR SEARCH RESULTS CLUSTERING , 2004 .

[9] Joachim M. Buhmann,et al. A Resampling Approach to Cluster Validation , 2002, COMPSTAT.

[10] Xin Liu,et al. Document clustering with cluster refinement and model selection capabilities , 2002, SIGIR '02.

[11] Derek Greene,et al. Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[12] Bernhard Schölkopf,et al. A Kernel Approach for Learning from Almost Orthogonal Patterns , 2002, European Conference on Principles of Data Mining and Knowledge Discovery.

[13] L. Sacks,et al. Evaluating fuzzy clustering for relevance-based information access , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[14] Geoffrey E. Hinton,et al. A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[15] Shi Zhong,et al. Efficient online spherical k-means clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[16] Xiaohua Hu,et al. A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[17] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[18] Vladimir Estivill-Castro,et al. Why so many clustering algorithms: a position paper , 2002, SKDD.

[19] Derek Greene,et al. Producing Accurate Interpretable Clusters from High-Dimensional Data , 2005, PKDD.

[20] William-Chandra Tjhi,et al. Fuzzy co-clustering of Web documents , 2005, 2005 International Conference on Cyberworlds (CW'05).

[21] G. McLachlan,et al. The EM algorithm and extensions , 1996 .

[22] Joydeep Ghosh,et al. Scalable, Balanced Model-based Clustering , 2003, SDM.

[23] Stefan M. Wild,et al. Improving non-negative matrix factorizations through structured initialization , 2004, Pattern Recognit..

[24] Inderjit S. Dhillon,et al. Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[25] Rong Zhang,et al. A large scale clustering scheme for kernel K-Means , 2002, Object recognition supported by user interaction for service robots.

[26] Steffen Staab,et al. WordNet improves text document clustering , 2003, SIGIR 2003.

[27] Dell Zhang,et al. Semantic, Hierarchical, Online Clustering of Web Search Results , 2004, APWeb.

[28] Joydeep Ghosh,et al. CLUMP: a scalable and robust framework for structure discovery , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[29] Chaitanya Swamy,et al. Correlation Clustering: maximizing agreements via semidefinite programming , 2004, SODA '04.

[30] Roberto Basili,et al. Complex Linguistic Features for Text Classification: A Comprehensive Study , 2004, ECIR.

[31] Takeo Kanade,et al. Discriminative cluster analysis , 2006, ICML.

[32] Carlos Ordonez,et al. FREM: fast and robust EM clustering for large data sets , 2002, CIKM '02.

[33] Xin Liu,et al. Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[34] Sven Meyer,et al. The Suffix Tree Document Model Revisited , 1992 .

[35] Doheon Lee,et al. Evaluation of the performance of clustering algorithms in kernel-induced feature space , 2005, Pattern Recognit..

[36] Chris H. Q. Ding,et al. A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[37] Jianbo Shi,et al. Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[38] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[39] Greg Hamerly,et al. Alternatives to the k-means algorithm that find better clusterings , 2002, CIKM '02.

[40] Dawid Weiss,et al. Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[41] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[42] Santosh S. Vempala,et al. A divide-and-merge methodology for clustering , 2005, PODS '05.

[43] Joydeep Ghosh,et al. Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[44] Mohamed S. Kamel,et al. CorePhrase: Keyphrase Extraction for Document Clustering , 2005, MLDM.

[45] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[46] George Karypis,et al. Soft clustering criterion functions for partitional document clustering: a summary of results , 2004, CIKM '04.

[47] Inderjit S. Dhillon,et al. Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[48] M. Aldenderfer. Cluster Analysis , 1984 .

[49] Joydeep Ghosh,et al. Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[50] Raghu Krishnapuram,et al. Fuzzy co-clustering of documents and keywords , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[51] David A. Hull. Stemming algorithms: a case study for detailed evaluation , 1996 .

[52] C. Ding,et al. On the Equivalence of Nonnegative Matrix Factorization and K-means - Spectral Clustering , 2005 .

[53] Inderjit S. Dhillon,et al. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[54] Oren Etzioni,et al. Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[55] Chris H. Q. Ding,et al. K-means clustering via principal component analysis , 2004, ICML.

[56] Santosh S. Vempala,et al. On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[57] Inderjit S. Dhillon,et al. Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.