Document Clustering: A Detailed Review

Document clustering is automatic organization of documents into clusters so that documents within a cluster have high similarity in comparison to documents in other clusters. It has been studied intensively because of its wide applicability in various areas such as web mining, search engines, and information retrieval. It is measuring similarity between documents and grouping similar documents together. It provides efficient representation and visualization of the documents; thus helps in easy navigation also. In this paper, we have given overview of various document clustering methods studied and researched since last few years, starting from basic traditional methods to fuzzy based, genetic, coclustering, heuristic oriented etc. Also, the document clustering procedure with feature selection process, applications, challenges in document clustering, similarity measures and evaluation of document clustering algorithm is explained.

[1]  Yuanchao Liu,et al.  Research of fast SOM clustering for text information , 2011, Expert Syst. Appl..

[2]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[3]  Frank S. C. Tseng,et al.  Mining fuzzy frequent itemsets for hierarchical document clustering , 2010, Inf. Process. Manag..

[4]  Soon Myoung Chung,et al.  Text document clustering based on neighbors , 2009, Data Knowl. Eng..

[5]  Marc M. Van Hulle,et al.  A clustering study of a 7000 EU document inventory using MDS and SOM , 2011, Expert Syst. Appl..

[6]  Pankaj Jajoo Document Clustering , 2008 .

[7]  Hui Xiong,et al.  Towards understanding hierarchical clustering: A data distribution perspective , 2009, Neurocomputing.

[8]  E. Eugene Schultz,et al.  Hawaii international conference on system sciences , 1992, SGCH.

[9]  Wei-Ying Ma,et al.  Multitype Features Coselection for Web Document Clustering , 2006, IEEE Trans. Knowl. Data Eng..

[10]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[11]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Weiguo Fan,et al.  Trace-Oriented Feature Analysis for Large-Scale Text Data Dimension Reduction , 2011, IEEE Transactions on Knowledge and Data Engineering.

[13]  M. Phil,et al.  Survey on Feature Selection in Document Clustering , 2011 .

[14]  Xiaohui Cui,et al.  Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm , 2005 .

[15]  Yong Wang,et al.  Document Clustering with Semantic Analysis , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[16]  Thomas E. Potok,et al.  A flocking based algorithm for document clustering analysis , 2006, J. Syst. Archit..

[17]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[18]  William-Chandra Tjhi,et al.  A heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data , 2008, Fuzzy Sets Syst..

[19]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[20]  Benjamin C. M. Fung,et al.  Hierarchical Document Clustering using Frequent Itemsets , 2003, SDM.

[21]  Mohamed S. Kamel,et al.  Enhanced bisecting k-means clustering using intermediate cooperation , 2009, Pattern Recognit..

[22]  Kai Liu,et al.  A fast divisive clustering algorithm using an improved discrete particle swarm optimizer , 2010, Pattern Recognit. Lett..

[23]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[24]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[25]  Frank S. C. Tseng,et al.  An integration of WordNet and fuzzy association rule mining for multi-label document clustering , 2010, Data Knowl. Eng..

[26]  C. Apté,et al.  Lightweight Document Clustering , 2000 .

[27]  Minqiang Li,et al.  Multinomial mixture model with feature selection for text clustering , 2008, Knowl. Based Syst..

[28]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[29]  Frank S. C. Tseng,et al.  An integration of fuzzy association rules and WordNet for document clustering , 2010, Knowledge and Information Systems.

[30]  Kok-Leong Ong,et al.  Enhancing the Effectiveness of Clustering with Spectra Analysis , 2007, IEEE Transactions on Knowledge and Data Engineering.

[31]  Chih-Ping Wei,et al.  Combining preference- and content-based approaches for improving document clustering effectiveness , 2006, Inf. Process. Manag..

[32]  Reynaldo Gil-García,et al.  Dynamic hierarchical algorithms for document clustering , 2010, Pattern Recognit. Lett..

[33]  Wei Yuan,et al.  Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization , 2011, Inf. Sci..

[34]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[35]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[36]  Renu Dhir,et al.  A Frequent Concepts Based Document Clustering Algorithm , 2010 .

[37]  Zhengxin Chen,et al.  Recent trends in Data Mining (DM): Document Clustering of DM Publications , 2006, 2006 International Conference on Service Systems and Service Management.

[38]  William-Chandra Tjhi,et al.  Possibilistic fuzzy co-clustering of large document collections , 2007, Pattern Recognit..

[39]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[40]  Edward A. Fox,et al.  Recent Developments in Document Clustering , 2007 .

[41]  Salwani Abdullah,et al.  A combined approach for clustering based on K-means and gravitational search algorithms , 2012, Swarm Evol. Comput..

[42]  M. Punithavalli,et al.  Survey on Feature Selection in Document Clustering , 2011 .