Efficient incremental phrase-based document clustering

Document clustering has become inevitable for applications that aim to extract information from huge corpuses. Such applications face two main challenges; one is the efficient representation of the documents, along with using an efficient similarity measure, and the second is dealing with the dynamic nature of the corpus. In this paper, an efficient document clustering model is introduced for incrementally storing and updating clusters of a dataset. A new phrase-based similarity method is developed along with the model to calculate the similarity between documents and clusters. Experimental results show that the new clustering model can achieve more accurate results than the traditional algorithms.

[1]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[2]  Edward A. Fox,et al.  Recent Developments in Document Clustering , 2007 .

[3]  Xiaotie Deng,et al.  Efficient Phrase-Based Document Similarity for Clustering , 2008, IEEE Transactions on Knowledge and Data Engineering.

[4]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[5]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[6]  Ashish Jaiswal,et al.  Hierarchical Document Clustering: A Review , 2011 .

[7]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.