论文信息 - A word-based soft clustering algorithm for documents

A word-based soft clustering algorithm for documents

Document clustering is an important tool for applications such as Web search engines. It enables the user to have a good overall view of the information contained in the documents. However, existing algorithms suffer from various aspects; hard clustering algorithms (where each document belongs to exactly one cluster) cannot detect the multiple themes of a document, while soft clustering algorithms (where each document can belong to multiple clusters) are usually inefficient. We propose WBSC (Word-based Soft Clustering), an efficient soft clustering algorithm based on a given similarity measure. WBSC uses a hierarchical approach to cluster documents having similar words. WBSC is very effective and efficient when compared with existing hard clustering algorithms like K-means and its variants.

Ravikumar Kondadadi | King-Ip Lin

[1] Oren Etzioni,et al. Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[2] References , 1971 .

[3] Robert E. Tarjan,et al. Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[4] Fionn Murtagh,et al. A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[5] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.