A word-based soft clustering algorithm for documents

Document clustering is an important tool for applications such as Web search engines. It enables the user to have a good overall view of the information contained in the documents. However, existing algorithms suffer from various aspects; hard clustering algorithms (where each document belongs to exactly one cluster) cannot detect the multiple themes of a document, while soft clustering algorithms (where each document can belong to multiple clusters) are usually inefficient. We propose WBSC (Word-based Soft Clustering), an efficient soft clustering algorithm based on a given similarity measure. WBSC uses a hierarchical approach to cluster documents having similar words. WBSC is very effective and efficient when compared with existing hard clustering algorithms like K-means and its variants.

[1]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[2]  References , 1971 .

[3]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[4]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[5]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.