Clustering of text documents is an important data mining issue and has wide application fields. However, many clustering approaches fail to yield high clustering quality because of the complex document semantics. Recently, semantic smoothing, which has been widely studied in the field of Information Retrieval, is proposed as an efficient solution. However, the existing semantic smoothing methods are not effective for partitional clustering. In this paper, based on the principle of TF*IDF schema, we propose an improved semantic smoothing method which is suitable for both agglomerative and partitional clustering. The experimental results show our method is more effective than the previous methods in terms of cluster quality.
[1]
Joydeep Ghosh,et al.
Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study
,
2003
.
[2]
R. A. Leibler,et al.
On Information and Sufficiency
,
1951
.
[3]
Xiaohua Hu,et al.
Semantic Smoothing for Model-based Document Clustering
,
2006,
Sixth International Conference on Data Mining (ICDM'06).
[4]
Xiaohua Hu,et al.
Context-sensitive semantic smoothing for the language modeling approach to genomic IR
,
2006,
SIGIR.
[5]
George Karypis,et al.
A Comparison of Document Clustering Techniques
,
2000
.