Essential difference between topic detection and text clustering is distribution of news corpus and time characteristics of news corpus. So we should study topic detection according to the news corpus, and it is necessary for news corpus to be in-depth and extensive research. Vector space model (VSM) is one of the most simple and effective topics representation model. And K-means is a well-known and widely used partitional clustering method. Therefore, we do a topic detection experiment to study how news corpus and K-means affect topic detection. Then we get the variation law that they affect topic detection, and add up their optimal values in topic detection. Finally, TDT evaluation methods prove that the optimal topic detection overall performance in topic detection experiment based on large-scale corpus enhances by 38.378% more than topic detection based on small-scale corpus. This experiment shows that topic detection based on K-means is suited to deal with large-scale data.
[1]
Tao Wang,et al.
The key technology of topic detection based on K-means
,
2010,
2010 International Conference on Future Information Technology and Management Engineering.
[2]
Li Hui.
Research on the Algorithm of Feature Selection Based on Difference and Multiple Features
,
2009
.
[3]
Li Xinwu.
Research on Text Clustering Algorithm Based on K_means and SOM
,
2008,
2008 International Symposium on Intelligent Information Technology Application Workshops.
[4]
Moustafa Ghanem,et al.
A novel refinement approach for text categorization
,
2005,
CIKM '05.
[5]
Hui Xiong,et al.
K-means clustering versus validation measures: a data distribution perspective
,
2006,
KDD '06.