HSPKNN: An effective and practical framework for hot topic detection of Internet news

With the rapid growth of information on the Internet, many Single-Pass based clustering methods are used in topic detection and tracking (TDT) because of Single-Pass's characteristics of incremental processing. In Single-Pass based methods, similarities between the feature vectors of news reports and the cluster centers of historical topics are calculated. The accuracy of TDT will be affected if the cluster centers can not precisely represent the topics. To overcome the shortcoming of Single-Pass based methods. This paper proposes an effective and practical framework for hot topic detection of Internet news. Firstly, news report streams are partitioned into segments by a time window, and then an agglomerative hierarchical clustering algorithm is used to acquire candidate topics. Finally, an algorithm fusing Single-Pass and KNN is proposed to detect topics from the candidate topics. Furthermore, in order to make it easier for the users to understand what the topics discuss, an algorithm generating descriptive labels for detected topics is proposed. Experimental results show that the proposed framework can outperform Single-Pass based methods and agglomerative hierarchical clustering based methods for TDT. In addition, the proposed framework has been used in the TDT module of an application system. Both the experimental results and application system demonstrate the effectiveness and practicality of the proposed framework.