News topic detection based on hierarchical clustering and named entity

News topic detection is the process of organizing news story collections and real-time news/broadcast streams into news topics. While unlike the traditional text analysis, it is a process of incremental clustering, and generally divided into retrospective topic detection and online topic detection. This paper considers the feature changes of modern news data experienced from the past, and presents a new topic detection strategy based on hierarchical clustering and named entities. Topic detection process is also divided into retrospective and online steps, and named entities in the news stories are employed in the topic clustering algorithm. For the online step's efficiency and precision, this paper first clusters news stories in each time window into micro-clusters, and then extracts three representation vectors for each micro-cluster to calculate the similarity to existing topics. The experimental results show remarkable improvement compared with recently most applied topic detection method.

[1]  Rafael Berlanga Llavori,et al.  Topic discovery based on text mining techniques , 2007, Inf. Process. Manag..

[2]  Yiming Yang,et al.  Topic-conditioned novelty detection , 2002, KDD.

[3]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[4]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[5]  Zhao Liping,et al.  An Adaptive Topic Tracking Model Based on 3-Dimension Document Vector , 2010 .

[6]  S. Sekine Named Entity : History and Future , 2004 .

[7]  Xiaodong Liu,et al.  Use relative weight to improve the kNN for unbalanced text category , 2010, Proceedings of the 6th International Conference on Natural Language Processing and Knowledge Engineering(NLPKE-2010).

[8]  Kuo Zhang,et al.  New event detection based on indexing-tree and named entity , 2007, SIGIR.

[9]  Xiaolong Wang,et al.  Online topic detection and tracking of financial news based on hierarchical clustering , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[10]  Na Ye,et al.  Time adaptive boosting model for topic tracking , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[11]  James Allan,et al.  Using Names and Topics for New Event Detection , 2005, HLT/EMNLP.

[12]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[13]  Dolf Trieschnigg,et al.  Hierarchical topic detection in large digital news archives: Exploring a sample based approach , 2005, J. Digit. Inf. Manag..

[14]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .