CDW: A text clustering model for diverse versions discovery

The development of information technology brings numerous online news and events to our daily life. One big problem of such information explosion is, many times there are diverse descriptions for one incident which make people confused. Although previous researches have provided various algorithms to detect and track events, few of them focus on uncovering the diversified versions of an event. In this paper, we propose a novel algorithm which is capable of discovering different versions of one event according to the news reports. We map documents to the topic layer to get the information of each topic. Then we extract the highly-differentiated words of each topic to cluster the documents. Compared with previous work, the accuracy of our algorithm is much higher. Experiments conducted on two data sets show that our algorithm is effective and outperforms various related algorithms, including classical methods such as K-means and LDA.

[1]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[2]  Jonathan Yamron,et al.  Dragon's Tracking and Detection Systems for the TDT2000 Evaluation , 2000 .

[3]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[4]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[5]  Rudolf Kruse,et al.  Relevance Feedback for Association Rules by Leveraging Concepts from Information Retrieval , 2007, SGAI Conf..

[6]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[7]  Li Fu,et al.  DVD: A Model for Event Diversified Versions Discovery , 2011, APWeb.

[8]  Ophir Frieder,et al.  A sentence level probabilistic model for evolutionary theme pattern mining from news corpora , 2009, SAC '09.

[9]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[12]  Jerome R. Bellegarda,et al.  A novel word clustering algorithm based on latent semantic analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Martin Franz,et al.  Unsupervised and supervised clustering for topic tracking , 2001, SIGIR '01.

[14]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[15]  Vincent Ng,et al.  Towards subjectifying text clustering , 2010, SIGIR.

[16]  Yan Zhang,et al.  Describing Web Topics Meticulously through Word Graph Analysis , 2009, 2009 Ninth IEEE International Conference on Computer and Information Technology.

[17]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .