Automatic online news issue construction in web environment

In many cases, rather than a keyword search, people intend to see what is going on through the Internet. Then the integrated comprehensive information on news topics is necessary, which we called news issues, including the background, history, current progress, different opinions and discussions, etc. Traditionally, news issues are manually generated by website editors. It is quite a time-consuming hard work, and hence real-time update is difficult to perform. In this paper, a three-step automatic online algorithm for news issue construction is proposed. The first step is a topic detection process, in which newly appearing stories are clustered into new topic candidates. The second step is a topic tracking process, where those candidates are compared with previous topics, either merged into old ones or generating a new one. In the final step, news issues are constructed by the combination of related topics and updated by the insertion of new topics. An automatic online news issue construction process under practical Web circumstances is simulated to perform news issue construction experiments. F-measure of the best results is either above (topic detection) or close to (topic detection and tracking) 90%. Four news issue construction results are successfully generated in different time granularities: one meets the needs like "what's new", and the other three will answer questions like "what's hot" or "what's going on". Through the proposed algorithm, news issues can be effectively and automatically constructed with real-time update, and lots of human efforts will be released from tedious manual work.

[1]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[2]  Qi He,et al.  A Model for Anticipatory Event Detection , 2006, ER.

[3]  G. Karypis,et al.  Criterion functions for document clustering , 2005 .

[4]  Shuo Bai,et al.  ICT ’ s Approaches to HTD and Tracking at TDT 2004 , 2004 .

[5]  Joe Carthy,et al.  Combining semantic and syntactic document classifiers to improve first story detection , 2001, SIGIR '01.

[6]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[7]  Tao Tao,et al.  A formal study of information retrieval heuristics , 2004, SIGIR '04.

[8]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[9]  Wessel Kraaij,et al.  TNO at TDT2001: Language Model-Based Topic Detection , 2001 .

[10]  Dolf Trieschnigg,et al.  TNO Hierarchical topic detection report at TDT 2004 , 2004 .

[11]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[12]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[13]  Ee-Peng Lim,et al.  Analyzing feature trajectories for event detection , 2007, SIGIR.

[14]  Kuo Zhang,et al.  New event detection based on indexing-tree and named entity , 2007, SIGIR.

[15]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[16]  Philip S. Yu,et al.  Time-dependent event hierarchy construction , 2007, KDD '07.

[17]  James Allan,et al.  UMass at TDT 2004 , 2004 .