o-HETM: An Online Hierarchical Entity Topic Model for News Streams

Nowadays, with the development of the Internet, large amount of continuous streaming news has become overwhelming to the public. Constructing a dynamic topic hierarchy which organizes the news articles according to multi-grain topics can enable the users to catch whatever they are interested in as soon as possible. However, it is nontrivial due to the streaming and time-sensitive characteristics of news data. In this paper, to address the challenges, we propose a Hierarchical Entity Topic Model (HETM) which considers the timeliness of news data and the importance of named entities in conveying information of who/when/where in news articles. In addition, we propose online HETM (o-HETM) by presenting a fast online inference algorithm for HETM to adapt it to streaming news. For better understanding of topics, we extract key sentences for each topic to form a summary. Extensive experimental results demonstrate that our model HETM significantly improves the topic quality and time efficiency, compared to state-of-the-art method HLDA (Hierarchical Latent Dirichlet Allocation). In addition, our proposed o-HETM with an online inference algorithm further greatly improves the time efficiency and thus can be applicable to the streaming news.

[1]  Dolf Trieschnigg,et al.  Hierarchical topic detection in large digital news archives: Exploring a sample based approach , 2005, J. Digit. Inf. Manag..

[2]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[3]  Peng Xu,et al.  Generating Breakpoint-based Timeline Overview for News Topic Retrospection , 2011, 2011 IEEE 11th International Conference on Data Mining.

[4]  Zhihui Li,et al.  Incorporating Entities in News Topic Modeling , 2013, NLPCC.

[5]  Thomas L. Griffiths,et al.  Online Inference of Topics with Latent Dirichlet Allocation , 2009, AISTATS.

[6]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[7]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[8]  Eric P. Xing,et al.  Dynamic Non-Parametric Mixture Models and the Recurrent Chinese Restaurant Process: with Applications to Evolutionary Clustering , 2008, SDM.

[9]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[10]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[11]  Indrajit Bhattacharya,et al.  Nested Hierarchical Dirichlet Process for Nonparametric Entity-Topic Analysis , 2013, ECML/PKDD.

[12]  Yizhou Sun,et al.  ETM: Entity Topic Models for Mining Documents Associated with Entities , 2012, 2012 IEEE 12th International Conference on Data Mining.

[13]  George A. Vouros,et al.  Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes , 2011, J. Mach. Learn. Res..

[14]  Arindam Banerjee,et al.  Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning , 2007, SDM.

[15]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[16]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[17]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[18]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[19]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.