Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models

Topic detection (TD) is a fundamental research issue in the Topic Detection and Tracking (TDT) community with practical implications; TD helps analysts to separate the wheat from the chaff among the thousands of incoming news streams. In this paper, we propose a simple and effective topic detection model called the temporal Discriminative Probabilistic Model (DPM), which is shown to be theoretically equivalent to the classic vector space model with feature selection and temporally discriminative weights. We compare DPM to its various probabilistic cousins, ranging from mixture models like von-Mises Fisher (vMF) to mixed membership models like Latent Dirichlet Allocation (LDA). Benchmark results on the TDT3 data set show that sophisticated models, such as vMF and LDA, do not necessarily lead to better results; in the case of LDA, notably worst performance was obtained under variational inference, which is likely due to the significantly large number of LDA model parameters involved for document-level topic detection. On the contrary, using a relatively simple time-aware probabilistic model such as DPM suffices for both offline and online topic detection tasks, making DPM a theoretically elegant and effective model for practical topic detection.

[1]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[2]  Eric P. Xing,et al.  Dynamic Non-Parametric Mixture Models and the Recurrent Chinese Restaurant Process: with Applications to Evolutionary Clustering , 2008, SDM.

[3]  James Allan,et al.  First story detection in TDT is hard , 2000, CIKM '00.

[4]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Arindam Banerjee,et al.  Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning , 2007, SDM.

[7]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[8]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[9]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[10]  Guy Lebanon,et al.  Learning Riemannian Metrics , 2002, UAI.

[11]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[12]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[13]  Jian Pei,et al.  Detecting topic evolution in scientific literature: how can citations help? , 2009, CIKM.

[14]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[15]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[16]  Dennis Shasha,et al.  Efficient elastic burst detection in data streams , 2003, KDD '03.

[17]  Christopher C. Yang,et al.  Discovering event evolution graphs from newswires , 2006, WWW '06.

[18]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[19]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[20]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[21]  James Allan,et al.  Taking Topic Detection From Evaluation to Practice , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[22]  Hector Garcia-Molina,et al.  Overview of multidatabase transaction management , 2005, The VLDB Journal.

[23]  Philip S. Yu,et al.  Parameter Free Bursty Events Detection in Text Streams , 2005, VLDB.

[24]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[25]  Ee-Peng Lim,et al.  Analyzing feature trajectories for event detection , 2007, SIGIR.

[26]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[27]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[28]  Qi He,et al.  Bursty Feature Representation for Clustering Text Streams , 2007, SDM.

[29]  Yiming Yang,et al.  A Probabilistic Model for Online Document Clustering with Application to Novelty Detection , 2004, NIPS.

[30]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[31]  Jon M. Kleinberg,et al.  Bursty and Hierarchical Structure in Streams , 2002, Data Mining and Knowledge Discovery.

[32]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[33]  Qi He,et al.  A Model for Anticipatory Event Detection , 2006, ER.

[34]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[35]  Wei Li,et al.  Nonparametric Bayes Pachinko Allocation , 2007, UAI.

[36]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[37]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[38]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[39]  Inderjit S. Dhillon,et al.  Generative model-based clustering of directional data , 2003, KDD '03.

[40]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[41]  Joe Carthy,et al.  Combining semantic and syntactic document classifiers to improve first story detection , 2001, SIGIR '01.

[42]  Yiming Yang,et al.  Topic-conditioned novelty detection , 2002, KDD.

[43]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.