论文信息 - A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichlet-multinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature.

[1] Richard M. Schwartz,et al. BBN at TREC7: Using Hidden Markov Models for Information Retrieval , 1998, TREC.

[2] Djoerd Hiemstra,et al. Bayesian extension to the language model for ad hoc information retrieval , 2003, SIGIR.

[3] Yiming Yang,et al. A study of retrospective and on-line event detection , 1998, SIGIR '98.

[4] A. Brix. Bayesian Data Analysis, 2nd edn , 2005 .

[5] James Allan,et al. First story detection in TDT is hard , 2000, CIKM '00.

[6] Thomas L. Griffiths,et al. Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[7] Michael A. West,et al. Hierarchical priors and mixture models, with applications in regression and density estimation , 2006 .

[8] Yiming Yang,et al. Topic-conditioned novelty detection , 2002, KDD.

[9] T. Ferguson. A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[10] Tom Minka,et al. A family of algorithms for approximate Bayesian inference , 2001 .