Incorporating topic transition in topic detection and tracking algorithms

Topics often transit among documents in a document collection. To improve the accuracy of the topic detection and tracking (TDT) algorithms in discovering topics or classifying documents, it is necessary to make full use of this kind of topic transition information. However, TDT algorithms usually find topics based on topic models, such as LDA, pLSI, etc., which are a kind of mixture model and make the topic transition difficult to be denoted and implemented. A topic transition model representation based on hidden Markov model is present, and learning the topic transition from documents is discussed. Based on the model, two TDT algorithms incorporating topic transition, i.e. topic discovering and document classifying, are provided to show the application of the proposed model. Experiments on two real-world document collections are done with the two algorithms, and performance comparison with other similar algorithm shows that the accuracy can achieve 93% for topic discovering in Reuters-21578, and 97.3% in document classifying. Furthermore, topic transition discovered by the algorithm on a dataset which was collected from a BBS website is consistent with the manual analysis results.

[1]  Mitsuru Ishizuka,et al.  Emerging topic tracking system in WWW , 2006, Knowl. Based Syst..

[2]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[3]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[4]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[5]  Helena Ahonen-Myka,et al.  Simple Semantics in Topic Detection and Tracking , 2004, Information Retrieval.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Mário A. T. Figueiredo,et al.  A sequential pruning strategy for the selection of the number of states in hidden Markov models , 2003, Pattern Recognit. Lett..

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  Satoshi Morinaga,et al.  Tracking dynamics of topic trends using a finite mixture model , 2004, KDD.

[11]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[12]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[13]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[14]  Vicenç Torra,et al.  Fuzzy c-means for Fuzzy Hierarchical Clustering , 2005, The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ '05..

[15]  ChengXiang Zhai,et al.  A mixture model for contextual text mining , 2006, KDD '06.