Variable space hidden Markov model for topic detection and analysis

Discovering topics from large amount of documents has become an important task recently. Most of the topic models treat document as a word sequence, whether in discrete character or term frequency form. However, the number of words in a document is greatly different from that in other documents. This will lead to several problems for current topic models in dealing with topics analysis. On the other hand, it is difficult to perform topic transition analysis based on current topic models. In an attempt to overcome these deficiencies, a variable space hidden Markov model (VSHMM) is proposed to represent the topics, and several operations based on space computation are presented. A hierarchical clustering algorithm with dynamically changing of the component number in topic model is proposed to demonstrate the effectiveness of the VSHMM. Method of document partition based on topic transition is also present. Experiments on a real-world dataset show that the VSHMM can improve the accuracy while decreasing the algorithm's time complexity greatly compared with the algorithm based on current mixture model.

[1]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  ChengXiang Zhai,et al.  A mixture model for contextual text mining , 2006, KDD '06.

[5]  Joe Carthy Lexical Chains versus Keywords for Topic Tracking , 2004, CICLing.

[6]  Joydeep Ghosh,et al.  Probabilistic model-based clustering of complex data , 2003 .

[7]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[8]  Reynaldo Gil-García,et al.  A General Framework for Agglomerative Hierarchical Clustering Algorithms , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[9]  Shivakumar Vaithyanathan,et al.  Model-Based Hierarchical Clustering , 2000, UAI.

[10]  Alain Biem,et al.  A model selection criterion for classification: application to HMM topology optimization , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Andrew P. Sage,et al.  Uncertainty in Artificial Intelligence , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[12]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[13]  Satoshi Morinaga,et al.  Tracking dynamics of topic trends using a finite mixture model , 2004, KDD.

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  Helena Ahonen-Myka,et al.  Simple Semantics in Topic Detection and Tracking , 2004, Information Retrieval.