Topics modeling based on selective Zipf distribution

Automatically mining topics out of text corpus becomes an important fundament of many topic analysis tasks, such as opinion recognition, Web content classification, etc. Although large amount of topic models and topic mining methods have been proposed for different purposes and shown success in dealing with topic analysis tasks, it is desired to create accurate models or mining algorithms for many applications. A general criteria based on Zipf fitness quantity computation is proposed to determine whether a topic description is well-form or not. Based on the quantity definition, the popular Dirichlet prior on multinomial parameters is found that it cannot always produce well-form topic descriptions. Hence, topics modeling based on LDA with selective Zipf documents as training dataset is proposed to improve the quality in generation of topics description. Experiments on two standard text corpuses, i.e. AP dataset and Reuters-21578, show that the modeling method based on selective Zipf distribution can achieve better perplexity, which means better ability in predicting topics. While a test of topics extraction on a collection of news documents about recent financial crisis shows that the description key words in topics are more meaningful and reasonable than that of tradition topic mining method.

[1]  Satoshi Morinaga,et al.  Tracking dynamics of topic trends using a finite mixture model , 2004, KDD.

[2]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[3]  Joydeep Ghosh,et al.  Probabilistic model-based clustering of complex data , 2003 .

[4]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Jianping Zeng,et al.  Incorporating topic transition in topic detection and tracking algorithms , 2009, Expert Syst. Appl..

[7]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[8]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[9]  Ramanathan V. Guha,et al.  Information diffusion through blogspace , 2004, SKDD.

[10]  Reynaldo Gil-García,et al.  A General Framework for Agglomerative Hierarchical Clustering Algorithms , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[11]  Wei Wang,et al.  Multi-grain hierarchical topic extraction algorithm for text mining , 2010, Expert Syst. Appl..

[12]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[13]  Jianping Zeng,et al.  Variable space hidden Markov model for topic detection and analysis , 2007, Knowl. Based Syst..

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[16]  Ricard V. Solé,et al.  Least effort and the origins of scaling in human language , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Xu Ling,et al.  Topic sentiment mixture: modeling facets and opinions in weblogs , 2007, WWW '07.

[18]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[19]  Michal Rosen-Zvi,et al.  Hidden Topic Markov Models , 2007, AISTATS.

[20]  Jianping Zeng,et al.  Modelling topic propagation over the Internet , 2009 .