A comparative study of topic models for topic clustering of Chinese web news

Topic model is an increasing useful tool to analyze the semantic level meanings and capture the topical features. However, there is few research about the comparative study of the topic models. In this paper, we describe our comparative study of three topic models in the extrinsic application of topic clustering. The topic model distance is defined on the converged parameters of topic models, which is used in the topic clustering. Then, the topic models are compared using the clustering result of the corresponding topic distance matrix. A series of comparative experiments are carried on a corpus containing 5033 web news from 30 topics using the cosine distance as the base-line. Web page collections with different number of topics and documents are used in experiments. The experiment results show that topic clustering using topic distance achieves a better precision and recall in the data set containing related topics. The topic clustering using topic distance benefits from the topic features captured by topic models. The complex topic model does provide further help than the simple topic model in topic clustering.

[1]  Robert L. Grossman,et al.  Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining , 2005, KDD 2005.

[2]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[3]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[4]  James Allan,et al.  Evaluating topic models for information retrieval , 2008, CIKM '08.

[5]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[6]  Bei Yu,et al.  A cross-collection mixture model for comparative text mining , 2004, KDD.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[9]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[10]  Balaraman Ravindran,et al.  Latent dirichlet allocation based multi-document summarization , 2008, AND '08.

[11]  Alexander Clark,et al.  A Comparative Study of Mixture Models for Automatic Topic Segmentation of Multiparty Dialogues , 2008, IJCNLP.

[12]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[13]  Mark Johnson,et al.  A Bayesian LDA-based model for semi-supervised part-of-speech tagging , 2007, NIPS.

[14]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[15]  Xiaolong Wang,et al.  A Pragmatic Chinese Word Segmentation System , 2006, SIGHAN@COLING/ACL.