论文信息 - Optimizing Semantic Coherence in Topic Models

Optimizing Semantic Coherence in Topic Models

Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).

[1] Thomas L. Griffiths,et al. Online Inference of Topics with Latent Dirichlet Allocation , 2009, AISTATS.

[2] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[3] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4] Xiaojin Zhu,et al. Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[5] Yee Whye Teh,et al. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[6] Francis R. Bach,et al. Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[7] Hosam M. Mahmoud,et al. Polya Urn Models , 2008 .

[8] Ruslan Salakhutdinov,et al. Evaluation methods for topic models , 2009, ICML '09.

[9] W. Bruce Croft,et al. LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[10] Mark Steyvers,et al. Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[11] Daniel Barbará,et al. Topic Significance Ranking of LDA Generative Models , 2009, ECML/PKDD.

[12] Donald Geman,et al. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Timothy Baldwin,et al. Automatic Evaluation of Topic Coherence , 2010, NAACL.

[14] Chong Wang,et al. Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[15] J. Jenkins,et al. Word association norms , 1964 .

[16] ChengXiang Zhai,et al. Automatic labeling of multinomial topic models , 2007, KDD '07.

[17] Charles Elkan,et al. Accounting for burstiness in topic models , 2009, ICML '09.