Incorporating Word Correlation Knowledge into Topic Modeling

This paper studies how to incorporate the external word correlation knowledge to improve the coherence of topic modeling. Existing topic models assume words are generated independently and lack the mechanism to utilize the rich similarity relationships among words to learn coherent topics. To solve this problem, we build a Markov Random Field (MRF) regularized Latent Dirichlet Allocation (LDA) model, which defines a MRF on the latent topic layer of LDA to encourage words labeled as similar to share the same topic label. Under our model, the topic assignment of each word is not independent, but rather affected by the topic labels of its correlated words. Similar words have better chance to be put into the same topic due to the regularization of MRF, hence the coherence of topics can be boosted. In addition, our model can accommodate the subtlety that whether two words are similar depends on which topic they appear in, which allows word with multiple senses to be put into different topics properly. We derive a variational inference method to infer the posterior probabilities and learn model parameters and present techniques to deal with the hardto-compute partition function in MRF. Experiments on two datasets demonstrate the effectiveness of our model.

[1]  Fei-Fei Li,et al.  Image Segmentation with Topic Random Field , 2010, ECCV.

[2]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Quentin Pleple,et al.  Interactive Topic Modeling , 2013 .

[5]  Alexander J. Smola,et al.  Word Features for Latent Dirichlet Allocation , 2010, NIPS.

[6]  Hal Daumé,et al.  Incorporating Lexical Priors into Topic Models , 2012, EACL.

[7]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Eric P. Xing,et al.  Conditional Topic Random Fields , 2010, ICML.

[10]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[11]  Bill Triggs,et al.  Region Classification with Markov Field Aspect Models , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Xiaojin Zhu,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Framework for Incorporating General Domain Knowledge into Latent Dirichlet Allocation Using First-Order Logic , 2022 .

[13]  Gregor Heinrich,et al.  A Generic Approach to Topic Models , 2009, ECML/PKDD.

[14]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[15]  Edwin V. Bonilla,et al.  Improving Topic Coherence with Regularized Topic Models , 2011, NIPS.

[16]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[17]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[18]  Arjun Mukherjee,et al.  Leveraging Multi-Domain Prior Knowledge in Topic Models , 2013, IJCAI.

[19]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.