Using Word Sense as a Latent Variable in LDA Can Improve Topic Modeling

Since proposed, LDA have been successfully used in modeling text documents. So far, words are the common features to induce latent topic, which are later used in document representation. Observation on documents indicates that the polysemous words can make the latent topics less discriminative, resulting in less accurate document representation. We thus argue that the semantically deterministic word senses can improve quality of the latent topics. In this work, we proposes a series of word sense aware LDA models which use word sense as an extra latent variable in topic induction. Preliminary experiments on benchmark datasets show that word sense can indeed improve topic modeling.

[1]  Weiwei Guo,et al.  Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions , 2011, EMNLP.

[2]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[3]  Padhraic Smyth,et al.  Combining concept hierarchies and statistical topic models , 2008, CIKM '08.

[4]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[5]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[6]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[7]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[8]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[9]  Yau-Hwang Kuo,et al.  Cross-Lingual Document Representation and Semantic Similarity Measure: A Fuzzy Set and Rough Set Based Approach , 2010, IEEE Transactions on Fuzzy Systems.

[10]  Xuchen Yao,et al.  Nonparametric Bayesian Word Sense Induction , 2011, Graph-based Methods for Natural Language Processing.

[11]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[13]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .