Topic Models Incorporating Statistical Word Senses

LDA considers a surface word to be identical across all documents and measures the contribution of a surface word to each topic. However, a surface word may present different signatures in different contexts, i.e. polysemous words can be used with different senses in different contexts. Intuitively, disambiguating word senses for topic models can enhance their discriminative capabilities. In this work, we propose a joint model to automatically induce document topics and word senses simultaneously. Instead of using some pre-defined word sense resources, we capture the word sense information via a latent variable and directly induce them in a fully unsupervised manner from the corpora. Experimental results show that the proposed joint model outperforms the classic LDA and a standalone sense-based LDA model significantly in document clustering.

[1]  Sushmita Mitra,et al.  Applications of Fuzzy Sets Theory, 7th International Workshop on Fuzzy Logic and Applications, WILF 2007, Camogli, Italy, July 7-10, 2007, Proceedings , 2007, WILF.

[2]  Xuchen Yao,et al.  Nonparametric Bayesian Word Sense Induction , 2011, Graph-based Methods for Natural Language Processing.

[3]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[4]  Eneko Agirre,et al.  Semeval-2007 Task 2 : Evaluating Word Sense Induction and Discrimination , 2007 .

[5]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[6]  Roberto Navigli,et al.  Inducing Word Senses to Improve Web Search Result Clustering , 2010, EMNLP.

[7]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[8]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[9]  Mirella Lapata,et al.  Bayesian Word Sense Induction , 2009, EACL.

[10]  Weiwei Guo,et al.  Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions , 2011, EMNLP.

[11]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[12]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[13]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[14]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[17]  Dan Tufis,et al.  Ontology-Supported Text Classification Based on Cross-Lingual Word Sense Disambiguation , 2007, WILF.

[18]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.