Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions

In this paper, we propose a novel topic model based on incorporating dictionary definitions. Traditional topic models treat words as surface strings without assuming predefined knowledge about word meaning. They infer topics only by observing surface word co-occurrence. However, the co-occurred words may not be semantically related in a manner that is relevant for topic coherence. Exploiting dictionary definitions explicitly in our model yields a better understanding of word semantics leading to better text modeling. We exploit WordNet as a lexical resource for sense definitions. We show that explicitly modeling word definitions helps improve performance significantly over the baseline for a text categorization task.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[5]  Martha Palmer,et al.  The English all-words task , 2004, SENSEVAL@ACL.

[6]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Rada Mihalcea,et al.  Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling , 2005, HLT.

[8]  David M. Blei,et al.  PUTOP: Turning Predominant Senses into a Topic Model for Word Sense Disambiguation , 2007, SemEval@ACL.

[9]  Yee Whye Teh,et al.  Improving Word Sense Disambiguation Using Topic Features , 2007, EMNLP.

[10]  Martha Palmer,et al.  SemEval-2007 Task-17: English Lexical Sample, SRL and All Words , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[11]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[12]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[13]  Rada Mihalcea,et al.  Unsupervised Graph-basedWord Sense Disambiguation Using Measures of Word Semantic Similarity , 2007, International Conference on Semantic Computing (ICSC 2007).

[14]  Padhraic Smyth,et al.  Combining concept hierarchies and statistical topic models , 2008, CIKM '08.

[15]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[16]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[17]  Hal Daumé,et al.  Markov Random Topic Fields , 2009, ACL/IJCNLP.

[18]  Mirella Lapata,et al.  Bayesian Word Sense Induction , 2009, EACL.

[19]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[20]  Weiwei Guo,et al.  Combining Orthogonal Monolingual and Multilingual Sources of Evidence for All Words WSD , 2010, ACL.

[21]  Caroline Sporleder,et al.  Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection , 2010, ACL.