Improving Topic Coherence with Regularized Topic Models

Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.

[1]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[2]  Timothy Baldwin,et al.  Evaluating topic models for digital libraries , 2010, JCDL '10.

[3]  Padhraic Smyth,et al.  Combining concept hierarchies and statistical topic models , 2008, CIKM '08.

[4]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[7]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[8]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[9]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[10]  Wei-Ying Ma,et al.  Learning to cluster web search results , 2004, SIGIR '04.

[11]  Daniel Barbará,et al.  Topic Significance Ranking of LDA Generative Models , 2009, ECML/PKDD.

[12]  Jin Zhang,et al.  Query Classification Based on Regularized Correlated Topic Model , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[13]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[14]  Andrew McCallum,et al.  Organizing the OCA: learning faceted subjects from a library of digital books , 2007, JCDL '07.

[15]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[16]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[17]  Scott Sanner,et al.  Probabilistic latent maximal marginal relevance , 2010, SIGIR '10.

[18]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[19]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[20]  Padhraic Smyth,et al.  Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning , 2008, SEMWEB.