Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge

While generative models such as Latent Dirichlet Allocation (LDA) have proven fruitful in topic modeling, they often require detailed assumptions and careful specification of hyperparameters. Such model complexity issues only compound when trying to generalize generative models to incorporate human input. We introduce Correlation Explanation (CorEx), an alternative approach to topic modeling that does not assume an underlying generative model, and instead learns maximally informative topics through an information-theoretic framework. This framework naturally generalizes to hierarchical and semi-supervised extensions with no additional modeling assumptions. In particular, word-level domain knowledge can be flexibly incorporated within CorEx through anchor words, allowing topic separability and representation to be promoted with minimal human intervention. Across a variety of datasets, metrics, and experiments, we demonstrate that CorEx produces topics that are comparable in quality to those produced by unsupervised and semi-supervised variants of LDA.

[1]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[2]  David Sontag,et al.  Using Anchors to Estimate Clinical State without Labeled Data , 2014, AMIA.

[3]  Aram Galstyan,et al.  Discovering Structure in High-Dimensional Data Through Correlation Explanation , 2014, NIPS.

[4]  Chris H. Q. Ding,et al.  On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing , 2008, Comput. Stat. Data Anal..

[5]  James P. Bagrow,et al.  Transitions in climate and energy discourse between Hurricanes Katrina and Sandy , 2015, Journal of Environmental Studies and Sciences.

[6]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[7]  Hal Daumé,et al.  Incorporating Lexical Priors into Topic Models , 2012, EACL.

[8]  Thang Nguyen,et al.  Is Your Anchor Going Up or Down? Fast and Accurate Supervised Topic Models , 2015, NAACL.

[9]  Satish Chikkagoudar,et al.  Disentangling the Lexicons of Disaster Response in Twitter , 2014, WWW.

[10]  Leonard K. M. Poon,et al.  Progressive EM for Latent Tree Models and Hierarchical Topic Detection , 2015, AAAI.

[11]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[12]  Sanjeev Arora,et al.  Learning Topic Models -- Going beyond SVD , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[13]  Xiaojin Zhu,et al.  Latent Dirichlet Allocation with Topic-in-Set Knowledge , 2009, HLT-NAACL 2009.

[14]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[15]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[16]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[17]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[18]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[19]  Xiaojin Zhu,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Framework for Incorporating General Domain Knowledge into Latent Dirichlet Allocation Using First-Order Logic , 2022 .

[20]  Aram Galstyan,et al.  Maximally Informative Hierarchical Representations of High-Dimensional Data , 2014, AISTATS.

[21]  Sanjeev Arora,et al.  A Practical Algorithm for Topic Modeling with Provable Guarantees , 2012, ICML.

[22]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Jordan L. Boyd-Graber,et al.  Anchors Regularized: Adding Robustness and Extensibility to Scalable Topic-Modeling Algorithms , 2014, ACL.

[25]  Jun Zhu,et al.  Robust RegBayes: Selectively Incorporating First-Order Logic Domain Knowledge into Bayesian Models , 2014, ICML.

[26]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[27]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[28]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[29]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[30]  Qiaozhu Mei,et al.  Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis , 2014, ICML.

[31]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[32]  Aleks Jakulin,et al.  Discrete Component Analysis , 2005, SLSFS.

[33]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[34]  James R. Foulds,et al.  Latent Topic Networks: A Versatile Probabilistic Programming Framework for Topic Models , 2015, ICML.

[35]  Aram Galstyan,et al.  Toward Interpretable Topic Discovery via Anchored Correlation Explanation , 2016, ArXiv.

[36]  David Sontag,et al.  Anchored Discrete Factor Analysis , 2015, ArXiv.

[37]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[38]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[39]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.