Keyword Assisted Embedded Topic Model

By illuminating latent structures in a corpus of text, topic models are an essential tool for categorizing, summarizing, and exploring large collections of documents. Probabilistic topic models, such as latent Dirichlet allocation (LDA), describe how words in documents are generated via a set of latent distributions called topics. Recently, the Embedded Topic Model (ETM) has extended LDA to utilize the semantic information in word embeddings to derive semantically richer topics. As LDA and its extensions are unsupervised models, they aren’t defined to make efficient use of a user’s prior knowledge of the domain. To this end, we propose the Keyword Assisted Embedded Topic Model (KeyETM), which equips ETM with the ability to incorporate user knowledge in the form of informative topiclevel priors over the vocabulary. Using both quantitative metrics and human responses on a topic intrusion task, we demonstrate that KeyETM produces better topics than other guided, generative models in the literature1.

[1]  Chao Zhang,et al.  Discriminative Topic Mining via Category-Name Guided Text Embedding , 2020, WWW.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[4]  Heng Ji,et al.  Aspect-Based Sentiment Analysis by Aspect-Sentiment Joint Embedding , 2020, EMNLP.

[5]  Mirella Lapata,et al.  Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised , 2018, EMNLP.

[6]  Hal Daumé,et al.  Incorporating Lexical Priors into Topic Models , 2012, EACL.

[7]  David M. Mimno,et al.  Applications of Topic Models , 2017, Found. Trends Inf. Retr..

[8]  Tai Hoang,et al.  Towards Autoencoding Variational Inference for Aspect-Based Opinion Summary , 2019, Appl. Artif. Intell..

[9]  Charles A. Sutton,et al.  Autoencoding Variational Inference For Topic Models , 2017, ICLR.

[10]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[13]  David M. Blei,et al.  Topic Modeling in Embedding Spaces , 2019, Transactions of the Association for Computational Linguistics.

[14]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Arjun Mukherjee,et al.  Leveraging Multi-Domain Prior Knowledge in Topic Models , 2013, IJCAI.

[17]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..