A Semantic Cover Approach for Topic Modeling

We introduce a novel topic modeling approach based on constructing a semantic set cover for clusters of similar documents. Specifically, our approach first clusters documents using their Tf-Idf representation, and then covers each cluster with a set of topic words based on semantic similarity, defined in terms of a word embedding. Computing a topic cover amounts to solving a minimum set cover problem. Our evaluation compares our topic modeling approach to Latent Dirichlet Allocation (LDA) on three metrics: 1) qualitative topic match, measured using evaluations by Amazon Mechanical Turk (MTurk) workers, 2) performance on classification tasks using each topic model as a sparse feature representation, and 3) topic coherence. We find that qualitative judgments significantly favor our approach, the method outperforms LDA on topic coherence, and is comparable to LDA on document classification tasks.

[1]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[2]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Emiliano De Cristofaro,et al.  "23andMe confirms: I'm super white" - Analyzing Twitter Discourse On Genetic Testing , 2018, ArXiv.

[6]  Akshay Krishnamurthy,et al.  High-Dimensional Clustering with Sparse Gaussian Mixture Models , 2010 .

[7]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[8]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[9]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[10]  Tiago A. Almeida,et al.  Towards SMS Spam Filtering: Results under a New Dataset , 2013 .

[11]  Bin Zhou,et al.  Fuzzy Approach Topic Discovery in Health and Medical Corpora , 2017, Int. J. Fuzzy Syst..

[12]  Björn Gambäck,et al.  Twitter Topic Modeling by Tweet Aggregation , 2017, NODALIDA.

[13]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[14]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[15]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[16]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[17]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[18]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[19]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[20]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[21]  Ahmet Aker,et al.  A Graph-Based Approach to Topic Clustering for Online Comments to News , 2016, ECIR.

[22]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[23]  Xiaohui Yan,et al.  A biterm topic model for short texts , 2013, WWW.

[24]  Vivek Kumar Rangarajan Sridhar,et al.  Unsupervised Topic Modeling for Short Texts Using Distributed Representations of Words , 2015, VS@HLT-NAACL.

[25]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[26]  S. Dongen Graph clustering by flow simulation , 2000 .