Using Word Embedding to Evaluate the Coherence of Topics from Twitter Data

Scholars often seek to understand topics discussed on Twitter using topic modelling approaches. Several coherence metrics have been proposed for evaluating the coherence of the topics generated by these approaches, including the pre-calculated Pointwise Mutual Information (PMI) of word pairs and the Latent Semantic Analysis (LSA) word representation vectors. As Twitter data contains abbreviations and a number of peculiarities (e.g. hashtags), it can be challenging to train effective PMI data or LSA word representation. Recently, Word Embedding (WE) has emerged as a particularly effective approach for capturing the similarity among words. Hence, in this paper, we propose new Word Embedding-based topic coherence metrics. To determine the usefulness of these new metrics, we compare them with the previous PMI/LSA-based metrics. We also conduct a large-scale crowdsourced user study to determine whether the new Word Embedding-based metrics better align with human preferences. Using two Twitter datasets, our results show that the WE-based metrics can capture the coherence of topics in tweets more robustly and efficiently than the PMI/LSA-based ones.

[1]  Ronan Collobert,et al.  N-gram-Based Low-Dimensional Representation for Document Classification , 2015, ICLR.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[4]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[5]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[7]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[8]  Gabriel Recchia,et al.  More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis , 2009, Behavior research methods.

[9]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[10]  Craig MacDonald,et al.  Topic-centric Classification of Twitter User's Political Orientation , 2015, FDIA.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[13]  Christopher D. Manning,et al.  Bilingual Word Embeddings for Phrase-Based Machine Translation , 2013, EMNLP.

[14]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Craig MacDonald,et al.  Topics in Tweets: A User Study of Topic Coherence Metrics for Twitter Data , 2016, ECIR.