Automatic Labelling of Topic Models Using Word Vectors and Letter Trigram Vectors

The native representation of LDA-style topics is a multinomial distributions over words, which can be time-consuming to interpret directly. As an alternative representation, automatic labelling has been shown to help readers interpret the topics more efficiently. We propose a novel framework for topic labelling using word vectors and letter trigram vectors. We generate labels automatically and propose automatic and human evaluations of our method. First, we use a chunk parser to generate candidate labels, then map topics and candidate labels to word vectors and letter trigram vectors in order to find which candidate label is more semantically related to that topic. A label can be found by calculating the similarity between a topic and its candidate label vectors. Experiments on three common datasets show that not only the labelling method, but also out approach to automatic evaluation is effective.

[1]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[2]  Lucy Vanderwende,et al.  Exploring Content Models for Multi-Document Summarization , 2009, NAACL.

[3]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[4]  Dekang Lin,et al.  Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 , 2011 .

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Timothy Baldwin,et al.  Best Topic Word Selection for Topic Labelling , 2010, COLING.

[8]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[9]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[10]  Timothy Baldwin,et al.  Evaluating topic representations for exploring document collections , 2015, J. Assoc. Inf. Sci. Technol..

[11]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[12]  Chao Liu,et al.  A probabilistic approach to spatiotemporal theme pattern mining on weblogs , 2006, WWW '06.

[13]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[14]  Kevin Gimpel,et al.  Modeling Topics , 2006 .

[15]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[16]  Timothy Baldwin,et al.  Word Sense Induction for Novel Sense Detection , 2012, EACL.

[17]  Claire Cardie,et al.  Compositional Matrix-Space Models for Sentiment Analysis , 2011, EMNLP.

[18]  John D. Lafferty,et al.  Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.

[19]  Georgiana Dinu,et al.  New Directions in Vector Space Models of Meaning , 2014, ACL.

[20]  Chu-Ren Huang,et al.  Proceedings of the 23rd International Conference on Computational Linguistics: Posters , 2010, COLING 2010.

[21]  Timothy Baldwin,et al.  Using ontological and document similarity to estimate museum exhibit relatedness , 2011, JOCCH.

[22]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[23]  Mark Stevenson,et al.  Labelling Topics using Unsupervised Graph-based Methods , 2014, ACL.

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  Mark Steedman,et al.  Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , 2012 .

[26]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[27]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[28]  Timothy Baldwin,et al.  Automatic Labelling of Topic Models , 2011, ACL.

[29]  Ruifeng Xu,et al.  Automatic Labelling of Topic Models Learned from Twitter by Summarisation , 2014, ACL.

[30]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[31]  Mark Stevenson,et al.  Measuring the Similarity between Automatically Generated Topics , 2014, EACL.

[32]  Andrew Y. Ng,et al.  Semantic Compositionality through Recursive Matrix-Vector Spaces , 2012, EMNLP.