Correlation between textual similarity and quality of LDA topic model results

The LDA topic model describes a corpus on the basis of its vocabulary. Our experiment aims at determining whether LDA outputs’ quality can be estimated through text similarity metrics, and if so determining the most relevant one. To do so, we use a categorized corpus on which we apply these metrics on every pair of categories. We present correlation scores between several metrics and the quality of the topic model. The experiments also include a comparison between simple and complex term extraction within our framework. We observed very high correlations with the Hellinger distance with or without complex terms, while the Soergel distance is most efficient when including complex terms. These experiments are a case study on a categorised corpus of 20,000 article abstracts.

[1]  Michael Nokel,et al.  Accounting ngrams and multi-word terms can improve topic models , 2016, MWE@ACL.

[2]  David Buttler,et al.  Exploring Topic Coherence over Many Models and Many Topics , 2012, EMNLP.

[3]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[4]  Timothy Baldwin,et al.  Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality , 2014, EACL.

[5]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[7]  Eleazar Eskin,et al.  Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning , 1999, EMNLP.

[8]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[9]  Yan Zhou,et al.  Intersection distance for object recognition , 2014, 2014 IEEE Workshop on Advanced Research and Technology in Industry Applications (WARTIA).

[10]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Jan Snajder,et al.  TakeLab: Systems for Measuring Semantic Text Similarity , 2012, *SEMEVAL.

[13]  Mark Stevenson,et al.  Evaluating Topic Coherence Using Distributional Semantics , 2013, IWCS.

[14]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[15]  E. Krause,et al.  Taxicab Geometry: An Adventure in Non-Euclidean Geometry , 1987 .

[16]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[17]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[18]  Michel Beigbeder,et al.  Exploitation de syntagmes dans la découverte de thèmes , 2019, CORIA.

[19]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[20]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[21]  Vasile Rus,et al.  Similarity Measures Based on Latent Dirichlet Allocation , 2013, CICLing.

[22]  John D. Lafferty,et al.  Visualizing Topics with Multi-Word Expressions , 2009, 0907.1013.