A Comparison of Approaches for Measuring the Semantic Similarity of Short Texts Based on Word Embeddings

Measuring the semantic similarity of texts has a vital role in various tasks from the field of natural language processing. In this paper, we describe a set of experiments we carried out to evaluate and compare the performance of different approaches for measuring the semantic similarity of short texts. We perform a comparison of four models based on word embeddings: two variants of Word2Vec (one based on Word2Vec trained on a specific dataset and the second extending it with embeddings of word senses), FastText, and TF-IDF. Since these models provide word vectors, we experiment with various methods that calculate the semantic similarity of short texts based on word vectors. More precisely, for each of these models, we test five methods for aggregating word embeddings into text embedding. We introduced three methods by making variations of two commonly used similarity measures. One method is an extension of the cosine similarity based on centroids, and the other two methods are variations of the Okapi BM25 function. We evaluate all approaches on the two publicly available datasets: SICK and Lee in terms of the Pearson and Spearman correlation. The results indicate that extended methods perform better from the original in most of the cases.

[1]  Ana Mestrovic,et al.  Extracting domain knowledge by complex networks analysis of Wikipedia entries , 2015, 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[2]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[3]  Sanda Martinčić-Ipšić,et al.  Short Texts Semantic Similarity Based on Word Embeddings , 2019 .

[4]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[5]  Heiko Paulheim,et al.  Knowledge graph refinement: A survey of approaches and evaluation methods , 2016, Semantic Web.

[6]  Michael Strube,et al.  Decoding Wikipedia Categories for Knowledge Acquisition , 2008, AAAI.

[7]  Sanda Martinčić-Ipšić,et al.  Link prediction on Twitter , 2017, PloS one.

[8]  François Rousselot,et al.  An Ontology-Based Approach to Information Retrieval , 2000, EJC.

[9]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[12]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[13]  Roberto Navigli,et al.  Neural Sequence Learning Models for Word Sense Disambiguation , 2017, EMNLP.

[14]  Michael D. Lee,et al.  An Empirical Evaluation of Models of Text Document Similarity , 2005 .

[15]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[16]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[17]  Ana Mestrovic,et al.  Revealing the structure of domain specific tweets via complex networks analysis , 2016, 2016 39th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[18]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[19]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[20]  Thomas Demeester,et al.  Representation learning for very short texts using weighted word embedding aggregation , 2016, Pattern Recognit. Lett..

[21]  Ana Meštrović,et al.  Corpus-Based Paraphrase Detection Experiments and Review , 2020, Inf..

[22]  Roberto Navigli,et al.  Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities , 2016, Artif. Intell..

[23]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[24]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[25]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[28]  Douglas B. Lenat,et al.  CYC: a large-scale investment in knowledge infrastructure , 1995, CACM.

[29]  Roberto Navigli,et al.  Knowledge-enhanced document embeddings for text classification , 2019, Knowl. Based Syst..

[30]  Bart Dhoedt,et al.  Semantics-driven Event Clustering in Twitter Feeds , 2015, #MSM.

[31]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[32]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[33]  Ion Androutsopoulos,et al.  Using Centroids of Word Embeddings and Word Mover’s Distance for Biomedical Document Retrieval in Question Answering , 2016, BioNLP@ACL.

[34]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[35]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[36]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[37]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[38]  Susan Gauch,et al.  Personalized News Recommendation Using Twitter , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[39]  Joost C F de Winter,et al.  Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. , 2016, Psychological methods.

[40]  Ljupco Todorovski,et al.  The Influence of Feature Representation of Text on the Performance of Document Classification , 2017, Applied Sciences.

[41]  Giovanni Semeraro,et al.  Centroid-based Text Summarization through Compositionality of Word Embeddings , 2017, MultiLing@EACL.

[42]  Jonathan Weese,et al.  UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems , 2013, *SEMEVAL.

[43]  Ana Mestrovic,et al.  Selectivity-Based Keyword Extraction Method , 2016, Int. J. Semantic Web Inf. Syst..

[44]  Ian H. Witten,et al.  An effective, low-cost measure of semantic relatedness obtained from Wikipedia links , 2008 .