A Study of Probabilistic and Algebraic Methods for Semantic Similarity

We study and propose in this article several novel solutions to the task of semantic similarity between two short texts. The proposed solutions are based on the probabilistic method of Latent Dirichlet Allocation (LDA) and on the algebraic method of Latent Semantic Analysis (LSA). Both methods, LDA and LSA, are completely automated methods used to discover latent topics or concepts from large collection of documents. We propose a novel word-to-word similarity measure based on LDA as well as several text-totext similarity measures. We compare these measures with similar, known measures based on LSA. Experiments and results are presented on two data sets: the Microsoft Research Paraphrase corpus and the User Language Paraphrase corpus. We found that the novel word-to-word similarity measure based on LDA is extremely promising.

[1]  Jimmy J. Lin,et al.  Extracting Structural Paraphrases from Aligned Monolingual Corpora , 2003, IWP@ACL.

[2]  Gokhan Tur,et al.  LDA Based Similarity Modeling for Question Answering , 2010, HLT-NAACL 2010.

[3]  Zornitsa Kozareva,et al.  Paraphrase Identification on the Basis of Supervised Machine Learning Techniques , 2006, FinTAL.

[4]  Danielle S. McNamara,et al.  The Role of Local and Global Weighting in Assessing the Semantic Similarity of Texts Using Latent Semantic Analysis , 2010, FLAIRS.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[7]  Ido Dagan,et al.  Similarity-Based Methods for Word Sense Disambiguation , 1997, ACL.

[8]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[9]  Alain Polguère,et al.  Lexical Selection and Paraphrase in a Meaning-Text Generation Model , 1991 .

[10]  Mark Sammons,et al.  Recognizing Textual Entailment , 2015 .

[11]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[12]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[13]  Xing Chen,et al.  Recommending Related Microblogs: A Comparison Between Topic and WordNet based Approaches , 2012, AAAI.

[14]  William C. Mann,et al.  Natural Language Generation in Artificial Intelligence and Computational Linguistics , 1990 .

[15]  Yixin Chen,et al.  Clustering of Defect Reports Using Graph Partitioning Algorithms , 2009, SEKE.