The Role of Local and Global Weighting in Assessing the Semantic Similarity of Texts Using Latent Semantic Analysis

In this paper, we investigate the impact of several local and global weighting schemes on Latent Semantic Analysis' (LSA) ability to capture semantic similarity between two texts. We worked with texts varying in size from sentences to paragraphs. We present a comparison of 3 local and 3 global weighting schemes across 3 different standardized data sets related to semantic similarity tasks. For local weighting, we used binary weighting, term-frequency, and log-type. For global weighting, we relied on binary, inverted document frequencies (IDF) collected from the English Wikipedia, and entropy, which is the standard weighting scheme used by most LSA-based applications. We studied all possible combinations of these weighting schemes on the following three tasks and corresponding data sets: paraphrase identification at sentence level using the Microsoft Research Paraphrase Corpus, paraphrase identification at sentence level using data from the intelligent tutoring system iSTART, and mental model detection based on student-articulated paragraphs in MetaTutor, another intelligent tutoring system. Our experiments revealed that for sentence-level texts a combination of type frequency local weighting in combination with either IDF or binary global weighting works best. For paragraph-level texts, a log-type local weighting in combination with binary global weighting works best. We also found that global weights have a greater impact for sententence-level similarity as the local weight is undermined by the small size of such texts.

[1]  Arthur C. Graesser,et al.  Assessing Student Paraphrases Using Lexical Semantics and Word Weighting , 2009, AIED.

[2]  Yixin Chen,et al.  Clustering of Defect Reports Using Graph Partitioning Algorithms , 2009, SEKE.

[3]  Preslav Nakov,et al.  Weight functions impact on LSA performance , 2001 .

[4]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[5]  Vasile Rus,et al.  Paraphrase Identification Using Weighted Dependencies and Word Semantics , 2010, Informatica.

[6]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[7]  Jimmy J. Lin,et al.  Extracting Structural Paraphrases from Aligned Monolingual Corpora , 2003, IWP@ACL.

[8]  Arthur C. Graesser,et al.  Using LSA in AutoTutor: Learning through mixed-initiative dialogue in natural language. , 2007 .

[9]  Michael W. Berry,et al.  Large-Scale Information Retrieval with Latent Semantic Indexing , 1997, Inf. Sci..

[10]  Jon Patrick,et al.  Paraphrase Identification by Text Canonicalization , 2005, ALTA.

[11]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[12]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[13]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[14]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[15]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .