The SIMILAR Corpus: A Resource To Foster The Qualitative Understanding of Semantic Similarity of Texts

We describe in this paper the SIMILAR corpus which was developed to foster a deeper and qualitative understanding of word-to-word semantic similarity metrics and their role on the more general problem of text-to-text semantic similarity. The SIMILAR corpus fills a gap in existing resources that are meant to support the development of text-to-text similarity methods based on word-level similarities. The existing resources, such as data sets annotated with paraphrase information between two sentences, do not provide word-to-word semantic similarity annotations and quality judgments at word-level. We annotated 700 pairs of sentences from the Microsoft Research Paraphrase corpus with word-to-word semantic similarity information using both a greedy and optimal protocol. We proposed a set of qualitative word-to-word semantic similarity relations which were then used to annotate the corpus. We also present a detailed analysis of various quantitative word-to-word semantic similarity metrics and how they relate to our qualitative relations. A software tool has been developed to facilitate the annotation of texts using the proposed protocol.

[1]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[2]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[3]  William C. Mann,et al.  Natural Language Generation in Artificial Intelligence and Computational Linguistics , 1990 .

[4]  Alain Polguère,et al.  Lexical Selection and Paraphrase in a Meaning-Text Generation Model , 1991 .

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[7]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[8]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[9]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[10]  Jimmy J. Lin,et al.  Extracting Structural Paraphrases from Aligned Monolingual Corpora , 2003, IWP@ACL.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[13]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[14]  A. Graesser,et al.  Computerized Learning Environments That Incorporate Research in Discourse Psychology, Cognitive Science, and Computational Linguistics. , 2005 .

[15]  Jon Patrick,et al.  Paraphrase Identification by Text Canonicalization , 2005, ALTA.

[16]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[17]  David J. Weir,et al.  The Distributional Similarity of Sub-Parses , 2005, EMSEE@ACL.

[18]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[19]  Arthur C. Graesser,et al.  Deeper Natural Language Processing for Evaluating Student Answers in Intelligent Tutoring Systems , 2006, AAAI.

[20]  Arthur C. Graesser,et al.  AutoTutor: A Cognitive System That Simulates a Tutor Through Mixed-Initiative Dialogue , 2006 .

[21]  Arthur C. Graesser,et al.  When Are Tutorial Dialogues More Effective Than Reading? , 2007, Cogn. Sci..

[22]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[23]  Arthur C. Graesser,et al.  Paraphrase Identification with Lexico-Syntactic Graph Subsumption , 2008, FLAIRS.

[24]  Fernando Niño,et al.  A multiobjective evolutionary algorithm for the task based sailor assignment problem , 2009, GECCO '09.

[25]  Vasile Rus,et al.  Automatic Detection of Student Mental Models During Prior Knowledge Activation in MetaTutor , 2009, EDM.

[26]  Vasile Rus,et al.  Paraphrase Identification Using Weighted Dependencies and Word Semantics , 2010, Informatica.

[27]  Yixin Chen,et al.  Clustering of Defect Reports Using Graph Partitioning Algorithms , 2009, SEKE.

[28]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[29]  Danielle S. McNamara,et al.  The Role of Local and Global Weighting in Assessing the Semantic Similarity of Texts Using Latent Semantic Analysis , 2010, FLAIRS.

[30]  Vasile Rus,et al.  A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, BEA@NAACL-HLT.