论文信息 - The SIMILAR Corpus: A Resource To Foster The Qualitative Understanding of Semantic Similarity of Texts - 字舞流文

The SIMILAR Corpus: A Resource To Foster The Qualitative Understanding of Semantic Similarity of Texts

We describe in this paper the SIMILAR corpus which was developed to foster a deeper and qualitative understanding of word-to-word semantic similarity metrics and their role on the more general problem of text-to-text semantic similarity. The SIMILAR corpus fills a gap in existing resources that are meant to support the development of text-to-text similarity methods based on word-level similarities. The existing resources, such as data sets annotated with paraphrase information between two sentences, do not provide word-to-word semantic similarity annotations and quality judgments at word-level. We annotated 700 pairs of sentences from the Microsoft Research Paraphrase corpus with word-to-word semantic similarity information using both a greedy and optimal protocol. We proposed a set of qualitative word-to-word semantic similarity relations which were then used to annotate the corpus. We also present a detailed analysis of various quantitative word-to-word semantic similarity metrics and how they relate to our qualitative relations. A software tool has been developed to facilitate the annotation of texts using the proposed protocol.

Nobal B. Niraula | Mihai C. Lintean | W. Baggett | V. Rus | Cristian Moldovan | Brent | Morgan

[1] H. Kuhn. The Hungarian method for the assignment problem , 1955 .

[2] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[3] William C. Mann,et al. Natural Language Generation in Artificial Intelligence and Computational Linguistics , 1990 .

[4] Alain Polguère,et al. Lexical Selection and Paraphrase in a Meaning-Text Generation Model , 1991 .

[5] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[6] Martha Palmer,et al. Verb Semantics and Lexical Selection , 1994, ACL.

[7] Philip Resnik,et al. Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[8] David W. Conrath,et al. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[9] Martin Chodorow,et al. Combining local context and wordnet similarity for word sense identification , 1998 .

[10] Jimmy J. Lin,et al. Extracting Structural Paraphrases from Aligned Monolingual Corpora , 2003, IWP@ACL.

[11] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12] Ted Pedersen,et al. WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[13] Chris Quirk,et al. Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[14] A. Graesser,et al. Computerized Learning Environments That Incorporate Research in Discourse Psychology, Cognitive Science, and Computational Linguistics. , 2005 .

[15] Jon Patrick,et al. Paraphrase Identification by Text Canonicalization , 2005, ALTA.

[16] Rada Mihalcea,et al. Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[17] David J. Weir,et al. The Distributional Similarity of Sub-Parses , 2005, EMSEE@ACL.

[18] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[19] Arthur C. Graesser,et al. Deeper Natural Language Processing for Evaluating Student Answers in Intelligent Tutoring Systems , 2006, AAAI.

[20] Arthur C. Graesser,et al. AutoTutor: A Cognitive System That Simulates a Tutor Through Mixed-Initiative Dialogue , 2006 .

[21] Arthur C. Graesser,et al. When Are Tutorial Dialogues More Effective Than Reading? , 2007, Cogn. Sci..

[22] Danielle S. McNamara,et al. Handbook of latent semantic analysis , 2007 .

[23] Arthur C. Graesser,et al. Paraphrase Identification with Lexico-Syntactic Graph Subsumption , 2008, FLAIRS.

[24] Fernando Niño,et al. A multiobjective evolutionary algorithm for the task based sailor assignment problem , 2009, GECCO '09.

[25] Vasile Rus,et al. Automatic Detection of Student Mental Models During Prior Knowledge Activation in MetaTutor , 2009, EDM.

[26] Vasile Rus,et al. Paraphrase Identification Using Weighted Dependencies and Word Semantics , 2010, Informatica.

[27] Yixin Chen,et al. Clustering of Defect Reports Using Graph Partitioning Algorithms , 2009, SEKE.

[28] Harold W. Kuhn,et al. The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[29] Danielle S. McNamara,et al. The Role of Local and Global Weighting in Assessing the Semantic Similarity of Texts Using Latent Semantic Analysis , 2010, FLAIRS.

[30] Vasile Rus,et al. A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, BEA@NAACL-HLT.